Računalniška analiza slovenskih zgodovinskih časopisov (1771–1914): jezikovni, tematski in državotvorni uvidi
DOI:
https://doi.org/10.51663/pnz.65.3.02Ključne besede:
korpusno jezikoslovje, slovenski časopisi, napake OCR, analiza ključnih besedPovzetek
Prispevek predstavlja računalniško-jezikoslovno analizo sPeriodike, zgodovinskega korpusa slovenskih periodičnih publikacij, izdanih med letoma 1771 in 1914. Z analizo ključnih besed ter diahrono analizo smo raziskali jezikovne, tematske in zgodovinske razsežnosti desetih najvidnejših časopisov v korpusu. Ugotovitve razkrivajo osrednjo vlogo teh časopisov pri oblikovanju slovenskega narodnega prebujanja v obdobju po letu 1848, hkrati pa poudarjajo raznolike tematske usmeritve posameznih periodičnih publikacij, kot so kmetijstvo, pedagogika, književnost in oglaševanje. Poleg tega raziskava obravnava izzive, ki jih prinaša slaba kakovost optičnega prepoznavanja znakov (OCR) pri digitalizaciji zgodovinskih besedil, ter njihove posledice za jezikovno in vsebinsko analizo. Združevanje računalniških metod z zgodovinskim raziskovanjem v tej študiji ponuja vpogled v razvoj slovenskega jezika, vlogo medijev pri oblikovanju narodne identitete in možnosti za izboljšanje besedilnih virov, temelječih na OCR.
Literatura
Amon, Smilja. "Vloga slovenskega časopisja v združevanju in ločevanju slovenske javnosti od 1797-1945." Javnost 15 (2008): S9–S24.
Anonymous, (L.). "Slovenski časopisi leta 1885." Ljubljanski zvon 5 (1885): 631–635.
Boros, Emanuela, Ehrmann, Maud, Romanello, Matteo, Najem-Meyer, Sven, in Kaplan, Frédéric. "Post-correction of historical text transcripts with large language models: An exploratory study." V: Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), ur. Bizzoni, Yuri, Degaetano-Ortlieb, Stefania, Kazantseva, Anna in Szpakowicz, Stan. Association for Computational Linguistics, 2024.
Darovec, Darko. Pregled zgodovine Istre. Založba Annales, 2023.
Dobranić, Filip, Evkoski, Bojan, in Ljubešić, Nikola. Corpus of Slovenian Periodicals (1771-1914) sPeriodika 1.0, 2023. http://hdl.handle.net/11356/1881.
Dobranić, Filip, Evkoski, Bojan, in Ljubešić, Nikola. "A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection." V: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ur. Calzolari, Nicoletta, Min-Yen Kan, Veronique Hoste idr. ELRA in ICCL, 2024.
Dović, Marijan. "Literatura in mediji v Jurčičevem času." Slavistična revija 54, št. 4 (2006): 543–557.
Dović, Marijan. "Anatomy of the 'Deathly Silence': Slovenian Newspapers in Carniola and the Pre-March Censorship." Neohelicon 50, št. 2 (2023): 543–560. https://doi.org/10.1007/s11059-023-00707-8.
Ehrmann, Maud, Bunout, Estelle, in Düring, Marten. "Historical Newspaper User Interfaces: A Review." V: 85th IFLA General Conference and Assembly (IFLA). Zenodo, 2019.
Ehrmann, Maud, Düring, M Marten., Neudecker, Clemens in Doucet, Antoine "Computational Approaches to Digitised Historical Newspapers." Dagstuhl Reports 12, št. 7 (2023): 112–179. Pridobljeno 5. 2. 2025. https://doi.org/10.4230/DagRep.12.7.112.
Garcia, Giselle G., in Weilbach, Christian. "If the Sources Could Talk: Evaluating Large Language Models for Research Assistance in History," v: Proceedings of the Computational Humanities Research Conference 2023, ur. Šeļa, Artjoms, Jannidis, Fotis in Romanowska, Iza, 616–638. 2023.
Hengchen, Simon, Ros, Ruben, Marjanen, Jani in Tolonen, Mikko. "A Data-Driven Approach to Studying Changing Vocabularies in Historical Newspaper Collections." Digital Scholarship in the Humanities 36, dodatek 2 (2021): ii109-ii126. https://doi.org/10.1093/llc/fqab032.
Humphries, Mark, Leddy, Lianne C., Downton, Quinn, Legace, Meredith, McConnell, John, Murray, Isabella, in Spence, Elizabeth. "Unlocking the Archives: Large Language Models Achieve State-of-the-Art Performance on the Transcription of Handwritten Historical Documents." Pridobljeno 24. 10. 2024, http://dx.doi.org/10.2139/ssrn.5006071.
Ilich, Maja. "Nekaj o modi v slovenskem časopisju na prelomu stoletja (1895-1915)." Zgodovina za vse 6 (1999): 98–108.
Jezernik, Božidar. "Katoliška duhovščina na prelomu devetnajstega in dvajsetega stoletja in proces modernizacije na Slovenskem." Traditiones 51, št. 1 (2022): 103–145.
Kermavner, Dušan. "Drugi slovenski socialnodemokratski listi." Kronika 10 (1962): 80–89.
Kettunen, Kimmo, in Pääkkönen, Tuula. "Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means." V: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), ur. Choukri, Khalid, Declerck, Thierry, Goggi, Sara idr., 956–961. ELRA, 2016.
Kilgarriff, Adam. "Simple Maths for Keywords." V: Proceedings of Corpus Linguistics 6. University of Liverpool, 2009.
Liu, Yuliang, Li, Zhang, Huang, Mingxin, Yang, Biao, Yu, Wenwen, Li, Chunyuan, Yin, Xucheng, Liu, Cheng-lin, Jin, Lianwen in Bai, Xiang. "On the Hidden Mystery of OCR in Large Multimodal Models." Sci. China Inf. Sci. 67, 220102 (2024). https://doi.org/10.1007/s11432-024-4235-6.
Marjanen, Jani, Kurunmäki, Jussi, Pivovarova, Lidia in Zosa, Elaine. "The Expansion of Isms, 1820-1917: Data-Driven Analysis of Political Language in Digitized Newspaper Collections." Journal of Data Mining & Digital Humanities 2020. https://doi.org/10.46298/jdmdh.6159.
Marjanen, Jani, Vaara, Ville, Kanner, Antti, Roivainen, Hege, Mäkelä, Eetu, Lahti, Leo in Tolonen, Mikko. "A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917." Journal of European Periodical Studies 4, št. 1, (2019). https://doi.org/10.21825/jeps.v4i1.10483.
Marjanen, Jani, Zosa, Elaine, Hengchen, Simon, Pivovarova, Lidia in Tolonen, Mikko. "Topic Modelling Discourse Dynamics in Historical Newspapers." V: Digital Humanities in the Nordic Countries 2020, ur. S. Reinsone, I. Skadiņa, A. Baklāne, and J. Daugavietis, 63–77. CEUR-WS.org, 2021.
Marušič, Branko. Pregled politične zgodovine Slovencev na Goriškem: 1848-1899. Goriški muzej, 2005.
Marušič, Branko. "Izbor vesti o Istri v slovenskem časopisju do leta 1880." Annales 17, št. 1 (2007): 65-82.
Mayer, Adán. I. L., Gutierrez-Vasques, Ximena, Saiso, Ernesto P. in Salmi, Hannu. "Underlying sentiments in 1867: A study of news flows on the execution of Emperor Maximilian I of Mexico in digitized newspaper corpora." Digital Humanities Quarterly 16, št. 4 (2022).
Mihelič, Stane. "Kmetijska družba in ustanovitev ‘Novic’." Slavistična revija 1, št. 1/2 (1948).
Park, Jaihyun in Cordell, Ryan. "A quantitative discourse analysis of Asian workers in the US historical newspapers." V: Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, ur. Hämäläinen, Mika, Öhman, Emily, Pirinenm, Flammie idr., 7–15. Association for Computational Linguistics, 2023.
Pedrazzini, Nilo in McGillivray, Barbara. "Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers." V: Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, ur. Hämäläinen, Mika, Alnajjar, Khalid, Partanen, Niko in Rueter, Jack, 85–95. Association for Computational Linguistics, 2022.
Pivovarova, Lidia, Zosa, Elaine, in Marjanen, Jani. "Word Clustering for Historical Newspapers Analysis." V: Proceedings of the Workshop on Language Technology for Digital Historical Archives, ur. Vertan, Cristina, Osenova, Petya, Iliev, Dimitar, 3–10. INCOMA Ltd., 2019.
Pogorelec, Breda. Zgodovina slovenskega knjižnega jezika. Ur. Ahačič, Kozma. Založba ZRC, 2011.
Pretnar Žagar, Ajda. "A corpus linguistic characterization of speriodika." V: Proceedings of the conference on language technologies and digital humanities, ur. Arhar Holdt, Špela in Erjavec, Tomaž, 384–406. 2024.
Schoots, Jonathan. "Analyzing political formation through historical isiXhosa text analysis: Using frequency analysis to examine emerging African nationalism in South Africa." V: Proceedings of the Fourth Workshop on Resources for African Indigenous Languages (RAIL 2023), ur. Mabuya, Rooweither, Mthobela, Don, Setaka, Mmasibidi in Van Zaanen, Menno, 65–75. Association for Computational Linguistics, 2023. https://doi.org/10.18653/v1/2023.rail-1.8.
Stergar, Nataša. "Narodnostno vprašanje v predmarčnih letnikih Bleiweisovih Novic." Kronika 25, št. 3 (1977).
Strange, Carolyn, McNamara, Daniel, Wodak, Josh in Wood, Ian. "Mining for the meanings of a murder: The impact of OCR quality on the use of digitized historical newspapers." Digital Humanities Quarterly 8, št. 1 (2014).
Štepec, Marko. "Zločin v slovenskem časopisju v 80. letih 19. stoletja." Kronika 35, št. 1/2 (1987): 30–38.
Thomas, Alan, Gaizauskas, Robert in Lu, Haiping. "Leveraging LLMs for post-OCR correction of historical newspapers." V: Proceedings of the third workshop on language technologies for historical and ancient languages, ur. Sprugnoli, Rachele in Passarotti, Marco, 116–121. ELRA in ICCL, 2024.
van Galen, Quintus. "The page is an image again: Bleedmapping as an analysis technique for historical newspapers." Digital Humanities Quarterly 17, št. 1 (2023).
Varnum, Michael E. W., Baumard, Nicolas, Atari, Mohammad in Gray, Kurt. "Large language models based on historical text could offer informative tools for behavioral science." Proceedings of the National Academy of Sciences 121, št. 42 (2024): e2407639121. https://doi.org/10.1073/pnas.2407639121.
Verheul, Japp, Salmi, Hannu, Riedl, Martin, Nivala, Asko, Viola, Lorella, Keck, Jana, in Bell, Emily. "Using word vector models to trace conceptual change over time and space in historical newspapers 1840–1914". Digital Humanities Quarterly 16, št. 2, (2022).
Zorn, Tone. "Odmevnost jezikovnega vprašanja v listu Slovenski pravnik v letih 1871-1918." Kronika 35, št. 3 (1987): 146–155.
Chow, Eric H. C. "An Experiment with Gemini Pro LLM for Chinese OCR and Metadata Extraction." Pridobljeno 5. 4. 2024. https://digitalorientalist.com/2024/04/05/an-experiment-with-gemini-pro-llm-for-chinese-ocr-and-metadata-extraction/.
Prenosi
Objavljeno
Številka
Rubrika
Licenca
Avtorske pravice (c) 2025 Ajda Pretnar Žagar

To delo je licencirano pod Creative Commons Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 mednarodno licenco.
Avtorji prispevkov, objavljenih v tej reviji, soglašajo z naslednjimi pogoji glede avtorskih pravic:
- Avtorji ohranijo avtorske pravice, reviji pa odobrijo pravico do prve objave. Delo se hkrati zaščiti z licenco za prosto uporabo avtorskih del (Creative Commons Attribution License), ki drugim osebam omogoča deljenje dela ob priznanju avtorstva in prve objave v tej reviji.
- Avtorji lahko sklenejo ločene dodatne pogodbene dogovore za neizključno distribucijo različice dela, objavljene v reviji, (npr. oddaja v institucionalni repozitorij ali objava v knjigi) z navedbo, da je bilo delo prvič objavljeno v tej reviji.
- Pred postopkom pošiljanja in med njim lahko avtorji delo objavijo v spletu (npr. v institucionalnih repozitorijih ali na svoji spletnih strani), k čemer jih tudi spodbujamo, saj lahko to prispeva k plodnim izmenjavam ter hitrejšemu in obsežnejšemu navajanju objavljenega dela (glej The Effect of Open Access).