Computational Analysis of Slovenian Historical Newspapers (1771–1914): Linguistic, Thematic, and Nation-Building Insights

Authors

DOI:

https://doi.org/10.51663/pnz.65.3.02

Keywords:

keyword analysis, OCR errors, corpus linguistics, historical newspapers

Abstract

This paper presents a computational linguistic analysis of sPeriodika, a historical corpus of Slovenian periodicals published between 1771 and 1914. Using keyword analysis and diachronic analysis, we explore the linguistic, thematic, and historical dimensions of ten prominent newspapers in the corpus. Our findings reveal the centrality of these newspapers in shaping Slovenian nation-building during the post-1848 period, while also highlighting the diverse thematic orientations of individual periodicals, including agriculture, pedagogy, literature, and advertising. Moreover, the study examines the challenges posed by low-quality Optical Character Recognition (OCR) in historical text digitisation and its implications for linguistic and content analysis. By combining computational methods with historical inquiry, this research provides insights into the evolution of the Slovenian language, the media’s role in nation- -building, and the potential for improving OCR-based textual resources.

References

Amon, Smilja. "Vloga slovenskega časopisja v združevanju in ločevanju slovenske javnosti od 1797-1945." Javnost 15 (2008): S9–S24.

Anonymous, (L.). "Slovenski časopisi leta 1885." Ljubljanski zvon 5 (1885): 631–635.

Boros, Emanuela, Ehrmann, Maud, Romanello, Matteo, Najem-Meyer, Sven, in Kaplan, Frédéric. "Post-correction of historical text transcripts with large language models: An exploratory study." V: Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), ur. Bizzoni, Yuri, Degaetano-Ortlieb, Stefania, Kazantseva, Anna in Szpakowicz, Stan. Association for Computational Linguistics, 2024.

Darovec, Darko. Pregled zgodovine Istre. Založba Annales, 2023.

Dobranić, Filip, Evkoski, Bojan, in Ljubešić, Nikola. Corpus of Slovenian Periodicals (1771-1914) sPeriodika 1.0, 2023. http://hdl.handle.net/11356/1881.

Dobranić, Filip, Evkoski, Bojan, in Ljubešić, Nikola. "A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection." V: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ur. Calzolari, Nicoletta, Min-Yen Kan, Veronique Hoste idr. ELRA in ICCL, 2024.

Dović, Marijan. "Literatura in mediji v Jurčičevem času." Slavistična revija 54, št. 4 (2006): 543–557.

Dović, Marijan. "Anatomy of the 'Deathly Silence': Slovenian Newspapers in Carniola and the Pre-March Censorship." Neohelicon 50, št. 2 (2023): 543–560. https://doi.org/10.1007/s11059-023-00707-8.

Ehrmann, Maud, Bunout, Estelle, in Düring, Marten. "Historical Newspaper User Interfaces: A Review." V: 85th IFLA General Conference and Assembly (IFLA). Zenodo, 2019.

Ehrmann, Maud, Düring, M Marten., Neudecker, Clemens in Doucet, Antoine "Computational Approaches to Digitised Historical Newspapers." Dagstuhl Reports 12, št. 7 (2023): 112–179. Pridobljeno 5. 2. 2025. https://doi.org/10.4230/DagRep.12.7.112.

Garcia, Giselle G., in Weilbach, Christian. "If the Sources Could Talk: Evaluating Large Language Models for Research Assistance in History," v: Proceedings of the Computational Humanities Research Conference 2023, ur. Šeļa, Artjoms, Jannidis, Fotis in Romanowska, Iza, 616–638. 2023.

Hengchen, Simon, Ros, Ruben, Marjanen, Jani in Tolonen, Mikko. "A Data-Driven Approach to Studying Changing Vocabularies in Historical Newspaper Collections." Digital Scholarship in the Humanities 36, dodatek 2 (2021): ii109-ii126. https://doi.org/10.1093/llc/fqab032.

Humphries, Mark, Leddy, Lianne C., Downton, Quinn, Legace, Meredith, McConnell, John, Murray, Isabella, in Spence, Elizabeth. "Unlocking the Archives: Large Language Models Achieve State-of-the-Art Performance on the Transcription of Handwritten Historical Documents." Pridobljeno 24. 10. 2024, http://dx.doi.org/10.2139/ssrn.5006071.

Ilich, Maja. "Nekaj o modi v slovenskem časopisju na prelomu stoletja (1895-1915)." Zgodovina za vse 6 (1999): 98–108.

Jezernik, Božidar. "Katoliška duhovščina na prelomu devetnajstega in dvajsetega stoletja in proces modernizacije na Slovenskem." Traditiones 51, št. 1 (2022): 103–145.

Kermavner, Dušan. "Drugi slovenski socialnodemokratski listi." Kronika 10 (1962): 80–89.

Kettunen, Kimmo, in Pääkkönen, Tuula. "Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means." V: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), ur. Choukri, Khalid, Declerck, Thierry, Goggi, Sara idr., 956–961. ELRA, 2016.

Kilgarriff, Adam. "Simple Maths for Keywords." V: Proceedings of Corpus Linguistics 6. University of Liverpool, 2009.

Liu, Yuliang, Li, Zhang, Huang, Mingxin, Yang, Biao, Yu, Wenwen, Li, Chunyuan, Yin, Xucheng, Liu, Cheng-lin, Jin, Lianwen in Bai, Xiang. "On the Hidden Mystery of OCR in Large Multimodal Models." Sci. China Inf. Sci. 67, 220102 (2024). https://doi.org/10.1007/s11432-024-4235-6.

Marjanen, Jani, Kurunmäki, Jussi, Pivovarova, Lidia in Zosa, Elaine. "The Expansion of Isms, 1820-1917: Data-Driven Analysis of Political Language in Digitized Newspaper Collections." Journal of Data Mining & Digital Humanities 2020. https://doi.org/10.46298/jdmdh.6159.

Marjanen, Jani, Vaara, Ville, Kanner, Antti, Roivainen, Hege, Mäkelä, Eetu, Lahti, Leo in Tolonen, Mikko. "A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917." Journal of European Periodical Studies 4, št. 1, (2019). https://doi.org/10.21825/jeps.v4i1.10483.

Marjanen, Jani, Zosa, Elaine, Hengchen, Simon, Pivovarova, Lidia in Tolonen, Mikko. "Topic Modelling Discourse Dynamics in Historical Newspapers." V: Digital Humanities in the Nordic Countries 2020, ur. S. Reinsone, I. Skadiņa, A. Baklāne, and J. Daugavietis, 63–77. CEUR-WS.org, 2021.

Marušič, Branko. Pregled politične zgodovine Slovencev na Goriškem: 1848-1899. Goriški muzej, 2005.

Marušič, Branko. "Izbor vesti o Istri v slovenskem časopisju do leta 1880." Annales 17, št. 1 (2007): 65-82.

Mayer, Adán. I. L., Gutierrez-Vasques, Ximena, Saiso, Ernesto P. in Salmi, Hannu. "Underlying sentiments in 1867: A study of news flows on the execution of Emperor Maximilian I of Mexico in digitized newspaper corpora." Digital Humanities Quarterly 16, št. 4 (2022).

Mihelič, Stane. "Kmetijska družba in ustanovitev ‘Novic’." Slavistična revija 1, št. 1/2 (1948).

Park, Jaihyun in Cordell, Ryan. "A quantitative discourse analysis of Asian workers in the US historical newspapers." V: Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, ur. Hämäläinen, Mika, Öhman, Emily, Pirinenm, Flammie idr., 7–15. Association for Computational Linguistics, 2023.

Pedrazzini, Nilo in McGillivray, Barbara. "Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers." V: Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, ur. Hämäläinen, Mika, Alnajjar, Khalid, Partanen, Niko in Rueter, Jack, 85–95. Association for Computational Linguistics, 2022.

Pivovarova, Lidia, Zosa, Elaine, in Marjanen, Jani. "Word Clustering for Historical Newspapers Analysis." V: Proceedings of the Workshop on Language Technology for Digital Historical Archives, ur. Vertan, Cristina, Osenova, Petya, Iliev, Dimitar, 3–10. INCOMA Ltd., 2019.

Pogorelec, Breda. Zgodovina slovenskega knjižnega jezika. Ur. Ahačič, Kozma. Založba ZRC, 2011.

Pretnar Žagar, Ajda. "A corpus linguistic characterization of speriodika." V: Proceedings of the conference on language technologies and digital humanities, ur. Arhar Holdt, Špela in Erjavec, Tomaž, 384–406. 2024.

Schoots, Jonathan. "Analyzing political formation through historical isiXhosa text analysis: Using frequency analysis to examine emerging African nationalism in South Africa." V: Proceedings of the Fourth Workshop on Resources for African Indigenous Languages (RAIL 2023), ur. Mabuya, Rooweither, Mthobela, Don, Setaka, Mmasibidi in Van Zaanen, Menno, 65–75. Association for Computational Linguistics, 2023. https://doi.org/10.18653/v1/2023.rail-1.8.

Stergar, Nataša. "Narodnostno vprašanje v predmarčnih letnikih Bleiweisovih Novic." Kronika 25, št. 3 (1977).

Strange, Carolyn, McNamara, Daniel, Wodak, Josh in Wood, Ian. "Mining for the meanings of a murder: The impact of OCR quality on the use of digitized historical newspapers." Digital Humanities Quarterly 8, št. 1 (2014).

Štepec, Marko. "Zločin v slovenskem časopisju v 80. letih 19. stoletja." Kronika 35, št. 1/2 (1987): 30–38.

Thomas, Alan, Gaizauskas, Robert in Lu, Haiping. "Leveraging LLMs for post-OCR correction of historical newspapers." V: Proceedings of the third workshop on language technologies for historical and ancient languages, ur. Sprugnoli, Rachele in Passarotti, Marco, 116–121. ELRA in ICCL, 2024.

van Galen, Quintus. "The page is an image again: Bleedmapping as an analysis technique for historical newspapers." Digital Humanities Quarterly 17, št. 1 (2023).

Varnum, Michael E. W., Baumard, Nicolas, Atari, Mohammad in Gray, Kurt. "Large language models based on historical text could offer informative tools for behavioral science." Proceedings of the National Academy of Sciences 121, št. 42 (2024): e2407639121. https://doi.org/10.1073/pnas.2407639121.

Verheul, Japp, Salmi, Hannu, Riedl, Martin, Nivala, Asko, Viola, Lorella, Keck, Jana, in Bell, Emily. "Using word vector models to trace conceptual change over time and space in historical newspapers 1840–1914". Digital Humanities Quarterly 16, št. 2, (2022).

Zorn, Tone. "Odmevnost jezikovnega vprašanja v listu Slovenski pravnik v letih 1871-1918." Kronika 35, št. 3 (1987): 146–155.

Chow, Eric H. C. "An Experiment with Gemini Pro LLM for Chinese OCR and Metadata Extraction." Pridobljeno 5. 4. 2024. https://digitalorientalist.com/2024/04/05/an-experiment-with-gemini-pro-llm-for-chinese-ocr-and-metadata-extraction/.

Published

2025-12-22

Issue

Section

Articles