CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

Luka Terčon; Kaja Dobrovoljc; Nikola Ljubešić

doi:10.51663/pnz.65.3.05

Authors

Luka Terčon Faculty of Arts, University of Ljubljana; Faculty of Computer and Information Science, University of Ljubljana https://orcid.org/0009-0006-3237-3583
Kaja Dobrovoljc Faculty of Arts, University of Ljubljana, Jožef Stefan Institute https://orcid.org/0000-0002-5909-7965
Nikola Ljubešić Jožef Stefan Institute; University of Ljubljana, Faculty of Computer and Information Science; Institute of Contemporary History https://orcid.org/0000-0001-7169-9152

DOI:

https://doi.org/10.51663/pnz.65.3.05

Keywords:

South Slavic languages, automatic linguistic processing, annotation pipeline, linguistic annotation

Abstract

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza and give a detailed description of the model training process for the latest 2.2 release of the pipeline. We also report performance scores produced by the pipeline for different languages and language varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms its parent pipeline Stanza at all the supported tasks. We also present the pipeline’s new functionality that enables efficient processing of web data and describe the efficiency of the pipeline for annotating written transcripts of spoken data.

References

Literature:

Banón, Marta, Miquel Espla-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord et al. “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.” In 23rd Annual Conference of the European Association for Machine Translation, EAMT 2022. European Association for Machine Translation, 2022.

de Marneffe, Marie-Catherine, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. “Universal Dependencies.” Computational Linguistics 47, no. 2 (07 2021): 255–308. https://doi.org/10.1162/coli_a_00402.

Dobrovoljc, Kaja, and Jaka Čibej. “Spoken Slovenian Treebank: New annotated data, parsing models and linguistic insights.” Paper presented at the UniDive 3rd General Meeting: Universality, diversity and idiosyncrasy in language technology, Budapest, 2025. https://unidive.lisn.upsaclay.fr/lib/exe/fetch.php?media=meetings:general_meetings:3rd_unidive_general_meeting:59_spoken_slovenian_treebank_n.pdf.

Dobrovoljc, Kaja, and Joakim Nivre. “The Universal Dependencies Treebank of Spoken Slovenian,” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC`16), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, et al., 1566–73. Portorož, Slovenia: European Language Resources Association (ELRA), 2016. https://aclanthology.org/L16-1248/.

Dobrovoljc, Kaja, Luka Terčon, and Nikola Ljubešić. “Universal Dependencies za slovenščino: nove smernice, ročno označeni podatki in razčlenjevalni model” [Universal Dependencies for Slovenian: New Guidelines, manually annotated data, and parsing model]. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave 11, no. 1 (2023): 218-246. https://doi.org/10.4312/slo2.0.2023.1.218-246.

Erjavec, Tomaž, Darja Fišer, Simon Krek, and Nina Ledinek. “The JOS Linguistically Tagged Corpus of Slovene”. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC`10), edited by Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias. Valletta, Malta: European Language Resources Association (ELRA), 2010. https://aclanthology.org/L10-1087/.

Erjavec, Tomaž. “MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages.” Language Resources and Evaluation 46, no. 1 (2012): 131–42. http://www.jstor.org/stable/41486069.

Goldhahn, Dirk, Maciej Sumalvico, and Uwe Quasthoff. “Corpus collection for under-resourced languages with more than one million speakers.” Collaboration and Computing for Under-Resourced Languages (CCURL) (2016): 67-73. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-CCURL2016_Proceedings.pdf#page=74.

Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. “Obeliks: statisticni oblikoskladenjski oznacevalnik in lematizator za slovenski jezik” [Obeliks: A Statistical Morphosyntactic Annotation and Lemmatization Tool for the Slovenian Language]. In Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia, vol. 19. 2012.

Krek, Simon, Polona Gantar, Kaja Dobrovoljc, and Iza Škrjanec. “Označevanje udeleženskih vlog v učnem korpusu za slovenščino” [Annotating Semantic Roles in a Training Corpus for Slovenian] In Proceedings of the Conference on Language Technologies and Digital Humanities (JT-DH-2016). Ljubljana University Press, Faculty of Arts, 2016. https://doi.org/10.5281/zenodo.14165095.

Ljubešić, Nikola, and Kaja Dobrovoljc. “What Does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian.” In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, edited by Tomaž Erjavec, Michał Marcińczuk, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, and Roman Yangarber, 29–34. Florence, Italy: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/W19-3704.

Ljubešić, Nikola, Tomaž Erjavec, Maja Miličević Petrović, and Tanja Samardžić. “Together we are stronger: Bootstrapping language technology infrastructure for South Slavic languages with CLARIN. SI.” In CLARIN. The Infrastructure for Language Resources, edited by Darja Fišer and Andreas Witt. De Gruyter, 2022. https://doi.org/10.1515/9783110767377-017.

Ljubešić, Nikola, Luka Terčon, and Kaja Dobrovoljc. “CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages.” In Proceedings of the Conference on Language Technologies and Digital Humanities (JT-DH-2024). Institute of Contemporary History, 2024. https://doi.org/10.5281/zenodo.13936406.

Osenova, Petya, and Kiril Simov. “Universalizing BulTreeBank: A Linguistic Tale about Glocalization.” In The 5th Workshop on Balto-Slavic Natural Language Processing, edited by Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Hristo Tanev, and Roman Yangarber. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, 2015. https://aclanthology.org/W15-5313/.

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, edited by Asli Celikyilmaz and Tsung-Hsien Wen, 101–8. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-demos.14.

Samardžić, Tanja, Nikola Ljubesić, and Maja Milicevic. “Regional Linguistic Data Initiative (ReLDI).” BSNLP@RANLP (2015).

Terčon, Luka, and Nikola Ljubešić. “CLASSLA-Stanza: The next step for linguistic processing of South Slavic Languages.” arXiv preprint (2023). https://doi.org/10.48550/arXiv.2308.04255.

Van Nguyen, Minh, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. “Trankit: A light-weight transformer-based toolkit for multilingual natural language processing.” arXiv preprint (2021). https://doi.org/10.48550/arXiv.2101.03289.

Online sources:

Arhar Holdt, Špela, Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Polona Gantar, Jaka Čibej, Eva Pori, et al. “Training Corpus SUK 1.0.” 2022. http://hdl.handle.net/11356/1747.

Arhar Holdt, Špela, Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Polona Gantar, Jaka Čibej, Eva Pori, et al. “Training Corpus SUK 1.1.” 2024. http://hdl.handle.net/11356/1959.

Batanović, Vuk, Nikola Ljubešić, Tanja Samardžić, and Tomaž Erjavec. “Serbian Linguistic Training Corpus SETimes.SR 2.0.” 2023. http://hdl.handle.net/11356/1843.

Erjavec, Tomaž, Ana-Maria Barbu, Ivan Derzhanski, Ludmila Dimitrova, Radovan Garabík, Nancy Ide, Heiki-Jaan Kaalep, et al. ‘MULTEXT-East “1984” Annotated Corpus 4.0.’ 2010. http://hdl.handle.net/11356/1043.

Krek, Simon, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može, Nina Ledinek, Nanika Holz, Katja Zupan, et al. “Training Corpus ssj500k 2.3.” 2021. http://hdl.handle.net/11356/1434.

Lenardič, Jakob, Jaka Čibej, Špela Arhar Holdt, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Katja Zupan, and Kaja Dobrovoljc. “CMC Training Corpus Janes-Tag 3.0.” 2022. http://hdl.handle.net/11356/1732.

Ljubešić, Nikola and Biljana Stojanovska. “Macedonian Linguistic Training Corpus SETimes.MK 0.1.” 2023. http://hdl.handle.net/11356/1886.

Ljubešić, Nikola, and Tanja Samardžić. “Croatian Linguistic Training Corpus Hr500k 2.0.” 2023. http://hdl.handle.net/11356/1792.

Ljubešić, Nikola, Peter Rupnik, and Taja Kuzman. “Croatian Web Corpus CLASSLA-web.hr 1.0.” 2024. http://hdl.handle.net/11356/1929.

Ljubešić, Nikola, Peter Rupnik, and Taja Kuzman. “Slovenian Web Corpus CLASSLA-web.sl 1.0.” 2024. http://hdl.handle.net/11356/1882.

Ljubešić, Nikola, Tomaž Erjavec, Vuk Batanović, Maja Miličević, and Tanja Samardžić. “Croatian Twitter Training Corpus ReLDI-NormTagNER-Hr 3.0.” 2023. http://hdl.handle.net/11356/1793.

Ljubešić, Nikola, Tomaž Erjavec, Vuk Batanović, Maja Miličević, and Tanja Samardžić. “Serbian Twitter Training Corpus ReLDI-NormTagNER-Sr 3.0.” 2023. http://hdl.handle.net/11356/1794.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Hr 2.0.” 2023. http://hdl.handle.net/11356/1790.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Sr 2.0.” 2023. http://hdl.handle.net/11356/1789.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Mk 2.0.” 2023. http://hdl.handle.net/11356/1788.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Bg 1.0.” 2023. http://hdl.handle.net/11356/1796.

Terčon, Luka, Nikola Ljubešić, and Tomaž Erjavec. “Word Embeddings CLARIN.SI-Embed.Sl 2.0.” 2023. http://hdl.handle.net/11356/1791.

Zupan, Katja, Nikola Ljubešić, and Tomaž Erjavec. “Smernice Janes-NER za označevanje imenskih entitet v slovenskem jeziku: Različica 1.1.” CJVT Wiki. Accessed 2 February, 2025. https://wiki.cjvt.si/books/08-imenske-entitete/page/oznacevalne-smernice.

Žitnik, Slavko and Frenk Dragar. “SloBENCH Evaluation Framework.” 2021. http://hdl.handle.net/11356/1469.

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

Language

Information

Make a Submission