CLASSLA-Stanza: naslednji korak za jezikovno procesiranje južnoslovanskih jezikov

Avtorji

DOI:

https://doi.org/10.51663/pnz.65.3.05

Ključne besede:

južnoslovanski jeziki, avtomatsko procesiranje jezika, označevalni cevovod, jezikovno označevanje

Povzetek

V članku predstavljamo orodje CLASSLA-Stanza, cevovod za avtomatsko jezikovno označevanje južnoslovanskih jezikov, ki temelji na cevovodu za procesiranje naravnega jezika Stanza. Opišemo vse glavne izboljšave, ki jih prinaša CLASSLA-Stanza v primerjavi s Stanzo in podamo podroben opis postopka učenja modelov v različici 2.2, najnovejši različici orodja. Obenem poročamo o rezultatih delovanja cevovoda za različne jezike in jezikovne zvrsti. CLASSLA-Stanza dosega konsistentno visoke rezultate za vse podprte jezike in preseže rezultate izvornega cevovoda Stanza pri vseh podprtih jezikih. Predstavimo tudi novo funkcijo cevovoda, ki omogoča učinkovito procesiranje spletnih besedil, in opišemo učinkovitost cevovoda za označevanje transkriptov govora.

Literatura

Literature:

Banón, Marta, Miquel Espla-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord et al. “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages.” In 23rd Annual Conference of the European Association for Machine Translation, EAMT 2022. European Association for Machine Translation, 2022.

de Marneffe, Marie-Catherine, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. “Universal Dependencies.” Computational Linguistics 47, no. 2 (07 2021): 255–308. https://doi.org/10.1162/coli_a_00402.

Dobrovoljc, Kaja, and Jaka Čibej. “Spoken Slovenian Treebank: New annotated data, parsing models and linguistic insights.” Paper presented at the UniDive 3rd General Meeting: Universality, diversity and idiosyncrasy in language technology, Budapest, 2025. https://unidive.lisn.upsaclay.fr/lib/exe/fetch.php?media=meetings:general_meetings:3rd_unidive_general_meeting:59_spoken_slovenian_treebank_n.pdf.

Dobrovoljc, Kaja, and Joakim Nivre. “The Universal Dependencies Treebank of Spoken Slovenian,” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC`16), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, et al., 1566–73. Portorož, Slovenia: European Language Resources Association (ELRA), 2016. https://aclanthology.org/L16-1248/.

Dobrovoljc, Kaja, Luka Terčon, and Nikola Ljubešić. “Universal Dependencies za slovenščino: nove smernice, ročno označeni podatki in razčlenjevalni model” [Universal Dependencies for Slovenian: New Guidelines, manually annotated data, and parsing model]. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave 11, no. 1 (2023): 218-246. https://doi.org/10.4312/slo2.0.2023.1.218-246.

Erjavec, Tomaž, Darja Fišer, Simon Krek, and Nina Ledinek. “The JOS Linguistically Tagged Corpus of Slovene”. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC`10), edited by Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias. Valletta, Malta: European Language Resources Association (ELRA), 2010. https://aclanthology.org/L10-1087/.

Erjavec, Tomaž. “MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages.” Language Resources and Evaluation 46, no. 1 (2012): 131–42. http://www.jstor.org/stable/41486069.

Goldhahn, Dirk, Maciej Sumalvico, and Uwe Quasthoff. “Corpus collection for under-resourced languages with more than one million speakers.” Collaboration and Computing for Under-Resourced Languages (CCURL) (2016): 67-73. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-CCURL2016_Proceedings.pdf#page=74.

Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. “Obeliks: statisticni oblikoskladenjski oznacevalnik in lematizator za slovenski jezik” [Obeliks: A Statistical Morphosyntactic Annotation and Lemmatization Tool for the Slovenian Language]. In Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia, vol. 19. 2012.

Krek, Simon, Polona Gantar, Kaja Dobrovoljc, and Iza Škrjanec. “Označevanje udeleženskih vlog v učnem korpusu za slovenščino” [Annotating Semantic Roles in a Training Corpus for Slovenian] In Proceedings of the Conference on Language Technologies and Digital Humanities (JT-DH-2016). Ljubljana University Press, Faculty of Arts, 2016. https://doi.org/10.5281/zenodo.14165095.

Ljubešić, Nikola, and Kaja Dobrovoljc. “What Does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian.” In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, edited by Tomaž Erjavec, Michał Marcińczuk, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, and Roman Yangarber, 29–34. Florence, Italy: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/W19-3704.

Ljubešić, Nikola, Tomaž Erjavec, Maja Miličević Petrović, and Tanja Samardžić. “Together we are stronger: Bootstrapping language technology infrastructure for South Slavic languages with CLARIN. SI.” In CLARIN. The Infrastructure for Language Resources, edited by Darja Fišer and Andreas Witt. De Gruyter, 2022. https://doi.org/10.1515/9783110767377-017.

Ljubešić, Nikola, Luka Terčon, and Kaja Dobrovoljc. “CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages.” In Proceedings of the Conference on Language Technologies and Digital Humanities (JT-DH-2024). Institute of Contemporary History, 2024. https://doi.org/10.5281/zenodo.13936406.

Osenova, Petya, and Kiril Simov. “Universalizing BulTreeBank: A Linguistic Tale about Glocalization.” In The 5th Workshop on Balto-Slavic Natural Language Processing, edited by Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Hristo Tanev, and Roman Yangarber. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, 2015. https://aclanthology.org/W15-5313/.

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, edited by Asli Celikyilmaz and Tsung-Hsien Wen, 101–8. Online: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-demos.14.

Samardžić, Tanja, Nikola Ljubesić, and Maja Milicevic. “Regional Linguistic Data Initiative (ReLDI).” BSNLP@RANLP (2015).

Terčon, Luka, and Nikola Ljubešić. “CLASSLA-Stanza: The next step for linguistic processing of South Slavic Languages.” arXiv preprint (2023). https://doi.org/10.48550/arXiv.2308.04255.

Van Nguyen, Minh, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. “Trankit: A light-weight transformer-based toolkit for multilingual natural language processing.” arXiv preprint (2021). https://doi.org/10.48550/arXiv.2101.03289.

Online sources:

Arhar Holdt, Špela, Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Polona Gantar, Jaka Čibej, Eva Pori, et al. “Training Corpus SUK 1.0.” 2022. http://hdl.handle.net/11356/1747.

Arhar Holdt, Špela, Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Polona Gantar, Jaka Čibej, Eva Pori, et al. “Training Corpus SUK 1.1.” 2024. http://hdl.handle.net/11356/1959.

Batanović, Vuk, Nikola Ljubešić, Tanja Samardžić, and Tomaž Erjavec. “Serbian Linguistic Training Corpus SETimes.SR 2.0.” 2023. http://hdl.handle.net/11356/1843.

Erjavec, Tomaž, Ana-Maria Barbu, Ivan Derzhanski, Ludmila Dimitrova, Radovan Garabík, Nancy Ide, Heiki-Jaan Kaalep, et al. ‘MULTEXT-East “1984” Annotated Corpus 4.0.’ 2010. http://hdl.handle.net/11356/1043.

Krek, Simon, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može, Nina Ledinek, Nanika Holz, Katja Zupan, et al. “Training Corpus ssj500k 2.3.” 2021. http://hdl.handle.net/11356/1434.

Lenardič, Jakob, Jaka Čibej, Špela Arhar Holdt, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Katja Zupan, and Kaja Dobrovoljc. “CMC Training Corpus Janes-Tag 3.0.” 2022. http://hdl.handle.net/11356/1732.

Ljubešić, Nikola and Biljana Stojanovska. “Macedonian Linguistic Training Corpus SETimes.MK 0.1.” 2023. http://hdl.handle.net/11356/1886.

Ljubešić, Nikola, and Tanja Samardžić. “Croatian Linguistic Training Corpus Hr500k 2.0.” 2023. http://hdl.handle.net/11356/1792.

Ljubešić, Nikola, Peter Rupnik, and Taja Kuzman. “Croatian Web Corpus CLASSLA-web.hr 1.0.” 2024. http://hdl.handle.net/11356/1929.

Ljubešić, Nikola, Peter Rupnik, and Taja Kuzman. “Slovenian Web Corpus CLASSLA-web.sl 1.0.” 2024. http://hdl.handle.net/11356/1882.

Ljubešić, Nikola, Tomaž Erjavec, Vuk Batanović, Maja Miličević, and Tanja Samardžić. “Croatian Twitter Training Corpus ReLDI-NormTagNER-Hr 3.0.” 2023. http://hdl.handle.net/11356/1793.

Ljubešić, Nikola, Tomaž Erjavec, Vuk Batanović, Maja Miličević, and Tanja Samardžić. “Serbian Twitter Training Corpus ReLDI-NormTagNER-Sr 3.0.” 2023. http://hdl.handle.net/11356/1794.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Hr 2.0.” 2023. http://hdl.handle.net/11356/1790.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Sr 2.0.” 2023. http://hdl.handle.net/11356/1789.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Mk 2.0.” 2023. http://hdl.handle.net/11356/1788.

Terčon, Luka, and Nikola Ljubešić. “Word Embeddings CLARIN.SI-Embed.Bg 1.0.” 2023. http://hdl.handle.net/11356/1796.

Terčon, Luka, Nikola Ljubešić, and Tomaž Erjavec. “Word Embeddings CLARIN.SI-Embed.Sl 2.0.” 2023. http://hdl.handle.net/11356/1791.

Zupan, Katja, Nikola Ljubešić, and Tomaž Erjavec. “Smernice Janes-NER za označevanje imenskih entitet v slovenskem jeziku: Različica 1.1.” CJVT Wiki. Accessed 2 February, 2025. https://wiki.cjvt.si/books/08-imenske-entitete/page/oznacevalne-smernice.

Žitnik, Slavko and Frenk Dragar. “SloBENCH Evaluation Framework.” 2021. http://hdl.handle.net/11356/1469.

Objavljeno

2025-12-22

Številka

Rubrika

Razprave