Leveraging a Morphological Lexicon for a Semi-Automatic Approach to Correcting Lemmas and Morphosyntactic Tags

Authors

DOI:

https://doi.org/10.51663/pnz.65.3.06

Keywords:

lemmatization, morphosyntactic tagging, training corpora, morphological lexicon, corpus annotation

Abstract

In the paper, we present a new semi-automatic approach to correcting lemmas and morphosyntactic tags. Unlike previous manual annotation approaches for Slovene corpora, the new method contains an additional step in which tokens and their automatically assigned lemmas and morphosyntactic tags are cross-referenced with the set of forms included in the Sloleks Morphological Lexicon of Slovene. Based on the comparison, each token is classified into one of several annotation scenarios. The new approach has noticeably reduced the time and resources invested into annotation by eliminating a large number of redundant tasks. The advantages of this method include the possibility of dividing annotation tasks into groups consisting of similar annotation problems (e.g. disambiguation of grammatical homographs). With adequate data preparation, it also drastically reduces the necessity for annotators to be familiar with the extensive Multext-East morphosyntactic tag set for Slovene, a restriction that created a bottleneck in the annotation process in similar annotation campaigns. The method was tested during the annotation process for the ROG Training Corpus of Spoken Slovene. In addition, we also test the scenario classification algorithm on the SUK Training Corpus of Written Slovene, which was annotated using the traditional sentence-by-sentence, token-by-token approach. We present the results and argue that the method should be used in future annotation campaigns to save resources and improve overall annotation consistency, while also discussing some of the caveats and disadvantages of the proposed approach.

References

Arhar Holdt, Špela, Jaka Čibej, Kaja Dobrovoljc, Tomaž Erjavec, Polona Gantar, Simon Krek, … Slavko Žitnik. "Nadgradnja učnega korpusa ssj550k v SUK 1.0." Razvoj slovenščine v digitalnem okolju (2023): 119–156.

Arhar Holdt, Špela, Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Polona Gantar, Jaka Čibej … Anja Zajc. "Training corpus SUK 1.1". Slovenian language resource repository CLARIN.SI, ISSN 2820-4042 (2024) http://hdl.handle.net/11356/1959

Čibej, Jaka and Tina Munda. "Metoda polavtomatskega popravljanja lem in oblikoskladenjskih oznak na primeru učnega korpusa govorjene slovenščine ROG." Language technologies and digital humanities: proceedings of the conference: 19-20 September 2024, Ljubljana, Slovenia. (2024): 66–86. https://www.sdjt.si/wp/wp-content/uploads/2024/09/JT-DH_2024_Cibej_Munda.pdf

Čibej, Jaka, Darja Fišer and Tomaž Erjavec. Normalisation, Tokenisation and Sentence Segmentation of Slovene Tweets. Normalisation and Analysis of Social Media Texts (NORMSOME) – LREC 2016 (2016): 5–10. Portorož, Slovenia. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf#page=10

Čibej, Jaka, Kaja Gantar, Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Erjavec, … Marko Robnik-Šikonja. "Morphological lexicon Sloleks 3.0." Slovenian language resource repository CLARIN.SI (2022) http://hdl.handle.net/11356/1745

Čibej, Jaka, Špela Arhar Holdt, Darja Fišer and Tomaž Erjavec. Ročno označeni korpusi JANES za učenje jezikovnotehnoloških orodij in jezikoslovne raziskave. Viri, orodja in metode za analizo spletne slovenščine (2018): 44–73. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/111/203/2416

Dobrovoljc, Kaja and Joakim Nivre. "The Universal Dependencies Treebank of Spoken Slovenian." Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA). (2016): 1566–1573. https://aclanthology.org/L16-1248

Dobrovoljc, Kaja. "Skladenjska drevesnica govorjene slovenščine: stanje in perspektive." Stanje in perspektive uporabe govornih virov v raziskavah govora. (2024): 41–62.

Eckart de Castilho, Richard, Éva Mújdricza-Maydt, Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych, Anette Frank and Chris Biemann. "A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures." Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH). Osaka, Japan: The COLING 2016 Organizing Committee (2016): 76–84. https://www.aclweb.org/anthology/W16-4011

Erjavec, Tomaž, Darja Fišer, Jaka Čibej and Špela Arhar Holdt. "CMC training corpus JANES-Norm 1.2." Slovenian language resource repository CLARIN.SI. (2016a) http://hdl.handle.net/11356/1084

Erjavec, Tomaž, Darja Fišer, Jaka Čibej and Špela Arhar Holdt. "CMC training corpus JANES-Tag 1.1." Slovenian language resource repository CLARIN.SI. (2016b). http://hdl.handle.net/11356/1081

Fišer, Darja, Nikola Ljubešić and Tomaž Erjavec. "The JANES Project: Language Resources and Tools for Slovene User-Generated Content." Language Resources Evaluation, 54, (2020): 223–246. https://doi.org/10.1007/s10579-018-9425-z

Kosem, Iztok, Simon Krek and Polona Gantar. "Semantic data should no longer exist in isolation: the digital dictionary database of Slovenian." Proceedings of the XIX EURALEX International Congress: Lexicography for Inclusion. Komotini: SynMorPhoSe Lab, Democritus University of Thrace. (2021): 81–83. https://elex.is/wp-content/uploads/2021/09/Semantic-Data-should-no-longer-exist-in-isolation-the-Digital-Dictionary-Database-of-Slovenian_Kosem-Krek-Gantar_EURALEX2020.pdf

Ljubešić, Nikola and Kaja Dobrovoljc. "What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian." Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing. Florence, Italy. Association for Computational Linguistics. (2019): 29–34. https://aclanthology.org/W19-3704/

Ljubešić, Nikola, Luka Terčon and Jaka Čibej. "The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0". Slovenian language resource repository CLARIN.SI, ISSN 2820-4042 (2023) http://hdl.handle.net/11356/1767

Pori, Eva, Jaka Čibej, Tina Munda, Luka Terčon and Špela Arhar Holdt. "Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref." Konferenca Jezikovne tehnologije in digitalna humanistika (2022): 162–168. Ljubljana, Slovenija. https://nl.ijs.si/jtdh22/pdf/JTDH2022_Pori-et-al_Lematizacija-in-oblikoskladenjsko-oznacevanje-korpusa-SentiCoref.pdf

Terčon, Luka, Jaka Čibej and Nikola Ljubešić. "The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0." Slovenian language resource repository CLARIN.SI, ISSN 2820-4042 (2023) http://hdl.handle.net/11356/1768

Verdonik, Darinka, Andreja Bizjak, Mirjam Sepesy Maučec, … Naum Dretnik. "ASR database ARTUR 1.0 (transcriptions)." Slovenian language resource repository CLARIN.SI (2023). http://hdl.handle.net/11356/1772

Verdonik, Darinka, Kaja Dobrovoljc, Peter Rupnik, Nikola Ljubešić, Simona Majhenič, Jaka Čibej and Thomas Schmidt. "Training corpus of spoken Slovenian ROG 1.0." Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, (2024) http://hdl.handle.net/11356/1992

Verdonik, Darinka, Nikola Ljubešić, Peter Rupnik, Kaja Dobrovoljc and Jaka Čibej. "Izbor in urejanje gradiv za učni korpus govorjene slovenščine ROG." Konferenca jezikovne tehnologije in digitalna humanistika. Ljubljana, Slovenija. (2024): 472–488.

Zwitter Vitez, Ana, Jana Zemljarič Miklavčič, Simon Krek, Marko Stabej and Tomaž Erjavec. "Spoken corpus GOS 1.1." Slovenian language resource repository CLARIN.SI. (2021) http://hdl.handle.net/11356/1438.

Zwitter Vitez, Ana, Jana Zemljarič Miklavčič, Simon Krek, Marko Stabej, Tomaž Erjavec, Darinka Verdonik, … Kaja Dobrovoljc. "Spoken corpus GOS 2.0 (transcriptions)." Slovenian language resource repository CLARIN.SI. (2023). http://hdl.handle.net/11356/1771

Downloads

Published

2025-12-22

Issue

Section

Articles