Treebanking Spoken Slovenian: New Data, Models, and Lessons Learned

Kaja Dobrovoljc

doi:10.51663/pnz.65.3.01

Authors

Kaja Dobrovoljc Faculty of Arts, University of Ljubljana, Jožef Stefan Institute https://orcid.org/0000-0002-5909-7965

DOI:

https://doi.org/10.51663/pnz.65.3.01

Keywords:

corpus annotation, dependency treebank, spontaneous speech, Universal Dependencies, parsing

Abstract

This paper presents a new version of the Spoken Slovenian Treebank (SST), a balanced and representative collection of transcribed spontaneous speech with manually annotated lemmas, part-of-speech tags, morphological features, and syntactic dependencies, recently expanded with over 3,000 newly annotated utterances. After a brief overview of the data sampling, annotation, and consolidation processes—presented in detail in previous work—we evaluate the significance of this new language resource for both linguistic research and natural language processing by first highlighting its distinctive lexical and morphosyntactic features in comparison to writing, and then assessing their impact on the performance of tools for automatic grammatical annotation. Finally, we reflect on the methodological insights gained during treebank creation, discuss the potential of SST for advancing spoken language research, and argue for the necessity of such resources in supporting linguistic diversity in language technology.

References

Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press. DOI: 10.1017/ cbo9780511621024

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (2010). The Longman Grammar of Spoken and Written English. De Gruyter Mouton.

Braggaar, A., & van der Goot, R. (2021). Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data. In Proceedings of the second workshop on domain adaptation for NLP (pp. 50–58).

Brank, J. (2023). Q-CAT Corpus Annotation Tool 1.5. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1844)

Caines, A., McCarthy, M., & Buttery, P. (2017). Parsing transcripts of speech. In Proceedings of the workshop on speech-centric natural language processing (pp. 27–36).

Čibej, J., & Munda, T. (2024). Metoda polavtomatskega popravljanja lem in oblikoskladenjskih oznak na primeru učnega korpusa govorjene slovenščine ROG. In Proceedings of the Conference on Language Technologies and Digital Humanities, Ljubljana, Slovenia.

de Marneffe, M. C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255–308.

Dobrovoljc, K. (2018). Formulaičnost v slovenskem jeziku. Slovenščina 2.0: empirical, applied and interdisciplinary research, 6(2), 67–95.

Dobrovoljc, K. (2022). Spoken Language Treebanks in Universal Dependencies: an Overview. In Proceedings of the thirteenth language resources and evaluation conference (pp. 1798–1806).

Dobrovoljc, K. (2024). Extending the Spoken Slovenian Treebank. In Proceedings of the Conference on Language Technologies and Digital Humanities (116–146), Ljubljana, Slovenia.

Dobrovoljc, K., & Martinc, M. (2018). Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing. In Proceedings of the second workshop on Universal Dependencies (UDW) (pp. 37–46).

Dobrovoljc, K., & Nivre, J. (2016). The Universal Dependencies Treebank of Spoken Slovenian. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 1566–1573).

Dobrovoljc, K., & Terčon, L. (2024). Universal Dependencies: Smernice za označevanje besedil v slovenščini. Različica 1.7. Center za jezikovne vire in tehnologije Univerze v Ljubljani. https://wiki.cjvt.si/attachments/71

Dobrovoljc, K., Erjavec, T., & Krek, S. (2017). The Universal Dependencies Treebank for Slovenian. In Proceedings of the 6th workshop on Balto-Slavic natural language processing (pp. 33–38).

Dobrovoljc, K., Terčon, L., & Ljubešić, N. (2023). Universal Dependencies za slovenščino: Nove smernice, ročno označeni podatki in razčlenjevalni model. Slovenščina 2.0: empirical applied and interdisciplinary research, 11(1), 218–246.

Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone Speech Corpus for Research and Development. In Proceedings of the 1992 IEEE international conference on acoustics, speech and signal processing - volume 1 (pp. 517–520).

Hajič, J., Cinková, S., Mikulová, M., Pajas, P., Ptácek, J., Toman, J., & Uresová, Z. (2008). PDTSL: An annotated resource for speech reconstruction. In Proceedings of the 2008 IEEE workshop on spoken language technology (pp. 93–96).

Hinrichs, E., & Kübler, S. (2005). Treebank profiling of spoken and written German. Universitätsbibliothek Johann Christian Senckenberg.

Hinrichs, E., Bartels, J., Kawata, Y., Kordoni, V., & Telljohann, H. (2000). The Tübingen treebanks for spoken German, English, and Japanese. In W. Wahlster (Ed.), Verbmobil: Foundations of speech-to-speech translation (p. 550-574). Springer Berlin Heidelberg.

Kahane, S., Caron, B., Strickland, E., & Gerdes, K. (2021, December). Annotation guidelines of UD and SUD treebanks for spoken corpora: A proposal. In Proceedings of the 20th international workshop on treebanks and linguistic theories (TLT, SyntaxFest 2021) (pp. 35–47).

Kåsen, A., Hagen, K., Nøklestad, A., Priestly, J., Solberg, P. E., & Haug, D. T. T. (2022, June). The Norwegian Dialect Corpus Treebank. In Proceedings of the thirteenth language resources and evaluation conference (pp. 4827–4832).

Krsnik, Luka; Dobrovoljc, Kaja and Terčon, Luka, 2024a, The Trankit model for linguistic process of standard written Slovenian 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1963.

Krsnik, Luka; Dobrovoljc, Kaja and Terčon, Luka, 2024b, The Trankit model for linguistic processing of written and spoken Slovenian 1.2, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1997

Lacheret-Dujour, A., Kahane, S., & Pietrandrea, P. (2019). Rhapsodie: A prosodic and syntactic treebank for spoken French (Vol. 89). John Benjamins Publishing Company.

Liu, Z., & Prud’hommeaux, E. (2021, April). Dependency Parsing Evaluation for Low-resource Spontaneous Speech. In Proceedings of the second workshop on domain adaptation for NLP (pp. 156–165). Kyiv, Ukraine: Association for Computational Linguistics. https://aclanthology.org/2021.adaptnlp-1.16

Ljubešić, N., Terčon, L., & Dobrovoljc, K. (2024). CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages. Conference on Language Technologies and Digital Humanities (JT-DH-2024), Ljubljana, Slovenia.

Ljubešić, Nikola; Terčon, Luka and Čibej, Jaka, 2023, The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1767.

Luotolahti, J., Kanerva, J., & Ginter, F. (2017). dep_search: Efficient search tool for large dependency parsebanks. In Proceedings of the 21st nordic conference on computational linguistics (pp. 255–258).

MacWhinney, B. (2014). The CHILDES project. Psychology Press.

Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., & Nguyen, T. H. (2021, April). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Lan- guage Processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguis- tics: System demonstrations (pp. 80–90).

Øvrelid, L., Kåsen, A., Hagen, K., Nøklestad, A., Solberg, P. E., & Johannessen, J. B. (2018, May). The LIA Treebank of Spoken Norwegian Dialects. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 4482–4488).

Pietrandrea, P., & Delsart, A. (2019, June). Chapter 16. Macrosyntax at work. In Studies in corpus linguistics (pp. 285–314). Amsterdam: John Benjamins Publishing Company.

Štravs, M., & Dobrovoljc, K. (2024). Service for querying dependency treebanks Drevesnik 1.1. http://hdl.handle.net/11356/1923 (Slovenian language resource repository CLARIN.SI)

Terčon, Luka; Čibej, Jaka and Ljubešić, Nikola, 2023, The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1768.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025a, The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2015.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025b, The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2017.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025c, The CLASSLA-Stanza model for morphosyntactic annotation of spoken Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2016.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025d, The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2015.

van der Wouden, T., Hoekstra, H., Moortgat, M., Renmans, B., & Schuurman, I. (2002). Syntactic analysis in the Spoken Dutch Corpus (CGN). In Proceedings of the third international conference on language resources and evaluation, LREC 2002.

van der Wouden, T., Schuurman, I., Schouppe, M., & Hoekstra, H. (2003). Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch. In Computational linguistics in the Netherlands 2002 (p. 129–141). BRILL.

Verdonik, D., & Bizjak, A. (2023). Pogovorni zapis in označevanje govora v govorni bazi Artur projekta RSDO.

Verdonik, D., & Maučec, M. S. (2016). A speech corpus as a source of lexical infor- mation. International Journal of Lexicography, 30(2), 143-166.

Verdonik, D., Bizjak, A., Sepesy Maučec, M., Gril, L., Dobrišek, S., Križaj, J., … Dretnik, N. (2023). ASR database ARTUR 1.0 (transcriptions). http://hdl.handle.net/11356/ 1772 (Slovenian language resource repository CLARIN.SI)

Verdonik, D., Bizjak, A., Žgank, A., Bernjak, M., Antloga, Š., Majhenič, S., … Bordon, D. (2023). ASR database ARTUR 1.0 (audio). http://hdl.handle.net/11356/1776 (Slovenian language resource repository CLARIN.SI)

Verdonik, D., Dobrovoljc, K., Čibej, J., Ljubešić, N., & Rupnik, P. (2024). Izbor in urejanje gradiv za učni korpus govorjene slovenščine - ROG. In Zbornik konference jezikovne tehnologije in digitalna humanistika 2024.

Verdonik, D., Dobrovoljc, K., Erjavec, T., & Ljubešić, N. (2024, May). Gos 2: A New Reference Corpus of Spoken Slovenian. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024) (pp. 7825–7830).

Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.

Verdonik, D., Potočnik, T., Sepesy Maučec, M., Erjavec, T., Majhenič, S., & Žgank, A. (2021). Spoken corpus Gos VideoLectures 4.2 (transcription). http://hdl.handle.net/11356/1444 (Slovenian language resource repository CLARIN.SI)

Verdonik, D., Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., Erjavec, T., … Rupnik, P. (2023). Spoken corpus Gos 2.1 (transcriptions). http://hdl.handle.net/ 11356/1863 (Slovenian language resource repository CLARIN.SI)

Yimam, S. M., Gurevych, I., De Castilho, R. E., & Biemann, C. (2013). Webanno: A flexible, web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics: System demonstrations (pp. 1–6).

Zeman, D., et al. (2024). Universal Dependencies 2.15. http://hdl.handle.net/11234/1-5787 (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University)

Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T. (2021). Spoken corpus Gos 1.1. http://hdl.handle.net/11356/1438 (Slovenian language resource repository CLARIN.SI)

Treebanking Spoken Slovenian: New Data, Models, and Lessons Learned

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

Language

Information

Make a Submission