Treebanking Spoken Slovenian: New Data, Models, and Lessons Learned
DOI:
https://doi.org/10.51663/pnz.65.3.01Keywords:
corpus annotation, dependency treebank, spontaneous speech, Universal Dependencies, parsingAbstract
This paper presents a new version of the Spoken Slovenian Treebank (SST), a balanced and representative collection of transcribed spontaneous speech with manually annotated lemmas, part-of-speech tags, morphological features, and syntactic dependencies, recently expanded with over 3,000 newly annotated utterances. After a brief overview of the data sampling, annotation, and consolidation processes—presented in detail in previous work—we evaluate the significance of this new language resource for both linguistic research and natural language processing by first highlighting its distinctive lexical and morphosyntactic features in comparison to writing, and then assessing their impact on the performance of tools for automatic grammatical annotation. Finally, we reflect on the methodological insights gained during treebank creation, discuss the potential of SST for advancing spoken language research, and argue for the necessity of such resources in supporting linguistic diversity in language technology.
References
Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press. DOI: 10.1017/ cbo9780511621024
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (2010). The Longman Grammar of Spoken and Written English. De Gruyter Mouton.
Braggaar, A., & van der Goot, R. (2021). Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data. In Proceedings of the second workshop on domain adaptation for NLP (pp. 50–58).
Brank, J. (2023). Q-CAT Corpus Annotation Tool 1.5. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1844)
Caines, A., McCarthy, M., & Buttery, P. (2017). Parsing transcripts of speech. In Proceedings of the workshop on speech-centric natural language processing (pp. 27–36).
Čibej, J., & Munda, T. (2024). Metoda polavtomatskega popravljanja lem in oblikoskladenjskih oznak na primeru učnega korpusa govorjene slovenščine ROG. In Proceedings of the Conference on Language Technologies and Digital Humanities, Ljubljana, Slovenia.
de Marneffe, M. C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255–308.
Dobrovoljc, K. (2018). Formulaičnost v slovenskem jeziku. Slovenščina 2.0: empirical, applied and interdisciplinary research, 6(2), 67–95.
Dobrovoljc, K. (2022). Spoken Language Treebanks in Universal Dependencies: an Overview. In Proceedings of the thirteenth language resources and evaluation conference (pp. 1798–1806).
Dobrovoljc, K. (2024). Extending the Spoken Slovenian Treebank. In Proceedings of the Conference on Language Technologies and Digital Humanities (116–146), Ljubljana, Slovenia.
Dobrovoljc, K., & Martinc, M. (2018). Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing. In Proceedings of the second workshop on Universal Dependencies (UDW) (pp. 37–46).
Dobrovoljc, K., & Nivre, J. (2016). The Universal Dependencies Treebank of Spoken Slovenian. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 1566–1573).
Dobrovoljc, K., & Terčon, L. (2024). Universal Dependencies: Smernice za označevanje besedil v slovenščini. Različica 1.7. Center za jezikovne vire in tehnologije Univerze v Ljubljani. https://wiki.cjvt.si/attachments/71
Dobrovoljc, K., Erjavec, T., & Krek, S. (2017). The Universal Dependencies Treebank for Slovenian. In Proceedings of the 6th workshop on Balto-Slavic natural language processing (pp. 33–38).
Dobrovoljc, K., Terčon, L., & Ljubešić, N. (2023). Universal Dependencies za slovenščino: Nove smernice, ročno označeni podatki in razčlenjevalni model. Slovenščina 2.0: empirical applied and interdisciplinary research, 11(1), 218–246.
Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone Speech Corpus for Research and Development. In Proceedings of the 1992 IEEE international conference on acoustics, speech and signal processing - volume 1 (pp. 517–520).
Hajič, J., Cinková, S., Mikulová, M., Pajas, P., Ptácek, J., Toman, J., & Uresová, Z. (2008). PDTSL: An annotated resource for speech reconstruction. In Proceedings of the 2008 IEEE workshop on spoken language technology (pp. 93–96).
Hinrichs, E., & Kübler, S. (2005). Treebank profiling of spoken and written German. Universitätsbibliothek Johann Christian Senckenberg.
Hinrichs, E., Bartels, J., Kawata, Y., Kordoni, V., & Telljohann, H. (2000). The Tübingen treebanks for spoken German, English, and Japanese. In W. Wahlster (Ed.), Verbmobil: Foundations of speech-to-speech translation (p. 550-574). Springer Berlin Heidelberg.
Kahane, S., Caron, B., Strickland, E., & Gerdes, K. (2021, December). Annotation guidelines of UD and SUD treebanks for spoken corpora: A proposal. In Proceedings of the 20th international workshop on treebanks and linguistic theories (TLT, SyntaxFest 2021) (pp. 35–47).
Kåsen, A., Hagen, K., Nøklestad, A., Priestly, J., Solberg, P. E., & Haug, D. T. T. (2022, June). The Norwegian Dialect Corpus Treebank. In Proceedings of the thirteenth language resources and evaluation conference (pp. 4827–4832).
Krsnik, Luka; Dobrovoljc, Kaja and Terčon, Luka, 2024a, The Trankit model for linguistic process of standard written Slovenian 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1963.
Krsnik, Luka; Dobrovoljc, Kaja and Terčon, Luka, 2024b, The Trankit model for linguistic processing of written and spoken Slovenian 1.2, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1997
Lacheret-Dujour, A., Kahane, S., & Pietrandrea, P. (2019). Rhapsodie: A prosodic and syntactic treebank for spoken French (Vol. 89). John Benjamins Publishing Company.
Liu, Z., & Prud’hommeaux, E. (2021, April). Dependency Parsing Evaluation for Low-resource Spontaneous Speech. In Proceedings of the second workshop on domain adaptation for NLP (pp. 156–165). Kyiv, Ukraine: Association for Computational Linguistics. https://aclanthology.org/2021.adaptnlp-1.16
Ljubešić, N., Terčon, L., & Dobrovoljc, K. (2024). CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages. Conference on Language Technologies and Digital Humanities (JT-DH-2024), Ljubljana, Slovenia.
Ljubešić, Nikola; Terčon, Luka and Čibej, Jaka, 2023, The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1767.
Luotolahti, J., Kanerva, J., & Ginter, F. (2017). dep_search: Efficient search tool for large dependency parsebanks. In Proceedings of the 21st nordic conference on computational linguistics (pp. 255–258).
MacWhinney, B. (2014). The CHILDES project. Psychology Press.
Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., & Nguyen, T. H. (2021, April). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Lan- guage Processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguis- tics: System demonstrations (pp. 80–90).
Øvrelid, L., Kåsen, A., Hagen, K., Nøklestad, A., Solberg, P. E., & Johannessen, J. B. (2018, May). The LIA Treebank of Spoken Norwegian Dialects. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 4482–4488).
Pietrandrea, P., & Delsart, A. (2019, June). Chapter 16. Macrosyntax at work. In Studies in corpus linguistics (pp. 285–314). Amsterdam: John Benjamins Publishing Company.
Štravs, M., & Dobrovoljc, K. (2024). Service for querying dependency treebanks Drevesnik 1.1. http://hdl.handle.net/11356/1923 (Slovenian language resource repository CLARIN.SI)
Terčon, Luka; Čibej, Jaka and Ljubešić, Nikola, 2023, The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1768.
Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025a, The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2015.
Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025b, The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2017.
Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025c, The CLASSLA-Stanza model for morphosyntactic annotation of spoken Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2016.
Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025d, The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2015.
van der Wouden, T., Hoekstra, H., Moortgat, M., Renmans, B., & Schuurman, I. (2002). Syntactic analysis in the Spoken Dutch Corpus (CGN). In Proceedings of the third international conference on language resources and evaluation, LREC 2002.
van der Wouden, T., Schuurman, I., Schouppe, M., & Hoekstra, H. (2003). Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch. In Computational linguistics in the Netherlands 2002 (p. 129–141). BRILL.
Verdonik, D., & Bizjak, A. (2023). Pogovorni zapis in označevanje govora v govorni bazi Artur projekta RSDO.
Verdonik, D., & Maučec, M. S. (2016). A speech corpus as a source of lexical infor- mation. International Journal of Lexicography, 30(2), 143-166.
Verdonik, D., Bizjak, A., Sepesy Maučec, M., Gril, L., Dobrišek, S., Križaj, J., … Dretnik, N. (2023). ASR database ARTUR 1.0 (transcriptions). http://hdl.handle.net/11356/ 1772 (Slovenian language resource repository CLARIN.SI)
Verdonik, D., Bizjak, A., Žgank, A., Bernjak, M., Antloga, Š., Majhenič, S., … Bordon, D. (2023). ASR database ARTUR 1.0 (audio). http://hdl.handle.net/11356/1776 (Slovenian language resource repository CLARIN.SI)
Verdonik, D., Dobrovoljc, K., Čibej, J., Ljubešić, N., & Rupnik, P. (2024). Izbor in urejanje gradiv za učni korpus govorjene slovenščine - ROG. In Zbornik konference jezikovne tehnologije in digitalna humanistika 2024.
Verdonik, D., Dobrovoljc, K., Erjavec, T., & Ljubešić, N. (2024, May). Gos 2: A New Reference Corpus of Spoken Slovenian. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024) (pp. 7825–7830).
Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.
Verdonik, D., Potočnik, T., Sepesy Maučec, M., Erjavec, T., Majhenič, S., & Žgank, A. (2021). Spoken corpus Gos VideoLectures 4.2 (transcription). http://hdl.handle.net/11356/1444 (Slovenian language resource repository CLARIN.SI)
Verdonik, D., Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., Erjavec, T., … Rupnik, P. (2023). Spoken corpus Gos 2.1 (transcriptions). http://hdl.handle.net/ 11356/1863 (Slovenian language resource repository CLARIN.SI)
Yimam, S. M., Gurevych, I., De Castilho, R. E., & Biemann, C. (2013). Webanno: A flexible, web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics: System demonstrations (pp. 1–6).
Zeman, D., et al. (2024). Universal Dependencies 2.15. http://hdl.handle.net/11234/1-5787 (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University)
Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T. (2021). Spoken corpus Gos 1.1. http://hdl.handle.net/11356/1438 (Slovenian language resource repository CLARIN.SI)
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Kaja Dobrovoljc

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).