Drevesnica govorjene slovenščine: novi podatki, modeli in ključni nauki.

Avtorji

DOI:

https://doi.org/10.51663/pnz.65.3.01

Ključne besede:

označevanje korpusov, odvisnostna drevesnica, spontani govor, Universal Dependencies, razčlenjevanje

Povzetek

Prispevek predstavlja novo različico drevesnice govorjene slovenščine (SST), uravnotežene in reprezentativne zbirke transkribiranega spontanega govora z ročno označenimi lemami, besednimi vrstami, oblikoslovnimi značilnostmi in skladenjskimi odvisnostmi, ki je bila nedavno razširjena z več kot 3.000 na novo razčlenjenimi izjavami. Po kratkem pregledu postopkov vzorčenja, označevanja in poenotenja korpusnih podatkov – ki smo jih podrobneje predstavili že v predhodni razpravi – ponazorimo pomen tega jezikovnega vira za raziskave na področju jezikoslovja in strojne obdelave jezika. S primerjavo govorne in pisne drevesnice najprej izpostavimo leksikalne ter oblikoslovno-skladenjske posebnosti govora v primerjavi s pisnim jezikom, nato pa predstavimo njihov vpliv na delovanje orodij za samodejno slovnično razčlenjevanje govornih transkripcij. Na koncu predstavimo metodološke izkušnje, pridobljene pri razvoju drevesnice, razpravljamo o njenem potencialu za nadaljnje raziskave govorjenega jezika in poudarimo pomen tovrstnih virov z vidika naslavljanja jezikovne raznolikosti pri razvoju jezikovnih tehnologij.

Literatura

Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press. DOI: 10.1017/ cbo9780511621024

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (2010). The Longman Grammar of Spoken and Written English. De Gruyter Mouton.

Braggaar, A., & van der Goot, R. (2021). Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data. In Proceedings of the second workshop on domain adaptation for NLP (pp. 50–58).

Brank, J. (2023). Q-CAT Corpus Annotation Tool 1.5. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1844)

Caines, A., McCarthy, M., & Buttery, P. (2017). Parsing transcripts of speech. In Proceedings of the workshop on speech-centric natural language processing (pp. 27–36).

Čibej, J., & Munda, T. (2024). Metoda polavtomatskega popravljanja lem in oblikoskladenjskih oznak na primeru učnega korpusa govorjene slovenščine ROG. In Proceedings of the Conference on Language Technologies and Digital Humanities, Ljubljana, Slovenia.

de Marneffe, M. C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255–308.

Dobrovoljc, K. (2018). Formulaičnost v slovenskem jeziku. Slovenščina 2.0: empirical, applied and interdisciplinary research, 6(2), 67–95.

Dobrovoljc, K. (2022). Spoken Language Treebanks in Universal Dependencies: an Overview. In Proceedings of the thirteenth language resources and evaluation conference (pp. 1798–1806).

Dobrovoljc, K. (2024). Extending the Spoken Slovenian Treebank. In Proceedings of the Conference on Language Technologies and Digital Humanities (116–146), Ljubljana, Slovenia.

Dobrovoljc, K., & Martinc, M. (2018). Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing. In Proceedings of the second workshop on Universal Dependencies (UDW) (pp. 37–46).

Dobrovoljc, K., & Nivre, J. (2016). The Universal Dependencies Treebank of Spoken Slovenian. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 1566–1573).

Dobrovoljc, K., & Terčon, L. (2024). Universal Dependencies: Smernice za označevanje besedil v slovenščini. Različica 1.7. Center za jezikovne vire in tehnologije Univerze v Ljubljani. https://wiki.cjvt.si/attachments/71

Dobrovoljc, K., Erjavec, T., & Krek, S. (2017). The Universal Dependencies Treebank for Slovenian. In Proceedings of the 6th workshop on Balto-Slavic natural language processing (pp. 33–38).

Dobrovoljc, K., Terčon, L., & Ljubešić, N. (2023). Universal Dependencies za slovenščino: Nove smernice, ročno označeni podatki in razčlenjevalni model. Slovenščina 2.0: empirical applied and interdisciplinary research, 11(1), 218–246.

Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone Speech Corpus for Research and Development. In Proceedings of the 1992 IEEE international conference on acoustics, speech and signal processing - volume 1 (pp. 517–520).

Hajič, J., Cinková, S., Mikulová, M., Pajas, P., Ptácek, J., Toman, J., & Uresová, Z. (2008). PDTSL: An annotated resource for speech reconstruction. In Proceedings of the 2008 IEEE workshop on spoken language technology (pp. 93–96).

Hinrichs, E., & Kübler, S. (2005). Treebank profiling of spoken and written German. Universitätsbibliothek Johann Christian Senckenberg.

Hinrichs, E., Bartels, J., Kawata, Y., Kordoni, V., & Telljohann, H. (2000). The Tübingen treebanks for spoken German, English, and Japanese. In W. Wahlster (Ed.), Verbmobil: Foundations of speech-to-speech translation (p. 550-574). Springer Berlin Heidelberg.

Kahane, S., Caron, B., Strickland, E., & Gerdes, K. (2021, December). Annotation guidelines of UD and SUD treebanks for spoken corpora: A proposal. In Proceedings of the 20th international workshop on treebanks and linguistic theories (TLT, SyntaxFest 2021) (pp. 35–47).

Kåsen, A., Hagen, K., Nøklestad, A., Priestly, J., Solberg, P. E., & Haug, D. T. T. (2022, June). The Norwegian Dialect Corpus Treebank. In Proceedings of the thirteenth language resources and evaluation conference (pp. 4827–4832).

Krsnik, Luka; Dobrovoljc, Kaja and Terčon, Luka, 2024a, The Trankit model for linguistic process of standard written Slovenian 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1963.

Krsnik, Luka; Dobrovoljc, Kaja and Terčon, Luka, 2024b, The Trankit model for linguistic processing of written and spoken Slovenian 1.2, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1997

Lacheret-Dujour, A., Kahane, S., & Pietrandrea, P. (2019). Rhapsodie: A prosodic and syntactic treebank for spoken French (Vol. 89). John Benjamins Publishing Company.

Liu, Z., & Prud’hommeaux, E. (2021, April). Dependency Parsing Evaluation for Low-resource Spontaneous Speech. In Proceedings of the second workshop on domain adaptation for NLP (pp. 156–165). Kyiv, Ukraine: Association for Computational Linguistics. https://aclanthology.org/2021.adaptnlp-1.16

Ljubešić, N., Terčon, L., & Dobrovoljc, K. (2024). CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages. Conference on Language Technologies and Digital Humanities (JT-DH-2024), Ljubljana, Slovenia.

Ljubešić, Nikola; Terčon, Luka and Čibej, Jaka, 2023, The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1767.

Luotolahti, J., Kanerva, J., & Ginter, F. (2017). dep_search: Efficient search tool for large dependency parsebanks. In Proceedings of the 21st nordic conference on computational linguistics (pp. 255–258).

MacWhinney, B. (2014). The CHILDES project. Psychology Press.

Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., & Nguyen, T. H. (2021, April). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Lan- guage Processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguis- tics: System demonstrations (pp. 80–90).

Øvrelid, L., Kåsen, A., Hagen, K., Nøklestad, A., Solberg, P. E., & Johannessen, J. B. (2018, May). The LIA Treebank of Spoken Norwegian Dialects. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (pp. 4482–4488).

Pietrandrea, P., & Delsart, A. (2019, June). Chapter 16. Macrosyntax at work. In Studies in corpus linguistics (pp. 285–314). Amsterdam: John Benjamins Publishing Company.

Štravs, M., & Dobrovoljc, K. (2024). Service for querying dependency treebanks Drevesnik 1.1. http://hdl.handle.net/11356/1923 (Slovenian language resource repository CLARIN.SI)

Terčon, Luka; Čibej, Jaka and Ljubešić, Nikola, 2023, The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1768.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025a, The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2015.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025b, The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2017.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025c, The CLASSLA-Stanza model for morphosyntactic annotation of spoken Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2016.

Terčon, Luka; Dobrovoljc, Kaja and Ljubešić, Nikola, 2025d, The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/2015.

van der Wouden, T., Hoekstra, H., Moortgat, M., Renmans, B., & Schuurman, I. (2002). Syntactic analysis in the Spoken Dutch Corpus (CGN). In Proceedings of the third international conference on language resources and evaluation, LREC 2002.

van der Wouden, T., Schuurman, I., Schouppe, M., & Hoekstra, H. (2003). Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch. In Computational linguistics in the Netherlands 2002 (p. 129–141). BRILL.

Verdonik, D., & Bizjak, A. (2023). Pogovorni zapis in označevanje govora v govorni bazi Artur projekta RSDO.

Verdonik, D., & Maučec, M. S. (2016). A speech corpus as a source of lexical infor- mation. International Journal of Lexicography, 30(2), 143-166.

Verdonik, D., Bizjak, A., Sepesy Maučec, M., Gril, L., Dobrišek, S., Križaj, J., … Dretnik, N. (2023). ASR database ARTUR 1.0 (transcriptions). http://hdl.handle.net/11356/ 1772 (Slovenian language resource repository CLARIN.SI)

Verdonik, D., Bizjak, A., Žgank, A., Bernjak, M., Antloga, Š., Majhenič, S., … Bordon, D. (2023). ASR database ARTUR 1.0 (audio). http://hdl.handle.net/11356/1776 (Slovenian language resource repository CLARIN.SI)

Verdonik, D., Dobrovoljc, K., Čibej, J., Ljubešić, N., & Rupnik, P. (2024). Izbor in urejanje gradiv za učni korpus govorjene slovenščine - ROG. In Zbornik konference jezikovne tehnologije in digitalna humanistika 2024.

Verdonik, D., Dobrovoljc, K., Erjavec, T., & Ljubešić, N. (2024, May). Gos 2: A New Reference Corpus of Spoken Slovenian. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024) (pp. 7825–7830).

Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.

Verdonik, D., Potočnik, T., Sepesy Maučec, M., Erjavec, T., Majhenič, S., & Žgank, A. (2021). Spoken corpus Gos VideoLectures 4.2 (transcription). http://hdl.handle.net/11356/1444 (Slovenian language resource repository CLARIN.SI)

Verdonik, D., Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., Erjavec, T., … Rupnik, P. (2023). Spoken corpus Gos 2.1 (transcriptions). http://hdl.handle.net/ 11356/1863 (Slovenian language resource repository CLARIN.SI)

Yimam, S. M., Gurevych, I., De Castilho, R. E., & Biemann, C. (2013). Webanno: A flexible, web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics: System demonstrations (pp. 1–6).

Zeman, D., et al. (2024). Universal Dependencies 2.15. http://hdl.handle.net/11234/1-5787 (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University)

Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T. (2021). Spoken corpus Gos 1.1. http://hdl.handle.net/11356/1438 (Slovenian language resource repository CLARIN.SI)

Objavljeno

2025-12-22

Številka

Rubrika

Razprave