Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene

Authors

  • Polona Gantar Faculty of Arts, University of Ljubljana
  • Špela Arhar Holdt Centre for Language Resources and Technologies, University of Ljubljana
  • Jaka Čibej Institut "Jozef Stefan"
  • Taja Kuzman

DOI:

https://doi.org/10.51663/pnz.59.1.05

Keywords:

verb phrases, corpus approach, multi-word expressions, PARSEME, Slovene

Abstract

This paper is an upgraded version of a conference paper presenting the categorization of verbal multi-word expressions (VMWEs) according to the PARSEME COST Action Shared Task 1.1 Guidelines. The categorization is universal but takes into account the characteristics of the individual included languages. It was used to annotate 13,511 sentences of the Slovene ssj500k 2.0 training corpus, which resulted in 3,364 identified VMWEs categorized as inherently reflexive verbs, light verb constructions, inherently adpositional verbs, and verbal idioms. The paper presents both the quantitative and qualitative results of the analysis, compares the suggested categorization system to existing work on VMWEs in Slovene linguistics, and evaluates the use of the proposed system for future work.

References

Arhar Holdt, Špela, and Vojko Gorjanc. 2007. “Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa.” Jezik in slovstvo 52, No. 2 (January): 95–110.

Atkins, Sue B. T., and Michael Rundell. 2008. The Oxford Guide to Practical Lexicography. New York: Oxford University Press.

Baldwin, Timothy, and Su Nam Kim. 2010. “Multiword Expressions” In Handbook of Natural Language Processing, edited by Nitin Indurkhya and Fred J. Damerau, Second Edition, 267–92. Boca Raton: CRC Press.

Breznik, Anton. 1916. Slovenska slovnica za srednje šole. Celovec: Družba sv. Mohorja.

Breznik, Anton. 1934. Slovenska slovnica za srednje šole. 4th, enlarged edition. Celje: Družba sv. Mohorja.

Candito, Marie, Fabienne Cap, Silvio Cordeiro, Vassiliki Foufi, Polona Gantar, Voula Giouli, Carlos Herrero, Mihaela Ionescu, Verginica Mititelu, Johanna Monti, Joakim Nivre, Mihaela Onofrei, Carla Parra Escartín, Manfred Sailer, Carlos Ramisch, Monica-Mihaela Rizea, Agata Savary, Ivelina Stonayova, Sara Stymne, Veronika Vincze. 2016. PARSEME shared task 1.0 annotation guidelines - version 1.6b – last updated on November 26, 2016. http://parsemefr.lif.uiv-mrs.fr/parseme-st-guidelines/1.0/.

Dobrovoljc, Kaja. 2017. “Multi-word discourse markers and their corpus-driven identification: the case of MWDM extraction from the reference corpus of spoken Slovene.” International journal of corpus linguistics 22, No. 4 (December): 551–82.

Dobrovoljc, Kaja, Simon Krek, and Jan Rupnik. 2012. “Skladenjski razčlenjevalnik za slovenščino.” In Zbornik Osme konference Jezikovne tehnologije, edited by Tomaž Erjavec and Jerneja Žganec Gros, 42–47. Ljubljana: Jožef Stefan Institute.

Gantar, Polona, Lut Colman, Carla Parra Escartín and Héctor Martínez Alonso. 2018. “Multiword Expressions: Between Lexicography and NLP.” International Journal of Lexicography: 1–25.

Gantar, Polona, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman, and Teja Kavčič. 2018. “Glagolske večbesedne enote v učnem korpusu ssj500k 2.1.” In Proceedings of the Conference on Language Technologies & Digital Humanities, edited by Darja Fišer and Andrej Pančur, 85-92. Ljubljana: Znanstvena založba Filozofske fakultete.

Gantar, Polona, Simon Krek, and Taja Kuzman. 2017. “Verbal multiword expressions in Slovene.” Europhras 2017, Computational and Corpus-Based Phraseology: proceedings, edited by Ruslan Mitkov, 247–59. Cham: Springer.

Godec Soršak, Lara. 2013. “Glagoli z oslabljenim pomenom v Slovarju slovenskega knjižnega jezika.” Slavistična revija 61, No. 3 (March): 507–22.

Gorjanc, Vojko, Polona Gantar, Iztok Kosem, and Simon Krek, eds. 2017. Dictionary of modern Slovene: problems and solutions. Ljubljana: Ljubljana University Press, Faculty of Arts. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15.

Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. 2012. “Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik.” In Zbornik Osme konference Jezikovne tehnologije, edited by Tomaž Erjavec and Jerneja Žganec Gros. Ljubljana: Jožef Stefan Institute.

Krek, Simon, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može, Nina Ledinek, Nanika Holz, Katja Zupan, Polona Gantar, and Taja Kuzman. 2017. “Training corpus ssj500k 2.0.” Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1165.

Kržišnik, Erika. 1994. “Slovenski glagolski frazemi (ob primeru glagolov govorjenja).” PhD diss., Faculty of Arts, University of Ljubljana.

Metelko, Franc Serafin. 1825. Lehrgebäude der slowenischen Sprache im Königreiche Illyrien und in den benachbarten Provinzen. Laibach: Leopold Eger.

Ramisch, Carlos, Silvio Ricardo Cordeiro, Agata Savary, Veronika Vincze, Verginica Barbu Mititelu, Archna Bhatia, Maja Buljan, Marie Candito, Polona Gantar et al. 2018. “Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions.” In Proceedings: LAW-MWE-CxG 2018, The 12th Linguistic Annotation Workshop (LAW XII) and the 14th Workshop on Multiword Expressions (MWE 2018), edited by Agata Savary, Carlos Ramisch, Jena D. Hwang, Nathan Schneider, Melanie Andresen, Sameer Pradhan, and Miriam R. L. Petruck, 222-40. Santa Fe: Association for Computational Linguistics. http://aclweb.org/anthology/W18-49.

Sag, Ivan, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. “Multiword Expressions: a Pain in the Neck for NLP.” In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2002), edited by Alexander Gelbukh, 1-15. Berlin, Heidelberg, New York: Springer.

Schneider, Nathan, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. 2014. “Comprehensive annotation of multiword expressions in a social web corpus.” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk and Stelios Piperidis, 455-61. European Languages Resources Association (ELRA).

Slovar slovenskega knjižnega jezika. 2nd edition. Ljubljana: SAZU and Fran Ramovš Institute of the Slovenian Language ZRC SAZU. www.fran.si.

Toporišič, Jože. 1973/74. “K izrazju in tipologiji slovenske frazeologije.” Jezik in slovstvo 19, No. 8 (Spring): 273–79.

Toporišič, Jože. 1982. Nova slovenska skladnja. Ljubljana: Državna Založba Slovenije.

Toporišič, Jože. 2000. Slovenska slovnica. Maribor: Založba Obzorja.

Vidovič-Muha, Ada. 1998. “Pomenski preplet glagolov imeti in biti – njuna jezikovnosistemska stilistika.” Slavistična revija 46, No. 4: 293–323.

Žele, Andreja. 1999. “Vezljivost v slovenskem knjižnem jeziku (s poudarkom na glagolu).” PhD diss., Faculty of Arts, University of Ljubljana.

Žele, Andreja. 2002. “Prostomorfemski glagoli kot slovarska gesla.” Jezikoslovni zapiski 8, No. 1: 95–108.

Žele, Andreja. 2012. Pomensko-skladenjske lastnosti slovenskega glagola. Linguistica et philologica 27. Ljubljana: Založba ZRC, ZRC SAZU.

Published

2019-06-11