Ali lahko umetna inteligenca razume hrvaške idiome? Ocena velikih jezikovnih modelov pri leksikografskih nalogah
DOI:
https://doi.org/10.51663/pnz.65.3.09Ključne besede:
umetna inteligenca, veliki jezikovni modeli, leksikografija, frazemi, konceptualna organizacijaPovzetek
Ta članek preučuje potencial ChatGPT pri avtomatizaciji dveh leksikografskih nalog v Spletnem slovarju hrvaških frazemov (ODCI): (1) prepoznavanje semantičnih ekvivalentov med frazemi in (2) generiranje semantičnih polj za frazeološke enote. Cilj raziskave je oceniti, kako učinkovito lahko umetna inteligenca avtomatizira proces razlikovanja in razvrščanja frazemov glede na pomen ter s tem zmanjša obseg ročnega leksikografskega dela. Ker se metodologije za uporabo jezikovnih tehnologij še vedno razvijajo vzporedno s tehnološkimi inovacijami, ta študija prispeva k boljšemu razumevanju delovanja orodij umetne inteligence ter njihove sposobnosti ustvarjanja kakovostnih in uporabnih jezikovnih podatkov. Rezultati kažejo, da ChatGPT izkazuje velik potencial za konceptualno organizacijo v leksikografiji. Kljub temu ostajajo izzivi, predvsem v zvezi z nedeterministično naravo odgovorov, generiranih z UI, in potrebo po ročnem urejanju avtomatskih podatkov.
Literatura
Babić, Karlo, Milan Petrović, Slobodan Beliga, et al. 2021. “Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model.” Applied Sciences 11 (21). https://www.mdpi.com/2076-3417/11/21/10442. doi:10.3390/app112110442.
Baisa, Vít, Marek Blahuš, Michal Cukr, et al. 2019. “Automating Dictionary Production: A Tagalog-English-Korean Dictionary from Scratch.” In Electronic Lexicography in the 21st Century (eLex 2019): Smart Lexicography, Conference Proceedings, edited by Iztok Kosem, Tanara Zingano Kuhn, Margarita Correia, et al. Sintra, Portugal, 1-3 October 2019. Brno: Lexical Computing, 805–818.
Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by Hugo, Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, et al. Vol. 33, 1877–1901. Curran Associates, Inc.
De Schryver, Gilles-Maurice. 2023. “Generative AI and Lexicography: The Current State of the Art Using ChatGPT.” International Journal of Lexicography 36 (4): 355–387. https://doi.org/10.1093/ijl/ecad021.
Filipović Petrović, Ivana. 2018. Kada se sretnu leksikografija i frazeologija: O statusu frazema u rječniku. Zagreb: Srednja Europa.
Filipović Petrović, Ivana, Miguel López Otal, and Slobodan Beliga. 2024. “Croatian Idioms Integration: Enhancing the LIdioms Multilingual Linked Idioms Dataset.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, et al. 4106–4112. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.366.
Filipović Petrović, Ivana, and Jelena Parizoska. 2019. “Konceptualna Organizacija Frazeoloških Rječnika u Leksikografiji.” Filologija 73: 27–45.
Fuertes-Olivera, Pedro. 2024. “Making Lexicography Sustainable: Using ChatGPT and Reusing Data for Lexicographic Purposes.” Lexikos 34 (1): 123–140. https://lexikos.journals.ac.za/pub/article/view/1883. doi:10.5788/34-1-1883.
Geeraerts, Dirk. 1989. “Principles of Monolingual Lexicography.” In Wörterbücher. Ein Internationales Handbuch Zur Lexikographie, edited by Franz Josef Hausmann, Vol. 1, 287–296. Berlin: Walter de Gruyter.
Hargraves, Orin. 2018. “Information Retrieval for Lexicographic Purposes.” In The Routledge Handbook of Lexicography, edited by Pedro Fuertes-Olivera, 701–714. Routledge.
Jakubíček, Miloš, Vojtech Kovář, and Pavel Rychlý. 2021. “Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing.” In Book of Abstracts of the 19th EURALEX International Congress, 65–67.
Kosem, Iztok, Polona Gantar, Nataša Logar et al. 2014. “Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies.” In Proceedings of the 16th EURALEX International Congress, edited by Andrea Abel, Chiara Vettori and Natascia Ralli, 355–364. Bolzano, Italy: EURAC Research.
Lew, Robert. 2023. “ChatGPT as a COBUILD Lexicographer.” Humanities and Social Sciences Communications 10 (704). doi:10.1057/s41599-023-02119-6.
Ljubešić, Nikola, and Taja Kuzman. 2024. “CLASSLA-Web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, et al., 3271–3282. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.291.
Ljubešić, Nikola, and Davor Lauc. 2021. “BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin, and Serbian.” In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, edited by Bogdan Babych et al., 37–42. Kiyv, Ukraine: Association for Computational Linguistics. https://aclanthology.org/2021.bsnlp-1.5.
McArthur, Tom. 1986. Worlds of Reference: Lexicography, Learning, and Language from the Clay Tablet to the Computer. Cambridge: Cambridge University Press.
Miller, Julia. 2018. “Research in the Pipeline: Where Lexicography and Phraseology Meet.” Lexicography ASIALEX 5 (1): 23–33. doi:10.1007/s40607-018-0044-z.
Moussallem, Diego, Mohamed Sherif, Diego Esteves, et al. 2018. “LIdioms: A Multilingual Linked Idioms Data Set.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), edited by Nicoletta Calzolari, et al., Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1392.
Perak, Benedikt, Slobodan Beliga, and Ana Meštrović. 2024. “Incorporating Dialect Understanding into LLM Using RAG and Prompt Engineering Techniques for Causal Common-Sense Reasoning.” In Proceedings of the 11th Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), edited by Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, et al., 220–229. Mexico City, Mexico: ACL. https://aclanthology.org/2024.vardial-1.19. doi:10.18653/v1/2024.vardial-1.19.
Rehm, Georg, and Andy Way. 2023. European Language Equality: A Strategic Agenda for Digital Language Equality. Springer Nature. https://doi.org/10.1007/978-3-031-28819-7. doi:10.1007/978-3-031-28819-7.
Rundell, Michael. 2023. “Automating the Creation of Dictionaries: Are We Nearly There?” In Proceedings of the 16th International Conference of the Asian Association for Lexicography: Lexicography (ASIALEX 2023 Proceedings), 1–9. Seoul, Korea: Yonsei University. (22–24 June 2023).
Tran, Hanh Thi Hong, Vid Podpečan, Mateja Jemec Tomazin et al. 2023. “Definition Extraction for Slovene: Patterns, Transformer Classifiers, and ChatGPT.” In Proceedings of the eLex 2023 Conference: Electronic Lexicography in the 21st Century, edited by Marek Medveď, Michal Měchura, Carole Tiberius, et al., 19–38. Brno: Lexical Computing.
Ulčar, Matej, and Marko Robnik-Šikonja. 2020. “FinEst BERT and CroSloEngual BERT: Less Is More in Multilingual Models.” In Text, Speech, and Dialogue: 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8–11, 2020, Proceedings, 104–111. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-030-58323-1_11.
Škorić, Mihailo. 2024. “Novi jezički modeli za Srpski jezik.” Infoteka, 24. https://arxiv.org/abs/2402.14379.
Online sources:
Clark, Kevin, Minh-Thang Luong, Quoc V. Le, et al. 2020. “ELECTRA: Pre-training Text Encoders as Discriminators Rather than Generators.” ICLR. https://openreview.net/pdf?id=r1xMH1BtvB.
Filipović Petrović, Ivana, and Jelena Parizoska. 2023. Frazeološki rječnik hrvatskoga jezika v2. Zagreb: Hrvatska akademija znanosti i umjetnosti. https://lexonomy.elex.is/#/frazeoloskirjecnikhr.
Ljubešić, Nikola, and Filip Klubička. 2016. Croatian Web Corpus hrWaC 2.1. http://hdl.handle.net/11356/1064. (Slovenian language resource repository CLARIN.SI).
Ljubešić, Nikola, Peter Rupnik, and Taja Kuzman. 2024. Croatian Web Corpus CLASSLA-Web.hr 1.0. http://hdl.handle.net/11356/1929. (Slovenian language resource repository CLARIN.SI).
Vossen, Piek. 2022. “ChatGPT Is a Waste of Time.” VU Magazine. https://vumagazine.nl/professor-piek-vossen-chatgpt-is-a-waste-of-time?lang=en.
Prenosi
Objavljeno
Številka
Rubrika
Licenca
Avtorske pravice (c) 2025 Ivana Filipović Petrović, Slobodan Beliga

To delo je licencirano pod Creative Commons Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 mednarodno licenco.
Avtorji prispevkov, objavljenih v tej reviji, soglašajo z naslednjimi pogoji glede avtorskih pravic:
- Avtorji ohranijo avtorske pravice, reviji pa odobrijo pravico do prve objave. Delo se hkrati zaščiti z licenco za prosto uporabo avtorskih del (Creative Commons Attribution License), ki drugim osebam omogoča deljenje dela ob priznanju avtorstva in prve objave v tej reviji.
- Avtorji lahko sklenejo ločene dodatne pogodbene dogovore za neizključno distribucijo različice dela, objavljene v reviji, (npr. oddaja v institucionalni repozitorij ali objava v knjigi) z navedbo, da je bilo delo prvič objavljeno v tej reviji.
- Pred postopkom pošiljanja in med njim lahko avtorji delo objavijo v spletu (npr. v institucionalnih repozitorijih ali na svoji spletnih strani), k čemer jih tudi spodbujamo, saj lahko to prispeva k plodnim izmenjavam ter hitrejšemu in obsežnejšemu navajanju objavljenega dela (glej The Effect of Open Access).