Can AI Understand Croatian Idioms? Assessing Large Language Models in Lexicographic Tasks

Authors

DOI:

https://doi.org/10.51663/pnz.65.3.09

Keywords:

artificial intelligence, large language models, lexicography, idioms, conceptual organisation

Abstract

This paper explores the potential of ChatGPT in automating two lexicographic tasks within the Online Dictionary of Croatian Idioms (ODCI): (1) identifying semantic equivalents among idioms and (2) generating semantic fields for idiomatic expressions. The study evaluates how effectively AI can automate the process of distinguishing and grouping idioms by meaning, with the aim of reducing manual lexicographic work. As contemporary methodologies for employing language technologies continue to develop alongside technological progress, this research enhances understanding of AI capabilities in linguistic analysis. The findings suggest that ChatGPT shows considerable potential for conceptual organisation in lexicography. Nevertheless, challenges persist, especially regarding the unpredictable nature of AI-generated responses and the necessity for human post-editing.

References

Babić, Karlo, Milan Petrović, Slobodan Beliga, et al. 2021. “Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model.” Applied Sciences 11 (21). https://www.mdpi.com/2076-3417/11/21/10442. doi:10.3390/app112110442.

Baisa, Vít, Marek Blahuš, Michal Cukr, et al. 2019. “Automating Dictionary Production: A Tagalog-English-Korean Dictionary from Scratch.” In Electronic Lexicography in the 21st Century (eLex 2019): Smart Lexicography, Conference Proceedings, edited by Iztok Kosem, Tanara Zingano Kuhn, Margarita Correia, et al. Sintra, Portugal, 1-3 October 2019. Brno: Lexical Computing, 805–818.

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by Hugo, Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, et al. Vol. 33, 1877–1901. Curran Associates, Inc.

De Schryver, Gilles-Maurice. 2023. “Generative AI and Lexicography: The Current State of the Art Using ChatGPT.” International Journal of Lexicography 36 (4): 355–387. https://doi.org/10.1093/ijl/ecad021.

Filipović Petrović, Ivana. 2018. Kada se sretnu leksikografija i frazeologija: O statusu frazema u rječniku. Zagreb: Srednja Europa.

Filipović Petrović, Ivana, Miguel López Otal, and Slobodan Beliga. 2024. “Croatian Idioms Integration: Enhancing the LIdioms Multilingual Linked Idioms Dataset.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, et al. 4106–4112. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.366.

Filipović Petrović, Ivana, and Jelena Parizoska. 2019. “Konceptualna Organizacija Frazeoloških Rječnika u Leksikografiji.” Filologija 73: 27–45.

Fuertes-Olivera, Pedro. 2024. “Making Lexicography Sustainable: Using ChatGPT and Reusing Data for Lexicographic Purposes.” Lexikos 34 (1): 123–140. https://lexikos.journals.ac.za/pub/article/view/1883. doi:10.5788/34-1-1883.

Geeraerts, Dirk. 1989. “Principles of Monolingual Lexicography.” In Wörterbücher. Ein Internationales Handbuch Zur Lexikographie, edited by Franz Josef Hausmann, Vol. 1, 287–296. Berlin: Walter de Gruyter.

Hargraves, Orin. 2018. “Information Retrieval for Lexicographic Purposes.” In The Routledge Handbook of Lexicography, edited by Pedro Fuertes-Olivera, 701–714. Routledge.

Jakubíček, Miloš, Vojtech Kovář, and Pavel Rychlý. 2021. “Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing.” In Book of Abstracts of the 19th EURALEX International Congress, 65–67.

Kosem, Iztok, Polona Gantar, Nataša Logar et al. 2014. “Automation of Lexicographic Work Using General and Specialized Corpora: Two Case Studies.” In Proceedings of the 16th EURALEX International Congress, edited by Andrea Abel, Chiara Vettori and Natascia Ralli, 355–364. Bolzano, Italy: EURAC Research.

Lew, Robert. 2023. “ChatGPT as a COBUILD Lexicographer.” Humanities and Social Sciences Communications 10 (704). doi:10.1057/s41599-023-02119-6.

Ljubešić, Nikola, and Taja Kuzman. 2024. “CLASSLA-Web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, et al., 3271–3282. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.291.

Ljubešić, Nikola, and Davor Lauc. 2021. “BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin, and Serbian.” In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, edited by Bogdan Babych et al., 37–42. Kiyv, Ukraine: Association for Computational Linguistics. https://aclanthology.org/2021.bsnlp-1.5.

McArthur, Tom. 1986. Worlds of Reference: Lexicography, Learning, and Language from the Clay Tablet to the Computer. Cambridge: Cambridge University Press.

Miller, Julia. 2018. “Research in the Pipeline: Where Lexicography and Phraseology Meet.” Lexicography ASIALEX 5 (1): 23–33. doi:10.1007/s40607-018-0044-z.

Moussallem, Diego, Mohamed Sherif, Diego Esteves, et al. 2018. “LIdioms: A Multilingual Linked Idioms Data Set.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), edited by Nicoletta Calzolari, et al., Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1392.

Perak, Benedikt, Slobodan Beliga, and Ana Meštrović. 2024. “Incorporating Dialect Understanding into LLM Using RAG and Prompt Engineering Techniques for Causal Common-Sense Reasoning.” In Proceedings of the 11th Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), edited by Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, et al., 220–229. Mexico City, Mexico: ACL. https://aclanthology.org/2024.vardial-1.19. doi:10.18653/v1/2024.vardial-1.19.

Rehm, Georg, and Andy Way. 2023. European Language Equality: A Strategic Agenda for Digital Language Equality. Springer Nature. https://doi.org/10.1007/978-3-031-28819-7. doi:10.1007/978-3-031-28819-7.

Rundell, Michael. 2023. “Automating the Creation of Dictionaries: Are We Nearly There?” In Proceedings of the 16th International Conference of the Asian Association for Lexicography: Lexicography (ASIALEX 2023 Proceedings), 1–9. Seoul, Korea: Yonsei University. (22–24 June 2023).

Tran, Hanh Thi Hong, Vid Podpečan, Mateja Jemec Tomazin et al. 2023. “Definition Extraction for Slovene: Patterns, Transformer Classifiers, and ChatGPT.” In Proceedings of the eLex 2023 Conference: Electronic Lexicography in the 21st Century, edited by Marek Medveď, Michal Měchura, Carole Tiberius, et al., 19–38. Brno: Lexical Computing.

Ulčar, Matej, and Marko Robnik-Šikonja. 2020. “FinEst BERT and CroSloEngual BERT: Less Is More in Multilingual Models.” In Text, Speech, and Dialogue: 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8–11, 2020, Proceedings, 104–111. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-030-58323-1_11.

Škorić, Mihailo. 2024. “Novi jezički modeli za Srpski jezik.” Infoteka, 24. https://arxiv.org/abs/2402.14379.

Online sources:

Clark, Kevin, Minh-Thang Luong, Quoc V. Le, et al. 2020. “ELECTRA: Pre-training Text Encoders as Discriminators Rather than Generators.” ICLR. https://openreview.net/pdf?id=r1xMH1BtvB.

Filipović Petrović, Ivana, and Jelena Parizoska. 2023. Frazeološki rječnik hrvatskoga jezika v2. Zagreb: Hrvatska akademija znanosti i umjetnosti. https://lexonomy.elex.is/#/frazeoloskirjecnikhr.

Ljubešić, Nikola, and Filip Klubička. 2016. Croatian Web Corpus hrWaC 2.1. http://hdl.handle.net/11356/1064. (Slovenian language resource repository CLARIN.SI).

Ljubešić, Nikola, Peter Rupnik, and Taja Kuzman. 2024. Croatian Web Corpus CLASSLA-Web.hr 1.0. http://hdl.handle.net/11356/1929. (Slovenian language resource repository CLARIN.SI).

Vossen, Piek. 2022. “ChatGPT Is a Waste of Time.” VU Magazine. https://vumagazine.nl/professor-piek-vossen-chatgpt-is-a-waste-of-time?lang=en.

Downloads

Published

2025-12-22

Issue

Section

Articles