Predicting Slovene Text Complexity Using Readability Measures

Authors

  • Tadej Škvorc Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
  • Simon Krek Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia // University of Ljubljana, Faculty of Arts, Askerčeva 2, 1000 Ljubljana, Slovenia
  • Senja Pollak Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
  • Špela Arhar Holdt University of Ljubljana, Faculty of Arts, Askerčeva 2, 1000 Ljubljana, Slovenia // University of Ljubljana, Faculty of Computer and Information Science, 1000 Ljubljana, Slovenia
  • Marko Robnik-Šikonja University of Ljubljana, Faculty of Computer and Information Science, 1000 Ljubljana, Slovenia

DOI:

https://doi.org/10.51663/pnz.59.1.10

Keywords:

readability, natural language processing, text analysis

Abstract

The majority of existing readability measures are designed for English texts. We aim to adapt and test the readability measures on Slovene. We test ten well-known readability formulas and eight additional readability criteria on five types of texts: children’s magazines, general magazines, daily newspapers, technical magazines, and transcriptions of national assembly sessions. As these groups of texts target different audiences, we assume that the differences in writing styles should be reflected in their readability scores. Our analysis shows which readability measures perform well on this task and which fail to distinguish between the groups.

References

Anderson, Jonathan. 1983. “LIX and RIX: Variations on a little-known readability index.” Journal of Reading 26, No. 6: 490-96.

Arhar Holdt, Špela. 2009. “Učni korpus SSJ in leksikon besednih oblik za slovenščino.” Jezik in slovstvo 54, No. 3-4: 43-56.

Bailin, Alan, and Ann Grafstein. 2016. Readability: Text and context. Springer.

Barzilay, Regina, and Mirella Lapata. 2008. “Modeling local coherence: An entity-based approach.” Computational Linguistics 34, No. 1: 1-34.

Berginc Logar, Nataša, and Simon Šuster. 2009. “Gradnja novega korpusa slovenščine.” Jezik in slovstvo 54: 57-68.

Berginc Logar, Nataša, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, Simon Krek, and Iztok Kosem. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko and Faculty of Social Sciences.

Björnsson, Carl Hugo. 1968. Läsbarhet. Liber.

Breiman, Leo. 2001. “Random forests.” Machine learning 45, No. 1: 5-32.

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A scalable tree boosting system.” In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 785-794. ACM.

Coleman, Meri, and Ta Lin Liau. 1975. “A computer readability formula designed for machine scoring.” Journal of Applied Psychology 60, No. 2: 283.

Crossley, Scott A., Kristopher Kyle, and Danielle S. McNamara. 2016. “The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion.” Behavior research methods 48, No. 4: 1227-37.

Dale, Edgar, and Jeanne S. Chall. 1948. “A formula for predicting readability: Instructions.” Educational research bulletin: 37-54.

Dębowski, Łukasz, Bartosz Broda, Bartłomiej Nitoń, and Edyta Charzyńska. 2015. “Jasnopis–A Program to Compute Readability of Texts in Polish Based on Psycholinguistic Research.” In Natural Language Processing and Cognitive Science, edited by B. Sharp, W Lubaszewski and R. Delmonte, 51-61. Liberia Editrice Cafoscarina.

Fišer, Darja, Tomaž Erjavec, Ana Zwitter Vitez, and Nikola Ljubešić. 2014. “JANES se predstavi: metode, orodja in viri za nestandardno pisno spletno slovenščino.” In Language technologies : proceedings of the 17th International Multiconference Information Society - IS 2014, edited by Tomaž Erjavec and Jerneja Žganec Gros, 56-61. Ljubljana: Jožef Stefan Institute.

François, Thomas, and Eleni Miltsakaki. 2012. “Do NLP and machine learning improve traditional readability formulas?” In Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations, edited by Sandra Williams, Advaith Siddharthan and Ani Nenkova, 49-57. Association for Computational Linguistics.

Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. 2012. “Obeliks: statisticni oblikoskladenjski oznacevalnik in lematizator za slovenski jezik.” In Proceedings of the Eighth Language Technologies Conference, edited by Tomaž Erjavec and Jerneja Žganec Gros, 89-94. Ljubljana: Jožef Stefan Institute.

Gunning, Robert. 1952. The technique of clear writing. McGraw-Hill.

Justin, J. 2003. Učbenik kot dejavnik uspešnosti kurikularne prenove: poročilo o rezultatih evalvacijske študije.

Kilgarriff, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi, and Elena Volodina. 2014. “Corpus-based vocabulary lists for language learners for nine languages.” Language resources and evaluation 48, No. 1: 121-63.

Kincaid, J. Peter, Robert P. Fishburne Jr, Richard L. Rogers, and Brad S. Chissom. 1975. Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease formula) for navy enlisted personnel. Report No. 8-75.

Kononenko, Igor, and Matjaž Kukar. 2007. Machine learning and data mining. Chichester, Horwood Publishing.

Kosem, Iztok, Tadeja Rozman, and Mojca Stritar. 2011. “How do Slovenian primary and secondary school students write and what their teachers correct: A corpus of student writing.” In Proceedings of Corpus Linguistics Conference 2011, ICC Birmingham, 20-22.

Lu, Xiaofei. 2009. “Automatic measurement of syntactic complexity in child language acquisition.” International Journal of Corpus Linguistics 14, No. 1: 3-28.

Mc Laughlin, G. Harry. 1969. “SMOG grading - a new readability formula.” Journal of reading 12, No. 8: 639-46.

Senter, R. J., and Edgar A. Smith. 1967. Automated readability index. Ohio; University of Cincinnati.

Sherman, Lucius Adelno. 1893. Analytics of literature: A manual for the objective study of English prose and poetry. Boston: Ginn.

Škvorc, Tadej, Simon Krek, Senja Pollak, Špela Arhar Holdt, and Marko Robnik-Šikonja. 2018. “Evaluation of Statistical Readability Measures on Slovene texts.” In Proceedings of the conference on Language Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 240-47. Ljubljana: Ljubljana University Press, Faculty of Arts.

Spache, George. 1953. “A new readability formula for primary-grade reading materials.” The Elementary School Journal 53, No. 7: 410-13.

Verdonik, Darinka, Ana Zwitter Vitez, and Hotimir Tivadar. 2011. Slovenski govorni korpus Gos. Trojina, zavod za uporabno slovenistiko.

Wiersma, Wybo, John Nerbonne, and Timo Lauttamus. 2010. “Automatically extracting typical syntactic differences from corpora.” Literary and Linguistic Computing 26, No. 1: 107-24.

Vitez, Ana Zwitter. 2014. “Ugotavljanje avtorstva besedil: primer »Trenirkarjev«.” In zbornik Devete konference Jezikovne Tehnologije Informacijska družba – IS, edited by Tomaž Erjavec and Jerneja Žganec Gros, 131-34. Ljubljana: Jožef Stefan Institute.

Published

2019-06-10