<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>Structural and Semantic Classification of Verbal Multi-Word Expressions in
                    Slovene</title>
                <author>
                    <name>
                        <forename>Polona</forename>
                        <surname>Gantar</surname>
                        <affiliation>Department of Translation, Faculty of Arts, University of
                            Ljubljana</affiliation>
                        <address>
                            <addrLine>Aškerčeva 12</addrLine>
                            <addrLine>SI-1000 Ljubljana</addrLine>
                        </address>
                        <email>apolonija.gantar@guest.arnes.si</email>
                    </name>
                </author>
                <author>
                    <name>
                        <forename>Špela</forename>
                        <surname>Arhar Holdt</surname>
                        <affiliation>CJVT, Faculty of Computer and Information Science, University
                            of Ljubljana</affiliation>
                        <address>
                            <addrLine>Večna pot 113</addrLine>
                            <addrLine>SI-1000 Ljubljana</addrLine>
                        </address>
                        <email>spela.arhar@cjvt.si</email>
                    </name>
                </author>
                <author>
                    <name>
                        <forename>Jaka</forename>
                        <surname>Čibej</surname>
                        <affiliation>Artificial Intelligence Laboratory, Jožef Stefan
                            Institute</affiliation>
                        <address>
                            <addrLine>Jamova cesta 39</addrLine>
                            <addrLine>SI-1000 Ljubljana</addrLine>
                        </address>
                        <email>jaka.cibej@ijs.si</email>
                    </name>
                </author>
                <author>
                    <name>
                        <forename>Taja</forename>
                        <surname>Kuzman</surname>
                        <email>kuzman.taja@gmail.com</email>
                    </name>
                </author>
            </titleStmt>
            <editionStmt>
                <edition><date>2019-04-15</date></edition>
            </editionStmt>
            <publicationStmt>
                <publisher>
                    <orgName xml:lang="sl">Inštitut za novejšo zgodovino</orgName>
                    <orgName xml:lang="en">Institute of Contemporary History</orgName>
                    <address>
                        <addrLine>Kongresni trg 1</addrLine>
                        <addrLine>SI-1000 Ljubljana</addrLine>
                    </address>
                </publisher>
                <pubPlace>http://ojs.inz.si/pnz/article/view/325</pubPlace>
                <date>2019</date>
                <availability status="free">
                    <licence>http://creativecommons.org/licenses/by-nc-nd/4.0/</licence>
                </availability>
            </publicationStmt>
            <seriesStmt>
                <title xml:lang="sl">Prispevki za novejšo zgodovino</title>
                <title xml:lang="en">Contributions to Contemporary History</title>
                <biblScope unit="volume">59</biblScope>
                <biblScope unit="issue">1</biblScope>
                <idno type="ISSN">2463-7807</idno>
            </seriesStmt>
            <sourceDesc>
                <p>No source, born digital.</p>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <projectDesc xml:lang="en">
                <p>Contributions to Contemporary History is one of the central Slovenian scientific
                    historiographic journals, dedicated to publishing articles from the field of
                    contemporary history (the 19th and 20th century).</p>
                <p>The journal is published three times per year in Slovenian and in the following
                    foreign languages: English, German, Serbian, Croatian, Bosnian, Italian, Slovak
                    and Czech. The articles are all published with abstracts in English and
                    Slovenian as well as summaries in English.</p>
            </projectDesc>
            <projectDesc xml:lang="sl">
                <p>Prispevki za novejšo zgodovino je ena osrednjih slovenskih znanstvenih
                    zgodovinopisnih revij, ki objavlja teme s področja novejše zgodovine (19. in 20.
                    stoletje).</p>
                <p>Revija izide trikrat letno v slovenskem jeziku in v naslednjih tujih jezikih:
                    angleščina, nemščina, srbščina, hrvaščina, bosanščina, italijanščina, slovaščina
                    in češčina. Članki izhajajo z izvlečki v angleščini in slovenščini ter povzetki
                    v angleščini.</p>
            </projectDesc>
        </encodingDesc>
        <profileDesc>
            <langUsage>
                <language ident="sl"/>
                <language ident="en"/>
            </langUsage>
            <textClass>
                <keywords xml:lang="en">
                    <term>verb phrases</term>
                    <term>corpus approach</term>
                    <term>multi-word expressions</term>
                    <term>PARSEME</term>
                    <term>Slovene</term>
                </keywords>
                <keywords xml:lang="sl">
                    <term>glagolske zveze</term>
                    <term>korpusni pristop</term>
                    <term>večbesedne enote</term>
                    <term>PARSEME</term>
                    <term>slovenščina</term>
                </keywords>
            </textClass>
        </profileDesc>
        <revisionDesc>
            <listChange>
                <change>
                    <date>2019-06-07</date>
                    <name>Mihael Ojsteršek</name>
                    <desc>Pretvorba iz DOCX v TEI, dodatno kodiranje</desc>
                </change>
            </listChange>
        </revisionDesc>
    </teiHeader>
    <text>
        <front>
            <docAuthor>Polona Gantar<note place="foot" xml:id="ftn1" n="*">
                    <hi rend="bold">Department of Translation, Faculty of Arts, University of
                        Ljubljana, Aškerčeva 12, SI-1000 Ljubljana,</hi><ref
                        target="mailto:apolonija.gantar@guest.arnes.si"><hi rend="bold"
                            >apolonija.gantar@guest.arnes.si</hi></ref></note></docAuthor>
            <docAuthor>Špela Arhar Holdt<note place="foot" xml:id="ftn2" n="**">
                    <hi rend="bold"> CJVT, Faculty of Computer and Information Science, University
                        of Ljubljana, Večna pot 113, SI-1000 Ljubljana, </hi><ref
                        target="mailto:spela.arhar@cjvt.si">
                        <hi rend="bold">spela.arhar@cjvt.si</hi></ref></note></docAuthor>
            <docAuthor>Jaka Čibej<note place="foot" xml:id="ftn3" n="***">
                    <hi rend="bold"> Artificial Intelligence Laboratory, Jožef Stefan Institute,
                        Jamova cesta 39, SI-1000 Ljubljana, </hi>
                    <ref target="mailto:jaka.cibej@ijs.si"><hi rend="bold"
                        >jaka.cibej@ijs.si</hi></ref></note></docAuthor>
            <docAuthor>Taja Kuzman<note place="foot" xml:id="ftn4" n="****">
                    <ref target="mailto:kuzman.taja@gmail.com"><hi rend="bold"
                            >kuzman.taja@gmail.com</hi></ref></note></docAuthor>
            <docImprint>
                <idno type="cobissType">Cobiss type: 1.01</idno>
                <idno type="UDC">UDC: 003.295:821.163.6‘367.625</idno>
            </docImprint>
            <div type="abstract" xml:lang="sl">
                <head>IZVLEČEK</head>
                <head>Strukturna in pomenska klasifikacija glagolskih večbesednih enot v
                    slovenščini</head>
                <p>
                    <hi rend="italic">Prispevek je nadgrajena različica konferenčnega prispevka, v
                        katerem predstavljamo kategorije glagolskih večbesednih enot (GVBE), kot so
                        bile oblikovane v okviru mednarodne COST akcije PARSEME Shared Task 1.1. S
                        kategorijami, ki so nadjezikovne in obenem prilagojene posameznim vključenim
                        jezikom, smo označili 13.511 povedi učnega korpusa ssj500k 2.0. Rezultat
                        označevanja je 3.364 identificiranih večbesednih glagolskih enot, ki so
                        klasificirane kot: inherentno povratni glagoli, zveze z glagoli v pomensko
                        oslabljeni rabi, predložnomorfemski glagoli in glagolski idiomi. V prispevku
                        rezultate označevanja predstavimo kvantitativno in kvalitativno, pri čemer
                        sopostavimo predlagani sistem klasifikacije ob obstoječe prakse na področju
                        slovenistične obravnave GVBE in ocenimo uporabnost sistema za nadaljnje
                        delo.</hi>
                </p>
                <p>
                    <hi rend="italic">Ključne besede: glagolske zveze, korpusni pristop, večbesedne
                        enote, PARSEME, slovenščina</hi></p>
            </div>
            <div type="abstract">
                <head>ABSTRACT</head>
                <p>
                    <hi rend="italic">This paper is an extended version of a conference paper
                        presenting the categorization of verbal multi-word expressions (VMWEs)
                        according to the PARSEME COST Action Shared Task 1.1 Guidelines. The
                        categorization is universal but takes into account the characteristics of
                        the individual languages included in it. The Shared Task was used to
                        annotate over 13,000 sentences of the Slovene ssj500k 2.0 training corpus,
                        which resulted in nearly 3,400 identified VMWEs categorized as inherently
                        reflexive verbs, light verb constructions, inherently adpositional verbs,
                        and verbal idioms. The paper presents both the quantitative and qualitative
                        results of the analysis, compares the suggested categorization system to
                        existing work on VMWEs in Slovene linguistics, and evaluates the use of the
                        proposed system for future work.</hi></p>
                <p>
                    <hi rend="italic">Keywords: verb phrases, corpus approach, multi-word
                        expressions, PARSEME, Slovene</hi>
                </p>
            </div>
        </front>
        <body>
            <div>
                <head>Introduction </head>
                <p>In the digital medium, the bulk of interactions between users – as well as
                    between users and computers or applications – occur with the use of language,
                    which is why the existence and open accessibility of digital language
                    infrastructure is of vital importance to the development and vitality of a
                    language. Slovene is no exception in this regard; it requires an infrastructure
                    that serves as a source of information for the language community as well as a
                    resource to be used in applied/theoretical linguistic research and the
                    development of new language technology tools, methods, and services. Examples of
                    such infrastructure include digital language resources that allow for continued
                    updates and contributions from the community, language databases with structured
                    and machine-readable data, and training corpora in which authentic texts are
                    annotated with different linguistic categories. In this regard, digital
                    lexicography, whose aim is to prepare the dictionary part of this language
                    infrastructure, plays an essential role in the field of digital humanities.</p>
                <p>In the field of digital lexicography, multi-word expressions (MWEs) are
                    considered important for constructing machine-readable language resources that
                    in turn enable the compilation of electronic MWE lexicons and the development of
                    language technology tools for a specific language. In order to achieve these
                    goals, it is crucial to know the linguistic features of different types of MWEs
                    and develop methods and standards for their identification in authentic language
                    use.</p>
                <p>However, this is not a trivial task. Definitions and categorisations of MWEs
                    differ according to their methodological and theoretical basis and research
                        goals.<note place="foot" xml:id="ftn5" n="1"> For an overview of MWE
                        classifications according to different methodological approaches, see <ref
                            target="#Gantar.Colman.2018">Gantar et al. (2018)</ref>.</note> A
                    lexicographic perspective focuses on the semantic characteristics of MWEs and
                    defines them as “different types of phrases that demonstrate a certain degree of
                    idiomatic meaning” (<ref target="#Atkins.2008">Atkins and Rundell 2008,
                        166</ref>) or as phrases whose “exact meaning is not directly obtained from
                    its component parts” (<ref target="#Sag.2002">Sag et al. 2002</ref>). On the
                    other hand, the definition of MWEs from the perspective of machine processing
                    emphasises their statistical significance: “a group of tokens in a sentence that
                    cohere more strongly than ordinary syntactic combinations. That is, they are
                    idiosyncratic in form, function, or frequency” (<ref target="#Schneider.2014"
                        >Schneider et al. 2014</ref>) and their inability to be split into
                    independent lexemes and at the same time maintain their semantic and syntactic
                    functions, as well as their lexical, syntactic, semantic, pragmatic and
                    statistical idiomaticity (<ref target="#Baldwin.2010">Baldwin and Kim 2010,
                        3</ref>). Although no universally accepted definition of MWEs exists,
                    researchers in linguistics and NLP both agree that the key feature separating
                    MWEs from free phrases is the special relationship between the elements that form
                    the MWE. This relation is usually defined using such concepts as collocability
                    (or statistical idiomaticity), idiomaticity (or semantic non-compositionality),
                    syntactic (in)flexibility, which also includes the possibility of an internal
                    modification of the phrase and the flexible order of its lexicalised elements,
                    and lexical variance. </p>
                <p>An attempt to provide the much needed guidelines and a pilot study on the
                    annotation of MWEs in language corpora was made as part of the PARSEME COST
                    Action Shared Task 1.1.<note place="foot" xml:id="ftn6" n="2">
                        <hi rend="italic">Home – PARSEME</hi>, <ref target="http://www.parseme.eu"
                            >http://www.parseme.eu</ref>.</note> The task focused on the automatic
                    identification of verbal multi-word expressions (VMWEs) in running text. As part
                    of the task, universal guidelines for VMWE classification were compiled
                    containing examples for all languages involved. Based on the guidelines, a
                    multi-lingual corpus was manually annotated with VMWEs and made available under
                    a Creative Commons licence.</p>
                <p>While the categories of MWEs were designed as language-independent, the specific
                    characteristics of all the included languages had to be taken into account to
                    reach a universally applicable solution. In this paper, we focus on the
                    Slovene results, which will be useful when compiling a digital lexicon of
                    Slovene MWEs, as well as other language resources such as the Dictionary of
                    Modern Slovene (<ref target="#Gorjanc.2017">Gorjanc et al. 2017</ref>) and a
                    corpus-based grammar of Slovene. The topic was presented in <ref
                        target="#Gantar.Colman.2018">Gantar et al. (2018)</ref> with a focus on MWEs
                    and their theoretical framework in Slovene studies. This paper focuses on MWEs
                    from the perspective of a unified concept that was applied to 20 different
                    languages within the PARSEME Shared Task 1.1. A comparison of the results can be
                    found in <ref target="#Ramisch.2018">Ramisch et al. (2018)</ref>. </p>
            </div>
            <div>
                <head>Identifying and Categorizing Verbal Multi-Word Expressions</head>
                <p>The verb plays a crucial role in the sentence in terms of co-text organization,
                    which is why the PARSEME Shared Task focused on verbal multi-word expressions
                    (VMWEs). For further analysis, it is crucial to determine the differences
                    between the definitions and categorizations of VMWEs as established in Slovene
                    studies on the one hand, and the international PARSEME COST Action on the other.
                    Our task aims to adopt the international annotation scheme in order to
                    include Slovene. Our research question focuses on the applicability of the
                    PARSEME system to authentic Slovene texts. Can the adapted PARSEME categories be
                    applied in practice? Are they attributable, robust, one-dimensional, and
                    represented in actual language use? What information do they entail (e.g. in
                    terms of syntax), how can they contribute to the development of new automatic
                    extraction methods, and finally, which problems arise when applying the system
                    to text? In the following sections, we present the annotation method. This is
                    followed by quantitative and qualitative analysis. The latter is focusing on
                    individual categories, their characteristics, and the potential problems of the
                    approach.</p>
                <div>
                    <head>Verbal Multiword Expressions – Slovenian Case</head>
                    <p>In Slovene studies, MWEs are divided into a) phraseological units (PUs), in
                        which at least one component carries meaning that differs from one of its
                        denotative “dictionary” senses, and expresses figurativeness, and b) all
                        other multiword expressions (i.e. fixed expressions), which are
                        characterized by a certain degree of fixedness and denote a meaning that can
                        be predicted from the meanings of their elements. PUs are further divided by
                        syntactic structure: the clausal type (which also includes proverbs) and the
                        phrasal type (all non-verbal PUs). In Slovene linguistic theory, verbal MWEs
                        are determined by their morphosyntactic features (<ref
                            target="#Toporišič.1973-74">Toporišič 1973/74</ref>; <ref
                            target="#Kržišnik.1994">Kržišnik 1994</ref>): a MWE is classified as a
                        VMWE if it includes a verbal element and if it functions as a predicate.
                        However, it remains unclear how to classify examples in which the verbal MWE
                        does not function as a predicate, e.g. <hi rend="italic">hočeš nočeš</hi>
                        ‘like it or not’, which includes two verbal elements, but functions as an
                        adverbial.</p>
                    <p>The problem of categorizing MWEs according to their morphological structure
                        and syntactic function was resolved in PARSEME shared task through the
                        definition that the main criterion for VMWEs is that their syntactic head in
                        the prototypical form is a verb, regardless of the fact whether it can or
                        cannot fulfil other syntactic roles. In addition, Slovene categorizations
                        have so far never treated verbs with the <hi rend="italic">se/si</hi>
                        morpheme as a separate MWE category. Phrasal verbs that consist of a verb
                        and a preposition and carry an independent meaning were categorized as MWEs
                        only conditionally (<ref target="#Kržišnik.1994">Kržišnik 1994,
                        58</ref>).</p>
                </div>
                <div>
                    <head>Verbal Multiword Expressions within the Parseme Shared Task 1.1</head>
                    <p>For the categorization of VMWEs within the Parseme Shared task 1.1,
                        exhaustive guidelines<note place="foot" xml:id="ftn7" n="3">
                            <hi rend="italic">PARSEME Shared Task 1.1 - Annotation guidelines</hi>,
                                <ref
                                target="http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/"
                                >http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/</ref>.</note>
                        were prepared in which the VMWE categories are defined by semantic and
                        syntactic features and are described with decision trees. The identification
                        and categorization process consisted of three steps. In the first step, we
                        identified potential VMWEs consisting of a verb as the syntactic head of the
                        phrase and at least one other word. In the second step, we identified the
                        lexicalised elements within the phrase. In the third step, we used detailed
                        linguistic tests consisting of generic and specific language criteria to
                        determine the correct category of the identified VMWE.</p>
                    <p>Based on the guidelines, VMWEs are further divided into two classes based on
                        whether the category can be applied to the majority of languages included in
                        the task, or whether they are typical of individual (groups of) languages.
                        The universal categories include verbal idioms (VID) and light verb
                        constructions (LVC), which are further divided into full (LVC.full) and
                        causal (LVC.cause). The quasi-universal categories, which are used within
                        individual groups of languages, include inherently reflexive verbs (IRV),
                        which are typical of most Slavic languages, and verb-particle constructions
                        (VPC), typical of Germanic languages. In the second version of the
                        guidelines, an additional quasi-universal category was added: inherently
                        adpositional verbs (IAV), which require an open syntactic slot and are
                        typical of Slovene and several other Slavic languages. </p>
                    <p>For Slovene, examples of VMWEs can be found for all the categories except for
                        VPC. For certain categories, however, there are specific
                        characteristics based on syntactic or morphological features of Slovene or
                        on grammatical categories that are generally accepted in Slovene but differ
                        to some extent from other languages. The specific Slovene features will be
                        described along with individual VMWE types.</p>
                </div>
            </div>
            <div>
                <head>The Corpus and Annotation Tool</head>
                <p>VMWEs were annotated in the Slovene ssj500k 2.0 training corpus (<ref
                        target="#Krek.2017">Krek et al. 2017</ref>), which consists of approximately
                    500,000 tokens and just under 28,000 sentences sampled from the FidaPLUS corpus
                    of Slovene (<ref target="#ArharHoldt.2007">Arhar Holdt and Gorjanc 2007</ref>).
                    The entire corpus is morphosyntactically tagged (<ref target="#Grčar.2012">Grčar
                        et al. 2012</ref>). Certain portions also contain named-entity annotations
                    and syntactic dependencies (<ref target="#Dobrovoljc.2012">Dobrovoljc et al.
                        2012</ref>). In the first annotation phase, 11,411 sentences were annotated
                    by two annotators with VMWEs as defined by the first version of the PARSEME
                    Guidelines (<ref target="#Candito.2016">Candito et al. 2016</ref>).
                    Disagreements in annotation were discussed and adjusted accordingly. In the
                    second phase, the categories were automatically modified based on the second
                    version of the PARSEME Guidelines and manually checked. The second phase
                    continued with the annotation of an additional 2,100 sentences annotated in
                    packages by individual annotators. Problematic examples were discussed and
                    correctly annotated.</p>
                <p>The tool used for annotation in the first phase was SentenceMarkup System (Figure
                    1), a custom tool primarily developed for syntactic dependency annotation of
                    Slovene texts (<ref target="#Dobrovoljc.2012">Dobrovoljc et al. 2012</ref>). The
                    tool was adjusted for the annotation of VMWEs by adding an additional
                    independent and interconnectable annotation layer (cf. <ref
                        target="#Gantar.2017">Gantar et al. 2017</ref>).</p>
                <figure>
                    <graphic url="SentenceMarkup_VMWEs.PNG"/>
                    <head>Figure 1: Annotations in the SentenceMarkup System</head>
                </figure>
                <p>In the second phase, the annotation took place in the FLAT annotation platform
                    (FoLiA Linguistic Annotation Tool), which was adapted for the purposes of the
                    PARSEME Shared Task and tested on 13 collaborating languages (Figure 2). The
                    FLAT platform allows text strings to be annotated with a set of categories.
                    Files can be assigned to different annotators. The supported formats are XML and
                    TSV, while annotated files are exported in XML. All annotations are saved
                    automatically. The interface also features text search using CQL.</p>
                <figure>
                    <graphic url="FLAT_WMWEs.png"/>
                    <head>Figure 2: Annotations in FLAT</head>
                </figure>
            </div>
            <div>
                <head>Quantitative Analysis </head>
                <p>The annotated VMWEs were imported into the ssj500k 2.1 training corpus (<ref
                        target="#Krek.2017">Krek et al. 2018</ref>). Among the 13,511 sentences
                    annotated in the first two annotation phases, 2,290 of them (approximately 22%)
                    contain at least one VMWE. On average, each of these sentences features 1.15
                    VMWEs. Taking into account all the annotated sentences, each sentence contains
                    approximately 0.25 VMWEs; in other words, on average, one VMWE is present in
                    every fourth sentence.</p>
                <p>Table 1 shows the distribution of the annotated VMWEs by category. The final
                    number of VMWEs in the training corpus is 3,364. The number of different VMWEs
                    (i.e. without any repetitions of the same unit) was just under 1,100. When
                    looking at absolute frequencies, the most frequent category is IRV (48%), and the
                    least frequent category is LVC.cause (2%). The categories with the highest
                    number of different VMWEs are VID and IAV, while LVC.full and LVC.cause are the
                    least diverse categories. We describe each category in more detail in section
                    5.</p>
                <table rend="table-scroll">
                    <head>Table 1: Distribution of annotated VMWEs by category</head>
                    <row role="label">
                        <cell>Category</cell>
                        <cell>Example</cell>
                        <cell>Translation</cell>
                        <cell>All VMWEs</cell>
                        <cell>Percent</cell>
                        <cell>Different VMWEs </cell>
                    </row>
                    <row>
                        <cell>Inherently Reflexive Verbs (IRV) </cell>
                        <cell><hi rend="italic">bati se</hi></cell>
                        <cell>to be afraid</cell>
                        <cell style="text-align:right;">1,627</cell>
                        <cell style="text-align:right;">48%</cell>
                        <cell style="text-align:right;">345</cell>
                    </row>
                    <row>
                        <cell>Inherently Adpositional Verbs (IAV) </cell>
                        <cell><hi rend="italic">priti do</hi></cell>
                        <cell>to come about</cell>
                        <cell style="text-align:right;">710</cell>
                        <cell style="text-align:right;">21%</cell>
                        <cell style="text-align:right;">154</cell>
                    </row>
                    <row>
                        <cell>Verbal Idioms (VID) </cell>
                        <cell><hi rend="italic">spati kot ubit</hi>
                        </cell>
                        <cell>(lit.) to sleep like a dead person </cell>
                        <cell style="text-align:right;">724</cell>
                        <cell style="text-align:right;">22%</cell>
                        <cell style="text-align:right;">457</cell>
                    </row>
                    <row>
                        <cell>Light Verb Constructions (LVC):<lb/>
                            LVC.cause</cell>
                        <cell><hi rend="italic">spraviti koga v smeh</hi></cell>
                        <cell>to make someone laugh </cell>
                        <cell style="text-align:right;">64</cell>
                        <cell style="text-align:right;">2%</cell>
                        <cell style="text-align:right;">27</cell>
                    </row>
                    <row>
                        <cell>Light Verb Constructions (LVC):<lb/>
                            LVC.full</cell>
                        <cell><hi rend="italic">biti v pomoč</hi></cell>
                        <cell>to be of help </cell>
                        <cell style="text-align:right;">239</cell>
                        <cell style="text-align:right;">7%</cell>
                        <cell style="text-align:right;">103</cell>
                    </row>
                    <row>
                        <cell>Total</cell>
                        <cell>-</cell>
                        <cell>-</cell>
                        <cell style="text-align:right;">3,364</cell>
                        <cell style="text-align:right;">100%</cell>
                        <cell style="text-align:right;">1,086</cell>
                    </row>
                </table>
                <p>Table 2 shows the most common VMWE structures by parts of speech (V – verb, N –
                    noun, A – adjective, R – adverb, Pre – preposition, Pro – pronoun). The
                    structures occurring in the corpus with a frequency below 10 have been
                    categorized as Other. The most frequent structures are V + Pro, V + Pre, V + N
                    and V + Pre + N. Collectively, they account for approximately 85% of all
                    annotated VMWEs.</p>
                <table rend="table-scroll">
                    <head>Table 2: Distribution of annotated VMWEs by part-of-speech structure </head>
                    <row role="label">
                        <cell>Structure</cell>
                        <cell>Example</cell>
                        <cell>Translation</cell>
                        <cell>Frequency</cell>
                        <cell>Percent</cell>
                    </row>
                    <row>
                        <cell>V + Pro</cell>
                        <cell><hi rend="italic">bati se</hi></cell>
                        <cell>to be afraid</cell>
                        <cell style="text-align:right;">1,663</cell>
                        <cell style="text-align:right;">49%</cell>
                    </row>
                    <row>
                        <cell>V + Pre</cell>
                        <cell><hi rend="italic">priti do</hi></cell>
                        <cell>to come about</cell>
                        <cell style="text-align:right;">535</cell>
                        <cell style="text-align:right;">16%</cell>
                    </row>
                    <row>
                        <cell>V + N</cell>
                        <cell><hi rend="italic">imeti odnos</hi></cell>
                        <cell>to have a relationship</cell>
                        <cell style="text-align:right;">372</cell>
                        <cell style="text-align:right;">11%</cell>
                    </row>
                    <row>
                        <cell>V + Pre + N</cell>
                        <cell><hi rend="italic">biti pod vtisom</hi></cell>
                        <cell>to be under the impression</cell>
                        <cell style="text-align:right;">303</cell>
                        <cell style="text-align:right;">9%</cell>
                    </row>
                    <row>
                        <cell>V + Pro + A</cell>
                        <cell><hi rend="italic">biti si edini</hi></cell>
                        <cell>to be unanimous</cell>
                        <cell style="text-align:right;">146</cell>
                        <cell style="text-align:right;">4%</cell>
                    </row>
                    <row>
                        <cell>V + R</cell>
                        <cell><hi rend="italic">biti res</hi></cell>
                        <cell>to be true</cell>
                        <cell style="text-align:right;">136</cell>
                        <cell style="text-align:right;">4%</cell>
                    </row>
                    <row>
                        <cell>V + Pro + Pre + N</cell>
                        <cell><hi rend="italic">ujeti se v past</hi></cell>
                        <cell>to get caught in a trap</cell>
                        <cell style="text-align:right;">24</cell>
                        <cell style="text-align:right;">1%</cell>
                    </row>
                    <row>
                        <cell>V + A</cell>
                        <cell><hi rend="italic">biti jasno</hi></cell>
                        <cell>to be clear</cell>
                        <cell style="text-align:right;">20</cell>
                        <cell style="text-align:right;">1%</cell>
                    </row>
                    <row>
                        <cell>V + A + N</cell>
                        <cell><hi rend="italic">imeti glavno besedo</hi></cell>
                        <cell>to have the last word</cell>
                        <cell style="text-align:right;">19</cell>
                        <cell style="text-align:right;">1%</cell>
                    </row>
                    <row>
                        <cell>N + V + Pre + N</cell>
                        <cell><hi rend="italic">biti na robu propada</hi></cell>
                        <cell>to be on the verge of collapse </cell>
                        <cell style="text-align:right;">12</cell>
                        <cell style="text-align:right;">&lt;1%</cell>
                    </row>
                    <row>
                        <cell>V + Pro + N</cell>
                        <cell><hi rend="italic">vzeti si čas</hi></cell>
                        <cell>to take one's time</cell>
                        <cell style="text-align:right;">11</cell>
                        <cell style="text-align:right;">&lt;1%</cell>
                    </row>
                    <row>
                        <cell>Other</cell>
                        <cell>-</cell>
                        <cell>-</cell>
                        <cell style="text-align:right;">123</cell>
                        <cell style="text-align:right;">4%</cell>
                    </row>
                    <row>
                        <cell>Total</cell>
                        <cell>-</cell>
                        <cell>-</cell>
                        <cell style="text-align:right;">3,364</cell>
                        <cell style="text-align:right;">100%</cell>
                    </row>
                </table>
            </div>
            <div>
                <head>Qualitative Analysis</head>
                <p>The qualitative analysis deals with the semantic and structural features of
                    VMWEs. Based on the PARSEME Guidelines, several characteristic features of
                    Slovene were identified on the level of structural and semantic tests used to
                    determine the category of VMWEs. In the analysis, we focused on patterns within
                    structures for each subcategory, the syntactic environment of the expression as
                    a unit, and the lexical units filling the corresponding participant slots. Based
                    on corpus examples, we also tried to identify the indicators of semantic
                    integrality that could be useful when automatically identifying VMWEs in
                    text.</p>
                <div>
                    <head>Inherently Reflexive Verbs (IRV)</head>
                    <p>The PARSEME Shared Task 1.1 guidelines treat verbs occurring with the
                        independent morpheme <hi rend="italic">se/si</hi> as a separate category of
                        VMWEs called <hi rend="italic">inherently reflexive verbs</hi>. It is a
                        language-specific category that includes phrases in which the verb without
                        the morpheme <hi rend="italic">se/si</hi> does not exist (<hi rend="italic"
                            >zdeti se</hi> 'to seem', *<hi rend="italic">zdeti</hi>) or in which the
                        presence of <hi rend="italic">se/si</hi> changes the meaning of the verb
                            (<hi rend="italic">pobrati se</hi> 'to recover' vs <hi rend="italic"
                            >pobrati</hi> 'to pick up').</p>
                    <p>Inherently reflexive verbs cover the largest percentage of VMWEs in the
                        training corpus (see Table 1). Among the correctly categorized examples
                        (1,621 in total)<note place="foot" xml:id="ftn8" n="4"> Among the 1,627
                            annotated examples, four were miscategorized. In two examples, the
                            elements of the expressions were incorrectly annotated. These examples
                            were excluded from further analysis.</note> we identified 339 different
                        IRVs, with the following most frequently occurring verbs: <hi rend="italic"
                            >zdeti se</hi> 'to seem', <hi rend="italic">odločiti se</hi> 'to
                        decide', <hi rend="italic">zgoditi se</hi> 'to come to pass' and <hi
                            rend="italic">pojaviti se</hi> 'to appear'. </p>
                    <p>To test whether the expression is semantically integral and to differentiate
                        it from other types of verb phrases with <hi rend="italic">se/si</hi> that
                        are not defined as VMWEs, we examined the behaviour of the verb in terms of
                        its opening up syntactic positions as a phrase. Inherently reflexive verbs
                        keep <hi rend="italic">se</hi>/<hi rend="italic">si</hi> as an obligatory
                        verb morpheme in all forms of their inflectional paradigm and can be
                        transitive (<hi rend="italic">bati se koga</hi>/<hi rend="italic">česa</hi>
                        'to be afraid of smn/sth') or intransitive (<hi rend="italic">znajti se</hi>
                        'to find oneself somewhere', <hi rend="italic">zvečeriti se</hi> 'to fall
                        [evening]'). </p>
                    <p>Inherently reflexive verbs as VMWEs must be differentiated from verbs where
                        the reflexive pronoun <hi rend="italic">se/si </hi>is not an obligatory
                        morpheme but serves another function, more specifically: (a) it denotes
                        mutualness (<hi rend="italic">poljubljati se</hi> 'to kiss [each other]',
                            <hi rend="italic">srečati se</hi> 'to encounter [each other]'), (b) it
                        denotes that the target of the action is the subject (<hi rend="italic"
                            >umivati se</hi> 'to wash [oneself]', <hi rend="italic">praskati se</hi>
                        'to scratch [oneself]'), or that the action is to the benefit of the subject
                            (<hi rend="italic">kuhati si</hi> 'to cook [oneself sth]', (c) it is
                        used for passivizing the sentence by removing the agent (<hi rend="italic"
                            >kdo ponavlja kaj</hi> 'someone repeats something' – <hi rend="italic"
                            >kaj se ponavlja</hi> 'something is repeated'), and (d) it denotes a
                        generic action (<hi rend="italic">govori se</hi> 'it is said'; <hi
                            rend="italic">se razume</hi> 'it is understood').</p>
                    <p>With verbs that can also occur without <hi rend="italic">se</hi>/<hi
                            rend="italic">si</hi>, only the phrases where the morpheme changes the
                        verb's meaning are categorized as IRVs. There are cases in which the
                        presence (or absence) of <hi rend="italic">se</hi>/<hi rend="italic">si</hi>
                        causes a semantic shift directly tied to a human subject In these cases, the
                        verb denotes a metaphorical meaning <hi rend="italic">pobrati se</hi> 'to
                        recover': <hi rend="italic">pobrati</hi> 'to pick sth up'; <hi rend="italic"
                            >gristi se</hi> 'to worry' : <hi rend="italic">gristi</hi> 'to bite'. </p>
                    <p>In Slovene linguistics, lexicalised phrases consisting of a verb and the <hi
                            rend="italic">se</hi>/<hi rend="italic">si</hi> morpheme have so far not
                        been treated as fixed expressions. The main focus has been recognition of
                        the function of the morpheme or the reflexive pronoun in terms of denoting
                        different degrees of agentness or the subject's (un)involvedness, as in the
                        case of the non-singular (<hi rend="italic">zbrati se</hi> 'to gather') or
                        generic agent (<hi rend="italic">tiskati se</hi> 'to be printed') (<ref
                            target="#Žele.2012">Žele 2012, 44</ref>; <ref target="#Toporišič.1982"
                            >Toporišič 1982, 244</ref>; <ref target="#Toporišič.2000">2000,
                            503</ref>). The identification of IRVs in text from the perspective of
                        their semantic and syntactic is particularly important for the automatic
                        identification of MWEs. In future lexicons and dictionaries, IRVs should
                        thus be treated either as independent entries or as part of polysemy.</p>
                </div>
                <div>
                    <head>Light Verb Constructions (LVC)</head>
                    <p>Light verb constructions have been treated from different perspectives by
                        different authors (for an overview, see <ref target="#GodecSoršak.2013"
                            >Soršak 2013</ref>). In most definitions, the verbs in LVCs are
                        categorized as something between full verbs and auxiliary verbs, while the
                        expressions that feature them are categorized as a phenomenon between fixed
                        and free expressions. Using existing typologies for Slovene (<ref
                            target="#Toporišič.2000">Toporišič 2000</ref>; <ref target="#Žele.1999"
                            >Žele 1999</ref>), Soršak analyzes Slovene LVCs based on the entries in
                        the Dictionary of Standard Slovene (<ref target="#SSSK.">SSKJ</ref>). The
                        results highlight that the dictionary often mentions the semantically light
                        use of a verb in places where the use is stylistically marked, most
                        frequently as expressive (<ref target="#GodecSoršak.2013">Soršak 2013,
                            514</ref>; e.g. <hi rend="italic">groza ga sprehaja, </hi>lit. 'terror
                        is walking him'). The results described in this paper show the opposite – in
                        the annotated corpus, LVCs are typical, stylistically neutral, and
                        frequently occurring.</p>
                    <p>As per the PARSEME Guidelines, a LVC must fulfil the following conditions: it
                        consists of a verb and a noun or a noun phrase that can also take the form
                        of a prepositional phrase (<hi rend="italic">imeti mnenje</hi> 'to have an
                        opinion', <hi rend="italic">biti v dvomih</hi> 'to be in doubt'), and must
                        open up its own valency slots (<hi rend="italic">kdo ima predavanje za
                            koga</hi> 'someone holds a lecture for someone'). Semantically, the
                        expression must denote an action (<hi rend="italic">imeti predavanje</hi>
                        'to hold a lecture') or a state (<hi rend="italic">biti v dvomih</hi> 'to be
                        in doubt'). According to the verb, the category has two subtypes: (a) if the
                        verb contributes to the meaning on a predominantly categorical level, the
                        expression is categorized as LVC.full (<hi rend="italic">biti v pomoč</hi>
                        'to be of help'); (b) if the subject can be interpreted as the cause or
                        source of the denoted action, the expression is categorized as LVC.cause
                            (<hi rend="italic">spraviti v smeh</hi> 'to make smn laugh'). The LVC
                        tests also take into account the abstractness of the noun (<hi rend="italic"
                            >imeti avto</hi> 'to have a car' is not a multiword expression, while
                        idiomatic expressions like <hi rend="italic">imeti mačka</hi> 'lit. to have
                        a cat – to have a hangover' are categorized as VIDs) and, with LVC.full, the
                        possibility of rephrasing by omitting the verb (<hi rend="italic">Janez ima
                            predavanje</hi> 'Janez holds a lecture' – <hi rend="italic">Janezovo
                            predavanje</hi> 'Janez's lecture'). </p>
                    <p>Despite the somewhat elusive concept of LVCs, the annotation process has
                        confirmed that the PARSEME guidelines are specific enough to be successfully
                        applied to real text. Of the 303 examples annotated as LVCs (1 example was
                        categorized incorrectly), 78.8% were LVC.full and 21.2% LVC.cause. 87.1% of
                        them were combinations of a verb and a noun, while 12.9% were combinations
                        of a verb and a prepositional phrase. The annotated LVCs contained a total
                        of 19 different verbs,<note place="foot" xml:id="ftn9" n="5">This is the
                            full set of the LVCs in the data, confirming that the set of verbs
                            occurring in these expressions is limited. In the dictionary, <ref
                                target="#GodecSoršak.2013">Soršak (2013, 513)</ref> finds mentions
                            of semantic lightness in 420 verb entries. However, as mentioned, the
                            labels often signify stylistically marked and atypical language
                            use.</note> predominantly the verb <hi rend="italic">imeti</hi> 'to
                        have' (65.6%), but also <hi rend="italic">biti</hi> 'to be' (13.6%) and <hi
                            rend="italic">dati/dajati</hi> 'to give' (a total of 9.6%).<note
                            place="foot" xml:id="ftn10" n="6"> In Slovene linguistics, verb phrases
                            with <hi rend="italic">imeti</hi> 'to have' and <hi rend="italic"
                                >biti</hi> 'to be' have been most frequently treated as the
                            equivalent of LVCs, but analyzed from different perspectives (see e.g.
                            Vidovič Muha 1998).</note> Other verbs (<hi rend="italic">narediti</hi>
                        'to do', <hi rend="italic">postaviti</hi>/<hi rend="italic">postavljati</hi>
                        'to put', <hi rend="italic">ostati</hi> 'to remain', <hi rend="italic"
                            >voditi</hi> 'to lead', <hi rend="italic">namenjati</hi> 'to pay
                        [attention]', <hi rend="italic">delati</hi> 'to do/make', <hi rend="italic"
                            >storiti</hi> 'to do', <hi rend="italic">vzbujati</hi>/<hi rend="italic"
                            >zbujati</hi> 'to incite', <hi rend="italic">dobiti</hi> 'to get', <hi
                            rend="italic">zastaviti</hi> 'to pose', <hi rend="italic">spraviti</hi>
                        'to make', <hi rend="italic">doseči</hi> 'to achieve' and <hi rend="italic"
                            >nositi</hi> 'to wear') occur less frequently, often in a single
                        expression (<hi rend="italic">ostati v spominu</hi> 'to remain in one's
                        memory', <hi rend="italic">namenjati pozornost</hi> 'to pay attention to
                        sth').</p>
                    <p>Combinations of a verb and a prepositional phrase are somewhat more typical
                        of the LVC.cause category. In the annotated data, LVC.cause occurs
                        exclusively with the prepositions <hi rend="italic">v</hi> 'in' (33
                        instances) and <hi rend="italic">na</hi> 'on' (6 instances). In the majority
                        of cases, the combination is <hi rend="italic">biti v</hi> (<hi
                            rend="italic">biti v pomoč</hi> 'to be of help', <hi rend="italic">biti
                            v podporo</hi> 'to provide support', <hi rend="italic">biti v
                            navadi</hi> 'to be a habit').</p>
                    <p>In the annotated expressions, a relatively limited set of nouns can be found:
                        a total of 97. The most frequent nouns are <hi rend="italic">težava</hi>
                        'problem' (21) and <hi rend="italic">pravica</hi> 'right' (20), followed by
                            <hi rend="italic">možnost</hi> 'possibility', <hi rend="italic"
                            >mnenje</hi> 'opinion', <hi rend="italic">učinek</hi> 'effect', <hi
                            rend="italic">vloga</hi> 'role', <hi rend="italic">vpliv</hi>
                        'influence', <hi rend="italic">vtis</hi> 'impression', <hi rend="italic"
                            >pomoč</hi> 'help', <hi rend="italic">občutek</hi> 'feeling', <hi
                            rend="italic">prednost</hi> 'advantage', <hi rend="italic">sreča</hi>
                        'luck', <hi rend="italic">korist</hi> 'benefit', <hi rend="italic"
                            >vprašanje</hi> 'question', <hi rend="italic">volja</hi> 'will', <hi
                            rend="italic">posledica</hi> 'consequence'. As expected, some of these
                        nouns occur exclusively in LVC.full (<hi rend="italic">pravica</hi>, <hi
                            rend="italic">možnost</hi>, <hi rend="italic">mnenje</hi>, <hi
                            rend="italic">vloga</hi>), while others occur in LVC.cause (<hi
                            rend="italic">učinek</hi>, <hi rend="italic">vpliv</hi>, <hi
                            rend="italic">vtis</hi>, <hi rend="italic">pomoč</hi>). In other cases,
                        the category depends on the meaning of the verb (<hi rend="italic">dati
                            prednost</hi> 'to give an advantage' → LVC.cause and <hi rend="italic"
                            >imeti prednost</hi> 'to have an advantage' → LVC.full.</p>
                    <p>In accordance with the conclusions made by <ref target="#GodecSoršak.2013"
                            >Soršak (2013, 519)</ref>, the results show that the featured verbs can
                        also be used with full meaning, while the semantic lightness in LVCs is
                        complemented by the nominal part (<hi rend="italic">imeti</hi> 'to have'
                        meaning 'to possess' compared to <hi rend="italic">imeti posledice</hi> 'to
                        have consequences' meaning 'to cause/lead to consequences'). Semantically,
                        the noun groups occurring in LVC.cause describe the result of an action, be
                        it a type of result (<hi rend="italic">učinek</hi> 'effect', <hi
                            rend="italic">vpliv</hi> 'influence', <hi rend="italic">vtis</hi>
                        'impression'), a positive (<hi rend="italic">korist</hi> 'benefit', <hi
                            rend="italic">užitek</hi> 'pleasure') or negative consequence (<hi
                            rend="italic">muka</hi> 'torment', <hi rend="italic">preglavica</hi>
                        'trouble'). The semantically light verb binds the result to the subject (<hi
                            rend="italic">nekdo</hi>/<hi rend="italic">nekaj daje vtis</hi> 'smn/sth
                        makes an impression', i.e. the agent is the cause of the action). In certain
                        cases, LVCs can be converted into semantically full verbs with a similar
                        morphological base (<hi rend="italic">dosegati učinek</hi> 'to achieve an
                        effect' – <hi rend="italic">učinkovati</hi> 'to affect'; <hi rend="italic"
                            >imeti vpliv</hi> 'to have an influence' – <hi rend="italic"
                            >vplivati</hi> 'to influence'), but not always (<hi rend="italic">imeti
                            posledice</hi> 'to have consequences' – /).</p>
                    <p>The nouns occurring in LVC.full are semantically more diverse. Dividing them
                        into semantic groups reveals that the common ground of these expressions can
                        be defined as planning or estimating success. Among the encountered LVCs are
                        phrases with nouns dealing with (a) communication (<hi rend="italic"
                            >mnenje</hi> 'opinion', <hi rend="italic">predlog</hi> 'suggestion', <hi
                            rend="italic">vprašanje</hi> 'question'); or describing (b) the
                        potential for success (<hi rend="italic">možnost</hi> 'possibility', <hi
                            rend="italic">prednost</hi> 'advantage', <hi rend="italic"
                            >priložnost</hi> 'opportunity'); (c) initial steps (<hi rend="italic"
                            >obljuba</hi> 'promise', <hi rend="italic">napoved</hi> 'prediction',
                            <hi rend="italic">načrt</hi> 'plan'); (d) potential reasons for failure
                            (<hi rend="italic">napaka</hi> 'mistake', <hi rend="italic"
                            >pomanjkljivost</hi> 'disadvantage'). Other groups deal with (e)
                        negative states (<hi rend="italic">težava</hi> 'problem', <hi rend="italic"
                            >strah</hi> 'fear', <hi rend="italic">dvom</hi> 'doubt'), (f) positive
                        qualities (<hi rend="italic">moč</hi> 'power', <hi rend="italic">pogum</hi>
                        'courage', <hi rend="italic">potrpljenje</hi> 'patience'), (g) achieved
                        results (<hi rend="italic">izobrazba</hi> 'education', <hi rend="italic"
                            >status</hi> 'status', <hi rend="italic">posel</hi> 'business'), and (h)
                        attitude towards as of yet unrealized goals (<hi rend="italic">želja</hi>
                        'wish', <hi rend="italic">ambicija</hi> 'ambition', <hi rend="italic"
                            >vizija</hi> 'vision'). Again, some examples can be converted into a
                        semantically full verb (<hi rend="italic">imeti mnenje</hi> 'to have an
                        opinion' – <hi rend="italic">meniti</hi>), while others cannot (<hi
                            rend="italic">imeti ambicije</hi> 'to have ambitions' – /).</p>
                </div>
                <div>
                    <head>Inherently Adpositional Verbs (IAV)</head>
                    <p>Inherently adpositional verbs, also called verbs with a lexicalised
                        prepositional morpheme (<ref target="#Žele.2002">Žele 2002</ref>), were
                        included in the PARSEME Guidelines during the second annotation phase as an
                        optional test category.<note place="foot" xml:id="ftn11" n="7">Based on the
                            feedback from the first annotation campaign and the issues discussed
                            among the contributors, idiomatic combinations of verbs with
                            prepositions or postpositions (IAVs) were separated from verb-particle
                            constructions (VPCs) such as <hi rend="italic">put off, to blow up, to
                                do in</hi>, in which the particle completely changes the meaning or
                            adds a partly predictable but non-spatial meaning to the verb. Unlike
                            VPCs, which are characteristic of Germanic languages and Hungarian, less
                            so of Romance languages, and absent in Slavic languages, IAVs can
                            exclusively be found in the Balto-Slavic language group.</note> The
                        guidelines define IAVs as verbs that only occur with a prepositional
                        morpheme (<hi rend="italic">simpatizirati z</hi> 'to sympathize with') or
                        verbs that change meaning when occurring with a prepositional morpheme (<hi
                            rend="italic">biti za</hi> 'be for, to support' vs. <hi rend="italic"
                            >biti</hi> 'to be'). The participants required by the verb phrase as a
                        whole are not a part of the VMWE, as opposed to e.g. <hi rend="italic">stati
                            na + trdnih tleh</hi> 'to stand on + solid ground', which is categorized
                        as a VID.</p>
                    <p>Prepositions have been treated as free verb morphemes as early as in <ref
                            target="#Metelko.1825">Metelko's Grammar of Slovene (1825, 247–56)</ref>
                        and were analyzed in further detail by <ref target="#Breznik.1916">Breznik
                            (1916, 250</ref>; <ref target="#Breznik.1934">1934, 225</ref>). Verbs
                        with a lexicalised prepositional morpheme were also analyzed by <ref
                            target="#Žele.2002">Žele (2002)</ref> and <ref target="#Kržišnik.1994"
                            >Kržišnik (1994)</ref>, the former from the perspective of the degree of
                        lexicality of the preposition and the latter from the perspective of phrase
                        fixedness as either a phraseological unit with structural fixedness (<hi
                            rend="italic">biti ob čem</hi> 'to be next to sth' meaning 'to be
                        positioned next to sth') or phrasemes with lexical fixedness (<hi
                            rend="italic">biti ob kaj</hi> 'to lose sth').</p>
                    <p>In the training corpus, IAVs account for approximately 20% of all annotated
                        VMWEs (see Table 1). Among the 710 examples, 154 diverse IAVs were
                        identified. The following examples appear with a frequency of at least 20:
                            <hi rend="italic">iti za</hi> 'to be about' (always in the third person
                        singular – <hi rend="italic">gre za</hi>), <hi rend="italic">priti do</hi>
                        'to occur', <hi rend="italic">ukvarjati se z</hi> 'to work on sth', <hi
                            rend="italic">vplivati na</hi> 'to influence', <hi rend="italic">skrbeti
                            za</hi> 'to take care of', <hi rend="italic">temeljiti na</hi> 'to be
                        based on', <hi rend="italic">naleteti na</hi> 'to encounter', <hi
                            rend="italic">veljati za</hi> 'to be considered' and <hi rend="italic"
                            >biti proti</hi> 'to be against'. As per the guidelines, the IAV
                        category also includes verb phrases that consist of an inherently reflexive
                        verb (see 5.1) and a lexicalised prepositional morpheme (<hi rend="italic"
                            >nanašati se na</hi> 'to refer to sth').</p>
                    <p>The most frequent lexicalised prepositional morpheme is <hi rend="italic"
                            >za</hi> 'for', occurring with 34 different verbs (e.g. <hi
                            rend="italic">gre za</hi> 'to be about'), followed by <hi rend="italic"
                            >na</hi> 'on', occurring with 33 different verbs (e.g. <hi rend="italic"
                            >vplivati na</hi> 'to influence'). Frequent prepositional morphemes are
                        also <hi rend="italic">z/s</hi> 'with', <hi rend="italic">do</hi> 'to' and
                            <hi rend="italic">v</hi> 'in'.</p>
                    <p>The lexicalised prepositional morpheme is usually positioned after the verb,
                        which is true in 86% of the annotated examples. In the vast majority of
                        cases, the morpheme is positioned directly after the verb or in a narrow
                        window (+3 words). An exception is <hi rend="italic">gre za</hi>, where an
                        intervening element serves to reference preceding information (<hi
                            rend="italic">gre [v tem primeru] za</hi> 'it [in this case] is about').
                        In less frequent examples where the prepositional morpheme is positioned
                        before the verb, the distance between the verb and the morpheme is
                        significantly larger (in 20% of the cases, the distance is 3+ words).</p>
                    <p>Verbs with a lexicalised prepositional morpheme can also be identified based
                        on common semantic features, e.g. the expression of (a) function or quality:
                            <hi rend="italic">veljati za</hi> [<hi rend="italic">favorita</hi>] 'to
                        be considered [a favorite]',<note place="foot" xml:id="ftn12" n="8">With
                            IAVs, we also list typical collocates from the Gigafida Corpus of
                            Written Slovene to ease semantic disambiguation.</note>
                        <hi rend="italic">imenovati [direktorja]</hi> 'to name [smn a director]',
                            <hi rend="italic">označiti za [laž]</hi> 'to call [sth] out as [a lie]';
                        (b) (dis)agreement: <hi rend="italic">biti za/proti [globalizacijo] '</hi>to
                        be for/against [globalization]<hi rend="italic">'</hi>; (c) basis: <hi
                            rend="italic">temeljiti na (dejstvu) </hi>'to be based on [fact]',<hi
                            rend="italic"> graditi na (zaupanju) </hi>'to build on [trust]'; (d)
                        beginning or change of action/state: <hi rend="italic">pasti v [komo]
                        </hi>'to fall in [a coma]',<hi rend="italic"> prerasti v (ljubezen) </hi>'to
                        blossom into [love]<hi rend="italic">'</hi>; (e) change of quality or form:
                            <hi rend="italic">pretvoriti v (energijo)</hi> 'to convert into
                        [energy]'; (f) survival: <hi rend="italic">iti skozi (proces) </hi>'to go
                        through [a process]'; (g) active participation: <hi rend="italic">ukvarjati
                            se z </hi>'to work on sth'<hi rend="italic">, skrbeti za </hi>'to take
                        care of sth'.</p>
                    <p>IAVs are characterized by the fact that the presence of the prepositional
                        morpheme often changes the valency qualities of the verb, e.g. (a) when the
                        original intransitive verb becomes transitive, as in the example <hi
                            rend="italic">živeti</hi> 'to live' : <hi rend="italic">živeti od
                            koga</hi>/<hi rend="italic">česa</hi> 'to live off of sth'; (b) when
                        there is a change in the case of the prepositional complement, e.g. <hi
                            rend="italic">obrniti se na koga</hi> 'to turn to someone (fig.) : <hi
                            rend="italic">obrniti se h komu</hi> 'to turn to someone (lit.)'. There
                        are also many examples of movement verbs that as IAVs change meaning to a
                        non-spatial judgment of state (<hi rend="italic">priti skozi</hi> 'to go
                        through' in the sense of 'to survive'). With verbs featuring a wide semantic
                        range, the prepositional morpheme typically narrows down the meaning (<hi
                            rend="italic">biti</hi> 'to be' : <hi rend="italic">biti za</hi> 'to be
                        for, to support sth'). Some verbs within IAVs require an abstract object,
                        e.g. <hi rend="italic">pasti v [depresijo, vrtinec nizkotnosti]</hi> 'to
                        fall into [depression, a whirlpool of insidiousness]', <hi rend="italic"
                            >dišati po</hi> [<hi rend="italic">prevari</hi>] 'to smell of [deceit]',
                            <hi rend="italic">pokati od</hi> [<hi rend="italic">veselja</hi>] 'to be
                        bursting of [joy]'.</p>
                    <p>Identifying inherently adpositional verbs poses a challenge both for human
                        annotators and language technology tools as additional elements can
                        intervene between the lexicalised morpheme and the verb. In addition,
                        numerous verb-preposition combinations can denote a literal meaning while
                        not exhibiting any change in the case of the object complement (<hi
                            rend="italic">stati za</hi> [<hi rend="italic">vrati</hi>] 'to stand
                        behind the door' : <hi rend="italic">stati za</hi> [<hi rend="italic"
                            >dejanji</hi>] 'to stand by one's actions'). They can also be polysemous
                            (<hi rend="italic">priti do</hi> [<hi rend="italic">spremembe</hi>] 'to
                        occur [change]' : <hi rend="italic">priti do</hi> [<hi rend="italic"
                            >denarja</hi>] 'to get [money]'). The analysis offers a starting point
                        for the automatic identification of IAVs and provides possibilities for more
                        detailed research, especially in terms of valency, sentence patterns and the
                        semantic features of participants.</p>
                </div>
                <div>
                    <head>Verbal Idioms (VID)</head>
                    <p>The PARSEME Guidelines define verbal idioms (VID) as the combination of two
                        lexicalised elements in which the verb is the syntactic head that requires
                        at least one participant in the sentence pattern. The participants can take
                        different syntactic roles, e.g. a direct or prepositional object complement
                            (<hi rend="italic">plačati ceno</hi> 'to pay a price', <hi rend="italic"
                            >zravnati z zemljo</hi> 'to level with the earth'), a subject (<hi
                            rend="italic">zgodba se ponavlja</hi> 'lit. the story repeats itself'),
                        an adverbial (<hi rend="italic">spati kot ubit</hi> 'lit. to sleep like a
                        dead person') or a subordinate clause (<hi rend="italic">vedeti</hi>, <hi
                            rend="italic">koliko je ura</hi> 'lit. to know what time it is' in the
                        sense 'to know what is going on'). VIDs must also keep a meaning that is
                        independent of the meanings of their elements even with certain syntactic
                        conversions. The Guidelines mention that the elements can appear in expected
                        paradigms (declensions), in different tenses, in active or passive voices,
                        with lexical variance, etc.</p>
                    <p>The definition provided by the PARSEME Guidelines differs from the one found
                        in Slovene linguistics in that it focuses on the verb as the head and the
                        lexicalised elements within the verb's sentence pattern. On the other hand,
                        Slovene linguistics focuses primarily on the ability of the verb phrase as a
                        whole to take the role of the predicate (<ref target="#Toporišič.1973-74"
                            >Toporišič 1973/74</ref>; <ref target="#Kržišnik.1994">Kržišnik
                            1994</ref>). From this point of view, it is problematic to treat phrases
                        that feature a verb as the fixed part, but as a whole do not always take the
                        role of the predicate. In some cases, they can take the role of an object
                        complement (<hi rend="italic">[ne spodobi se] voditi za nos </hi>'lit. [it
                        is not proper] to lead someone by the nose' in the sense 'fooling someone is
                        frowned upon'), a sentence (<hi rend="italic">srce se trga</hi> [<hi
                            rend="italic">komu</hi>] '[someone's] heart is breaking'), or an
                        adverbial (<hi rend="italic">hočeš nočeš</hi> 'like it or not').</p>
                    <p>In the training corpus, 724 units were categorized as VIDs, which represents
                        22% of all VMWEs (see Table 1). As can be expected, VIDs occurring more than
                        10 times feature the verbs <hi rend="italic">biti</hi> 'to be' and <hi
                            rend="italic">imeti</hi> 'to have'. Several other VIDs occur more than 5
                        times (<hi rend="italic">biti kos</hi> 'to be sth's match', <hi
                            rend="italic">priti prav</hi> 'to come in handy', <hi rend="italic"
                            >igrati vlogo</hi> 'to play a role', <hi rend="italic">pustiti pri
                            miru</hi> 'leave sth be', <hi rend="italic">priskočiti na pomoč</hi> 'to
                        rush to smn's aid', and <hi rend="italic">imeti opravka s/z</hi> 'to busy
                        oneself with'), along with fixed discourse markers (cf. Dobrovoljc 2017):
                            <hi rend="italic">se pravi</hi> 'which is to say', <hi rend="italic">kdo
                            ve</hi> 'who knows'.</p>
                    <p>As mentioned above, the most frequent structures are combinations of the verb
                            <hi rend="italic">biti</hi> 'to be' and an adverb/adjective/noun. Taking
                        into account their structural fixedness and semantic vagueness of the verb,
                        they should be treated as separate lexicon entries: <hi rend="italic">biti
                            všeč/res/mar/prida/prav/kos</hi> 'to be likeable/true/to care/to be of
                        benefit/to be right/to be smn's match'. This group includes phrases with a
                        semantically wide verb <hi rend="italic">imeti</hi> 'to have': <hi
                            rend="italic">imeti prav/rad</hi> 'to be right/to love', <hi
                            rend="italic">ne imeti pojma/smisla</hi> 'to have no clue/meaning'.</p>
                    <p>Another frequent structure in the training corpus is the combination of a
                        verb and a noun or noun phrase. Among the verbs, the most frequent are <hi
                            rend="italic">delati</hi> 'to make' (<hi rend="italic">delati
                            družbo/gužvo/izjeme/preglavice/razlike/sceno/škodo</hi> 'to do/make
                        company/a crowd/an expection/trouble/a difference/a scene/damage') and <hi
                            rend="italic">dati</hi> 'to give' (<hi rend="italic">dati
                            polet/pečat</hi> 'to give momentum/to leave a mark'). The latter
                        structurally coincide with LVCs, but cannot be converted in the same way as
                        LVCs to express possession (<hi rend="italic">Miha ima predavanje</hi> 'Miha
                        holds a lecture' → <hi rend="italic">Mihovo predavanje</hi> 'Miha's
                        lecture', but not <hi rend="italic">Miha dela gužvo</hi> 'Miha is crowding
                        the place' → *<hi rend="italic">Mihova gužva </hi>'Miha's crowd'). The
                        largest percentage in the training corpus is covered by VIDs consisting of a
                        verb and a prepositional phrase. Again, the most frequent verb is <hi
                            rend="italic">biti</hi> 'to be' (<hi rend="italic">biti na dosegu
                            roke</hi> 'to be in reach'<hi rend="italic">, biti na
                            razpolago/voljo</hi> 'to be at one's disposal'), followed by e.g. <hi
                            rend="italic">priti</hi> 'to come' (<hi rend="italic">priti na dan
                        </hi>'to come to light'<hi rend="italic">, priti na misel </hi>'to come to
                            mind'<hi rend="italic">)</hi> and <hi rend="italic">dati</hi> 'to give'
                            (<hi rend="italic">dati na izbiro</hi> 'to give a choice'). In terms of
                        fixedness, some combinations of a verb and a nominal/prepositional phrase
                        require an obligatory negation (<hi rend="italic">ne moči si kaj</hi> 'can't
                        help but', <hi rend="italic">ni ne duha ne sluha o (kom/čem) </hi>'no trace
                        of sth', <hi rend="italic">ni para (komu) </hi>'someone has no equal').</p>
                    <p>The training corpus also features other structures, but with lower
                        frequencies (<hi rend="italic">solze stopijo v oči (komu)</hi> 'someone's
                        eyes are watering'<hi rend="italic">, časi se spreminjajo</hi> 'times are
                        changing'). These also include idioms (<hi rend="italic">bolje preprečiti
                            kot zdraviti</hi> 'lit. better to prevent than to cure') and comparisons
                            (<hi rend="italic">igrati se [s kom/čim] kot mačka z mišjo</hi> 'lit. to
                        play [with smn/sth] like a cat plays with a mouse'), as well as verb-adverb
                        combinations (<hi rend="italic">priti skupaj</hi> 'to come together'<hi
                            rend="italic">, daleč priti</hi> 'to come far') and combinations of a
                        verb and a pronominal morpheme (<hi rend="italic">zagosti jo (komu)</hi> 'to
                        create mischief for someone').</p>
                    <p>Within their sentence patterns, VIDs open up predictable syntactic slots
                        filled by participants with typical semantic roles. A quick overview of the
                        annotated examples shows that certain verb forms are fixed or more frequent
                        (e.g. third person or negated forms) and that lexical elements in a certain
                        slot are to some extent predictable: <hi rend="italic">(svet, življenje,
                            vse) postaviti na glavo</hi> 'to turn [the world/life/everything] upside
                        down').</p>
                </div>
            </div>
            <div>
                <head>Discussion and Conclusion</head>
                <p>The conducted annotation task has shown that the annotation set-up (including the
                    tool and the annotation scheme) is suitable. However, content-wise, the task is
                    relatively complex and requires a more advanced linguistic background. The
                    categories provided in the available guidelines are attributable and
                    formalistically distinguishable from each other; categorization problems occur
                    mostly when distinguishing collocations from VMWEs. The quantitative analysis
                    shows that all categories are robust and present in authentic texts. </p>
                <p>Based on the annotated VMWEs, we were able to identify certain pattern features
                    on the syntactic and semantic levels. These patterns represent a good starting
                    point for a set of rules for the automatic extraction of VMWEs and further
                    language description. Methodologically, we made a shift in focus from a
                    functional-syntactic perspective to the description of interconnected features
                    on the morphosyntactic, syntactic, semantic, and lexical levels.</p>
                <p>As expected, VMWEs are typically formed by verbs with a wide semantic range, e.g.
                        <hi rend="italic">biti</hi> 'to be', <hi rend="italic">dati</hi> 'to give',
                        <hi rend="italic">imeti</hi> 'to have', which makes them lose their lexical
                    qualities, but keep their morphological features, syntactic function, and
                    position in the sentence pattern. The degree to which the meaning of the verb as
                    an element of the MWE contributes to the meaning of the whole is often difficult
                    to determine, one of the reasons being that numerous verb phrases structurally
                    coincide with several categories, but denote no idiomatic meaning. In the text, they
                    are difficult to distinguish from free phrases or collocations (frequent,
                    semantically sensible and structurally adequate word co-occurrences). </p>
                <p>On the other hand, the initial structural and semantic analysis has shown that
                    (a) individual types of VMWEs form recognizable structural patterns, e.g. verb +
                    nominal/prepositional phrase; (b) the lexicalization of elements influences the
                    change in the participants' position and their semantic roles (<hi rend="italic"
                        >vreči se po kom</hi> 'to take after smn' – <hi rend="italic">vreči se v
                        kaj</hi> 'to begin working enthusiastically' – <hi rend="italic">vreči koga
                        ven</hi> 'to throw smn out'); (c) that the sequence of verb elements in a
                    VMWE is usually not fixed, but (e) there are certain tendencies in word order
                    and (d) the number and representation of intervening elements. Furthermore, (e)
                    certain lexical elements can be predicted based on the frequency and the
                    elements of the co-text; (f) for better automatic identification of VMWEs, their
                    formalized description should include information on all levels of language
                    description.</p>
                <p>The list of VMWEs obtained from the annotated corpus represents a set of lexicon
                    units that can be used in machine learning for the automatic identification of
                    VMWEs in text.</p>
                <p>While our research did not include a systematic analysis of the sentence
                    patterns, it should be mentioned that the training corpus includes the syntactic
                    (formalized syntactic dependencies) and semantic (semantic role labelling) data
                    that can be used to analyze them. This would allow us to identify more general
                    sentence patterns for a certain VMWE type and use them in automatic
                    extraction.</p>
                <p>To correctly identify different MWEs, we will also create a typology of
                    non-verbal MWEs, e.g. nominal (<hi rend="italic">žlahtna kapljica</hi> 'fine
                    wine'), adjectival (<hi rend="italic">vreden greha</hi> 'worthy of sin'), or
                    adverbial phrases (<hi rend="italic">zdaj ali nikoli</hi> 'now or never'), as
                    well as phrases containing particles, conjunctions and pronouns (<hi
                        rend="italic">ja pa ja</hi> 'as if', <hi rend="italic">s tem da</hi> 'taking
                    into account that') which were identified as frequent n-grams (Dobrovoljc 2017).
                    Another challenge to tackle is the relation between the canonical and converted
                    forms of MWEs, e.g. začarani krog 'vicious circle' – <hi rend="italic">biti ujet
                        v začarani krog/v začaranem krogu</hi> 'to be caught in a vicious circle'<hi
                        rend="italic"> – izviti se/rešiti se iz začaranega kroga </hi>'to escape
                    from a vicious circle' <hi rend="italic">– vrteti se/znajti se v začaranem
                        krogu</hi> 'to spin/end up in a vicious circle'<hi rend="italic"> –
                        izstopiti iz začaranega kroga</hi> 'to step out of a vicious circle', etc.
                    Furthermore, it is difficult to identify MWEs with an independent, but
                    non-metaphorical meaning, e.g. fixed expressions of the type <hi rend="italic"
                        >tehnološki park</hi> 'technological park' and <hi rend="italic">ustavno
                        sodišče</hi> 'supreme court', which are closer to terminology and named
                    entities.</p>
            </div>
            <div>
                <head>Acknowledgments</head>
                <p>The authors acknowledge the financial support from the Slovenian Research Agency:
                    (a) research core funding No. P6-0215, <hi rend="italic">Slovene Language –
                        Basic, Contrastive, and Applied Studies</hi>; (b) research core funding No.
                    P6-0411, Language Resources and Technologies for Slovene; and (c) project
                    funding No. J6-8256, <hi rend="italic">New grammar of modern standard Slovene:
                        resources and methods</hi>. The research was conducted within the framework
                    of the IC1207 PARSEME COST Action<note place="foot" xml:id="ftn13" n="9">
                        <hi rend="italic">Home – PARSEME</hi>, <ref target="http://www.parseme.eu"
                            >http://www.parseme.eu</ref>.</note> and the IS1305 ENeL COST
                        Action.<note place="foot" xml:id="ftn14" n="10">
                        <hi rend="italic">Action IS1305 – COST</hi>, <ref
                            target="www.elexicography.eu">www.elexicography.eu</ref>.</note></p>
            </div>

        </body>
        <back>
            <div type="bibliography">
                <head>Sources and Literature</head>
                <listBibl>
                    <bibl xml:id="ArharHoldt.2007">Arhar Holdt, Špela, and Vojko Gorjanc. 2007.
                        “Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa.” <hi
                            rend="italic">Jezik in slovstvo</hi> 52, No. 2 (January): 95–110.</bibl>
                    <bibl xml:id="Atkins.2008">Atkins, Sue B. T., and Michael Rundell. 2008. <hi
                            rend="italic">The Oxford Guide to Practical Lexicography.</hi> New York:
                        Oxford University Press. </bibl>
                    <bibl xml:id="Baldwin.2010">Baldwin, Timothy, and Su Nam Kim. 2010. “Multiword
                        Expressions” In <hi rend="italic">Handbook of Natural Language
                            Processing</hi>, edited by Nitin Indurkhya and Fred J. Damerau, Second
                        Edition, 267–92. Boca Raton: CRC Press.</bibl>
                    <bibl xml:id="Breznik.1916">Breznik, Anton. 1916. <hi rend="italic">Slovenska
                            slovnica za srednje šole</hi>. Celovec: Družba sv. Mohorja.</bibl>
                    <bibl xml:id="Breznik.1934">Breznik, Anton. 1934. <hi rend="italic">Slovenska
                            slovnica za srednje šole</hi>. 4th, enlarged edition. Celje: Družba sv.
                        Mohorja.</bibl>
                    <bibl xml:id="Candito.2016">Candito, Marie, Fabienne Cap, Silvio Cordeiro,
                        Vassiliki Foufi, Polona Gantar, Voula Giouli, Carlos Herrero, Mihaela
                        Ionescu, Verginica Mititelu, Johanna Monti, Joakim Nivre, Mihaela Onofrei,
                        Carla Parra Escartín, Manfred Sailer, Carlos Ramisch, Monica-Mihaela Rizea,
                        Agata Savary, Ivelina Stonayova, Sara Stymne, Veronika Vincze. 2016. <hi
                            rend="italic">PARSEME shared task 1.0 annotation guidelines - version
                            1.6b</hi> – last updated on November 26, 2016. <ref
                            target="http://parsemefr.lif.uiv-mrs.fr/parseme-st-guidelines/1.0/"
                            >http://parsemefr.lif.uiv-mrs.fr/parseme-st-guidelines/1.0/</ref>.</bibl>
                    <bibl xml:id="Dobrovoljc.2017">Dobrovoljc, Kaja. 2017. “Multi-word discourse
                        markers and their corpus-driven identification: the case of MWDM extraction
                        from the reference corpus of spoken Slovene.” <hi rend="italic"
                            >International journal of corpus linguistics</hi> 22, No. 4 (December):
                        551–82. </bibl>
                    <bibl xml:id="Dobrovoljc.2012">Dobrovoljc, Kaja, Simon Krek, and Jan Rupnik.
                        2012. “Skladenjski razčlenjevalnik za slovenščino.” In <hi rend="italic"
                            >Zbornik Osme konference Jezikovne tehnologije</hi>, edited by Tomaž
                        Erjavec and Jerneja Žganec Gros, 42–47. Ljubljana: Jožef Stefan Institute. </bibl>
                    <bibl xml:id="Gantar.Colman.2018">Gantar, Polona, Lut Colman, Carla Parra
                        Escartín and Héctor Martínez Alonso. 2018. “Multiword Expressions: Between
                        Lexicography and NLP.” <hi rend="italic">International Journal of
                            Lexicography</hi>: 1–25. </bibl>
                    <bibl xml:id="Gantar.ArharHoldt.2018">Gantar, Polona, Špela Arhar Holdt, Jaka
                        Čibej, Taja Kuzman, and Teja Kavčič. 2018. “Glagolske večbesedne enote v
                        učnem korpusu ssj500k 2.1.” In <hi rend="italic">Proceedings of the
                            Conference on Language Technologies &amp; Digital Humanities</hi>,
                        edited by Darja Fišer and Andrej Pančur, 85-92. Ljubljana: Znanstvena
                        založba Filozofske fakultete.</bibl>
                    <bibl xml:id="Gantar.2017">Gantar, Polona, Simon Krek, and Taja Kuzman. 2017.
                        “Verbal multiword expressions in Slovene.” <hi rend="italic">Europhras 2017,
                            Computational and Corpus-Based Phraseology: proceedings</hi>, edited by
                        Ruslan Mitkov, 247–59. Cham: Springer. </bibl>
                    <bibl xml:id="GodecSoršak.2013">Godec Soršak, Lara. 2013. “Glagoli z oslabljenim
                        pomenom v Slovarju slovenskega knjižnega jezika.” <hi rend="italic"
                            >Slavistična revija</hi> 61, No. 3 (March): 507–22.</bibl>
                    <bibl xml:id="Gorjanc.2017">Gorjanc, Vojko, Polona Gantar, Iztok Kosem, and
                        Simon Krek, eds. 2017. <hi rend="italic">Dictionary of modern Slovene:
                            problems and solutions</hi>. Ljubljana: Ljubljana University Press,
                        Faculty of Arts. <ref
                            target="https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15"
                            >https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15</ref>.</bibl>
                    <bibl xml:id="Grčar.2012">Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. 2012.
                        “Obeliks: statistični oblikoskladenjski označevalnik in lematizator za
                        slovenski jezik.” In <hi rend="italic">Zbornik Osme konference Jezikovne
                            tehnologije, </hi>edited by Tomaž Erjavec and Jerneja Žganec Gros<hi
                            rend="italic">.</hi> Ljubljana: Jožef Stefan Institute.</bibl>
                    <bibl xml:id="Krek.2017">Krek, Simon, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može,
                        Nina Ledinek, Nanika Holz, Katja Zupan, Polona Gantar, and Taja Kuzman.
                        2017. “Training corpus ssj500k 2.0.” <hi rend="italic">Slovenian language
                            resource repository CLARIN.SI.</hi>
                        <ref target="http://hdl.handle.net/11356/1165"
                            >http://hdl.handle.net/11356/1165</ref>.</bibl>
                    <bibl xml:id="Kržišnik.1994">Kržišnik, Erika. 1994. “Slovenski glagolski frazemi
                        (ob primeru glagolov govorjenja).” PhD diss., Faculty of Arts, University of
                        Ljubljana.</bibl>
                    <bibl xml:id="Metelko.1825">Metelko, Franc Serafin. 1825. <hi rend="italic"
                            >Lehrgebäude der slowenischen Sprache im Königreiche Illyrien und in den
                            benachbarten Provinzen</hi>. Laibach: Leopold Eger.</bibl>
                    <bibl xml:id="Ramisch.2018">Ramisch, Carlos, Silvio Ricardo Cordeiro, Agata
                        Savary, Veronika Vincze, Verginica Barbu Mititelu, Archna Bhatia, Maja
                        Buljan, Marie Candito, Polona Gantar et al. 2018. “Edition 1.1 of the
                        PARSEME shared task on automatic identification of verbal multiword
                        expressions.” In <hi rend="italic">Proceedings: LAW-MWE-CxG 2018, The 12th
                            Linguistic Annotation Workshop (LAW XII) and the 14th Workshop on
                            Multiword Expressions (MWE 2018)</hi>, edited by Agata Savary, Carlos
                        Ramisch, Jena D. Hwang, Nathan Schneider, Melanie Andresen, Sameer Pradhan,
                        and Miriam R. L. Petruck, 222-40. Santa Fe: Association for Computational
                        Linguistics. <ref target="http://aclweb.org/anthology/W18-49"
                            >http://aclweb.org/anthology/W18-49</ref>.</bibl>
                    <bibl xml:id="Sag.2002">Sag, Ivan, Timothy Baldwin, Francis Bond, Ann Copestake,
                        and Dan Flickinger. 2002. “Multiword Expressions: a Pain in the Neck for
                        NLP.” In <hi rend="italic">Proceedings of the 3rd International Conference
                            on Intelligent Text Processing and Computational Linguistics</hi>
                        (CICLing 2002), edited by Alexander Gelbukh, 1-15. Berlin, Heidelberg, New
                        York: Springer. </bibl>
                    <bibl xml:id="Schneider.2014">Schneider, Nathan, Spencer Onuffer, Nora Kazour,
                        Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith.
                        2014. “Comprehensive annotation of multiword expressions in a social web
                        corpus.” <ref
                            target="https://aclweb.org/anthology/volumes/proceedings-of-the-ninth-international-conference-on-language-resources-and-evaluation-lrec-2014/"
                                ><hi rend="italic">Proceedings of the Ninth International Conference
                                on Language Resources and Evaluation (LREC-2014)</hi></ref><hi
                            rend="italic">,</hi> edited by Nicoletta Calzolari, Khalid Choukri,
                        Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion
                        Moreno, Jan Odijk and Stelios Piperidis, 455-61. European Languages
                        Resources Association (ELRA).</bibl>
                    <bibl xml:id="SSSK."><hi rend="italic">Slovar slovenskega knjižnega jezika</hi>.
                        2nd edition. Ljubljana: SAZU and Fran Ramovš Institute of the Slovenian
                        Language ZRC SAZU. <ref target="http://www.fran.si"
                        >www.fran.si</ref>.</bibl>
                    <bibl xml:id="Toporišič.1973-74">Toporišič, Jože. 1973/74. “K izrazju in
                        tipologiji slovenske frazeologije.” <hi rend="italic">Jezik in slovstvo
                        </hi>19, No. 8 (Spring): 273–79.</bibl>
                    <bibl xml:id="Toporišič.1982">Toporišič, Jože. 1982. <hi rend="italic">Nova
                            slovenska skladnja</hi>. Ljubljana: Državna Založba Slovenije.</bibl>
                    <bibl xml:id="Toporišič.2000">Toporišič, Jože. 2000. <hi rend="italic">Slovenska
                            slovnica</hi>. Maribor: Založba Obzorja.</bibl>
                    <bibl xml:id="Vidovič-Muha.1998">Vidovič-Muha, Ada. 1998. “Pomenski preplet
                        glagolov imeti in biti – njuna jezikovnosistemska stilistika.” <hi
                            rend="italic">Slavistična revija</hi> 46, No. 4: 293–323.</bibl>
                    <bibl xml:id="Žele.1999">Žele, Andreja. 1999. “Vezljivost v slovenskem knjižnem
                        jeziku (s poudarkom na glagolu).” PhD diss., Faculty of Arts, University of
                        Ljubljana.</bibl>
                    <bibl xml:id="Žele.2002">Žele, Andreja. 2002. “Prostomorfemski glagoli kot
                        slovarska gesla.” <hi rend="italic">Jezikoslovni zapiski</hi> 8, No. 1:
                        95–108.</bibl>
                    <bibl xml:id="Žele.2012">Žele, Andreja. 2012<hi rend="italic">.
                            Pomensko-skladenjske lastnosti slovenskega glagola.</hi> Linguistica et
                        philologica 27. Ljubljana: Založba ZRC, ZRC SAZU.</bibl>
                </listBibl>
            </div>
            <div type="summary">
                <docAuthor>Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman</docAuthor>
                <head style="text-transform: uppercase;">Structural and Semantic Classification of Verbal Multi-Word
                        Expressions in Slovene</head>
                <head rend="subheader" style="text-transform: uppercase;">Summary</head>
                <p>In the paper, we present an analysis of Slovene verbal multi-word expressions
                    (VMWEs) based on the categorization made within PARSEME COST Action Shared Task
                    1.1 for 20 different languages. The purpose of the task was to identify VMWEs in
                    running text based on syntactic and semantic guidelines, as well as to compile a
                    manually annotated multi-language corpus to be made available under a Creative
                    Commons licence. The results of the analysis will be useful in the compilation
                    of a digital lexicon of Slovene multi-word units and will help establish a
                    theoretical framework that takes into account the specific characteristics of
                    Slovene while still fulfilling international criteria.</p>
                <p>Unlike the functional-syntactic criteria advocated thus far in Slovene studies
                        (<ref target="#Toporišič.1973-74">Toporišič 1973/74</ref>; <ref
                        target="#Kržišnik.1994">Kržišnik 1994</ref>), the classification of VMWEs
                    within the PARSEME Shared Task 1.1 focuses on the identification of the
                    syntactic head of the MWE. This allows MWEs to be divided into e.g. verbal,
                    adjectival, and nominal MWEs regardless of the function they have in the
                    sentence as a semantic and syntactic whole. The PARSEME classification consists
                    of both universal and language-specific categories. Universal categories include
                    verbal idioms (VID; <hi rend="italic">plačati ceno</hi> 'to pay the price') and
                    light verb constructions, which are further divided into full (LVC.full; <hi
                        rend="italic">imeti mnenje</hi> 'to have an opinion') and causal (LVC.cause;
                        <hi rend="italic">spraviti v smeh</hi> 'to make smn laugh').
                    Language-specific categories encompass inherently reflexive verbs (IRV; <hi
                        rend="italic">zdeti se</hi> 'to seem'), which are typical of most Slavic
                    languages; phrasal verbs (VPC), typical of Germanic languages; and inherently
                    adpositional verbs (IAV), also typical of most Slavic languages, including
                    Slovene. A total of 13,511 sentences in the Slovene training corpus ssj500k 2.0
                        (<ref target="#Krek.2017">Krek et al. 2017</ref>) were annotated with 3,364
                    VMWEs: 1,627 IRV (48%), 724 VID (22%), 710 IAV (21%), 239 LVC.full (7%), and 64
                    LVC.cause (2%).</p>
                <p>A linguistic analysis of the individual categories highlights numerous semantic
                    and syntactic characteristics of the identified VMWEs that can be taken into
                    account in the compilation of a MWE lexicon and the automatic identification of
                    MWEs in text. Among other things, the results show the importance of the
                    criteria used to distinguish between different types of reflexive verbs based on
                    the role of the reflexive pronoun; they can be viewed either as independent
                    lexical units with their own meaning (e.g. <hi rend="italic">delati se</hi> 'to
                    pretend') or as verbal phrases denoting e.g. mutual (<hi rend="italic"
                        >poljubljati se</hi> 'to kiss each other'), reflexive (<hi rend="italic"
                        >umivati se</hi> 'to wash oneself'), or passive actions (<hi rend="italic"
                        >ponavljati se</hi> 'to be repeated'). The analysis has also shown that
                    although the order of the components of a VMWE is usually not fixed, certain
                    tendencies exist in terms of word order and the number of intervening elements.
                    A semantic analysis of VMWEs has also revealed the presence of semantic groups
                    formed by VMWEs within an individual category, as well as the properties of
                    light verbs and verbs that typically form idiomatic units.</p>
                <p>The study provides a good basis for further analyses of Slovene MWEs. In the
                    training corpus, VMWE annotations can be analyzed in terms of their formalized
                    syntactic dependency trees or the semantic roles played by the participants in
                    the sentence.</p>
            </div>
            <div type="summary" xml:lang="sl">
                <docAuthor>Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman</docAuthor>
                <head>STRUKTURNA IN POMENSKA KLASIFIKACIJA GLAGOLSKIH VEČBESEDNIH ENOT V
                    SLOVENŠČINI</head>
                <head rend="subheader" style="text-transform: uppercase;">Povzetek</head>
                <p>V prispevku predstavljamo analizo glagolskih večbesednih enot (GVBE) v
                    slovenščini na podlagi kategorizacije, kot je bila izdelana v okviru PARSEME
                    COST Action Shared Task 1.1 za 20 različnih jezikov. Namen naloge je bil
                    identificirati GVBE v tekočem besedilu na podlagi skladenjskih in pomenskih
                    smernic ter izdelava ročno označenega večjezičnega korpusa, ki bo na voljo pod
                    licenco Creative Commons. Rezultati analize bodo uporabljeni pri izdelavi
                    digitalnega leksikona večbesednih enot za slovenščino kot tudi za utemeljitev
                    teoretičnih izhodišč, ki upoštevajo specifike slovenščine in so hkrati usklajena
                    z mednarodnimi merili.</p>
                <p>Klasifikacija VMWE znotraj Parseme Shared task 1.1 za razliko od
                    funkcijskoskladenjskih meril, ki jih predvideva slovenistično jezikoslovje (<ref
                        target="#Toporišič.1973-74">Toporišič 1973/74</ref>; <ref
                        target="#Kržišnik.1994">Kržišnik 1994</ref>), postavlja v izhodišče
                    prepoznavanje skladenjskega jedra MWE, kar omogoča njihovo delitev na glagolske,
                    pridevniške, samostalniške ipd. GVBE, neodvisno od funkcije, ki jo v stavku
                    opravljajo kot pomenska in skladenjska celota. V izhodišču predvideva
                    Parsemovska klasifikacija univerzalne in jezikovnospecifične kategorije. Znotraj
                    prvih loči glagolske idiome (VID; <hi rend="italic">plačati ceno</hi>) in zveze
                    z glagoli v pomensko oslabljeni rabi, ki so členjeni na prave (LVC.full; <hi
                        rend="italic">imeti mnenje</hi>) in vzročne (LVC.cause; <hi rend="italic"
                        >spraviti v smeh</hi>). Znotraj druge skupine pa inherentno povratne glagole
                    (IRV; <hi rend="italic">zdeti se</hi>), ki so tipični za večino slovanskih
                    jezikov, frazne glagole (VPC), značilne za germanske jezike, in glagole z
                    leksikaliziranim predložnim morfemom (IAV), ki so tipični za slovenščino in
                    večino slovanskih jezikov. V učnem korpusu ssj500k 2.0 (<ref target="#Krek.2017"
                        >Krek et al. 2017</ref>) smo označili 13,511 stavkov, v katerih smo
                    identificirali skupno 3,364 VMWE v naslednjih deležih: 1,627 IRV (48 %), 724 VID
                    (22 %), 710 IAV (21 %), 239 LVC.full (7 %) in 64 LVC.cause (2 %). </p>
                <p>Jezikoslovna analiza posameznih kategorij je pokazala številne semantične in
                    skladenjske značilnosti identificiranih GVBE, ki jih bo mogoče upoštevati pri
                    izdelavi leksikona VBE ter pri njihovi avtomatski identifikaciji v besedilu. Med
                    drugim je izpostavila merila za ločevanje različnih tipov povratnih glagolov na
                    podlagi vloge povratnega zaimka, kar omogoča njihovo obravnavanje bodisi kot
                    samostojnih leksikalnih enot z lastnim pomenom (npr. <hi rend="italic">delati
                        se</hi>) bodisi kot glagolskih zvez v različnih upovedovalnih vlogah, kot so
                    npr. vzajemnost (<hi rend="italic">poljubljati se</hi>), povratnost (<hi
                        rend="italic">umivati se</hi>), pasivizacija (<hi rend="italic">ponavljati
                        se</hi>) ipd. Analize so tudi pokazale, da zaporedje elementov v GVBE
                    navadno ni ustaljeno, obstajajo pa določene tendence glede besednega reda in
                    števila vrivajočih se elementov. Analiza GVBE s semantičnega vidika je pokazala
                    navzočnost določenih semantičnih skupin, ki jih tvorijo GVBE v posamezni
                    kategoriji, kot tudi lastnosti glagolov v pomensko oslabljeni rabi ter glagolov,
                    ki tipično tvorijo idiomatične enote.</p>
                <p>Raziskava postavlja dobre osnove za nadaljnje analize VBE v slovenščini, zlasti
                    ob upoštevanju skladenjskih oznak v obliki formaliziranih skladenjskih drevesnic
                    v učnem korpusu, in semantičnih vlog, pripisanih udeležencem v stavčnem
                    vzorcu.</p>
            </div>
        </back>
    </text>
</TEI>
