<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of Pronominal
                    Address Terms</title>
                <author>
                    <name>
                        <forename>Isolde</forename>
                        <surname>van Dorst</surname>
                        <affiliation>Lancaster University</affiliation>
                        <email>i.vandorst@lancaster.ac.uk</email>
                    </name>
                </author>
            </titleStmt>
            <editionStmt>
                <edition><date>2019-04-15</date></edition>
            </editionStmt>
            <publicationStmt>
                <publisher>
                    <orgName xml:lang="sl">Inštitut za novejšo zgodovino</orgName>
                    <orgName xml:lang="en">Institute of Contemporary History</orgName>
                    <address>
                        <addrLine>Kongresni trg 1</addrLine>
                        <addrLine>SI-1000 Ljubljana</addrLine>
                    </address>
                </publisher>
                <pubPlace>http://ojs.inz.si/pnz/article/view/326</pubPlace>
                <date>2019</date>
                <availability status="free">
                    <licence>http://creativecommons.org/licenses/by-nc-nd/4.0/</licence>
                </availability>
            </publicationStmt>
            <seriesStmt>
                <title xml:lang="sl">Prispevki za novejšo zgodovino</title>
                <title xml:lang="en">Contributions to Contemporary History</title>
                <biblScope unit="volume">59</biblScope>
                <biblScope unit="issue">1</biblScope>
                <idno type="ISSN">2463-7807</idno>
            </seriesStmt>
            <sourceDesc>
                <p>No source, born digital.</p>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <projectDesc xml:lang="en">
                <p>Contributions to Contemporary History is one of the central Slovenian scientific
                    historiographic journals, dedicated to publishing articles from the field of
                    contemporary history (the 19th and 20th century).</p>
                <p>The journal is published three times per year in Slovenian and in the following
                    foreign languages: English, German, Serbian, Croatian, Bosnian, Italian, Slovak
                    and Czech. The articles are all published with abstracts in English and
                    Slovenian as well as summaries in English.</p>
            </projectDesc>
            <projectDesc xml:lang="sl">
                <p>Prispevki za novejšo zgodovino je ena osrednjih slovenskih znanstvenih
                    zgodovinopisnih revij, ki objavlja teme s področja novejše zgodovine (19. in 20.
                    stoletje).</p>
                <p>Revija izide trikrat letno v slovenskem jeziku in v naslednjih tujih jezikih:
                    angleščina, nemščina, srbščina, hrvaščina, bosanščina, italijanščina, slovaščina
                    in češčina. Članki izhajajo z izvlečki v angleščini in slovenščini ter povzetki
                    v angleščini.</p>
            </projectDesc>
        </encodingDesc>
        <profileDesc>
            <langUsage>
                <language ident="sl"/>
                <language ident="en"/>
            </langUsage>
            <textClass>
                <keywords xml:lang="en">
                    <term>pronominal address terms</term>
                    <term>Shakespeare</term>
                    <term>corpus linguistics</term>
                    <term>digital humanities</term>
                    <term>statistical modelling</term>
                </keywords>
                <keywords xml:lang="sl">
                    <term>izrazi zaimkovnega naslavljanja</term>
                    <term>Shakespeare</term>
                    <term>korpusno jezikoslovje</term>
                    <term>digitalna humanistika</term>
                    <term>statistično modeliranje</term>
                </keywords>
            </textClass>
        </profileDesc>
        <revisionDesc>
            <listChange>
                <change>
                    <date>2019-06-07</date>
                    <name>Mihael Ojsteršek</name>
                    <desc>Pretvorba iz DOCX v TEI, dodatno kodiranje</desc>
                </change>
            </listChange>
        </revisionDesc>
    </teiHeader>
    <text>
        <front>
            <docAuthor>Isolde van Dorst<note place="foot" xml:id="ftn1" n="*">
                    <hi rend="bold">Lancaster
                        University, </hi><ref target="mailto:i.vandorst@lancaster.ac.uk"><hi rend="bold"
                            >i.vandorst@lancaster.ac.uk</hi></ref></note>
            </docAuthor>
            <docImprint>
                <idno type="cobissType">Cobiss type: 1.01</idno>
                <idno type="UDC">UDC: 004.934:821.111SHAK(083.41)</idno>
            </docImprint>
            <div type="abstract" xml:lang="sl">
                <head>IZVLEČEK</head>
                <head style="text-transform: uppercase;">You, thou in thee: statistična analiza uporabe izrazov
                        zaimkovnega naslavljanja pri Shakespearu</head>
                <p>
                    <hi rend="italic">Študija se ukvarja z oblikovanjem napovednega modela,
                        namenjenega ugotavljanju, katere jezikovne in nejezikovne značilnosti
                        vplivajo na izbiro zaimkov v Shakespearovih igrah. V angleščini, ki se je
                        uporabljala v Shakespearovem obdobju, je razlikovanje med YOU in THOU, ki je
                        danes arhaično, še obstajalo. Običajno se navaja, da sta ga določala
                        relativni družbeni status ter osebna bližina govorca in naslovljenca. Vendar
                        pa je treba še ugotoviti, ali bo statistično strojno učenje potrdilo to
                        tradicionalno razlago. Proučuje se 23 značilnosti, izbranih z različnih
                        jezikoslovnih področij, kot so pragmatika, sociolingvistika in analiza
                        pogovora. Trije uporabljeni algoritmi – naivni Bayesov klasifikator,
                        odločitveno drevo in metoda podpornih vektorjev – so izbrani kot
                        ilustrativni nabor možnih modelov zaradi njihovih kontrastnih predpostavk in
                        učne pristranskosti. Opravita se dve napovedi, prva o binarnem (</hi><hi
                        rend="italic smallcaps">you</hi><hi rend="italic">/</hi><hi
                        rend="italic smallcaps">thou</hi><hi rend="italic">) razlikovanju in druga o
                        trinarnem (you/thou/thee) razlikovanju. Od vseh treh algoritmov daje
                        najboljše rezultate metoda podpornih vektorjev. Po ugotovitvah so
                        značilnosti, ki najbolje napovejo izbiro zaimka, besede iz neposrednega
                        jezikovnega konteksta. Izkazalo se je, da na napoved zaimka vpliva tudi več
                        drugih značilnosti, vključno z imenom govorca in naslovljenca, razliko v
                        statusu ter pozitivnim ali negativnim mnenjem.</hi></p>
                <p><hi rend="italic"> Ključne besede: izrazi zaimkovnega naslavljanja, Shakespeare,
                        korpusno jezikoslovje, digitalna humanistika, statistično
                    modeliranje</hi></p>
            </div>
            <div type="abstract">
                <head>ABSTRACT</head>
                <p>
                    <hi rend="italic">This study creates a prediction model to identify which
                        linguistic and extra-linguistic features influence pronoun choices in the
                        plays of Shakespeare. In the English of Shakespeare’s time, the now-archaic
                        distinction between </hi><hi rend="italic smallcaps">you</hi><hi
                        rend="italic"> and </hi><hi rend="italic smallcaps">thou</hi><hi
                        rend="italic"> persisted, and is usually reported as being determined by
                        relative social status and personal closeness of speaker and addressee.
                        However, it remains to be determined whether statistical machine learning
                        will support this traditional explanation. 23 features are investigated,
                        having been selected from multiple linguistic areas, such as pragmatics,
                        sociolinguistics and conversation analysis. The three algorithms used, Naive
                        Bayes, decision tree and support vector machine, are selected as
                        illustrative of a range of possible models in light of their contrasting
                        assumptions and learning biases. Two predictions are performed, firstly on a
                        binary (</hi><hi rend="italic smallcaps">you</hi><hi rend="italic">/</hi><hi
                        rend="italic smallcaps">thou</hi><hi rend="italic">) distinction and then on
                        a trinary (you/thou/thee) distinction. Of the three algorithms, the support
                        vector machine models score best. The features identified as the best
                        predictors of pronoun choice are the words in the direct linguistic context.
                        Several other features are also shown to influence the pronoun prediction,
                        including the names of the speaker and addressee, the status differential,
                        and positive and negative sentiment.</hi></p>
                <p>
                    <hi rend="italic">Keywords: pronominal address terms, Shakespeare, corpus
                        linguistics, digital humanities, statistical modelling</hi></p>
            </div>
        </front>
        <body>
            <div>
                <head>Introduction</head>
                <p>For several decades much research has been undertaken on the use of <hi
                        rend="italic">you</hi>, <hi rend="italic">thou</hi> and <hi rend="italic"
                        >thee</hi> in Shakespeare’s works. However, the results so far have yet to
                    arrive at an exact and conclusive answer regarding how these pronouns were
                    used.</p>
                <p>This study combines the strengths of multiple research fields in an effort to
                    determine via hitherto unused methods which linguistic and extra-linguistic
                    features influence the choice of second person singular pronoun (<hi
                        rend="italic">you</hi> versus <hi rend="italic">thou</hi> or <hi
                        rend="italic">thee</hi>) in the plays of William Shakespeare. Prior findings
                    in literary and linguistic studies are utilised to find which features could be
                    relevant in this choice, and tools and applications created for corpus
                    linguistics and computer science are exploited to analyse the data in a more
                    exact way than has so far been accomplished. Through these techniques, I hope to
                    identify which features can contribute to a more accurate prediction of pronoun
                    choice, in a model to mimic the pronoun use of Shakespeare.</p>
                <p>It is worth observing at this point that it has not yet been determined whether
                    it is even possible to predict the pronoun based on linguistic features. Part of
                    the aim of this paper is to make a determination on this point. In other words,
                    is it possible to create a computational model that can predict which pronoun
                    will be used based on a set of linguistic and extra-linguistic features taken
                    from the text itself and selected on the basis of knowledge that we have of
                    English in the late 1500s and early 1600s? To accomplish this, all occurrences
                    of <hi rend="italic">you</hi>, <hi rend="italic">thou</hi> and <hi rend="italic"
                        >thee</hi> are extracted from Shakespeare’s plays, and every instance is
                    manually coded for 23 linguistic and extra-linguistic features, creating data
                    which will serve to ascertain the answer to this primary question. A second
                    question to be addressed is whether some features perform better as predictors
                    of the pronoun choice than others. Thirdly, the issue of whether the use of
                    different algorithms affects the prediction outcomes will be considered.</p>
                <p>Throughout this paper, italicised <hi rend="italic">you</hi>, <hi rend="italic"
                        >thou</hi> and <hi rend="italic">thee</hi> refer to specific pronoun forms.
                    However, whereas <hi rend="italic">you</hi> – in Early Modern English as in
                    contemporary English – does not exhibit any formal variation for pronoun case,
                        <hi rend="italic">thou</hi> is strictly a nominative form with <hi
                        rend="italic">thee</hi> as its accusative/dative form. <hi rend="italic"
                        >Thou</hi> and <hi rend="italic">thee</hi> are therefore related
                    inflectional forms of a single pronoun lemma; <hi rend="italic">you</hi> exists
                    in variation with both. Small capitals are used to indicate the pronoun lemmas,
                    thus: <hi rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi>, where
                        <hi rend="smallcaps">thou</hi> includes both <hi rend="italic">thou</hi> and
                        <hi rend="italic">thee</hi>. Whenever discussing pronouns in this paper, I
                    am strictly referring to the singular second-person pronouns <hi rend="italic"
                        >you</hi>, <hi rend="italic">thou</hi> and <hi rend="italic">thee</hi> that
                    are examined in this study.</p>
            </div>
            <div>
                <head>Background</head>
                <div>
                    <head>Digital Humanities</head>
                    <p>Over the past few years, computational research has branched out into other
                        research fields that are not necessarily closely connected to computer
                        science. Digital Humanities (DH) is an umbrella term for all research that
                        is computational but approaches the datasets investigated within, and/or
                        addresses questions or problems that are of importance to, the disciplines
                        of the humanities.</p>
                    <p>The popularity of Digital Humanities, a cross-domain field of study, is
                        attributable to the fact that it does not diminish the differences between
                        fields but rather operationalises this difference to solve difficulties that
                        could not be dealt with within a single discipline. The role of
                        computational methods in the humanities can be considered as that of a
                        supporting character; in any DH computer modelling research, it should be
                        kept in mind that the interpretation is as important at the suitability of a
                        computational model and its outcomes.</p>
                </div>
                <div>
                    <head>Early Modern English and <hi rend="smallcaps">you</hi>/<hi
                            rend="smallcaps">thou</hi></head>
                    <p>In Early Modern English (EModE), two different second person singular
                        pronouns were used, namely the formally singular <hi rend="smallcaps"
                            >thou</hi> and the formally plural (but pragmatically also
                        respectful-singular) <hi rend="smallcaps">you</hi>, with only the latter
                        surviving the EModE period (Taavitsainen and Jucker 2003). The difference
                        between the uses of these two pronouns is evident from multiple literary
                        studies that have addressed Shakespeare’s work, work of his contemporaries,
                        and other documents from this era, such as <ref target="#Walker.2003">Walker (2003)</ref> 
                        and <ref target="#Busse.2002">Busse (2002)</ref>.
                        These studies suggest that unwritten social rules governed the use of these
                        pronouns, abiding by which rules was necessary in order to speak according
                        to society’s standards. The use of the two different pronouns acted as a
                        sign of relative status: <hi rend="smallcaps">you</hi> would be used to
                        superiors and <hi rend="smallcaps">thou</hi> towards inferiors. The choice
                        of pronoun can thus also operate as a subtle means of showing respect or
                        disrespect; using the pronouns in this way would have been natural and easy
                        to English native speakers of the period.</p>
                    <p>Shakespeare lived during the Early Modern English period, and thus used both
                            <hi rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi> in his
                        writing. His work was written less than 100 years before <hi rend="italic"
                            >thou</hi> and <hi rend="italic">thee</hi> disappeared from the standard
                        language (surviving in dialects and archaicised registers, such as pious
                        addresses to the divinity). Thus we may straightforwardly posit that the
                        disappearance of <hi rend="smallcaps">thou</hi> was likely already in
                        progress around his time. Though obviously heightened in its use of
                        emotional and dramatic language and style to accommodate to the genre of the
                        play script, the language of Shakespeare – including the usage of the two
                        second-person pronouns – can be assumed to be a reasonably good
                        representation of the language used generally in social interaction and
                        conversation at that time (<ref target="#Calvo.1992">Calvo 1992</ref>).</p>
                </div>
                <div>
                    <head>Prior Studies on <hi rend="smallcaps">you</hi>/<hi rend="smallcaps"
                            >thou</hi></head>
                    <p>Most studies of Shakespeare’s use of <hi rend="smallcaps">you</hi> and <hi
                            rend="smallcaps">thou</hi> so far have been literary and nonnumeric
                        studies (<ref target="#Brown.1960">Brown and Gilman 1960</ref>; 
                        <ref target="#Quirk.1974">Quirk 1974</ref>; <ref target="#Calvo.1992">Calvo 1992</ref>); the relative few to
                        have used data-based or quantitative techniques did not implement any method
                        beyond directly comparing raw frequency counts (<ref target="#Busse.2003">Busse 2003</ref>;
                        <ref target="#Mazzon.2003">Mazzon 2003</ref>;
                        <ref target="#Stein.2003">Stein 2003</ref>). Moreover, these studies did not look at all the extant
                        Shakespeare plays, but instead chose a few plays to focus on. Nonetheless,
                        these studies have demonstrated some patterns in the use of <hi
                            rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi> and thus
                        provide a workable foundation for a more in-depth study of the usage of
                        those two pronouns.</p>
                    <p>These prior studies support in the overall conclusion that the pronouns <hi
                            rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi> appear to
                        be used to support the explicit expression of respect, social status, and
                        familiarity. <ref target="#Quirk.1974">Quirk (1974)</ref> and <ref target="#Mazzon.2003">Mazzon (2003)</ref> characterise the role of the
                        pronoun as a linguistic marker, whose usage can be seen as either marked or
                        unmarked. In other words, the use of a particular pronoun can be seen as
                        marked when it is used unexpectedly, for example when <hi rend="smallcaps"
                            >you</hi> is expected based on social status, but <hi rend="smallcaps"
                            >thou</hi> is used instead. Thus, in contrast to earlier studies 
                        (<ref target="#Brown.1960">Brown
                        and Gilman 1960</ref>), they do not perceive <hi rend="smallcaps">you</hi> and <hi
                            rend="smallcaps">thou</hi> to be in direct contrast, and to have a more
                        variable interpretation than was assumed until then, based on the context it
                        occurs in. <ref target="#Calvo.1992">Calvo (1992)</ref> and <ref target="#Stein.2003">Stein (2003)</ref> expand on this by concluding that
                        markedness of the pronoun is dependent on the context and the situation, in
                        addition to the pronoun choice depending on stable factors such as the
                        social statuses of, and the level of familiarity between, the characters in
                        Shakespeare’s plays; the speakers and addressees in this study – rather than
                            <hi rend="italic">just</hi> the latter factors (<ref target="#Brown.1960">Brown and Gilman 1960</ref>).
                        The emotive effect of the utterances within which the <hi rend="smallcaps"
                            >you/thou</hi> distinction is utilised is of importance as well;
                        feelings such as anger and love for another character may find expression
                        through pronoun choice. This is connected to the notion of respect, as, in
                        an angry remark, marked pronouns can be used to disrespect the addressee
                        based on their social status (<ref target="#Stein.2003">Stein 2003</ref>).</p>
                    <p>As <ref target="#Stein.2003">Stein (2003)</ref> and <ref target="#Busse.2006">Busse (2006)</ref> already stressed in their studies, a study
                        of <hi rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi> in
                        Shakespeare cannot and should not be limited to a single research
                        discipline. Rather, what is needed is a combination of literature,
                        sociolinguistics, pragmatics and conversation analysis, which are all useful
                        in capturing the complexity of pronominal address and the social
                        constrictions that may have underpinned the choice of one honorific
                        pronoun-form over the other. </p>
                </div>
            </div>
            <div>
                <head>Methodology</head>
                <p>As has already been mentioned, this is a strictly empirical study which attempts
                    to verify the findings of earlier research through a computational approach. The
                    use of a computational, statistical method is motivated by the goal of creating
                    a more objective representation of Shakespeare’s use of <hi rend="smallcaps"
                        >you</hi> and <hi rend="smallcaps">thou</hi> in his plays than has been
                    accomplished so far, since it does not require analysis of meaning-in-context by
                    a human being, but rather proceeds directly from quantitative measurements.</p>
                <div>
                    <head>Hypotheses</head>
                    <p>Three hypotheses were formulated on the basis of the literature:</p>
                    <list type="ordered">
                        <item>No single model will be able to predict the pronominal address term
                            solely based on linguistic and extra-linguistic features.</item>
                    </list>
                    <p>This, being a null-hypothesis, is exactly what this study aims to falsify by
                        developing such a model. It is not likely that a single model will be able
                        to predict Shakespeare’s original choice of <hi rend="smallcaps">you</hi> or
                            <hi rend="smallcaps">thou</hi> based on linguistic and extra-linguistic
                        features, because this choice is dependent on so many factors. However, the
                        application of literature, sociolinguistics, pragmatics and conversation
                        analysis all combined into a computational model will be able to
                        successfully predict the pronoun choice as it includes all the factors that
                        might influence the choice for either <hi rend="smallcaps">you</hi> or <hi
                            rend="smallcaps">thou</hi>.</p>
                    <list type="ordered:2" rend="ordered">
                        <item>The features of social status, age and sentiment will be better
                            predictors of the pronoun choice than other features.</item>
                    </list>
                    <p>A hierarchy will be established according to which the linguistic and
                        extra-linguistic features are predicting the pronoun choice in the best
                        performing model. It may be inferred from the literature that social status,
                        age and sentiment are highly likely to be at the top of this hierarchy,
                        among the most influential features; these three features have shown up most
                        reliably in prior research.</p>
                    <list type="ordered:3" rend="ordered">
                        <item>The best performing algorithm will combine features both dependently
                            and independently.</item>
                    </list>
                    <p>The different learning biases and assumptions of the three algorithms applied
                        in this study will reveal how the features interact with one another. The
                        first algorithm, Naive Bayes, assumes all features are independent of one
                        another, while the decision tree algorithm assumes that the features are all
                        dependent on each other. Lastly, the support vector machine works with both
                        dependent and independent features. I expect the set of features that will
                        be included in the final model to be a combination of both dependent and
                        independent features, and therefore the support vector machine algorithm to
                        perform best. The three algorithms will be discussed in more detail later in
                        the chapter Classification based on three algorithms.</p>
                </div>
                <div>
                    <head>Data</head>
                    <p>The data for this study comes from the <hi rend="italic">Encyclopaedia of
                            Shakespeare’s Language</hi> project<note place="foot" xml:id="ftn2"
                            n="1">More information on this project, which is funded by the Arts and
                            Humanities Research Council (AH/N002415/1), can be found on <ref
                                target="http://wp.lancs.ac.uk/shakespearelang/"
                                >http://wp.lancs.ac.uk/shakespearelang/</ref>.</note>, which is a
                        research project at Lancaster University (UK). The project corpus consists
                        of 38 of Shakespeare’s plays, which includes all 36 plays from the First
                        Folio with the addition of <hi rend="italic">The Two Noble Kinsmen</hi> and
                            <hi rend="italic">Pericles: Prince of Tyre</hi>. A broadly annotated
                        version of the full Shakespeare corpus can be found online<note place="foot"
                            xml:id="ftn3" n="2">CQPweb Main Page, <ref
                                target="http://cqpweb.lancs.ac.uk"
                            >http://cqpweb.lancs.ac.uk</ref>.</note>. Some of the annotation and all
                        of the abbreviations used for the titles of the plays follow <hi
                            rend="italic">The Arden Shakespeare</hi>.</p>
                    <div>
                        <head>Linguistic and Extra-Linguistic Features</head>
                        <table rend="table-scroll">
                            <head>Table 1: List of all features used in this study</head>
                            <row role="label">
                                <cell>Feature</cell>
                                <cell style="text-align:center;">Acronym</cell>
                                <cell style="text-align:center;">Annotation</cell>
                            </row>
                            <row>
                                <cell>Genre</cell>
                                <cell style="text-align:center;">Genre</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Play name</cell>
                                <cell style="text-align:center;">Play</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Play, act, scene</cell>
                                <cell style="text-align:center;">Scene</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Speaker ID</cell>
                                <cell style="text-align:center;">S_ID</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Speaker gender</cell>
                                <cell style="text-align:center;">S_Gender</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Speaker status</cell>
                                <cell style="text-align:center;">S_Status</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Production date</cell>
                                <cell style="text-align:center;">Prod_Date</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>N-gram</cell>
                                <cell style="text-align:center;">LW1-3, RW1-3</cell>
                                <cell style="text-align:center;">Automatic</cell>
                            </row>
                            <row>
                                <cell>Positive sentiment</cell>
                                <cell style="text-align:center;">Pos_Sent</cell>
                                <cell style="text-align:center;">Automatic</cell>
                            </row>
                            <row>
                                <cell>Negative sentiment</cell>
                                <cell style="text-align:center;">Neg_Sent</cell>
                                <cell style="text-align:center;">Automatic</cell>
                            </row>
                            <row>
                                <cell>Speaker age</cell>
                                <cell style="text-align:center;">S_Age</cell>
                                <cell style="text-align:center;">Manual</cell>
                            </row>
                            <row>
                                <cell>Location</cell>
                                <cell style="text-align:center;">Location</cell>
                                <cell style="text-align:center;">Manual</cell>
                            </row>
                            <row>
                                <cell>Addressee ID</cell>
                                <cell style="text-align:center;">A_ID</cell>
                                <cell style="text-align:center;">Automatic</cell>
                            </row>
                            <row>
                                <cell>Addressee gender</cell>
                                <cell style="text-align:center;">A_Gender</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Addressee status</cell>
                                <cell style="text-align:center;">A_Status</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                            <row>
                                <cell>Addressee age</cell>
                                <cell style="text-align:center;">A_Age</cell>
                                <cell style="text-align:center;">Manual</cell>
                            </row>
                            <row>
                                <cell>Status differential</cell>
                                <cell style="text-align:center;">Stat_Diff</cell>
                                <cell style="text-align:center;">Automatic</cell>
                            </row>
                            <row>
                                <cell>No. of people addressed</cell>
                                <cell style="text-align:center;">A_Number</cell>
                                <cell style="text-align:center;">Pre-annotated</cell>
                            </row>
                        </table>
                        <p>The Encyclopaedia of Shakespeare’s Language corpus is richly annotated.
                            However, some additional annotation was necessary to perform a full
                            analysis of what extra-linguistic features could be predictors of the
                            pronominal address term. The full set of features used in this study can
                            be found in Table 1. The added features are briefly described here.</p>
                        <p>As a referent (such as a second person singular pronoun) is dependent on
                            context, the adjacent part of the utterance is used as a feature to test
                            the effect of co-text. Six co-textual words are included, i.e. a 7-gram
                            altogether. “LW” labels the words occurring on the left of the pronoun,
                            and “RW” the words on the right of the pronoun. Each of these words are
                            numbered based on their distance from the pronoun, e.g. LW3 is the third
                            word on the left of the pronoun. In corpus linguistics, collocations are
                            often examined within a three-word-window, meaning there are three words
                            on either side of the word of interest. While I am not necessarily
                            looking at specific collocations of <hi rend="smallcaps">you</hi> and
                                <hi rend="smallcaps">thou</hi>, the LW/RW features will look at
                            similarities and differences in co-textual words to see if they can
                            predict the pronoun choice.</p>
                        <p>Another feature noted as critical in prior studies is sentiment, that is
                            the use of the pronoun to convey positivity or negativity. Sentiment was
                            annotated with the use of the 7-gram described above. <hi rend="italic"
                                >SentiStrength</hi> is a lexicon-based sentiment analysis program
                            that scores phrases with a score for positivity and negativity 
                            (<ref target="#Thelwall.2010">Thelwall
                            et al. 2010</ref>). Since <hi rend="italic">SentiStrength</hi> was developed
                            to work with online comments rather than complete sentences as in formal
                            written English, it works well with n-grams too. The scores for
                            positivity and negativity are kept as separate variables.</p>
                        <p>The corpus already included metadata on the speakers; however, I wanted
                            to include age as well. The age of a character is often not given except
                            for when it is an important attribute of that character, making this
                            difficult to annotate. Therefore, <ref target="#Quennell.2002">Quennell and Johnson’s (2002)</ref>
                            character descriptions were used. The characters were sorted into a
                            trinary classification, with ‘adult’ as the default category. Any
                            deviations towards ‘younger’ or ‘older’ were based on textual references
                            or the character’s name, such as for ‘Old Man’ in <hi rend="italic">King
                                Lear</hi>. Older characters were occasionally classified as such
                            based on the fact they had adult children with prominent roles in the
                            plays.</p>
                        <p>A more global feature is the location where the scene is set. This was
                            difficult to annotate, due to the often unreliable stage directions.
                            Instead of a nominal description for each scene location, I used a
                            binary annotation of ‘public’ and ‘private’. The text itself was
                            examined to determine the location based on what characters said about
                            their location, but in addition <ref target="#Bate.2007">Bate and Rasmussen’s (2007)</ref> annotation
                            and <ref target="#Greenblatt.1997">Greenblatt, Cohen, Howard and Maus’ (1997)</ref> annotations were
                            consulted. The use of these three resources enabled the binary manual
                            annotation of location for every scene.</p>
                        <p>Besides the information about the speaker and the scene, information
                            regarding the addressee is essential when analysing character
                            interaction from a conversation analysis perspective. As a manual
                            annotation for addressee would be incredibly time consuming, I instead
                            used an automatic method which identifies the previous speaker as the
                            addressee of any given utterance. This is in line with the last-as-next
                            bias used in conversation analysis (<ref target="#Mazeland.2003">Mazeland 2003</ref>). This means that,
                            even in larger group conversations, it is often expected that the last
                            speaker before the current speaker will also be the next speaker, thus
                            making it likely that the current speaker is addressing the last
                            speaker. If the utterances were interrupted by the start of a new scene
                            or other stage directions (e.g. someone walking into the scene), the
                            annotated addressee would be the <hi rend="italic">next</hi> speaker
                            rather than the previous speaker for the first utterance after the
                            interruption.</p>
                        <p>Using the data for the social status of the speaker and the addressee, I
                            also created a status differential. As the status category labels are
                            numeric and ordered, this can be done by taking the difference between
                            the two. For example, a king (status = 0) and a servant (status = 6) are
                            distant in status, and thus will have a high status differential (here:
                            6). Between a king and a prince (status = 1), the difference is a lot
                            smaller (here: 1). This absolute feature was automatically generated
                            from the already annotated features.</p>
                        <p>A feature that had to be excluded is familiarity between characters
                            (social distance). This data was not already available, and it was
                            beyond the scope of this study to annotate this for all relevant
                            character pairs. The literature has shown this to be a relevant feature.
                            However, through the use of sentiment analysis, I have attempted to
                            cover the complimentary and insulting aspects that could arise from high
                            familiarity, and any lack thereof arising from low familiarity.
                            Obviously, this does not cover all aspects of familiarity, but it means
                            that this feature is not totally neglected.</p>
                    </div>
                </div>
                <div>
                    <head>Classification Based on Three Algorithms</head>
                    <p>Three different algorithms are used for the classification task, namely Naive
                        Bayes, decision trees and support vector machines. Whereas it would be ideal
                        to achieve a high precision and recall score, the main goal of this research
                        is to see whether it is even possible to predict the second person singular
                        pronoun choice through a computational application <hi rend="italic">at
                            all</hi>. If this is indeed the case, what features contribute to this
                        prediction? It is thus more important to verify which features influence the
                        choice and to what extent they do so. </p>
                    <p>The reason for using three algorithms, and in particular these three, is
                        their differences in learning biases and assumptions. Naive Bayes assumes
                        all features are independent of one another, whereas decision tree attempts
                        to create a dependent, hierarchical structure in the features. Support
                        vector machine (SVM) is more complex and is able to combine both dependent
                        and independent features. The addition of the latter algorithm will be
                        particularly useful if the difference between the two simpler algorithm’s
                        models is small.</p>
                    <p>As well as applying three algorithms, I will also look at the difference
                        between keeping <hi rend="italic">thou</hi> and <hi rend="italic">thee</hi>
                        separate and combining them into the one category <hi rend="smallcaps"
                            >thou</hi>. For this, I will run both a binary (<hi rend="smallcaps"
                            >you</hi> and <hi rend="smallcaps">thou</hi>) and a trinary (<hi
                            rend="italic">you</hi>, <hi rend="italic">thou</hi> and <hi
                            rend="italic">thee</hi>) classification, to see whether this affects the
                        scores or changes which features are included in the best models.</p>
                </div>
                <div>
                    <head>Overview of Implementation</head>
                    <p>I ran the three algorithms using the Waikato Environment for Knowledge
                        Analysis (Weka<note place="foot" xml:id="ftn4" n="3">Weka 3 - Data Mining
                            with Open Source Machine Learning Software in Java, <ref
                                target="http://www.cs.waikato.ac.nz/ml/weka/"
                                >http://www.cs.waikato.ac.nz/ml/weka/</ref>.</note>) software<note
                            place="foot" xml:id="ftn5" n="4">In Weka, Naive Bayes is identified as
                            NaiveBayesMultinominal, decision tree as J48, and support vector machine
                            as SMO.</note> with the default settings. The algorithms were run using
                        a 10-fold cross-validation to ensure the best model based on training and
                        testing of all folds combined.</p>
                    <p>The number of relevant instances of <hi rend="italic">you/thou/thee</hi>
                        extracted from the dataset is 22,932, which makes up 99.5% of the total
                        number of such pronouns in the dataset. The pronouns were extracted using a
                        Python script with simple heuristics. About 0.5% was missed due to noise in
                        the dataset. The number of instances of <hi rend="italic">you/thou/thee</hi>
                        that were extracted from each play range from 363 (in <hi rend="italic"
                            >Macbeth</hi>) to 811 (in <hi rend="italic">Coriolanus</hi>).</p>
                    <p>I attempted to improve or maintain the scores while making the model simpler
                        by excluding features, that is, through feature ablation. When there were
                        conflicting changes in the scores, the scores of precision and F-measure
                        were prioritised. I hoped to identify which features truly help predict the
                        pronoun by building the simplest but best performing model. The baseline
                        that the models were compared to is derived from the distribution of the
                        pronouns in the dataset, thus 62.6% of <hi rend="smallcaps">you</hi> and
                        37.4% <hi rend="smallcaps">thou</hi>.</p>
                    <p>I first took out groups of features that are related, rather than one feature
                        at a time. Among the 23 features, I created six different groups. The first
                        group related to the wider linguistic and social context (play, production
                        date, genre, scene, location), while the second group was the closer
                        linguistic co-text (n-gram). Information on the speaker (name, status,
                        gender, age) and the addressee (name, status, gender, age, number of people)
                        were groups 3 and 4. I kept status differential on its own, because it
                        relates to multiple groups. Finally, the last group was sentiment (positive
                        and negative). After the group ablation, I went back over the features to
                        see if individual feature exclusions would improve the model further. This
                        ensured the simplest and best model for each algorithm. The scores and the
                        features included in each model are given in Tables 2, 3 and 4.</p>
                </div>
            </div>
            <div>
                <head>Results</head>
                <div>
                    <head>Trinary Classification Scores</head>
                    <p>Table 2 shows the results of the trinary classification. As can be seen, each
                        model performed significantly better than the baseline model, on all scores.
                        The F-measure of the best model, the support vector machine model, is
                        highlighted in bold.</p>
                    <table rend="table-scroll">
                        <head>Table 2: Scores for precision, recall, F-measure and accuracy for
                            trinary pronoun prediction</head>
                        <row role="label">
                            <cell cols="2">Algorithm</cell>
                            <cell style="text-align:center;">Precision</cell>
                            <cell style="text-align:center;">Recall</cell>
                            <cell style="text-align:center;">F-measure</cell>
                            <cell style="text-align:center;">Accuracy</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="4">Baseline</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.392</cell>
                            <cell style="text-align:center;">0.626</cell>
                            <cell style="text-align:center;">0.483</cell>
                            <cell style="text-align:center;" rows="4">62.6417%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">you</hi></cell>
                            <cell style="text-align:center;">0.626</cell>
                            <cell style="text-align:center;">1.000</cell>
                            <cell style="text-align:center;">0.770</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thou</hi></cell>
                            <cell style="text-align:center;">0.000</cell>
                            <cell style="text-align:center;">0.000</cell>
                            <cell style="text-align:center;">0.000</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thee</hi></cell>
                            <cell style="text-align:center;">0.000</cell>
                            <cell style="text-align:center;">0.000</cell>
                            <cell style="text-align:center;">0.000</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="4">Naive Bayes</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.826</cell>
                            <cell style="text-align:center;">0.826</cell>
                            <cell style="text-align:center;">0.826</cell>
                            <cell style="text-align:center;" rows="4">82.64%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">you</hi></cell>
                            <cell style="text-align:center;">0.880</cell>
                            <cell style="text-align:center;">0.885</cell>
                            <cell style="text-align:center;">0.882</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thou</hi></cell>
                            <cell style="text-align:center;">0.865</cell>
                            <cell style="text-align:center;">0.850</cell>
                            <cell style="text-align:center;">0.857</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thee</hi></cell>
                            <cell style="text-align:center;">0.509</cell>
                            <cell style="text-align:center;">0.510</cell>
                            <cell style="text-align:center;">0.510</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="4">Decision Tree</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.732</cell>
                            <cell style="text-align:center;">0.752</cell>
                            <cell style="text-align:center;">0.712</cell>
                            <cell style="text-align:center;" rows="4">75.2093%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">you</hi></cell>
                            <cell style="text-align:center;">0.738</cell>
                            <cell style="text-align:center;">0.960</cell>
                            <cell style="text-align:center;">0.835</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thou</hi></cell>
                            <cell style="text-align:center;">0.896</cell>
                            <cell style="text-align:center;">0.574</cell>
                            <cell style="text-align:center;">0.700</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thee</hi></cell>
                            <cell style="text-align:center;">0.408</cell>
                            <cell style="text-align:center;">0.097</cell>
                            <cell style="text-align:center;">0.157</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="4">Support Vector Machine</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.854</cell>
                            <cell style="text-align:center;">0.857</cell>
                            <cell style="text-align:center;"><hi rend="bold">0.854</hi></cell>
                            <cell style="text-align:center;" rows="4">85.675%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">you</hi></cell>
                            <cell style="text-align:center;">0.871</cell>
                            <cell style="text-align:center;">0.927</cell>
                            <cell style="text-align:center;">0.898</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thou</hi></cell>
                            <cell style="text-align:center;">0.919</cell>
                            <cell style="text-align:center;">0.836</cell>
                            <cell style="text-align:center;">0.876</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="italic">thee</hi></cell>
                            <cell style="text-align:center;">0.659</cell>
                            <cell style="text-align:center;">0.566</cell>
                            <cell style="text-align:center;">0.609</cell>
                        </row>
                    </table>
                </div>
                <div>
                    <head>Binary Classification Scores</head>
                    <p>Table 3 shows the results of the best models for the binary classification.
                        The F-measure of the best model, again the support vector machine model, is
                        highlighted in bold. This is also the best scoring model out of all models
                        presented in this paper.</p>
                    <table rend="table-scroll">
                        <head>Table 3: Scores for precision, recall, F-measure and accuracy for
                            binary pronoun prediction</head>
                        <row role="label">
                            <cell cols="2">Algorithm</cell>
                            <cell style="text-align:center;">Precision</cell>
                            <cell style="text-align:center;">Recall</cell>
                            <cell style="text-align:center;">F-measure</cell>
                            <cell style="text-align:center;">Accuracy</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="3">Baseline</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.392</cell>
                            <cell style="text-align:center;">0.626</cell>
                            <cell style="text-align:center;">0.483</cell>
                            <cell style="text-align:center;" rows="3">62.6417%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">you</hi></cell>
                            <cell style="text-align:center;">0.626</cell>
                            <cell style="text-align:center;">1.000</cell>
                            <cell style="text-align:center;">0.770</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">thou</hi></cell>
                            <cell style="text-align:center;">0.000</cell>
                            <cell style="text-align:center;">0.000</cell>
                            <cell style="text-align:center;">0.000</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="3">Naive Bayes</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.868</cell>
                            <cell style="text-align:center;">0.868</cell>
                            <cell style="text-align:center;">0.867</cell>
                            <cell style="text-align:center;" rows="3">86.8306%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">you</hi></cell>
                            <cell style="text-align:center;">0.876</cell>
                            <cell style="text-align:center;">0.920</cell>
                            <cell style="text-align:center;">0.897</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">thou</hi></cell>
                            <cell style="text-align:center;">0.853</cell>
                            <cell style="text-align:center;">0.782</cell>
                            <cell style="text-align:center;">0.816</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="3">Decision Tree</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.818</cell>
                            <cell style="text-align:center;">0.818</cell>
                            <cell style="text-align:center;">0.818</cell>
                            <cell style="text-align:center;" rows="3">81.8376%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">you</hi></cell>
                            <cell style="text-align:center;">0.849</cell>
                            <cell style="text-align:center;">0.863</cell>
                            <cell style="text-align:center;">0.856</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">thou</hi></cell>
                            <cell style="text-align:center;">0.764</cell>
                            <cell style="text-align:center;">0.744</cell>
                            <cell style="text-align:center;">0.754</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="3">Support Vector Machine</cell>
                            <cell style="text-align:center;">Weighted Avg.</cell>
                            <cell style="text-align:center;">0.872</cell>
                            <cell style="text-align:center;">0.873</cell>
                            <cell style="text-align:center;"><hi rend="bold">0.872</hi></cell>
                            <cell style="text-align:center;" rows="3">87.2798%</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">you</hi></cell>
                            <cell style="text-align:center;">0.886</cell>
                            <cell style="text-align:center;">0.914</cell>
                            <cell style="text-align:center;">0.900</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;"><hi rend="smallcaps">thou</hi></cell>
                            <cell style="text-align:center;">0.848</cell>
                            <cell style="text-align:center;">0.803</cell>
                            <cell style="text-align:center;">0.825</cell>
                        </row>
                    </table>
                </div>
                <div>
                    <head>Feature Comparison of the Models</head>
                    <p>Overall, the final models contain similar sets of features. The exact
                        compositions are given in Table 4. What is surprising is that the binary
                        classification model for the decision tree is very different from the other
                        models: it does not contain any of the words from the n-gram as a predictor,
                        whereas the others did.</p>
                    <table rend="table-scroll">
                        <head>Table 4: Features included in the best model of each algorithm</head>
                        <row role="label">
                            <cell>Algorithm</cell>
                            <cell style="text-align:center;">Type</cell>
                            <cell>Features included</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="2">Naive Bayes</cell>
                            <cell style="text-align:center;">Trinary</cell>
                            <cell>LW1, LW2, RW1, RW2, S_ID</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;">Binary</cell>
                            <cell>LW1, LW2, LW3, RW1, RW2, RW3, A_ID</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="2">Decision Tree</cell>
                            <cell style="text-align:center;">Trinary</cell>
                            <cell>LW1, LW2, RW1, RW2, S_ID, Stat_Diff, Neg_Sent</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;">Binary</cell>
                            <cell>Scene, S_ID, S_Gender, A_ID, A_Status, A_Age,
                                Stat_Diff, Pos_Sent</cell>
                        </row>
                        <row style="border-top:1px solid #333333;">
                            <cell rows="2">Support Vector Machine</cell>
                            <cell style="text-align:center;">Trinary</cell>
                            <cell>LW1, RW1, S_ID, S_Age, A_ID, A_Age, A_Number,
                                Stat_Diff, Pos_Sent, Neg_Sent</cell>
                        </row>
                        <row>
                            <cell style="text-align:center;">Binary</cell>
                            <cell>LW1, RW1, S_ID, S_Age, A_ID, A_Age, A_Number,
                                Stat_Diff, Pos_Sent, Neg_Sent</cell>
                        </row>
                    </table>
                </div>
            </div>
            <div>
                <head>Discussion</head>
                <p>This study has given some new insights into the analysis of pronominal address
                    terms. Looking at the second person singular pronoun choice as a binary and a
                    trinary classification problem resulted in slightly different outcomes. Even
                    though the highest scores were achieved in the binary classification, one might
                    still wonder whether this is the best method for addressing the second person
                    singular pronoun choice. Looking back at prior studies on pronoun interpretation
                    and comparing them to the features used in this study, we can conclude that <hi
                        rend="italic">thee</hi> and <hi rend="italic">thou</hi> are equal in their
                    opposition to <hi rend="italic">you</hi>, with the main difference being their
                    grammatical role. From the model comparison, we have seen that the co-text is
                    most important when predicting the pronoun. This is evidence of the purely
                    grammatical difference between <hi rend="italic">thou</hi> and <hi rend="italic"
                        >thee</hi> and their overall similarity in other aspects. Therefore, both
                    linguistically and computationally, it makes more sense to perform a binary
                    classification.</p>
                <p>Differences between the algorithms were observed, but all three algorithms easily
                    outperformed the baseline. The support vector machine models performed best, but
                    the scores for the Naive Bayes models were quite similar to those for the SVM
                    models. A choice between these approaches could be based solely on the scores
                    for accuracy, precision, recall and F-measure, or also by taking into account
                    the complexity, which is significantly higher for the support vector machine
                    models. The more nuanced models that the support vector machine creates, which
                    include more features than the models of the other algorithms, may suggest that
                    the extra complexity of SVM models is indeed beneficial.</p>
                <p>The best predicting features were the LW and RW features, which supports the
                    importance of the direct linguistic co-text. In particular RW1 appeared as the
                    most important feature in predicting the second person singular pronominal
                    address term. Other important features were the speaker’s name, addressee’s
                    name, status differential, positive sentiment and negative sentiment, with
                    additional support from the speaker’s gender, addressee’s status, addressee’s
                    age, speaker’s age, and number of people addressed. Only six features were not
                    included in any of the models: genre, play, production date, location, speaker’s
                    status and addressee’s gender.</p>
                <p>I am, therefore, now able to falsify the null-hypothesis that it is not possible
                    to build a reliable prediction model based on linguistic and extra-linguistic
                    features. All six models demonstrate that linguistic and extra-linguistic
                    features substantially improve the prediction of the pronominal address term, as
                    all six outperform the baseline.</p>
                <p>The second hypothesis, about which features would be good predictors, was
                    partially correct in predicting that social status, age and sentiment would be
                    included in the best models. However, none of these features were the main
                    predictor of pronoun choice; that was the immediate co-text.</p>
                <p>With regard to the final hypothesis, it has been revealed that the features are
                    indeed both dependent on and independent of each other. However, since the Naive
                    Bayes models perform almost identically to the support vector machine models, we
                    can say that the features are, for the most part, independent of one
                    another.</p>
            </div>
            <div>
                <head>Conclusions</head>
                <p>The primary finding of this study is that it is indeed possible to build a
                    prediction model for the use of <hi rend="smallcaps">you</hi> versus <hi
                        rend="smallcaps">thou</hi> with a singular referent in the plays of
                    Shakespeare that is based on linguistic and extra-linguistic features. Moreover,
                    in particular, the direct linguistic co-text of the second person singular
                    pronoun is important. Other important features include the speaker’s and
                    addressee’s names, status differential and both positive and negative sentiment.
                    All in all this suggests that the pronoun choice is influenced by several
                    linguistic and extra-linguistic features.</p>
                <p>The best scoring algorithm and model was the support vector machine with 87.3%
                    accuracy through its binary classification model.</p>
                <p>For future research, I would recommend an exploration of other algorithms and
                    features that were left out of this study, such as morphology, word embeddings
                    and POS-tags. This will help us gain more information about the linguistic
                    co-text directly surrounding the second person singular pronoun, which will
                    likely give more insight into why this direct co-text is so important in
                    deciding the choice of <hi rend="smallcaps">you</hi> or <hi rend="smallcaps"
                        >thou</hi>. Moreover, including familiarity between characters (social
                    distance) as a feature would be beneficial, as this has been noted multiple
                    times in prior research as an influential factor, but was beyond the scope of
                    this study.</p>
                <p>Although this study has not yet provided a comprehensive set of all the
                    linguistic and extra-linguistic features that influence the second person
                    singular pronoun choice in Shakespeare’s plays, it has definitely provided a
                    more objective and extensive analysis of the matter that furthers the research
                    into <hi rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi>.</p>
            </div>
            <div>
                <head>Acknowledgements</head>
                <p>The research presented in this article was conducted in collaboration with the
                    Encyclopaedia of Shakespeare’s Language project at Lancaster University. This
                    project is funded by the UK’s Arts and Humanities Research Council (AHRC), grant
                    reference AH/N002415/1. The Shakespeare corpus will be made publicly available
                    in Summer 2019, first via the CQPweb interface and then through download at a
                    later stage. Many thanks to Jonathan Culpeper and the rest of the team for their
                    advice and support throughout the study.</p>
            </div>
        </body>
        <back>
            <div type="bibliography">
                <head>References</head>
                <listBibl>
                    <head>Literature:</head>
                    <bibl xml:id="Bate.2007">Bate, Jonathan, and Eric Rasmussen, eds. 2007. <hi
                            rend="italic">William Shakespeare: Complete works</hi>. London: The
                        Royal Shakespeare Company.</bibl>
                    <bibl xml:id="Brown.1960">Brown, Roger W., and Albert Gilman. 1960. “The
                        pronouns of power and solidarity.” In <hi rend="italic">Style in
                            language</hi>, edited by Thomas A. Sebeok, 253-76. Cambridge: MIT
                        Press.</bibl>
                    <bibl xml:id="Busse.2006">Busse, Beatrix. 2006. <hi rend="italic">Vocative
                            constructions in the language of Shakespeare</hi>. Amsterdam: John
                        Benjamins.</bibl>
                    <bibl xml:id="Busse.2003">Busse, Ulrich. 2003. “The co-occurrence of nominal and
                        pronominal address forms in the Shakespeare Corpus: Who says thou or you to
                        whom?”, in <hi rend="italic">Diachronic perspectives on address term
                            systems</hi>, edited by Irma Taavitsainen and Andreas H. Jucker,
                        193-221. Amsterdam: John Benjamins.</bibl>
                    <bibl xml:id="Busse.2002">Busse, Ulrich. 2002. <hi rend="italic">The function of
                            linguistic variation in the Shakespeare corpus: A corpus-based study of
                            the morpho-syntactic variability of the address pronouns and their
                            socio-historical and pragmatic implications</hi>. Amsterdam: John
                        Benjamins.</bibl>
                    <bibl xml:id="Calvo.1992">Calvo, Clara. 1992. “Pronouns of address and social
                        negotiation in As You Like It.” In <hi rend="italic">Language and
                            Literature</hi>, Vol. 1(1), 5-27. London: Longman Group UK Ltd.</bibl>
                    <bibl xml:id="Greenblatt.1997">Greenblatt, Stephen, Walter Cohen, Jean E.
                        Howard, and Katherine E. Maus. 1997. <hi rend="italic">The Norton
                            Shakespeare: Based on the Oxford edition</hi>. New York: W.W. Norton
                        &amp; Company, Inc.</bibl>
                    <bibl xml:id="Mazeland.2003">Mazeland, Harrie. 2003. <hi rend="italic">Inleiding
                            in de conversatieanalyse</hi>. Bussum: Coutinho bv. </bibl>
                    <bibl xml:id="Mazzon.2003">Mazzon, Gabriella. 2003. “Pronouns and nominal
                        address in Shakespearean English: A socio-affective marking system in
                        transition.” In <hi rend="italic">Diachronic perspectives on address term
                            systems</hi>, edited by Irma Taavitsainen and Andreas H. Jucker, 223-49.
                        Amsterdam: John Benjamins.</bibl>
                    <bibl xml:id="Quennell.2002">Quennell, Peter, and Hamish Johnson. 2002. <hi
                            rend="italic">Who’s who in Shakespeare</hi>. London: Routledge.</bibl>
                    <bibl xml:id="Quirk.1974">Quirk, Randolph. 1974. “Shakespeare and the English
                        language.” In <hi rend="italic">The linguist and the English language</hi>,
                        edited by R. Quirk, 46-64. London: Edward Arnold.</bibl>
                    <bibl xml:id="Stein.2003">Stein, Dieter. 2003. “Pronomial usage in Shakespeare:
                        Between sociolinguistics and conversation analysis.” In <hi rend="italic"
                            >Diachronic perspectives on address term systems</hi>, edited by Irma
                        Taavitsainen and Andreas H. Jucker, 251-307. Amsterdam: John
                        Benjamins.</bibl>
                    <bibl xml:id="Taavitsainen.2003">Taavitsainen, Irma, and Andreas H. Jucker.
                        2003. “Introduction.” In <hi rend="italic">Diachronic perspectives on
                            address term systems</hi>, edited by Irma Taavitsainen and Andreas H.
                        Jucker, 1-25. Amsterdam: John Benjamins.</bibl>
                    <bibl xml:id="Thelwall.2010">Thelwall, Mike, Kevan Buckley, Georgious Paltoglou,
                        Di Cai, and Arvid Kappas. 2010. “Sentiment strength detection in short
                        informal text.” <hi rend="italic">Journal of the American Society for
                            Information Science and Technology</hi>, 61(12): 2544-58. <ref
                            target="https://doi.org/10.1002/asi.21416"
                            >https://doi.org/10.1002/asi.21416</ref>.</bibl>
                    <bibl xml:id="Walker.2003">Walker, Terry. 2003. “<hi rend="italic">You</hi> and
                            <hi rend="italic">thou</hi> in Early Modern English dialogues: Patterns
                        of usage.” In <hi rend="italic">Diachronic perspectives on address term
                            systems,</hi> edited by Irma Taavitsainen and Andreas H. Jucker, 309-42.
                        Amsterdam: John Benjamins.</bibl>
                </listBibl>
            </div>
            <div type="summary">
                <docAuthor>Isolde van Dorst</docAuthor>
                <head style="text-transform: uppercase;">You, thou and thee: A statistical analysis of Shakespeare’s
                        use of pronominal address terms</head>
                <head rend="subheader" style="text-transform: uppercase;">Summary</head>
                <p>Much research has been undertaken on the use of <hi rend="italic">you</hi>, <hi
                        rend="italic">thou</hi> and <hi rend="italic">thee</hi> in Shakespeare’s
                    works. However, the results so far have yet to arrive at an exact and conclusive
                    answer regarding how these pronouns were used. This study combines the strengths
                    of multiple research fields in an effort to determine via hitherto unused
                    computational methods which linguistic and extra-linguistic features influence
                    the second person singular pronoun choices in the plays of Shakespeare. In the
                    English of Shakespeare’s time, the now-archaic distinction between <hi
                        rend="smallcaps">you</hi> and <hi rend="smallcaps">thou</hi> persisted, and
                    is usually reported as being determined by relative social status and personal
                    closeness of speaker and addressee. However, even between studies with similar
                    outcomes, the results vary massively on the degree of influence and by the
                    inclusion or exclusion of a wide range of other potential influencing factors.
                    Therefore, it remains to be determined whether statistical machine learning will
                    support this traditional explanation.</p>
                <p>In this study, 23 linguistic and extra-linguistic features are investigated,
                    having been selected from multiple linguistic areas, such as pragmatics,
                    sociolinguistics and conversation analysis. The three algorithms used, Naive
                    Bayes, decision tree and support vector machine, are selected as illustrative of
                    a range of possible models in light of their contrasting assumptions and
                    learning biases. Two predictions are performed, firstly on a binary (<hi
                        rend="smallcaps">you</hi>/<hi rend="smallcaps">thou</hi>) distinction and
                    then on a trinary (<hi rend="italic">you</hi>/<hi rend="italic">thou</hi>/<hi
                        rend="italic">thee</hi>) distinction, giving six final models to compare.
                    This is a strictly empirical study, which attempts to verify the findings of
                    earlier research through a computational approach. Its aim and main focus is to
                    try and find a pattern or model that best explains the use of second person
                    singular pronominal address terms in Shakespeare, rather than simply achieve the
                    best performing model.</p>
                <p>The primary finding of this study is that it is indeed possible to build a
                    prediction model for the use of singular second person pronouns in the plays of
                    Shakespeare based on linguistic and extra-linguistic features. Moreover, in
                    particular, the direct linguistic context of the pronoun is the most important
                    feature in all of the models except one. Several other features are also
                    influencing the pronoun prediction, including the names of the speaker and
                    addressee, the status differential, and positive and negative sentiment.
                    Additionally, all three algorithms easily outperformed the baseline. Out of the
                    three algorithms, the support vector machine models score best. However, the
                    Naive Bayes models perform almost equally well. This reveals that the features
                    are, for the most part, independent of one another. When comparing the binary
                    and trinary classification outcomes, the binary models scored better than the
                    trinary ones. Looking back at prior studies on pronoun interpretation and
                    comparing them to the features used in this study, we can conclude that <hi
                        rend="italic">thee</hi> and <hi rend="italic">thou</hi> are equal in their
                    opposition to <hi rend="italic">you</hi>, with the main difference being their
                    grammatical role. Therefore, both linguistically and computationally, it makes
                    most sense to use the binary classification. </p>
            </div>
            <div type="summary" xml:lang="sl">
                <docAuthor>Isolde van Dorst</docAuthor>
                <head style="text-transform: uppercase;">You, thou in thee: statistična analiza uporabe izrazov
                        zaimkovnega naslavljanja pri Shakespearu </head>
                <head rend="subheader" style="text-transform: uppercase;">Povzetek</head>
                <p>O uporabi zaimkov <hi rend="italic">you</hi>, <hi rend="italic">thou</hi> in <hi
                        rend="italic">thee</hi> v Shakespearovih delih je bilo opravljenih veliko
                    raziskav. Vendar rezultati doslej še niso dali natančnega in dokončnega odgovora
                    o tem, kako so se ti zaimki uporabljali. Študija združuje prednosti z različnih
                    raziskovalnih področij, da bi z računalniškimi metodami, ki doslej še niso bile
                    uporabljene, ugotovili, katere jezikovne in nejezikovne značilnosti vplivajo na
                    izbiro osebnega zaimka druge osebe ednine v Shakespearovih igrah. V angleščini,
                    ki se je uporabljala v Shakespearovem obdobju, je razlikovanje med YOU in THOU,
                    ki je danes arhaično, še obstajalo. Običajno se navaja, da sta ga določala
                    relativni družbeni status ter osebna bližina govorca in naslovljenca. Vendar pa
                    se tudi med študijami s podobnimi rezultati ti zelo razlikujejo glede stopnje
                    vplivanja ter upoštevanja ali neupoštevanja številnih drugih mogočih dejavnikov
                    vpliva. Zato je treba še ugotoviti, ali bo statistično strojno učenje potrdilo
                    to tradicionalno razlago.</p>
                <p>V tej študiji se proučuje 23 jezikovnih in nejezikovnih značilnosti, izbranih z
                    različnih jezikoslovnih področij, kot so pragmatika, sociolingvistika in analiza
                    pogovora. Trije uporabljeni algoritmi – naivni Bayesov klasifikator, odločitveno
                    drevo in metoda podpornih vektorjev – so izbrani kot ilustrativni nabor možnih
                    modelov zaradi njihovih kontrastnih predpostavk in učne pristranskosti. Opravita
                    se dve napovedi, prva o binarnem ( <hi rend="italic">you/thou</hi>) razlikovanju
                    in druga o trinarnem (<hi rend="italic">you/thou/thee</hi>) razlikovanju, s
                    čimer dobimo šest končnih modelov, ki jih lahko primerjamo. Študija je strogo
                    empirična, njen cilj pa je z računalniškim pristopom preveriti ugotovitve
                    predhodnih raziskav. Osredotoča se predvsem na iskanje vzorca ali modela, ki bi
                    najbolje pojasnil uporabo izrazov zaimkovnega naslavljanja za drugo osebo ednine
                    pri Shakespearu, in ne le na oblikovanje modela, ki deluje najbolje. </p>
                <p> Temeljna ugotovitev te študije je, da je resnično mogoče oblikovati napovedni
                    model za uporabo zaimkov za drugo osebo ednine v Shakespearovih igrah na podlagi
                    jezikovnih in nejezikovnih značilnosti. Poleg tega je neposredni jezikovni
                    kontekst zaimka najpomembnejša značilnost v vseh modelih razen v enem. Na
                    napoved zaimka vpliva tudi več drugih značilnosti, vključno z imenom govorca in
                    naslovljenca, razliko v statusu ter pozitivnim ali negativnim mnenjem. Vsi trije
                    algoritmi so tudi z lahkoto dosegli boljše rezultate od izhodišča. Od vseh treh
                    algoritmov daje najboljše rezultate metoda podpornih vektorjev. Vendar tudi
                    modeli naivnega Bayesovega klasifikatorja dosegajo skoraj enako dobre rezultate.
                    Iz tega izhaja, da so značilnosti večinoma neodvisne druga od druge. Primerjava
                    binarne in trinarne klasifikacije je pokazala, da so rezultati binarnih modelov
                    boljši od rezultatov trinarnih. Če primerjamo predhodne študije o interpretaciji
                    zaimkov z značilnostmi, uporabljenimi v tej študiji, lahko ugotovimo, da sta
                    zaimka <hi rend="italic">thee</hi> in <hi rend="italic">thou</hi> v opoziciji z
                    zaimkom <hi rend="italic">you</hi> enakovredna, pri čemer je najpomembnejša
                    razlika njihova slovnična vloga. Zato je z jezikoslovnega in računalniškega
                    stališča najbolj smiselna uporaba binarne klasifikacije. </p>
            </div>
        </back>
    </text>
</TEI>
