A Mixed-principle Rule-based Approach to the Automatic Syllabification of Serbian

In this paper we present a mixed-principle rule-based approach to the automatic syllabification of Serbian, based on prescriptive rules from traditional grammar in combination with the Sonority Sequencing Principle. We explore the problems and limitations of the existing rule set and sonority-based approaches, introduce an algorithm that utilizes both means in an attempt to produce a more accurate segmentation of words into syllables that is better aligned with the intuition of the native speakers, and present the statistical data related to the distribution of syllables and their structure in Serbian.


Introduction
Syllables have been considered -although not unequivocally (cf. Koehler 1966) -to be one of the basic units in phonology constituting the minimal units of pronunciation, and to play a role in prosody, phonotactics, and phonological processing (Ladefoged and Johnson 2014). The role of the segmentation of words into syllables and their distributional properties began to see an increase in importance in speech technologies in the 1990s (Iacoponi and Savy 2011), most notably in the areas of speech recognition (SR) and text-to-speech synthesis (TTS).
Syllable segmentation today plays a role in speech technologies on the segmental level -conditioning the length of segmental units such as consonants and vowelsas well as on the prosodic level -governing rhythmical alternations (Bigi and Petrone 2014). Syllable segmentation is also a key component in hyphenation (e.g. Kaplar et al. 2018), although it should be noted that, at least in Serbian, hyphenation is governed by a partially diverging set of rules from those governing syllabification 1 . Syllable distribution data is also of crucial importance for psycholinguistic experiments, as syllable frequency has been shown to play a role in the processing of words (e.g. Barber et al. 2004;Cholin et al. 2006;Cholin and Levelt 2009). Developing an automatic system of syllabification allows for the segmentation of large-scale language corpora needed for the development of automatic systems or the extraction of relevant data related to frequency syllable distributions, which would otherwise require a large number of trained annotators and would be a resource and cost heavy undertaking.
The two generally distinguishable approaches to automatic syllabification are rule-based versus data-driven approaches (Marchand et al. 2009). While data-driven approaches have taken over many aspects of natural language processing, and there are a number of data-driven models of syllable segmentation using artificial neural networks (e.g. Daelemans and van den Bosch 1992;Hunt 1993;Stoianov et al. 1997;Landsiedel et al. 2011), the unavailability of segmented data for Serbian makes rulebased approaches the only viable option for automatic syllabification in Serbian.
To the best of our knowledge, there is a single publicly available attempt at developing a rule-based syllabifier for Serbian by Kaplar et al. (2018). In this paper we lay out a number of problems and limitations with the ruleset used in their syllabification system and why relying on the existing set of prescriptive rule descriptions from traditional grammar is insufficient to capture and describe a syllabification system that is aligned with the intuition of native speakers of Serbian. A relatable attempt at automatic syllabification was developed by Meštrović et al. (2015) for Croatian, the key difference between their work and ours being in the principle behind the syllabification algorithm which in their case relied solely on the onset maximization principlelimiting possible syllable onsets to valid onsets at the beginning of words. Taking into account Morelli's (1999) limitations on possible syllable onsets in Serbo-Croatian, the onset maximization principle employed by Meštrović et al. could be considered a comparatively liberal system. In order to attempt to constrain our syllabifer, we are decided on a different approach that will not rely on onset maximization, but rather a combination of a number of alternative principles.
In this paper we present a mixed-principle rule-based approach to the syllabification of Serbian. Our starting set of rules is based on the Gramatika srpskoga jezika by Stanojčić and Popović (2005), a prescriptive textbook for Serbian grammar that presents a set of rule descriptions for the segmentation of words into syllables. In a previous version of our syllabification algorithm (Kovač and Marković 2018), we made a number of changes to the rule descriptions of Stanojčić and Popović (2005) as the formulation of some of the descriptions proved to be redundant, some were examplebased and not specific enough for a formal implementation, and we also expanded them with three added modifications related to the treatment of nasals and the alveolar sonorant /r/ based on Kašić (2014) and the treatment of alveolar sonorants /l/ and /n/ based on Zec (2000). In this paper we extend our previous algorithm to include a module for validating the structure of syllables in terms of their compliance with the Sonority Sequencing Principle (SSP), thus further fine-tuning the accuracy of our segmentation, and resolving a number of problems noted in our earlier implementation.
The goal of the paper is threefold: i) to improve our system for automatic rulebased syllabification for Serbian based on the formalization of existing rule descriptions by the addition of the sonority sequencing validation module, ii) to provide an analysis of the outcomes of the automatic syllabification process in order to address possible theoretical considerations and serve as a basis for the development of future syllabifiers, and iii) to present statistical data related to the distribution of syllables and their structure in Serbian.

Prescriptive Rule Descriptions
Our starting set of rules was based on the formalization of the rule descriptions governing the segmentation of words into syllables from the Gramatika srpskoga jezika by Stanojčić and Popović (2005). Being a prescriptive textbook on Serbian grammar used at a high school level by all student profiles, we expected these rules to constitute the common knowledge base shared by the majority of native speakers.
(1) In words made up of multiple phonemes, consonants, sonorants and vowels, the syllable boundary comes after the vowel and before the consonant (e.g. či-ta-ti [to read]). In addition to this general rule, they list the following rules -(2), (3), (4), (5) and (6) -that further specify medial syllable boundaries depending on consonant manner of articulation.
(2) Medially, in a consonant cluster which has an affricate or fricative sound in its initial position, the syllable boundary will be before that consonant cluster ( Stanojčić and Popović (2005, 32) also introduce the rule descriptions (7) and (8) to define when the sonorants /r/, /l/, and /n/ constitute syllable nuclei.

Revising the Existing Rule Set
The development of our syllabification algorithm has been an iterative process testing the existing rule set and making changes as needed. While other authors (e.g. Kaplar et al. 2018) used the rule descriptions of Stanojčić and Popović (2005) directly to implement a software architecture for syllabification in Serbian, we have found a number of problems with this approach.
The definition of the rule description under (1) causes the initial member of a consonant cluster in the rule descriptions under (2)-(6) to be understood as the first consonant following a vowel. However, given that the sonorants /r/, /l/, and /n/ can also constitute syllable nuclei in Serbian in certain positions, as presented under rule descriptions (7) and (8), a more precise definition would be that the initial member of a consonant cluster is the first consonant following an element that constitutes a syllable nucleus. The general rule under (1) should be then revised as follows.
In addition to our expansion of the general rule presented under (1) to include the syllable bearing sonorants, while formalizing the rule descriptions via finite-state automata, rules (2) and (3) proved to be redundant as they produced identical outcomes to the general rule under (1*). Because of this, these rules were disregarded in our syllabification algorithm.
During our early testing of the verbatim implementation of the rule descriptions, we also noticed that the existing rule descriptions treated a consonant cluster consisting of a nasal in initial position followed by a consonant that is not one of the sonorants /j/, /v/, /l/, /ʎ/, and /r/ as a part of the following syllable onset, producing outcomes such as: gu-ngula [commotion], mo-mci [guys], ka-ncelarije [offices], su-nce [sun], etc. Contrary to Stanojčić and Popović (2005), authors such as Kašić (2014) argue that nasals should be treated analogously to plosives during syllabification because there is a complete occlusion in the oral cavity during their production. If this principle were to be employed, rule (5) should be revised as follows.
(5*) If a consonant cluster consists of a plosive or nasal in its initial position and some other consonant except the sonorants /j/, /v/, /l/, /ʎ/, and /r/, the syllable boundary will be between the consonants.
Following rule (5*), the examples above would then be segmented as: gun-gula [commotion], mom-ci [guys], kan-celarije [offices], sun-ce [sun], etc. Even though in the earlier implementation of our syllabifier (Kovač and Marković 2018) we did not want to employ the Sonority Sequencing Principle (SSP), we opted for the treatment of nasals by Kašić (2014) in our implementation, which respected the limitations put forward by the Sonority Hierarchy, and was more in line with native speaker intuition.

The Sonority Hierarchy
Sonority Theory accounts for the organization of segments into well-formed sequences, both within the syllable and across syllabic boundaries. This organization is driven by principles of sonority, a property that is used as the basis of ranking all sounds along a scale, from less sonorous to more sonorous ones. Although there is a general consensus that segments are ranked by their inherent sonority, the notion of sonority itself is not unambiguously described in the phonetic and phonological literature. Among the phonetic approaches, Ladefoged (1982) defines sonority as the perceptual salience or loudness of a sound, and Bloch and Trager (1942; according to Goldsmith 1995) define it as the amount of airflow in the resonance chamber. For others, sonority is dependent on multiple phonetic parameters (Ohala and Kawasaki 1984;Ohala 1990;Butt 1992). In the phonological literature, sonority is generally defined as a multi-valued feature (Foley 1972;Hankamer and Aissen 1974;Selkirk 1984), although there are also authors who argue that it is derivable from the more basic binary features of phonological theory (Clements 1990). Other questions that are often addressed are whether sonority scales are universal or language-specific, allowing freedom to languages in assigning sonority values, and how fine-grained distinctions sonority scales should capture. For example, Clements' universal sonority scale includes only four major classes of consonants (Clements 1990), ranked from least sonorous to most sonorous, as in (i): (i) O < N < L < G (O = obstruents, N = nasals, L = liquids, G = glides) Selkirk (1984, 112) proposes a much more detailed scale, which divides all sounds into 11 groups, assuming more subtle differences in sonority values. Selkirk also states that the sonority indices may not be as important in themselves as the sonority relations that they express. Selkirk's scale of sonority in consonants is given in (ii): (ii) p, t, k < b, d, g < f, θ < v, z, ð < s < m, n < l < r Sonority scales serve as the basis of constructing segment sequences within syllables. The universal cross-linguistic generalization is that in the sequence of segments, the one ranking highest on the sonority scale constitutes the peak of the syllable, i.e. it is the syllabic nucleus. As for the other segments around the nucleus, they are organized so that the more sonorous ones are closer to the nucleus, and less sonorous ones are more distant. This generalization is referred to as Sonority Sequencing Principle (SSP). Thus a syllable with an ascending sonority slope in the onset and a descending slope in the coda, such as, for example blunt, is a well-formed syllable, whereas *lbutn is prohibited, due to the violation of the SSP. Adopting thee SSP often solves the problems of syllabic consonants, since they generally occur in environments where they constitute a sonority peak, as in the Serbian word pr-vi.

The Need for Sonority
Apart from the segmentation of nasals analogously to plosives following Kašić (2014) that relied on principles of the SSP, in our initial attempt at the formalization of the rule description under (8) of Stanojčić and Popović (2005) we had to rely on sonority to define the criteria for when the alveolar sonorants /l/ and /n/ act as syllable nuclei.
As Stanojčić and Popović gave no formal criteria defining the contexts of syllable bearing /l/ and /n/, our initial attempt to draw on generalizations based on their examples for syllable carrying /l/ (Stlp, Vlča glava, Žlne, Vlava, Plzen) and /n/ (Idn, Ibn-Saud). In analogy to the rules descriptions under (7a) and (7b) and our added rule (7c*) defining the contexts in which the alveolar phoneme /r/ can act as a syllable nucleus, we implemented rule (8*) to define the conditions under which the phonemes /l/ and /n/ can act as syllable bearing nuclei.
(8*) The other two alveolar sonorants, /l/ and /n/, can be syllable carriers if they are found: a) medially between two consonants, b) initially before a consonant, or c) finally after a consonant.
However, the formulation under (8*) allowed for outcomes such as: Be-rn, Ka-rl, erla-jn, Kla-jn, kasa-rn-skim, Linko-ln, Va-jl-om, etc. in which the phonemes /l/ and /n/ identified as syllable nuclei have a lower sonority level than the consonants in their onset or coda. Because the phonemes /r/ and /j/ are more sonorous than the phonemes /l/ and /n/, and the lateral phoneme /l/ is more sonorous than the nasal phoneme /n/, native speakers do not perceive the elements of lower sonority as syllable nuclei in these contexts. Zec (2000) states that alveolar sonorants can be syllable bearing elements in Serbian only in contexts in which there is no segment of a higher level of sonority in their immediate vicinity. Because of this, we needed to further specify rule (8*) to take sonority constraints into consideration as follows.
(8**) The other two alveolar sonorants, /l/ and /n/, can be syllable carriers if they are found: a) medially between two consonants of lower sonority, b) initially before a consonant of lower sonority, or c) finally after a consonant of lower sonority.
It turns out that this principle can also account for the behavior of the syllable bearing /r/ in Serbian. In fact, it does not only provide a general account for consonantal syllabic nuclei in Serbian that subsumes the rules under (7) and (8**) it also accounts for our extension of rule (7) that keeps the the consonant cluster /rje/ of the ijekavica dialect unsegmented in initial position 2 . Because the phoneme /j/ has a higher level of sonority than /r/, the phoneme /r/ should not be treated as a syllable nucleus initially in words such as rjeka [river].
In our previous implementation of the syllabifier (Kovač and Marković 2018), we attempted to limit our reliance on the Sonority Sequencing Principle to the cases above. However, during the evaluation of our algorithm, we encountered a number of syllable structures that were unexpected due to their absence from the onset maximization approach to syllabification developed for Croatian by Meštrović et al. (2015). Namely, we encountered the syllable structure CCCCCVC in mo-na-rhstvom [with the monarchy], the structure CCCCCV in the words se-rbska The way we attempted to remedy this issue was to limit the syllable onset length three-syllable clusters, which is the maximum length of non-syllabic consonant clusters word initially in Serbian (Kašić 2014). While this constraint, in combination with rules (5) and (6) , alp-skog [alpine] -some medial clusters with a syllabic consonant still remained a problem. For example, in the word najstrpljiviji [most patient], which contains a syllabic /r/, the syllable boundary that would be placed between /na/ and /jstr/ -na-jstr-pljiviji -which does not coincide with native speaker intuition. The Sonority Sequencing Principle seems like a perfect solution for this cases, as it would require the structure of a syllable to follow a sonority scale, with the syllable nucleus being the most sonorous element, while sonority would gradually decrease towards the periphery of the syllable (Zec 2000). With this added sonority requirement, the phoneme /j/, being more sonorous than /s/ and /t/, would have to constitute a part of the previous syllable where it would be of a lower sonority when compared to its neighbouring syllable bearing vowel, and the syllable boundary would be naj-str-pljiviji which is in line with native speaker intuition.
As a final check following rules (1)-(8**), we add rule (9) that has the ability to shift the syllable boundary in order to avoid a violation of the sonority hierarchy.
(9) If the syllable structure resulting from rules (1)-(8**) does not conform to the Sonority Sequencing Principle, move the boundary so that the phoneme violating the sonority sequence is shifted into the neighboring syllable.

An Adapted Sonority Hierarchy
In our sonority sequencing module, we relied on a combination of Selkirk's (1984) sonority scale, the sonority apertures for Serbian described by Subotić et al. (2012), and some notes on sonority sequencing in Serbian from Zec (2000). Our sonority scale is shown under (iii). (iii) p, t, k < b, d, g < ts, tʃ, tɕ < f, ʃ, h < v, z, ʒ < s < m, n, ɲ < l, ʎ < j, r < a, e, i o, u The highest sonority group in our implementation was made up by the vowels of Serbian. As vowels constitute syllable nuclei and there can only be a single vowel per syllable, we did not need to make a distinction between three sonority apertures of vowels (i, u < e, o < a) as it is the case in the hierarchy of Subotić et al. (2012). Following Selkirk (1984), we divided sonorants into three sonority classes, and following Zec (2000), we treated liquids as more sonorous than nasals, and, within liquids, the phoneme /r/ as more sonorous than laterals. For the needs of our implementation, we treated the phoneme /r/ and glide /j/ as a single sonority group, although from a theoretical standpoint /j/ would be considered as more sonorous out of the two given its semi-vowel nature. We opted for treating /s/ as an element of higher sonority than voiced fricative despite its voiceless nature following Selkirk (1984), and expanded Selkirk's hierarchy with the addition of affricates between voiceless fricatives and voiced plosives as a parallel to the aperture order presented by Subotić et al. (2012).
It is important to note that there are sequences which clearly do not conform with the SSP in a number of languages, and which may undermine the relevance and power of the sonority hierarchy. A very common pattern, found across a number of unrelated languages, is the possibility of an /s/ + plosive sequence in the syllable onset, which would be in clear violation if we were to adopt the sonority scale outlined above. In Serbian, there is a known ambiguity in syllable segmentation in the case of continuant fricative phonemes. For example, the word postaviti [to set] can be syllabified as both po-sta-vi-ti and pos-ta-vi-ti (Gvozdanović 2011). We therefore adopt the view put forward in Morelli (1999), who argues that fricatives and plosives may be treated as a single class with respect to sonority in these cases -since splitting them into separate classes would make wrong typological predictions -and add an exception to our sonority sequencing module that leaves fricative + plosive sequences as a viable sequence in the syllable onset.

Our Algorithm 3
Our mixed-principle syllabification algorithms consists of the following steps: I. Identify vowels in the word and mark their positions as positions capable of constituting syllable nuclei (based on (1)). II. If a word contains the letters l, n or the letter r not followed by the sequence je in the center of a consonant cluster consisting of elements of lower sonority or at the beginning or a word followed by a consonant of lower sonority, or the letters l or n at the end of a word preceded by a consonant of lower sonority, treat those positions in the word as capable of constituting syllable nuclei (based on (1*), (7), and (8**)). III. For each position identified as capable of constituting a syllable nucleus: A. If it is followed by a sequence of two sonorants, mark the syllable boundary between the two sonorants (based on (4)), except if the second sonorant is j and it is followed by e. If the second sonorant is j followed by e, mark the syllable boundary before the sonorant cluster (based on (6)). B. If it is followed by a sequence of a plosive or nasal and a plosive, fricative, affricate or nasal, mark the syllable boundary between the two consonants (based on (5*)). C. In all other cases mark the syllable boundary after the syllable nucleus (based on (1*)). IV. Run a recursive sonority check (based on (9)): A. If the word consists of more than one syllable, convert the syllable structures identified by the previous steps into sonority group values. B. For each syllable, check if there is a violation of the SSP at the edges of the syllable ignoring the check at the onset on the word-initial syllable and the check in the coda of the word-final syllable. C. If a violation found is a sequence of a fricative followed by a plosive in the onset, ignore the violation. D. If there is a violation, remove the letter from the edge of the syllable, and add it onto the neighboring syllable. E. Repeat until no violation is found.

Syllable Distribution Data
In this section, we present the statistical distribution data of syllables in Serbian based on our updated syllabification process applied to the Serbian Lemmatized and PoS Annotated Corpus SrpLemKor (Popović 2010;Utvić 2011). We chose SrpLemKor for our analysis, because its annotation allowed us to filter out numbers, Roman numerals, abbreviations and non-Serbian words or suffixes in compounds (at least to some extent) and thus reduce noise in the data.
The following results show the syllable distribution statistics based on 3,648,543 non-unique word-forms (word tokens) from SrpLemKor. From a total of 4,681,713 entities (punctuation and word tokens) in our version of the corpus, 113,679 (2.43%) entities of texts #260, #4505 and #4517 were excluded because the files contained faulty encoding. Based on corpus tags, we excluded 919,161 (19.63%) entities tagged PUNCT (punctuation), SENT (sentence separator full-stops), RN (Roman numerals), NUM @card@ (Arabic numerals), ABB (abbreviations) and ? (non-Serbian words and other uncategorized entries). An additional 815 (0.02%) entities that contained the characters w, q and x were removed in an attempt to further reduce noise stemming from foreign words, as not all foreign words were tagged as such in the corpus. In the process of syllabification, an additional 12,877 (0.28%) entities were removed as they were solely made up of consonant clusters with no available syllable nucleus candidate.

Syllable Type Distributions in Serbian
In the 3,648,543 word-forms from SrpLemKor, a total of 8,196,771 syllables were identified. Table 1 presents the syllable type distribution based on our mixed-principle syllabification algorithm. These results show the distribution of syllables in a somewhat noisy data. We found there are still foreign words annotated as non-foreign in the corpus constituting some of the less-frequent syllable structures listed as "Other" in Table 1. For example, an instance of the syllable structure VCCCCC was found to correspond to the segmentation of the German word Pe-itscht [lashes], the syllable structure CCCCVCCC was identified in the German word Fle-i-schmarkt [meat market], and the structure CCCCCVC was found in the German word Gle-i-chschal-tung [co-ordination]. The structure CCCCCCVC was found in the German word Na-chtschat-ten [nightshade] and in the toponym CRYSLER. The syllable structure CCVCCCC was found in the source transcription of the last name Pe-tritsch and in the English word knights. The syllable structure CCCVCCC was identified to be a part of the German words Wolfsmilch [spurge] and E-in-ge-schickt [sent in] and to correspond to the English word string. The syllable structure CCCCCCV was identified in the German words We-ihna-chtsbra-e-u-che [Christmas trees], Stor-chschna-bel [Crane's bill], while the structure CCCCCV was found in the words Re-chtsge-schi-chte [history of law] and Um-gangsspra-che [vernacular], as well as in the sequences šttske and su-žnjstva. The syllable structure CCCCVCC was found in the German word Ze-it-schrift [magazine], and in multiple occurrences of the source spelling of the last names Schmidt and Rot-hchild. The structure VCCCC was found in the German words Deutsch [German], Ernst [seriousness], in the sequence der-demnaechst [soon], and in the strings ikvbv and EHCmc. As can be seen from the examples above, besides foreign origin words, noise in the data can also be found in typos and strings we did not manage to identify. Another example of such string was ngBpJKTnQ identified as the structure VCCCCCCCC. Most structures identified as CVCCCC were the result of typos, e.g. serbsk, kra-levstv, pod-danstv, carstv, slav-jansk, ju-go-slo-venskg, cr-no-gorskg, but also foreign origin names, e.g. Hirsch, Herbst, Lokotsch, and Worlds in additions to strings such as majnds and Gorrrr. In addition to these, one occurrence of the syllable structure CVCCCCCCCC that stood for the onomatopoeic vulgarism mršššššššš [go away].
We also found 2 syllable structures that differed from the structures found by Meštrović et al. (2005) -e-ssel-blu-me [cowslip], and more. We discuss the implications of these for our syllabification algorithm in the Discussion section below.

Syllable Type Positional Distributions in Serbian
We also examined the syllable type frequencies with respect to their position in a word. Four positional frequencies are presented in Table 2: syllable type frequencies in monosyllabic words, and syllables type frequencies in the initial position, in medial positions, and in the final position of polysyllabic words. Based on SrpLemKor, the most frequent monosyllabic syllable structures in Serbian are CV (50%), V (25%) and CVC (11%). The most frequent syllable structures in the initial position of polysyllabic words are CV (56%), CCV (15%) and V (15%). In medial positions in polysyllabic words, the most frequent syllable structures are CV (69%), CCV (14%) and CVC (10%). The most frequent syllable structures in the final position of polysyllabic words are CV (65%), CVC (16%) and CCV (8%). It is interesting to note the asymmetry that the syllable structures CCCVCC, VCCC, and CCCCVC occurred only in monosyllabic words and in the final position of polysyllabic words, while the syllable structure CCCCVC occurred in all positions except the initial position in polysyllabic words.

Syllable Nuclei Statistics in Serbian
The distribution of different syllable nuclei in Serbian based on the SrpLemKor corpus is presented in Table 3. Based on the positional nucleus distribution data, it can be seen that overall /a/ and /o/ constitute the most frequent nuclei in Serbian. However, there is some positional variation. While the most frequent nuclei in final, medial, and initial position of polysyllabic words are also /a/ and /o/, in monosyllabic words, the most frequent nuclei are /a/ and /e/.

Discussion
While our mixed-principle rule-based syllabification algorithm is suitable for the segmentation of words into syllables following the ruleset we devised based by the combination of prescriptive rule descriptions and the employment of the Sonority Sequencing Principle, there are still some practical and theoretical considerations to be addressed.
While reporting on the syllable distribution data, we mentioned that the 3,648,543 word-forms extracted from SrpLemKor used for the calculation of statistical data related to the distribution of syllables and their structure in Serbian still contained some noise such as foreign words, typos, and possibly random character strings. Based on 500 random samples taken from the syllable output data checked by a human evaluator, the estimate of the amount of such noise in the data is <2%. Given the nature of corpus-based data, this noise should not significantly impact the reliability of the distributional information.
From a theoretical standpoint, in formulating our algorithm, we disregarded the three-syllable consonant cluster limitation put forward by Kašić (2014) in favor of exploring the limitations of the sonority module. The occurrence of the two syllable types CCCCVC and CCCCV, which were not present in the onset-maximizationbased syllabification algorithm for Croatian (Meštrović et al. 2015), shows that in a limited number of instances this constraint is needed to exclude syllable clusters that are in accordance with the SSP and prescriptive rule descriptions, but seem contrary to native speaker intuition about syllable boundaries. In addition to this, there is the ambiguity in syllable segmentation in the case of continuant fricative phonemes (Gvozdanović 2011) with the continuant constituting either the first place in the onset of the syllable or the last place in the coda of the previous syllable, e.g. the possibility to syllabify postaviti [to set] as po-sta-vi-ti and pos-ta-vi-ti, would require a larger-scale study examining the intuition of native speakers on syllabification to make an assumption about contemporary tendencies in the segmentation in these contexts.
In order to verify the syllabic status of different clusters, it would be interesting to conduct a series of monitoring studies modeled after Mehler et al. (1981), who have shown that reaction times to a word are faster if the word is primed by a sequence corresponding to a syllable in the word when compared to priming with a string that does not constitute a syllable. Bradley et al. (1993) argue that these effects produce mixed results in some languages which contain a large number of ambisyllabic segments, so these studies may also reveal whether and to what extent syllables play a role in prelexical processing in Serbian.

Conclusion
In this paper we presented a mixed-principle rule-based syllabifier modelled after the rule descriptions found in Stanojčić and Popović (2005), extended by rule specifications from Kašić (2014) and Zec (2000), and complemented by a sonority sequencing module based on Selkirk (1984), Subotić et al. (2012), and Zec (2000).
An implementation of the existing prescriptive rules for the segmentation of words into syllables allowed us to gain an insight into the problem areas of the rule descriptions, and propose a number of revisions and amendments to the existing rules. The sonority sequencing module revealed the need for an additional onset-length limitation constraint, and pointed out the limitations of sonority in ambiguous consonant clusters that would require further exploration and validation by native speaker intuition. We have also gained an insight into the distribution of different syllable structures and syllable nuclei following this approach, which will be useful for comparison with the performance of alternative syllabification systems.
In the future, we plan to compare our system to an onset-maximization-based syllabifier for Serbian in combination with the prescriptive rules to see if we can create an alternative system that will produce outputs consistent with the intuition of native speakers of Serbian. It would be interesting to see a systematic comparison of our current approach and the onset-maximization approach with data gathered from the intuition of contemporary native speakers of Serbian.
We also believe that, while phonological criteria present a basis for syllabification, in the future we will also need to test whether and to what extent approaches based solely on phonological criteria result in syllable boundaries that coincide with morphological boundaries. Our assumption is that phonological rules will need to be amended by morphological criteria to result in syllabification that respects morphological boundaries as well.
In addition to these, the question of the treatment of foreign origin words and transcribed foreign words might be an additional point to consider. As an extension of a syllabifier, a language detection algorithm might be employed to properly segment the former, while the latter might not need special treatment as the process of transcription should in itself contain a degree of phonological adaptation.

A MIxED-PRINCIPLE RuLE-BASED APPROACH TO THE AuTOMATIC SyLLABIFICATION OF SERBIAN
sUMMarY In this paper we present a mixed-principle rule-based approach to the automatic syllabification of Serbian based on prescriptive rule descriptions from traditional grammar found in Stanojčić and Popović (2005), extended by rule specifications from Kašić (2014) and Zec (2000), and complemented by a sonority sequencing module based on Selkirk (1984), Subotić et al. (2012), and Zec (2000).
Syllable segmentation plays a role in speech technologies -most notably in the areas of speech recognition and text-to-speech synthesis -at both the segmental and prosodic levels. It is also one of the governing factors in hyphenation, and syllable frequency distribution data is used in psycholinguistic experiments as a covariate. The unavailability of segmented data for Serbian makes a rule-based approach to automatic syllabification the only viable option as there is no data available for training a datadriven neural network model, and the segmentation of large-scale language corpora by trained annotators would be a resource and cost heavy undertaking.
Our goal in this paper is threefold: i) we extend and improve an earlier version of our syllabification algorithm by introducing a sonority sequencing validation module which resolves a number of issues present in the earlier version of our syllabifier, ii) we provide a detailed analysis of the outcomes of the automatic syllabification process in order to address possible theoretical considerations and serve as a basis for the development of future syllabifiers, and iii) we present the statistical data related to the distribution of syllables and their structure in Serbian to be used in psycholinguistic experiments.
The implementation of the existing set of prescriptive rules for the segmentation of words into syllables in Serbian allowed us to gain an insight into problem areas of the rule descriptions, and propose a number of revisions and amendments to the existing rules. The sonority sequencing module revealed the need for an additional onsetlength limitation constraint, and pointed out the limitations of sonority in ambiguous consonant clusters -such is the case with continuant fricative phonemes that seem to be able to occupy either the first place in the onset of a syllable or the last place in the coda of a previous syllable -that would require further exploration and validation by native speaker intuition.
The data on the distribution of different syllable structures and syllable nuclei following this approach will be useful for comparison with the performance of alternative syllabification systems. In the future, it would be interesting to see a systematic comparison of our current approach to alternative approaches such as an onset-maximization approach evaluated on segmentation data gathered from the native speakers of Serbian.