Bringing Together Linguistics and Social History in Automated Text Analysis of Greek Papyri


1. Introduction

The Greek documentary papyri and ostraca from Egypt present a wealth of information for large-scale variational-linguistic research: they form a considerable corpus—about 4.5 million Greek words—spanning several centuries from the third century BCE to the eighth century CE; they contain various text genres, including texts in a more “everyday” language, such as private letters; and they are sociolinguistically far more diverse than literary texts, counting texts from non-elite writers, women, non-native speakers of Greek, etc. They are, therefore, indispensable when studying the development of the Greek language during the post-classical period.
However, these texts have, so far, received limited attention in Ancient Greek linguistic research. One possible reason for this is the lack of any large-scale linguistically annotated corpus. [1] Although all documentary papyri, ostraca, and tablets are publicly available in XML format in the form of the Duke Databank for Documentary Papyri (DDbDP), [2] only the raw text is included, without any tokenization or linguistic annotation. There is also limited data on the sociolinguistic background of the scribes—generally, only the place where the document was found or written and its date. As Ancient Greek encodes a large amount of information in its morphology, this lack of morphological encoding renders it very difficult to search the corpus from a linguistic point of view. A variational approach to Ancient Greek would, then, benefit from more detailed sociolinguistic information on the scribes and authors of papyrus texts.
This paper will present a first attempt to annotate the complete papyrus corpus for linguistic and socio-linguistic information. For this, we can, fortunately, build upon several tools and databases already publicly available, which include machine learning tools, such as part-of-speech taggers and syntactic parsers, linguistically annotated corpora of Greek to train these tools on, and databases with historical data on the papyrus texts. The second goal of this paper is, therefore, to show how natural language processing in the papyri can benefit from these resources, with a particular focus on the historical ones. This is not a one-way street, however: linguistic annotations can, in turn, also be a useful asset for socio-historical research. After briefly describing the resources we have drawn on, the following sections will describe the various steps in the annotation process—tokenization, linguistic annotation, and sociolinguistic annotation—and demonstrate how historical and linguistic approaches to the papyri can benefit from each other.

2. Resources

All linguistic information—i.e., part-of-speech/morphology, lemmata, syntax and semantics—has been determined using a stochastic machine-learning approach, i.e., by inferring this information statistically instead of on the basis of manually coded rules. For part-of-speech/morphological tagging, we used RFTagger, [3] which has been specifically developed to handle languages with large tagsets—in casu Greek; we used Lemming [4] as a lemmatizer; the Stanford-Graph Based Neural Dependency Parser [5] to determine syntactic dependencies; and several machine learning packages implemented in R [6] for automatic semantic labeling. Since, due to a large number of unknown word forms, purely stochastic approaches tend to perform rather poorly with highly inflected languages, [7] we also integrated the output of the rule-based morphological analyzer Morpheus. [8] All these tools were trained on Greek treebank data, most prominently the Ancient Greek Dependency Treebanks, [9] the PROIEL, [10] Gorman, [11] Pedalion [12] and Sematia [13] treebanks. For part-of-speech tagging and lemmatization, we also used a manually tagged Greek New Testament corpus [14] and a Septuagint [15] one.
The databases from the Trismegistos project [16] were a valuable asset for our project. Trismegistos is an interdisciplinary platform with information about texts from the Ancient World in general—roughly 800 BCE – 800 CE. Its original focus was on Egypt, however, and the papyrological sources were at the core of setting up the infrastructure. Through cooperation with the Heidelberger Gesamtverzeichnis der griechische Urkunden aus Ägypten [HGV], Trismegistos contains metadata on all documentary papyri in Trismegistos Texts. It is also a partner in the Papyrological Navigator, in which the full text of the DdbDP and the metadata of HGV have been brought together through the unique stable numerical identifier that is the TM id. The presence of the TM number in the DDbDP full text in XML made it possible to draw on information from other TM databases as well. For the project presented here, Trismegistos People, with its separate onomastic and prosopographical tables covering all of Egypt, turned out to be a welcome complement to other lexical tools. The information of Trismegistos Places could also be used, although this is currently less developed lexically. Finally, the Trismegistos Text Irregularities database, developed in cooperation with Joanne Stolk, emerged as essential to combine both the actually attested and the regularized version of a text in the analysis.

3. Annotation of the Greek Papyri

3.1 Tokenization

Although tokenization—i.e., the division of a text into “tokens,” including not only words but also punctuation marks—is a relatively trivial step, there are some complications, nevertheless. If the text that was originally written by the scribe has been regularized by the modern editor, for example, one has to decide from which version of the text the tokens should be chosen. Henriksson and Vierros (2017) created a tool that separates both versions from each other, generating both an “original” and a “regularized” tokenized version of the same text. Yet, neither version is particularly suitable for automated linguistic analysis:

  • The “original” version, particularly due to the lack of a unified spelling convention and the fact that it is missing words or characters, is simply too irregular for an automated natural language tool to analyze—trained as it is on highly regularized literary prose (see above).
  • The “regularized” version, on the other hand, is too “regular.” Editors not only frequently correct irregular spellings but also morphosyntactic problems such as case usage. In some instances, however, even case usage consistent with post-classical Greek but violating classical Greek norms is emended. While this would probably be beneficial for natural language processing, we would prefer to see the morphology annotated in the way it appears in the text and not in the editor’s head.

Therefore, we decided that it would be beneficial to include both text versions in the tokenization, in order to be able to choose dynamically between regularized and original versions of a token, according to the type of regularization—spelling vs. grammatical. This was possible due to the existence of the Trismegistos Text Irregularities database, [17] which classifies each editorial regularization according to its linguistic level: “grapheme,” “phoneme,” “morpheme.” “Lexeme” is mostly used for semantic or unexplained scribal “mistakes,” while “phrase” typically tags words and even sections that are supplied by the editor. In the future our automated linguistic analysis will use the regularized version when the “error” occurs at the phoneme or grapheme level, but the original version—or possibly both—in the case of morphological regularizations. [18]

3.2 Linguistic Annotation

3.2.1 Part-of-Speech Tagging
Part-of-speech and morphology information was determined using RFTagger, [19] which was specifically developed to handle languages with large tag sets, such as Greek. We followed the approach of Dik and Whaling (2008) for classical Greek literature: using the morphological analysis tool Morpheus (see above, section 2), all possible analyses for each word were determined. Afterward, we added the output of Morpheus as a “lexicon” to RFTagger, allowing the tagger to choose the most probable analysis according to its contextual probability model. As for Dik and Whaling (2008), restricting possible analyses to the ones provided by Morpheus was clearly beneficial for the tagging process: excluding proper names and punctuation marks, we achieved an accuracy of about 94.7% on a manually annotated test corpus of 2,160 words—2,045 tag/morphology combinations identified correctly: see section 2 for the training data.
As for proper names, tagging accuracy was only 75.5%, as the training corpus and Morpheus’s dictionary did not contain a large part of the—mostly Egyptian—names occurring in papyrus texts. However, we resolved this problem by making use of the Trismegistos People database, which contains all attested personal names, both the inflected form and the lemma. We added all lemmata to Morpheus’s dictionary. In addition, as Trismegistos People also contains morphological case information, we used it to correct personal names incorrectly analyzed by the tagger. For place names, we intend to add the lemmata in the Trismegistos Places database to Morpheus’s dictionary as well.
3.2.2 Lemmatization
To determine lemma information, we employed Lemming. [20] This tool was trained on the same corpus we used for part-of-speech tagging, using word forms and automatically generated tags as its input. We also employed the lemmata from Liddel-Scott-Jones’s A Greek- English Lexicon (LSJ) as a resource for Lemming and modified Lemming’s code so that it would only consider lemmata that were accepted as valid by Morpheus, unless none such were available. With this method, about 99% (2130/2158) of the lemmata were identified correctly. Almost all remaining errors were due to incorrect morphology information resulting from the part-of-speech tagging. An example is P. Tebt. 3.1.772 (= TM 5364), wherein line 5, the lemma of διακριθῶ was incorrectly identified as διακριθάω instead of διακρίνω because it was tagged as an active present indicative instead of a passive aorist subjunctive.
For proper names, lemmatization encounters the same problems as with part-of-speech tagging: therefore, all lemmata present in the Trismegistos People databases were added to Lemming’s dictionary. We intend to do the same for the lemmata in Trismegistos Places.
3.2.3 Syntactic Parsing
Next, the papyri were parsed syntactically using Stanford’s Graph-Based Neural Dependency Parser. [21] The training data come from a large group of treebanking projects, as described in section 2. Since there were many different annotators involved in these projects, these data inevitably contain a large number of inconsistent annotations. We used a mix of rule-based and statistical techniques to resolve a substantial part of them. Moreover, as the PROIEL corpus used an entirely different annotation scheme than the other projects, we converted its annotation scheme to the most frequently used one (i.e., the AGDT scheme, which is based on the Prague Dependency Treebanks) with a number of manually written rules. Although the conversion was not perfect, the large amount of data the PROIEL project offered (about 280,000 tokens) clearly outweighed potential conversion problems: without these data parsing accuracy dropped with 4 percentage points.
Our test set included a manually annotated corpus of 1,677 sentences (20,869 tokens), taken from the Pedalion treebanks. The LAS (Labeled Attachment Score: the numbers of words that had their syntactic head and relation correctly identified) was 84.5%. Interestingly, about one fourth of the errors we found were related to inconsistencies in training or test data, even though we already reduced a large number of them, showing the difficulties of coming up with a clear and unambiguous annotation format for Greek syntax.
3.2.4 Semantic Analysis
A final step in the linguistic analysis pipeline was the automated processing of meaning. ‘Meaning’ is obviously a very broad term, and therefore we focused on two semantic analysis tasks. Firstly, we modelled lexical meaning, using so-called distributional semantic or vector space models. [22] These models represent the meaning of a word as a vector of real numbers, calculated on the basis of co-occurrence patterns in a large corpus. The goal of distributional semantic modelling is to represent similar words with mathematically similar vectors, so that distance measures can be used to calculate how semantically close two words are to each other. Such techniques require a large amount of input data: we therefore did not only include the papyrus corpus, but also a large corpus of literary texts taken from the Perseus [23] and First1KGreek [24] projects, which we automatically analyzed with the techniques described above. In particular, we found that including syntactic information greatly improved the accuracy of the distributional models, even though these syntactic parses were calculated automatically and therefore not flawless.
Next, we also modelled phrasal meaning, in the form of semantic role annotation. These roles represent the semantic relation of the dependents of a verb to the event predicated by it, e.g., its agent, the time or location when or where the event happens etc. The semantic roles included were the ones defined in the Pedalion grammar of KU Leuven. [25] The distributional word models described above turned out to be particularly helpful for this task. Labeling accuracy for the papyri was 80.9% on a test corpus of 1,646 words, although it was substantially lower for literary texts. One hurdle is the low amount of manually annotated semantic training data for the labeler: it could only make use of about 11,000 training examples, unlike the other tasks, which use a training corpus of almost a million tokens. Hopefully the presence of an automatic labeler [26] may help to increase this number in future annotation projects.
3.2.5 Benefits for Historical Research
As a result of this, we now have a new corpus of almost 4.5 million Greek words annotated for morphology, lemmata, syntax and semantics. To make this accessible to the scholarly community, we have developed a new Trismegistos database called Words on the basis of the XML. This MySQL database consists of two tables: WORD with the lemmata and WORDREF for the attestations. Accessible under www.trismegistos.org/words, the PHP/Javascript interface allows users to search for a specific Greek lemma or a morphological category, with immediate figures for their frequency. Upon selection, a survey of all attestations is then provided, with pie charts presenting the relative frequency of each morphological aspect—e.g., the number of attestations in the aorist tense, or in the genitive case and plural number. The search can be limited to certain regions, specific periods, the material of the writing surface, or even the type of text—the latter mainly thanks to Joanne Stolk’s work on the Text Irregularities database. It is also possible to select on the basis of whether an attestation is complete, partially reconstructed, or even corrected by the editor or the ancient scribe. An option to filter the results on the basis of the grammatical context—in the sense of immediate vicinity rather than syntactic dependence—is available, as well as export facilities. Recently, a search possibility for morphological features has been added as well. [27]
Such a tool has great potential to speed up data collection. In the old days, the lexical method of searching for all possible attestations of specific words relevant to a historical problem relied on indices and took much manual labor. The advent of the DDbDP, with its full text, had already hastened the heuristic undertaking by an order of magnitude and, in the process, revolutionized the way scholars work. A new, completely searchable lemmatized corpus of all Greek documentary texts with information on morphology can even go a step further: for example, a search for μήτηρ results in 19,745 hits, mostly in the genitive because many attestations occur in the Roman period naming formula. For research such as Depauw (2010), the collection of sources and their chronological and geographic survey is now possible in less than a minute, instead of the week of manual labor it then cost. A study of the morphological category “vocative,” such as Dickey (2004), would now also be possible in a fraction of the time traditionally required. Of course, the tool is not yet perfect and even 95% accuracy—remember, proper names excluded—still implies that it contains some 225,000 errors. We have developed a tagging function by which users can easily indicate or correct these errors, but a systematic survey of some typical errors would no doubt be rewarding.
Moreover, the tool opens up fascinating new avenues in the world of Named Entity Recognition, certainly if the syntactical annotation can be improved further. The addition of titles to people identified by name, for example, may be largely automated—something which we are also currently exploring. Finally, the importance of the presence of semantic annotation for historical research should also not be underestimated: the possibility to detect similar words to a given target word will enable historians to search for a broad range of lemmata that express the same concept rather than being tied to the specific lemmata they are personally aware of. The presence of semantic role annotation also allows for queries for places and dates that are not expressed by proper names.

3.3 Sociolinguistic Annotation

The Trismegistos databases already present some sociolinguistic information such as text genre and the place where the papyrus text is written. Several other language-external variables are, however, currently not available but might also be interesting for sociolinguistic research. In this section, we briefly describe ongoing research to automatically detect the native language of the writers of papyrus texts.
In a multi-lingual environment, such as Hellenistic Egypt, native language interference is an important factor in diachronic language change. [28] It is difficult, however, to estimate the extent to which such interference is a factor in specific changes. So far, there has not been any systematic effort to determine the native language of the writers of papyrus texts. Therefore, we decided to try to infer this information automatically with a machine-learning approach.
One option is to use some texts of which we are reasonably confident about the scribe’s ethnicity as training material to classify other texts. For some letters, the scribe’s ethnicity can already be roughly deduced on the basis of the initial greeting: in the case of Ὀρσενοῦπις Νείλωι τῶι ἀδελφῶι χαίρειν (BGU 2 450 = TM 28143), for instance, the writer has the Egyptian name Ὀρσενοῦπις (TM Nam 568). It is, therefore, likely that the native language of the scribe would also be Egyptian. By grouping texts on the basis of the writer’s onomastically reconstructed ethnicity, they can be used as training material. We can look for specific spelling variants in these texts, or, conversely, for mistakes against case usage or other grammatical features. In either case, these irregularities are presumably traced back to the nature of the Egyptian language—e.g., that it does not distinguish voiced from unvoiced sounds and is non-inflecting. [29] Outside of linguistic features, the ethnicity of other names in the text might be a useful feature as well, under the assumption that Egyptians are more likely to interact with other Egyptians than Greeks are. These features can then, in turn, be used to infer the native language of the writers of other papyrus texts.
Note that this onomastic approach to ethnicity and mother tongue can only be “rough,” as Egyptians—including Egyptian scribes—also regularly used Greek names. [30] A further caveat is that people may not always have written letters themselves. However, most cases where scribe and author are not the same, seem to relate to the higher classes, which were more likely to have Greek names and to employ the service of scribes who knew Greek well—presumably as their mother tongue. Yet, we would expect that using a Greek name also correlates with a better mastery of the Greek language, so the amount of noise might not be a large problem. [31] Another approach is to cluster the texts based on features that are known to occur frequently with Egyptian scribes—e.g., voiced for voiceless plosives, or case mistakes. This way we do not have to make any assumptions about the onomastic data. Since features such as the confusion between voiceless and voiced consonants seem to correlate well with grammatical problems that may be typical of Egyptian usage, such as wrong gender usage or the use of articles for relative pronouns, [32] this approach might be more fruitful. Obviously, automatically generated sociolinguistic information might also be useful for historical approaches to the papyri, as it would give more background information on the writers of the texts that might not always be available.

4. Conclusion

The Greek papyri provide a wealth of information for both historical and linguistic research. This paper has presented a fruitful combination of both approaches in the context of a first automated analysis of these texts. While existing machine learning tools, such as part-of-speech taggers and syntactic parsers, can be trained on literary texts, the specific genre of the papyri induces many problems, including substandard spellings and many unknown word forms, especially proper names. By drawing on historical data from the Trismegistos databases, we can fine-tune these tools to deliver optimal performance for papyrus texts as well. Sociolinguistic approaches to the papyri can also benefit from extra-linguistic data, such as place and genre information contained in these databases.
On the other hand, a fully linguistically annotated corpus of the papyri is a considerable asset for historical research as well. Lemmata and morphological, syntactic and semantic annotations can be heuristically useful in finding relevant data on specific concepts for historical research. Automated text classifications—e.g., by native language—may also provide valuable insights into the history of these texts.
In the future, we aim to cooperate with the Papyrological Navigator to not only make sure that the information can be accessed in that platform but also to assure that changes and corrected readings in the original text find their way into the annotated version of TM Words. Close cooperation will, furthermore, allow the development of ways in which access to Greek papyri is facilitated for students whose Greek may not be sufficient to read the text. Currently some of the annotation efforts described in this paper are already implemented in Trismegistos Texts: an example is shown in https://www.trismegistos.org/text/2, where the morphological interpretation, lemmatization and translation have been added to the individual words of the text.

Bibliography

Bamman, David, and Gregory Crane. 2011. “The Ancient Greek and Latin Dependency Treebanks.” In Language Technology for Cultural Heritage, ed. C. Sporleder, C. van den Bosch, and K. Zervanou, 79–98. Berlin.
Cayless, Hugh A., James M. S. Cowey, Ryan Baumann, and Timothy David Hill. 2016. Idp.data: Data from the Integrating Digital Papyrology Project. papyri.info. https://github.com/papyri/idp.data.
Coussement, Sandra. 2016. ‘Because I am Greek’: Polyonymy as an Expression of Ethnicity in Ptolemaic Egypt. Studia Hellenistica 55. Leuven.
Crane, Gregory. 1991. “Generating and Parsing Classical Greek.” Literary and Linguistic Computing 6(4):243–245.
Dahlgren, Sonja, Alek Keersmaekers, and Joanne Stolk. Submitted. “Language contact in historical documents: the identification and co-occurrence of Egyptian transfer features in Greek documentary papyri.”
Depauw, Mark. 2010. “Do Mothers Matter? The Emergence of Metronymics in Early Roman Egypt.” In The Language of the Papyri, ed. T. V. Evans and D. D. Obbink, 120–139. Oxford.
Depauw, Mark, and Tom Gheldof. 2013. “Trismegistos: An Interdisciplinary Platform for Ancient World Texts and Related Information.” In Theory and Practice of Digital Liberaries — TPDL 2013 Selected Workshops, ed. L. Bolikoswki, V. Casarosa, P. Goodale, N. Houssos, P. Manghi, and J. Schirrwagen, 40–52. Valetta.
Depauw, Mark, and Joanne Stolk. 2014. “Linguistic Variation in Greek Papyri: Towards a New Tool for Quantitative Study.” Greek, Roman, and Byzantine Studies 55(1):196–220.
Dickey, Eleanore. 2004. “The Greek Address System of the Roman Period and its relation to Latin.” Classical Quarterly 54:494–527.
Dik, Helma, and Richard Whaling. 2008. “Bootstrapping Classical Greek Morphology.” In Digital Humanities 2008: Book of Abstracts, ed. L. L. Opas-Hanninen, 105. Oulu.
Dozat, Timothy, Peng Qi, and Christopher D. Manning. 2017. “Stanford’s Graph-Based Neural Dependency Parser at the CoNLL 2017 Shared Task.” In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, ed. J. Hajic, D. Zeman, J. Nivre, F. Ginter, S. Petrov, M. Straka, M. Popel, and E. Bejček, 20–30. Vancouver.
Fewster, Penelope. 2002. “Bilingualism in Roman Egypt.” In Bilingualism in Ancient Society: Language Contact and the Written Text, ed. J. N. Adams, M. Janse, and S. Swain, 220–245. Oxford.
Hajič, Jan. 2000. “Morphological Tagging: Data vs. Dictionaries.” In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, ed. J. M. Weibe, 94–101. Seattle.
Haug, Dag T. T., and Marius Jøhndal. 2008. “Creating a Parallel Treebank of the Old Indo-European Bible Translations.” In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), ed. B. Alex, S. Degaetano-Ortlieb, A. Feldman, A. Kazantseva, N. Reiter, S. Szpakowicz, 27–34. Marrakech.
Henriksson, Erik, and Marja Vierros. 2017. “Preprocessing Greek Papyri for Linguistic Annotation.” Journal of Data Mining & Digital Humanities: Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages. http://jdmdh.episciences.org/1385.
Kraft, R., ed. 1988. Morphologically Analyzed Septuagint (version 1.0). Computer-Assisted Tools for Septuagint Studies (CATSS), University of Pennsylvania. http://ccat.sas.upenn.edu/gopher/.
Müller, Thomas, Ryan Cotterell, Alexander M. Fraser, and Hinrich Schütze. 2015. “Joint Lemmatization and Morphological Tagging with Lemming.” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ed. L. Màrquez, C. Callison-Burch, and J. Su, 2268–2274. Lisbon.
Open Greek and Latin. 2019. XML Files for the Works in the First Thousand Years of Greek Project. https://github.com/OpenGreekAndLatin/First1KGreek.
Perseus Digital Library. 2019. XML Canonical Resources for Greek Literature. https://github.com/PerseusDL/canonical-greekLit.
Popel, Martin, David Marecek, Jan Stepánek, Daniel Zeman, and Zdenek Zabokrtskỳ. 2013. “Coordination Structures in Dependency Treebanks.” In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics I, ed. P. Fung, M. Poesio, H. Schuetze, 517–527. Sofia.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/. Schmid, Helmut, and Florian Laws. 2008. “Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging.” In Proceedings of the 22nd International Conference on Computational Linguistics I, ed. D. Scott and H. Uszkoreit, 777–784. Manchester, UK.
Tauber, J. K., ed. 2017. MorphGNT: SBLGNT Edition [Data Set] (version 6.12). https://github.com/morphgnt/sblgnt.
Turney, Peter D., and Patrick Pantel. 2010. “From Frequency to Meaning: Vector Space Models of Semantics.” Journal of Artificial Intelligence Research 37:141–188.
Van Hal, Toon, and Yannick Anné. 2017. “Reconciling the Dynamics of Language with a Grammar Handbook. On Pedalion, an Ongoing Greek Grammar Project.” Digital Scholarship in the Humanities 32(2):448–454.
Vierros, Maria. 2015. Bilingual notaries in Hellenistic Egypt: a study of Greek as a second language. Collectanea Hellenistica 5. Brussels.

Footnotes

[ back ] 1. The largest available linguistically annotated papyrus corpus is the Sematia and PapyGreek corpus (Henriksson and Vierros 2017), which is annotated manually for morphology, lemmata, and syntax (see https://github.com/ezhenrik).
[ back ] 2. Cayless et al. 2016.
[ back ] 3. Schmid and Laws 2008.
[ back ] 4. Müller et al. 2015.
[ back ] 5. Dozat, Qi, and Manning 2017.
[ back ] 6. R Core Team 2020.
[ back ] 7. Hajič 2000.
[ back ] 8. Crane 1991.
[ back ] 9. Bamman and Crane 2011; the 2.1 version was used.
[ back ] 10. Haug and Jøhndal 2008.
[ back ] 11. Gorman 2020.
[ back ] 12. Keersmaekers et al. 2019.
[ back ] 13. Henriksson and Vierros 2017.
[ back ] 14. Tauber 2017.
[ back ] 15. Kraft 1988.
[ back ] 16. Depauw and Gheldof 2013.
[ back ] 17. Depauw and Stolk 2014.
[ back ] 18. The current version of Trismegistos Words (see below) still uses only the regularized version for automated analysis.
[ back ] 19. Schmid and Laws 2008.
[ back ] 20. Müller et al. 2015.
[ back ] 21. Dozat, Qi, and Manning 2017.
[ back ] 22. Turney and Pantel 2010.
[ back ] 23. Perseus Digital Library 2019.
[ back ] 24. Open Greek and Latin 2019.
[ back ] 25. Van Hal and Anné 2017.
[ back ] 26. This labeler is released on GitHub (https://github.com/alekkeersmaekers/PRL).
[ back ] 28. Fewster 2002.
[ back ] 29. See also Dahlgren, Keersmaekers, and Stolk, submitted, for an overview of the spelling evidence.
[ back ] 30. Coussement 2016.
[ back ] 31. Looking at the texts with initial greetings, the evidence is somewhat mixed. On the one hand, problems of case usage are more frequent in texts when the scribe has an Egyptian name—on average, 0.39 case problems per text versus 0.19 when the scribe has a Greek name. On the other hand, certain phonological problems that we associate with Egyptian—such as the use of γ instead of κ—are not more common with scribes with an Egyptian name than scribes with a Greek one.
[ back ] 32. Vierros 2015.