1. Introduction
2. Resources
3. Annotation of the Greek Papyri
3.1 Tokenization
Although tokenization—i.e., the division of a text into “tokens,” including not only words but also punctuation marks—is a relatively trivial step, there are some complications, nevertheless. If the text that was originally written by the scribe has been regularized by the modern editor, for example, one has to decide from which version of the text the tokens should be chosen. Henriksson and Vierros (2017) created a tool that separates both versions from each other, generating both an “original” and a “regularized” tokenized version of the same text. Yet, neither version is particularly suitable for automated linguistic analysis:
- The “original” version, particularly due to the lack of a unified spelling convention and the fact that it is missing words or characters, is simply too irregular for an automated natural language tool to analyze—trained as it is on highly regularized literary prose (see above).
- The “regularized” version, on the other hand, is too “regular.” Editors not only frequently correct irregular spellings but also morphosyntactic problems such as case usage. In some instances, however, even case usage consistent with post-classical Greek but violating classical Greek norms is emended. While this would probably be beneficial for natural language processing, we would prefer to see the morphology annotated in the way it appears in the text and not in the editor’s head.
Therefore, we decided that it would be beneficial to include both text versions in the tokenization, in order to be able to choose dynamically between regularized and original versions of a token, according to the type of regularization—spelling vs. grammatical. This was possible due to the existence of the Trismegistos Text Irregularities database, [17] which classifies each editorial regularization according to its linguistic level: “grapheme,” “phoneme,” “morpheme.” “Lexeme” is mostly used for semantic or unexplained scribal “mistakes,” while “phrase” typically tags words and even sections that are supplied by the editor. In the future our automated linguistic analysis will use the regularized version when the “error” occurs at the phoneme or grapheme level, but the original version—or possibly both—in the case of morphological regularizations. [18]