What is the Classical Language Toolkit?
Johnson began work on CLTK in 2012 out of a perceived void for open source software offering state-of-the-art natural language processing for ancient languages.  Though classicists were pioneers in what today is known as digital humanities,  they had not kept pace with the recent maturing of the field of natural language processing.  Arising out of the disciplines of computer science and linguistics, NLP has quickly evolved into a significant part of artificial intelligence research. 
The software for the CLTK Core is a package written in the Python programming language which is deployed as an easy-to-install package for end users.  It currently offers good basic (and some advanced) NLP functionality for Ancient Greek, Latin, Akkadian, Old French, and the Germanic languages. Examples of this functionality includes tools like sentence segmenters, word tokenizers, part-of-speech taggers, and stopword removal. Several more advanced tools are available too, such as pre-trained word2vec models, prosody scanning, and IPA phonetic transcription.
The toolkit’s features for Ancient Greek and Latin are those most developed, however by design all languages supported by CLTK are “first-class citizens,” in that all language-specific data sets and functions are similarly easy to access by end users.  One consequence of such parallelism of languages is that CLTK becomes an extensible framework with which language traditions may be studied in comparison to one another with less friction than otherwise possible. For example, were Greek, Hebrew and Latin toolkits available—but in different programming languages, accessible with various conventions, and for different platforms—normalizing input, processing algorithms, and output would be a significant challenge.  With a consistent application programming interface, or API, however, users can (in theory, if not entirely yet in practice) analyze language statistics of source texts and their ancient translations, observe diachronic syntactic trends of related languages (such as the evolution of Latin into the Romance languages), or even correlate the representation of ideas between two different literary traditions.
Beyond the software for the Core tools, CLTK also provides digital texts and linguistic data sets. In fact, the project began in part as an effort simply to consolidate the valuable, yet often hard-to-find, NLP data which were spread across the Internet. There was a tremendous amount of valuable information available to be collated, including open access data sets and public-domain text repositories.  As data volume and complexity grew, there arose an increasing need for careful organization, access, and dissemination of this data, so as to encourage use. Accordingly, CLTK developed a corpus management system, akin to software management software (the most well-known example being Apple’s App Store).  This system allows users to download corpora hosted by the CLTK organization, as well as to identify other corpora which CLTK cannot host, say, for example, due to licensing restrictions.  In contrast to most NLP projects, CLTK takes some responsibility for providing users with good texts and assisting them in data management.
For all of CLTK’s efforts to make NLP more accessible to non-technical users, there still remains a significant learning curve to begin using the software, as users must first gain some familiarity with the Unix-style command line and the Python programming language. In the spirit of further breaking down barriers to bringing NLP to the ancient world, the project began a partnership to provide an NLP-backed, online reading environment, an example of which is Khan’s 2016 GSoC contributions to the CLTK Archive.
Bootstrapping a reading environment
Hollis mentored Khan’s development of features for a front-end application for CLTK designed to help users perform sophisticated analysis and annotation of literary text data.  Taking advantage of the multimodal potential of the web browser, Khan brought together texts already available in the CLTK corpora with user-supplied annotations, images, scholarly citations, and other types of linked data. While originally included in a website called the CLTK Archive, at present offline, these commentary-writing features have since been deployed at several sites, including the Classical Commentaries by the Center for Hellenic Studies (CHS), the Archive of African American Folklore, the Orpheus Digital Collections Platform, and the forthcoming New Alexandria at CHS (which also uses CLTK corpora). 
The larger goal of Hollis and Khan’s work was more than just a presentation layer of a data set, but an aesthetically pleasing digital environment to facilitate reading, note-taking, and learning.  Two foundational principles directed the scope and direction of Khan’s project: first, the code ought to be thoroughly open, reusable with a variety of text data, and able to be re-implemented by other projects needing annotation. Second, the project aimed to elegantly combine metadata and open-access media. Assuming the user is not overwhelmed with too much information, such metadata can enrich the reading experience.
Finally, and at the heart of Khan’s project, was a feature for users to add annotations to the reading environment. Annotations could be made either private or public; in the latter case, the annotations could be integrated into other users’ personal libraries. One obvious use case for this feature was that teachers could share annotations with students, so as to facilitate the comprehension of ancient languages.
Upon completion of this summer’s project, a number of enhancements were considered, mostly concerning adding more metadata and making the update process seamless. Other improvements include allowing users finer-grained control over what metadata components to display in the front end (for example, vocabulary and Wikipedia entities, but not images or JSTOR citations). As open source software, Hollis’s frontend and Khan’s annotation software is available for anyone to deploy and extend.
Lemmatization as reading
The 2016 GSoC project for the CLTK Core focused on lemmatization, that is the automated retrieval of dictionary headwords from a given text. For his project, Latin and Greek Backoff Lemmatizer, Burns adopted a strategy commonly applied to part-of-speech identification, namely sequential backoff tagging. For this implementation, he repurposed part-of-speech taggers available in the Natural Language Toolkit (NLTK).  With this backoff process, rather than using a single-pass algorithm to return lemmata, several lemmatizers are chained together and used in succession. With each pass, if a lemma is found for a given word, it is assigned and the corresponding word is ignored on subsequent passes. If a lemma is not found, the word is passed down to the next tagger in the chain. The process finishes when all of the lemmatizers—theoretically, an infinite number can be chained together—have been exhausted. Any words that remain untagged at the end of the backoff chain are then assigned a default lemma, such as “UNKNOWN” or None. 
In a volume on literacies, it should be mentioned that although the lemmatizers are hardly “literate,” they are grounded in recognizable classical-language reading patterns. One backoff lemmatizer assigns lemmata to indeclinable (and non-ambiguous) words such as adverbs, prepositions, and interjections. The Latin word in will always resolve to the English headword “in.” As a “reader,” the lemmatizer recognizes this and can effectively retrieve the indeclinable word from a lookup table. Another backoff lemmatizer uses regular expressions, a sort of flexible, text-based pattern search, to parse word endings and assign lemmata accordingly. So, for example, a lemma like latinitas (“Latinity”) can be derived using regular expressions from the form latinitati because in the Latin corpus the ending -itati always resolves to -itas. Yet another backoff lemmatizer, the Principal Parts lemmatizer, combines regular-expression-based pattern matching with with a lookup dictionary of valid verb stems. This lemmatizer first matches words with recognizable verb endings, removes the ending, looks up the stem in a principal parts dictionary, and returns a lemma if the stem is valid. So, for example, -eris can be used to derive the lemma for the perfect subjunctive verb audiveris, because audiv- is a valid verb stem (from audio, “to hear”), but the same pass would ignore the dative/ablative plural adjective asperis (from asper, “rough”) because asp- is not a valid verb stem. These lexical and morphological negotiations can again be seen as a “readerly” process and are exactly the kinds word-ending identifications and manipulations asked of every beginning Latin learner.  The computer can simply read, stem, and validate forms at a speed and scale unimaginable to a human reader.
Another way in which the lemmatizers imitate the reading process is through context-based decision making. A lemmatizer introduced in the first iteration of the Backoff module is one which “reads” through a pretagged corpus of Latin texts to determine the most likely lemma for a given word based on word frequency. This process is known as training the tagger.  The program attempts to distinguish words based on prior “knowledge,” that is information “learned” from the training data, and applies this to new, unseen text. For example, the form bella is found in the training corpus most frequently under the lemma bellum (“war”), so all instances of bella in newly seen text will be tagged as such. While this can lead to false negatives, that is it may mistag a form of the adjective bellus, -a, -um (“beautiful”), the trade-off works in favor of higher accuracy overall for the lemmatizer because bellum is a far more frequent word in the training data than bellus. 
A variation of the Train lemmatizer that also replicates classical-language reading practices can be seen in the TrainPOS lemmatizer. This version looks in the training data not only at the most frequent lemma for a given word, but also at the part of speech (POS) of the following word. Accordingly, this lemmatizer can more accurately disambiguate cum (“with”) from cum (“when”), long noted as a challenge for automated lemmatization: the preposition cum is followed with greater frequency by a noun or adjective than its homograph.  More experiments need to be done on using extracting lemmata with adjacent POS data, but it could prove useful in resolving many of the ambiguous forms that keep lemmatizers from achieving even greater accuracies.
Ancient literacy in the browser
The GSoC work on a Core tool, that is the lemmatizer, and the Archive developed along separate paths, largely insulated from each other. As CLTK moves forward, however, it is critical to examine how these two projects can be integrated to make each stronger. More importantly, in the context of this article on the role that CLTK plays in the future of digital literacies, it is critical to examine how the Core tools can help readers negotiate texts presented in websites like the Archive and how such websites can give context to output from the Core tools. This output often resembles the kind of paratextual help that was ubiquitous in pedagogical materials from antiquity, such as the compilation of running word-lists or glosses, the addition of macrons, and so on.  With this in mind, it is worth reflecting on how the relationship between the Core tools and reading interfaces fits into these practices of ancient literacy.
Raffaella Cribiore has shown, for example, that the physical expression of student texts found in Egyptian papyri makes ample use of reading aids and tools.  Moreover, many of these papyrological aids exhibit the analysis of texts into their constituent elements.  Many of these find corollaries in the CLTK Core tools. Separation and demarcation of word divisions, whether by spaces, dots, dashes, or other marks, show the long-seated utility in tokenization. Annotation of vowel quantity shows the utility of the macronizer; running word-lists the utility of the lemmatizer. Marginal notes with intertextual references, translations, and various morphological and lexical information hint at valuable tools that CLTK could develop for integration into front-end services.
The digital analytical edition begins from the idea of treating texts as data and then presents views of this data based on a “set of instructions.”  Here we approach a Model-View-Controller (MVC) version of working with classical texts.  To take just the projects from GSoC, in an environment where the Core tools and the Archive were tightly integrated, we could pass code using the Backoff Lemmatizer to a text of Virgil’s Aeneid, such that, instead of seeing “Arma virumque cano etc.” as we do in many currently available online editions of this text, we could see the entire text lemmatized before us as Python list: [ ‘arma’, ‘vir’, ‘-que’, ‘cano’, etc.] In theory this view could be changed on the fly to represent any output derived from the underlying CLTK-enhanced Python. As the CLTK web API matures, we expect that such in situ NLP processing will be integrated into its machine-readable layer.
W. A. Johnson refers to literacies as “text-oriented events in particular sociocultural contexts.”  His choice of words is serendipitous with respect to how programmers and software architects design user interfaces like the CLTK Archive. “Event” is the programming term for how the user interacts with the browser and so by extension how an Archive user interacts with the reading environment.  While there are many online venues available for reading classical literature, in large part they replicate on the screen traditional modes of negotiating texts in their printed form. CLTK, by leveraging innovative work in both the Core and the Archive, can promote an events-driven literacy that literally transforms the way works from antiquity are presented to their audience, offering this generation of classical-text readers levels of engagement and interactivity not realistically available to previous generations.
Abelson, H., and G. J. Sussman. 1996. Structure and Interpretation of Computer Programs. 2nd ed. Cambridge, MA.
Andor, D., C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Alexander. 2016. Globally Normalized Transition-Based Neural Networks. arXiv preprint arXiv:1603.06042.
Barker, E., and M. Terras. 2016. “Greek Literature, the Digital Humanities, and the Shifting Technologies of Reading.” Oxford Handbooks Online. https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199935390.001.0001/oxfordhb-9780199935390-e-45.
Berti, M., ed. 2019. Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution. Berlin.
Bird, S., E. Klein, and E. Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, CA.
Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3:993–1022.
Brunner, T. F. 1987. “Data Banks for the Humanities: Learning from Thesaurus Linguae Graecae.” Scholarly Communication 7:6–9.
Burstall, R. 2000. “Christopher Strachey—Understanding Programming Languages.” Higher-Order and Symbolic Computation 13:51–55.
Celano, G. G. A., G. Crane, B. Almas, et al. 2014. “The Ancient Greek and Latin Dependency Treebank v.2.1.” https://perseusdl.github.io/treebank_data/.
Crane, G. 1991. “Generating and Parsing Classical Greek.” Literary and Linguistic Computing 6:243–45. https://academic.oup.com/dsh/article-lookup/doi/10.1093/llc/6.4.243.
Crane, G., B. Almas, A. Babeu, L. Cerrato, A. Krohn, F. Baumgart, M. Berti, G. Franzini, and S. Stoyanova. 2014. “Cataloging for a Billion Word Library of Greek and Latin.” Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCh) 83–88. Madrid.
Cribiore, R. 2001. Gymnastics of the Mind. Princeton.
DeHoratius, E. F. 2000. “Review of Vergil’s Aeneid 10 & 12: Pallas & Turnus by B. W. Boyd.” Classical Journal 96:211–215.
Dickey, E. 2016. Learning Latin the Ancient Way: Latin Textbooks from the Ancient World. Cambridge.
Dué, C., ed. 2009. Recapturing a Homeric Legacy: Images and Insights from the Venetus A Manuscript of the Iliad. Hellenic Studies Series 35. Washington, DC.
Ebbott, M. 2009. “Text and Technologies: The Iliad and the Venetus A.” In Dué 2009:31–55.
Gruber-Miller, J., ed. 2006. When Dead Tongues Speak: Teaching Beginning Greek and Latin. New York.
Hancox, P. J. 1996. “A Brief History of Natural Language Processing.” http://www.cs.bham.ac.uk/~pjh/sem1a5/pt1/pt1_history.html.
Heslin, P. 1999. “Diogenes.” https://community.dur.ac.uk/p.j.heslin/Software/Diogenes/index.php.
Hockey, S. M. 2000. Electronic Texts in the Humanities: Principles and Practice. Oxford.
Hunt, A., and D. Thomas. 2000. The Pragmatic Programmer: From Journeyman to Master. Boston.
Johnson, W. A. 2009. “Introduction.” In Johnson and Parker 2009:3–10.
Johnson, W. A., and H. N. Parker. 2009. Ancient Literacies: The Culture of Reading in Greece and Rome. Oxford and New York.
Jones, S. E. 2016. Roberto Busa, S. J., and the Emergence of Humanities Computing: The Priest and the Punched Cards. New York.
Lancashire, I., ed. 1991. The Humanities Computing Yearbook 1989–1990. Oxford.
Lehnert, W. G., and M. H. Ringle, eds. 1982. State of the Art in Natural-Language Understanding. New York.
Lenders, W. 2013. “The Early History of Computational Lexicography: The 1950s and 1960s.” In Gouws, Heid, Schweickard, and Wiegand 2013:969–982.
Lockspeiser, B., L. Israel, E. Damboritz, R. Neiss, S. Kaplan, J. Mosenkis, and N. Santacruz. 2012. “Sefaria.” https://github.com/Sefaria.
Manning, C. D., and H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA.
McCaffrey, D. 2006. “Reading Latin Efficiently and the Need for Cognitive Strategies.” In Gruber-Miller 2006:113–33.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems 26:3111–3119. https://papers.nips.cc/book/advances-in-neural-information-processing-systems-26-2013.
Muccigrosso, J. D. 2004. “Frequent Vocabulary in Latin Instruction.” Classical World 97:409–433.
Perkins, J. 2014. Python 3 Text Processing with NLTK 3 Cookbook. Birmingham, UK.
Piotrowski, M. 2012. Natural Language Processing for Historical Texts. San Rafael, CA.
Rockwell, G., and S. Sinclair. 2016. Hermeneutica: Computer-Assisted Interpretation in the Humanities. Cambridge, MA.
Short, W. M. 2019. “Computational Classics? Programming Natural Language Understanding.” SCS Blog. https://classicalstudies.org/scs-blog/william-m-short/blog-computational-classics-programming-natural-language-understanding.
Smith, D. A., J. A. Rydberg-Cox, and G. Crane. 2000. “The Perseus Project: A Digital Library for the Humanities.” Literary & Linguistic Computing 15:15–25.
Vandendorpe, C. 2009. From Papyrus to Hypertext: Toward the Universal Digital Library. Urbana, IL.
Waltz, D. L. 1982. “State of the Art in Natural-Language Understanding.” In Lehnert and Ringle 1982:3–32.
Whitaker, W. 1993. “Words v.1.97F.” http://archives.nd.edu/whitaker/wordsdoc.htm#SUMMARY.
Wittern, C. 2006. “Chinese Buddhist Texts for the New Millennium—The Chinese Buddhist Electronic Text Association (CBETA) and Its Digital Tripitaka.” Journal of Digital Information 2. https://journals.tdl.org/jodi/index.php/jodi/article/view/84/83.
[ back ] 1. This paper deals specifically with CLTK’s participation in GSoC 2016, although the organization has also participated in the intervening years.
[ back ] 2. Throughout this paper, we will use Core to refer to CLTK’s suite of NLP tools and Archive to refer to the reading environment. Since publication, the CLTK Archive has been moved to Luke Hollis’s Orpheus project, available at http://orphe.us/. All links provided in this paper were last accessed on July 1, 2019.
[ back ] 3. The project also credits and offers thanks to its board of academic advisors, who offer input on specialist scholarly issues and advocate for CLTK in the academy, namely Neil Coffee, Gregory Crane, Peter Meineck, and Leonard Muellner.
[ back ] 4. To name just a few: Father Roberto Busa’s Index Thomisticus, morphologically tagged treebanks of Thomas Aquinas’s writings (founded in 1949; see Jones 2016 on the project’s history); the Thesaurus Linguae Graecae, digital texts of the Greek canon through the Byzantine era (founded 1972; for the TLG’s history, see Brunner 1987); CD-ROMs from the Packard Humanities Institute, including among others, the classical Latin canon, and Greco-Roman epigraphy and papyri (the Institute itself dates to 1987, though David Packard published a computer-generated concordance to Livy in 1968; see Lancashire 1991:215; Lenders 2013:974); and the Perseus Project, offering source texts, translations, and morphological assistance to readers (first release in 1987; see Smith, Rydberg-Cox, et al. 2000). On the early years of text digitization, across many disciplines, see Hockey 2000:11–23.
[ back ] 5. Noteworthy early exceptions include Gregory Crane’s Morpheus (Crane 1991) and Peter Heslin’s Diogenes (Heslin 1999); see also Piotrowski 2012. The situation has improved considerably and NLP for historical languages is now an active area of research. The table of contents for a recent special issue of the Journal of Data Mining and the Digital Humanities on the computer-aided processing of intertextuality in ancient languages provides a glimpse of research interests in this growing field (https://jdmdh.episciences.org/page/intertextuality-in-ancient-languages); see also Short 2019.
[ back ] 6. From its inception through the 1980s, NLP was driven by programs comprised of hand-written rules (Waltz 1982; Hancox 1996). Succeeding this has been a growth of many statistical approaches to the identification and manipulation of human language. Statistical methods include supervised machine learning especially, but also unsupervised clustering (e.g., latent Dirichlet allocation; Blei, Ng, et al. 2003), word embeddings (e.g., word2vec; Mikolov, Sutskever, et al. 2013), and, recently, deep learning (e.g., SyntaxNet; Andor, Alberti, et al. 2016). For a foundational, and largely still relevant, NLP textbook, see Manning and Schütze 1999.
[ back ] 8. This is by analogy to the computer science concept, coined by Christopher Strachey (Burstall 2000:52; Abelson and Sussman 1996:§1.3.4).
[ back ] 9. On the issue of compatibility between available digital tools, see the forthcoming chapter from Burns on “Building a Text Analysis Pipeline for Classical Languages” in Berti 2019.
[ back ] 10. Corpora are now available for over two dozen languages, and, by partnering with several projects, we have access to digital text collections with millions of words each: for example, the Open Greek and Latin Project (Crane, Almas, et al. 2014); for Biblical Hebrew and Aramaic, Sefaria (Lockspeiser, Israel, et al. 2012); and for classical Chinese Buddhist texts, CBETA (Wittern 2006). Examples of open access data set and text repositories available to be collated include treebanks from the Alpheios Project (http://alpheios.net) and Ancient Greek and Latin Dependency Treebank (Celano, Crane, et al. 2014), the source code for Whitaker’s Words (Whitaker 1993), and the contents of the Latin Library website (http://www.thelatinlibrary.com), which have all been used to some extent in CLTK tool development.
[ back ] 11. Documentation for this CorpusImporter is available at http://docs.cltk.org/en/latest/importing_corpora.html.
[ back ] 15. In addressing this goal, CLTK consulted researchers who had created similar reading and annotation interfaces for ancient literature. We owe a special thanks to Brett Lockspeiser for sharing his experiences building Sefaria, an interactive website for biblical Hebrew and Aramaic (Lockspeiser, Israel, et al. 2012). The CLTK team has also learned much from years of involvement with the Perseus Project and CHS.
[ back ] 17. Special corpus conversion scripts were authored by the Sefaria team and by Thibault Clérice of the Open Greek and Latin Project. The CLTK data format is documented at https://github.com/cltk/cltk_api/wiki/JSON-data-format-specifications.
[ back ] 18. On the NLTK’s part-of-speech capabilities, see Bird, Klein, et al. 2009:229–31; Perkins 2014:81–109.
[ back ] 19. For a comprehensive list of all GSoC tasks accomplished by Burns, see https://gist.github.com/diyclassics/fc80024d65cc237f185a9a061c5d4824.
[ back ] 20. See, for example, McCaffrey 2006.
[ back ] 22. In AGDLT, the lemma bellum appears 128 times and the lemma bellus appears 9 times.
[ back ] 23. In AGDLT, cum (“with”) is followed by a noun or an adjective 76% of the time (152 out of 200 instances), while the figure for cum (“when”) is 35% (105 out of 297 instances). On the challenge of disambiguation cum, see, for example Muccigrosso 2004:419: “Many Latin words are unambiguous in form (e.g., mihi). These the computer can handle quite competently, but the ambiguous words (e.g., cum) require more intelligence.”
[ back ] 24. See, for example, the beginners’ learning material from antiquity described in Dickey 2016:4–6 and throughout.
[ back ] 25. Cribiore 2001:132–36, 167–76.
[ back ] 26. Speaking critically of grammarians’ tendency to miss the linguistic forest for the trees, Cribiore 2001:215 speaks of a “myopic ability to dismember a text into its components.” Myopic or not, transformation tasks such as tokenization and annotation tasks such as part-of-speech tagging are nevertheless part of a continuum of “dismembering” going back at least as far as these grammarians.
[ back ] 27. As E.F. DeHoratius (DeHoratius 2000) notes in his review B.W. Boyd’s edition of Aeneid 10 and 12—a book designed to supplement Pharr’s coverage of the first six books: “Detractors…claim that the very visible vocabulary system that garners so much praise provides those inexperienced readers for whom it was expressly designed and with whom it is so often used too great a crutch to lean on.”
[ back ] 29. This kind of presentational and editorial flexibility is part of a larger trend in reading environments for classical languages, demonstrated, to pick just a few representative examples, by the embedded widgets in the Perseus Digital Library’s Scaife Viewer (available at https://scaife.perseus.org/); the NLP and annotation modules in Recogito (available at https://recogito.pelagios.org/) or the Coptic Scriptorium (available at http://copticscriptorium.org/); and the linked-data model for manuscript images and scholia of the Homer Multitext (available at http://www.homermultitext.org/). See Ebbott 2009:52–55 for a concise statement of the “advantages of digital technologies to construct a truly different type of critical edition” for classical works as well as the challenges. For an overview of trends in reading environments for classical languages, in particular classical Greek, see Barker and Terras 2016.
[ back ] 30. In this respect, there is a great deal of sympathy between this vision of the CLTK Core/Archive and the text-analytical tools discussed in Rockwell and Sinclair 2016 as “hermeneutica,” embeddable, interactive web tools that allow users to experiment with textual data.
[ back ] 31. See jsfiddle.net.
[ back ] 32. See the statement of Jean-Michel Adam, quoted in Vandendorpe 2009:52: “To better define the text as object, I will draw on the idea that ‘every text contains a set of instructions’ for readers, which enables them to orient themselves in the piece of world presented in the book.”
[ back ] 34. Johnson 2009:3.