The eScriptorium VRE for Manuscript Cultures

Peter A. Stokes, Benjamin Kiessling, Daniel Stökl Ben Ezra, Robin Tissot, and El Hassane Gargem
EPHE – PSL (Paris, France), UMR 8546 Archéologie et Philologie d’Orient et d’Occident (AOROC)

1. Introduction: What is eScriptorium

The eScriptorium VRE is software being developed at the École Pratique des Hautes Études, Université Paris Science et Lettres (EPHE – PSL), with the immediate goal of developing a web interface to an engine for the automatic transcription of written sources, both printed and handwritten, in principle in any current or historical system of writing. [1] The software is intended to be a core component in a larger VRE which provides the key steps in the normal editorial chain. The assumption here is that the researcher has images of the text, and wishes first to obtain a transcription of the text in these images, then (most likely) add markup to the text and potentially also to the images in order to encode the various editorial judgments that are a core part of any edition, perhaps then also, or instead, to apply techniques such as Natural Language Processing (NLP) or Named Entity Recognition (NER), and finally to publish the text and images, along with accompanying data and metadata. To date, the focus has been entirely on the first of these steps, that is, allowing the manual, semi-automatic or automatic transcription of texts, and the user can then export the results to be marked up, analyzed and/or published in other frameworks.
The development of eScriptorium began as part of a larger project at PSL called Scripta, which is studying the history and practice of writing in almost all its forms across most of human history, and it has since been continued most notably in the Resilience project [2] , which is a European infrastructure project looking to develop a long-term research infrastructure for religious studies. The range of languages and writing-systems being studied in the Scripta project is enormous, covering the Ancient Near East (e.g. Ancient Aramaic, Ptolemaic Egyptian, Ugaritic), Iran and Central Asia (including Elamite, Sogdhian, Middle Iranian), India and South-East Asia (such as Sanskrit, Classical Tamil, Old Javanese, Tai-Lue), East Asia (Tibetan, Classical Chinese, Old and medieval Japanese), and the Classical and Medieval West (including Greek, Umbrian, and Old Slavonic), among others. The scope of Resilience is in principle even bigger, as it should cover all languages relevant to any aspect of religious studies in Europe for the next thirty-five years. It has therefore been a crucial element of the project that the software must avoid, as far as possible, all assumptions about the nature of the writing and language that is in the system. The writing may be left to right, right to left, top to bottom or even bottom to top; the support may be paper, parchment, but also stone, palm leaf, clay, wood, or many others; it may be written with a pen, painted with a brush, inscribed with a chisel; the writing system may be alphabetic, logographic, hieroglyphic; and so on. This variety means that almost all levels of the software are much more complex than they would be for a single type of script, as will be discussed below.
The eScriptorium VRE is designed to interface with the Kraken engine for OCR/HTR. [3] This engine has been developed by Benjamin Kiessling, who is also from EPHE – PSL and is part of the eScriptorium team. Written in Python, the Kraken engine is designed from the beginning to embed as few pre-assumptions about the writing-systems as possible, and so to work with a very wide range of different scripts. It is highly modular, and each module has a large number of parameters that the user can set to accommodate the specific needs of the case in question. Furthermore, if the existing parameters are not sufficient, it is entirely possible for a sufficiently-skilled user to replace any given module of the Kraken engine with a custom-made one. This flexibility is extremely important, particularly for the very diverse needs of the Scripta and Resilience projects. It is also very rare, and not present nearly to the same extent in other available systems which normally provide more of an “end to end” process, with little or no opportunity to intervene in the middle (as discussed further below). However, this also means that Kraken requires a relatively good understanding of OCR/HTR software and processes, as well as being comfortable with installing modules and dependencies and running processes directly from the command line. For this reason, it is not appropriate for the majority of users in the humanities, for whom the time and effort that must be invested in learning these techniques may seem too much for a technology that is still relatively new and for which the benefits may be in doubt. The eScriptorium interface therefore serves as a user-friendly way into the Kraken engine, providing a system that functions well for the majority of users, while those with more specialized needs will still be able to judge the system and its likely value and therefore be better placed to make an informed decision whether to invest further in the details.

2. The eScriptorium Workflow

In order to understand the current state of the art in OCR/HTR systems, it is necessary first to understand the basic workflow of eScriptorium and other similar systems. In general, going from images to transcription requires the following basic steps:

  1. Importing the images into the system, including preprocessing such as PDF import and other format conversions.
  2. Finding the lines of text and other significant elements (columns, glosses, initials, etc.) on the images: that is, subdividing an image into sets of shapes on that image that correspond to different region types.
  3. Transcribing the lines of text: that is, converting images of lines of text into the corresponding text.
  4. Compiling the lines of text into a coherent document and exporting the result for markup, publication, etc.

Steps 1 and 4 are relatively straightforward and are largely a question of interface. [4] In eScriptorium, the import can be done directly from a user’s hard drive, or by simply giving the URL of the IIIF manifest file and leaving the machine to import the images automatically. Steps 2 and 3 are generally much more complex, since, as we have seen, the computer needs to be able to treat a very wide range of different documents, supports, and layouts. [5] It is important to note that each of these basic steps comprises multiple sub-tasks with sometimes unexpected ways that they can fail. Step 2 not only finds lines but also has to sort them into the same order a human would read them in, a process which is by no means easy given the many possible layouts across different writing systems. This step is crucial, though, because the transcription of a document produced in step 3 will be completely unreadable, even if perfect on a line level, if this ordering operation has failed. In Kraken, and therefore eScriptorium, both of these steps are handled by trainable computer vision algorithms, that is, by machine learning. Very broadly, this means that one must have example documents which have already been prepared: that is, images of pages annotated with columns, lines and so on for step 2, and transcriptions of texts matching the images for step 3. One then submits these example cases to the computer, and the machine “learns” from them, creating a statistical model of the images which it has already seen, and this model then allows it either to segment new unseen images into regions, or to produce a transcription of the text from the unseen images. As for transcription, the eScriptorium interface also allows for adding this information directly in order to compile ground truth material for training, and it also provides mechanisms for the user to correct any errors in the automatic or indeed manual results (as shown in [Stokes 2020c, video number 3).

In practice, these two steps for machine learning often comprise several sub-steps. For instance, one may begin by preparing a certain number of pages by hand, most often by typing them out manually (Stokes 2020c, video number 4). It may then be helpful to train the machine based on this relatively small sample and then automatically transcribe some more pages: the results may have numerous errors but correcting these errors may be faster than typing out the whole page from scratch. After manual correction, the machine can then be re-trained with this additional material and, as a result, the subsequent pages will have fewer errors and will therefore be faster to correct. This process can then be repeated until the results are good enough to be useful. How good is “good enough” depends very much on the use-case: it is generally not possible to get perfect results, but it is often possible to get over 99% character accuracy, meaning correcting perhaps just one or two errors per page which is very much faster than typing by hand. Indeed, this correction may be unnecessary, depending on the purpose of the transcription. If one is preparing an edition for publication then clearly one or two errors per page is not acceptable and must be corrected, but if the purpose is to use the text for distant reading or other large-scale analyses such as NLP, NER, automatically identifying uncatalogued texts, and so on, then 99% accuracy may be more than good enough, and indeed some methods for text identification work with only 60% accuracy. There are also ways of speeding up this initial process of text-entry, for instance by taking an existing transcription of the same text from elsewhere, importing it into eScriptorium, and then adjusting it to match this particular exemplar, or taking examples of other transcribed texts that were written in scripts very similar to the new text, and training on those.
This discussion poses clear practical questions, such as how many pages of pre-prepared transcription are necessary for good results, and how to decide whether or not the time and effort is worthwhile compared to simply producing a transcription by hand. Unfortunately it is impossible to answer this question generally, since the results depend on too many variables, including the consistency of the handwriting, the similarity of the writing to other documents which have already been used for training (if any), the number and consistency of the abbreviations, the quality of the images, the consistency of the page layout, and even the quality of the pen and ink, insofar as faded ink or blurred strokes may also change the results. The time required to correct texts is also difficult to quantify, as this again depends not only on the expertise of the user and familiarity with the system, but also on factors like the regularity of orthography in the document, since for instance a very regular orthography such as that for Latin may well be amenable to automatic “spell correction.” Such automatic correction is not built into eScriptorium since it is rarely available for historical languages, but it could easily be used via an external tool which could be connected to eScriptorium or directly to Kraken. In any case, some general observations can still be made. For instance, if one has a small corpus of just a few pages, or a collection of short documents (such as a large set of single-sheet charters), each written by a different person with relatively varied script, then these approaches are less likely to be useful unless one has a model that is already trained on a large quantity of similarly varied samples. The more that one has a large number of pages in consistent script, the more it becomes worthwhile to use such automatic approaches. In any case, the interface of eScriptorium has been designed to be very ergonomic, and some users with “impossible” cases for automatic transcription (for instance, a large collection of highly varied inscriptions in stone written in a mixture of Sanskrit and Old Khmer) have nevertheless continued to use eScriptorium simply because they have found it to be faster and more ergonomic an interface compared to a word processor or other text editor.

State of the Art in Current OCR/HTR

For the more technically-minded readers, it is helpful at this point to understand the current state of the art in systems for manuscript OCR/HTR. Feature-complete systems at the time of writing include Kraken but also Tesseract 4 and Transkribus. Broadly speaking, these systems work in similar ways for automatic transcription, and they give similar results in terms of accuracy, but they differ significantly in areas such as their ability to cope with complex “non-standard” page layout or “unusual” script (as seen from a modern Western point of view), and the degree of openness in terms of Open-Source software but also in the ability to import and export data and trained models, points which are discussed in the following section.
These current systems are generally line-wise text recognizers: that is, text images are transcribed a whole line at a time in contrast to the character-by-character approach which is employed in traditional OCR for printed text (such as in Tesseract 3 and Sakhr). The line-wise transcription is usually performed by a recurrent or hybrid convolutional neural network which has been trained specifically for a particular script or even a specific hand, depending on the complexity and variability of the writing. All modern systems are trained with an implicit alignment between the desired textual output and the input images, most often through what is known as the CTC loss function. [6] The primary benefit of this alignment is easier acquisition of data for training, as the laborious labelling of single characters is replaced with the much simpler transcription of whole lines.
Before transcribing the text, a modern system needs first to identify the lines on the page, and this is handled by a layout analysis (LA) module, the exact functionality of which depends on the nature of the text transcription module. Character classifiers require the extraction of single glyphs from the page by the LA system, a process which can be difficult for cursive scripts, while line-based systems can use whole line extraction techniques which are more versatile and script-independent. The major technical difference between different HTR packages lies in this LA module: how it models lines and if it can be adapted to new kinds of documents. Tesseract and Ocropus retain hand-crafted, non-adaptable computer vision methods that output rectangular boxes around the lines. These work reasonably well for printed documents and clean handwriting but cannot reliably process complex manuscripts, especially if the lines of text are curved or otherwise do not fit naturally into these boxes. Recently, new forms of LA use the baselines of the text instead of boxes, and this has been successful in dealing with highly complex material containing slanted, curved, and rotated lines. [7] Methods following this paradigm are popular in the research community, but actual implementations are currently limited to Transkribus and Kraken/eScriptorium.
Additionally, a wide variety of research algorithms can be found in the literature which have not yet been implemented in OCR systems. These include systems that merge the steps for layout analysis and transcription, or methods that are optimized to extract text from noisy environments such as natural scenes. [8] Another active field of research in computer science is in methods to decrease the manual labor required to successfully train machine learning algorithms through approaches such as domain adaptation (transforming models trained on one kind of document to another), semi-supervised learning (training on partially labelled examples), and wholly synthetic manuscript pages (training the machine on “artificial” images so that it can learn to read the real ones). [9] Nevertheless, it remains that the creation of these example cases or “ground truth” for the computer is the longest and most laborious part of the process for the end user, and it may even seem contradictory that one must manually transcribe many pages in order for the computer to transcribe automatically. Indeed, if one only has a very small corpus, or if the range of different scribal hands or styles of writing in that corpus is large, then it may not be worth the effort to use these automatic methods. However, if the corpus is large and homogeneous enough that the computer can train to a sufficiently high level for your needs, then, once this initial groundwork is done, the results afterwards can be spectacular, with thousands or even millions of words being transcribed automatically at literally the click of a button. Nevertheless, it should not be surprising that there is no “magic solution” that can instantly solve all cases. As we know very well, texts are extremely complex objects, with a great deal of variety in terms of layout, format, structure, style of script, and so on, and it takes human beings many years of specialized training to learn to read them. This complexity makes them interesting and worthy of years of study, but it should come as no surprise that it also makes them difficult to treat with a machine.

3. Openness and Import/Export of Images, Texts and Models

In addition to the flexibility and adaptability to different writing systems, another of the core principles of both Kraken and eScriptorium is that of openness. The software for both Kraken and eScriptorium is open-source and free for anyone to download, use, and modify. More significantly, though, the framework is designed to allow for the easy import and export of images, transcriptions, and trained models, and this is particularly important for a number of reasons. As we now know very well from experience, a closed system is extremely risky in terms of sustainability, since if you are locked into a given system then you become entirely dependent on it: if it ceases to function then your project is potentially in jeopardy, and one can easily become hostage to any future developments, such as if a free service becomes paid-for. It is therefore always of the utmost importance that one uses standards-compliant data, and that this data can be freely imported into and exported from different pieces of software, in order to avoid dependance on any one piece and thereby help ensure the longevity of the process as a whole. Equally if not more important in the case of OCR/HTR is the ability to import and export trained models in particular, which is important on both scholarly and practical terms. From a scholarly point of view, there are very real questions around transparency and accountability in the use of machine learning. If we are preparing transcriptions automatically in this way, then naturally the results will be influenced according to the training data that we have provided to the machine. As discussed above, there are different standards and practices for transcription, and the types of document and script that are used in the ground truth will inevitably influence the results. Other scholars will therefore need access to the ground truth and trained models in order to understand exactly how the text was obtained, to evaluate if it was appropriate or not, and to anticipate potential biases and errors. [10] From a practical point of view, the process of compiling ground truth is often laborious as we have seen, and in fact on a commercial level, very large-scale datasets of high-quality ground truth are extremely valuable, which is part of the reason why multi-billion dollar corporations are so keen to access our e-mails, labelled photographs, and so on. In addition, the process of training a model is also relatively slow and intensive. This is of less concern to the average user in the humanities, since we can simply leave the computer to “do its thing” while we get on with something else. Nevertheless, training Deep Learning systems like Kraken is very intensive for the machine, and it can take weeks on a normal home computer. The process is very much faster if one has access to a High Performance Computing (HPC) center with specialized hardware, but very few scholars in the humanities have this access, and in any case the process uses a relatively high amount of electricity with financial and ecological implications. [11] Fortunately, the training is the intensive and slow part, and once this is done then the model can be used relatively quickly and easily for the segmentation and transcription. However, this again illustrates the benefits of exporting and sharing models. If I can train my model in an HPC center, and then download it and send it to you, or—even better—publish it on an open repository, then you and anyone else can take my model, upload it to your instance of eScriptorium (or Kraken or some other system), and use it from there. You may need to retrain it to fit your specific documents, but as long as our documents are sufficiently close then the training that you need to do can be significantly reduced both in terms of time and the amount of ground truth. You would then ideally also publish your re-trained model in a public repository, and in this way, we can build up a shared collection of trained models, thereby reducing significantly the computing time and energy that is currently being wasted on the redundant training of many different models on what is essentially the same script. More specifically, Kraken and eScriptorium both allow users to export and import models, for instance downloading them to their personal computers to do with as they wish. Kraken is also directly linked to Zenodo, which is a large-scale public international repository for research data. This means that one can decide at any point to publish a model to Zenodo, and the system will then take care of the publishing meaning that the model will be saved for the long term according to best practices in data archiving including the automatic assignment of a persistent identifier (a DOI) for future reference. Managing this successfully requires significant care in documentation and metadata, as future users of an existing model will need to know which standards of transcription were used, along with which sample images and so on, and this in turn requires that the entire ground truth also be published along with the model. [12]

4. Some Challenges for a Multi-Script VRE

The discussion so far presents eScriptorium as it stands at the time of writing. Although still in development, it is already being used by numerous teams in several different instances across Europe and the United States. [13] There are, however, numerous challenges that remain if the project is to achieve its goals. Aside from technical details of implementation, it seems at this point that the most significant challenges lie in the goal of being as close as possible to working with any script. Indeed, it is already clear that this is not truly possible: for instance, as mentioned above, the automatic transcription module of Kraken applies a line-by-line approach, but this assumes that the text is in fact written in lines (or that it can be approximated as such), an assumption that does not hold for hieroglyphic scripts like ancient Egyptian. Indeed, the question of text direction is more complex than one might first imagine. On the face of it, the situation is simple enough: most scripts read in lines from left to right and then top to bottom (such as Greek and Latin), or in lines from right to left and then top to bottom (such as Hebrew or Arabic), or in columns from top to bottom and then right to left (as is often the case for Chinese and Japanese).
There are, however, other cases, such as lines read from top to bottom and then left to right (such as Mongolian), or columns read from bottom to top (such as some inscriptions in Old Javanese). Furthermore, some scripts like Latin are (usually) written on a baseline, while others like Hebrew are (usually) written from a top-line, and Arabic can be written along short diagonal segments which form a line overall. Writing can also be circular, for instance on coins, or spiral-shaped, for instance on prayer-bowls, or radiating out from a central point, for instance in Arabic marginal glosses, or in very complex shapes that form pictures, for instance in micrography or calligrammes. The situation is even more complex in multigraphic contexts, that is, where different writing-systems are mixed in a single document, such as a seventeenth-century liturgical manuscript written in Chinese and Hebrew, to take just one example from many thousands. [14] The international Unicode standard for representing the world’s writing systems in computers describes an algorithm for treating bidirectional documents (such as mixing English and Arabic); this standard is in its forty-second revision at the time of writing and currently extends to nearly eighteen thousand words, or around forty-five pages if printed. This illustrates the complexity of simply displaying a typed document with different directions of script, let alone that of automatically transcribing such a document from an image and then presenting it in an interface where the user can correct the lines, regions and transcription, with all the text being displayed in the correct directions as required.
This variety of writing-systems, and particularly the need to cater for so called “rare” and historical scripts, introduces further challenges than directionality. By definition, “rare” languages and scripts do not have large corpora and are not already well catered for by existing software and methods. Indeed, the very nature of Deep Learning is that, as we have already seen, it requires large amounts of pre-existing material, and in general the more such material the better (provided that the data is sufficiently representative). This means that, almost by definition, methods that rely on “big data” are not appropriate for “rare” languages and scripts. In practice these methods can usually be used anyway, to some extent, as long as the corpus is not very small or very heterogeneous, but they will almost always be small “boutique” projects that will not have the support of large companies, for both better and worse, or reusability across domains. Furthermore, some of the basic techniques for improving the results of OCR/HTR will not work in these cases. For instance, it is very common to use some sort of statistical language processing to correct errors in the OCR of modern texts: a very simple example is to run a spell-checker on the result, but more complex examples attempt to automatically analyze the language and attempt to correct errors based on what is or is not linguistically likely or even possible. Such an approach can improve the transcription considerably, but it requires a pre-existing model of the language, so that the computer can recognize what is and is not likely to be an error. However, searches for a spell-checker for (say) Old Vietnamese is very unlikely to bear fruit any time soon, if ever, and indeed the same holds for accurate statistical models of orthographically varied historical writing in general.

5. Different Points of View

There is, however, a further aspect of this: as those of us in the humanities know well, there are many different conventions and possibilities when preparing editions. Despite the claims of some that a transcription must be a simple reproduction of “what is on the page,” it is nevertheless clear that a written text contains an effectively infinite amount of information, and any transcription is necessarily an active selection and, in a sense, translation from one system of writing to another. [15] Even Latin texts have different conventions for transcription, depending on whether the context is Classical or Medieval, whether paleographical, epigraphical or papyrological, and so on, and the complexity multiplies enormously when considering practices for languages such as those of South-East Asia or even the Ancient Near East. [16] It is therefore impossible and indeed undesirable that the computer automatically produce a transcription without any guidance from the user, since it is impossible to know a priori which standards of transcription the computer should follow. One can certainly imagine pre-preparing a list of common cases, following conventions established by significant scholarly bodies, and this would indeed be very useful and desirable, but it still seems certain that many other cases would remain.
This need to accommodate different standards makes the reuse of models more difficult, since it increases the degree to which models must be retrained for specific cases. The challenge goes much further than this, however, and extends to the basis of any VRE for manuscript studies or textual editing, since it also means that any VRE which will be used by a wide range of people must be able to accommodate these different standards. Extending the workflow from transcription to edition introduces further complexities, as there are (also) many different types of edition, and the variety in editions is (probably) greater than that of transcriptions. Ideally, a single VRE should accommodate all standards and types of edition, as well as all standards of transcription, for all writing-systems written on all supports. Such a flexibility is possible, but it comes at a cost: either the interface must be extremely complex to account for all the different options, or it requires some level of programming in order to produce a customized interface which is specific to a given project. Indeed, this is perhaps the biggest challenge faced by the Text Encoding Initiative (TEI).
Many have complained that the TEI guidelines for text encoding are extremely complex and unwieldy, and that they do not prescribe a single convention for transcription meaning that, in effect, they are not a true standard. However, if the TEI Guidelines did impose a single convention then they would immediately become unusable for all those who want or indeed need to use other conventions. Texts are different, and editions and research projects have different goals and therefore different needs, even those projects that are studying the same texts. [17] In practice, then, the TEI is a sort of “meta-standard,” from which one can then specify more restricted and proscriptive standards for particular contexts, with one of the more successful examples being Epidoc for editions of inscriptions. [18] VREs and other tools for the preparation and publication of digital editions face a similar challenge: it is certainly within the bounds of technology to develop a simple process whereby one can produce a transcript or even publish an edition very easily in a relatively small number of clicks, but this necessarily means that most of the decisions will be taken from the researcher and put into the hands of the tool-developers. Conversely, processes can also be developed which give manuscript specialists control over fine details according to their own needs, but this means complex interfaces and/or the need to actively write code at some level to customize the process.
This need for different standards, methods, and points of view extends well beyond transcription and editions and indeed goes to all forms of manuscript studies and indeed to all scholarly research in the humanities. [19] Armando Petrucci and Collette Sirat have both made similar observations for palaeography: “Infatti ogni terminologia paleografica è legata ad una particolare visione storica del fenomeno scrittorio … ma legittimamente utilizzabili risulteranno comunque tutte quelle fondate su premesse metodologiche valide e su rigorose analisi grafiche.” [20] “Two things which are similar are always similar in certain respects. … Generally, similarity, and with it repetition, always presupposes the adoption of a point of view: some similarities or repetitions will strike us if we are interested in one problem or another” (Sirat 2006:310, citing Popper 1968:420–422).
This may seem obvious, but in fact it raises a fundamental epistemological question: annotation and comparison are core tasks of scholarship and have been identified as two of the six “scholarly primitives” which underly all of our work. [21] However, annotation, or description, depend on terminology, as indeed does discovery, another of Unsworth’s primitives, but as Petrucci has noted, there is no single terminology which can be claimed as “the” valid one over all others. Similarly, Sirat and Popper have noted that comparison also requires a point of view, and different problems require different comparisons and therefore different points of view. This poses a very significant challenge to VREs, and indeed to the ideal of interoperability and related areas such as Linked Open Data. In addition, the fundamental principles of machine learning itself, as a mere kind of statistical inference, can cast doubt on the scholarly value of the automatic productions of these methods when “hidden” inside VREs, with their design assumptions and limitations remaining relatively opaque to the humanities users even in open systems. In principle, different terminologies can be related through ontologies and other tools, such that one can record for both human and computer use that two given terms are close in meaning, exact matches, related, broader or narrower in scope, and so on, and in this way one can link different terminologies, at least in principle. [22] However, this is a complex and laborious process, and it also requires a very deep understanding of both terminologies as the scope for misunderstanding and error is enormous. We must therefore consider how to design VREs to enable different points of view, which includes allowing for different terminologies, different interfaces, different ways of presenting, annotating, comparing, searching and otherwise working with the data. Indeed, as Elena Pierazzo has observed, the text itself is only one part of the scholarly work and interpretation in an edition, and Johanna Drucker and others have shown that the interface generally is interpretative and embeds specific methods and viewpoints, and it therefore necessarily excludes others. [23] For this reason, it seems unlikely that there can ever be a single VRE or other system which responds to all requirements, but instead perhaps we must accept that there will always be many different tools and frameworks.
Certainly we must follow existing standards where these are available: otherwise there is no hope for interchange or data sharing, and our material is certain to be lost very quickly once our custom tool is no longer maintained and our custom data is therefore unreadable. Similarly, the best option remains to follow standards and to use tools that allow for the ready import and export of data, including trained models in the case of machine learning. In this way we have some chance of moving between different tools, frameworks and VREs as necessary, taking advantage of those that best respond to the point of view that we need for a given problem at a given moment. This also suggests favouring smaller modules that can be pieced together into different workflows as required, rather than large, centralised, monolithic VREs, and the piecing together in turn requires at least some understanding of data and (probably) basic programming. This should not be of concern to classicists or others in the humanities: certainly software development is a highly skilled profession that requires specialised training, but the basics of Python and XSLT are very much easier to learn than the complexities of Greek or Latin.


Balogh, D., and A. Griffiths. 2020. DHARMA Transliteration Guide. Release version 3. Draft paper.
British Library and A. Keinan-Schoobaert. 2020. Ground Truth Transcriptions for Training OCR of Historical Arabic Handwritten Texts. London.
Diem, M., F. Kleber, S. Fiel, T. Grüning, and B. Gatos. 2017. “cBAD: ICDAR2017 Competition on Baseline Detection.” In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 9 vols., 1:1355–1360. Kyoto.
Drucker, J. 2014. Graphesis: Visual Forms of Knowledge Production. Cambridge, MA.
Elliott, T., G. Bodard, H. Cayless et al. 2006–2020. EpiDoc: Epigraphic Documents in TEI XML. La Jolla, CA.
Graves, A., S. Fernández, F. Gomez, and J. Schmidhuber. 2006. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” In ICML ’06: Proceedings of the 23rd International Conference on Machine Learning, ed. W. Cohen and A. Moore, 369–376. New York.
Huitfeldt, C., and C. M. Sperberg-McQueen. 2008. “What is Transcription?” Literary and Linguistic Computing 23:295–310.
———. 2018. “Interpreting Difference among Transcripts.” In Digital Humanities Book of Abstracts. Mexico City.
Kang, L., P. Riba, Y. Wang, M. Rusiñol, A. Fornés, and M. Villegas. 2020. “GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images.” In Computer Vision – ECCV 2020, ed. A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, 273–289. Heidelberg.
Kiessling, B. 2019. “Kraken – A Universal Text Recognizer for the Humanities.” Digital Humanities Book of Abstracts. Utrecht.
Kiessling, B., D. Stökl Ben Ezra, and M. Miller. 2019. “BADAM: A Public Dataset for Baseline Detection in Arabic-script Manuscripts.” HIP@ICDAR 2019. New York.
Kiessling, B., R. Tissot, D. Stökl Ben Ezra, and P. A. Stokes. 2019. “eScriptorium: An Open Source Platform for Historical Document Analysis.” In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), 19–24. Wahsington, DC.
Miles, A., and S. Bechhofer. 2009. SKOS Simple Knowledge Organisation System. Cambridge, MA.
OCR-D. n.d. The Ground-Truth-Guidelines. Wolfenbüttel.
Petrucci, A. 2001. La descrizione del manoscritto: storia, problemi, modelli. 2nd ed. Rome.
Pierazzo, E. 2011. “A Rationale of Digital Documentary Editions.” Literary and Linguistic Computing 26:463–477.
———. 2015. Digital Scholarly Editing: Theories, Models and Methods. Farnham, UK.
Pierazzo, E., and P. A. Stokes. 2011. “Putting the Text back into Context: A Codicological Approach to Manuscript Transcription.” In Kodikologie und Paläographie im digitalen Zeitalter 2 – Codicology and Palaeography in the Digital Age 2, ed. F. Fischer, C. Fritze, and G. Vogeler, 397–429. Norderstedt.
Robinson, P., and E. Solopova. 1993. “Guidelines for Transcription of the Manuscripts of the Wife of Bath’s Prologue.” In The Canterbury Tales Project: Occasional Papers, ed. N. F. Blake and P. Robinson, 19–52. Oxford.
Sirat, C. 2006. Writing as Handwork: A History of Handwriting in Mediterranean and Western Culture. Ed. L. Schramm. Bibliologia 24. Turnhout.
Stokes, P. A. 2009. “The Digital Dictionary.” Florilegium 26:37–69.
———. 2020a. “eScriptorium : un outil pour la transcription automatique des documents.” In ÉpheNum: Veille, agenda et actualités des humanités numériques à l’EPHE. Paris. English version available at
———. 2020b. “Palaeography, Codicology and Stemmatology.” In Handbook of Stemmatology: History, Methodology, Digital Approaches, ed. P. Roelli, 46–56. Berlin.
———. 2020c. “eScriptorium 1–6.”
Stokes, P. A., and G. Noël. 2010. “Project Report.” Anglo-Saxon Cluster. London.
Unsworth, J. 2000. “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This?” Paper presented at Humanities Computing: Formal Methods, Experimental Practice. King’s College, London, May 13, 2000. London.
Wang, K., B. Babenko, and S. Belongie. 2011. “End-to-End Scene Text Recognition.” In 2011 International Conference on Computer Vision, 1457–1464. Washington, DC.
Wigington, C., C. Tensmeyer, B. Davis, W. Barrett, B. Price, and S. Cohen. 2018. “Start, Follow, Read: End-to-End Full-Page Handwriting Recognition.” In Proceedings of the European Conference on Computer Vision (ECCV), ed. V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, 367–383. Heidelberg.


[ back ] 1. Further discussions and papers on the eScriptorium project include Kiessling 2019, Kiessling, Stökl Ben Ezra et al. 2019, Kiessling, Tissot et al. 2019, and Stokes 2020a. This work has received funding from the European Union’s Horizon 2020 Research and Innovation program under Grant Agreement No. 871127 (RESILIENCE), and from the Initiatives de Recherches Interdisciplinaires et Stratégiques of Université PSL (Scripta–PSL).
[ back ] 3. Some writers distinguish Optical Character Recognition (OCR) as the automatic transcription of printed text and Handwritten Text Recognition (HTR) as that of handwritten text, whereas others reserve OCR for character-based approaches to recognition and HTR for line-based. As a result, the difference between OCR and HTR is often blurred in the current literature, and so we use the two terms together as interchangeable unless clearly stated otherwise.
[ back ] 4. For videos showing these steps in an early version of eScriptorium, see Stokes 2020a and Stokes 2020c, numbers 1 and 5.
[ back ] 5. For videos showing these steps see Stokes 2020a and Stokes 2020c, numbers 2, 4, and 6. For convenience, the term “document” is used in this article as a shorthand to refer to any instance of writing, whether printed or handwritten, without reference to any specific form, support, or writing instrument.
[ back ] 6. Graves 2006.
[ back ] 7. Diem et al. 2017.
[ back ] 8. Wigington et al. 2018, Wang et al. 2011.
[ back ] 9. One recent example among others is Kang et al. 2020.
[ back ] 10. A simple example of an error resulting from this point was a project which attempted to automatically identify authorship in vernacular Old English texts, but without controlling for different editorial practices in the sources: as a result, the project was successful in identifying editors but not authors. See further Stokes 2009:54–56.
[ back ] 11. As just one example, the GPU units that the eScriptorium team are currently putting in place will have an estimated running cost of approximately €10,000 per year.
[ back ] 12. For further discussion of this and other related problems, see (for example) OCR-D n.d.
[ back ] 13. As well as Scripta and RESILIENCE, other example projects include OpenITI AOCP (, LectAuRep, Sofer Mahir (, DIM STCN (, and CREMMA (, among others.
[ back ] 14. Hebrew Union College MS 926, available at
[ back ] 15. Discussions of this include Pierazzo 2011:464–472, Pierazzo 2015:85–101, Huitfeld and Sperberg-McQueen 2008, Huitfeld and Sperberg-McQueen 2018, and Robinson and Solopova 1993:20–29.
[ back ] 16. One example of a transcription guide for such languages is Balogh and Griffiths 2020.
[ back ] 17. An example of this is described by Stokes and Noël 2010.
[ back ] 18. Elliott, Bodard, Cayless et al. 2006–2020.
[ back ] 19. Some concrete examples of the impact of transcription on interpretation are given by Stokes 2020b:50–54.
[ back ] 20. Petrucci 2001:70–71: “In fact, every paleographical terminology is connected to a particular historical vision of the phenomenon of handwriting … but all terminologies prove legitimately useful nonetheless, if they are founded on valid methodological premises and rigorous graphical analyses” (our translation).
[ back ] 21. These “scholarly primitives” are from an influential talk by Unsworth (2000).
[ back ] 22. One important example of such a schema is the Simple Knowledge Organisation System (SKOS), for which see Miles and Bechhofer 2009, esp. §10 Mapping Properties.
[ back ] 23. Pierazzo 2011:473–475, Drucker 2014.