1. Introduction: What is eScriptorium
2. The eScriptorium Workflow
In order to understand the current state of the art in OCR/HTR systems, it is necessary first to understand the basic workflow of eScriptorium and other similar systems. In general, going from images to transcription requires the following basic steps:
- Importing the images into the system, including preprocessing such as PDF import and other format conversions.
- Finding the lines of text and other significant elements (columns, glosses, initials, etc.) on the images: that is, subdividing an image into sets of shapes on that image that correspond to different region types.
- Transcribing the lines of text: that is, converting images of lines of text into the corresponding text.
- Compiling the lines of text into a coherent document and exporting the result for markup, publication, etc.
Steps 1 and 4 are relatively straightforward and are largely a question of interface. [4] In eScriptorium, the import can be done directly from a user’s hard drive, or by simply giving the URL of the IIIF manifest file and leaving the machine to import the images automatically. Steps 2 and 3 are generally much more complex, since, as we have seen, the computer needs to be able to treat a very wide range of different documents, supports, and layouts. [5] It is important to note that each of these basic steps comprises multiple sub-tasks with sometimes unexpected ways that they can fail. Step 2 not only finds lines but also has to sort them into the same order a human would read them in, a process which is by no means easy given the many possible layouts across different writing systems. This step is crucial, though, because the transcription of a document produced in step 3 will be completely unreadable, even if perfect on a line level, if this ordering operation has failed. In Kraken, and therefore eScriptorium, both of these steps are handled by trainable computer vision algorithms, that is, by machine learning. Very broadly, this means that one must have example documents which have already been prepared: that is, images of pages annotated with columns, lines and so on for step 2, and transcriptions of texts matching the images for step 3. One then submits these example cases to the computer, and the machine “learns” from them, creating a statistical model of the images which it has already seen, and this model then allows it either to segment new unseen images into regions, or to produce a transcription of the text from the unseen images. As for transcription, the eScriptorium interface also allows for adding this information directly in order to compile ground truth material for training, and it also provides mechanisms for the user to correct any errors in the automatic or indeed manual results (as shown in [Stokes 2020c, video number 3).