1. Introduction
2. Our Innovation
While the approach of the SAWS project demonstrates an effective way to model links within a set of paragraphs, the problem we faced was the modeling of the links between paragraphs and subparagraphs in canonical hierarchies (print editions) and physical witnesses (manuscripts and fragments). The two constraints were:
- The library should preserve the physical layout of the text witnesses—MS, volume, folio, column, and line.
- Multiple links exist between passages in text witnesses to various canonical works, many times overlapping.
3. Framework components
We offer an integrated process of text modeling, linking, and alignment that supports multiple hierarchies and enables different modes of presentation, based on the integration of various available tools and standards as well as their modification to fit the unique challenges typical to the chosen literary corpus. A practical/ideological decision was made, to use only open-source software to build the solution, so that it can be reused and enhanced by others.
- CapiTainS [1], an open-source implementation of the CTS standard, was picked as a basis for the solution. As it provides not only an implementation of the API, but also a web UI for the digital library, it offered a basis for the VRE. CapiTainS uses TEI files as the storage of each text in the digital library, and we therefore produced a single TEI file per text. Two separate TEI/XML schemas were modeled, one for the description of the various canonical hierarchies, used in print editions, and the other for the description of physical witnesses, both in manuscripts and fragments.
- Based on the structure of the TEI files, two types of CTS URNs were generated, for canonical hierarchies and physical witnesses:
a. For canonical editions, the hierarchy follows the structural hierarchy, e.g. weekly Pentateuch lection, paragraph, word. For example, urn:cts:ancJewLit:tanhuma.reeh.pe:5 references the 5th paragraph in Tanhuma PE for Re’eh lection (Deut. 11:26–16:17).b. For physical witnesses, the hierarchy follows the physical layout of the manuscript or fragment, and paragraphs within the manuscript can be cited as ranges of words. For example, urn:cts:ancJewLit:tanhuma.reeh.yevr3b314:2.42-2.120 refers to Genizah fragment St. Petersburg Yevr. III B 314, folio 2, words 42–120.
- We used the simplest RDF ontology to represent the various links between canonical hierarchies and the physical witnesses. It contains a single predicate to denote that two text ranges within the library are linked—isRelatedTo. Given the complex structure of the Tanhuma genre, it is not yet clear to us whether it makes sense to specify the type of relationships between paragraphs in the library, or whether the mere notion that the texts are related should be recorded in the library, and presented to the researchers. We will consider allowing the researchers to specify the types of relationships, and to add/delete/edit relationships in the library, in future work.
- In order to locate the related text ranges in the library, we have used RWFS (Rolling Window Fuzzy Search), a text reuse detection algorithm, to automatically identify and align related paragraphs. RWFS locates similar text ranges in a given input text and corpus of texts by scanning the input text while performing fuzzy full-text searches on a window of words—ngrams—and clustering the positive results into consecutive text ranges (we initially developed RWFS for cataloging and identification of mal-transcribed texts, and we will present it in detail in Jewish Studies DH, Luxemburg, January 2021).
- We used TRAViz [3] for word-by-word text alignment. TRAViz, a JavaScript library developed within the eTRACES DH project, generates visualizations for Text Variant Graphs that show the variations between different editions of texts. It was integrated into CapiTainS’s user interface to provide a visualization of related texts when zooming into them.
4. A demonstrator
This work enabled the creation of a basic demo edition for the Tanhuma Corpus. We included in the library manually transcribed texts for a single Pentateuch lection—Re’eh. CapiTainS was used as the CTS server, and its Web UI—Nemo—was used to provide browser-based interface to the library. With the help of its tutorial and documentation, we have configured and extended CapiTainS Nemo to support right-to-left texts, the annotation linked paragraphs, and zooming into linked paragraphs. The VRE provides:
- A digital library of all canonical and physical witnesses, each browsable by its hierarchy.
- Graphic annotation of linked paragraphs, displayed when reading a text, allowing the user to zoom into linked passages for close reading.
- A view which displays linked paragraphs in two manners:
a. Side-by-side synoptic paragraph presentation.b. TRAViz based word-by-word synoptic presentation (visualizing the same transition point).
5. Future work
The ability to refer to any text range using CTS, combined with annotation with RDF, enables the tackling of many other challenges by offering additional elements for representation and documentation of various relations, for example:
- The original literary structure of the Tanhuma is not represented in any canonical print edition. The model may enable a scholar to define a hierarchical division that describes the assumed original structure and to map the texts against it.
- In some cases, physical witnesses do contain hierarchical elements that were introduced by the original copyists, whether they fit existing canonical hierarchies or not. The model enables the documentation of such structural annotations.
- Another feature of the Tanhuma corpus is its modularity, in which paragraphs are reused and appear in different homilies, sometimes greatly paraphrased, to fit the context. Such a feature may be documented by annotations using RDF predicates expressing the reuse type.