Digital Research Library for Multi-Hierarchical Interrelated Texts: From ‘Tikkoun Sofrim’ Text Production to Text Modeling

Uri Schor, Vered Raziel-Kretzmer, Moshe Lavee, and Tsvi Kuflik
University of Haifa

1. Introduction

Tanhuma / Yelamdenu Literature” is a vast literary corpus of rabbinic homilies to the Hebrew Bible. It is a diverse group of interrelated texts, evolved through a long process of fluid and unstable oral and written transmission. Different recensions are available in a variety of print editions, dozens of manuscripts, hundreds of fragments, and secondary citations in medieval anthologies (Bregman 2013). There is no single accepted canonical hierarchy, i.e. internal structure by which the text is sub-divided (such as sedarim, chapters, and paragraphs), representing all text variants. Various print editions provide different hierarchies and content, and the evidence of manuscripts and fragments further complicates the picture (Urbach 1966). Accordingly, the print medium provides insufficient access and inadequate representation of the corpus in its full entirety and complexity. A virtual research environment (VRE) for the Tanhuma is expected to provide a dynamic collection of homilies drawn from within the entire corpus and reflecting the complex interrelations between editions, manuscripts, and fragments.
The Tikkoun-Sofrim project was conceived for transcribing Tanhuma manuscripts by combining handwritten text recognition (HTR) and crowdsourcing (Wecker at al. 2019 and 2020). Following this project, we proposed a data modeling framework and an integrative process for inclusion of the transcriptions in a future VRE. We build upon recent perceptions that a VRE may provide “texts in versions and texts as versions in a way that is much simpler, more intuitive and dynamic than corresponding print attempts” (Pierazzo 2016:50). This new theoretical model inspects more granular pieces of texts in manuscripts, aiming to detect variations and mutations between these fragments (Pierazzo 2016) and representing the complex organization of these variations in a graph data structure (Andrews and Macé 2013). Based on hypergraphs representing the multiple relations between pairs of homilies or sections it is possible to cope with the Tanhuma corpus, for which it is impossible to model textual transmission lines as directed acyclic graphs (stemmata). “The pervasive linkage between different contents and parts promote a modularised structure and a module-oriented vision of scholarly editions” (Sahle 2016:29), supporting the high degree of modularity of the Tanhuma corpus.

2. Our Innovation

The model for the digital library was inspired by the Sharing Ancient Wisdoms (SAWS) project [2], in which a digital library for gnomological corpus was designed and implemented. The gnomological, wisdom-saying texts are abundant with interrelations, such as translations, allusions, and paraphrases. The SAWS library was composed by the texts in the gnomological collections, each divided into short paragraphs of atomic wise-sayings. These paragraphs were made accessible by the canonical text services (CTS) API, so that each paragraph is citable by a unique ID, a CTS URN. The links between such sayings were modeled by a resource description framework (RDF), thus creating a graph, in which the edges are paragraphs, and the vertices are the links between paragraphs. Different types of links were denoted by an RDF ontology, e.g. isTranslationOf, isParallelTo, isShorterVersionOf, etc.

While the approach of the SAWS project demonstrates an effective way to model links within a set of paragraphs, the problem we faced was the modeling of the links between paragraphs and subparagraphs in canonical hierarchies (print editions) and physical witnesses (manuscripts and fragments). The two constraints were:

  • The library should preserve the physical layout of the text witnesses—MS, volume, folio, column, and line.
  • Multiple links exist between passages in text witnesses to various canonical works, many times overlapping.
With the limitation of a single hierarchy per text imposed by the CTS standard, we have decided to use two different hierarchy schemas, one for canonical edition texts, and one for transcribed physical witnesses. The canonical texts were organized by the traditional structural hierarchy of the print editions, while the transcriptions were organized by the physical layout of the manuscripts. In each of the cases, the CTS hierarchy goes down to the word level, allowing for citing, and therefore linking, arbitrary text ranges, bound by two words. Using RDF, we have documented the links within the library of citable texts. Together with a Web UI, we provide the researcher with a bird’s-eye view of a complex web of inter-related texts, along with the ability to zoom-in to close-reading of related texts. With the abundance of canonical editions containing Tanhuma genre texts, and the growing availability of manuscripts transcriptions—the artifacts of the Tikkoun Sofrim project—a comprehensive digital library can be built in the coming years.
In the current poster we present a basic, preliminary application of our model, the representation of relations between physical witnesses and canonical hierarchies. Specifically, the poster shows a fragment from the Cairo Genizah, containing a homily for Deuteronomy. This fragment is the only available witness to contain together materials that are found separately in two print editions. However, the model we suggest is flexible enough to support many other complexities, as we shall present below, after describing the process and the current demo edition.

3. Framework components

We offer an integrated process of text modeling, linking, and alignment that supports multiple hierarchies and enables different modes of presentation, based on the integration of various available tools and standards as well as their modification to fit the unique challenges typical to the chosen literary corpus. A practical/ideological decision was made, to use only open-source software to build the solution, so that it can be reused and enhanced by others.

  1. CapiTainS [1], an open-source implementation of the CTS standard, was picked as a basis for the solution. As it provides not only an implementation of the API, but also a web UI for the digital library, it offered a basis for the VRE. CapiTainS uses TEI files as the storage of each text in the digital library, and we therefore produced a single TEI file per text. Two separate TEI/XML schemas were modeled, one for the description of the various canonical hierarchies, used in print editions, and the other for the description of physical witnesses, both in manuscripts and fragments.
  2. Based on the structure of the TEI files, two types of CTS URNs were generated, for canonical hierarchies and physical witnesses:
    a. For canonical editions, the hierarchy follows the structural hierarchy, e.g. weekly Pentateuch lection, paragraph, word. For example, references the 5th paragraph in Tanhuma PE for Re’eh lection (Deut. 11:26–16:17).
    b. For physical witnesses, the hierarchy follows the physical layout of the manuscript or fragment, and paragraphs within the manuscript can be cited as ranges of words. For example, urn:cts:ancJewLit:tanhuma.reeh.yevr3b314:2.42-2.120 refers to Genizah fragment St. Petersburg Yevr. III B 314, folio 2, words 42–120.
  3. We used the simplest RDF ontology to represent the various links between canonical hierarchies and the physical witnesses. It contains a single predicate to denote that two text ranges within the library are linked—isRelatedTo. Given the complex structure of the Tanhuma genre, it is not yet clear to us whether it makes sense to specify the type of relationships between paragraphs in the library, or whether the mere notion that the texts are related should be recorded in the library, and presented to the researchers. We will consider allowing the researchers to specify the types of relationships, and to add/delete/edit relationships in the library, in future work.
  4. In order to locate the related text ranges in the library, we have used RWFS (Rolling Window Fuzzy Search), a text reuse detection algorithm, to automatically identify and align related paragraphs. RWFS locates similar text ranges in a given input text and corpus of texts by scanning the input text while performing fuzzy full-text searches on a window of words—ngrams—and clustering the positive results into consecutive text ranges (we initially developed RWFS for cataloging and identification of mal-transcribed texts, and we will present it in detail in Jewish Studies DH, Luxemburg, January 2021).
  5. We used TRAViz [3] for word-by-word text alignment. TRAViz, a JavaScript library developed within the eTRACES DH project, generates visualizations for Text Variant Graphs that show the variations between different editions of texts. It was integrated into CapiTainS’s user interface to provide a visualization of related texts when zooming into them.

4. A demonstrator

This work enabled the creation of a basic demo edition for the Tanhuma Corpus. We included in the library manually transcribed texts for a single Pentateuch lection—Re’eh. CapiTainS was used as the CTS server, and its Web UI—Nemo—was used to provide browser-based interface to the library. With the help of its tutorial and documentation, we have configured and extended CapiTainS Nemo to support right-to-left texts, the annotation linked paragraphs, and zooming into linked paragraphs. The VRE provides:

  1. A digital library of all canonical and physical witnesses, each browsable by its hierarchy.
  2. Graphic annotation of linked paragraphs, displayed when reading a text, allowing the user to zoom into linked passages for close reading.
  3. A view which displays linked paragraphs in two manners:
    a. Side-by-side synoptic paragraph presentation.
    b. TRAViz based word-by-word synoptic presentation (visualizing the same transition point).

5. Future work

The ability to refer to any text range using CTS, combined with annotation with RDF, enables the tackling of many other challenges by offering additional elements for representation and documentation of various relations, for example:

  1. The original literary structure of the Tanhuma is not represented in any canonical print edition. The model may enable a scholar to define a hierarchical division that describes the assumed original structure and to map the texts against it.
  2. In some cases, physical witnesses do contain hierarchical elements that were introduced by the original copyists, whether they fit existing canonical hierarchies or not. The model enables the documentation of such structural annotations.
  3. Another feature of the Tanhuma corpus is its modularity, in which paragraphs are reused and appear in different homilies, sometimes greatly paraphrased, to fit the context. Such a feature may be documented by annotations using RDF predicates expressing the reuse type.

6. Conclusion

At the core of our model is the use of a library containing texts in multiple hierarchies, both diverse canonical hierarchies from print editions and hierarchies of physical witnesses of the corpus. The model enables the representation of all the links between the texts. The use of RDF will enable marking and storing more complex interrelations between the texts. The model is based on common standards and open-source tools and hence may serve for the tailoring of applications for other corpora with similar characteristics.


Andrews, T. L., and C. Macé. 2013. “Beyond the tree of texts: Building an empirical model of scribal variation through graph analysis of texts and stemmata.” Literary and Linguistic Computing 28(4):504–521.
Bregman, M. 2003. The Tanhuma-Yelammedenu Literature: Studies in the Evolution of the Versions . Piscataway, NJ. Hebrew.
CapiTainS [1].
Pierazzo, E. 2016. “Modelling Digital Scholarly Editing: From Plato to Heraclitus.” In Digital Scholarly Editing: Theories and Practices, ed. M. J. Driscoll, and E. Pierazzo, 41–58. Cambridge, UK.
Sahle, P. 2016. “What is a Scholarly Digital Edition?.” In Digital Scholarly Editing: Theories and Practices, ed. M. J. Driscoll, and E. Pierazzo, 19–39. Cambridge, UK.
Sharing Ancient Wisdoms (SAWS) Project [2].
Urbach, E. E. 1966. “Remnants of the Tanhuma-Yelamdenu.” Kobetz Al Yad 6:1–55. Hebrew.
Wecker, A. J. et al. 2019. “Tikkoun sofrim: A webapp for Personalization and Adaptation of Crowdsourcing transcriptions.” In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, 109–110. New York.
Wecker, A. J. et al. 2020. “Opportunities for Personalization for Crowdsourcing in Handwritten Text Recognition.” In Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, 373–375. New York.