Mining Manuscript Data in the New Testament Virtual Manuscript Room

Andrew Smith, Shepherds Theological Seminary


In the late 1980s, businesses were collecting customer data because they knew it was valuable to what would later be termed business intelligence, but until the 1990s there was no clear path to how to mine and usefully exploit those data. The New Testament Virtual Manuscript Room is in the business world’s 1980s: we have collected great amounts of textual data, but those data could be mined for information much broader than critical editions of the New Testament. Reconceptualizing the use of that massive database would prompt changes in data entry, as the encoding for data points such as codicological and paratextual features is woefully underdeveloped in TEI-XML. This reconceptualization would also require new means to query and post-process the data. This paper will provide three proposals for how this extremely valuable VRE could be developed into a much more powerful resource.

Introduction to the NTVMR

The New Testament Virtual Manuscript Room (NTVMR), a popular online resource for scholars and students in biblical studies, is advertised as “a place where scholars can come to find the most exhaustive list of New Testament manuscript resources, can contribute to marking attributes about these manuscripts, and can find state of the art tools for researching this rich dataset.” [1] The NTVMR and its broader Collaborative Research Environment are key to the production of the critical editions of the Greek New Testament known as the Editio Critica Maior (ECM). Apart from the ECM work being performed at Münster’s Institut für neutestamentliche Textforschung (INTF), external ECM editorial teams also use the NTVMR environment, [2] as do several projects editing text outside the Greek New Testament. [3] The range of projects now using the NTVMR is a testimony to its maturity and utility as a suite of manuscript tools. Manuscript data in the NTVMR follow the EpiDoc Guidelines for encoding ancient documents using the TEI-XML standard, [4] with additional guidelines developed by the International Greek New Testament Project (IGNTP). [5] Transcriptions are mapped to images of the associated pages either locally hosted by the INTF or linked to external sites when better images are available elsewhere. And though the NTVMR stores each page of a manuscript as a separate XML file, the underlying database associates each page with a unique document identification number (DocID).
To date, of the roughly 5,300 extant Greek New Testament manuscripts, the total number of transcribed pages of text initially appears quite small: [6] only 2.61% (or 55,828) of the cataloged 2,139,833 manuscript pages have been published online. [7] Despite the immense work ahead in advancing the transcription numbers, the NTVMR still represents the most complete dataset of this type. Development of the environment has focused on the production of critical editions of the Greek New Testament, yet more could be done to facilitate populating and researching the dataset.

The Problem/Opportunity

Codicological and paratextual manuscript data encoded in the NTVMR are minimal, as the editorial activity of preparing critical editions does not require them. However, the scholarly use of paratextual and codicological data in contextualizing biblical manuscripts is growing to where populating this information alongside transcription data now makes sense. [8] The NTVMR represents a well-developed platform that could be modified with little difficulty to accommodate additional non-textual data. Additionally, more rigorous data definition and entry would enhance the usability of the manuscript data. What follows are three proposals to reach this goal.

Proposal 1: Expand Data Structures for Non-textual Information

The physical description element <physDesc> is already defined by TEI-XML (but not used on the NTVMR) and provides a data structure for supplying codicological information. Specific elements that can be encoded include:

  • the object description including support materials and description, collation (arrangement of leaves, bifolia, etc.), foliation (numeration), condition, and layout;
  • the hand description including the writing style, script descriptions, decoration elements, musical notation, and additions (e.g. marginalia or other annotation);
  • the binding description including binding information (type of covering, boards, etc.), seal descriptions, and accompanying material.

Predefined attributes permit low-level precision in describing these elements. For example, a <handNote> element might contain a script attribute (=“chancery”) and a medium attribute (=“ink-green”).

Additionally, one metadata item that would benefit from data structure expansion (though it is maintained external to the XML) is the bibliography information for each manuscript as provided by the online Liste ( The Liste bibliography for some manuscripts has become so large that additional entry tagging would provide faster sorting of materials. [9] I recommend flags for: (1) the editio princeps of a manuscript; (2) an edition; and (3) a detailed or comparative study. This would immediately delineate between passing mentions of a manuscript and potential research materials.

Proposal 2: Clearly Define Objective Data Points

With many collaborators performing data entry into the NTVMR, it is a challenge to have searchable data approach some form of normalization. Efficacious searching requires publicly defining how data points are to be measured according to an objective standard whenever possible.
Currently, the transcription standard for ECM projects is to ignore textually superfluous elements (including punctuation and diacritical marks) in minuscule manuscripts; only a small number of such features are recorded in the majuscules and papyri. [10] However, how these elements are to be recorded is not well defined. For example, enlarged letters (which the NTVMR identifies as “capitals”) are assigned a height value with a unit of “line(s)”; the IGNTP guidelines state: “The height of the capitals is expressed as the number of lines of ordinary script on the page corresponding to the height of the capital.” [11] How is the height of a line defined? What level of precision should be expected when entering a value (e.g. 2, 2.2, 2.25)? A much more frequently used feature (in the majuscules) is punctuation. What distinguishes a middle dot from a raised dot? Is the vertical position measured against the preceding character or the following character? If one is editing a critical edition, these questions will appear pedantic; if one is researching unit delimitation across many manuscripts (for example), having standardized and accurate data is essential.
Additionally, the formatting or vocabulary of paratextual and codicological elements needs to be standardized. A cross-organizational standard need not be adopted, but a consistent model within the NTVMR is desirable. For example, when describing quire formations, a standard collation formula—something quite variable in scholarly literature—will need to be followed. [12]

Proposal 3: Regulate Subjective Data Entry

Some data will necessarily be subjectively descriptive rather than objectively measurable. For example, identifying types of majuscule or minuscule Greek writing involves evaluation of visual data not easily quantified. Regulation of this subjective data could be accomplished by having one evaluator create the model for other collaborators. In the example of identifying styles of writing, ideally a single person would perform that evaluation over many manuscripts on a first pass through the data. This would provide consistent identification across a large data set, establish the vocabulary for collaborators, and create a guide for future script identification. The NTVMR should then publish a specification for subjective data to facilitate querying the database.


Very briefly I have outlined three steps that would, with minimal changes to the underlying data structures, advance the NTVMR toward its advertised goal as a research platform and enrich the overall dataset for the study of biblical manuscripts. Implementation of these proposals is relatively low-cost regarding the modification of existing data structures and project-dependent with regarding new paratextual/codicological data acquisition. The result will allow scholars the opportunity to populate information about these manuscripts more fully and make better use of an already outstanding platform.


Jongkind, D. 2007. Scribal Habits of Codex Sinaiticus. Piscataway, NJ.
Lied, L. I. and M. Maniaci. 2018. Bible As Notepad: Tracing Annotations and Annotation Practices in Late Antique and Medieval Biblical Manuscripts. Berlin.
Malik, P. 2017. P. Beatty III (𝔓47): The Codex, Its Scribe, and Its Text. Leiden.
Parker, D. C. 1992. Codex Bezae. Cambridge.
Smith, W. A. 2014. A Study of the Gospels in Codex Alexandrinus: Codicology, Palaeography, and Scribal Hands. Leiden.


[ back ] 1. Retrieved from on 7 July 2020.
[ back ] 2. These include the Museum of the Bible (MOTB) Greek Paul Project (, the MOTB Greek Psalter Project (, the MOTB Syriac Climacus’ Ladder of Divine Ascent Project (, the Τομέας Μελέτης Χειρόγραφης Παράδοσης – Byzantine Project (, the Erstellung einer kritischen Edition der Johannesapokalypse (Kirchliche Hochschule Wuppertal/Bethel; and the Titles of the New Testament project (
[ back ] 3. The environment has been adopted by a number of projects editing text outside the Greek New Testament, including the Coptic-Sahidic Old Testament Project (Akademie der Wissenschaften zu Göttingen), the Canons of Apa Joannes the Archimandrite project (Paris Lodron Universität Salzburg), the Paratexts of the Bible project (Ludwig-Maximilians-Universität München), and the Avestan Digital Archive (Freie Universität Berlin).
[ back ] 4. The Text Encoding Initiative (TEI) has produced “a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics” (retrieved from on 7 July 2020). The EpiDoc collaborative “addresses not only the transcription and editorial preparation of the texts themselves, but also the history, materiality and metadata of the objects on which the texts appear” (retrieved from on 7 July 2020).
[ back ] 5. These guidelines are hosted by the University of Birmingham under the title “IGNTP guidelines for XML transcriptions of New Testament manuscripts using TEI P5” at
[ back ] 6. Taken from the NTVMR Statistics tool calculations ( 12 August 2020.
[ back ] 7. The high number of pages in the minuscule and lectionary witnesses, typically longer witnesses for which transcriptions are largely lacking, are the primary reason for this. 84.41% of the 1,520 cataloged papyrus pages and 24.89% of the 26,638 catalogued majuscule pages attest to the focus of the transcription work being on the most frequently cited witnesses for critical editions. A small number of minuscules and lectionaries are cited in the critical apparatus of the ECM editions because these manuscripts are largely Byzantine in character; thus, 3.40% (44,692) of the 1,314,729 cataloged minuscule pages and 0.40% (3,224) of the 796,946 cataloged lectionary pages have been transcribed.
[ back ] 8. Examples of studies that explore codicological and/or paratextual data include David C. Parker, Codex Bezae (Cambridge: Cambridge University Press, 1992); Dirk Jongkind, Scribal Habits of Codex Sinaiticus (Piscataway, NJ: Gorgias Press, 2007); W. Andrew Smith, A Study of the Gospels in Codex Alexandrinus: Codicology, Palaeography, and Scribal Hands (Leiden: Brill, 2014); Peter Malik, P. Beatty III (𝔓47): The Codex, Its Scribe, and Its Text (Leiden: Brill, 2017); and Liv I. Lied and Marilena Maniaci, Bible As Notepad: Tracing Annotations and Annotation Practices in Late Antique and Medieval Biblical Manuscripts (Berlin: De Gruyter, 2018).
[ back ] 9. GA 05 (Codex Bezae) has 344 entries, GA 01 (Codex Sinaiticus) has 333, GA 03 (Codex Vaticanus) has 256, and GA 02 (Codex Alexandrinus) has 200. The Liste page does link to external resources as well, the most relevant to the bibliographical materials being the Pinakes site ( Not all manuscripts in the Liste have Pinakes links and the Pinakes categories for sorting these references are still broader than what is being proposed here.
[ back ] 10. The ECM Apokalypse team has recorded punctuation in all manuscripts, though their guidelines for doing so are internal to that project.
[ back ] 11. “IGNTP Guidelines for Transcribing Greek Manuscripts using the Online Transcription Editor” (version 1.2), 37, available at
[ back ] 12. This raises a further issue. Users can currently search on the NTVMR for a limited set of paratextual features by constructing a query URL. This returns a list of features in the Manuscript Workspace page with a DocID, PageId, (biblical) Content, and Image—that is, the location (and possible image) for each feature. This is already very helpful. If the outputs were customizable to display other values (e.g. the color of non-black inks), that would immediately alert the researcher to the frequency of variations in a feature and accelerate the process of correlating minable codicological and paratextual data with textual data. The ability to download these outputs in standard formats (e.g. a list of comma-separated values) would be ideal for researchers desiring to sort larger query outputs.