Data-Driven Scholarship in Digital Classics: Purpose and Benefit of Text Mining

1. Introduction: Text and Author: What’s in the data?

If we accept that there is no human action apart from culture, then the question of structures, objects, and conditions of knowledge, of awareness dispositives and textual strategies, exists before all applications of the methods of digital humanities.
This fundamentally affects the question of data, i.e., the change that is visible on the level of our texts, and the change of the view of the data, i.e., the methods involved. The number of new connections is discussed under the auspices of ‘distant reading’ [1] and can be deduced from very large amounts of text and data through the application of algorithm-based evaluation with methods such as text mining, clustering, or topic modeling. But complexity reduction, visualization, and explorative experimentation have also led to completely new questions.
Without claiming to provide a full and complete description, but an exemplary report, my paper will be a sketch of some new developments. There are new approaches that emerge in the classical scholarship in relation to textual strategies: the algorithm-based analysis not only changes our textual foundation it must also be accompanied by corresponding considerations that reflect the advantages and consequences.
Classics has played a pioneering role in the production of digital humanities because it developed digital text corpora earlier than other disciplines. For that reason the ancient texts are short of completely digitized today. [2] But the classical hermeneutical understanding of texts must fundamentally change if scholars want to consciously form the epistemological effect of digitization, and these changes must emerge from the interplay of application of algorithm-based methods and epistemic systematization: the relationship between form and function in the digitally represented objects is different from that in the non-digital representation. Applying Franco Moretti’s juxtaposition of close reading (classical text analysis) and distant reading (text analysis with digital humanities methods), the situation of close reading can be described as that of walkers in the forest versus those—the distant reading—flying in an airplane over the forest. The walker in the forest can walk several paths through the forest and can also explore different interests and describe in great detail; the one who sees the forest from the plane can see from a bird’s eye view the entire forest and its surroundings—this is the view from above. Today with the algorithm-based analysis we have new possibilities for exploration. But the fear of a “tabula rasa interpretation,” that reads “texts algorithmically” and produces a false objectivity is not entirely unfounded and has to be taken into account. [3]
The following remarks will illustrate my point on the basis of work that was carried out in the context of the projects “eAQUA” and “Digital Plato,” in which precisely such methodological considerations were developed on the basis of the adaptation of methods from text mining. [4] It is best to show on the basis of concrete examples how methods change and what conclusions can be drawn from the results.
It is necessary to start with a text type typical for the field of Classics: fragments. For the time period between the eighth century BCE and third century CE, 59% of ancient authors’ texts have been preserved only as fragments, 12% of their work has been preserved in fragmentary transmissions by other authors, and merely 29% are known to be complete. [5] In this respect, it quickly becomes clear how important working with fragments is to the Classics. Not only is the question of the cohesiveness of an oeuvre problematic, but also the question of attributing a text to a certain author. Generally, the designation of something being “received as a fragment” means that a passage of text in a work of a certain author is attributed to another author. This attribution can take place in a variety of ways:

(author A): as author B already described, the following happened to XY: …


(author A): as authors B, C, and D have already described, the following happened to XY: …


(author X): but many have described that the following happened to XY: …


(author X): it was described that the following happened to XY: …

For centuries scholars have been assembling collections of fragments from this kind of information about authors B, C, D, and also the unspecified group “many.”

But we can see that the attribution of authors for fragmentary transmitted texts raises various problems: [6] what is the relationship here between quotations, reported speech, and paraphrase? Is a fragment always merely a quotation in direct speech or should we also regard reported speech? According to which criteria do we demarcate a quotation from a fragment in the first place? Or is a fragment everything: direct quotation, reported speech, and paraphrase? And, last but not least, the decisive question: who can be called the author here in the first place?
The attempt to identify an author out of the fragments has already led our disciplines down various dead ends. However, new literary theory offers a groundbreaking theoretical foundation that actually can help us with these difficult questions. Here, I am referring to the concept of the “Death of the Author” associated with Roland Barthes, Michel Foucault, and especially Julia Kristeva:

The word’s status is thus defined horizontally (the word in the text belongs to both writing subject and addressee) as well as vertically (the word in the text is oriented towards an anterior or synchronic literary corpus) … each word (text) is an intersection of words (texts) where at least one other word (text) can be read … any text is constructed as a mosaic of quotations; any text is the absorption and transformation of another. [7]

It leads us down a different path than the philological-hermeneutical approach of the usual school of history whose approach is tied to the traditional concept of the author, i.e., always postulating an identifiable connection between author and text.

The current understanding in History and Classics is based on the admirable works on fragments in the collections of, for instance, Diels/Kranz (fragments of the Presocratics) and Jacoby (fragments of the Greek historians). Glenn W. Most and G. Schepens have re-examined the long-standing question of quoting and quotation relating to the special form that is meant by the usage of the term ‘fragment’. [8]
What I am mainly concerned with here is exploring the historical and theoretical aspects of scholarly practices in collecting fragments, especially fragments of lost historical works.
Meaning “a small part broken off or separated from something,” [9] a fragment is either an object separated from its original ensemble (like monuments, sculptures, inscriptions, etc.) or it is—on a metaphorical level—a textual passage without its original context. Regarding the original context, textual fragments are usually separated from the surrounding text not by accident or loss but by an editor’s decision to classify this text passage as a fragment of a lost work. To sum up: “Generally speaking, classical fragments are made rather than born.” [10]
Ever since antiquity, dicta, apophthegmata, and sayings of wise men, oracles or aphorisms, were collected, assorted, and assembled consistently by scholars. The omnipresence of quotes is evident from antiquity to modern times. However, reconstruction of completely lost works from quotations embedded in works written by later authors is a concept of modern times.
We have collections of fragments of Greek historians since the Renaissance, but the climax of collecting fragments was the 19th century, when at the beginning Creuzer in Heidelberg set out to collect the fragments of the Greek historians. His aim of tracing the origins and the development of Greek historiography provided a basis for all further work. From his and the latter collections (Müller, as well as Mullach and Diels, Jacoby, today the New Jacoby, etc.) it seems fairly obvious that the context of these collections is a special narrative framework: the aforementioned aim (reconstruction of completely lost works from quotations embedded in works written by later authors) warrants an approach that treats the works from which the quotations are extracted as mere quarry sites.
The editors—until nearly the present day—accepted as unconsidered assumption that only the fragment matters, not its wider context, and that the main question of interest in the study of fragments are problems of textual and historical authenticity. Therefore, we are dealing with the distinction between fragment and testimony, a problem that deserves further consideration. G. Schepens has proposed to consider the term ‘cover-text’ to broaden our understanding of this distinction, as ‘cover’ implies the different aspects of preserving, concealing, and enclosing fragments. [11]
Although this is an inspiring analysis, my assumption is that Schepens’s idea is also based on the concept of the authorial author. This differentiation does not resolve the central problem: if a context within a work is missing for a text (and this is even more the case when the name of the author is missing) and cannot be reconstructed, then the context is replaced by the framework of a collection, whether in a collection of quotations or fragments, the context of which is formed chronologically, personally, or thematically, according to the scholar’s interest. [12]

2. Examples

2.1 eAQUA

The purpose of the eAQUA project continues to be the study of this complex relationship with new applications from Information Retrieval. This provides a basis for an advanced study of digital text resources. Crane, Romanello, and especially Berti have already covered important issues in this field, [13] in particular the topic of the representation of fragments in digital editions. However, the underlying presuppositions Crane et al. use to distinguish between quotes as fragments warrant further clarification. The process of extracting a text as a fragment and editing it in a collection seems parallel to the focalization that presents a narrative through the special perspective of the narrator (Genette) or, with Foucault, as an ‘author function’ (Foucault, What is an Author?). [14] Likewise, including passages from another work as quotation, direct or indirect speech, allusion, or paraphrase matches the concept of different levels of voices. The distinction between narrator and voice, as Roland Barthes formulated in the title of his famous essay The Death of the Author (1967), [15] dispensed with the idea of authorship and, with that, dissolved the fundamental connection between author and text, which had been constitutive for the fragment collections up to that point, bearing consequences also on the semantic context. The strong relationship between this textual epistemology and the underlying assumption in an automatic textual analysis is evident. [16]
In the eAQUA project we use rule-based and probabilistic approaches that can distinguish different levels of phrases. Thus, the unsupervised automatic extraction of textual passages that are parallels, i.e., re-used in other works, is possible: extracting ‘re-used’ phrases of a text compared with the whole corpus of Greek texts is based on the method of string-matching algorithms. One tool we have developed searches for parallels with word size 5. [17]
Another tool uses a search function to generate a graphic visualization of co-occurrences. The search functions of newer text mining methods go further by showing syntagmatic contexts. [18] The search function of eAQUA is based on this capability. With the help of a statistical co-occurrence analysis, calculated with the likelihood ratio, the frequency of the mutual appearance of a word pair is given in relation to the total occurrence in the text corpus so that systematic co-occurrence profiles are developed and visualized.
In the following I will outline three examples from our work that were examined in eAQUA. The first example shows how the citation analysis works and, above all, how reliably it works: the algorithm-based identification of textual passages makes for more precise research than the traditional manual approach. The second and third examples go further and attempt to document, with the help of the co-occurrence search, that semantic contexts can also be “re-discovered” through the algorithm-based search.
2.1.1. Example: Plutarch’s De malignitate Herodoti [19]
Plutarch’s essay on ‘The malice of Herodotus’ seems to be an oddity among his works. [20] Plutarch seriously meant to rescue the good name of his ancestors the Boeotians and the Corinthians, whom Herodotus—in his eyes—had slandered in his Histories. Plutarch criticizes Herodotus’ aim, topic, and treatment; he quotes extensively what Herodotus says only to deny it immediately: whole sentences are copied or paraphrased, most of them slightly changed. In contrast to the mere mention of other authors, Plutarch uses here verbal quotations in a very explicit way, although not always entirely reliably: [21] Helmboldt/O’Neil identify 25 verbatim quotes; Bowen has 73; Pearson in the Loeb edition 56; and eAQUA 49 verbatim quotes.
Comparing the tables of references from Bowen (1992:148), Helmboldt/O’Neil (1959:34–37 on Herodotus) and from the comments and annotations of Pearson with the results from eAQUA’s search shows that, on the whole, the references in eAQUA are as sound and complete as in the existing literature on Plutarch’s quotations. [22] Also, the detailed examination reveals that the results from eAQUA’s search are consistent with the results obtained by the conventional, manual method (see Figures 1 and 2): Helmboldt/O’Neill counted 25 verbatim quotations; Bowen in his commentary 73 verbatim quotations; [23] and eAQUA has 49 verbatim quotations (2 are doubly counted).
Figure 1. eAQUA citation analysis of Plutarch’s De malignitate Herodoti (854e–874c)
Figure 2. 49 verbatim quotations from Herodotus in Plutarch’s De malignitate Herodoti (854e–874c).
2.1.2. Example: Anacharsis
My second example concerns the visualization of fragments, which, on the basis of a questionable attribution to an author, were eliminated from fragment collections. Here I used the co-occurrence from eAQUA.
Anacharsis, one of the Seven Wise Men, was a popular figure in Classical literature over the centuries. His first appearance is in the Histories of Herodotus (4, 46; 74–75), and later on he is the central character in dialogues by Plutarch and Lucian. In antiquity he was said to have written some books, but nothing has been preserved except some sayings illustrating the prominence of this colorful character. Unlike other figures who are accepted by modern editors as authors, Anacharsis has never been part of the traditional canon. Therefore, little attention has been paid to his sayings, and only one edition of his fragments [24] and another of his letters exist, [25] although these are classified as pseudo-epigraphic writings by Reuters in the edition of the so-called Letters of Anacharsis. Neither edition is part of the digitized corpus of Greek literature.
Through the co-occurrence search, words are identified that appear together more frequently than could be the case on the basis of chance. As I already mentioned, fragments are in this way a special case and not actually a case for the co-occurrence search, because it generally refers to unique or usually very rare textual passages. But the eAQUA offers a different possibility:
The inspection of the co-occurrence lists for the search term ‘Anacharsis’, especially of those that appear only once, yields a stunning result (Figures 3 and 4):
Figure 3. Co-occurrence graph of Ἀνάχαρσις.
Figure 4. Co-occurrences of Ἀνάχαρσις.
Figure 5. Word tree of the co-occurrence Ἀνάχαρσις-Ξενιάδες and reference in Sextus Empiricus 7.55–56.
Gorgias (DK 82 B3) maintains probably the most famous position in this chain of arguments:

Nothing exists. Even if something exists, nothing can be known about it. Even if something can be known about it, knowledge about it can’t be communicated to others. (transl. Kathleen Freeman)

Primarily, Gorgias refers to visible things leading mankind astray and the ostensibility of the specified objectivity of factuality (Schubert 2007). Protagoras and Xenophanes too were famous for such positions:

Xenophanes (DK 21 B 15):

ἀλλ’ εἰ χεῖρας ἔχον βόες <ἵπποι τ’> ἠὲ λέοντες ἢ γράψαι χείρεσσι καὶ ἔργα τελεῖν ἅπερ ἄνδρες, ἵπποι μέν θ’ ἵπποισι, βόες δέ τε βουσὶν ὁμοίας καί <κε> θεῶν ἰδέας ἔγραφον καὶ σώματ’ ἐποίουν τοιαῦθ’, οἷόν περ καὐτοὶ δέμας εἶχον <ἕκαστοι>.
But if cattle and horses or lions had hands, or were able to draw with their hands and do the works that men can do, horses would draw the forms of the gods like horses and cattle like cattle and they would make their bodies such as they each had themselves. (transl. Kathleen Freeman)
Protagoras (DK 80 B4):

περὶ μὲν θεῶν οὐκ ἔχω εἰδέναι, οὔθ’ ὡς εἰσὶν οὔθ’ ὡς οὐκ εἰσὶν οὔθ’ ὁποῖοί τινες ἰδέαν· πολλὰ γὰρ τὰ κωλύοντα εἰδέναι ἥ τ’ ἀδηλότης καὶ βραχὺς ὢν ὁ βίος τοῦ ἀνθρώπου.
About the gods, I am not able to know whether they exist or do not exist, nor what they are like in form; for the factors preventing knowledge are many: the obscurity of the subject, and the shortness of human life. (transl. Kathleen Freeman)

Encountering the nomadic Anacharsis in this context is a stunning fact, as he was indeed known as a wise man in the tradition (see above), but not as a philosopher who might have engaged in the argumentations of epistemology. Instead, Sextus Empiricus extensively cites Anacharsis’ philosophical ideas: [26]

The citation is one of the longest testimonies and is relayed verbatim by Sextus. Only in Anacharsis’ so-called pseudo-epigraphic epistles (Reuters 1963, so-called because they are classified as later forgery) can longer texts be found ascribed to his name. Commonly these epistles are classified as Hellenistic works (third/second century BCE). Kindstrand and other editors include neither these epistles nor this passage from Sextus in the edition of sayings and testimonies of Anacharsis. [27]
This skeptical view is purely subjective: when comparing the text passage from Sextus with Herodotus, where Anacharsis is credited with a similar opinion concerning truth, and with the philosophy of the fifth century, it is obvious that placing this opinion into the fifth century BCE counts for something. [28] At least it can be argued here that the athetesis of this testimony has been rash, and also that we have gained by this passage from Sextus a new perspective on the ancient tradition concerning Anacharsis.
Only this one, single fragment for the semantic context in which Anacharsis is quoted exists in the entire corpus of Greek literature. The quote was rejected by the editors of the collection of fragments as non-authentic and therefore eliminated from the entire scholarly discussion. [29] However, co-occurrence analysis that functions entirely independent of the connection between author, work, and connection within a work makes the discovery possible and thus opens the possibility of re-opening and re-evaluating the context.
The algorithm-based textual analysis, which functions without the textual hermeneutic framework, constructs a different textual basis than the one available through the critical editions and commentaries. On the basis of this foundation, it makes possible a clear and verifiable evaluation. Above all, it also makes possible a change in perspective regarding the attribution of quotations, or—couched in more neutral terms in relation to the question of “who is speaking”—a different view of the various voices in the text.
2.1.3. Example: Ἀτθίδος (Atthidos)
This example will demonstrate how, in the special case of our ancient Greek corpus and the specific situation of fragmentary texts, the co-occurrence search leads us to the discovery of new, previously unknown semantic connections, in contrast to the standard approaches and contrary to standard assumptions.
In this case, the search term is Ἀτθίδος (Atthidos), the genitive case of “Atthis,” with which a work about the history of Athens was designated in ancient times and a corresponding author would be called an “Atthidographer” today.
The modern term “Atthidographers” refers to a group of historians who wrote books on the history of Athens, from mythical epochs up to their own time and flourished in the fourth century BCE. [30] Their works, the so-called “Atthides,” have not survived; only quotations by later authors allow an approximate reconstruction of the content and intent of their work. Any search query has to bear in mind that the search results display both the co-occurrences of the quoting author and the cited work. As these quotes are usually associated with a reference to one of the books of these respective “Atthides” from which the quoted sentence has been taken (e.g., ἐν δ Ἀτθίδος: in the fourth book of the Atthis), Ἀτθίδος, the most frequent case of “Atthis” in the corpus (in TLG-E: ἀτθίδος/Atthidos occurring 276 times; the second most frequent form ἀτθίδι/Atthidi 83 times), seems to be well suited.
The word form occurs 276 times in the corpus of the TLG-E and has the frequency class 14, not a rare word, but not overly common either (the most frequent term in the TLG-E corpus is καί, occurring 214 = 16.248 times more often than Ἀτθίδος).
In Figures 8 and 9 a phrase traced from μετανεστήκασιν (3rd pl. part. perf. ind. act. [ionic] of μετανίστημι with the meaning ‘to migrate’) has been marked up. Together with νομάδας (adj. acc. fem. of νομάδες with the meaning of grazing/nomadic or acc. pl. of νομάδοι with the meaning of nomadic people / herdsmen) it points to a nomadic relationship of migration.
Figure 6. Graph of Ἀτθίδος, co-occurrence of Ἀτθίδος- μετανεστηκάσιν- νομάδας.
Figure 7. Word tree of Ἀτθίδος – μετανεστηκάσιν.
Figure 8. Word tree of Ἀτθίδος- νομάδας.
These are extremely rare co-occurrences of Ἀτθίδος. By default, we use the Log Likelihood algorithm (lgl) as the standard ratio for the calculation of the co-occurrences in the word lists shown.
Here the factors of over-all frequency and simultaneous occurrence are evaluated differently, so that even with the same co-occurrence number different lgl-values could arise:

νομάδας occurs three times in connection with Ἀτθίδος and has the lgl-value 34.3777
σποράδην occurs three times in connection with Ἀτθίδος and has the lgl-value 25.5101

The lower significance value arises from the frequency in the entire corpus. This turns out to be 325 times higher with σποράδην than with νομάδας (value of 76). Accordingly, the algorithm evaluates the co-occurrence as three times more likely if the overall frequency (freq) is lower.

This is based on the fundamental assumption that the more frequently two words appear in the same corpus, the greater the likelihood that the two appear together. When the word σποράδην is found 325 times, but only three times in connection with Ἀτθίδος (less than one percent), then the assumption is a semantic context lower than with νομάδας, in which case both appear together in three of 76 occurrences. With μετανεστήκασιν the lgl-value is almost twice as large as with the other two because the word appears with Ἀτθίδος in 100 percent of the time.
The highest lgl-value of a co-occurrent of Ἀτθίδος is for Φιλόχορος: 820,076 (calculated for the entire corpus, the maximum of the lgl for Φιλόχορος is at 357,140 in the co-occurrence of δέ and μήν). With a frequency of 576, Φιλόχορος appears together with Ἀτθίδος 60 times (approximately 10 percent of the time). A similarly high lgl-value of Ἀτθίδος is given with Ἀνδροτίων, for example, with 806,222. The word appears in 53 of 252 cases together with Ἀτθίδος, so in about 21 percent of the time. The factors of overall frequency and shared appearance are evaluated differently by the algorithm, so that even with the same co-occurrence number different lgl-values can be produced.
In the following table the values are summarized:

Overall frequency lgl number
νομάδας 76 34,38 3
σποράδην 325 25,51 3
μετανεστάκασιν 3 59,56 3
Φιλοχορος 576 820,08 60
Ἀνδροτίων 252 806,92 53
Table 1: lgl values
It refers to a quotation from the Atthis of Philochorus (third/second century BCE), who—as was demonstrated by a detailed investigation—supposed a nomadic phase in the early history of mythical Athens. [31]
In the Word Tree option, three sources are shown, the epitome of Stephanus’ Ethnica, from which Felix Jacoby extracted a fragment of the Atthidographer Philochorus (FGrHist no. 328) as No. 3b. The text from Herodian also contains the fragment of Philochorus, but is not included in Felix Jacoby’s collection of fragments, so that the assemblage of sources from eAQUA has a more thorough coverage. [32]
This result demonstrates a very unusual development: What are nomads doing in Attica? The Athenians were proud of their autochthonous development and autochthony is the opposite of nomadism. It seems that the Atthidographers invented a phase in Athenian history to show an evolutionary development from nomadism to settled life. This implies that they have taken over a model from philosophers like Aristotle and applied it to their own concept of Athenian history.
According to the citation from Stephanus, the Atthidographer Philochorus had lain down a connection between pre-historic nomadic livelihoods of the Athenians, their initial sedentary lifestyle, and there first founding of asty / polis.
Jacoby insists in his commentary on the two fragments that the first phrase of the fragment (2a: Ἀθηναῖοι δὲ πρῶτοι τῶν ἄλλων ἄστη καὶ πόλεις ᾤκησαν) does not belong to the second one because this phrase occurs with almost the same wording in Stephanus. [33]
In doing so, he dissolved the causal relationship that results from the text in Stephanus. The detailed analysis proves that the Atthidographers as historians of Athens seemed to disagree with the official Athenian idea of origin. Only the idea of cultural evolution progressing in different stages, first discussed in the fifth and fourth century BCE, may explain this stage of nomadic livelihood in the Athenian pre-history that can be reconstructed for the Atthidographers.
Thus, the co-occurrence analysis can lead to better results than the traditional approach, which is based upon experience and linear reading in the case of very rare co-occurrences. The reason is that co-occurrence analysis is built on the algorithm-based evaluation of an entire corpus, which is not restricted by the subjective decisions of an editor.
In view of fragments, this corresponds to the ‘un-editing’ that Ray Siemens describes for the relationship between digital and print editions, as follows:

In addition to acknowledging the value of the electronic medium to editing and the edition, such “assemblages” also recognize the critical practice of “un-editing,” whereby the reader is exposed to the various layers of editorial mediation of a given text, as well as an increased awareness of the “materiality” of the text-object under consideration. [34]

In his book entitled Radiant Textuality, Jerome McGann establishes the view that texts are n-dimensional in every medium, especially in contrast to the copy-text theory and its emphasis on editorial decision. Accordingly, an approach that is based on this model of intertextuality, i.e., the relation of the texts to and among each other without the intentions of an author or editor intervening, would be much closer to the text—without the interference of editors.

2.2. Digital Plato: [35]

Concerning the relationship between text and the formation of tradition, the aforementioned examples are limited to literal quotations and texts that later entered into fragment collections. Very different, but equally important, forms of reception in ancient tradition are paraphrases and allusions that narrowly avoid literal identification. Unlike quotes, the relationship between texts with little or no textual identity is much more difficult to determine; determining the relationship between pre-text and post-text (i.e., intended or unintentional reference, transforming reference, reformulation, commentary, etc.) is a special challenge. In contrast to the detection of a citation, the algorithmic analysis of a paraphrase must take a different approach.
Dealing with relationships that do not always match on the textual level, matches often can only be derived from context. The word2vec algorithm, [36] a method of word embeddings using a shallow, two-layer neural network, performs a contextual search based on the assumption that words frequently used in similar contexts will carry a similar meaning. The comparison of the contexts of two words can serve as a similarity measure of the word pair and is based on a novel distance function between text documents, the Word Mover’s Distance. [37]
The results are very promising: A search for paraphrases of Plato Phaid. 80 a10-b5 in the Praeparatio Evangelica of the Church father Eusebius

Plat. Phaid. 80 a10-b5: Σκόπει δή, ἔφη, ὦ Κέβης, εἰ ἐκ πάντων τῶν εἰρημένων τάδε ἡμῖν συμβαίνει, τῷ μὲν θείῳ καὶ ἀθανάτῳ καὶ νοητῷ καὶ μονοειδεῖ καὶ ἀδιαλύτῳ καὶ ἀεὶ ὡσαύτως κατὰ ταὐτὰ ἔχοντι ἑαυτῷ ὁμοιότατον εἶναι ψυχή, τῷ δὲ ἀνθρωπίνῳ καὶ θνητῷ καὶ πολυειδεῖ καὶ ἀνοήτῳ καὶ διαλυτῷ καὶ μηδέποτε κατὰ ταὐτὰ ἔχοντι ἑαυτῷ ὁμοιότατον αὖ εἶναι σῶμα.
Consider then, Cebes, said he, whether from all that has been said we obtain these results: that soul is most like the divine, and immortal, and intelligible, and uniform, and indissoluble, and ever unchangeable and self-consistent; and the body on the other hand most like the human, and mortal, and unintelligible, arid multiform, and dissoluble, and never consistent with itself.

shows this result:

Figure 9. Results Nr.1-3 for paraphrases of Platon, Phaid. 80 a10–b5 in the Praeparatio Evangelica of Eusebius.
The first hit (marked text passage is highlighted in yellow)

Σκόπει δή, ἔφη, ὦ Κέβης, εἰ ἐκ πάντων τῶν εἰρημένων τάδε ἡμῖν συμβαίνει, τῷ μὲν θείῳ καὶ ἀθανάτῳ καὶ νοητῷ καὶ μονοειδεῖ καὶ ἀδιαλύτῳ καὶ ἀεὶ ὡσαύτως κατὰ ταὐτὰ ἔχοντι ἑαυτῷ ὁμοιότατον εἶναι ψυχή, τῷ δὲ ἀνθρωπίνῳ καὶ θνητῷ καὶ πολυειδεῖ καὶ ἀνοήτῳ καὶ διαλυτῷ καὶ μηδέποτε κατὰ ταὐτὰ ἔχοντι ἑαυτῷ ὁμοιότατον αὖ εἶναι σῶμα.
Consider then, Cebes, said he, whether from all that has been said we obtain these results: that soul is most like the divine, and immortal, and intelligible, and uniform, and indissoluble, and ever unchangeable and self-consistent; and the body on the other hand most like the human, and mortal, and unintelligible, arid multiform, and dissoluble, and never consistent with itself.

is the same as the input passage from Plato and a verbal citation (in Eusebius Praeparatio Evangelica 11.27.13).

The second hit (marked text passage is highlighted in turquoise) is the second part of the sentence in Plato Phaidon 80b3–5. The algorithm finds and distinguishes complex semantic contexts, which have traditionally relied on textual interpretation by human experts. The passage that was entered is only the first part of the sentence, but the program finds the semantic relationship between it and the other part of the sentence, although that part of the passage means quite the opposite of the first part and has no textual overlap with it.
In the third hit (marked text passage is highlighted in yellow)
Euseb. PE 15, 22, 32

κἂν δι’ ἑνὸς ποικίλον, οἷον πρόσωπον. οὐ γὰρ ἄλλο μὲν ῥινός, ἄλλο δὲ ὀφθαλμοῦ, ἀλλὰ ταὐτὸν ὁμοῦ πάντων. καὶ εἰ τὸ μὲν δι’ ὀμμάτων, τὸ δὲ δι’ ἀκοῆς, ἕν τι δεῖ εἶναι εἰς ὃ ἄμφω· ἢ πῶς ἂν εἴποι ὅτι ἕτερα ταῦτα, μὴ εἰς τὸ αὐτὸ ὁμοῦ τῶν αἰσθημάτων ἐλθόντων; δεῖ τοίνυν τοῦτο ὥσπερ κέντρον εἶναι, γραμμὰς δὲ συμβαλλούσας ἐκ περιφερείας κύκλου τὰς πανταχόθεν αἰσθήσεις πρὸς τοῦτο περαίνειν, …
For there are not different powers that perceive the nostril and the eye, but the same perceives all at once. And if one impression comes through the eyes, and another through hearing, there must be some one power which both reach: or how could one say that these are different, if the sensations did not reach the same sentient power at the same time? This, therefore, must be as it were a centre, and lines converging from the circumference of the circle must convey the sensations from all sides to it …

the algorithm finds an even higher level a connection between Eusebius and Plato. The hit shows a passage in Eusebius Praeparatio Evangelica that cites Plotin, who is traditionally identified as the founder of Neoplatonism and who cites Plato’s Phaidon in his work Enneads. This Eusebius passage cites Plotin IV 7.6, in which he cites the text from Plato’s Phaidon!

This demonstrates that the algorithm finds semantic similarities on the higher level of understanding by finding not only paraphrases, but contexts in reception and tradition without any basis in textual overlap. [38]
So far, the search algorithm can detect relationships that we would likely term meaning, a development that seems to be very promising indeed for the development of a paraphrase search.

3. Conclusion

The examples of eAQUA’s search with string-matching algorithms and our research with the Digital Plato project’s neural networks demonstrate that these tools give an excellent estimate of quotations, text reuse, and direct or indirect text tradition. Concerning the question of fragments, these tools broaden the possibilities of helping to ‘unearth’ fragments lost in the process of editorial decisions. Representing a kind of ‘unediting’ whereby the various layers of editorial decisions can be decoded, this new approach yields a promising new avenue of research that combines methodological consideration and scientific practice. In the same way, however, starting from the complete work as opposed to the fragment, the use of the algorithm-based paraphrase search with the help of neural networks in the field of deep learning opens up new paths for detecting indirect transmission.
An algorithm-based analysis reduces a text to its essence to such an extent that authors, editors, and interpreters completely fade into the background. “Automatic extraction” yields results that create a kind of reference that is not dependent on individual interpretation. These processes provide an efficient way to bridge the gap between distant and close reading through a combination of empirical and traditional methods of interpretation. This allows a close to the hermeneutic circle of interpretation and knowledge in combination with algorithm-based analysis and offers a deeper insight in the reception of ancient tradition.


[ back ] 1. Moretti 2007; cf. Manovich 2007 and Crane 2006.
[ back ] 2. Thesaurus Linguae Graecae:; Biblioteca Teubneriana Latina Online:; Perseus Digital Library:; Library of Latin Texts:; an overview is given in (22.8.2017).
[ back ] 3. Liu March 2013:414.
[ back ] 4. Cf., funded by the German Federal Ministry of Education and Research 2008–2013.
[ back ] 5. Berti, et al. 2009.
[ back ] 6. Cf. Grafton 1998 and Laks 1997.
[ back ] 7. Kristeva 1980:37.
[ back ] 8. Schepens 1998; Schepens 2005; Most 2009:9–22.
[ back ] 9. Oxford Dictionary online, 9.5.2017.
[ back ] 10. Dionisotti 1977:1.
[ back ] 11. Schepens 2005:X; 1997:166–172.
[ back ] 12. Cf. Schepens 1997:166, who differentiates between situations in which no context or a reduced or rather an entirely different context exists.
[ back ] 13. Crane & Romanello 2009, Berti 2016, 2016a,b, 2009a,b.
[ back ] 14. Foucault 1969.
[ back ] 15. Barthes 1967.
[ back ] 16. Cf. detailed in Schubert 2012.
[ back ] 17. A comprehensive description (written by Jens Wittig, who has adapted and further developed the visualization tools in the eAQUA portal) can be found in the “Wissensdatenbank” at the following link:; cf. Büchler et al. 2010.
[ back ] 18. If two terms are occurring together in at least one local context they are in a relationship that can be denominated as syntagmatic context. Cf. Bünte 2010.
[ back ] 19. Extended version of Schubert 2012.
[ back ] 20. Bowen 1992:2.
[ back ] 21. Plutarch, The malice of Herodotus: (De malignitate Herodoti), tr. and comm. by A.J. Bowen. Warminster Aris & Phillips 1992:7. Helmboldt/O’Neil 1959.
[ back ] 22. Bowen (1992) 148; Helmboldt/O’Neil (1959) 34–37; Pearson (1965) ad loc.
[ back ] 23. Bowen differentiates not exactly between quotations and paraphrases.
[ back ] 24. Kindstrand 1981.
[ back ] 25. Reuters 1963.
[ back ] 26. 7.55–59: Καὶ Ἀνάχαρσις, ὡς φασίν, ὁ Σκύθης πάσης τέχνης τὴν κριτικὴν κατάληψιν ἀναιρεῖ, σφόδρα τε ἐπιτιμᾷ τοῖς Ἕλλησι ταύτην ἀπολείπουσιν· τίς γάρ ἐστι, φησίν, ὁ κρίνων τι τεχνικῶς; ἆρά γε ὁ ἰδιώτης ἢ ὁ τεχνίτης; ἀλλ’ ἰδιώτην μὲν οὐκ ἂν εἴποιμεν· πεπήρωται γὰρ πρὸς τὴν γνῶσιν τῶν τεχνικῶν ἰδιωμάτων, καὶ ὡς οὔτε τυφλὸς λαμβάνει τὰ τῆς ὁράσεως ἔργα οὔτε κωφὸς τὰ τῆς ἀκοῆς, οὕτως οὐδὲ ὁ ἄτεχνος ὀξυωπεῖ πρὸς τὴν κατάληψιν τοῦ τεχνικῶς ἀποτελεσθέντος, ἐπεί τοι ἐὰν καὶ τούτῳ μαρτυρῶμεν τήν τινος πράγματος τεχνικοῦ κρίσιν, οὐ διοίσει τῆς τέχνης ἡ ἀτεχνία, ὅπερ ἐστὶν ἄτοπον· ὥστε οὐχ ὁ ἰδιώτης ἐστὶ κριτὴς τῶν τεχνικῶν ἰδιωμάτων. (56) λείπεται ἄρα λέγειν τὸν τεχνίτην· ὃ πάλιν ἐστὶν ἀπίθανον. ἤτοι γὰρ ὁ ὁμόζηλος τὸν ὁμόζηλον ἢ ὁ ἀνομόζηλος τὸν ἑτερόζηλον κρίνει. ἀλλ’ ὁ ἑτερόζηλος οὐχ οἷός τέ ἐστι κρίνειν τὸν ἑτερόζηλον· τῆς γὰρ ἰδίας τέχνης ἐστὶν ἐπιγνώμων, (57) πρὸς δὲ τὴν ἀλλοτρίαν ἰδιώτης καθέστηκεν. καὶ μὴν οὐδὲ <ὁ> ὁμόζηλος τὸν ὁμόζηλον δύναται δοκιμάζειν· αὐτὸ γὰρ τοῦτο ἐζητοῦμεν, τίς ἐστιν ὁ τούτους κρίνων ἐν μιᾷ δυνάμει τὸ ὅσον ἐπὶ τῇ αὐτῇ τέχνῃ καθεστῶτας. ἄλλως τε, εἴπερ οὗτος ἐκεῖνον κρίνει, γενήσεται τὸ αὐτὸ κρῖνόν τε καὶ κρινόμενον πιστόν τε καὶ ἄπιστον· (58) ᾗ μὲν γὰρ ὁμόζηλός ἐστιν ὁ ἕτερος τῷ κρινομένῳ, κρινόμενος καὶ αὐτὸς ἄπιστος ἔσται, ᾗ δὲ κρίνει, πιστὸς γενήσεται. οὐ δυνατὸν δὲ τὸ αὐτὸ καὶ κρῖνον καὶ κρινόμενον καὶ πιστὸν καὶ ἄπιστον ὑπάρχειν· οὐκ ἄρα ἔστι τις ὁ κρίνων τεχνικῶς. διὰ δὲ τοῦτο οὐδὲ κριτήριον· (59) τῶν γὰρ κριτηρίων τὰ μέν ἐστι τεχνικὰ τὰ δὲ ἰδιωτικά, οὔτε δὲ τὰ ἰδιωτικὰ κρίνει, ὥσπερ οὐδὲ ὁ ἰδιώτης, οὔτε τὰ τεχνικά, ὥσπερ οὐδὲ ὁ τεχνίτης, διὰ τὰς ἔμπροσθεν εἰρημένας αἰτίας. τοίνυν οὐδέν ἐστι κριτήριον.
(55) Anacharsis the Scythian also, as they say, destroys the apprehension which judges concerning every art, and strongly censures the Greeks for accepting it. “For who,” says he, “is the man who judges a thing by rules of art? Is he the non-expert or the expert artist? But surely we could not say that he is the non-expert; for he is lacking in knowledge of the special features of the art, and just as the blind man does; not perceive the effects of vision, nor the deaf those of hearing, so neither is the non-expert keen of sight to apprehend the result produced by artistic methods; since in fact, were we to entrust to him the judgment of any product of art, there will be no difference between lack of art and art, which is absurd. So that the non-expert is not the judge of the special features of art.
(56) It remains, then, to say that the expert artist is the judge; and this again is improbable. For either the fellow-craftsman judges the fellow-craftsman, or the man of one craft the man of another craft. But the man of one craft is incapable of judging the man of another craft; for he is learned in his own art,
(57) but in regard to another man’s he is in the position of a non-expert. Nor in fact can the fellow-craftsman pass judgment on his fellow-craftsman; for precisely this was our question — Who is he that judges those who stand on the same level inasmuch as they are engaged in the same art? And besides, if this fellow-craftsman judges that one, the same thing will be both judging and judged, both trusted and distrusted;
(58) for in so far as the other man is a fellow-craftsman of the man who is being judged, he himself also will be subject to judgment and distrusted, whereas, in so far as he is giving judgment, he will be trusted. But it is not possible for the same thing to be both judging and judged, trusted and distrusted. Therefore, there is none who judges by rules of art. And because (59) of this there is no criterion either; for of criteria. some are technical, others non-technical, but, for the reasons already stated, neither the non-technical criteria judge any more than the non-expert, nor the technical any more than the expert artist. So, then no criterion exists: (transl. Bury)
[ back ] 27. Kindstrand 1981:49. Cf. Meins 2017.
[ back ] 28. Cf. Schubert 2010c:157ff.
[ back ] 29. Meins 2017 has presented the first comprehensive analysis of this passage.
[ back ] 30. Jacoby 1923–1958; since 2006 available online and as CD-ROM (Brill); since 2007: Worthington, Brills New Jacoby; cf. G. Schepens et. al., (1998ff.).
[ back ] 31. Details in Schubert 2010b.

[ back ] 32. FGrHist 328,F.2a (= Stephanus, “Ethnica” p.292 Billerbeck):

ἄστυ· ἡ κοινῶς πόλις. διαφέρει δέ, ὅτι τὸ μὲν κτίσμα δηλοῖ ἡ δὲ πόλις καὶ τοὺς πολίτας. „ἐκλήθη δὲ ἄστυ” ὡς Φιλόχορος ἐν ᾱ Ἀτθίδος (FGrHist 328 F 2a) „διὰ τὸ πρότερον νομάδας καὶ σποράδην ζῶντας τότε συνελθεῖν καὶ στῆναι ἐκ τῆς πλάνης εἰς τὰς κοινὰς οἰκήσεις, ὅθεν οὐ μετανεστήκασιν. Ἀθηναῖοι δὲ πρῶτοι τῶν ἄλλων ἄστη καὶ πόλεις ᾤκησαν“.
“asty: The polis in general. But there is a difference, in that the one (asty) indicates the physical structure, while polis denotes the citizens also. It was called asty, as Philokhoros (says) in the first (eleventh MS) (book) of the Atthis, on account of the fact that, previously living as scattered nomads, at that time they stood still (stenai) from their wandering and came together into common habitations, from which they have not moved. The Athenians preceded others in building towns (asty) and cities (poleis).” (transl. Harding)

This is also handed down in the “Etymologicum Magnum” that has been apparently extracted from a work of Oros ‘On Peoples’.

FGrHist 328, F.2b (= ET. (GEN.) M. p. 160, 5): ἄστυ· ἡ πόλις· Φιλόχορος ἐν τῶι ᾱ τῆς Ἀτθίδος φησίν· «ἄστυ δὲ προσηγορεύσαν τὴν πόλιν διὰ τὸ πρότερον νομάδας καὶ σποράδην ζῶντας τότε συνελθεῖν καὶ στῆναι ἐκ τῆς πλάνης εἰς τὰς κοινὰς οἰκήσεις, ὅθεν οὐ μετανέστησαν». οὕτως Ὦρος ἐν ἐθνικῶν.
Asty. The polis. Philokhoros in the first (book) of Atthis says, “they gave the name asty to the polis on account of the fact that, previously living as scattered nomads, at the time they stood still from their wandering and came together into common habitations, from which they did not move. So (says) Oros in (the) On Peoples.” (transl. Harding)
[ back ] 33. Stephanus s.v. Ἀθῆναι: Jacoby FGrHist (Text) comm. to F2–4, pp. 264: Stephanus, “Ethnica” p. 66 Billerbeck: s.v. Ἀθῆναι: … πρῶτοι γὰρ Ἀθηναῖοι τὰ ἄστη καὶ τὰς πόλεις εὑρεῖν ἱστοροῦνται, ὅθεν καὶ τὴν ἀκρόπολιν αὐτῶν πόλιν ἐκάλουν κυρίῳ ὀνόματι, …
[ back ] 34. Siemens 2010:1–50, 10f.
[ back ] 35. The following remarks are a very short description; the detailed description is published in Pöckelmann et al. in Digital Classics Online 3.3 (2017) and Pöckelmann et al. (2019).
[ back ] 36. Mikolov et al. 2013a.
[ back ] 37. Kusner et al. 2015.
[ back ] 38. Further examples and results in Schubert et al. (2019).