Lexicology and Statistical Analysis With Hyperbase-Latin Web Edition: The Specific Collocations Method Applied to Some Latin Semantic Fields

University of Liège, UR Mondes Anciens, LASLA
Over the past thirty years, numerous works [1] have aimed to renew the field of lexical semantics using the method of semic analysis. [2] These publications have led to undeniable advances in the field of Latin lexicography, but the method, nevertheless, faces a major difficulty: the identification of the semes, [3] as well as the distinction between semes particular to words (“sèmes inhérents”) and semes conditioned by the context (“sèmes afférents”), relies widely on the subjectivity of the linguist.
However, a statistical method—the so-called “specific collocations” method—can contribute to limiting this subjectivity: based on the z-score test, [4] the method possesses the ability to specify, on the one hand, which words—forms or lemmas—are attracted by the words constituting the lexical field under consideration, and, on the other hand, what is the strength of this attraction. Thanks to the Latin textual database of LASLA (Laboratoire d’Analyse Statistique des Langues Anciennes of the University of Liège), as well as to the Hyperbase Latin Web Edition interface, one can easily apply the method to a large Latin corpus.
A study of the lists of specific collocations contributes to a better identification of word contextual semes according to the various text types in which words occur. A comparison of the specific collocations of several words belonging to the same semantic field helps to refine the description of the inherent semes of each word. These principles will be illustrated by examining some Latin lexical fields—both concrete and abstract.

0. Lexicology and Textometry

Relying on the theories of his predecessors, the work of Rastier (1987) makes a first distinction between “inherent semes” and “afferent semes.” The “inherent semes” belong to the signified of the lexeme “in language.” The “afferent semes” are realized only in the context of a particular discourse. Among the “afferent semes,” Rastier also distinguishes “socially normalized semes,” which result from implicit judgments related to the culture and society in which the discourse is produced. The question is to know in which way new discourse analysis methods can help us to identify semes corresponding to each of these various types.
Basic “afferent semes” of a lexeme are defined as related to a particular discourse or a type of particular discourses: it is clear that they can only be determined by examining the discourse in which the lexeme appears. “Socially normalized semes,” on the other hand, are related to the characteristic features of the culture and society conditioning their nature: these features are undoubtedly reflected in the discourse itself, even if the context of the discourse production can also be a key factor. Therefore, as far as all “afferent semes” are concerned, developing new methods for text analysis will obviously also improve the ways to identify these kinds of semes, and textometry, defined as statistical analysis of textual data, will be a useful tool for doing so.
The utility of these new methods in order to identify precise “inherent semes” is not so obvious when those semes are defined as unrelated to discourse. In this case, the usual way to proceed is to collect and compare dictionary word definitions, in order to extract from them common features that can be considered “inherent semes” of words. However, even in word definitions, the dictionaries take into account the micro-context in which words occur. In this respect, textometry allows accurate analysis of those micro-contexts.
Textometry is, therefore, a useful tool for semic analysis, offering new ways to distinguish between “inherent” and “afferent” semes. A feature that we can relate to a lexeme inside a large corpus regrouping various text types could probably be considered as an “inherent seme,” especially if also related to the micro-context of the word. On the contrary, we cannot define a feature appearing only in one text-type or only in a particular context as an “inherent seme,” although this feature could be one of the “afferent seme” of the word.

1. Textometry and Latin Databases

Classical Latin is one of the better-equipped languages for textometric research. Already, in the early 1960s, LASLA undertook the project to build a large database of Latin classical literary texts (www.ulg.ac.be/cipl/bdlasla/). The main characteristic of this database is to relate each of the text’s units to a lemma and complete a morphological analysis with, in addition, a syntactic code for each verb form. The choice of the lemma and the morphosyntactic annotation relies on semi-automatic processing and involves a systematic manual selection and a verification of the data by a trained philologist. This processing is quite demanding, but the produced information is, as a consequence of the laborious process, far more reliable than any information produced by a fully automatic process.

Today, the LASLA Classical Latin database has reached about 2,500,000-word forms and is aiming to involve, one day, the entire classical Latin literature—from Plautus to Apuleius. In addition, the LASLA team is working on some later texts—such as the Historia Augusta—and has already analyzed a large corpus of Late and Medieval Latin Hagiographical texts. In order to allow textometric studies of this corpus, LASLA has developed a joint project with the UMR 7320: “Bases, corpus, langage” at the Université Côte d’Azur (CNRS). Both laboratories adapted together the new Hyperbase Web Edition platform to handle the LASLA Latin files, distributed in various databases—a general one with all the LASLA files, another with the major works, and several corresponding to various genres or authors. The Hyperbase-L.A.S.L.A. Web Edition (http://hyperbase.unice.fr/hyperbase/?edition=lasla) allows not only documentary mining but also statistical research. The user can perform searches on any word form, lemma, morphological tag, syntactic tag or string consisting in a combination of word forms, lemmas and tags—with, if needed, gaps between the units of the string. From a statistical point of view, Hyperbase allows numerous forms of statistical research—reduced deviation, tree-analysis, correspondence analysis—not only on the distribution in the corpus of all the above-mentioned items but also on their collocates. As far as the semic analysis is concerned, distributions of words belonging to the same semantic field can already be instructive, as illustrated by Figure 1. According to this distribution histogram, the lemma supplico appears to be characteristic of Plautus and some of Cicero’s speeches. On the contrary, the lemma precor is mainly found in poetry and tragedies. This seems to indicate a specialization of the two lemmas related to text genres. However, such a distribution histogram does not allow pinpointing, which semantic features of the two lemmas can explain this specialization. With this aim in view, we have to use a more sophisticated method, relying on the analysis of the specific collocations of both lemmas.

Figure 1: Distribution in the base “Latin” of the lemmas supplico and precor.

2. Latin Lexicology and Specific Collocations

A first attempt to use the specific collocations method as a tool for semic analysis was performed by Celentin (2010) in a study of the lexical field of the notion of “cave.” Celentin tries to identify the different semes distinguishing different meanings for the words antrum, cauerna, specus, and spelunca. In this study, he takes into account the text genre, as well as the diachronic dimension, extending the corpus to late Latin poetry. For instance, he points out that, in classical Latin literature, specus and spelunca share some collocates but differ for others according to different distributions. The lemma spelunca is used mostly by Lucretius, Vergilius and Juvenalis. Its specific collocates belong to various semantic fields: earth (saxus, terra, mons, humus, tellus), water (lacus, liquidus, gutta, fons), vastness (cauus, magnus, ingens, uastus), coldness and darkness (frigidus, umbra), and confinement (claudo). Vergilius also uses the lemma specus, but this word is especially characteristic of Curtius, of the tragedies of Seneca and of Tacitus. Most of the semantic fields of the collocates of specus are quite similar to those of spelunca, even if the collocates differ: earth (tellus, mons, rupes, terra, humus, saxum, fretum), water (linquo, unda, lacus, liquor, fons, ripa), vastness (ingens, altus, uastus, immensus, bassus, profundus, altitudo, latus), darkness (obscurus, ater, umbra, occultus), and confinement (claustrum, includo, recludo). However, some collocates of specus constitute a semantic field specific to this word and are related to the notion of “hell” (Dis, Manes, inferus, Tartares, Styx). With such a method, we can identify a seme specific to specus in comparison to spelunca. To know if this seme could be considered “inherent” or “afferent,” we only have to verify if the seme is present in all occurrences of the word in the corpus.
Celentin’s work demonstrates that this method is very efficient when it concerns this particular semantic field, but we wonder if the method is as efficient for other concrete terms, if it can also be applied to abstract words, and if it could be improved by using Hyperbase in order to vary the analysis parameters—word form or lemma as pivot-word, word forms or lemmas as collocate, and length of the text span.

3. Specific Collocations with Hyperbase Latin

The method to calculate the specific collocations with Hyperbase relies on the division of the corpus in two subcorpus: the first regrouping all the text spans in which the word form or the lemma under consideration is present; and the second made of all the other text spans. The software calculates which words or lemmas are specific to the subcorpus, including the word form or the lemma we wish to study. As this subcorpus has been built on the basis of this word form or lemma functioning as a “pivot,” we assume that the word forms or lemmas specific to this subcorpus have a close relationship with this “pivot” and can be considered as its “specific collocates.” As explained in Brunet, [5] the statistic test used to identify the specific collocates is the hypergeometric law, [6] but the results have also been transposed into z-scores, which are easier to interpret. According to the mathematical expectation, the deviation we observe of the number of occurrences of a word form or a lemma can be considered random if the z-score is between 0 and +2. Above 2, there is less than a 5 in 100 chance that the deviation is random, and we can consider the word form or the lemma a specific collocate. The higher the value of the z-score, the stronger the association between the word and its collocate.
With Hyperbase, we can calculate attractions between word forms, between lemmas, between tags, or also between units belonging to different categories (for instance, between a lemma and word forms). We can also vary the text spans from a three-word span to a whole sentence, or even a whole paragraph. We can calculate the specific collocates in the whole corpus, in only the texts of one author, or only in one text. We can also identify the specific collocates of pairs of word forms or lemmas. Hyperbase is then also a powerful instrument to pinpoint which collocates can be characteristic of a genre, a text type, or the style of an author.
In order to illustrate the method, we can go back to the supplicoprecor case (Figure 2). According to their collocates (text span = five tokens), it is quite clear that supplico and precor belong to two different domains. With collocates such as populo, supplico concerns the field of human and public activities while precor—with deos and immortales as collocates—is clearly related to the act of praying to the gods. If we apply the same method, for instance, to potestas and potentia, we notice big differences between the two words. These differences are consistent with the usual definition of the two words. The lemma potestas is characterized by a list of collocates related to the Roman institutions: in decreasing order of the z-score, tribunicius, tribunus, consularis, plebs, populus, creo, redigo, lex, permitto, magistratus, curiatus, Romanus, consul, senatus, etc. The lemma potentia is related to proper names and to words belonging to the imperial court and the possession of goods: in decreasing order of the z-score, gratia, ops, nobilitas, principatus, princeps, infringo, facultas, superbia, eloquentia, dominatio, diuitiae, inuidia, cupiditas, etc.

These two examples show that the method applies successfully not only to concrete words but also to abstract words. Two other cases will exemplify the various search possibilities offered by Hyperbase: the case of uirtus and that of teneo.

Figure 2: Specific collocates (word forms) of the lemmas supplico and precor.

4. The Virtus Case

As we also know from modern languages—such as, “liberty” versus “liberties”—it is sometimes necessary to distinguish between the singular and the plural forms of a word to establish its various definitions. As Hyperbase allows choosing a lemma associated with a tag as “pivot,” it is easy to study collocates of the singular and plural forms of the same word. The behavior of the word uirtus is, in this respect, particularly exemplary, as shown in Figure 3.

Figure 3: Substantives as specific collocates—lemmas—of the singular and plural forms of uirtus.
We have chosen here to study the substantives functioning as collocate of the singular and plural forms of the lemma uirtus in a text span corresponding to a sentence. The units taken into consideration in order to detect the specific collocates are lemmas. More or less half of the collocates are common both to the singular and the plural forms of uirtus: uirtus itself, uitium, animus, natura, fortuna, res, bonum ingenium, homo, gloria, etc. The association strength can, however, be very different: while uitium is the first collocate of the plural, uitium does not appear in the list of the twenty-six strongest collocates of the singular uirtus. We notice the same phenomenon for natura and ingenium, respectively the fourth and the seventh collocates of uirtutes. More generally, the order of the common collocates is largely different. In addition to this, plural and singular forms have their own specific collocates. Collocates of the plural uirtutes are mostly related to the philosophical sphere: temperentia, prudentia, honestas, fama, aetas, adolescentia, pietas, communitas, eloquentia, gratia, continentia, conscientia, comitas, sagacitas, etc. Collocates of the singular uirtus come from the mining of “courage” or “quality of the uir,” and are related to war: hostis, miles, respublica, populus, laus, bellum, consilium, imperator, legio, consul, ciuis, ciuitas, exercitus, proelium, etc. Nevertheless, the singular uirtus can also be associated with specific qualities: uoluptas, dignitas, felicitas, auctoritas, etc.

The study of adjectives functioning as collocates of uirtus is also very instructive. We have used a text span of three tokens in order to study specifically the adjectives preceding or following the occurrence of uirtus (Figures 4 and 5). We have only taken into account adjectives that can agree with the feminine singular. The results for the specific collocates of the singular forms of uirtussumma, eximia, multa, pristina, par, magna, antiqua, singulari, maior, beata, incredibili—are consistent with what we noticed for the substantives. For the plural, the list of adjectives likely to agree with uirtutes, uirtutum, or uirtutibus is more limited and more general: pares, ceterae, grauissimae, and similes are all adjectives used to compare one uirtus to others, which means that the plural use stays in the philosophical domain.

Figure 4: Adjectives as specific collocates (word forms) of the singular forms of 

Figure 5: Adjectives as specific collocates (word forms) of the plural forms of uirtus.

It is also clear that the singular uirtus, which is related to public life and war, and the plural uirtutes, which have a philosophical meaning, do not appear in the same text types. However, we wonder in what kinds of text the singular uirtus does appear and if it is always endowed with the same meaning. A study of the distribution of the singular forms of uirtus (Figure 6) shows that uirtus is not only characteristic of historians, such as Caesar and Sallustius, but also, especially, of Cicero and Seneca.

Figure 6: Distribution of the singular forms of uirtus (base “Latin” z-score).

Figure 7: Specific collocates of the singular forms of uirtus (Cicero versus Seneca).
However, a study of the specific collocates of those singular forms in both authors shows that they have quite different conceptions of the notion (Figure 7). Among the first twenty-five collocates—text span = sentence—only three are common to both authors: uirtus itself, homo, and natura. As far as Cicero is concerned, the concept of uirtus is related to respublica and what constitutes the Roman respublica. Seneca considers uirtus first related to bonum, uoluptas and uita.
In the various tests that we have used to study the collocates of uirtus, we vary the text span from three tokens to a whole sentence. When we limit the text span to three words, a methodological question arises about the concurrence between the specific collocations method and the use of a simple concordance. With the study of the last case—the teneo case—we will try to pinpoint the advantages of the method.

5. The teneo case

By using the concordance function of Hyperbase, we can detect strings of five-word forms, an occurrence of the lemma teneo as the third one. Amongst the 1,777 occurrences of these strings, we find some recurrent word forms: 45 locum, 17 urbem, 12 consilium, 11 imperium, 9 postem, 9 risum, 8 ordinem, 8 arma, 5 castra, 5 libellos, 3 prouinciam, and 3 uirgam. All these occurrences seem to correspond to phraseological units with different levels of lexical freezing, but it is quite difficult to evaluate which collocation is more frozen than another one. Applied to the same string of five words, the specific collocations method is a way to address this challenge. In decreasing order of z-score, the list of the collocates of teneo is the following: 12,75 locum, 9,75 postem, 7,69 risum, 5,42 urbem, 4,99 castra, 4,89 uirgam, 4,87 libellos, 4,85 consilium, 4,67 ordinem, and 4,51 imperium. The word forms arma and prouiciam do not appear in this list.
If we compare both lists, we notice that the number of occurrences of arma teneo is superior to this of castra teneo. However, according to the number of occurrences of the two forms in the corpus, castra is a collocate of teneo while arma is not. It means that the phraseological freezing is more important for castra than for arma. The expressions prouinciam teneo and uirgam teneo have the same number of occurrences in the corpus (three) but with a z-score of 4,89, the freezing is much more important for uirgam teneo: we only find six occurrences of uirgam in the whole corpus. In the cases of postem and risum, we also have the same number of occurrences: nine. Although in both cases there is a high z-score, the higher z-score of postem reflects that nearly all the occurrences of postem (9/11) appear with teneo, while we find thirty-four occurrences of risum independent of teneo. Such findings would benefit lexicography: for instance, in the well-known dictionary of Gaffiot, we find, under the lemma teneo, the expressions locum teneo, risum teneo, somnum teneo, while somnum is not a specific collocate of teneo. Furthermore, the expression postem teneo is absent, and we only find it under the lemma postis. The specific collocations method could be an opportunity to take into account the relative frequencies of phraseological units in the descriptions of word meanings.

6. Lexicology and Specific Collocations: An Appraisal

This specific collocations method will not replace philological study or linguistic analysis, but it is certainly an interesting tool for lexicology—in particular when combined with the semic approach. The method is useful in order to detect “inherent” or “afferent semes” and to extract precise meaning variation of a lexeme according to diachrony, genres, styles, and idiolects. It also helps to establish hierarchies in the freezing process of phraseological units pointed out by dictionaries. The next step in the use of this method for lexicological purposes would likely be to take into account multiple collocations, such as, for instance, the collocation we detected with Hyperbase between the lemma teneo and the three-word forms ferrum, dextra and manu—all collocates of one another.


Brunet, Cl. 2002. Etude sémantique de beneficium, iniuria et d’autres noms désignant des actes de bienfaisance et de malfaisance en latin dans un rapport d’antonymie. PhD diss., Université de Franche-Comté.
Brunet, É. 1982. “Loi hypergéométrique et loi normale. Comparaison dans les grands corpus.” In Actes du 2e Colloque de lexicologie politique. vol. III, ed. D. Bonnaud-Lamotte, 253–264. Paris.
———. 2011. Hyperbase, Logiciel hypertexte pour le traitement documentaire et statistique des corpus textuels, Manuel de référence, version standard 8.0 et 9.0. http://hyperbase.unice.fr/hyperbase//doc/manuel.pdf
Célentin, H. 2010. Antrum, cauerna, specus & spelunca: études contrastives. MA Thesis, Université de Liège. http://web.philo.ulg.ac.be/lasla/wp-content/uploads/sites/7/2017/09/HCelentin.pdf.
Évrard, É. and Mellet, L. 1998. “Les méthodes quantitatives en langues anciennes.” Lalies 18: 111–155.
Gavoille, L. 2007. “Oratio ou la parole persuasive. Etude sémantique et pragmatique.” In Bibliothèque d’Études classiques. 53rd ed.
Heiden, S. 2004. Interface hypertextuelle à un espace de cooccurrences: implémentation dans Weblex. In Le poids des mots Actes des 7ème Journées internationales d’Analyse Statistique des Données Textuelles, ed. G. Purnelle, C. Fairon and A. Dister, 577–588. Louvain-la-Neuve.
Moussy, Cl. 1989. “Les métaphores lexicalisées et l’analyse sémique.” In Actes du 5ème Colloque International de Linguistique Latine, Louvain-la-Neuve et Borzée, 31 mars – 4 avril = Cahiers de l’Institut de Linguistique de Louvain, 15, 1–4, ed. M. Lavency and D. Longrée, 309–319.
———. 1991. “La structure du signifié : utilité et limites de l’analyse en traits pertinents (avec application au latin).” In New Studies in Latin Linguistics (Studies in Language Companion Series, 21), ed. R. Coleman, 63–73. Amsterdam and Philadelphia.
Rastier, Fr. 1987. Sémantique Interpretative. Paris.
Soutet, O. 1995. Linguistique. Paris.
Thomas, J.-Fr. 2002. Gloria et laus. Etude sémantique. In Bibliothèque d’Études classiques, 31st ed.
———. 2007. Déshonneur et honte en latin: étude sémantique. In Bibliothèque d’Études classiques, 50th ed.


[ back ] 1. Among others, Moussy 1989 and 1991; J.F. Thomas 2002 and 2007; Brunet 2002; Gavoille 2007.
[ back ] 2. Rastier 1987.
[ back ] 3. The principle of semantic analysis is to divide each lexical morpheme into minimal features of meaning called “semes”. The sum of the semes of a signifier forms a “sememe.” See Rastier 1987: 17; Soutet 1995: 261–262; Célentin 2010: 63–67. More explanations will be given in paragraph 0.
About the formula for calculating z-score, see paragraph 3 and Brunet 1982; Évrard, & Mellet 1998: 128–131.
[ back ] 5. Brunet 2011:29–32.
[ back ] 6. As described by Heiden 2004.