Demos: Challenges and Lessons


Article Contents

Summary
Introduction
Down the Pipe
What You Get for Free
The PDFs and the Economics of Transformation
The Challenge of Canonical Citation, or, Don’t Mess   with the DTD
A Good Use of Standoff Markup
The Future: Inscriptions, Ancient Works, and TextServices.
Some Conclusions

top

Summary

Dēmos: Classical Athenian Democracy (www.stoa.org/projects/demos/home)  is a medium-sized digital library of texts aimed at inviting non-specialist readers to engage in the a critical reading of primary and secondary sources for this ancient historical topic. This article will not describe the contents of the site at length—for which see the site itself or the description of the project forthcoming in the New England Classical Journal—but  will focus on this site as an object lesson in the benefits, and potential pitfalls, of building an XML-based collection of humanist texts.

top

Introduction

Dēmos: Classical Athenian Democracy (www.stoa.org/projects/demos/home)  is an online collection of articles about Athenian democracy in the 5th and  4th centuries BCE. As of this writing, Dēmos consists of 113 articles,  totalling approximately 400,000 words of content, by at least 13 different  authors. The 26 major articles are available as PDF files, and these alone  toal 785 pages of content. So the site is a medium-sized digital library, if we define small as a few documents and large as many thousands of documents.

From the outset, Dēmos was intended to bring together secondary materials—from relatively straightforward descriptions of institutions and biographies of historical figures to scholarly argumentative essays—with easy and meaningful access to the primary sources. Easy access was not hard to envision: direct hyperlinks to the primary sources in the original language or in translation  whenever those sources were online. To make this access meaningful, however,  we wanted to provide contextual information about those sources, so readers would not only be able to read, say, a passage from Demosthenes, but to understand  who Demosthenes was, the nature (and limits) of his speeches as evidence, and the details of whatever speech is being cited. We wanted to deliver similar  contextual information for place names, personal names, and other elements that might not be familiar to a general readership.

For such a site to work and to continue to work as its contents grew and  as technology and available resources changed, the site had to be dynamic  and had to adhere to a separation of concerns. The content of each article  had to deal with Athenian democracy and its sources, in some form that was  both human-readable and machine readable but that did not presuppose any particular  operating system or medium of publication. Some other mechanism had to take  take care of cross-referencing and linking, and had to reflect automatically,  to the extent possible, the current state of resources in the site locally  and those resources available online at other sites. And yet another mechanism  had to take care of the ultimate display of the linked and cross-referenced  data.

In early 2002, just as we had accumulated enough content for Dēmos to consider publishing its first edition, a combination of open standards for  encoding and manipulating electronic texts, and freely available software or working with those texts reached a critical point of stability, maturity,  and simplicity. What had once required programmers of the highest degree of  experience was now within the grasp of classicists with mere enthusiasm and  technological competence.

The first of the standards in question was the Document Type Definitions  for XML files developed by the Text Encoding Initiative (the TEI-DTDs). The second was the Extensible Stylesheet Language Transformations (XSLT) which describe how to transform one XML document into another. The third was Cascading Stylesheets (CSS), which allow a site to impose formatting on HTML documents by means of stylesheets that are external to the HTML. The software in question was the combination of Jakarta Tomcat and Cocoon, two Java-based, open-source applications that, together, allow (among other things) XML documents to be delivered to a web-browser in HTML, having undergone various transformations along the way.

So the content of Dēmos consists of TEI-conformant XML documents. These  are dissected, merged, and transformed in various ways by a set of custom  XSLT stylesheets, with Tomcat/Cocoon doing the work, before being delivered  as HTML to a reader’s browser, which formats them according to custom  CSS stylesheets for attractive and intuitive reading and interaction.

The work of assembling this system took place between January and June of  2003. In this article, I want to describe some of the technological decisions,  innovations (sometimes mine, sometimes borrowed), and mistakes (all mine)  that went into publication of Dēmos, the debt that this project owes  to other scholars working on similar projects, and some challenges and possibilities  I foresee for the future of the site.

top

Down the Pipe

When a user visits an article in Dēmos, her browser contacts the site,  sending a request for a named article along with three pieces of information:  the name of the requested article, the section of the requested article (by number or name), and the preferred way of presenting Greek text.

This request goes to Cocoon, which matches the “article” request  to a pipeline. The pipeline is a defined process that begins by reading  one or more XML files, transforms them in various ways, and ends by outputting  an XML file. In the case of a request for a section of an article, the pipeline  works like this:

    •      Cocoon finds a file whose title matches the requested article and reads it.
    •      Cocoon then applies an XSLT stylesheet that prepares any Greek text present   for subsequent treatment. By the end of this pipeline, Greek text should  appear in the user’s preferred encoding and should be linked to   morphological and lexical tools. To do that, first, it is necessary to   mark explicitly each individual Greek word, using the tag (for “word”) as defined in the TEI Guidelines (http://www.tei-c.org/P4X/AI.html#AILC).   Greek text appears in the XML files in Beta Code, and even if the user  wants to see Greek transliterated or in Unicode, that Beta Code information must be preserved, to be passed on later to morphological parsers. So  this first transformation includes it as an attribute to the “w”-tag.So, “ o( a)/nqrwpos sofo/s e)stin. ” in the original file would be transformed to:o(
      a)/nqrwpos
      sofo/s
      e)stin.

      Latin words are handled similarly.

    •      Cocoon notes the user’s preferred method of encoding Greek—whether  in Latin transliteration, Beta Code, or Unicode. Using that information, the application processes the XML file with the Transcoder, a Java function  written by Hugh Cayless. The Transcoder looks for any elements in the  XML that have a “lang” attribute of “grc” (for “Greek”). It then assumes that the text of that element is in Beta Code, and transliterates that text into the preferred encoding. Now all the Greek words are encoded according to the user’s preference, with their original beta-code preserved.
    •      Cocoon then applies the “article” XSLT stylesheet. This stylesheet begins the process of transforming the XML into XHTML for display to the user. The first thing it does is note the requested section, find that section in the XML file, and work only on that (so rather than extracting  the requested section from the file, the stylesheet in effect discards   all the other sections from the file).This stylesheet actually calls many other stylesheets that perform specific         transformations—building the table of contents and adding navigational   elements, wrapping text marked as quotations with typographic quotation   marks, italicizing emphasized text or text in Latin, and so on.It also marks text that will eventually be linked. These include phrases  explicitly marked as cross-reference, but also include personal names, place names, and (most importantly) citations to sources. For now, these  are simply marked with tags, since Cocoon does not yet have the information necessary to make the links specific.
    •      The next stage of the pipeline takes the almost-fully-tranformed document  and merges it with four other XML files:“descriptions_available” – This is an XML list of all the articles currently in Dēmos that describe ancient authors or works,  or genres of evidence. Because this file is generated dynamically whenever   it is requested, it will always be up-to-date.“perseus_available” – This is an XML list of all ancient   works known to be available in the Perseus Digital Library (http://www.perseus.tufts.edu).  At the moment, Dēmos relies on Perseus for almost all of its linking  to the texts and translations of primary sources.“demos_available” – This is an XML list of all articles currently in Dēmos. It is also generated dynamically whenever requested.   (That process, by the way, takes place in another Cocoon pipeline; pipelines can call other pipelines, which is part of their strength.)“lookup” – This file is discussed at length below. Briefly,   it allows editorial intervention in the process of automatically generated  linking.
    •      With all of the information from the above four files now available, one last XSLT stylesheet fills in the targets for all links. If a citation points to an ancient work that is available as Perseus, the stylesheet  links to that work; if a cross-reference points to an article that exists  in Dēmos, then the stylesheet links to that. If there is a citation  to a work for which there is a descriptive article, or whose author is the subject of a descriptive article, the stylesheet will generate a link.As a part of this, this transformation also cleans up, removing the extra data and lists that have been added along the way. The result is an document  in XHTML, a format that follows the rules of XML but is recognizable to web-browsers.
  •      The final state of the pipeline is the “serializer,” which  sends the completed XHTML file back to the user’s browser, which will format it according to the Cascading Style Sheets and display it on the screen.

top

What You Get For Free

With its ability to perform a series of transformations on an XML file, Cocoon delivers a number of benefits that a site can exploit with very little effort. The most noteworthy is indexing.

One of the common weaknesses of traditional print articles is the lack of  indexing, which is too laborious and costly for a serial publication. But with XML content, indexing is almost automatic.

Just as the Cocoon pipeline that delivers an article to the reader takes note of which section the reader wants and works only with that section, indexing  using XML and XSLT is not so much a matter of generating an index, but of  identifying things to be indexed and throwing away anything else.

For example, if we want an index of personal names, the XSLT stylesheet need include only instructions for dealing with elements marked by   or tags. That stylesheet will ignore anything in the XML that is not a personal name. From there it is only a matter of sorting  and formatting.

Dēmos’ articles offer indices of personal names, place names,  names of archaeological artifacts, deme names, and tribe names. Each article also includes a double index locororum. “Double”  because the index has two sections that display the same data in two different ways. The first section is sorted by citation, and can thus answer the question,  “Is Aristot. Ath. Pol. 12.3 cited, and if so, where?” The second section of the index is sorted according to section of the article, thus answering   the question, “What are the sources for Payment for Participation in the Council?”

top

The PDFs and the Economic Transformation

For the 26 major articles currently in Dēmos, the site offers PDF versions  for download. While the combination of Cocoon and XSLT, with the addition of another standard, XSL-FO (eXtensible Stylesheet Language Formatting Objects), it is possible to transform XML into PDF dynamically. Dēmos’ PDFs, however, are not generated dynamically, but through a process that combines   XSLT transformation and hand-editing in a page-layout program. The reason  for this follows.

Every view of a Dēmos article includes a link that will allow the reader to see the whole article as one long page in the browser. This view can be printed, and offers sufficient clarity and organization to make for relatively satisfactory reading. So even without PDFs users can print out Dēmos’ articles for offline reading.

When contemplating making PDFs available, then, and after examining the possibility of using XSL-FO for the job, I decided that the economics of generating PDFs dynamically were wrong. While XSL-FO is powerful, it does not yet offer the  level of control over how a text appears on the page that an editor can have  using a modern page-layout application such as Adobe’s inDesign.

In order to make Dēmos’ PDFs represent a significant improvement  over simply printing from the browser, I wrote a basic stylesheet that does  90% of the formatting of the XML and delivers it to the browser. From there,  I can copy and paste that text into an inDesign template and polish the layout. The resulting PDFs are formatted in the Adobe Minion Pro font, an OpenType  font that contains a full complement of polytonic Greek glyphs; the articles  include marginal notes, fully kerned text, ligatures, and other typographic  features that are expected of a printed text, but still very difficult if not impossible to accomplish purely automatically.

top

The Challenge of Canonical Citation, or, Don’t Mess with the DTD

One of the most important features of Dēmos is the contextual information offered in addition to traditional scholarly citation. A general readership  does not necessarily know what “Dem. 18.123” means, and will scarcely  be more enlightened having followed a hyperlink to the middle of some paragraph  in “On the Crown.” To be critical interpreters of ancient history,  readers need to understand the nature of the evidence. So, whenever possible,  Dēmos supplements a citation like “Dem 18.123” with a marginal note entitled “Read About the Evidence,” including a link to contextual    information about Demosthenes and contextual information about “Dem.18”—that it is an example of oratory, a genre with certain problems and potentials, the historical circumstances of the speech, and so forth.

So the publishing system needs to know, for each citation, to what author and work the citation refers. This is not as easy as it sounds. An experienced reader can look at “Dem. 18.123” and discern that “Dem.”  is the author and “18” is the work. And it would be possible to  extract this information programmatically. But the same algorithm that could  dissect “Dem. 18.123” would be misled by “Hdt. 4.123.”  Likewise, one that could handle “Aristot. Pol. ####” would be  confused by “Dion. Hal. ####.”

I solved this problem by adding to the TEI-DTD, thus creating a special “DemosTEI-DTD.”  This new DTD added three attributes to the element: “author,”  “work,” and “primary.” The first two are self-explanatory;the third, “primary” could take “true” or “false”  as values, thus allowing the cite to distinguish between primary and secondary sources for purposes of indexing.

This works pretty well, in practice. A citation looks like this: Dem. 18.123.The stylesheets can easily extract the author’s and work’s name and cross-reference them to contextual articles, generating marginal notes  and links.

But it was a mistake, nevertheless. By using a non-standard DTD, even one that is differs from the TEI’s DTD in such a small way, I made Dēmos  a much less friendly citizen in the world of digital libraries, since its  files will not validate against the standard DTD. It has made managing the  site more difficult, since the “DemosTEI” DTD has to be on any  server that hosts the project, which adds another burder to the server’s  administrators. And it is untidy—the XML files should not be cluttered  with redundant information, since the citations themselves serve as unique  identifiers.

The answer to this problem—which I will put in place during the summer  of 2004—is “standoff markup,” the practice of using separate  XML documents to supplement the information in the TEI-conformant ones.

For example, the articles themselves should contain nothing but the canonical citation in their markup. A separate file should contain the information identifying the author and work for citations following a certain pattern:

Dem. 18.
Demosthenes
Dem. 18
true
The XSLT stylesheets would read a citation from the article, find its match  in the external file, and collect information about the author and work accordingly.  Had I employed a system like this from the outset, Dēmos would be better  for it today. Fortunately, because the articles are valid according to a DTD,  it will be relatively easy to make sweeping, site-wide changes using simple XSLT scripts.

top

A Good Use of Standoff Markup

Dēmos does take advantage of standoff markup, in the form of a file  named “lookup.xml” that the site’s stylesheets consult when  building links between resources.

When the stylesheets encounter certain kinds of elements—personal names,  place names, names of ancient authors or works—they first see if the element is present as a “keyword” element in the file lookup.xml.

If the element is not, then the stylesheets automatically generate an appropriate  link. If the element in question is the name of an ancient work, the stylesheets  will look for a Dēmos article describing that work and, if one is present,  link to it; otherwise, the stylesheets will generate no link. If the element  is a personal name or place name, the stylesheet will generate a link that  looks up that element by name in the Perseus Encyclopedia.

But if the element is present as a keyword in lookup.xml, the stylesheet  can read from that file one or more targets for linking, and will generate  a link to a menu of appropriate resources.

For example, many technical terms are marked as elements in  the articles, including the term “jury.” While there is no single  article dedicated to the topic of juries under the Athenian democracy, there  are a number of articles or sections of articles in Dēmos, and a number  of external resources, that would be helpful to the reader who wants to learn  more about juries.

So, in lookup.xml, there is a element whose key attribute  is “lawcourt”; under this element there is a element  with “jury” as its value. Also under that element  are other elements that point to resources relevant to lawcourts:

lawcourt
court
courts
juror
jury
palladium
court system
helliaea
delphinium
intro_legal_system
punishment

      Heliaea
lawcourt
Heliaia
http://www.agathe.gr/cgi-bin/qtvr?site=agora;node=58

When the stylesheets encounter “jury” as an element in a Dēmos  article, they use the information in lookup.xml to generate a page that will invite readers to read the following resources:

    • The article entitled “An Introduction to the Legal System,”   by Victor Bers and Adriaan Lanni, included in Dēmos.
    • – The article entitled “Punishment in Athenian Law,” by Danielle  Allen, included in Dēmos.
    • – The entry on “dik?” in the Law Glossary, included in  Dēmos.
    • – The entry on “klepsudra” in the Law Glossary, included in  Dēmos.
    • – The Perseus Encyclopedia entry on the Heliaea.
    • – The sections on the lawcourt and the Heliaea from the Athenian Agora Excavations’ site.
    • – Epsilon 2505, an entry in the Suda Online, describing the Heliaea.

So this instance of standoff markup allows editorial intervention in the automated process of linking. It also allows one-to-many linking, when many resources shed light on a single entity.

It also allows many-to-many linking. Each element in lookup.xml can have many elements. So “jury,” “juror,” “lawcourt,” “Heliaea,” and “court” will all point to this set of resources having to do with courts, juries, and justice  under the Athenian Democracy.

And, of course, this system preserves the “Separation of Concerns.” An individual article need contain information only about its own topic. Its  markup does not need to be “aware” of any other articles in Dēmos or elsewhere. Each article can contain only information about content. The Cocoon pipeline is the place where the “concern” of Function belongs, and so the pipeline handles the linking by bringing in extra XML file.

And the standoff markup, being itself a well-formed XML file, could be used in other ways, in other projects, since it is also independent of the stylesheets that manipulate it under the current implementation.

top

The Future: Inscriptions, Ancient Works,  and TextServices.

In the near future, we hope to add to the growing Dēmos library two important collections. First, Michael Arnush has been editing a collection of inscriptions, with notes and translations, fundamental to our understanding  of Athenian democracy. We will mark these up according to the TEI-DTD, following  the conventions of the Epidoc initative, which is working to standardize how  the TEI tagset should be applied to epigraphic documents. These texts will  require a new set of stylesheets to bring them to a wide audience in a useful  way—hiding some of the more esoteric conventions of epigraphy from casual  readers while making them available to scholars, providing a convenient and  intuitive reading environment, and associating those sources with other articles  already in place.

We also hope to build a library of ancient literary texts and translations,  with rich internal markup, that could be fully integrated into the site. Some  of these will be texts not currently available online, such as certain works  of Plutarch; others will be new editions of texts now available elsewhere.

This project will stand on the TextServices protocol, described elsewhere  in this issue of Classics@, and so the Dēmos Ancient Texts collection  will be fully integrated into a distributed digital library of humanist texts,  with easy and fully automated sharing of data and metadata.

In fact, all of the workings of Dēmos are going to be rebuilt in accordance  with the TextServices protocol. This will make the site itself more orderly,  expandable, and compliant with standards, and will allow any other site to  incorporate the texts that make up Dēmos into its own collection, for  its own purposes.

top

Some Conclusions

Our discipline is still very new to electronic publishing, its potentials, and it pitfalls, and Dēmos: Classical Athenian Democracy has certainly  not been excempt from problems. All contributers to the project have been,  to a greater or lesser extent, learning on the job.

But the virtue of working in an electronic medium is that dead-end streets  are not one-way; as long as the content remains intact and untouched, there  is no real danger of “breaking” anything. And the virtue of working  with open standards such as XML, XSLT, and CSS, and working with open-source   software such as Tomcat and Cocoon, lies in the opportunities to profit from  the experience and talents of others.

Hugh Cayless’ contribution of the Transcoder has been a sine  qua non for Dēmos and dozens of other ongoing projects. Bruce  Robertson’s deep knowledge of Java-based server software and his patience  in sharing it with the less-enlightened has been invaluable. Their work, and  the work and insights of Anne Mahoney, Ross Scaife, Neel Smith, Michael Jones  and others were what allowed this project to come together, and will certainly  contribute to any future improvements.

And just as the potential of electronic publication has engendered an active  and collegial scholarly community, the technologies and techniques at our  disposal here at the beginning of the millenium promise to bring our texts—both  our ancient texts and those texts we create as we try to understand them—into  fruitful new relationships.