Wednesday, August 27, 2008

The First Thousand Years of Greek

The Center for Hellenic Studies has just announced, via its website, a project led by Neel Smith (Holy Cross) entitled "The First Thousand Years of Greek." I reproduce the entire announcement here, since the CHS website isn't set up to let me link directly to the announcement itself:
The First Thousand Years of Greek aims to create a corpus, to be made available under a free license, of TEI-compliant texts and lemmatized word indices coordinated with the on-line Liddell-Scott-Jones lexicon from the Perseus project. The coverage ultimately should include at least one version of every Greek text known to us from manuscript transmission from the beginning of alphabetic writing in Greece through roughly the third century CE.

In 2008, the capabilities of consumer-level personal computers, the tools available specifically for working with ancient Greek, and above all the publication of digital resources under licenses enabling scholarly use place the dream of the First Thousand Years of Greek within reach. Gregory Crane and the Perseus project have augmented Liddell-Scott-Jones with unique identifiers on every entry, and released this under a Creative Commons (free) license. Peter Heslin, whose work has always been a model of appropriate free licensing, has recently published in Diogenes 3 a polished library for working with the TLG E corpus, and by applying the open-sourced Perseus morphological parser to every word in the TLG E word list and then publishing the resulting index, has shown how even data sets with a restrictive license like the TLG can be used to create valuable new free resources. Hugh Cayless' transcoding transformer has become an indispensable piece of the programmer's toolkit, as support for Unicode continues to mature in a range of programming languages on different operating systems. At the Center for Hellenic Studies, Neel Smith and Christopher Blackwell have led the development of Canonical Text Services (information at chs, or mirrored here), a network service that retrieves passages of text identified by canonical references.

By combining public-domain readings of ancient texts or translations, which can be automatically transferred from digital collections such as the TLG, Perseus, and Project Gutenberg, with existing free resources, the CHS team will automate —and make it possible for others to automate— the most tedious aspects of creating the First Thousand Years of Greek. What we currently lack, and must create manually, is shockingly basic: an inventory of existing ancient Greek texts. The TLG Canon is a useful reference, but it is an inventory of print volumes, not of Greek texts. (So Ptolemy's Geography appears as two works in the TLG Canon because the TLG used two different print editions for different parts of the work; and of course entries for texts in “fragments” collections appear in the TLG Canon even though they do not exist as independent texts.) An inventory of Greek texts preserved by manuscript transmission will necessarily present a selection of material that is radically different from the material found in the TLG Canon.

In addition to historical metadata included in such an inventory, we need to determine for each text how it should be cited, and how that citation scheme should be mapped on to the TEI's semantic markup. There is no way to avoid making these editorial decisions individually for each text included in the First Thousand Years of Greek, but once the citation scheme has been been organized for a given text, we should be able to extract readings automatically from the TLG, Perseus, or Project Gutenberg, and then apply software to the extracted content to generate the new texts and indices of the First Thousand Years of Greek.

The quality of existing digital and print editions across the set of texts covered by the First Thousand Years of Greek will not be perfectly even. This will certainly mean that coverage of some parts of the project will advance more quickly than others. The CHS team expects that by beginning with material already available in good digital and print sources, we can gather a significant corpus quickly, and continue to expand its coverage over time. In the fall of 2008, the project is focusing on the first thousand years of Greek verse, with the goal of creating a complete corpus of all Greek texts in verse known through manuscript copying through the third century CE. The CHS welcomes collaborators, and invites any individuals, groups, or institutions who would like to contribute or just find out more about the First Thousand Years of Greek to email the project lead, Neel Smith, at first1kyears at

