horothesia: Eighteen Years of EpiDoc. Now what?

Transcript of my keynote address, delivered to the EAGLE 2014 International Conference on Monday, September 29, 2014, at the École normale supérieure in Paris:

Thank you.

Allow me to begin by thanking the organizers of this conference. The conference chairs: Silvia Orlandi, Francois Berard, and John Scheid. Members of the Steering Committee: Vittore Casarosa, Pietro Liuzzo, and Raffaella Santucci. The local organizing committee: Elizabeth Le Bunetel and Philippe Martineau. Members of the EAGLE 2014 General Committee -- you are too numerous to mention, but no less appreciated. To the sponsors of EAGLE Europeana: the Competitiveness and Innovation Framework Programme of the European Commission. Europeana. Wikimedia Italia. To the presenters and poster-authors and their collaborators. To those who have made time out of busy schedules to prepare for, support, or attend this event. Colleagues and friends. Thank you for the invitation to speak and to be part of this important conference.

OK. Please get out your laptops and start up the Oxygen XML Editor. If you actually read the syllabus for the course, you'd have already downloaded the latest copy of the EpiDoc schema...

Just kidding.

I have perhaps misled you with my title. This talk will not just be about EpiDoc. Instead, I'd like to use EpiDoc as an entrance point into some thoughts I've had about what we are doing here. About where we are going. I'd like to take EpiDoc as an example -- and the EAGLE 2014 Conference as a metaphor -- for something much larger: the whole disparate, polyvalent, heterarchical thing that we sometimes call "Épigraphie et électronique". Digital epigraphy. Res epigraphica digitalis.

Before we try to unpack how we got here and where we're going, I'd like to ask for your help in trying to illuminate who we are. I'd like you to join me in a little exercise in public self-identification. Not only is this an excellent way to help fill the generous block of time that the conference organizers have given me for this talk, it's also much less awkward than trooping out to the Place de la Sorbonne and doing trust falls on the edge of the fountain. ... Right?

Seriously. This conference brings together a range of people and projects that really have had no specific venue to meet, and so we are in some important ways unknown to each other. It's my hypothesis that, if we learn a bit about each other up front, we prime the pump of collaboration and exchange for the rest of the conference. After all, why do we travel to conferences at all if it is not for the richness of interacting with each other, both during sessions and outside them. OK, and as Charlotte Roueché is ever vigilant to remind us, for the museums.

OK then, are you ready?

Independent of any formal position or any academic or professional credential, raise your hand if you would answer "yes" to this question: "Are you an epigraphist?"

What about "are you an information scientist?"

Historians?

Oh, yes, you can be more than one of these -- you'll recall I rolled out the word "heterarchy" in my introduction!

How about "Wikipedian?" "Cultural Heritage Professional?" "Programmer?" "Philologist?" "Computer Scientist?" "Archivist?" "Museologist?" "Linguist?" "Archaeologist?" "Librarian?" "Physicist?" "Engineer?" "Journalist?" "Clergy?"

Phooey! No clergy!

Let's get at another distinction. How many of you would identify yourselves as teachers?

What about students?

Researchers? Administrators? Technicians? Interested lay persons?

OK, now that we have your arms warmed up, let's move on to voices.

If you can read, speak, or understand a reasonable amount of the English language, please join me in saying "I understand English."

Ready? "I understand English."

OK. Now, if we can read, speak, or understand a reasonable amount of French, shall we say "Je comprends le français?"

"Je comprends le français."

What about Arabic?

Bulgarian? Catalan? Flemish? German? Of course there are many more represented here, but I think you get my point.

OK. Now let's build this rhetorical construct one step higher.

This one involves standing up if that's physically appropriate for you, so get yourselves ready! If cannot stand, by all means choose some other, more appropriate form of participation.
Independent of any formal position or any academic credential, I want you to stand up if you consider yourself a "scholar".

Now, please stay standing -- or join those standing -- if you consider yourself a "student".

Yes, I did it. I reintroduced the word "student" from another category of our exercise. I am not only a champion of heterarchy, but also of recursive redefinition.

And now, please stay standing -- or join those standing -- if you consider yourself an "enthusiast."

If you're not standing, please stand if you can.

Now, pick out some one near you that you have not met. Shake their hand and introduce yourself. Ask them what they are so enthusiastic about that they were compelled to come to this conference!

Alright. Please resume your seats.

I think we're warmed up.

Let me encourage you to adopt a particular mindset while you are here at this conference. I hope that you will find it to be both amenable and familiar. It's the active recognition of the valuable traits we all share: intelligence, inquisitiveness, inventiveness, incisiveness, interdependence. Skill. Stamina. Uniqueness. Respect for the past. Congeniality.

I am here, in part, because I have a deep, inescapable interest in the study of ancient documents and in the application of computational methods and new media to their resurrection, preservation, and contemplation, and to their reintegration into the active cultural memory of the human people.
I have looked over the programme for this conference, and I have the distinct impression that your reasons for being here are somewhat similar to mine. I am delighted to have this opportunity to visit with old friends and fellow laborers. And to make the acquaintance of so many new ones. I expect to be dazzled by the posters and presentations to come. Are you as excited as I am?

My title did promise some EpiDoc.

How many of you know EpiDoc?

How many of you know what EpiDoc is?

How many of you have heard of EpiDoc?

The word "EpiDoc" is a portmanteau, composed of the abbreviated word "epigraphy" and the abbreviated word "document" or "documentation" (I can't remember which). It has become a misnomer, as EpiDoc is used for much more than epigraphic documents and documentation. It has found a home in papyrology and in the study of texts transmitted to us from antiquity via the literary and book-copying cultures of the intervening ages. It has at least informed, if not been directly used, in other allied subfields like numismatics and sigillography. It's quite possible I'll learn this week of even broader usages.

EpiDoc is a digital format and method for the encoding of both transcribed and descriptive information about ancient texts and the objects that supported and transmitted them. Formally, it is a wholly conformant customization of the Text Encoding Initiative's standard for the representation of texts in digital form. It is serialized in XML -- the Extensible Markup Language -- a specification developed and maintained by the World-Wide Web Consortium.

EpiDoc is more than format and method. It is a community of practice. The term embraces all the people who learn, use, critique, and talk about EpiDoc. It also takes in the Guidelines, tools, and other helps that have been created and curated by those people. All of them are volunteers, scraping together the time to work on EpiDoc out of their personal time, their academic time, and out of the occasional grant. There has never been formal funding devoted to the development or maintenance of the EpiDoc guidelines or software. If you are a participant in the EpiDoc community, you are a hero.

EpiDoc was born in the late 1990s in a weird little room in the northwest corner of the third floor of Murphey Hall on the campus of the University of North Carolina at Chapel Hill. The room is no longer there. It was consumed in a much-needed and long-promised renovation in 2003 or so. It was the old Classics Department computer lab: a narrow space with a sturdy, home-made, built-in counter along two walls and a derelict bookshelf. It was part of a suite of three rooms, the most spacious of which was normally granted as an office to that year's graduate fellow.

The room had been appropriated by Classics graduate students Noel Fiser and Hugh Cayless, together with classical archaeology graduate student Kathryn McDonnell, and myself (an interloper from the History Department). The Classics department -- motivated and led by these graduate students with I-forget-which-faculty-member serving as figurehead -- had secured internal university funding to digitize the department's collection of 35 millimeter slides and build a website for searching and displaying the resulting images. They bought a server with part of the grant. It soon earned the name Alecto after one of the Furies in Greek mythology. I've searched in vain for a picture of the lab, which at some point we sponge-painted in bright colors evocative of the frescoes from Minoan Santorini. The world-wide web was less than a decade old.

I was unconscious then of the history of computing and the classics at Chapel Hill. To this day, I don't know if that suite of rooms had anything to do with David Packard and his time at Chapel Hill. At the Epigraphic Congress in Oxford, John Bodel pointed to Packard's Livy concordance as one of the seminal moments in the history of computing and the classics, and thus the history of digital epigraphy. I'd like to think that we intersected that heritage not just in method, but in geography.

I had entered the graduate program in ancient history in the fall of 1995. I had what I would later come to understand to have been a spectacular slate of courses for my first term: Richard Talbert on the Roman Republic, Jerzy Linderski on Roman Law, and George Houston on Latin Epigraphy.
Epigraphy was new to me. I had seen and even tried my hand at reading the odd Latin or Greek inscription, but I had no knowledge of the history or methods of discipline, and very little skill. As George taught it, the Latin Epigraphy course was focused on the research use of the published apparatus of Latin epigraphy. The CIL. The journals. The regional and local corpora. What you could do with them.

If I remember correctly, the Epigraphic Database Heidelberg was not yet online, nor were the Packard Greek inscriptions (though you could search them on CDROM). Yes, the same Packard. Incidentally, I think we'll hear something very exciting about the Packard Greek Inscriptions in tomorrow's Linked Ancient World Data panel.

Anyway, at some point I came across the early version of what is now called the Epigraphische Datenbank Clauss - Slaby, which was online. Back then it was a simple search engine for digital transcripts of the texts in the L'Annee Epigraphique up from 1888 through 1993. Crucially, one could also download all the content in plain text files. If I understand it correctly, these texts were also destined for publication via the Heidelberg database (and eventually Rome too) after verification by autopsy or inspection of photographs or squeezes.

At some point, I got interested in abbreviations. My paper for George's class was focused on "the epigraphy of water" in Roman North Africa. I kept running across abbreviations in the inscriptions that didn't appear in any of the otherwise helpful lists one finds in Cagnat or one of the other handbooks. In retrospect, the reasons are obvious: the handbook author tailors the list of abbreviations to the texts and types of texts featured in the handbook itself. Selected for importance and range, the statistical distribution of textual types and language, and of features like abbreviation, are not the same as those for the entire corpus. So, what is a former programmer to do? Why not download the texts from Clauss' site and write a program to hunt for parentheses. The Leiden Conventions make parentheses a strong indicator of abbreviations that have been expanded by an editor, so the logic for the program seemed relatively straightforward.

Mercifully, the hacktastical code that I wrote to do this task has, I think, perished from the face of the earth. The results, which I serialized into HTML form, may still be consulted on the website of the American Society of Greek and Latin Epigraphy.

As useful as the results were, I was dissatisfied with the experience. The programming language I had used -- called "C" -- was not a very good fit for the kind of text processing involved. Moreover, as good as the Leiden Conventions are, parentheses are used for things other than abbreviations. So, there was manual post-processing to be done. And then there were the edge cases, like abbreviations that stand alone in one document, but are incorporated into longer abbreviations in others. And then there were expanded use cases: searching for text in one inscription that was abbreviated in another. Searching for abbreviations or other strings in text that was transcribed from the original, rather than in editorial supplement or restoration. And I wanted a format and software tools that was a better fit for textual data and this class of problems.

XML and the associated Extensible Stylesheet Language (XSL) -- both then fairly new -- seemed like a good alternative approach. So I found myself confronted with a choice: should I take XML and invent my own schema for epigraphic texts, or should I adopt and adapt something someone else had already created? This consideration -- to make or to take -- is still of critical importance not only for XML, but for any format specification or standards definition process. It's important too for most digital projects. What will you build and on what will you build it?

There are pros and cons. By adopting an existing standard or tool, you can realize a number of benefits. You don't reinvent the wheel. You build on the strengths and the lessons of others. You can discuss problems and approaches with others who are using the same method. You probably make it easier to share and exchange your tools and any data you create. It's possible that many of the logic problems that aren't obvious to you at the beginning have already been encountered by the pioneers.
But standards and specifications can also be walled gardens in which decisions and expert knowledge are hoarded by the founders or another elite group. They can undermine openness and innovation. They can present a significant learning curve. You can use a complex standard and find that you've built a submarine to cross the Seine. Something simpler might have worked better.

Back then, there was a strong narrative around warning people off the cavalier creation of new XML schemas. The injunction was articulated in a harsh metaphor: "every time someone creates a new schema, a kitten dies." Behind this ugly metaphor was the recognition of another potential pitfall: building an empty cathedral. Your data format -- your personal or parochial specification -- might embody everything you imagined or needed, but be largely useless to, or unused by, anyone else.
So, being a cat lover, and being lazy (all the best programmers are lazy), I went looking for an existing schema. I found it in the Text Encoding Initiative. Whether the TEI (and EpiDoc) fit your particular use case is something only you can decide. For me, at that time and since, it was a good fit. I was particularly attracted to a core concept of the TEI: one should encode the intent behind the formatting and structure in a document -- the semantics of the authorial and editorial tasks -- rather than just the specifics of the formatting. So, where the Leiden Conventions would have us use parentheses to mark the editorial expansion of an abbreviation, the TEI gives us XML elements that mean "abbreviation" and "expansion." Where a modern Latin epigraphic edition would use a subscript dot to indicate that the identity of a character is ambiguous without reference to context, the TEI gives us the "unclear" element.

This encoding approach pays off. I'll give just one example. For a few years now, I've been helping Arlo Griffiths (who directs the Jakarta research center of the École française d'Extrême-Orient) to develop a corpus of the surviving inscriptions of the Campa Kingdoms. This is a small corpus, perhaps 400 extant inscriptions, from coastal Vietnam, that includes texts in both Sanskrit and the incompletely understood Old Cam language. The script involved has not yet made its way into the Unicode specification. The standard transliteration scheme for this script, as well as some of the other editorial conventions used in the publication of Cam inscriptions, overlaps and conflicts with the Leiden conventions. But with TEI/EpiDoc there is no confusion or ambiguity. The XML says what the editor means to say, and the conventions of transcription are preserved unchanged, perhaps someday to be converted programmatically to Unicode when Unicode is ready.

EpiDoc transitioned from a personal project to a public one when another potential use case came along. For some time, a committee commissioned by the Association Internationale d'Épigraphie Grecque et Latine had been working under the direction of Silvio Panciera, then the chair of Latin epigraphy at La Sapienza in Rome. Their goal was to establish a comprehensive database of Greek and Latin inscriptions, primarily for the purpose of searching the texts and associated descriptive information or metadata. It was Charles Crowther at Oxford's new Centre for the Study of Ancient Documents who put me in contact with the committee. And it was Charles who championed the eventual recommendation of the committee that the system they envisioned must be able to import and export structured documents governed by a standard schema. He was thinking of EpiDoc.

Many years have passed and many things have changed, and I'm forced to leave out the names of so many people whose hard work and acumen has brought about those changes. Here in Paris today Panciera's vision stands on the cusp of realization. It has also been transcended, for we are not here to talk about a standalone textual database or a federation of such, but about the incorporation of Greek and Latin epigraphy -- in all its historiographical variety and multiplicity of reception -- into the digital cultural heritage system of Europe (Europeana) and into the independent digital repository of a global people: Wikipedia and Wikidata. That EpiDoc can play a role in this grand project just blows me away.

And it's not just about EAGLE, Europeana, Wikipedia, and EpiDoc. It's about a myriad other databases, websites, images, techniques, projects, technologies, and tools. It's about you and the work that you do.

Even as we congratulate ourselves on our achievements and the importance of our mission, I hope you'll let me encourage you to keep thinking forward. We are doing an increasingly good job of bringing computational approaches into many aspects of the scholarly communication process. But plenty remains to be done. We are starting to make the transition from using computer hardware and software to make conventional books and digital imitations thereof; "born digital" is starting to mean something more than narrative forms in PDF and HTML, designed to be read directly by each single human user and, through them, digested into whatever database, notebook, or other research support system that person uses. We are now publishing data that is increasingly designed for harvesting and analyzing by automated agents and that is increasingly less encumbered by outdated and obstructive intellectual property regimes. Over time, our colleagues will begin to spend less time seeking and ingesting data, and more time analyzing, interpreting, and communicating results. We are also lowering the barriers to appreciation and participation in global heritage by a growing and more connected and more vulnerable global people.

Will we succeed in this experiment? Will we succeed in helping to build a mature and responsible global culture in which heritage is treasured, difference is honored, and a deep common cause embraced and protected? Will we say three years from now that building that database or encoding those texts in EpiDoc was the right choice? In a century, will our work be accessible and relevant to our successors and descendants? In 5? In 10?

I do not know. But I am thrilled, honored, and immensely encouraged to see you here, walking this ancient road and blazing this ambitious and hopeful new trail. This is our opportunity to help reunite the world's people and an important piece of their heritage. We are a force against the recasting of history into political rhetoric. We stand against the convenient ignorance of our past failures and their causes. We are the antidote to the destruction of ancient statues of the Buddha, to the burning of undocumented manuscripts, to papyri for sale on eBay, to fields of holes in satellite images where once there was an unexcavated ancient site.

Let's do this thing.

horothesia

SyntaxHighlighter

Monday, October 6, 2014

Eighteen Years of EpiDoc. Now what?

No comments: