SyntaxHighlighter

Friday, November 7, 2014

Updated in Planet Atlantides: From Stone to Screen

I have updated in the Maia and Electra feed aggregators the URL and feed URL for the website of the From Stone to Screen project at the University of British Columbia:

title = From Stone to Screen
url = http://fromstonetoscreen.com/
creators = University of British Columbia
description = There are over 1,000 artifacts and squeezes  of inscriptions in the collection of the  Department of Classical, Near Eastern, and Religious Studies of The University of British Columbia. Until now, the collection was only available on site in Vancouver. We are excited to announce the beginning of our effort to make these objects available for study to scholars and students around the world.
feed = http://fromstonetoscreen.com/?feed=rss2


New in EpiDig: From Stone to Screen

The epigraphic squeeze and artifact digitization project at the University of British Columbia has a new website, and a record for it has been added to the Digital Epigraphy Zotero Library.




Friday, October 24, 2014

New in EpiDig

Records for the following digital resources for the discovery, publication, study, and teaching of epigraphy have been added to the EpiDig library at zotero.org:

Bérard, François, Denis Feissel, Nicolas Laubry, Pierre Petitmengin, Denis Rousset, and Michel Sève. Guide de l’Epigraphiste.
Epigraphica Anatolica.” Institut Für Altertumskunde, Universität Zu Köln.
Fossati, Dario. Hittite Epigraphic Findings in the Ancient Near East.
Maspero, Gaston. Cahiers de notes épigraphiques de Gaston Maspero ; copies d’inscriptions hiéroglyphiques, de papyrus égyptiens et coptes ; dessins, croquis, etc. de monuments d’Akmîm, Louqsor, etc. (1881-1884)1801.
Revue Raydân. Centre Français d’Archéologie et de Sciences Sociales de Sanaa.
Roman Inscriptions of Britain.
Studi Epigrafici E Linguistici Sul Vicino Oriente Antico.
Sylloge Epigraphica Barcinonensis: SEBarc.
Van Nijf, Onno. Saxa Loquuntur.

Monday, October 6, 2014

New in Planet Atlantides: Stone to Screen

I have just added the following blog to the Electra and Maia feed aggregators:

title = From Stone to Screen
url = https://fromstonetoscreen.wordpress.com/
description = A collaborative project to create a digital database of the archaeological teaching collections at the University of British Columbia.
feed url = https://fromstonetoscreen.wordpress.com/feed/

Eighteen Years of EpiDoc. Now what?

Transcript of my keynote address, delivered to the EAGLE 2014 International Conference on Monday, September 29, 2014, at the École normale supérieure in Paris:

Thank you.

Allow me to begin by thanking the organizers of this conference. The conference chairs: Silvia Orlandi, Francois Berard, and John Scheid. Members of the Steering Committee: Vittore Casarosa, Pietro Liuzzo, and Raffaella Santucci. The local organizing committee: Elizabeth Le Bunetel and Philippe Martineau. Members of the EAGLE 2014 General Committee -- you are too numerous to mention, but no less appreciated. To the sponsors of EAGLE Europeana: the Competitiveness and Innovation Framework Programme of the European Commission. Europeana. Wikimedia Italia. To the presenters and poster-authors and their collaborators. To those who have made time out of busy schedules to prepare for, support, or attend this event. Colleagues and friends. Thank you for the invitation to speak and to be part of this important conference.

OK. Please get out your laptops and start up the Oxygen XML Editor. If you actually read the syllabus for the course, you'd have already downloaded the latest copy of the EpiDoc schema...

Just kidding.

I have perhaps misled you with my title. This talk will not just be about EpiDoc. Instead, I'd like to use EpiDoc as an entrance point into some thoughts I've had about what we are doing here. About where we are going. I'd like to take EpiDoc as an example -- and the EAGLE 2014 Conference as a metaphor -- for something much larger: the whole disparate, polyvalent, heterarchical thing that we sometimes call "Épigraphie et électronique". Digital epigraphy. Res epigraphica digitalis.

Before we try to unpack how we got here and where we're going, I'd like to ask for your help in trying to illuminate who we are. I'd like you to join me in a little exercise in public self-identification. Not only is this an excellent way to help fill the generous block of time that the conference organizers have given me for this talk, it's also much less awkward than trooping out to the Place de la Sorbonne and doing trust falls on the edge of the fountain. ... Right?

Seriously. This conference brings together a range of people and projects that really have had no specific venue to meet, and so we are in some important ways unknown to each other. It's my hypothesis that, if we learn a bit about each other up front, we prime the pump of collaboration and exchange for the rest of the conference. After all, why do we travel to conferences at all if it is not for the richness of interacting with each other, both during sessions and outside them. OK, and as Charlotte Roueché is ever vigilant to remind us, for the museums.

OK then, are you ready?

Independent of any formal position or any academic or professional credential, raise your hand if you would answer "yes" to this question: "Are you an epigraphist?"

What about "are you an information scientist?"

Historians?

Oh, yes, you can be more than one of these -- you'll recall I rolled out the word "heterarchy" in my introduction!

How about "Wikipedian?" "Cultural Heritage Professional?" "Programmer?" "Philologist?" "Computer Scientist?" "Archivist?" "Museologist?" "Linguist?" "Archaeologist?" "Librarian?" "Physicist?" "Engineer?" "Journalist?" "Clergy?"

Phooey! No clergy!

Let's get at another distinction. How many of you would identify yourselves as teachers?

What about students?

Researchers? Administrators? Technicians? Interested lay persons?

OK, now that we have your arms warmed up, let's move on to voices.

If you can read, speak, or understand a reasonable amount of the English language, please join me in saying "I understand English."

Ready? "I understand English."

OK. Now, if we can read, speak, or understand a reasonable amount of French, shall we say "Je comprends le français?"

"Je comprends le français."

What about Arabic?

Bulgarian? Catalan? Flemish? German? Of course there are many more represented here, but I think you get my point.

OK. Now let's build this rhetorical construct one step higher.

This one involves standing up if that's physically appropriate for you, so get yourselves ready! If cannot stand, by all means choose some other, more appropriate form of participation.
Independent of any formal position or any academic credential, I want you to stand up if you consider yourself a "scholar".

Now, please stay standing -- or join those standing -- if you consider yourself a "student".

Yes, I did it. I reintroduced the word "student" from another category of our exercise. I am not only a champion of heterarchy, but also of recursive redefinition.

And now, please stay standing -- or join those standing -- if you consider yourself an "enthusiast."

If you're not standing, please stand if you can.

Now, pick out some one near you that you have not met. Shake their hand and introduce yourself. Ask them what they are so enthusiastic about that they were compelled to come to this conference!

Alright. Please resume your seats.

I think we're warmed up.

Let me encourage you to adopt a particular mindset while you are here at this conference. I hope that you will find it to be both amenable and familiar. It's the active recognition of the valuable traits we all share: intelligence, inquisitiveness, inventiveness, incisiveness, interdependence. Skill. Stamina. Uniqueness. Respect for the past. Congeniality.

I am here, in part, because I have a deep, inescapable interest in the study of ancient documents and in the application of computational methods and new media to their resurrection, preservation, and contemplation, and to their reintegration into the active cultural memory of the human people.
I have looked over the programme for this conference, and I have the distinct impression that your reasons for being here are somewhat similar to mine. I am delighted to have this opportunity to visit with old friends and fellow laborers. And to make the acquaintance of so many new ones. I expect to be dazzled by the posters and presentations to come. Are you as excited as I am?

My title did promise some EpiDoc.

How many of you know EpiDoc?

How many of you know what EpiDoc is?

How many of you have heard of EpiDoc?

The word "EpiDoc" is a portmanteau, composed of the abbreviated word "epigraphy" and the abbreviated word "document" or "documentation" (I can't remember which). It has become a misnomer, as EpiDoc is used for much more than epigraphic documents and documentation. It has found a home in papyrology and in the study of texts transmitted to us from antiquity via the literary and book-copying cultures of the intervening ages. It has at least informed, if not been directly used, in other allied subfields like numismatics and sigillography. It's quite possible I'll learn this week of even broader usages.

EpiDoc is a digital format and method for the encoding of both transcribed and descriptive information about ancient texts and the objects that supported and transmitted them. Formally, it is a wholly conformant customization of the Text Encoding Initiative's standard for the representation of texts in digital form. It is serialized in XML -- the Extensible Markup Language -- a specification developed and maintained by the World-Wide Web Consortium.

EpiDoc is more than format and method. It is a community of practice. The term embraces all the people who learn, use, critique, and talk about EpiDoc. It also takes in the Guidelines, tools, and other helps that have been created and curated by those people. All of them are volunteers, scraping together the time to work on EpiDoc out of their personal time, their academic time, and out of the occasional grant. There has never been formal funding devoted to the development or maintenance of the EpiDoc guidelines or software. If you are a participant in the EpiDoc community, you are a hero.

EpiDoc was born in the late 1990s in a weird little room in the northwest corner of the third floor of Murphey Hall on the campus of the University of North Carolina at Chapel Hill. The room is no longer there. It was consumed in a much-needed and long-promised renovation in 2003 or so. It was the old Classics Department computer lab: a narrow space with a sturdy, home-made, built-in counter along two walls and a derelict bookshelf. It was part of a suite of three rooms, the most spacious of which was normally granted as an office to that year's graduate fellow.

The room had been appropriated by Classics graduate students Noel Fiser and Hugh Cayless, together with classical archaeology graduate student Kathryn McDonnell, and myself (an interloper from the History Department). The Classics department -- motivated and led by these graduate students with I-forget-which-faculty-member serving as figurehead -- had secured internal university funding to digitize the department's collection of 35 millimeter slides and build a website for searching and displaying the resulting images. They bought a server with part of the grant. It soon earned the name Alecto after one of the Furies in Greek mythology. I've searched in vain for a picture of the lab, which at some point we sponge-painted in bright colors evocative of the frescoes from Minoan Santorini. The world-wide web was less than a decade old.

I was unconscious then of the history of computing and the classics at Chapel Hill. To this day, I don't know if that suite of rooms had anything to do with David Packard and his time at Chapel Hill. At the Epigraphic Congress in Oxford, John Bodel pointed to Packard's Livy concordance as one of the seminal moments in the history of computing and the classics, and thus the history of digital epigraphy. I'd like to think that we intersected that heritage not just in method, but in geography.

I had entered the graduate program in ancient history in the fall of 1995. I had what I would later come to understand to have been a spectacular slate of courses for my first term: Richard Talbert on the Roman Republic, Jerzy Linderski on Roman Law, and George Houston on Latin Epigraphy.
Epigraphy was new to me. I had seen and even tried my hand at reading the odd Latin or Greek inscription, but I had no knowledge of the history or methods of discipline, and very little skill. As George taught it, the Latin Epigraphy course was focused on the research use of the published apparatus of Latin epigraphy. The CIL. The journals. The regional and local corpora. What you could do with them.

If I remember correctly, the Epigraphic Database Heidelberg was not yet online, nor were the Packard Greek inscriptions (though you could search them on CDROM). Yes, the same Packard. Incidentally, I think we'll hear something very exciting about the Packard Greek Inscriptions in tomorrow's Linked Ancient World Data panel.

Anyway, at some point I came across the early version of what is now called the Epigraphische Datenbank Clauss - Slaby, which was online. Back then it was a simple search engine for digital transcripts of the texts in the L'Annee Epigraphique up from 1888 through 1993. Crucially, one could also download all the content in plain text files. If I understand it correctly, these texts were also destined for publication via the Heidelberg database (and eventually Rome too) after verification by autopsy or inspection of photographs or squeezes.

At some point, I got interested in abbreviations. My paper for George's class was focused on "the epigraphy of water" in Roman North Africa. I kept running across abbreviations in the inscriptions that didn't appear in any of the otherwise helpful lists one finds in Cagnat or one of the other handbooks.  In retrospect, the reasons are obvious: the handbook author tailors the list of abbreviations to the texts and types of texts featured in the handbook itself. Selected for importance and range, the statistical distribution of textual types and language, and of features like abbreviation, are not the same as those for the entire corpus. So, what is a former programmer to do? Why not download the texts from Clauss' site and write a program to hunt for parentheses. The Leiden Conventions make parentheses a strong indicator of abbreviations that have been expanded by an editor, so the logic for the program seemed relatively straightforward.

Mercifully, the hacktastical code that I wrote to do this task has, I think, perished from the face of the earth. The results, which I serialized into HTML form, may still be consulted on the website of the American Society of Greek and Latin Epigraphy.

As useful as the results were, I was dissatisfied with the experience. The programming language I had used -- called "C" -- was not a very good fit for the kind of text processing involved. Moreover, as good as the Leiden Conventions are, parentheses are used for things other than abbreviations. So, there was manual post-processing to be done. And then there were the edge cases, like abbreviations that stand alone in one document, but are incorporated into longer abbreviations in others. And then there were expanded use cases: searching for text in one inscription that was abbreviated in another. Searching for abbreviations or other strings in text that was transcribed from the original, rather than in editorial supplement or restoration. And I wanted a format and software tools that was a better fit for textual data and this class of problems.

XML and the associated Extensible Stylesheet Language (XSL) -- both then fairly new -- seemed like a good alternative approach. So I found myself confronted with a choice: should I take XML and invent my own schema for epigraphic texts, or should I adopt and adapt something someone else had already created? This consideration -- to make or to take -- is still of critical importance not only for XML, but for any format specification or standards definition process. It's important too for most digital projects. What will you build and on what will you build it?

There are pros and cons. By adopting an existing standard or tool, you can realize a number of benefits. You don't reinvent the wheel. You build on the strengths and the lessons of others. You can discuss problems and approaches with others who are using the same method. You probably make it easier to share and exchange your tools and any data you create. It's possible that many of the logic problems that aren't obvious to you at the beginning have already been encountered by the pioneers.
But standards and specifications can also be walled gardens in which decisions and expert knowledge are hoarded by the founders or another elite group. They can undermine openness and innovation. They can present a significant learning curve. You can use a complex standard and find that you've built a submarine to cross the Seine. Something simpler might have worked better.

Back then, there was a strong narrative around warning people off the cavalier creation of new XML schemas. The injunction was articulated in a harsh metaphor: "every time someone creates a new schema, a kitten dies." Behind this ugly metaphor was the recognition of another potential pitfall: building an empty cathedral. Your data format -- your personal or parochial specification -- might embody everything you imagined or needed, but be largely useless to, or unused by, anyone else.
So, being a cat lover, and being lazy (all the best programmers are lazy), I went looking for an existing schema. I found it in the Text Encoding Initiative. Whether the TEI (and EpiDoc) fit your particular use case is something only you can decide. For me, at that time and since, it was a good fit. I was particularly attracted to a core concept of the TEI: one should encode the intent behind the formatting and structure in a document -- the semantics of the authorial and editorial tasks -- rather than just the specifics of the formatting. So, where the Leiden Conventions would have us use parentheses to mark the editorial expansion of an abbreviation, the TEI gives us XML elements that mean "abbreviation" and "expansion." Where a modern Latin epigraphic edition would use a subscript dot to indicate that the identity of a character is ambiguous without reference to context, the TEI gives us the "unclear" element.

This encoding approach pays off. I'll give just one example. For a few years now, I've been helping Arlo Griffiths (who directs the Jakarta research center of the École française d'Extrême-Orient) to develop a corpus of the surviving inscriptions of the Campa Kingdoms. This is a small corpus, perhaps 400 extant inscriptions, from coastal Vietnam, that includes texts in both Sanskrit and the incompletely understood Old Cam language. The script involved has not yet made its way into the Unicode specification. The standard transliteration scheme for this script, as well as some of the other editorial conventions used in the publication of Cam inscriptions, overlaps and conflicts with the Leiden conventions. But with TEI/EpiDoc there is no confusion or ambiguity. The XML says what the editor means to say, and the conventions of transcription are preserved unchanged, perhaps someday to be converted programmatically to Unicode when Unicode is ready.

EpiDoc transitioned from a personal project to a public one when another potential use case came along. For some time, a committee commissioned by the Association Internationale d'Épigraphie Grecque et Latine had been working under the direction of Silvio Panciera, then the chair of Latin epigraphy at La Sapienza in Rome. Their goal was to establish a comprehensive database of Greek and Latin inscriptions, primarily for the purpose of searching the texts and associated descriptive information or metadata. It was Charles Crowther at Oxford's new Centre for the Study of Ancient Documents who put me in contact with the committee. And it was Charles who championed the eventual recommendation of the committee that the system they envisioned must be able to import and export structured documents governed by a standard schema. He was thinking of EpiDoc.

Many years have passed and many things have changed, and I'm forced to leave out the names of so many people whose hard work and acumen has brought about those changes. Here in Paris today Panciera's vision stands on the cusp of realization. It has also been transcended, for we are not here to talk about a standalone textual database or a federation of such, but about the incorporation of Greek and Latin epigraphy -- in all its historiographical variety and multiplicity of reception -- into the digital cultural heritage system of Europe (Europeana) and into the independent digital repository of a global people: Wikipedia and Wikidata. That EpiDoc can play a role in this grand project just blows me away.

And it's not just about EAGLE, Europeana, Wikipedia, and EpiDoc. It's about a myriad other databases, websites, images, techniques, projects, technologies, and tools. It's about you and the work that you do.

Even as we congratulate ourselves on our achievements and the importance of our mission, I hope you'll let me encourage you to keep thinking forward. We are doing an increasingly good job of bringing computational approaches into many aspects of the scholarly communication process. But plenty remains to be done. We are starting to make the transition from using computer hardware and software to make conventional books and digital imitations thereof; "born digital" is starting to mean something more than narrative forms in PDF and HTML, designed to be read directly by each single human user and, through them, digested into whatever database, notebook, or other research support system that person uses. We are now publishing data that is increasingly designed for harvesting and analyzing by automated agents and that is increasingly less encumbered by outdated and obstructive intellectual property regimes. Over time, our colleagues will begin to spend less time seeking and ingesting data, and more time analyzing, interpreting, and communicating results. We are also lowering the barriers to appreciation and participation in global heritage by a growing and more connected and more vulnerable global people.

Will we succeed in this experiment? Will we succeed in helping to build a mature and responsible global culture in which heritage is treasured, difference is honored, and a deep common cause embraced and protected? Will we say three years from now that building that database or encoding those texts in EpiDoc was the right choice? In a century, will our work be accessible and relevant to our successors and descendants? In 5? In 10?

I do not know. But I am thrilled, honored, and immensely encouraged to see you here, walking this ancient road and blazing this ambitious and hopeful new trail. This is our opportunity to help reunite the world's people and an important piece of their heritage. We are a force against the recasting of history into political rhetoric. We stand against the convenient ignorance of our past failures and their causes. We are the antidote to the destruction of ancient statues of the Buddha, to the burning of undocumented manuscripts, to papyri for sale on eBay, to fields of holes in satellite images where once there was an unexcavated ancient site.

Let's do this thing.


Friday, September 26, 2014

New in Electra and Maia: I.Sicily

I have just added the following blog to the Maia and Electra Atlantides feed aggregators:

title = I.Sicily
url = http://isicily.wordpress.com/
creators = Jonathan Prag
license = None
description = Building a digital corpus of Sicilian inscriptions
keywords = None
feed = http://isicily.wordpress.com/feed/


Thursday, July 17, 2014

New in Electra: RIDE

I have just added the following digital resource to the Electra Atlantis feed aggregator

title = RIDE: A review journal for digital editions and resources
url = http://ride.i-d-e.de/
creators = Alexander Czmiel, et al. (eds.)
license =  http://creativecommons.org/licenses/by/4.0/
description = RIDE is a review journal dedicated to digital editions and resources. RIDE aims to direct attention to digital editions and to provide a forum in which expert peers criticise and discuss the efforts of digital editors in order to improve current practices and advance future developments. It will do so by asking its reviewers to pay attention not only to the traditional virtues and vices of any edition, but also to the progressing methodology and its technical implications.
feed = http://ride.i-d-e.de/feed/

Wednesday, July 16, 2014

New in Electra: EpiDoc Workshop

I have just added the following blog to the Electra Atlantis feed aggregator:

title = EpiDoc workshop
url = http://epidocworkshop.blogspot.co.uk/
creators = Simona Stoyanova, et al.
description = Share markup examples; give and receive feedback
keywords = EpiDoc, epigraphy, inscriptions, XML, TEI
feed = http://epidocworkshop.blogspot.com/feeds/posts/default?alt=rss

Monday, July 14, 2014

Hacking on Apache Log Files with Python

There are plenty of tools out there for doing web request analysis. I wanted to pull some information out of some Apache log files without all that overhead. Here's what I did:

I got Rory McCann's apache-log-parser (after some googling; it's on Github and on pypi).  I set up a Python virtual environment using Doug Hellmann's virtualenvwrapper, activated it, and then used:

 pip install apache-log-parser  

Since I'd never used apache-log-parser before, I had to get familiar with it. I discovered that, to use it, I had to know the format string that Apache was using to log information for my site. This was a two-step process, figured out by reading the Log Files section of the Apache documentation and poking about with grep.

First, I searched in the Apache configuration files for the CustomLog directive that's associated with the virtual host I wanted to analyze. This gave me a 'nickname' for the log configuration. More spelunking in Apache config files -- this time in the main configuration file -- turned up the definition of that nickname (Apache uses the LogFormat directive for this purpose):

 $ cd /etc/apache2/  
 $ grep CustomLog sites-enabled/foo.nowhere.org  
  CustomLog /var/log/apache2/foo.nowhere.org-access.log combined  
 $ grep combined apache2.conf   
 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined  

It's that LogFormat string that needs to be given to Rory's code so it knows how to parse your log files.

After some experimenting in the Python interpreter to get a feel for the code and its capabilities, I wrote a few lines of my own to wrap the setup and file reading operations:

 #!/usr/bin/env python  
 # -*- coding: utf-8 -*-  
   
 import apache_log_parser  
 import glob  
 import logging  
   
 # supported log file formats  
 APACHE_COMBINED="%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""  
 APACHE_COMMON="%h %l %u %t \"%r\" %>s %b"  
   
 def gulp(log_file_path, pattern=APACHE_COMBINED):  
   """ import and parse log files """  
   log_data=[]  
   line_parser=apache_log_parser.make_parser(pattern)  
   for file_name in glob.glob(log_file_path):  
     logging.info("file_name: %s" % file_name)  
     file = open(file_name, 'r')  
     lines = file.readlines()  
     file.close()  
     logging.info(" read %s lines" % len(lines))  
     for line in lines:  
       line_data=line_parser(line)  
       log_data.append(line_data)  
   logging.info("total number of events parsed: %s" % len(log_data))  
   return log_data  

For this particular server, I had multiple log files, but I wanted to have all the requests from all of them (parsed into dictionaries by Rory's code) in a single list for subsequent analysis. So, back to the Python interpreter:

 >>> import logging  
 >>> logging.basicConfig(level=logging.INFO)  
 >>> import loggulper  
 >>> d = loggulper.gulp("/path/to/my/log/files/foo.nowhere.org-access.*")  

I'll spare you the logging messages. This took several minutes. I ended up with about 1.5 million requests in the list. Real life intervened. How to save this data for later without having to run through the whole import process again? Pickle it.

 >>> import cPickle as pickle  
 >>> out_name = "logdata.pickle"  
 >>> outf = open(out_name, 'w')  
 >>> pickler = pickle.Pickler(outf, pickle.HIGHEST_PROTOCOL)  
 >>> pickler.dump(d)  
 <cPickle.Pickler object at 0x1044044e8>  
 >>> outf.close()  

The whole list was saved to a 933.3 MB file in just a few seconds (full disclosure: I have a solid-state drive). It was nearly as quick to read back in a couple of days later (new interpreter session and all):

 >>> import cPickle as pickle  
 >>> in_name="logdata.pickle"  
 >>> inf=open(in_name, 'r')  
 >>> unpickler=pickle.Unpickler(inf)  
 >>> d=unpickler.load()  
 >>> len(d)  
 1522015  
 >>> d[0].keys()  
 ['status', 'request_header_referer', 'remote_user', 'request_header_user_agent__browser__family', 'request_header_user_agent__is_mobile', 'request_header_user_agent__browser__version_string', 'request_header_user_agent', 'request_http_ver', 'request_header_user_agent__os__version_string', 'remote_logname', 'time_recieved_isoformat', 'time_recieved', 'request_first_line', 'request_header_user_agent__os__family', 'request_method', 'request_url', 'remote_host', 'time_recieved_datetimeobj', 'response_bytes_clf']  

It's important to notice at this point that the word "received" is misspelled "recieved" in keys in the dictionaries returned by apache-log-parser. If you don't notice this early on, it will cause some amount of frustration.

It turned out that my log data included events past the end of reporting period I'd been given (ending 31 May 2014), so I needed to filter out just those requests that fell within the reporting period. Python list comprehensions to the rescue:

 >>> dates=[req['time_recieved_datetimeobj'] for req in d]  
 >>> max(dates)  
 datetime.datetime(2014, 7, 13, 11, 37, 31)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  
 >>> from datetime import datetime  
 >>> d_relevant=[req for req in d if req['time_recieved_datetimeobj'] < datetime(2014,06,01)]  
 >>> dates=[req['time_recieved_datetimeobj'] for req in d_relevant]  
 >>> max(dates)  
 datetime.datetime(2014, 5, 31, 23, 59, 17)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  

Now to separate requests made by self-identified bots and spiders from the rest of the traffic:

 >>> robots=[req for req in d_relevant if 'bot' in req['request_header_user_agent'].lower() or 'spider' in req['request_header_user_agent'].lower()]  
 >>> len(robots)  
 848450  
 >>> humans=[req for req in d_relevant if 'bot' not in req['request_header_user_agent'].lower() and 'spider' not in req['request_header_user_agent'].lower()]  
 >>> len(humans)  
 486249  
 >>> len(robots) + len(humans) == len(d_relevant)  
 True  
 >>> unique_bots=[]  
 >>> for bot in robots:  
 ...   if bot['request_header_user_agent'] not in unique_bots:  
 ...     unique_bots.append(bot['request_header_user_agent'])  
 ...   
 >>> len(unique_bots)  
 229

Aside: yes, I know there could well still be automated agents in the "humans" list; I've only filtered out those that are not operated by the sneaky or the uninformed. Let's not worry about that issue for now.

Stay tuned for the next installment, wherein I hope we actually learn something about how our server is being used.

Tuesday, June 3, 2014

New in Maia: Mār Šiprim and Laboratoire Orient et Méditerranée

I have added feeds for the following web resources to the Maia Atlantis feed aggregator:

title = Mār Šiprim
url = http://mar-shiprim.org/
creators = International Association for Assyriology
license = None
description = Official Newsletter for the International Association for Assyriology (IAA). Through this Newsletter, the IAA aims to provide an online platform for Assyriologists and Near-Eastern enthusiasts where to interact with each other on both an intellectual and an informal level, thus establishing an international linkage among colleagues.
keywords = None
feed = http://mar-shiprim.org/feed/

title = Laboratoire Orient et Méditerranée
url = http://www.orient-mediterranee.com/?lang=fr
creators = None
license = None
description = Orient & Méditerranée est une Unité Mixte de Recherche en Sciences historiques, philologiques et religieuses, associant le Centre National de la Recherche Scientifique, CNRS, l’Université Paris-Sorbonne, Paris IV, l’Université Panthéon-Sorbonne, Paris 1 et l’École Pratique des Hautes Études
keywords = académie des inscriptions et belles-lettres, actualités, annuaire, antique, antiques, antiquité classique et tardive, arabie, araméen, archeology, archives, archéologiques, archéologues, bible, calendrier, centre national de la recherche scientifique, chantiers de fouille, cnrs, collections, colloques, collège de france, communication, contact, coopérations, coran, cours spécialisés, crédits, disciplines, distinctions, documentaires, débuts du christianisme, electroniques, formation, historiens des religions, informations administratives, initiation, islam médiéval, langue syriaque, les chercheurs du lesa, lesa, liens utiles, linguistes, l’université panthéon-sorbonne, l’université paris-sorbonne, mediterranee, membres, missions de terrain, monde byzantin, monde méditerranéen, mondes cananéen, médecine grecque, méditerranée, médiévale, organigramme, orient, orient & méditerranée, orient chrétien, ougarit, ouvrages récents, paris 1, paris iv, philologiques, philologues, phénicien, plan du site, proche-orient, programmes, présentation, publications, publications des membres de l’umr, punique, qumrân, rassemble cinq laboratoires, recherches, religions monothéistes, responsabilité d’entreprises documentaires, ressources documentaires, revues, sciences historiques, sciences humaines, sciences religieuses, soutenances, spip 2, spécialistes du monde, séminaires, sémitique, sémitique occidental, template, textes fondateurs, thèses, thèses en cours, umr 8167, umr8167, unité mixte de recherche, vallée de l’euphrate syrien, valorisation de la recherche, vient de paraître, école pratique des hautes études, écoles doctorales, époques, éthiopie, études sémitiques
feed = http://www.orient-mediterranee.com/spip.php?page=backend

Friday, May 23, 2014

New in Planet Maia: Building Tabernae and Archaeology of Portus (MOOC)

I have just added the following resources to the Maia Atlantis feed aggregator:

title = Building Tabernae
url = http://buildingtabernae.org/
creators = Miko Flohr
license = None
description = About two years ago, I received a quarter million euro grant from the Dutch government for a  four year project on urban commercial investment in Roman Italy, and a project blog was already in the proposal. The project – Building Tabernae – started April, 2013, and is now about to enter a new phase, in which some results will start emerging, and new data will be gathered. The blog, I hope, is a way to force the scholar in charge of this project – me – to record and communicate the project’s successes and failures, and everything else that one encounters when investigating commercial investment in Roman Italy, and to be available for discussion with specialists and non-specialists alike.
feed = http://buildingtabernae.org/feed/

title = Archaeology of Portus: Exploring the Lost Harbour of Ancient Rome
url = http://moocs.southampton.ac.uk/portus/
creators = University of Southampton and FutureLearn
license = None
description = The University of Southampton and FutureLearn are running a MOOC (Massive Open Online Course), focusing on the archaeological work in progress at the Roman site of Portus. It is one of a number of Southampton-based courses that will be made available for you to study online, for free, wherever they are based in the world, in partnership with FutureLearn.
feed = http://moocs.southampton.ac.uk/portus/feed/

Tuesday, May 13, 2014

Additions and corrections in Planet Atlantides

I've just added the following blog to the Maia and Electra feed aggregators:

title = Standards for Networking Ancient Prosopographies
url = http://snapdrgn.net/
creators = Gabriel Bodard, et al.
description = Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names (hereafter SNAP:DRGN or SNAP) project aims to address the problem of linking together large collections of material (datasets) containing information about persons, names and person-like entities managed in heterogeneous systems and formats.
feed = http://snapdrgn.net/feed

I've also updated the entry for MutEc as follows (corrected feed url):

title = Mutualisation d'outils numériques pour les éditions critiques et les corpus (MutEC)
url = http://www.mutec-shs.fr
creators = Marjorie Burghart, et al.
description = MutEC est un dispositif de partage, d'accumulation et de diffusion des technologies et des méthodologies qui émergent dans le champ des humanités numériques.
feed = http://www.mutec-shs.fr/?q=rss.xml

MITH and tDAR continue to respond to our bot with 403 Forbidden, so their content will not appear in the aggregators.

Friday, April 18, 2014

New in Maia: Oral Poetry

The following blog has been added to the Maia Atlantis feed aggregator:

title = Oral Poetry
url = http://oralpoetry.blogspot.com/
feed = http://oralpoetry.blogspot.com/feeds/posts/default?alt=rss

Planet Atlantides Updates: Antiquitas, Archeomatica, Source, tDAR and MITH

I have added subscriptions for the following resources to the indicated aggregators at Planet Atlantides:

To Electra:

title = Source: Journalism Code, Context & Community
site = https://source.opennews.org/en-US/
license = CC Attribution 3.0 http://creativecommons.org/licenses/by/3.0/
feed = https://source.opennews.org/en-US/rss/

To Maia:

title = Antiquitas
site = http://antiquitas.hypotheses.org/
creators = Hervé Huntzinger
description = Ce carnet a pour objet de fédérer la communauté pédagogique et scientifique investie dans le Parcours « Sciences
 de l'Antiquité » de l’Université de Lorraine. Il fournit aux futurs étudiants une information claire sur l’offre de formation. Il ouvre aux étudiants de master et de doctorat un espace pour mettre en valeur leurs travaux et s’initier à la recherche. Il offre, enfin, aux enseignants-chercheurs une plateforme permettant d’informer les chercheurs, les étudiants et le public averti de l’actualité de la recherche. La formation est adossée à l’équipe d’accueil Hiscant-MA (EA1132), spécialisés en Sciences de l’Antiquité.
feed = http://antiquitas.hypotheses.org/feed

I have also updated the feed URL in both Electra and Maia for the following resource:

title = Archeomatica: Tecnologie per i Beni Culturali
site = http://www.archeomatica.it/
description = Tutte le notizie sulle tecnologie applicate ai beni culturali per il restauro e la conservazione
feed = http://feeds.feedburner.com/Archeomatica

The following resources are presently responding to requests from the Planet Atlantides Feed Bot for access to their feeds with a 403 Forbidden HTTP status code. Consequently, updates from these resources will not be seen in the aggregators until and if the curators of these resources make a server configuration change to permit us to syndicate the content.

title = Maryland Institute for Technology in the Humanities (MITH)
site = http://mith.umd.edu/
description = Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland, College Park
feed = http://mith.umd.edu/feed/

title = The Digital Archaeological Record (tDAR)
site = http://www.tdar.org/
feed = http://www.tdar.org/feed/




Friday, April 11, 2014

Mining AWOL more carefully for ISSNs

I made a couple of bad assumptions in my previous attempt to mine ISSNs out of the content of the AWOL Blog:
  1. I assumed that the string "ISSN" would always appear in all caps.
  2. I assumed that the string "ISSN" would be followed immediately by a colon (:).
In fact, the following command indicates there are at least 673 posts containing instances of the string (ignoring capitalization) "issn" in the AWOL content:
ack -hilo issn  post-*.xml | wc -l
 In an attempt to make sure we're capturing real ISSN strings, I refined the regular expression to try to capture a leading "ISSN" string, and then everything possibly following until and including a properly formatted ISSN number. I've seen both ####-#### and ########, (where # is either a digit or the character "X") in the wild, so I accommodated both possibilities. Here's the command:
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml > issn-raw.txt
You can see the raw list of the matched strings here. If we count the lines generated by that command instead of saving them to file, we can see that there are at least 1931 ISSNs in AWOL.
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
Then I wondered, are we getting just one ISSN per file or multiples? We know that some of the posts in the blog are about single resources, but there are also plenty of posts about collections and also posts that gather up all the references to every known instance of a particular genre (e.g., open-access journals or journals in JSTOR). So I modified the command to count how many files have these "well-formed" ISSN strings in them (the -l option to ack):
ack -hilo 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
For a total of 638 affected files. Here's a list of the affected files, for future team reference.

One wonders about the discrepancy between 638 and 673, but at least I know I now have a regular expression that can capture most of the ISSNs and their values. I'll do some spot-checking later to see if I can figure out what's being missed and why.

More importantly, it's now very clear that mining the ISSNs out of the blog posts on our way to Zotero is a worthwhile task. Not only will we be able to add them to the records, we may also be able to use them to look up existing catalog data from other databases with which to better populate the fields in the corresponding Zotero records.




New in Electra

I've just added the following blogs to the Electra Atlantis feed aggregator:

Mining AWOL for Identifiers

NB: There is now a follow-up post to this one, in which various bad assumptions made here are addressed: "Mining AWOL more carefully for ISSNs".

In collaboration with Pavan Artri, Dawn Gross, Chuck Jones, Ronak Parpani, and David Ratzan, I'm currently working on a project to port the content of Chuck's Ancient World Online (AWOL) blog to a Zotero library. Funded in part by a grant from the Gladys Krieble Delmas Foundation, the idea is to make the information Chuck gathers available for more structured data needs, like citation generation, creation of library catalog records, and participation in linked data graphs. So far, we have code that successfully parses the Atom XML "backup" file we can get from Blogger and uses the Zotero API to create a Zotero record for each blog post and to populate its title (derived from the title of the post), url (the first link we find in the body of the post), and tags (pulled from the Blogger "labels").

We know that some of the post bodies also contain standard numbers (like ISSNs and ISBNs), but it has been unclear how many of them there are and how regular the structure of text strings in which they appear. Would it be worthwhile to try to mine them out programmatically and insert them into the Zotero records as well? If so, what's our best strategy for capturing them ... i.e., what sort of parenthetical remarks, whitespace, and punctuation might intervene between them and the corresponding values? Time to do some data prospecting ...

We'd previously split the monolithic "backup" XML file into individual XML files, one per post (click at your own risk; there are a lot of files in that github listing and your browser performance in rendering the page and its JavaScript may vary). Rather than writing a script to parse all that stuff just to figure out what's going on, I decided to try my new favorite can-opener, ack (previously installed stresslessly on my Mac with another great tool, the Homebrew package manager).

Time for some fun with regular expressions! I worked on this part iteratively, trying to start out as liberally as possible, thereby letting in a lot of irrelevant stuff so as not to miss anything good. I assumed that we want to catch acronyms, so strings of two or more capital letters, preceded by a word boundary. I didn't want to just use a [A-Z] range, since AWOL indexes multilingual resources, so I had recourse to the Unicode Categories feature that's available in most modern regular expression engines, including recent versions of Perl (on which ack relies). So, I started off with:
\b\p{Lu}\p{Lu}+
After some iteration on the results, I ended up with something more complex, trying to capture anything that fell between the acronym itself and the first subsequent colon, which seemed to be the standard delimiter between the designation+explanation of the type of identifier and the identifying value itself. I figure we'll worry how to parse the value later, once we're sure which identifiers we want to capture. So, here's the regex I ultimately used:
\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]
The full ack command looked like this:
ack -oh "\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]" post-*.xml > ../awol-acronyms/raw.txt
where the -h option telling ack to "suppress the prefixing of filenames on output when multiple files are searched" and the -o option telling ack to "show only the part of each line matching" my regex pattern (quotes from the ack man page). You can browse the raw results here.

So, how to get this text file into a more analyzable state? First, I thought I'd pull it into my text editor, Sublime, and use its text manipulation functions to filter for unique lines and then sort them. But then, it occurred to me that I really wanted to know frequency of identifier classes across the whole of the blog content, so I turned to OpenRefine.

I followed OR's standard process for importing a text file (being sure to set the right character encoding for the file on which I was working). Then, I used the column edit functionality and the string manipulation functions in the Open Refine Expression Language (abbreviated GREL because it used to be called "Google Refine Expression Language") to clean up the strings (regularizing whitespace, trimming leading and trailing whitespace, converting everything to uppercase, and getting rid of whitespace immediately preceding colons). That part could all have been done in a step outside OR with other tools, but I didn't think about it until I was already there.

Then came the part OR is actually good at, faceting the data (i.e., getting all the unique strings and counts of same). I then used the GREL facetCount() function to get those values into the table itself, followed this recipe to get rid of matching rows in the data, and exported a CSV file of the unique terms and their counts (github's default display for CSV makes our initial column very wide, so you may have to click on the "raw" link to see all the columns of data).

There are some things that need investigating, but what strikes me is that apparently only ISSN is probably worth capturing programmatically. ISSNs appear 44 times in 14 different variations:


ISSN: 17
ISSN paper: 9
ISSN electrònic: 4
ISSN electronic edition: 2
ISSN electrónico: 2
ISSN électronique: 2
ISSN impreso: 2
ISSN Online: 2
ISSN edición electrónica: 1
ISSN format papier: 1
ISSN Print: 1
ISSN print edition: 1
ONLINE ISSN: 1
PRINT ISSN: 1

Compare ISBNs:


ISBN of Second Part: 2
ISBN: 1
ISBN Compiled by: 1

DOIs make only one appearance, and there are no Library of Congress cataloging numbers.

Now to point my collaborators at this blog post and see if they agree with me...





Thursday, April 10, 2014

Batch XML validation at the command line

Updated: 8 August, 2017 to reflect changes in the installation pattern for jing.

Against a RelaxNG schema. I had help figuring this out from Hugh and Ryan at DC3:

$ find {searchpath} -name "*.xml" -print | parallel --tag jing {relaxngpath}
The find command hunts down all files ending with ".xml" in the directory tree under searchpath. The parallel command takes that list of files and fires off (in parallel) a jing validation run for each of them. The --tag option passed to jing ensures we get the name of the file passed through with each error message. This turns out (in general terms as seen by me) to be much faster than running each jing call in sequence, e.g. with the --exec primary in find.

As I'm running on a Mac, I had to install GNU Parallel and the Jing RelaxNG Validator. That's what Homebrew is for:
$ brew install jing
$ brew install jing-trang
$ brew install parallel
NB: you may have to install a down version of Java before you can get the jing-trang formula to work in homebrew (e.g., brew install java6).

What's the context, you ask? I have lots of reasons to want to be able to do this. The proximal cause was batch-validating all the EpiDoc XML files for the inscriptions that are included in the Corpus of Campā Inscriptions before regenerating the site for an update today. I wanted to see quickly if there were any encoding errors in the XML that might blow up the XSL transforms we use to generate the site. So, what I actually ran was:
$ curl -O http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
$ find ./texts/xml -name '*.xml' -print | parallel --tag jing tei-epidoc.rng
 Thanks to everybody who built all these tools!


Thursday, February 27, 2014

Planet Atlantides grows up and gets its own user-agent string

So, sobered by recent spelunking and bad-bot-chasing in various server logs and convicted by sage advice that ought to be followed by everyone in the UniversalFeedParser documentation, I have customized the bot used on Planet Atlantides for fetching web feeds so it identifies itself unambiguously to the web servers from which it requests those feeds.

Here's the explanatory text I just posted to the Planet Atlantides home page. Please let me know if you have suggestions or critiques.

Feed reading, bots, and user agents

As implied above, Planet Atlantides uses Sam Ruby's "Venus" branch of the Planet "river of news" feed reader. That code is written in the Python language and uses an earlier version of the Universal Feed Reader library for fetching web feeds (RSS and Atom formats). Out of the box, its http requests use the feed parser's default user agent string, so your server logs will only have recorded "UniversalFeedParser/4.2-pre-274-svn +http://feedparser.org/" when our copy of the software pulled your feed in the past. 

Effective 27 February 2014, the Planet Atlantides production version of the code now identifies itself with the following user agent string: "PlanetAtlantidesFeedBot/0.2 +http://planet.atlantides.org/". Production code runs on a machine with the IP address 66.35.62.81, and never runs more than once per hour. Apart for a one-time set of test episodes on 27 February 2014 itself, log entries recording our user agent string and a different IP address represent spoofing by a potential bad actor other than me and my automagical bot. You should nuke them from orbit; it's the only way to be sure. Note that from time-to-time, I may run test code from other IP addresses, but I will in future use the user agent string beginning with "PlanetAtlantidesTestBot" for such runs. You can expect them to be infrequent and irregular.

Please email me if you have any questions about Planet Atlantides, its bot, or these user agent strings. In particular, if you put something like "PlanetAtlantidesBot is messing up my site" in your subject line, I'll look at it and respond as quickly as I can.

Monday, February 10, 2014

Thursday, January 30, 2014

Pruned from Maia: Dead and Damaged Feeds

The following resources have been pruned from the Maia Atlantis feed aggregator because their feeds (and in some cases the whole resource) have disappeared with no alternative address or are consistently returning errors:
  • GIS for Archaeology and CRM (formerly at http://www.gisarch.com; domain now up for sale)
  • ABZU Recent Additions (feed returns 404)
  • epea pteroenta (feed and site perpeturally return 500)
  • Internet Archaeology (feed content is invalid; site sports a notice saying a server upgrade is impending)
  • Portable Antiquities Scheme Blog (feed returns 404)
  • Art and Social Identities in Late Antiquity (University of Aarhus) (site and feed are gone)
  • ArcLand News (feed returns 404; site sports a notice saying a server upgrade is impending)
  • Jonathan Eaton (Imperium Sine Fine) (feed returns 404; blogger site says "feed has been removed")
 Please contact me if you have updated feed URLs for any of these resources.

I have also updated a number of feed that had moved, including some of which that did not provide redirects and had to be sought out manually.

Thursday, January 16, 2014

New in Maia: "Greek in Italy" and "Spartokos a Lu"

The following blogs have been added to the Maia Atlantis feed aggregator: