horothesia: April 2014

Friday, April 18, 2014

New in Maia: Oral Poetry

The following blog has been added to the Maia Atlantis feed aggregator:

title = Oral Poetry
url = http://oralpoetry.blogspot.com/
feed = http://oralpoetry.blogspot.com/feeds/posts/default?alt=rss

Planet Atlantides Updates: Antiquitas, Archeomatica, Source, tDAR and MITH

I have added subscriptions for the following resources to the indicated aggregators at Planet Atlantides:

To Electra:

title = Source: Journalism Code, Context & Community
site = https://source.opennews.org/en-US/
license = CC Attribution 3.0 http://creativecommons.org/licenses/by/3.0/
feed = https://source.opennews.org/en-US/rss/

To Maia:

title = Antiquitas
site = http://antiquitas.hypotheses.org/
creators = Hervé Huntzinger
description = Ce carnet a pour objet de fédérer la communauté pédagogique et scientifique investie dans le Parcours « Sciences  de l'Antiquité » de l’Université de Lorraine. Il fournit aux futurs étudiants une information claire sur l’offre de formation. Il ouvre aux étudiants de master et de doctorat un espace pour mettre en valeur leurs travaux et s’initier à la recherche. Il offre, enfin, aux enseignants-chercheurs une plateforme permettant d’informer les chercheurs, les étudiants et le public averti de l’actualité de la recherche. La formation est adossée à l’équipe d’accueil Hiscant-MA (EA1132), spécialisés en Sciences de l’Antiquité.
feed = http://antiquitas.hypotheses.org/feed

I have also updated the feed URL in both Electra and Maia for the following resource:

title = Archeomatica: Tecnologie per i Beni Culturali
site = http://www.archeomatica.it/
description = Tutte le notizie sulle tecnologie applicate ai beni culturali per il restauro e la conservazione
feed = http://feeds.feedburner.com/Archeomatica

The following resources are presently responding to requests from the Planet Atlantides Feed Bot for access to their feeds with a 403 Forbidden HTTP status code. Consequently, updates from these resources will not be seen in the aggregators until and if the curators of these resources make a server configuration change to permit us to syndicate the content.

title = Maryland Institute for Technology in the Humanities (MITH)
site = http://mith.umd.edu/
description = Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland, College Park
feed = http://mith.umd.edu/feed/

title = The Digital Archaeological Record (tDAR)
site = http://www.tdar.org/
feed = http://www.tdar.org/feed/

Friday, April 11, 2014

Mining AWOL more carefully for ISSNs

I made a couple of bad assumptions in my previous attempt to mine ISSNs out of the content of the AWOL Blog:

I assumed that the string "ISSN" would always appear in all caps.
I assumed that the string "ISSN" would be followed immediately by a colon (:).

In fact, the following command indicates there are at least 673 posts containing instances of the string (ignoring capitalization) "issn" in the AWOL content:

ack -hilo issn post-*.xml | wc -l

In an attempt to make sure we're capturing real ISSN strings, I refined the regular expression to try to capture a leading "ISSN" string, and then everything possibly following until and including a properly formatted ISSN number. I've seen both ####-#### and ########, (where # is either a digit or the character "X") in the wild, so I accommodated both possibilities. Here's the command:

ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml > issn-raw.txt

You can see the raw list of the matched strings here. If we count the lines generated by that command instead of saving them to file, we can see that there are at least 1931 ISSNs in AWOL.

ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l

Then I wondered, are we getting just one ISSN per file or multiples? We know that some of the posts in the blog are about single resources, but there are also plenty of posts about collections and also posts that gather up all the references to every known instance of a particular genre (e.g., open-access journals or journals in JSTOR). So I modified the command to count how many files have these "well-formed" ISSN strings in them (the -l option to ack):

ack -hilo 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l

For a total of 638 affected files. Here's a list of the affected files, for future team reference.

One wonders about the discrepancy between 638 and 673, but at least I know I now have a regular expression that can capture most of the ISSNs and their values. I'll do some spot-checking later to see if I can figure out what's being missed and why.

More importantly, it's now very clear that mining the ISSNs out of the blog posts on our way to Zotero is a worthwhile task. Not only will we be able to add them to the records, we may also be able to use them to look up existing catalog data from other databases with which to better populate the fields in the corresponding Zotero records.

New in Electra

I've just added the following blogs to the Electra Atlantis feed aggregator:

dh+lib: where the digital humanities and librarianship meet
Digital Humanities Universität Leipzig
James Cummings (In my <element/>)

Mining AWOL for Identifiers

NB: There is now a follow-up post to this one, in which various bad assumptions made here are addressed: "Mining AWOL more carefully for ISSNs".

In collaboration with Pavan Artri, Dawn Gross, Chuck Jones, Ronak Parpani, and David Ratzan, I'm currently working on a project to port the content of Chuck's Ancient World Online (AWOL) blog to a Zotero library. Funded in part by a grant from the Gladys Krieble Delmas Foundation, the idea is to make the information Chuck gathers available for more structured data needs, like citation generation, creation of library catalog records, and participation in linked data graphs. So far, we have code that successfully parses the Atom XML "backup" file we can get from Blogger and uses the Zotero API to create a Zotero record for each blog post and to populate its title (derived from the title of the post), url (the first link we find in the body of the post), and tags (pulled from the Blogger "labels").

We know that some of the post bodies also contain standard numbers (like ISSNs and ISBNs), but it has been unclear how many of them there are and how regular the structure of text strings in which they appear. Would it be worthwhile to try to mine them out programmatically and insert them into the Zotero records as well? If so, what's our best strategy for capturing them ... i.e., what sort of parenthetical remarks, whitespace, and punctuation might intervene between them and the corresponding values? Time to do some data prospecting ...

We'd previously split the monolithic "backup" XML file into individual XML files, one per post (click at your own risk; there are a lot of files in that github listing and your browser performance in rendering the page and its JavaScript may vary). Rather than writing a script to parse all that stuff just to figure out what's going on, I decided to try my new favorite can-opener, ack (previously installed stresslessly on my Mac with another great tool, the Homebrew package manager).

Time for some fun with regular expressions! I worked on this part iteratively, trying to start out as liberally as possible, thereby letting in a lot of irrelevant stuff so as not to miss anything good. I assumed that we want to catch acronyms, so strings of two or more capital letters, preceded by a word boundary. I didn't want to just use a [A-Z] range, since AWOL indexes multilingual resources, so I had recourse to the Unicode Categories feature that's available in most modern regular expression engines, including recent versions of Perl (on which ack relies). So, I started off with:

\b\p{Lu}\p{Lu}+

After some iteration on the results, I ended up with something more complex, trying to capture anything that fell between the acronym itself and the first subsequent colon, which seemed to be the standard delimiter between the designation+explanation of the type of identifier and the identifying value itself. I figure we'll worry how to parse the value later, once we're sure which identifiers we want to capture. So, here's the regex I ultimately used:

\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]

The full ack command looked like this:

ack -oh "\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]" post-*.xml > ../awol-acronyms/raw.txt

where the -h option telling ack to "suppress the prefixing of filenames on output when multiple files are searched" and the -o option telling ack to "show only the part of each line matching" my regex pattern (quotes from the ack man page). You can browse the raw results here.

So, how to get this text file into a more analyzable state? First, I thought I'd pull it into my text editor, Sublime, and use its text manipulation functions to filter for unique lines and then sort them. But then, it occurred to me that I really wanted to know frequency of identifier classes across the whole of the blog content, so I turned to OpenRefine.

I followed OR's standard process for importing a text file (being sure to set the right character encoding for the file on which I was working). Then, I used the column edit functionality and the string manipulation functions in the Open Refine Expression Language (abbreviated GREL because it used to be called "Google Refine Expression Language") to clean up the strings (regularizing whitespace, trimming leading and trailing whitespace, converting everything to uppercase, and getting rid of whitespace immediately preceding colons). That part could all have been done in a step outside OR with other tools, but I didn't think about it until I was already there.

Then came the part OR is actually good at, faceting the data (i.e., getting all the unique strings and counts of same). I then used the GREL facetCount() function to get those values into the table itself, followed this recipe to get rid of matching rows in the data, and exported a CSV file of the unique terms and their counts (github's default display for CSV makes our initial column very wide, so you may have to click on the "raw" link to see all the columns of data).

There are some things that need investigating, but what strikes me is that apparently only ISSN is probably worth capturing programmatically. ISSNs appear 44 times in 14 different variations:

ISSN:	17
ISSN paper:	9
ISSN electrònic:	4
ISSN electronic edition:	2
ISSN electrónico:	2
ISSN électronique:	2
ISSN impreso:	2
ISSN Online:	2
ISSN edición electrónica:	1
ISSN format papier:	1
ISSN Print:	1
ISSN print edition:	1
ONLINE ISSN:	1
PRINT ISSN:	1

Compare ISBNs:

ISBN of Second Part:	2
ISBN:	1
ISBN Compiled by:	1

DOIs make only one appearance, and there are no Library of Congress cataloging numbers.

Now to point my collaborators at this blog post and see if they agree with me...

Thursday, April 10, 2014

Batch XML validation at the command line

Updated: 8 August, 2017 to reflect changes in the installation pattern for jing.

Against a RelaxNG schema. I had help figuring this out from Hugh and Ryan at DC3:

$ find {searchpath} -name "*.xml" -print | parallel --tag jing {relaxngpath}

The find command hunts down all files ending with ".xml" in the directory tree under searchpath. The parallel command takes that list of files and fires off (in parallel) a jing validation run for each of them. The --tag option passed to jing ensures we get the name of the file passed through with each error message. This turns out (in general terms as seen by me) to be much faster than running each jing call in sequence, e.g. with the --exec primary in find.

As I'm running on a Mac, I had to install GNU Parallel and the Jing RelaxNG Validator. That's what Homebrew is for:

~~$ brew install jing~~
$ brew install jing-trang
$ brew install parallel

NB: you may have to install a down version of Java before you can get the jing-trang formula to work in homebrew (e.g., brew install java6).

What's the context, you ask? I have lots of reasons to want to be able to do this. The proximal cause was batch-validating all the EpiDoc XML files for the inscriptions that are included in the Corpus of Campā Inscriptions before regenerating the site for an update today. I wanted to see quickly if there were any encoding errors in the XML that might blow up the XSL transforms we use to generate the site. So, what I actually ran was:

$ curl -O http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
$ find ./texts/xml -name '*.xml' -print | parallel --tag jing tei-epidoc.rng

Thanks to everybody who built all these tools!

SyntaxHighlighter