SyntaxHighlighter

Friday, April 11, 2014

Mining AWOL more carefully for ISSNs

I made a couple of bad assumptions in my previous attempt to mine ISSNs out of the content of the AWOL Blog:
  1. I assumed that the string "ISSN" would always appear in all caps.
  2. I assumed that the string "ISSN" would be followed immediately by a colon (:).
In fact, the following command indicates there are at least 673 posts containing instances of the string (ignoring capitalization) "issn" in the AWOL content:
ack -hilo issn  post-*.xml | wc -l
 In an attempt to make sure we're capturing real ISSN strings, I refined the regular expression to try to capture a leading "ISSN" string, and then everything possibly following until and including a properly formatted ISSN number. I've seen both ####-#### and ########, (where # is either a digit or the character "X") in the wild, so I accommodated both possibilities. Here's the command:
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml > issn-raw.txt
You can see the raw list of the matched strings here. If we count the lines generated by that command instead of saving them to file, we can see that there are at least 1931 ISSNs in AWOL.
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
Then I wondered, are we getting just one ISSN per file or multiples? We know that some of the posts in the blog are about single resources, but there are also plenty of posts about collections and also posts that gather up all the references to every known instance of a particular genre (e.g., open-access journals or journals in JSTOR). So I modified the command to count how many files have these "well-formed" ISSN strings in them (the -l option to ack):
ack -hilo 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
For a total of 638 affected files. Here's a list of the affected files, for future team reference.

One wonders about the discrepancy between 638 and 673, but at least I know I now have a regular expression that can capture most of the ISSNs and their values. I'll do some spot-checking later to see if I can figure out what's being missed and why.

More importantly, it's now very clear that mining the ISSNs out of the blog posts on our way to Zotero is a worthwhile task. Not only will we be able to add them to the records, we may also be able to use them to look up existing catalog data from other databases with which to better populate the fields in the corresponding Zotero records.




New in Electra

I've just added the following blogs to the Electra Atlantis feed aggregator:

Mining AWOL for Identifiers

NB: There is now a follow-up post to this one, in which various bad assumptions made here are addressed: "Mining AWOL more carefully for ISSNs".

In collaboration with Pavan Artri, Dawn Gross, Chuck Jones, Ronak Parpani, and David Ratzan, I'm currently working on a project to port the content of Chuck's Ancient World Online (AWOL) blog to a Zotero library. Funded in part by a grant from the Gladys Krieble Delmas Foundation, the idea is to make the information Chuck gathers available for more structured data needs, like citation generation, creation of library catalog records, and participation in linked data graphs. So far, we have code that successfully parses the Atom XML "backup" file we can get from Blogger and uses the Zotero API to create a Zotero record for each blog post and to populate its title (derived from the title of the post), url (the first link we find in the body of the post), and tags (pulled from the Blogger "labels").

We know that some of the post bodies also contain standard numbers (like ISSNs and ISBNs), but it has been unclear how many of them there are and how regular the structure of text strings in which they appear. Would it be worthwhile to try to mine them out programmatically and insert them into the Zotero records as well? If so, what's our best strategy for capturing them ... i.e., what sort of parenthetical remarks, whitespace, and punctuation might intervene between them and the corresponding values? Time to do some data prospecting ...

We'd previously split the monolithic "backup" XML file into individual XML files, one per post (click at your own risk; there are a lot of files in that github listing and your browser performance in rendering the page and its JavaScript may vary). Rather than writing a script to parse all that stuff just to figure out what's going on, I decided to try my new favorite can-opener, ack (previously installed stresslessly on my Mac with another great tool, the Homebrew package manager).

Time for some fun with regular expressions! I worked on this part iteratively, trying to start out as liberally as possible, thereby letting in a lot of irrelevant stuff so as not to miss anything good. I assumed that we want to catch acronyms, so strings of two or more capital letters, preceded by a word boundary. I didn't want to just use a [A-Z] range, since AWOL indexes multilingual resources, so I had recourse to the Unicode Categories feature that's available in most modern regular expression engines, including recent versions of Perl (on which ack relies). So, I started off with:
\b\p{Lu}\p{Lu}+
After some iteration on the results, I ended up with something more complex, trying to capture anything that fell between the acronym itself and the first subsequent colon, which seemed to be the standard delimiter between the designation+explanation of the type of identifier and the identifying value itself. I figure we'll worry how to parse the value later, once we're sure which identifiers we want to capture. So, here's the regex I ultimately used:
\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]
The full ack command looked like this:
ack -oh "\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]" post-*.xml > ../awol-acronyms/raw.txt
where the -h option telling ack to "suppress the prefixing of filenames on output when multiple files are searched" and the -o option telling ack to "show only the part of each line matching" my regex pattern (quotes from the ack man page). You can browse the raw results here.

So, how to get this text file into a more analyzable state? First, I thought I'd pull it into my text editor, Sublime, and use its text manipulation functions to filter for unique lines and then sort them. But then, it occurred to me that I really wanted to know frequency of identifier classes across the whole of the blog content, so I turned to OpenRefine.

I followed OR's standard process for importing a text file (being sure to set the right character encoding for the file on which I was working). Then, I used the column edit functionality and the string manipulation functions in the Open Refine Expression Language (abbreviated GREL because it used to be called "Google Refine Expression Language") to clean up the strings (regularizing whitespace, trimming leading and trailing whitespace, converting everything to uppercase, and getting rid of whitespace immediately preceding colons). That part could all have been done in a step outside OR with other tools, but I didn't think about it until I was already there.

Then came the part OR is actually good at, faceting the data (i.e., getting all the unique strings and counts of same). I then used the GREL facetCount() function to get those values into the table itself, followed this recipe to get rid of matching rows in the data, and exported a CSV file of the unique terms and their counts (github's default display for CSV makes our initial column very wide, so you may have to click on the "raw" link to see all the columns of data).

There are some things that need investigating, but what strikes me is that apparently only ISSN is probably worth capturing programmatically. ISSNs appear 44 times in 14 different variations:


ISSN: 17
ISSN paper: 9
ISSN electrònic: 4
ISSN electronic edition: 2
ISSN electrónico: 2
ISSN électronique: 2
ISSN impreso: 2
ISSN Online: 2
ISSN edición electrónica: 1
ISSN format papier: 1
ISSN Print: 1
ISSN print edition: 1
ONLINE ISSN: 1
PRINT ISSN: 1

Compare ISBNs:


ISBN of Second Part: 2
ISBN: 1
ISBN Compiled by: 1

DOIs make only one appearance, and there are no Library of Congress cataloging numbers.

Now to point my collaborators at this blog post and see if they agree with me...





Thursday, April 10, 2014

Batch XML validation at the command line

Against a RelaxNG schema. I had help figuring this out from Hugh and Ryan at DC3:

$ find {searchpath} -name "*.xml" -print | parallel --tag jing {relaxngpath}
The find command hunts down all files ending with ".xml" in the directory tree under searchpath. The parallel command takes that list of files and fires off (in parallel) a jing validation run for each of them. The --tag option passed to jing ensures we get the name of the file passed through with each error message. This turns out (in general terms as seen by me) to be much faster than running each jing call in sequence, e.g. with the --exec primary in find.

As I'm running on a Mac, I had to install GNU Parallel and the Jing RelaxNG Validator. That's what Homebrew is for:
$ brew install jing
$ brew install parallel
 What's the context, you ask? I have lots of reasons to want to be able to do this. The proximal cause was batch-validating all the EpiDoc XML files for the inscriptions that are included in the Corpus of Campā Inscriptions before regenerating the site for an update today. I wanted to see quickly if there were any encoding errors in the XML that might blow up the XSL transforms we use to generate the site. So, what I actually ran was:
$ curl -O http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
$ find ./texts/xml -name '*.xml' -print | parallel --tag jing tei-epidoc.rng
 Thanks to everybody who built all these tools!


Thursday, February 27, 2014

Planet Atlantides grows up and gets its own user-agent string

So, sobered by recent spelunking and bad-bot-chasing in various server logs and convicted by sage advice that ought to be followed by everyone in the UniversalFeedParser documentation, I have customized the bot used on Planet Atlantides for fetching web feeds so it identifies itself unambiguously to the web servers from which it requests those feeds.

Here's the explanatory text I just posted to the Planet Atlantides home page. Please let me know if you have suggestions or critiques.

Feed reading, bots, and user agents

As implied above, Planet Atlantides uses Sam Ruby's "Venus" branch of the Planet "river of news" feed reader. That code is written in the Python language and uses an earlier version of the Universal Feed Reader library for fetching web feeds (RSS and Atom formats). Out of the box, its http requests use the feed parser's default user agent string, so your server logs will only have recorded "UniversalFeedParser/4.2-pre-274-svn +http://feedparser.org/" when our copy of the software pulled your feed in the past. 

Effective 27 February 2014, the Planet Atlantides production version of the code now identifies itself with the following user agent string: "PlanetAtlantidesFeedBot/0.2 +http://planet.atlantides.org/". Production code runs on a machine with the IP address 66.35.62.81, and never runs more than once per hour. Apart for a one-time set of test episodes on 27 February 2014 itself, log entries recording our user agent string and a different IP address represent spoofing by a potential bad actor other than me and my automagical bot. You should nuke them from orbit; it's the only way to be sure. Note that from time-to-time, I may run test code from other IP addresses, but I will in future use the user agent string beginning with "PlanetAtlantidesTestBot" for such runs. You can expect them to be infrequent and irregular.

Please email me if you have any questions about Planet Atlantides, its bot, or these user agent strings. In particular, if you put something like "PlanetAtlantidesBot is messing up my site" in your subject line, I'll look at it and respond as quickly as I can.