SyntaxHighlighter

Friday, April 11, 2014

Mining AWOL for Identifiers

NB: There is now a follow-up post to this one, in which various bad assumptions made here are addressed: "Mining AWOL more carefully for ISSNs".

In collaboration with Pavan Artri, Dawn Gross, Chuck Jones, Ronak Parpani, and David Ratzan, I'm currently working on a project to port the content of Chuck's Ancient World Online (AWOL) blog to a Zotero library. Funded in part by a grant from the Gladys Krieble Delmas Foundation, the idea is to make the information Chuck gathers available for more structured data needs, like citation generation, creation of library catalog records, and participation in linked data graphs. So far, we have code that successfully parses the Atom XML "backup" file we can get from Blogger and uses the Zotero API to create a Zotero record for each blog post and to populate its title (derived from the title of the post), url (the first link we find in the body of the post), and tags (pulled from the Blogger "labels").

We know that some of the post bodies also contain standard numbers (like ISSNs and ISBNs), but it has been unclear how many of them there are and how regular the structure of text strings in which they appear. Would it be worthwhile to try to mine them out programmatically and insert them into the Zotero records as well? If so, what's our best strategy for capturing them ... i.e., what sort of parenthetical remarks, whitespace, and punctuation might intervene between them and the corresponding values? Time to do some data prospecting ...

We'd previously split the monolithic "backup" XML file into individual XML files, one per post (click at your own risk; there are a lot of files in that github listing and your browser performance in rendering the page and its JavaScript may vary). Rather than writing a script to parse all that stuff just to figure out what's going on, I decided to try my new favorite can-opener, ack (previously installed stresslessly on my Mac with another great tool, the Homebrew package manager).

Time for some fun with regular expressions! I worked on this part iteratively, trying to start out as liberally as possible, thereby letting in a lot of irrelevant stuff so as not to miss anything good. I assumed that we want to catch acronyms, so strings of two or more capital letters, preceded by a word boundary. I didn't want to just use a [A-Z] range, since AWOL indexes multilingual resources, so I had recourse to the Unicode Categories feature that's available in most modern regular expression engines, including recent versions of Perl (on which ack relies). So, I started off with:
\b\p{Lu}\p{Lu}+
After some iteration on the results, I ended up with something more complex, trying to capture anything that fell between the acronym itself and the first subsequent colon, which seemed to be the standard delimiter between the designation+explanation of the type of identifier and the identifying value itself. I figure we'll worry how to parse the value later, once we're sure which identifiers we want to capture. So, here's the regex I ultimately used:
\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]
The full ack command looked like this:
ack -oh "\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]" post-*.xml > ../awol-acronyms/raw.txt
where the -h option telling ack to "suppress the prefixing of filenames on output when multiple files are searched" and the -o option telling ack to "show only the part of each line matching" my regex pattern (quotes from the ack man page). You can browse the raw results here.

So, how to get this text file into a more analyzable state? First, I thought I'd pull it into my text editor, Sublime, and use its text manipulation functions to filter for unique lines and then sort them. But then, it occurred to me that I really wanted to know frequency of identifier classes across the whole of the blog content, so I turned to OpenRefine.

I followed OR's standard process for importing a text file (being sure to set the right character encoding for the file on which I was working). Then, I used the column edit functionality and the string manipulation functions in the Open Refine Expression Language (abbreviated GREL because it used to be called "Google Refine Expression Language") to clean up the strings (regularizing whitespace, trimming leading and trailing whitespace, converting everything to uppercase, and getting rid of whitespace immediately preceding colons). That part could all have been done in a step outside OR with other tools, but I didn't think about it until I was already there.

Then came the part OR is actually good at, faceting the data (i.e., getting all the unique strings and counts of same). I then used the GREL facetCount() function to get those values into the table itself, followed this recipe to get rid of matching rows in the data, and exported a CSV file of the unique terms and their counts (github's default display for CSV makes our initial column very wide, so you may have to click on the "raw" link to see all the columns of data).

There are some things that need investigating, but what strikes me is that apparently only ISSN is probably worth capturing programmatically. ISSNs appear 44 times in 14 different variations:


ISSN: 17
ISSN paper: 9
ISSN electrònic: 4
ISSN electronic edition: 2
ISSN electrónico: 2
ISSN électronique: 2
ISSN impreso: 2
ISSN Online: 2
ISSN edición electrónica: 1
ISSN format papier: 1
ISSN Print: 1
ISSN print edition: 1
ONLINE ISSN: 1
PRINT ISSN: 1

Compare ISBNs:


ISBN of Second Part: 2
ISBN: 1
ISBN Compiled by: 1

DOIs make only one appearance, and there are no Library of Congress cataloging numbers.

Now to point my collaborators at this blog post and see if they agree with me...





1 comment:

Tom Elliott said...

Well, Chuck wrote me back saying he was surprised that there were only so few ISSNs in the content. This made me worried that my fancy regular expression was too fancy and not effective. So, I decided to count the number of files containing any capitalization combination of the string "issn".

We can get a list of all files containing the string (ignoring) case by running:

ack -ohil issn post-*.xml

If we count the total number of lines in that output, we can get the number of files (i.e., blog posts) in which the string appears:

ack -ohil issn post-*.xml | wc -l

The answer? 673.

I've got some more work to do...