Friday, April 11, 2014

Mining AWOL more carefully for ISSNs

I made a couple of bad assumptions in my previous attempt to mine ISSNs out of the content of the AWOL Blog:
  1. I assumed that the string "ISSN" would always appear in all caps.
  2. I assumed that the string "ISSN" would be followed immediately by a colon (:).
In fact, the following command indicates there are at least 673 posts containing instances of the string (ignoring capitalization) "issn" in the AWOL content:
ack -hilo issn  post-*.xml | wc -l
 In an attempt to make sure we're capturing real ISSN strings, I refined the regular expression to try to capture a leading "ISSN" string, and then everything possibly following until and including a properly formatted ISSN number. I've seen both ####-#### and ########, (where # is either a digit or the character "X") in the wild, so I accommodated both possibilities. Here's the command:
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml > issn-raw.txt
You can see the raw list of the matched strings here. If we count the lines generated by that command instead of saving them to file, we can see that there are at least 1931 ISSNs in AWOL.
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
Then I wondered, are we getting just one ISSN per file or multiples? We know that some of the posts in the blog are about single resources, but there are also plenty of posts about collections and also posts that gather up all the references to every known instance of a particular genre (e.g., open-access journals or journals in JSTOR). So I modified the command to count how many files have these "well-formed" ISSN strings in them (the -l option to ack):
ack -hilo 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
For a total of 638 affected files. Here's a list of the affected files, for future team reference.

One wonders about the discrepancy between 638 and 673, but at least I know I now have a regular expression that can capture most of the ISSNs and their values. I'll do some spot-checking later to see if I can figure out what's being missed and why.

More importantly, it's now very clear that mining the ISSNs out of the blog posts on our way to Zotero is a worthwhile task. Not only will we be able to add them to the records, we may also be able to use them to look up existing catalog data from other databases with which to better populate the fields in the corresponding Zotero records.

1 comment:

Scot Mcphee said...

In one way this is one argument why strictly structured document stylesheets have the advantage over loose XML aggregations of key/value pairs.

e.g. value rather than value ... in the XSD or RelaxNG you can then specify your ISSN format.

But of course, then the schema is not adaptable for different datasets or flexible to fit new uses.