Friday, February 10, 2012

Give Me the Zotero Item Keys!

I fear and hope that this post will cause someone smarter than me to pipe up and say UR DOIN IT WRONG ITZ EZ LYK DIS ...

Here's the use case:

The Integrating Digital Papyrology project (and friends) have a Zotero group library populated with 1,445 bibliographic records that were developed on the basis of an old, built-by-hand Checklist of Editions of Greek and Latin Papyri (etc.). A lot of checking and improving was done to the data in Zotero.

Separately, there's now a much larger pile of bibliographic records related to papyrology that were collected (on different criteria) by the Bibliographie Papyrologique project. They have been machine-converted (into TEI document fragments) from a sui generis Filemaker Pro database and are now hosted via papyri.info (the raw data is on github).

There is considerable overlap between these two datasets, but also signifcant divergeance. We want to merge "matching" records in a carefully supervised way, making sure not to lose any of the extra goodness that BP adds to the data but taking full advantage of the corrections and improvements that were done to the Checklist data.

We started by doing an export-to-RDF of the Zotero data and, as a first step, that was banged up (programmatically) against the TEI data on the basis of titles. Probable matches were hand-checked and a resulting pairing of papyri.info bibliographic ID numbers against Zotero short titles was produced. You can see the resulting XML here.

I should point out that almost everything up to here including the creation and improvement of the data, as well as anything below regarding the bibliography in papyri.info, is the work of others. Those others include Gabriel Bodard, Hugh Cayless, James Cowey, Carmen Lantz, Adam Prins, Josh Sosin, and Jen Thum. And the BP team. And probably others I'm forgetting at the moment or who have labored out of my sight. I erect this shambles of a lean-to on the shoulders of giants.

To guide the work of our bibliographic researchers in analyzing the matched records, I wanted to create an HTML file that looks like this:
  • Checklist Short Title = Papyri.info ID number and Full Title String
  • BGU 10 = PI idno 7513: Papyrusurkunden aus ptolemäischer Zeit. (Ägyptische Urkunden aus den Staatlichen Museen zu Berlin. Griechische Urkunden. X. Band.)
  • etc. 
In that list, I wanted items to the left to be linked to the online view of the Zotero record at zotero.org and items on the right linked to the online view of the TEI record at papyri.info. The XML data we got from the initial match process provided the papyri.info bibliographic ID numbers, from which it's easy to construct the corresponding URIs, e.g., http://papyri.info/biblio/7513.

But Zotero presented a problem. URIs for bibliographic records in Zotero server use alphanumeric "item keys" like this: CJ3WSG3S (as in https://www.zotero.org/groups/papyrology/items/itemKey/CJ3WSG3S/).

That item key string is not, to my knowledge, included in any of the export formats produced by the Zotero desktop client, nor is it surfaced in its interface (argh). It appears possible to hunt them down programmatically via the Zotero Read API, though I haven't tried it for reasons that will be explained shortly. It is certainly possible to hunt for them manually via the web interface, but I'm not going to try that for more than about 3 records.

How I got the Zotero item keys

So, I have two choices at this point: write some code to automate hunting the item keys via the Zotero Read API or crack open the Zotero SQLLite database on my local client and see if the item keys are lurking in there too. Since I'm on a newish laptop on which I hadn't yet installed XCode, which seems to be a prerequisite to installing support for a Python virtual environment, which is the preferred way to get pip, which is the preferred install prerequisite for pyzotero, which is the python wrapper for the Zotero API, I had to make some choices about which yaks to shave.

I decided to start the (notoriously slow) XCode download yak and then have a go at the SQLLite yak while that was going on.

I grabbed the trial version of RazorSQL (which looked like a good shortcut after a few minutes of Googling), made a copy of my Zotero database, and started poking around. I thought about looking for detailed documentation (starting here I guess), but direct inspection started yielding results so I just kept going commando-style. It became clear at once that I wasn't going to find a single table containing my bibliographic entries. The Zotero client database is all normalized and modularized and stuff. So I viewed table columns and table contents as necessary and started building a SQL query to get at what I wanted. Here's what ultimately worked:

SELECT itemDataValues.value, items.key FROM items 
INNER JOIN libraries ON items.libraryID = libraries.libraryID
INNER JOIN groups ON libraries.libraryID = groups.libraryID
INNER JOIN itemData ON items.itemID = itemData.itemID
INNER JOIN itemDataValues ON itemData.valueID = itemDataValues.valueID
INNER JOIN fields ON itemData.fieldID = fields.fieldID
WHERE groups.name= "Papyrology" AND fields.fieldID=116

The SELECT statement gets me two values for each match dredged up by the rest of the query: a value stored in the itemDataValues table and a key stored in the items table. The various JOINs are used to get us close to the specific value (i.e., a short title) that we want. 116 in the fieldID field of the fields table corresponds to the short title field you see in your Zotero client. I found that out by inspecting the fields table; I could have used more JOINs to be able to use the string "shortTitle" in my WHERE clause, but that would have just taken more time.

The results of that query against my database looked like this:

P.Cair.Preis.    2245UKTH
CPR 18           26K8TAJT
P.Bodm. 28       282XKDE9
P.Gebelen        29ETKPXC
O.Krok           2BBMS7NS
P.Carlsb. 5      2D2ZNT4C
P.Mich.Aphrod.   2DTD2NIZ
P.Carlsb. 9      2FWF6T6I
P.Col. 1         2G4CF756
P.Lond.Copt. 2   2GAEU5QP
P.Harr. 1        2GCCNGJV
O.Deir el-Bahari 2GH3FEA2
P.Harrauer       2H3T6EU2
(etc).

So, copy that tabular result out of the RazorSQL GUI, paste it into a new LibreOffice spreadsheet and save it and I've got an XML file that I can dip into from the XSLT I had already started on to produce my HTML view.

Here's the resulting HTML file.

On we go.

Oh, and for those paying attention to such things, XCode finished downloading about two-thirds of the way through this process ...

Tuesday, February 7, 2012

Playing with PELAGIOS: Open Context and Labels

Latest in the Playing with PELAGIOS series.

I've just modified the tooling and re-run the Pleiades-oriented-view-of-the-GAWD report to include the RDF triples just published by Open Context and to exploit, when available, rdfs:label on the annotation target in order to produce more human-readable links in the HTML output. This required the addition of an OPTIONAL clause to the SPARQL query, as well as modifications to the results-processing XSLT. The new versions are indicated/linked on the report page.

You can see the results of these changes, for example, in the Antiochia/Theoupolis page.

Monday, February 6, 2012

Playing with PELAGIOS: The GAWD is Live

The is the lastest in an on-going series chronicling my dalliances with data published by the PELAGIOS project partners.

I think it's safe to say that, thanks to the PELAGIOS partner institutions, that we do have a Graph of Ancient World Data (GAWD) on the web. It's still in early stages, and one has to do some downloading, unzipping, and so forth to engage with it at the moment, but indeed the long-awaited day has dawned.

Here's the perspective, as of last Friday, from the vantage point of Pleiades. I've used SPARQL to query the GAWD for all information resources that the partners claim (via their RDF data dumps) are related to Pleiades information resources. I.e., I'm pulling out a list of information resources about texts, pictures, objects, grouped by their relationships to what Pleiades knows about ancient places (findspot, original location, etc.). I've sorted that view of the graph by the titles Pleiades gives to its place-related information resources and generated an HTML view of the result. It's here for your browsing pleasure.

Next Steps and Desiderata

For various technical reasons, I'm not yet touching the data of a couple of PELAGIOS partners (CLAROS and SPQR), but the will hopefully be resolved soon. I still need to dig into figuring out what Open Context is doing on this front. Other key resources -- especially those emanating from ISAW -- are not yet ready to produce RDF (but we're working on it).

There are a few things I'd like the PELAGIOS partners to consider/discuss adding to their data:

  • Titles/labels for the information resources (using rdfs:label?). This would make it possible for me to produce more intuitive/helpful labels for users of my HTML index. Descriptions would be cool too. As would some indication of the type of thing(s) a given resource addresses (e.g., place, statue, inscription, text)
  • Categorization of the relationships between their information resources and Pleaides information resources. Perhaps some variation of the terms originally explored by Concordia (whence the GAWD moniker), as someone on the PELAGIOS list has already suggested.
What would you like to see added to the GAWD? What would you do with it?

Thursday, February 2, 2012

Playing with PELAGIOS: Dealing with a bazillion RDF files

Latest in a Playing with PELAGIOS series

Some of the PELAGIOS partners distribute their annotation RDF in a relatively small number of files. Others (like SPQR and ANS) have a very large number of files. This makes the technique I used earlier for adding triples to the database ungainly. Fortunately, 4store provides some command line methods for loading triples.

First, stop the 4store http server (why?):
$ killall 4s-httpd
Try to import all the RDF files.  Rats!
$ 4s-import -a pelagios *.rdf
-bash: /Applications/4store.app/Contents/MacOS/bin/4s-import: Argument list too long
Bash to the rescue (but note that doing one file at a time has a cost on the 4store side):
$ for f in *.rdf; do 4s-import -av pelagios $f; done
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.00000.rdf>
Pass 1, processed 10 triples (10)
Pass 2, processed 10 triples, 8912 triples/s
Updating index
Index update took 0.000890 seconds
Imported 10 triples, average 4266 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.101.rdf>
Pass 1, processed 11 triples (11)
Pass 2, processed 11 triples, 9856 triples/s
Updating index
Index update took 0.000936 seconds
Imported 11 triples, average 4493 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.10176.rdf>
Pass 1, processed 8 triples (8)
Pass 2, processed 8 triples, 6600 triples/s
Updating index
Index update took 0.000892 seconds
Imported 8 triples, average 3256 triples/s
... 
This took a while. There are 86,200 files in the ANS annotation batch.

Note the use of the -a option on 4s-import to ensure the triples are added to the current contents of the database, rather than replacing them! Note also the -v option, which is what gives you the report (otherwise, it's silent and that makes my ctrl-c finger twitchy).

Now, back to the SPARQL mines.

Wednesday, February 1, 2012

Playing with PELAGIOS: Arachne was easy after nomisma

Querying Pleiades annotations out of Arachne RDF was as simple as loading the Arachne Objects by Places RDF file into 4store the same way I did nomisma and running the same SPARQL query.  Cost: 5 minutes. Now I know about 29 objects in the Arachne database that they think are related to Akragas/Agrigentum. For example:

Playing with PELAGIOS: Nomisma

So, I want to see how hard it is to query the RDF that PELAGIOS partners are putting together. The first experiment is documented below.

Step 1: Set up a Triplestore (something to load the RDF into and support queries)

Context: I'm a triplestore n00b. 

I found Jeni Tennison's Getting Started with RDF and SPARQL Using 4store and RDF.rb and, though I had no interest in messing around with Ruby as part of this exercise, the recommendation of 4store as a triplestore sounded good, so I went hunting for a Mac binary and downloaded it.

Step 2: Grab RDF describing content in Nomisma.org

Context: I'm a point-and-click expert.

I downloaded the PELAGIOS-conformant RDF data published by Nomisma.org at http://nomisma.org/nomisma.org.pelagios.rdf.

Background: "Nomisma.org is a collaborative effort to provide stable digital representations of numismatic concepts and entities, for example the generic idea of a coin hoard or an actual hoard as documented in the print publication An Inventory of Greek Coin Hoards (IGCH)."

Step 3: Fire up 4store and load in the nomisma.org 

Context: I'm a 4store n00b, but I can cut and paste, read and reason, and experiment.

Double-clicked the 4store icon in my Applications folder. It opened a terminal window.

To create and start up an empty database for my triples, I followed the 4store instructions and Tennison's post (mutatis mutandis) and so typed the following in the terminal window ("pelagios" is the name I gave to my database; you could call yours "ray" or "jay" if you like):
$ 4s-backend-setup pelagios
$ 4s-backend pelagios
Then I started up 4store's SPARQL http server and aimed it at the still-empty "pelagios" database so I could load my data and try my hand at some queries:
$ 4s-httpd pelagios
Loading the nomisma data was then as simple as moving to the directory where I'd saved the RDF file and typing:
$ curl -T nomisma.org.pelagios.rdf 'http://localhost:8080/data/http://nomisma.org/nomisma.org.pelagios.rdf/'
Note how the URI base for nomisma items is appended to the URL string passed via curl. This is how you specify the "model URI" for the graph of triples that gets created from the RDF.

Step 4: Try to construct a query and dig out some data.

Context: I'm a SPARQL n00b, but I'd done some SQL back in the day and XML and namespaces are pretty much burned into my soul at this point. 

Following Tennison's example, I pointed my browser at http://localhost:8080/test/. I got 4store's SPARQL test query interface. I googled around looking grumpily at different SPARQL "how-tos" and "getting starteds" and trying stuff and pondering repeated failure until this worked:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX oac: <http://www.openannotation.org/ns/>

SELECT ?x
WHERE {
 ?x oac:hasBody <http://pleiades.stoa.org/places/462086> .
} 

That's "find the ID of every OAC Annotation in the triplestore that's linked to Pleiades Place 462086" (i.e., Akragas/Agrigentum, modern Agrigento in Sicily). It's a list like this:
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch1910-agrigentum-5
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch2089-agrigentum-24
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch2101-agrigentum-32
  • ...
51 IDs in all.

But what I really want is a list of the IDs of the nomisma entities themselves so I can go look up the details and learn things. Back to the SPARQL mines until I produced this:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX oac: <http://www.openannotation.org/ns/>

SELECT ?nomismaid
WHERE {
 ?x oac:hasBody <http://pleiades.stoa.org/places/462086> .
 ?x oac:hasTarget ?nomismaid .
} 

Now I have a list of 51 nomisma IDs: one for the mint and 50 coin hoards that illustrate the economic network in which the ancient city participated (e.g., http://nomisma.org/id/igch2081).

Cost: about 2 hours of time, 1 cup of coffee, and three favors from Sebastian Heath on IRC.

Up next: Arachne, the object database of the Deutsches Archäologisches Institut.