SyntaxHighlighter

Thursday, February 2, 2012

Playing with PELAGIOS: Dealing with a bazillion RDF files

Latest in a Playing with PELAGIOS series

Some of the PELAGIOS partners distribute their annotation RDF in a relatively small number of files. Others (like SPQR and ANS) have a very large number of files. This makes the technique I used earlier for adding triples to the database ungainly. Fortunately, 4store provides some command line methods for loading triples.

First, stop the 4store http server (why?):
$ killall 4s-httpd
Try to import all the RDF files.  Rats!
$ 4s-import -a pelagios *.rdf
-bash: /Applications/4store.app/Contents/MacOS/bin/4s-import: Argument list too long
Bash to the rescue (but note that doing one file at a time has a cost on the 4store side):
$ for f in *.rdf; do 4s-import -av pelagios $f; done
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.00000.rdf>
Pass 1, processed 10 triples (10)
Pass 2, processed 10 triples, 8912 triples/s
Updating index
Index update took 0.000890 seconds
Imported 10 triples, average 4266 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.101.rdf>
Pass 1, processed 11 triples (11)
Pass 2, processed 11 triples, 9856 triples/s
Updating index
Index update took 0.000936 seconds
Imported 11 triples, average 4493 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.10176.rdf>
Pass 1, processed 8 triples (8)
Pass 2, processed 8 triples, 6600 triples/s
Updating index
Index update took 0.000892 seconds
Imported 8 triples, average 3256 triples/s
... 
This took a while. There are 86,200 files in the ANS annotation batch.

Note the use of the -a option on 4s-import to ensure the triples are added to the current contents of the database, rather than replacing them! Note also the -v option, which is what gives you the report (otherwise, it's silent and that makes my ctrl-c finger twitchy).

Now, back to the SPARQL mines.

Wednesday, February 1, 2012

Playing with PELAGIOS: Arachne was easy after nomisma

Querying Pleiades annotations out of Arachne RDF was as simple as loading the Arachne Objects by Places RDF file into 4store the same way I did nomisma and running the same SPARQL query.  Cost: 5 minutes. Now I know about 29 objects in the Arachne database that they think are related to Akragas/Agrigentum. For example:

Playing with PELAGIOS: Nomisma

So, I want to see how hard it is to query the RDF that PELAGIOS partners are putting together. The first experiment is documented below.

Step 1: Set up a Triplestore (something to load the RDF into and support queries)

Context: I'm a triplestore n00b. 

I found Jeni Tennison's Getting Started with RDF and SPARQL Using 4store and RDF.rb and, though I had no interest in messing around with Ruby as part of this exercise, the recommendation of 4store as a triplestore sounded good, so I went hunting for a Mac binary and downloaded it.

Step 2: Grab RDF describing content in Nomisma.org

Context: I'm a point-and-click expert.

I downloaded the PELAGIOS-conformant RDF data published by Nomisma.org at http://nomisma.org/nomisma.org.pelagios.rdf.

Background: "Nomisma.org is a collaborative effort to provide stable digital representations of numismatic concepts and entities, for example the generic idea of a coin hoard or an actual hoard as documented in the print publication An Inventory of Greek Coin Hoards (IGCH)."

Step 3: Fire up 4store and load in the nomisma.org 

Context: I'm a 4store n00b, but I can cut and paste, read and reason, and experiment.

Double-clicked the 4store icon in my Applications folder. It opened a terminal window.

To create and start up an empty database for my triples, I followed the 4store instructions and Tennison's post (mutatis mutandis) and so typed the following in the terminal window ("pelagios" is the name I gave to my database; you could call yours "ray" or "jay" if you like):
$ 4s-backend-setup pelagios
$ 4s-backend pelagios
Then I started up 4store's SPARQL http server and aimed it at the still-empty "pelagios" database so I could load my data and try my hand at some queries:
$ 4s-httpd pelagios
Loading the nomisma data was then as simple as moving to the directory where I'd saved the RDF file and typing:
$ curl -T nomisma.org.pelagios.rdf 'http://localhost:8080/data/http://nomisma.org/nomisma.org.pelagios.rdf/'
Note how the URI base for nomisma items is appended to the URL string passed via curl. This is how you specify the "model URI" for the graph of triples that gets created from the RDF.

Step 4: Try to construct a query and dig out some data.

Context: I'm a SPARQL n00b, but I'd done some SQL back in the day and XML and namespaces are pretty much burned into my soul at this point. 

Following Tennison's example, I pointed my browser at http://localhost:8080/test/. I got 4store's SPARQL test query interface. I googled around looking grumpily at different SPARQL "how-tos" and "getting starteds" and trying stuff and pondering repeated failure until this worked:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX oac: <http://www.openannotation.org/ns/>

SELECT ?x
WHERE {
 ?x oac:hasBody <http://pleiades.stoa.org/places/462086> .
} 

That's "find the ID of every OAC Annotation in the triplestore that's linked to Pleiades Place 462086" (i.e., Akragas/Agrigentum, modern Agrigento in Sicily). It's a list like this:
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch1910-agrigentum-5
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch2089-agrigentum-24
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch2101-agrigentum-32
  • ...
51 IDs in all.

But what I really want is a list of the IDs of the nomisma entities themselves so I can go look up the details and learn things. Back to the SPARQL mines until I produced this:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX oac: <http://www.openannotation.org/ns/>

SELECT ?nomismaid
WHERE {
 ?x oac:hasBody <http://pleiades.stoa.org/places/462086> .
 ?x oac:hasTarget ?nomismaid .
} 

Now I have a list of 51 nomisma IDs: one for the mint and 50 coin hoards that illustrate the economic network in which the ancient city participated (e.g., http://nomisma.org/id/igch2081).

Cost: about 2 hours of time, 1 cup of coffee, and three favors from Sebastian Heath on IRC.

Up next: Arachne, the object database of the Deutsches Archäologisches Institut.



Tuesday, January 17, 2012