Friday, October 13, 2017

Using OpenRefine with Pleiades

This past summer, DC3's Ryan Baumann developed a reconciliation service for Pleiades. He's named it Geocollider. It has two manifestations:

  • Upload a CSV file containing placenames and/or longitude/latitude coordinates, set matching parameters, and get back a CSV file of possible matches.
  • An online Application Programming Interface (API) compatible with the OpenRefine data-cleaning tool.
The first version is relatively self-documenting. This blog post is about using the second version with OpenRefine.

Reconciliation


I.e., matching (collating, aligning) your placenames against places in Pleiades.

Running OpenRefine against Geocollider for reconciliation purposes is as easy as:
When you've worked through the results of your reconciliation process and selected matches, OpenRefine will have added the corresponding Pleiades place URIs to your dataset. That may be all you want or need (for example, if you're preparing to bring your own dataset into the Pelagios network) ... just export the results and go on with your work. 

But if you'd like to actually get information about the Pleiades places, proceed to the next section.

Augmentation


I.e., pulling data from Pleiades into OpenRefine and selectively parsing it for information to add to your dataset.

Pleiades provides an API for retrieving information about each place resource it contains. One of the data formats this API provides is JSON, which is a format with which OpenRefine is designed to work. The following recipe demonstrates how to use the General Refine Expression Language to extract the "Representative Location" associated with each Pleiades place. 

Caveat: this recipe will not, at present, work with the current Mac OSX release of OpenRefine (2.7), even though it should and hopefully eventually will.  It has not been tested with the current releases for Windows and Linux, but they probably suffer from the same limitations as the OSX release. More information, including a non-trivial technical workaround, may be had from OpenRefine Issue 1265. I will update this blog post if and when a resolution is forthcoming.

1. Create a new column containing Pleiades JSON. 

Assuming your dataset is open in an OpenRefine project and that it contains a column that has been reconciled using Geocollider, select the drop-down menu on that column and choose "Edit column" -> "Add column by fetching URLs ..."

Screen capture of OpenRefine column drop-down menu: add column by fetching URLs

In the dialog box, provide a name for the new column you are about to create. In the "expression" box, enter a GREL expression that retrieves the Pleiades URL from the reconciliation match on each cell and appends the string "/json" to it:
cell.recon.match.id + "/json"

Screen capture of OpenRefine dialog box: add column by fetching URLs

OpenRefine retrieves the JSON for each matched place from Pleiades and inserts it into the appropriate cell in the new column. 

2. Create another new column by parsing the representative longitude out of the JSON.

From the drop-down menu on the column containing JSON, select "Edit column" -> "Add column based on this column..."
Screen capture of OpenRefine column drop-down menu: add column based on this column


In the dialog box, provide a name for the new column. In the expression box, enter a GREL expression that extracts the longitude from the reprPoint object in the JSON:
value.parseJson()['reprPoint'][0]

Screen capture of OpenRefine column dialog box: add column based on this column


Note that the reprPoint object contains a two-element list, like:
[ 37.328382, 38.240638 ]
Pleiades follows the GeoJSON specification in using the longitude, latitude ordering of elements in coordinate pairs so, to get the longitude, you use the index (0) for the first element in the list.

3. Create a column for the latitude

Use the method explained in step 2, but select the second list item from reprPoint (index=1).

4. Carry on ...

Your data set in OpenRefine will now look something like this:
screen capture showing portion of an OpenRefine table that includes an ancient toponym, JSON retrieved from Pleiades, and latitude and longitude values extracted from that JSON