SyntaxHighlighter

Showing posts with label interop. Show all posts
Showing posts with label interop. Show all posts

Friday, October 13, 2017

Using OpenRefine with Pleiades

This past summer, DC3's Ryan Baumann developed a reconciliation service for Pleiades. He's named it Geocollider. It has two manifestations:

  • Upload a CSV file containing placenames and/or longitude/latitude coordinates, set matching parameters, and get back a CSV file of possible matches.
  • An online Application Programming Interface (API) compatible with the OpenRefine data-cleaning tool.
The first version is relatively self-documenting. This blog post is about using the second version with OpenRefine.

Reconciliation


I.e., matching (collating, aligning) your placenames against places in Pleiades.

Running OpenRefine against Geocollider for reconciliation purposes is as easy as:
When you've worked through the results of your reconciliation process and selected matches, OpenRefine will have added the corresponding Pleiades place URIs to your dataset. That may be all you want or need (for example, if you're preparing to bring your own dataset into the Pelagios network) ... just export the results and go on with your work. 

But if you'd like to actually get information about the Pleiades places, proceed to the next section.

Augmentation


I.e., pulling data from Pleiades into OpenRefine and selectively parsing it for information to add to your dataset.

Pleiades provides an API for retrieving information about each place resource it contains. One of the data formats this API provides is JSON, which is a format with which OpenRefine is designed to work. The following recipe demonstrates how to use the General Refine Expression Language to extract the "Representative Location" associated with each Pleiades place. 

Caveat: this recipe will not, at present, work with the current Mac OSX release of OpenRefine (2.7), even though it should and hopefully eventually will.  It has not been tested with the current releases for Windows and Linux, but they probably suffer from the same limitations as the OSX release. More information, including a non-trivial technical workaround, may be had from OpenRefine Issue 1265. I will update this blog post if and when a resolution is forthcoming.

1. Create a new column containing Pleiades JSON. 

Assuming your dataset is open in an OpenRefine project and that it contains a column that has been reconciled using Geocollider, select the drop-down menu on that column and choose "Edit column" -> "Add column by fetching URLs ..."

Screen capture of OpenRefine column drop-down menu: add column by fetching URLs

In the dialog box, provide a name for the new column you are about to create. In the "expression" box, enter a GREL expression that retrieves the Pleiades URL from the reconciliation match on each cell and appends the string "/json" to it:
cell.recon.match.id + "/json"

Screen capture of OpenRefine dialog box: add column by fetching URLs

OpenRefine retrieves the JSON for each matched place from Pleiades and inserts it into the appropriate cell in the new column. 

2. Create another new column by parsing the representative longitude out of the JSON.

From the drop-down menu on the column containing JSON, select "Edit column" -> "Add column based on this column..."
Screen capture of OpenRefine column drop-down menu: add column based on this column


In the dialog box, provide a name for the new column. In the expression box, enter a GREL expression that extracts the longitude from the reprPoint object in the JSON:
value.parseJson()['reprPoint'][0]

Screen capture of OpenRefine column dialog box: add column based on this column


Note that the reprPoint object contains a two-element list, like:
[ 37.328382, 38.240638 ]
Pleiades follows the GeoJSON specification in using the longitude, latitude ordering of elements in coordinate pairs so, to get the longitude, you use the index (0) for the first element in the list.

3. Create a column for the latitude

Use the method explained in step 2, but select the second list item from reprPoint (index=1).

4. Carry on ...

Your data set in OpenRefine will now look something like this:
screen capture showing portion of an OpenRefine table that includes an ancient toponym, JSON retrieved from Pleiades, and latitude and longitude values extracted from that JSON


Thursday, June 20, 2013

It is happening

A couple of hours ago, I was sitting out on the back deck with my wife and pets, enjoying perfect temperatures, morning birdsong, lavender-scented country air, and a cup of freshly brewed Costa Rican coffee (roasted by the good folks at the Kaffeeklatsch in Huntsville). Idyllic.

I was flipping through the latest news stories, blog posts, and such, brought into my phone by my feed reader (currently Feedly). I was trying to ignore the omnipresent bad news of the world, when this popped up:

screen capture of a feed summary in Feedly on my Android phone
Forma[m] Lollianus fecit?!? I'm predisposed by my dissertation experience to trigger on certain Latin and Greek words because of their sometime significance for the study of Roman geography. Forma is of course one of those words, and it does (probably more often than justified) get translated as "map" or "plan." Could this be — admittedly against the odds —an inscription on a map or plan drafted or surveyed by some guy named Lollianus?

If you're me, the possibility warrants a click-through to a corresponding record in the Heidelberg Epigraphic Databank (EDH).

My mappish hopes were quickly dashed, but just as quickly were replaced by interest in a group of inscribed objects I hadn't run across before: mirrors from Roman Dacia bearing makers' inscriptions. "Forma" can mean "mirror"? A quick check of Lewis & Short at Perseus doesn't support that idea, but builds confidence in a better interpretation: "mold, stamp, form". Was this mirror, or some part of it, somehow cast or stamped out? The EDH entry tells me there are 9 identical mirrors extant and that the inscription goes around the "Fassung" (frame?). Yup.

Cool. I learned something today before breakfast. And it's knowledge I can use when I come back to doing more with the geographical/mapping/surveying vocabulary.

And then it hits me: that's not information I went looking for, not a search I initiated. New information of interest was pushed to me because I had previously used a software tool to express interest in a number of information sources including, but not limited to, ancient inscriptions. The software kept an eye on new output from those sources and made it available to me for review and engagement in a mode and at a time and place of my choosing. And because the source data was online, open, and linked in a standard format, I was able to drink coffee and pet my dog on the back deck in Moontown, Alabama while making use of the scholarly work done yesterday(!) by Brigitte Gräf in Heidelberg, Germany.

Isn't this one of the things we've been working toward?

How did that happen?


Sometime earlier this year, Frank Grieshaber in Heidelberg rolled out web page listings and corresponding Atom feeds of recently changed content in the EDH (e.g., latest updates to the inscriptions database). I added them, along with similar data-oriented feeds, to a feed aggregator I dubbed Planet Potamos (with "Potamos" trying lamely to evoke a rushing river of data; the "Planet" acknowledges the feed aggregation software I use). I put the same feed subscriptions into my personal feed reader (I could also have put the Potamos aggregator's feed, but it only updates periodically and I'm an immediacy junkie). I installed and configured my feed reader on every device I use.

The rest is magic. Magic made the old-fashioned way by lots of people in many different places and times developing standards, building software, creating data, doing research, and sharing.

What next?


Well, I hope that Frank and his colleagues in Heidelberg will eventually add thumbnail images (where they have them) to the EDH feeds. I hope that the other epigraphic databases (and indeed all kinds of ancient studies web applications) will set up similar feeds. I hope that we can all start using more linked-data approaches in and alongside such feeds in order to communicate seminal interpretive/discovery facets (like geography, personography, temporality and genre) in machine-actionable ways. I hope the spirit and practice of openness that lubricates and accelerates this sort of synergy continues to grow and flower.

As for me, I'm thinking about how I might set up some kind of filtering mechanism that would highlight or prioritize content in my feed reader that's potentially relevant to my (e.g.) geo/map/survey vocabulary interests. Hmmmmm....


Thursday, April 18, 2013

Citing Sources in Digital Annotations

I'm collaborating with other folks both in and outside ISAW on a variety of digital scholarly projects in which Linked Open Data is playing a big role. We're using the Resource Description Framework (RDF) to provide descriptive information for, and make cross-project assertions about, a variety of entities of interest and the data associated with them (places, people, themes/subjects, creative works, bibliographic items, and manuscripts and other text-bearing objects). So, for example, I can produce the following assertions in RDF (using the Terse RDF Triple Language, or TuRTLe):

<http://syriaca.org/place/45> a <http://geovocab.org/spatial#Feature> ;
  rdfs:label "Serugh" ;
  rdfs:comment "An ancient city where Jacob of Serugh was bishop."@en ;
  foaf:primaryTopicOf <http://en.wikipedia.org/wiki/Suruç> ;
  owl:sameAs <http://pleiades.stoa.org/places/658405#this> .

This means: 'There's a resource identified with the Uniform Resource Identifier (URI) "http://syriaca.org/place/45" about which the following is asserted:
(Folks familiar with what Sean Gillies has done for the Pleiades RDF will recognize my debt to him in the what proceeds.)

But there are plenty of cases in which just issuing a couple of triples to encode an assertion about something isn't sufficient; we need to be able to assign responsibility/origin for those assertions and to link them to supporting argument and evidence (i.e., standard scholarly citation practice). For this purpose, we're very pleased by the Open Annotation Collaboration, whose Open Annotation Data Model was recently updated and expanded in the form of a W3C Community Draft (8 February 2013) (the participants in Pelagios use basic OAC annotations to assert geographic relationships between their data and Pleiades places).


A basic OADM annotation uses a series of RDF triples to link together a "target" (the thing you want to make an assertion about) and a "body" (the content of your assertion). You can think of them as footnotes. The "target" is the range of text after which you put your footnote number (only in OADM you can add a footnote to any real, conceptual, or digital thing you can identify) and the "body" is the content of the footnote itself. The OADM draft formally explains this structure in section 2.1. This lets me add an annotation to the resource from our example above (the ancient city of Serugh) by using the URI "http://syriaca.org/place/45" as the target of an annotation) thus:
<http://syriaca.org/place/45/anno/desc6> a oa:Annotation ;
  oa:hasBody <http://syriaca.org/place/45/anno/desc6/body> ;
  oa:hasTarget <http://syriaca.org/place/45> ;
  oa:motivatedBy oa:describing ;
  oa:annotatedBy <http://syriaca.org/editors.xml#tcarlson> ;
  oa:annotatedAt "2013-04-03T00:00:01Z" ;
  oa:serializedBy <https://github.com/paregorios/srpdemo1/blob/master/xsl/place2ttl.xsl> ;
  oa:serializedAt "2013-04-17T13:35:05.771-05:00" .

<http://syriaca.org/place/45/anno/desc6/body> a cnt:ContentAsText, dctypes:Text ;
  cnt:chars "an ancient town, formerly located near Sarug."@en ;
  dc:format "text/plain" ;

I hope you'll forgive me for not spelling that all out in plain text, as all the syntax and terms are explained in the OADM. What I'm concerned about in this blog post is really what the OADM doesn't explicitly tell me how to do, namely: show that the annotation body is actually a quotation from a published book. The verb oa:annotatedBy lets me indicate that the annotation itself was made (i.e., the footnote was written) by a resource identified by the URI "http://syriaca.org/editors.xml#tcarlson". If I'd given you a few more triples, you could have figured out that that resource is a real person named Thomas Carlson, who is one of the editors working on the Syriac Reference Portal project. But how do I indicate (as he very much wants to do because he's a responsible scholar and has no interest in plagiarizing anyone) that he's deliberately quoting a book called The Scattered Pearls: A History of Syriac Literature and Sciences? Here's what I came up with (using terms from Citation Typing Ontology and the DCMI Metadata Terms):
<http://syriaca.org/place/45/anno/desc7/body> a cnt:ContentAsText, dctypes:Text ;
  cnt:chars "a small town in the Mudar territory, between Ḥarran and Jarabulus. [Modern name, Suruç (tr.)]"@en ;
  dc:format "text/plain" ;
  cito:citesAsSourceDocument <http://www.worldcat.org/oclc/255043315> ;
  dcterms:biblographicCitation  "The Scattered Pearls: A History of Syriac Literature and Sciences, p. 558"@en .

The addition of the triple containing cito:citesAsSourceDocument lets me make a machine-actionable link to the additional structured bibliographic data about the book that's available at Worldcat (but it doesn't say anything about page numbers!). The addition of the triple containing dcterms:bibliographicCitation lets me provide a human-readable citation.

I'd love to have feedback on this approach from folks in the OAC, CITO, DCTERMS, and general linked data communities. Could I do better? Should I do something differently?


The SRP team is currently evaluating a sample batch of such annotations, which you're also welcome to view. The RDF can be found here. These files are generated from the TEI XML here using the XSLT here.

Thursday, October 4, 2012

Pleiades Machine Tags for Blog Posts? Yes!

So, a few minutes ago I noticed a new post in my feed reader from a blog I've admired for a while: Javier Andreu Pintado's Oppida Imperii Romani. I've thought for a long time that I ought to get in touch with him (we don't know each other from Adam as far as I know) and see if we could figure out a more-or-less automated way to get his posts to show up on the associated Pleiades pages.

Then it hit me:

Why can't we just use labels incorporating Pleiades IDs like we've been doing with machine tags on Flickr and query the Blogger API to get the associated posts?

Why not indeed. It turns out it just works.

To test, I added the string "pleiades:depicts=579885" as a label on my blog post from last December, "Pleiades, Flickr, and the Ancient World Image Bank" (Since that tag is used in an example in that post. I recognize that the blog post doesn't actually depict that place, which is what that label term ought to mean, but this is just a test).

Then I went to the Google APIs Explorer page for the Blogger "list posts" function (which I found by googling) and entered by blog's ID and the label string in the appropriate fields.



And, in a matter of milliseconds, I got back a JSON representation of my blog post.



So now I'm thinking we might explore the possibility of creating a widget on Pleiades place pages to feature blog posts tagged like this from selected blogs. It appears that, to execute the API queries against Blogger, we have to do them blog-by-blog with known IDs, but that's probably OK anyway so we can curate the process of blog selection and prevent spam.

It occurs to me that the Pelagios community might be interested in looking at this approach in order to build a gateway service to inject blog posts into the Pelagios network.

And while I'm name-checking, I wonder if any Wordpress aficionados out there can come up with a functionally equivalent mechanism.

Friday, June 1, 2012

Ancient Studies Needs Open Bibliographic Data and Associated URIs

Update 1:  links throughout, minor formatting changes, proper Creative Commons Public Domain tools, parenthetical about import path from Endnote and such, fixing a few typos.

The NEH-funded Linked Ancient World Data Institute, still in progress at ISAW, has got me thinking about a number of things. One of them is bibliography and linked data. Here's a brain dump, intended to spark conversation and collaboration.

What We Need

  • As much bibliographic data as possible, for both primary and secondary sources (print and digital), publicly released to third parties under either a public domain declaration or an unrestrictive open license.
  • Stable HTTP URIs for every work and author included in those datasets.

Why

Bibliographic and citation collection and management are integral to every research and publication in project in ancient studies. We could save each other a lot of time, and get more substantive work done in the field, if it was simpler and easier to do. We could more easily and effectively tie together disparate work published on the web (and appearing on the web through retrospective digitization) if we had a common infrastructure and shared point of reference. There's already a lot of digital data in various hands that could support such an effort, but a good chunk of it is not out where anybody with good will and talent can get at it to improve it, build tools around it, etc.

What I Want You (and Me) To Do If You Have Bibliographic Data
  1. Release it to the world through a third party. No matter what format it's in, give a copy to someone else whose function is hosting free data on the web. Dump it into a public repository at github.com or sourceforge.net. Put it into a shared library at Zotero, Bibsonomy, Mendeley, or another bibliographic content website (most have easy upload/import paths from Endnote, and other citation management applications). Hosting a copy yourself is fine, but giving it to a third party demonstrates your bona fides, gets it out of your nifty but restrictive search engine or database, and increments your bus number.
  2. Release it under a Creative Commons Public Domain Mark or Public Domain Dedication (CC0).  Or if you can't do that, find as open a Creative Commons or similar license as you can. Don't try to control it. If there's some aspect of the data that you can't (because of rights encumberance) or don't want to (why?) give away to make the world a better place, find a quick way to extract, filter, or excerpt that aspect and get the rest out.
  3. Alert the world to your philanthropy. Blog or tweet about it. Post a link to the data on your institutional website. Above all, alert Chuck Jones and Phoebe Acheson so it gets announced via Ancient World Online and/or Ancient World Open Bibliographies.
  4. Do the same if you have other useful data, like identifiers for modern or ancient works or authors.
  5. Get in touch with me and/or anyone else to talk about the next step: setting up stable HTTP URIs corresponding to this stuff.
Who I'm Talking To

First of all, I'm talking to myself, my collaborators, and my team-mates at ISAW. I intend to eat my own dogfood.

Here are other institutions and entities I know about who have potentially useful data.
  • The Open Library : data about books is already out there and available, and there are ways to add more
  • Perseus Project : a huge, FRBR-ized collection of MODS records for Greek and Latin authors, works, and modern editions thereof.
  • Center for Hellenic Studies: identifiers for Greek and Latin authors and works
  • L'Année Philologique and its institutional partners like the American Philological Association: the big collection of analytic secondary bibliography for classics (journal articles)
  • TOCS-IN: a collaboratively collected batch of analytic secondary bibliography for classics
  • Papyri.info and its contributing project partners: TEI bibliographic records for  much of the bibliography produced for or cited by Greek and Latin papyrologists (plus other ancient language/script traditions in papyrology)
  • Gnomon Bibliographische Datenbank: masses of bibliographic data for books and articles for classics
  • Any and every university library system that has a dedicated or easily extracted set of associated catalog records. Especially any with unique collections (e.g., Cincinnati) or those with databases of analytical bibliography down to the level of articles in journals and collections.
  • Ditto any and every ancient studies digital project that has bibliographic data in a database.
Comments, Reactions, Suggestions

Welcome, encouraged, and essential. By comment here or otherwise (but not private email please!).

Friday, November 4, 2011

It's all coming together at PELAGIOS

For years (over a decade in fact) we've been dreaming and talking about linking up ancient world resources on the web along the thematic axis of geography. Pleiades was launched in no small part in pursuit of that vision. And today comes more proof -- to which many can relate -- that hard work, collaboration, and openness bears really tasty fruit.
The Perseus geospatial data now includes annotations of ancient places with Pleiades URIs. Beginning next week, the Places widget in the Perseus interface will include links to download the Pleiades annotations in OAC compliant RDF format. These links will appear for any text with place entity markup which also has places from this dataset. We are also providing a link to search on the top five most frequently mentioned of these places in the Pelagios graph explorer.
(Check out the rest of the story, which provides a screenshot of the interface changes and a step-by-step description of how the work was done).

How did this come to be possible? Here's a very much abridged history:

  • Perseus built a path-breaking, web-based digital library of resources for the study of the ancient world; released a bunch of their code and content under open licenses; and managed the geographic aspects of the content as data
  • Pleiades built on and marshaled the efforts of the Classical Atlas Project, the Digital Atlas of Roman and Medieval Civilization, and other collaborators to publish an ever-improving geographic dataset on the web under a permissive open license
  • Leif Isaksen, on behalf of the Google Ancient Places project, took that dataset, mashed it up with another open geographical dataset (GeoNames) and published the results (Pleiades+) under a public domain declaration (more openness).
  • The PELAGIOS team took Pleiades+ and started matching it with their data. Perseus is just the latest member of that team to do so, and there are more on the way.
The resulting interface enhancements Perseus is announcing today are just the latest visible example of how the web of people benefits from the creation and exploitation of the web of data, and it's all super-charged by openness.

I'm grateful to the hard-working folks, and the array of funding agencies and host institutions, whose commitment and support are making these dreams come true.

Thursday, December 17, 2009

Interoperation with Pleiades

I've had a few questions lately about how other web-based publications could be designed to support interoperation with Pleiades. Here's my working advice:

Any project that wants to lay the groundwork for geographic interoperability on the basis of Pleiades should:

1. Capture and manage Pleiades identifiers (stable URLs like http://pleiades.stoa.org/places/638753/) for each place one might want to cite.

2. Request membership in the Pleiades community and add/modify content therein as necessary in order to create new resources (and new URLs) for places that Pleiades doesn't yet document, but which are provably historical and relevant to content controlled by the external project.

3. Capture and manage stable URLs from Wikipedia or GeoNames that correspond to modern geographic entities that are relevant to the content controlled by the external project. Don't conflate modern and ancient locations, as this will eventually lead to heartbreak.

4. Emit paged web feeds in the Atom Syndication Format (RFC 4287) that also conform to the guidance documented (with in-the-wild, third-party examples) at:

http://www.atlantides.org/trac/concordia/wiki/ConcordiaAtomFeeds

and make use of the terms defined at

http://www.atlantides.org/trac/concordia/wiki/ConcordiaThesaurus

to indicate publicly relationships such as "findspot" and "original location" between the content controlled by the external project, Pleiades resources, Wikipedia resources, GeoNames resources and resources published by other third parties.

5. Alert us so we can include the entry-point URL for the feeds in the seeded search horizon list for the web crawler and search index service we are developing.

You can see how the Epigraphic Databank Heidelberg team has been thinking about how to accomplish this at:

http://www.atlantides.org/trac/concordia/wiki/PleiadesMoI

and

http://www.atlantides.org/trac/concordia/wiki/EDHgeographyTable

Thursday, November 19, 2009

Bridging Institutional Repository and Bibliographic Management

As an institution, ISAW has an interest in disseminating, preserving and promoting the research products and publications of its faculty, research staff, students, affiliates and collaborators. Our parent institution, NYU, has made a commitment to the persistent dissemination of such materials when voluntarily contributed to its Faculty Digital Archive (FDA). We'll use the FDA as a locus for materials that fit well into DSpace (with which the FDA is realized) and that aren't rights-constrained. But we also need mechanisms for developing and publishing the whole bibliographic story of a particular faculty member, research group, project or conference with links from the individual entries to digital copies wherever they may be (e.g., the FDA, JSTOR, Internet Archive, Google Books). For this function, we like Zotero. Atop Zotero's robust and ubiquitous feed documents, we can build interoperability with our website and other tools and venues in a way that is also completely visible to commercial and third-party search and discovery tools.

There will be a number of iterations necessary to reach a fully robust solution, but we're already taking some of the first steps.

As an early experiment with the FDA, we had a student assistant input all of my boss's articles in PDF format, along with descriptive metadata (see: Roger Bagnall's Publications). The default metadata schema in the FDA wasn't a perfect fit for journal article citations, but the FDA staff is now working with us to extend the schema to meet our needs. We're using the Zotero data model as a guide.

Given that the metadata in this collection is the only structured dataset around for Roger's articles, I wanted to be able to get it all back out to use for other things. The FDA does provide web feeds, but (unlike Zotero) these aren't comprehensive for a given context and don't incorporate all the metadata fields. But we can use FDA's OAI-PMH interface to get the full metadata with a query like:

http://archive.nyu.edu/request?verb=ListRecords&metadataPrefix=oai_dc&set=hdl_2451_28115

where "hdl_2451_28115" is the identifier for the "Roger Bagnall's Publications" container I linked to above. (Special thanks to Ekaterina Pechekhonova on the NYU Digital Library team, who helped me with syntax).

As a further experiment, I wrote an XSL transform to convert the OAI-PMH XML document into the RDF XML Zotero can import. There are a couple of inelegant hacks in the transform (mainly to get at substrings within single fields), but I'm still happy with the results. The import into Zotero went smoothly:

http://www.zotero.org/paregorios/items/collection/1505597

Next steps: move this to a shared Zotero library so Roger, a student assistant and members of our digital projects team can collaborate to enter the rest of the publications (books, book sections, etc.) and fix any errors in the article records. Then we'll look at the process for using that metadata (via another transform) to help us populate the FDA. We'll also start working on parsing and aggregating Zotero's feeds for use on our website (in Roger's online profile and aggregated with other affiliates' feeds to provide a "recent publications" section).

We're also experimenting with Zotero for the bibliography of our Pleiades project (a collaborative online gazetteer of the Greek and Roman world), and as a component in a potential replacement for the Checklist of Editions of Greek, Latin, Demotic and Coptic Papyri, Ostraca and Tablets. On a more personal level, I've taken to doing all my bookmarking with Zotero and have set up a folder in my library (with associated feed) so that colleagues can following what I'm citing on a daily basis.

Tuesday, June 9, 2009

Determining BAtlas IDs for future Pleiades interoperation

For those who are working with datasets they'd like eventually to link up with Pleiades, we created the Barrington Atlas ID scheme. I've just posted some more tools for helping you determine the BAtlas IDs to go with your existing geographic names or other information.

There's now a draft "Barrington Atlas Index with Identifiers". In PDF (watch out: 7.2 MB) it looks like:


It's also available in a 1.0 MB zip-compressed HTML version, with somewhat semantic class attributes on spans that could be used to parse out different themes ahead of an attempt to match it to a names list:

And of course there is already the home-brewed XML format we distributed the original IDs in (last release tar-gzipped archive):

Share and enjoy!

Wednesday, January 28, 2009

The Concordia Graph

In yesterday's post, I should also have linked directly to the working copy of the Concordia Graph ... persons, places, names, objects and some basic, history-oriented relationships between them ... a subset of what hopefully GAWD will eventually address (as non-idiosyncratically as possible).

Tuesday, January 27, 2009

Semantic Web, Scholarly Resources for Antiquity and the Museum

Our on-going work on geographically functional, cross-resource, machine-actionable citation(!) with the Web continues to get more interesting.

The kickoff was, of course, the joint NEH/JISC grant that is (under the rubric of the Concordia project) funding our look at this in collaboration with the Centre for Computing in the Humanities at King's College, London. Our two workshops (and lots of discussion with other parties in between) have led us through KML, Atom+GeoRSS, citation vocabularies and OAI/ORE on to Cool URIs, Linked Data, CIDOC CRM and more.

Traffic is now building on the Graph of Ancient World Data discussion group (e.g., Sebastian Heath's post on coin hoard data at nomisma.org). Yesterday, Sean Gillies rolled out some changes to the Pleiades interface that provide #this endpoints for Pleiades places, so that Sebastian and others can make explicit reference either to the historical places themselves (non-information resources cited like http://pleiades.stoa.org/places/639166#this) or our descriptions of them on the web (information resources, cited like http://pleiades.stoa.org/places/639166/).

And then this afternoon I came across the latest Talis Semantic Web podcast, featuring Koven Smith on Semantic Web initiatives at the Metropolitan Museum of Art. 38 minutes well-spent. They're thinking about and exploring a number of the approaches and technologies we're interested in, but from a museum perspective. It would be interesting to discuss how these methods could be used to better bridge gaps between museums, field archaeologists, epigraphers, numismatists, papyrologists, prosopographers, historical geographers, librarians, archivists and the rest!

Thursday, July 10, 2008

Barrington Atlas IDs

Update: follow the batlasids tag trail for follow-ups.

Back in February, I blogged about clean URLs and feed aggregation. In March, we learned about the ORE specification for mapping resource aggregations in Atom XML, just as we were gearing up to start work on the Concordia project, with support from the US National Endowment for the Humanities and the UK Joint Information Services Committee.

Our first workshop was held in May. One of the major outcomes was a to-do for me: provide a set of stable identifiers for every citable geographic feature in the Barrington Atlas so collaborators could start publishing resource maps and building interoperation services right away, without waiting for the full build-out of Pleiades content (which will take some time).

The first fruits can be downloaded at: http://atlantides.org/batlas/ . All content under that URL is licensed cc-by. Back versions are in dated subdirectories.

There you'll find XML files for 3 of the Atlas maps (22, 38 and 65). There's only one feature class for which we don't provide IDs: roads. More on why not another time. I'll be adding files for more of the maps as quickly as I can, beginning with Egypt and the north African coast west from the Nile delta to Tripolitania (the Concordia "study area"). Our aim is full coverage for the Atlas within the next few months.

What do you get in the files?


IDs (aka aliases) for every citable geographic feature in the Barrington Atlas. For example:
  • BAtlas 65 G2 Ouasada = ouasada-65-g2
If you combine one of these aliases with the "uribase" also listed in the file (http://atlantides.org/batlas/) you get a Uniform Resource Identifier for that feature (this should answer Sebastian Heath's question).

For features with multiple names, we provide multiple aliases to facilitate ease of use for our collaborators. For example, for BAtlas 65 A2 Aphrodisias/Ninoe, any of the following aliases are valid:
  • aphrodisias-ninoe-65-a2
  • aphrodisias-65-a2
  • ninoe-65-a2
Features labeled in the Atlas with only a number are also handled. For example, BAtlas 38 C1 no. 9 is glossed in the Map-by-Map Directory with the location description (modern names): "Siret el-Giamel/Gasrin di Beida". So, we produce the following aliases, all valid:
  • (9)-38-c1
  • (9)-siret-el-giamel-gasrin-di-beida-38-c1
  • (9)-siret-el-giamel-38-c1
  • (9)-gasrin-di-beida-38-c1
Most unlabeled historical/cultural features also get identifiers. For example:
  • Unnamed aqueduct at Laodicea ad Lycum in BAtlas 65 B2 = aqueduct-laodicea-ad-lycum-65-b2
  • Unnamed bridge at Valerian in BAtlas 22 B5 = bridge-valeriana-22-b5
Unlocated toponyms and false names (appearing only in the Map-by-Map Directory) get treated like this:
  • BAtlas 22 unlocated Acrae = acrae-22-unlocated
  • BAtlas 38 unlocated Ampelos/Ampelontes? = ampelos-ampelontes-38-unlocated = ampelos-38-unlocated = ampelontes-38-unlocated
  • BAtlas 65 false name ‘Itoana’ = itoana-65-false
The XML files also provide associated lists of geographic names, formatted BAtlas citations and other information useful for searching, indexing and correlating these entries with your own existing datasets. What you don't get is coordinates. That's what the Pleiades legacy data conversion work is for, and it's a slower and more expensive process.

Read on to find out how you can start using these identifiers now, and get links to the corresponding Pleiades data automatically as it comes on line over time.

Why do we need these identifiers?


Separate digital projects would like to be able to refer unambiguously to any ancient Greek or Roman geographic feature using a consistent, machine-actionable scheme. The Barrington Atlas is a stable, published resource that can provide this basis if we construct the corresponding IDs.

Even without coordinates, other projects can begin to interoperate with each other immediately, as long as they have a common scheme of identifiers. After using BAtlas URIs to normalize, control or annotate their geographic description, they can publish services or crosswalks that provide links for the relationships within and between their datasets. For example, for each record in a database of coins you might like links to all the other coins minted by the same city, or to digital versions (in other databases) of papyrus documents and inscriptions found at that site.

Moreover, we would like other projects to start using a consistent identifier scheme now, so that as Pleiades adds content we can build more interoperation around it (e.g., dynamic mapping, coordinate lookup, proximity search across multiple collections). To that end, Pleiades will provide redirects (303 see other) from Barrington Atlas URIs (following the scheme described here) as follows:
  • If a corresponding entry exists in Pleiades, the web browser will be redirected to that Pleiades page automatically
  • If there is not yet a corresponding entry in Pleiades, the web browser will be redirected to an HTML page providing a full human-readable citation of the Atlas, as well as information about this service
So, for example:
  • http://atlantides.org/batlas/aphrodisias-ninoe-65-a2 will re-direct to http://pleiades.stoa.org/places/638753
  • http://atlantides.org/batlas/vlahii-22-e4 will re-direct to http://atlantides.org/batlas/vlahii-22-e4.html until there is a corresponding Pleiades record
The HTML landing pages for non-Pleiades redirects are not in place yet, but we're working on it. We'll post again when that's working.

Why URIs for a discretely citable feature in a real-world, printed atlas?

I'll let Bizer, Cyganiak and Heath explain the naming of resources with URI references. In the parlance of "Linked Data on the Web," Barrington Atlas features are "non-information resources"; that is, they are non-digital/real-world discrete entities about which web authors and services may want to make assertions or around which to perform operations. What we are doing is creating a stable system for identifying and citing these resources so that those assertions and operations can be automated using standards-compliant web mechanisms and applications. The HTML pages to which web browsers will be automatically redirected constitute "information resources" that describe the "non-information resources" identified by the original URIs.

How

If I get a comment box full of requests for a blow-by-blow description of the algorithm, I'll post something on that. If you're really curious and energetic, have a look at the code. It's intended mostly for short-term, internal use, so it's not marvelously documented. Yes, it's a hack.

One of the big headaches was deciding how to decompose the complex labels into simple, clean ASCII strings that can be legal URL components. Sean blogged about that, and wrote some code to do it, shortly after the workshop.

Credit where credit is due

Sean and I had a lot of help from the workshop participants (Ben Armintor, Gabriel Bodard, Hugh Cayless, Sebastian Heath, Tim Libert, Sebastian Rahtz and Charlotte Roueché) in sorting out what to do here. Older, substantive conversations that informed this process (with these folks and others; notably Rob Chavez, Greg Crane, Ruth Mostern, Dan Pett, Ross Scaife†, Patrick Sims-Williams, Linda Smith and Neel Smith) go back as far as 2000, shortly after the Atlas was published.

Many thanks to all!

Examples
in the Wild


Sebastian Rahtz has already mocked up an example service for the Lexicon of Greek Personal Names. It takes a BAtlas alias and returns you all the name records in their system that are associated with the corresponding place. So, for example:
  • http://clas-lgpn2.class.ox.ac.uk/batlas/aloros-50-b3
This is just one of several services that LGPN is developing. See the LGPN web services page, as well as the LGPN presentation to the Digital Classicist Seminar in London last month.

Sebastian Heath, for some time, has been incorporating Pleiades identifiers into the database records of the American Numismatic Society. He has blogged about that work in the context of Concordia.

Do you have an application? Let me know!

Thursday, May 8, 2008

Open Library API, Bibo Ontology and Digital Bibliographies

I bet we're going to want to fiddle with the Open Library API and the Bibo Ontology in the context of the Pleiades bibliography application (and some others we're thinking about, like a next-generation Checklist of Editions for papyri and the like).
  • Seek and get digital books from the Open Library.
  • Use Bibo in other-than-html serializations of the underlying MODS records, and maybe even microformatishly in the HTML version. (We already use COinS -- for interop with Zotero -- but it's lossy, ungainly and suboptimally human-readable).
Thanks to Dave Pattern (via planet code4lib) for the pointer to the OA API).

Wednesday, April 2, 2008

Concordia licensing and openness

Andy Powell hopes "that the conditions of funding in this case mandated that the resulting resources be made open rather than just free" and wonders what licenses will govern the content produced or incorporated by the various projects funded under the joint NEH/JISC Transatlantic Digitization Collaboration grants.

I can only speak for the Concordia project, a collaboration of ISAW and CCH.

In answer to the first question: no, I am not aware of any mandate placed on us in this regard. We did make explicit commitments of our own in our proposal about licensing, and we're now bound to abide by those.

Here is a list of the software and content that we will use, modify or produce, indicating the license that now governs (or will govern) each:

Wednesday, March 26, 2008

Subaudible to me: APIS News and Updates

No easily spotted webfeed for APIS News and Updates either. Bummer.

Concordia grant award

Yesterday I had the pleasure of attending a nice event at the Folger library during which the Chairman of the National Endowment for the Humanities announced the award of 5 grants under the NEH/JISC joint Transatlantic Digitization Collaboration rubric (press release).

I'm happy to report that Pleiades is part of one of the winning proposals. The award goes jointly to ISAW at NYU and to CCH/Classics at King's College, London for a collaboration we're calling "Concordia" (to reflect its focus on cross-project interoperability). The principal investigators are Roger Bagnall and Charlotte Roueché. Sean Gillies, Gabriel Bodard and I will join them in working on the project. The period of performance is 1 April 2008 - 31 March 2009.

What will we do?
Our advisory board:

Subaudible to me: Digital Medievalist News

I can't find a web feed for Digital Medievalist News. Bummer.

Tuesday, March 4, 2008

Behold the power of the ORE

Dan Cohen rocked my feed reader this morning with news that the Open Archives Initiative has unveiled the Object Reuse and Exchange (ORE) Specification. This initiative came in below my RADAR (as so many things do!); Dan's post is well worth a close reading, both as an introduction and as a rationale.

As I understand it so far, ORE provides a structured method for mapping relationships between digital resources (different formats, multiple versions, works that cite other works, reviews of works, etc.). Any party -- an author, an archivist, a (e)journal editor, an automated process -- can construct these maps and then publish them via a serialization format for use by other individuals and processes. As Dan writes:
Today's scholarship ... cannot be contained by web pages or PDFs put into an institutional repository, but rather consists of what the ORE team has termed “aggregates,” or constellations of digital objects that often span many different web servers and repositories ... By forging semantic links between pieces entailed in a work of scholarship [ORE] keeps those links active and dynamic and allows for humans, as well as machines that wish to make connections, to easily find these related objects. It also allows for a much better preservation path for digital scholarship because repositories can use ORE to get the entirety of a work and its associated constellation rather than grabbing just a single published instantiation of the work.
Sean and I have been poring over the ORE Spec for the last hour or so, and especially the section on the primary serialization format for ORE Resource Maps, which makes use of Atom.

Pleiades fans will already know that, at the beginning, Sean designed into our publication interface an Atom+GeoRSS serialization component (e.g., Pleiades Cyrene in Atom), and that he is a vocal advocate for RESTful geoapps that employ Atom and other appropriate formats. Last Friday, I gave a presentation about Atom+GeoRSS for cross-project interoperability to an audience at the British School in Rome. This approach that has grown out of our Pleiades work. In comparing where we have been going with where ORE is going, it's clear that the practice is very close (as Sean points out). In coming days I'll be reworking the example to match the ORE spec, and we'll be doing some upgrades to our standard Pleiadic Atom feeds as well.