SyntaxHighlighter

Showing posts with label tools. Show all posts
Showing posts with label tools. Show all posts

Thursday, April 10, 2014

Batch XML validation at the command line

Updated: 8 August, 2017 to reflect changes in the installation pattern for jing.

Against a RelaxNG schema. I had help figuring this out from Hugh and Ryan at DC3:

$ find {searchpath} -name "*.xml" -print | parallel --tag jing {relaxngpath}
The find command hunts down all files ending with ".xml" in the directory tree under searchpath. The parallel command takes that list of files and fires off (in parallel) a jing validation run for each of them. The --tag option passed to jing ensures we get the name of the file passed through with each error message. This turns out (in general terms as seen by me) to be much faster than running each jing call in sequence, e.g. with the --exec primary in find.

As I'm running on a Mac, I had to install GNU Parallel and the Jing RelaxNG Validator. That's what Homebrew is for:
$ brew install jing
$ brew install jing-trang
$ brew install parallel
NB: you may have to install a down version of Java before you can get the jing-trang formula to work in homebrew (e.g., brew install java6).

What's the context, you ask? I have lots of reasons to want to be able to do this. The proximal cause was batch-validating all the EpiDoc XML files for the inscriptions that are included in the Corpus of Campā Inscriptions before regenerating the site for an update today. I wanted to see quickly if there were any encoding errors in the XML that might blow up the XSL transforms we use to generate the site. So, what I actually ran was:
$ curl -O http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
$ find ./texts/xml -name '*.xml' -print | parallel --tag jing tei-epidoc.rng
 Thanks to everybody who built all these tools!


Thursday, June 20, 2013

It is happening

A couple of hours ago, I was sitting out on the back deck with my wife and pets, enjoying perfect temperatures, morning birdsong, lavender-scented country air, and a cup of freshly brewed Costa Rican coffee (roasted by the good folks at the Kaffeeklatsch in Huntsville). Idyllic.

I was flipping through the latest news stories, blog posts, and such, brought into my phone by my feed reader (currently Feedly). I was trying to ignore the omnipresent bad news of the world, when this popped up:

screen capture of a feed summary in Feedly on my Android phone
Forma[m] Lollianus fecit?!? I'm predisposed by my dissertation experience to trigger on certain Latin and Greek words because of their sometime significance for the study of Roman geography. Forma is of course one of those words, and it does (probably more often than justified) get translated as "map" or "plan." Could this be — admittedly against the odds —an inscription on a map or plan drafted or surveyed by some guy named Lollianus?

If you're me, the possibility warrants a click-through to a corresponding record in the Heidelberg Epigraphic Databank (EDH).

My mappish hopes were quickly dashed, but just as quickly were replaced by interest in a group of inscribed objects I hadn't run across before: mirrors from Roman Dacia bearing makers' inscriptions. "Forma" can mean "mirror"? A quick check of Lewis & Short at Perseus doesn't support that idea, but builds confidence in a better interpretation: "mold, stamp, form". Was this mirror, or some part of it, somehow cast or stamped out? The EDH entry tells me there are 9 identical mirrors extant and that the inscription goes around the "Fassung" (frame?). Yup.

Cool. I learned something today before breakfast. And it's knowledge I can use when I come back to doing more with the geographical/mapping/surveying vocabulary.

And then it hits me: that's not information I went looking for, not a search I initiated. New information of interest was pushed to me because I had previously used a software tool to express interest in a number of information sources including, but not limited to, ancient inscriptions. The software kept an eye on new output from those sources and made it available to me for review and engagement in a mode and at a time and place of my choosing. And because the source data was online, open, and linked in a standard format, I was able to drink coffee and pet my dog on the back deck in Moontown, Alabama while making use of the scholarly work done yesterday(!) by Brigitte Gräf in Heidelberg, Germany.

Isn't this one of the things we've been working toward?

How did that happen?


Sometime earlier this year, Frank Grieshaber in Heidelberg rolled out web page listings and corresponding Atom feeds of recently changed content in the EDH (e.g., latest updates to the inscriptions database). I added them, along with similar data-oriented feeds, to a feed aggregator I dubbed Planet Potamos (with "Potamos" trying lamely to evoke a rushing river of data; the "Planet" acknowledges the feed aggregation software I use). I put the same feed subscriptions into my personal feed reader (I could also have put the Potamos aggregator's feed, but it only updates periodically and I'm an immediacy junkie). I installed and configured my feed reader on every device I use.

The rest is magic. Magic made the old-fashioned way by lots of people in many different places and times developing standards, building software, creating data, doing research, and sharing.

What next?


Well, I hope that Frank and his colleagues in Heidelberg will eventually add thumbnail images (where they have them) to the EDH feeds. I hope that the other epigraphic databases (and indeed all kinds of ancient studies web applications) will set up similar feeds. I hope that we can all start using more linked-data approaches in and alongside such feeds in order to communicate seminal interpretive/discovery facets (like geography, personography, temporality and genre) in machine-actionable ways. I hope the spirit and practice of openness that lubricates and accelerates this sort of synergy continues to grow and flower.

As for me, I'm thinking about how I might set up some kind of filtering mechanism that would highlight or prioritize content in my feed reader that's potentially relevant to my (e.g.) geo/map/survey vocabulary interests. Hmmmmm....


Thursday, November 19, 2009

Bridging Institutional Repository and Bibliographic Management

As an institution, ISAW has an interest in disseminating, preserving and promoting the research products and publications of its faculty, research staff, students, affiliates and collaborators. Our parent institution, NYU, has made a commitment to the persistent dissemination of such materials when voluntarily contributed to its Faculty Digital Archive (FDA). We'll use the FDA as a locus for materials that fit well into DSpace (with which the FDA is realized) and that aren't rights-constrained. But we also need mechanisms for developing and publishing the whole bibliographic story of a particular faculty member, research group, project or conference with links from the individual entries to digital copies wherever they may be (e.g., the FDA, JSTOR, Internet Archive, Google Books). For this function, we like Zotero. Atop Zotero's robust and ubiquitous feed documents, we can build interoperability with our website and other tools and venues in a way that is also completely visible to commercial and third-party search and discovery tools.

There will be a number of iterations necessary to reach a fully robust solution, but we're already taking some of the first steps.

As an early experiment with the FDA, we had a student assistant input all of my boss's articles in PDF format, along with descriptive metadata (see: Roger Bagnall's Publications). The default metadata schema in the FDA wasn't a perfect fit for journal article citations, but the FDA staff is now working with us to extend the schema to meet our needs. We're using the Zotero data model as a guide.

Given that the metadata in this collection is the only structured dataset around for Roger's articles, I wanted to be able to get it all back out to use for other things. The FDA does provide web feeds, but (unlike Zotero) these aren't comprehensive for a given context and don't incorporate all the metadata fields. But we can use FDA's OAI-PMH interface to get the full metadata with a query like:

http://archive.nyu.edu/request?verb=ListRecords&metadataPrefix=oai_dc&set=hdl_2451_28115

where "hdl_2451_28115" is the identifier for the "Roger Bagnall's Publications" container I linked to above. (Special thanks to Ekaterina Pechekhonova on the NYU Digital Library team, who helped me with syntax).

As a further experiment, I wrote an XSL transform to convert the OAI-PMH XML document into the RDF XML Zotero can import. There are a couple of inelegant hacks in the transform (mainly to get at substrings within single fields), but I'm still happy with the results. The import into Zotero went smoothly:

http://www.zotero.org/paregorios/items/collection/1505597

Next steps: move this to a shared Zotero library so Roger, a student assistant and members of our digital projects team can collaborate to enter the rest of the publications (books, book sections, etc.) and fix any errors in the article records. Then we'll look at the process for using that metadata (via another transform) to help us populate the FDA. We'll also start working on parsing and aggregating Zotero's feeds for use on our website (in Roger's online profile and aggregated with other affiliates' feeds to provide a "recent publications" section).

We're also experimenting with Zotero for the bibliography of our Pleiades project (a collaborative online gazetteer of the Greek and Roman world), and as a component in a potential replacement for the Checklist of Editions of Greek, Latin, Demotic and Coptic Papyri, Ostraca and Tablets. On a more personal level, I've taken to doing all my bookmarking with Zotero and have set up a folder in my library (with associated feed) so that colleagues can following what I'm citing on a daily basis.

Friday, January 30, 2009

There is more than one "TimeMap" in the geohistorical software space

Guest blogging at the Google Geo Developer's Blog, UC Berkeley's Nick Rabinowitz details his TimeMap Javascript library that:
helps the Google Maps API play nicely with the SIMILE Timeline API to create maps and timelines that work together
This is not to be confused with the older TimeMap family of software components (some now open-sourced), originally built by the Archaeological Computing Laboratory at the University of Sydney under the direction of Ian Johnson.

Monday, January 5, 2009

The Study and Publication of Inscriptions in the Age of the Computer

Update (7 January 2009): added links to abstracts

This Saturday, 10 January 2009, Paul Iversen and I will be co-chairing the following panel at the Joint Annual Meetings of the American Philological Association and the Archaeological Institute of America. The panel, on the topic of digital study and publication of inscriptions, is sponsored by American Society of Greek and Latin Epigraphy. I hope to see you there!

Saturday, January 10, 8:30-11:00 a.m. in Independence I of the Marriott Hotel, Philadelphia:
  1. Publishing Image and Text in Digital Epigraphy
    Neel Smith (College of the Holy Cross)
    [ abstract not available ]
  2. Topic Maps and the Semantics of Inscriptions
    Marion Lamé (Alma Mater Studiorum, Università di Bologna, Italy and Université de Provence, Aix-Marseille 1, France)
    [ abstract in pdf (courtesy APA) ]
  3. An Efficient Method for Digitizing Squeezes & Performing Automated Epigraphic Analysis
    Eleni Bozia, Angelos Barmpoutis and Robert S. Wagman (University of Florida)
    [ abstract in msword (courtesy APA) ]
  4. Opportunities for Epigraphy in the Context of 3-D Digitization
    Gabriel Bodard (King’s College London) and Ryan Baumann (Univ. of Kentucky)
    [ abstract in pdf (courtesy APA) ]

Monday, October 20, 2008

Near Eastern Prosopography and Onomastics

Charles Helton wants a Lexicon of Greek Personal Names for Sumerian and Akkadian. What's the state of prosopographical and onomastic research in that context and the status of relevant projects (digital or otherwise)?

The DH Stack(s)

Lots of interesting posts in the last couple of days about Digital Humanities skills, software and cyberinfrastructure initiatives:

Monday, September 29, 2008

Friday, September 26, 2008

Reuters (EndNote) sues George Mason over Zotero

By way of the Courthouse News Service we hear that:

Thomson Reuters demands $10 million and an injunction to stop George Mason University from distributing its new Web browser application, Zotero ... Reuters claims George Mason is violating its license agreement and destroying the EndNote customer base.

Thursday, September 11, 2008

First Thousand Years of Greek: Utilities

I recently blogged the announcement of the CHS-sponsored First Thousand Years of Greek project. It looks as if Neel is beginning to roll out related code, documentation and information on the CHS Digital Incunabula site.

Sarah Parcak on Egypt in Huntsville, September 17

The North Alabama Society of the Archaeological Institute of America is hosting Dr. Sarah Parcak for two talks next Wednesday:
  • Women and Power in Antiquity: A New Kingdom Case Study from Deir el-Medina, Thebes, 2:20 p.m. in Roberts 419 on the UAH campus
  • Making the Mummies Dance from Space: Using Satellite Imagery to Find Ancient Egypt, 7:30 p.m. in the Chan Auditorium (first floor Business Administration Building) on the UAH campus
You can read more about Dr. Parcak's work, and much else, on the NASAIA blog, Excavate!

Friday, August 29, 2008

Get Paid to Read Greek!

From Greg Crane:

Contribute to the Greek and Latin Treebanks!

We are currently looking for advanced students of Greek and Latin to contribute syntactic analyses (via a web-based system) to our existing Latin Treebank (described below) and our emerging Greek Treebank as well (for which we have just received funding). We particularly encourage students at various levels to design research projects around this new tool. We are looking in particular for the following:
  • Get paid to read Greek! We can have a limited number of research assistantships for advanced students of the languages who can work for the project from their home institutions. We particularly encourage students who can use the analyses that they produce to support research projects of their own.
  • We also encourage classes of Greek and Latin to contribute as well. Creating the syntactic analyses provides a new way to address the traditional task of parsing Greek and Latin. Your class work can then contribute to a foundational new resource for the study of Greek and Latin - both courses as a whole and individual contributors are acknowledged in the published data.
  • Students and faculty interested in conducting their own original research based on treebank data will have the option to submit their work for editorial review to have it published as part of the emerging Scaife Digital Library.
To contribute, please contact David Bamman (david.bamman@tufts.edu) or Gregory Crane (gregory.crane@tufts.edu).

For more information, see http://nlp.perseus.tufts.edu/syntax/treebank/.

Wednesday, August 20, 2008

Natual Language Toolkit (NLTK) penetration?

I'd be interested to know of digital classicists, antiquisters and those inhabiting neighboring nodes who are making use of NLTK and what your impressions of strengths and weaknesses are.

Thursday, July 31, 2008

Outfox Shoutout

I wanted to blog about this as soon as it hit my feedreader, but then there was that proposal to finish. Anyway:

One of the highlights of a decade spent at Carolina was getting to work with Gary Bishop, a professor in the Department of Computer Science. We found ourselves in a collaboration initiated by Jason Morris, a blind Classics graduate student who was deeply interested in ancient geography and for whom Braille maps constituted a ridiculously low-bandwidth, low-resolution disappointment. The idea of producing immersive spatial audio maps took off in the hands of a group of Gary's undergraduate students and, with some seed money from Microsoft Research, this one initiative blossomed into a research and teaching program in assistive technology.

Gary's recently blogged about a cool new project: the Outfox extension for Firefox, which:
allows in-page JavaScript to access local platform services and devices such as text-to-speech synthesis, sound playback and game controllers
It's open source (BSD License), and you can help.

Thursday, May 8, 2008

Open Library API, Bibo Ontology and Digital Bibliographies

I bet we're going to want to fiddle with the Open Library API and the Bibo Ontology in the context of the Pleiades bibliography application (and some others we're thinking about, like a next-generation Checklist of Editions for papyri and the like).
  • Seek and get digital books from the Open Library.
  • Use Bibo in other-than-html serializations of the underlying MODS records, and maybe even microformatishly in the HTML version. (We already use COinS -- for interop with Zotero -- but it's lossy, ungainly and suboptimally human-readable).
Thanks to Dave Pattern (via planet code4lib) for the pointer to the OA API).

Thursday, May 1, 2008

Sharing the road

So, you find yourself in a free-wifi coffeeshop (or similar venue), where you're sharing bandwidth with others. You have a bunch of really big files that you need to transfer to a remote server so collaborators can get at them. You know that upload speed at your location is throttled pretty aggressively (my usual haunt has Bell South DSL, and I've never seen a big upload average higher than 48Kbps). So, it's likely that if you blast that stuff out it'll slow everybody in the venue (I tried. It did.).

This uses all the bandwidth it can get:
scp huge-file.zip myname@myserver.org:
This is more neighbor-friendly (my max upload speed is set to 10Kbps):
rsync --bwlimit=10 -e ssh huge-file.zip myname@myserver.org:
I really don't want to get thrown out of here ...

Tuesday, March 4, 2008

Watching Omeka

Shawn is trying out Omeka. Thanks, and thanks for blogging about the experience. We're beginning to think about what we'll do for online avatars of ISAW exhibitions beginning in 2009, and so Omeka is on our list of things to look at closely.

Friday, February 29, 2008

Atom+GeoRSS for interoperability: Cyrenaican archaeology, epigraphy, geography

The influenza kept me off the plane to Rome, but happily I was at least able to give my talk (via Skype) this morning. The occasion is a meeting at the British School in Rome, organized by the Inscriptions of Roman Cyrenaica project, to bring together scholars working in Cyrenaica to explore the potential for cross-project collaboration and data sharing. I used our work so far on Pleiades (and a bunch of Sean's ideas exchanged on IRC) as a spring-board for a methodological proposal: using Atom+GeoRSS feeds to facilitate cross-project data discovery and citation.

There will be more about this in future posts, but for now, the slides (mostly screen shots) are available:

Monday, February 4, 2008