SyntaxHighlighter

Thursday, April 10, 2014

Batch XML validation at the command line

Against a RelaxNG schema. I had help figuring this out from Hugh and Ryan at DC3:

$ find {searchpath} -name "*.xml" -print | parallel --tag jing {relaxngpath}
The find command hunts down all files ending with ".xml" in the directory tree under searchpath. The parallel command takes that list of files and fires off (in parallel) a jing validation run for each of them. The --tag option passed to jing ensures we get the name of the file passed through with each error message. This turns out (in general terms as seen by me) to be much faster than running each jing call in sequence, e.g. with the --exec primary in find.

As I'm running on a Mac, I had to install GNU Parallel and the Jing RelaxNG Validator. That's what Homebrew is for:
$ brew install jing
$ brew install parallel
 What's the context, you ask? I have lots of reasons to want to be able to do this. The proximal cause was batch-validating all the EpiDoc XML files for the inscriptions that are included in the Corpus of Campā Inscriptions before regenerating the site for an update today. I wanted to see quickly if there were any encoding errors in the XML that might blow up the XSL transforms we use to generate the site. So, what I actually ran was:
$ curl -O http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
$ find ./texts/xml -name '*.xml' -print | parallel --tag jing tei-epidoc.rng
 Thanks to everybody who built all these tools!


No comments: