SyntaxHighlighter

Wednesday, May 23, 2012

First pass at extracting useful data from my dissertation

You'll find context in yesterday's post on the dissertation.

It turns out it wasn't as hard as I anticipated to start getting useful information extracted from my born-digital-for-printing-on-dead-trees dissertation. Here's a not-yet-perfect xml serialization (borrowing tags from the TEI) of "instance" information found in the diss narrative:

https://github.com/paregorios/demarc/blob/master/xml/instances.xml

Each instance is a historical event (or in some cases event series) relating to boundary demarcation or dispute within the empire. Here's a comparison between the original formatting for paper and the xml.

For paper:

XML:
<?xml version="1.0" encoding="UTF-8"?>
<div type="instance" xml:id="INST9">
  <idno type="original">INST9</idno>
  <head>A Negotiated Boundary between the <placeName 
    type="ancient">Zamucci</placeName> and the <placeName 
    type="ancient">Muduciuvi</placeName></head>
  <p rend="indent">Burton 2000, no. 78</p>
  <p>Date(s): <date>AD 86</date></p>
  <p type="treDisputeStatement">This boundary marker was placed in 
    accordance with the agreement of both parties (<foreign xml:lang="la">ex 
    conven/tione utrarumque nationum</foreign>), and therefore may be taken as
    evidence of a <hi rend="bold">boundary dispute</hi>.</p>
  <p rend="indent">This single boundary marker from coastal <placeName 
    type="modern">Libya</placeName> provides the only evidence for the resolution
    of a boundary dispute between these two indigenous peoples. The date of the 
    demarcation, as calculated from the imperial titulature, places the event in 
    the same year as the reported ‘destruction’ of the <placeName 
    type="ancient">Nasamones</placeName> by <placeName type="ancient">Legio III 
    Augusta</placeName> as a consequence of a tax revolt in which tax collectors 
    were killed.<note n="286"> Zonaras 11.19. </note> It is not clear whether 
    the boundary action was related to the conflict, or merely took advantage of
    the temporary presence of the legionary legate in what ought to have been
    part of the proconsular province. Surviving documentation for proconsuls
    during the 80s AD is incomplete, and therefore we cannot say who was
    governing <placeName type="ancient">Africa Proconsularis </placeName>at the 
    time of this demarcation.<note n="287"> Thomasson 1996, 45-48. </note>
    Neither party seems to have been related to the <placeName 
    type="ancient">Nasamones</placeName>; rather, they are thought to be sub-
    tribes of the <placeName type="ancient">Macae.</placeName><note 
    n="288">Mattingly 1994, 27-28, 32, 74, 76.. </note></p>
</div>



One thing that made this a lot easier than it might of been was the way I used styles in Microsoft Word back when I created the original version of the document. Rather than just painting formatting onto my text for headings, paragraphs, strings of characters, and so forth, I created a custom "style" for each type of thing I wanted to paint (e.g., an "instance heading" or a "personal name"). I associated the desired visual formatting with each of these, but the names themselves (since the captured semantic distinctions that I was interested in) provided hooks today for writing this stuff out as sort-of TEI XML.

There's more to do, obviously, but this was a satisfying first step.

No comments: