Monday, February 4, 2008

Escape from PGeo, part deux

Last week, I posted a short bit on using ogrinfo and ogr2ogr to convert a layer in an ESRI personal geodatabase to a shapefile. Today we'll automate the process with some python programming to extract all layers (regardless of geometry) from an arbitrary pgeo file.

The following examples assume you're running python within an FWTools shell.

Get a list of layers

The first challenge is capturing the output from ogrinfo (the list of layers) into a form we can parse. Python gives us a number of methods for issuing commands to the hosting system environment. The one to use here is popen3 from the os module since it will let us invoke ogrinfo and then work with its output (specifically stdout) as a file-like object in code. We'll read the output lines into a python list for subsequent processing.
>>> import os
>>> infilename = "ItAntMapping.mdb"
>>> child_stdin, child_stdout, child_stderr = os.popen3("ogrinfo %s" % infilenam
>>> output = child_stdout.readlines()
>>> output
["INFO: Open of `ItAntMapping.mdb'\n", " using driver `PGeo' successful.\n"
, '1: whollyimprecise\n', '2: placesAdded\n', '3: placesEstimated\n', '4: places
Solid\n', '5: stretchesDangling\n', '6: stretchesFloating\n', '7: stretchesSolid
\n', '8: tpPoints\n', '9: stretchesUnlocated\n']
Now we want to cleanup this list of lines so we're left with a list of layer names (no prefixed numbers, no newline characters, no strings other than layer names). We'll iterate through each line in the output and, if a python regular expression designed to differentiate between layer names and ogrinfo's other outputs finds a match, we'll copy the relevant portion of the matched line into a new list of "layers". We'll need the python regular expression module (re) for this step.

In pseudocode:
  • set up a regular expression to match a line that begins with a string of digits, a colon, and a space, followed by a string of arbitrary length and a newline (group the string that is the layer name)
  • create a new list to hold the layer names
  • for each line in the output list:
    • attempt a regular expression match
    • if matched: append the contents of the group (the layer name) to the layers list
In python:
import re
>>> regex = re.compile('^\d+: (.*)\n')
>>> layers = []
>>> for line in output:
... m = regex.match(line)
... if m:
... layers.append(m.groups()[0])
>>> layers
['whollyimprecise', 'placesAdded', 'placesEstimated', 'placesSolid', 'stretchesD
angling', 'stretchesFloating', 'stretchesSolid', 'tpPoints', 'stretchesUnlocated
Extract each named layer

Now the easy part. Iterate through the list of layers, invoking ogr2ogr for each layer. We don't need to capture any i/o streams this time, so we'll just use the plain-vanilla system method.
>>> for layer in layers:
... os.system('ogr2ogr -f "ESRI Shapefile" %s.shp ItAntMapping.mdb %s' % (la
yer, layer))
>>> ^Z
C:\Users\Tom\Documents\itins>dir *.shp
Volume in drive C has no label.
Volume Serial Number is 2C47-654D

Directory of C:\Users\Tom\Documents\itins

02/04/2008 01:03 PM 156 placesAdded.shp
02/04/2008 01:03 PM 7,464 placesEstimated.shp
02/04/2008 01:03 PM 85,780 placesSolid.shp
02/04/2008 01:03 PM 100 stretchesDangling.shp
02/04/2008 01:03 PM 100 stretchesFloating.shp
02/04/2008 01:03 PM 62,420 stretchesSolid.shp
02/04/2008 01:03 PM 1,684 stretchesUnlocated.shp
02/04/2008 01:03 PM 100 whollyimprecise.shp
8 File(s) 157,804 bytes
0 Dir(s) 172,207,665,152 bytes free
Package up the code

I've incorporated these steps into a python script, along with some error handling and usage methods. It's a bit more generalized than what's shown above. Enjoy, and let me know about mistakes or suggestions for improvement.

No comments: