SyntaxHighlighter

Thursday, July 17, 2014

New in Electra: RIDE

I have just added the following digital resource to the Electra Atlantis feed aggregator

title = RIDE: A review journal for digital editions and resources
url = http://ride.i-d-e.de/
creators = Alexander Czmiel, et al. (eds.)
license =  http://creativecommons.org/licenses/by/4.0/
description = RIDE is a review journal dedicated to digital editions and resources. RIDE aims to direct attention to digital editions and to provide a forum in which expert peers criticise and discuss the efforts of digital editors in order to improve current practices and advance future developments. It will do so by asking its reviewers to pay attention not only to the traditional virtues and vices of any edition, but also to the progressing methodology and its technical implications.
feed = http://ride.i-d-e.de/feed/

Wednesday, July 16, 2014

New in Electra: EpiDoc Workshop

I have just added the following blog to the Electra Atlantis feed aggregator:

title = EpiDoc workshop
url = http://epidocworkshop.blogspot.co.uk/
creators = Simona Stoyanova, et al.
description = Share markup examples; give and receive feedback
keywords = EpiDoc, epigraphy, inscriptions, XML, TEI
feed = http://epidocworkshop.blogspot.com/feeds/posts/default?alt=rss

Monday, July 14, 2014

Hacking on Apache Log Files with Python

There are plenty of tools out there for doing web request analysis. I wanted to pull some information out of some Apache log files without all that overhead. Here's what I did:

I got Rory McCann's apache-log-parser (after some googling; it's on Github and on pypi).  I set up a Python virtual environment using Doug Hellmann's virtualenvwrapper, activated it, and then used:

 pip install apache-log-parser  

Since I'd never used apache-log-parser before, I had to get familiar with it. I discovered that, to use it, I had to know the format string that Apache was using to log information for my site. This was a two-step process, figured out by reading the Log Files section of the Apache documentation and poking about with grep.

First, I searched in the Apache configuration files for the CustomLog directive that's associated with the virtual host I wanted to analyze. This gave me a 'nickname' for the log configuration. More spelunking in Apache config files -- this time in the main configuration file -- turned up the definition of that nickname (Apache uses the LogFormat directive for this purpose):

 $ cd /etc/apache2/  
 $ grep CustomLog sites-enabled/foo.nowhere.org  
  CustomLog /var/log/apache2/foo.nowhere.org-access.log combined  
 $ grep combined apache2.conf   
 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined  

It's that LogFormat string that needs to be given to Rory's code so it knows how to parse your log files.

After some experimenting in the Python interpreter to get a feel for the code and its capabilities, I wrote a few lines of my own to wrap the setup and file reading operations:

 #!/usr/bin/env python  
 # -*- coding: utf-8 -*-  
   
 import apache_log_parser  
 import glob  
 import logging  
   
 # supported log file formats  
 APACHE_COMBINED="%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""  
 APACHE_COMMON="%h %l %u %t \"%r\" %>s %b"  
   
 def gulp(log_file_path, pattern=APACHE_COMBINED):  
   """ import and parse log files """  
   log_data=[]  
   line_parser=apache_log_parser.make_parser(pattern)  
   for file_name in glob.glob(log_file_path):  
     logging.info("file_name: %s" % file_name)  
     file = open(file_name, 'r')  
     lines = file.readlines()  
     file.close()  
     logging.info(" read %s lines" % len(lines))  
     for line in lines:  
       line_data=line_parser(line)  
       log_data.append(line_data)  
   logging.info("total number of events parsed: %s" % len(log_data))  
   return log_data  

For this particular server, I had multiple log files, but I wanted to have all the requests from all of them (parsed into dictionaries by Rory's code) in a single list for subsequent analysis. So, back to the Python interpreter:

 >>> import logging  
 >>> logging.basicConfig(level=logging.INFO)  
 >>> import loggulper  
 >>> d = loggulper.gulp("/path/to/my/log/files/foo.nowhere.org-access.*")  

I'll spare you the logging messages. This took several minutes. I ended up with about 1.5 million requests in the list. Real life intervened. How to save this data for later without having to run through the whole import process again? Pickle it.

 >>> import cPickle as pickle  
 >>> out_name = "logdata.pickle"  
 >>> outf = open(out_name, 'w')  
 >>> pickler = pickle.Pickler(outf, pickle.HIGHEST_PROTOCOL)  
 >>> pickler.dump(d)  
 <cPickle.Pickler object at 0x1044044e8>  
 >>> outf.close()  

The whole list was saved to a 933.3 MB file in just a few seconds (full disclosure: I have a solid-state drive). It was nearly as quick to read back in a couple of days later (new interpreter session and all):

 >>> import cPickle as pickle  
 >>> in_name="logdata.pickle"  
 >>> inf=open(in_name, 'r')  
 >>> unpickler=pickle.Unpickler(inf)  
 >>> d=unpickler.load()  
 >>> len(d)  
 1522015  
 >>> d[0].keys()  
 ['status', 'request_header_referer', 'remote_user', 'request_header_user_agent__browser__family', 'request_header_user_agent__is_mobile', 'request_header_user_agent__browser__version_string', 'request_header_user_agent', 'request_http_ver', 'request_header_user_agent__os__version_string', 'remote_logname', 'time_recieved_isoformat', 'time_recieved', 'request_first_line', 'request_header_user_agent__os__family', 'request_method', 'request_url', 'remote_host', 'time_recieved_datetimeobj', 'response_bytes_clf']  

It's important to notice at this point that the word "received" is misspelled "recieved" in keys in the dictionaries returned by apache-log-parser. If you don't notice this early on, it will cause some amount of frustration.

It turned out that my log data included events past the end of reporting period I'd been given (ending 31 May 2014), so I needed to filter out just those requests that fell within the reporting period. Python list comprehensions to the rescue:

 >>> dates=[req['time_recieved_datetimeobj'] for req in d]  
 >>> max(dates)  
 datetime.datetime(2014, 7, 13, 11, 37, 31)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  
 >>> from datetime import datetime  
 >>> d_relevant=[req for req in d if req['time_recieved_datetimeobj'] < datetime(2014,06,01)]  
 >>> dates=[req['time_recieved_datetimeobj'] for req in d_relevant]  
 >>> max(dates)  
 datetime.datetime(2014, 5, 31, 23, 59, 17)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  

Now to separate requests made by self-identified bots and spiders from the rest of the traffic:

 >>> robots=[req for req in d_relevant if 'bot' in req['request_header_user_agent'].lower() or 'spider' in req['request_header_user_agent'].lower()]  
 >>> len(robots)  
 848450  
 >>> humans=[req for req in d_relevant if 'bot' not in req['request_header_user_agent'].lower() and 'spider' not in req['request_header_user_agent'].lower()]  
 >>> len(humans)  
 486249  
 >>> len(robots) + len(humans) == len(d_relevant)  
 True  
 >>> unique_bots=[]  
 >>> for bot in robots:  
 ...   if bot['request_header_user_agent'] not in unique_bots:  
 ...     unique_bots.append(bot['request_header_user_agent'])  
 ...   
 >>> len(unique_bots)  
 229

Aside: yes, I know there could well still be automated agents in the "humans" list; I've only filtered out those that are not operated by the sneaky or the uninformed. Let's not worry about that issue for now.

Stay tuned for the next installment, wherein I hope we actually learn something about how our server is being used.

Tuesday, June 3, 2014

New in Maia: Mār Šiprim and Laboratoire Orient et Méditerranée

I have added feeds for the following web resources to the Maia Atlantis feed aggregator:

title = Mār Šiprim
url = http://mar-shiprim.org/
creators = International Association for Assyriology
license = None
description = Official Newsletter for the International Association for Assyriology (IAA). Through this Newsletter, the IAA aims to provide an online platform for Assyriologists and Near-Eastern enthusiasts where to interact with each other on both an intellectual and an informal level, thus establishing an international linkage among colleagues.
keywords = None
feed = http://mar-shiprim.org/feed/

title = Laboratoire Orient et Méditerranée
url = http://www.orient-mediterranee.com/?lang=fr
creators = None
license = None
description = Orient & Méditerranée est une Unité Mixte de Recherche en Sciences historiques, philologiques et religieuses, associant le Centre National de la Recherche Scientifique, CNRS, l’Université Paris-Sorbonne, Paris IV, l’Université Panthéon-Sorbonne, Paris 1 et l’École Pratique des Hautes Études
keywords = académie des inscriptions et belles-lettres, actualités, annuaire, antique, antiques, antiquité classique et tardive, arabie, araméen, archeology, archives, archéologiques, archéologues, bible, calendrier, centre national de la recherche scientifique, chantiers de fouille, cnrs, collections, colloques, collège de france, communication, contact, coopérations, coran, cours spécialisés, crédits, disciplines, distinctions, documentaires, débuts du christianisme, electroniques, formation, historiens des religions, informations administratives, initiation, islam médiéval, langue syriaque, les chercheurs du lesa, lesa, liens utiles, linguistes, l’université panthéon-sorbonne, l’université paris-sorbonne, mediterranee, membres, missions de terrain, monde byzantin, monde méditerranéen, mondes cananéen, médecine grecque, méditerranée, médiévale, organigramme, orient, orient & méditerranée, orient chrétien, ougarit, ouvrages récents, paris 1, paris iv, philologiques, philologues, phénicien, plan du site, proche-orient, programmes, présentation, publications, publications des membres de l’umr, punique, qumrân, rassemble cinq laboratoires, recherches, religions monothéistes, responsabilité d’entreprises documentaires, ressources documentaires, revues, sciences historiques, sciences humaines, sciences religieuses, soutenances, spip 2, spécialistes du monde, séminaires, sémitique, sémitique occidental, template, textes fondateurs, thèses, thèses en cours, umr 8167, umr8167, unité mixte de recherche, vallée de l’euphrate syrien, valorisation de la recherche, vient de paraître, école pratique des hautes études, écoles doctorales, époques, éthiopie, études sémitiques
feed = http://www.orient-mediterranee.com/spip.php?page=backend

Friday, May 23, 2014

New in Planet Maia: Building Tabernae and Archaeology of Portus (MOOC)

I have just added the following resources to the Maia Atlantis feed aggregator:

title = Building Tabernae
url = http://buildingtabernae.org/
creators = Miko Flohr
license = None
description = About two years ago, I received a quarter million euro grant from the Dutch government for a  four year project on urban commercial investment in Roman Italy, and a project blog was already in the proposal. The project – Building Tabernae – started April, 2013, and is now about to enter a new phase, in which some results will start emerging, and new data will be gathered. The blog, I hope, is a way to force the scholar in charge of this project – me – to record and communicate the project’s successes and failures, and everything else that one encounters when investigating commercial investment in Roman Italy, and to be available for discussion with specialists and non-specialists alike.
feed = http://buildingtabernae.org/feed/

title = Archaeology of Portus: Exploring the Lost Harbour of Ancient Rome
url = http://moocs.southampton.ac.uk/portus/
creators = University of Southampton and FutureLearn
license = None
description = The University of Southampton and FutureLearn are running a MOOC (Massive Open Online Course), focusing on the archaeological work in progress at the Roman site of Portus. It is one of a number of Southampton-based courses that will be made available for you to study online, for free, wherever they are based in the world, in partnership with FutureLearn.
feed = http://moocs.southampton.ac.uk/portus/feed/

Tuesday, May 13, 2014

Additions and corrections in Planet Atlantides

I've just added the following blog to the Maia and Electra feed aggregators:

title = Standards for Networking Ancient Prosopographies
url = http://snapdrgn.net/
creators = Gabriel Bodard, et al.
description = Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names (hereafter SNAP:DRGN or SNAP) project aims to address the problem of linking together large collections of material (datasets) containing information about persons, names and person-like entities managed in heterogeneous systems and formats.
feed = http://snapdrgn.net/feed

I've also updated the entry for MutEc as follows (corrected feed url):

title = Mutualisation d'outils numériques pour les éditions critiques et les corpus (MutEC)
url = http://www.mutec-shs.fr
creators = Marjorie Burghart, et al.
description = MutEC est un dispositif de partage, d'accumulation et de diffusion des technologies et des méthodologies qui émergent dans le champ des humanités numériques.
feed = http://www.mutec-shs.fr/?q=rss.xml

MITH and tDAR continue to respond to our bot with 403 Forbidden, so their content will not appear in the aggregators.