horothesia: python

There are plenty of tools out there for doing web request analysis. I wanted to pull some information out of some Apache log files without all that overhead. Here's what I did:

I got Rory McCann's apache-log-parser (after some googling; it's on Github and on pypi). I set up a Python virtual environment using Doug Hellmann's virtualenvwrapper, activated it, and then used:

 pip install apache-log-parser

Since I'd never used apache-log-parser before, I had to get familiar with it. I discovered that, to use it, I had to know the format string that Apache was using to log information for my site. This was a two-step process, figured out by reading the Log Files section of the Apache documentation and poking about with grep.

First, I searched in the Apache configuration files for the CustomLog directive that's associated with the virtual host I wanted to analyze. This gave me a 'nickname' for the log configuration. More spelunking in Apache config files -- this time in the main configuration file -- turned up the definition of that nickname (Apache uses the LogFormat directive for this purpose):

 $ cd /etc/apache2/  
 $ grep CustomLog sites-enabled/foo.nowhere.org  
  CustomLog /var/log/apache2/foo.nowhere.org-access.log combined  
 $ grep combined apache2.conf   
 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

It's that LogFormat string that needs to be given to Rory's code so it knows how to parse your log files.

After some experimenting in the Python interpreter to get a feel for the code and its capabilities, I wrote a few lines of my own to wrap the setup and file reading operations:

 #!/usr/bin/env python  
 # -*- coding: utf-8 -*-  
   
 import apache_log_parser  
 import glob  
 import logging  
   
 # supported log file formats  
 APACHE_COMBINED="%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""  
 APACHE_COMMON="%h %l %u %t \"%r\" %>s %b"  
   
 def gulp(log_file_path, pattern=APACHE_COMBINED):  
   """ import and parse log files """  
   log_data=[]  
   line_parser=apache_log_parser.make_parser(pattern)  
   for file_name in glob.glob(log_file_path):  
     logging.info("file_name: %s" % file_name)  
     file = open(file_name, 'r')  
     lines = file.readlines()  
     file.close()  
     logging.info(" read %s lines" % len(lines))  
     for line in lines:  
       line_data=line_parser(line)  
       log_data.append(line_data)  
   logging.info("total number of events parsed: %s" % len(log_data))  
   return log_data

For this particular server, I had multiple log files, but I wanted to have all the requests from all of them (parsed into dictionaries by Rory's code) in a single list for subsequent analysis. So, back to the Python interpreter:

 >>> import logging  
 >>> logging.basicConfig(level=logging.INFO)  
 >>> import loggulper  
 >>> d = loggulper.gulp("/path/to/my/log/files/foo.nowhere.org-access.*")

I'll spare you the logging messages. This took several minutes. I ended up with about 1.5 million requests in the list. Real life intervened. How to save this data for later without having to run through the whole import process again? Pickle it.

 >>> import cPickle as pickle  
 >>> out_name = "logdata.pickle"  
 >>> outf = open(out_name, 'w')  
 >>> pickler = pickle.Pickler(outf, pickle.HIGHEST_PROTOCOL)  
 >>> pickler.dump(d)  
 <cPickle.Pickler object at 0x1044044e8>  
 >>> outf.close()

The whole list was saved to a 933.3 MB file in just a few seconds (full disclosure: I have a solid-state drive). It was nearly as quick to read back in a couple of days later (new interpreter session and all):

 >>> import cPickle as pickle  
 >>> in_name="logdata.pickle"  
 >>> inf=open(in_name, 'r')  
 >>> unpickler=pickle.Unpickler(inf)  
 >>> d=unpickler.load()  
 >>> len(d)  
 1522015  
 >>> d[0].keys()  
 ['status', 'request_header_referer', 'remote_user', 'request_header_user_agent__browser__family', 'request_header_user_agent__is_mobile', 'request_header_user_agent__browser__version_string', 'request_header_user_agent', 'request_http_ver', 'request_header_user_agent__os__version_string', 'remote_logname', 'time_recieved_isoformat', 'time_recieved', 'request_first_line', 'request_header_user_agent__os__family', 'request_method', 'request_url', 'remote_host', 'time_recieved_datetimeobj', 'response_bytes_clf']

It's important to notice at this point that the word "received" is misspelled "recieved" in keys in the dictionaries returned by apache-log-parser. If you don't notice this early on, it will cause some amount of frustration.

It turned out that my log data included events past the end of reporting period I'd been given (ending 31 May 2014), so I needed to filter out just those requests that fell within the reporting period. Python list comprehensions to the rescue:

 >>> dates=[req['time_recieved_datetimeobj'] for req in d]  
 >>> max(dates)  
 datetime.datetime(2014, 7, 13, 11, 37, 31)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)  
 >>> from datetime import datetime  
 >>> d_relevant=[req for req in d if req['time_recieved_datetimeobj'] < datetime(2014,06,01)]  
 >>> dates=[req['time_recieved_datetimeobj'] for req in d_relevant]  
 >>> max(dates)  
 datetime.datetime(2014, 5, 31, 23, 59, 17)  
 >>> min(dates)  
 datetime.datetime(2013, 7, 21, 3, 41, 26)

Now to separate requests made by self-identified bots and spiders from the rest of the traffic:

 >>> robots=[req for req in d_relevant if 'bot' in req['request_header_user_agent'].lower() or 'spider' in req['request_header_user_agent'].lower()]  
 >>> len(robots)  
 848450  
 >>> humans=[req for req in d_relevant if 'bot' not in req['request_header_user_agent'].lower() and 'spider' not in req['request_header_user_agent'].lower()]  
 >>> len(humans)  
 486249  
 >>> len(robots) + len(humans) == len(d_relevant)  
 True  
 >>> unique_bots=[]  
 >>> for bot in robots:  
 ...   if bot['request_header_user_agent'] not in unique_bots:  
 ...     unique_bots.append(bot['request_header_user_agent'])  
 ...   
 >>> len(unique_bots)  
 229

Aside: yes, I know there could well still be automated agents in the "humans" list; I've only filtered out those that are not operated by the sneaky or the uninformed. Let's not worry about that issue for now.

Stay tuned for the next installment, wherein I hope we actually learn something about how our server is being used.

I want to know what Pleiades machine tags are in use on photos throughout Flickr (more background here). I thought I'd learn how to ask for that information from the Flickr API via a script. I requested and got an API key (see http://www.flickr.com/help/api/). I set up a Python virtual environment and git repository for the project. I went looking for Python code that already implemented interaction with the API and settled (somewhat arbitrarily) on Beej's Python Flickr API kit (now maintained by Sybren Stüvel). Then used pip install flickrapi to get the package.

Here's a command-line session running the script and showing its output:

(pleiades-flickr)darkstar:pleiades-flickr paregorios$ python src/listptags.py 
pleiades:atteststo is used on 15 photos in Flickr
pleiades:denotes is used on 1 photos in Flickr
pleiades:depcits is used on 2 photos in Flickr
pleiades:depicts is used on 7229 photos in Flickr
pleiades:findspot is used on 2197 photos in Flickr
pleiades:finspot is used on 2 photos in Flickr
pleiades:foundat is used on 1 photos in Flickr
pleiades:observedat is used on 3 photos in Flickr
pleiades:origin is used on 225 photos in Flickr
pleiades:place is used on 970 photos in Flickr
pleiades:places is used on 19 photos in Flickr
pleiades:where is used on 119 photos in Flickr
(pleiades-flickr)darkstar:pleiades-flickr paregorios$

Here's the code (version at github):

#!/usr/bin/env python
"""
A Flickr tag bot
"""

import argparse
import flickrapi
import json
import logging as l
import os
import sys
import traceback

from myflickr import API_KEY, NAMESPACE_DEFAULT

SCRIPT_DESC = "poll machine tags from flickr"

def main ():
    """ Unleash the bot! """

    global args
    global l

    flickr = flickrapi.FlickrAPI(API_KEY)
    resp = flickr.machinetags_getPairs(namespace=args.namespace, format="json")
    if resp[:14] == "jsonFlickrApi(":
        jstr = resp[14:-1]
        j = json.loads(jstr)
        ptags = [(p['_content'], p['usage']) for p in j['pairs']['pair']]
        for ptag in ptags:
            print "%s is used on %s photos in Flickr" % ptag


if __name__ == "__main__":
    try:
        parser = argparse.ArgumentParser(description=SCRIPT_DESC, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
        parser.add_argument ("-n", "--namespace", default=NAMESPACE_DEFAULT, help="namespace to use in requesting machine tags")
        parser.add_argument ("-v", "--verbose", action="store_true", default=False, help="verbose output")
        args = parser.parse_args()
        if args.verbose:
            l.basicConfig(level=l.DEBUG)
        else:
            l.basicConfig(level=l.WARNING)
        main()
        sys.exit(0)
    except KeyboardInterrupt, e: # Ctrl-C
        raise e
    except SystemExit, e: # sys.exit()
        raise e
    except Exception, e:
        print "ERROR, UNEXPECTED EXCEPTION"
        print str(e)
        traceback.print_exc()
        os._exit(1)

Comments, questions, and constructive criticism welcomed!

horothesia

SyntaxHighlighter

Monday, July 14, 2014

Hacking on Apache Log Files with Python

Wednesday, August 21, 2013

Get a List of Machine Tags from the Flickr API