A better way to separate Apache log files by virtual host domains

Apache’s “combined” log format is one the most common log formats used in access logging, containing useful fields such as referrer and user agent. Unfortunately, it does not contain a field listing the the virtual host for whom a request was formed. With Apache, this is easily rectified by defining a custom logging format and post-processing logs to maintain compatibility. Add to httpd.conf:

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" mycombined

Apache, however, needs to be told to start using this log format, which can be done by modifying the CustomLog directive that should already be in httpd.conf:

CustomLog log/access.log mycombined

The first field of any log line will now list the virtual host, as so:

rhombic.net msnbot.msn.com - - [11/Apr/2006:04:00:34 -0500] “GET / HTTP/1.0” 200 3490 “-” “msnbot/0.9 (+http://search.msn. com/msnbot.htm)”

Unfortunately, we now have a custom log format that many logging tools may be unable to deal with. The fix is just as simple: write a script, that given our special log file, splits each log line into different files per domain. I wrote my solution in Python:

import sys

fpCache = {}

fileName = sys.argv[1]

fpFullLog = file( fileName )

for line in fpFullLog: line = line.split( " ", 1 ) domain = line[0] # Extract domain line = line[1] # Leave rest of log line alone

if not fpCache.has_key(domain): fpDailyDomainLog = file( fileName + "." + domain, "a" ) fpCache[domain] = fpDailyDomainLog

fpCache[domain].write(line)

fpFullLog.close()

for fp in fpCache.itervalues(): fp.close()

For those who miss the obvious, use of this script is:

% python split-accesslog-by-domain.py access.log

which will produce files with their domains appended:

access.log.example.com
access.log.foo.com
 …etc…

Each of these files is now in Apache’s combined log format, ready to be used as input to almost every statistics package.

This script will only work on POSIX-complaint UNIXes that support the “append” write mode. To avoid having to open, close, and reopen a file many times, the script incorporates an ridiculously simple and extremely effective file handle caching. This caching will become a problem if there are too many different domains, as it may be possible to exceed limit of open files a process may have. Fixing this is an exercise for the reader, as well as more exception detection and mitigation.

Topic: 

Like this article? Please support my writing! Flattr my blog (see my thoughts on Flattr), tip me via PayPal, or send me an item from my Amazon wish list.

Comments

Hepcat Willy's picture

Thanks for this article, I’ve been struggling for days to get virtualhost logs working. Your script works perfectly! Yes, I often miss the obvious… Kind regards, Hep

Anonymous Visitor's picture

CustomLog /Volumes/……. “%{host}i %v %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"”