By Samat Jain
May 23, 2006 - 2:02am
Apache’s “combined” log format is one the most common log formats used in access logging, containing useful fields such as referrer and user agent. Unfortunately, it does not contain a field listing the the virtual host for whom a request was formed. With Apache, this is easily rectified by defining a custom logging format and post-processing logs to maintain compatibility. Add to httpd.conf:
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" mycombined
Apache, however, needs to be told to start using this log format, which can be done by modifying the CustomLog directive that should already be in httpd.conf:
CustomLog log/access.log mycombined
The first field of any log line will now list the virtual host, as so:
rhombic.net msnbot.msn.com - - [11/Apr/2006:04:00:34 -0500] “GET / HTTP/1.0” 200 3490 “-” “msnbot/0.9 (+http://search.msn. com/msnbot.htm)”
Unfortunately, we now have a custom log format that many logging tools may be unable to deal with. The fix is just as simple: write a script, that given our special log file, splits each log line into different files per domain. I wrote my solution in Python:
import sys
fpCache = {}
fileName = sys.argv[1]
fpFullLog = file( fileName )
for line in fpFullLog: line = line.split( " ", 1 ) domain = line[0] # Extract domain line = line[1] # Leave rest of log line alone
if not fpCache.has_key(domain): fpDailyDomainLog = file( fileName + "." + domain, "a" ) fpCache[domain] = fpDailyDomainLog
fpCache[domain].write(line)
fpFullLog.close()
for fp in fpCache.itervalues(): fp.close()
For those who miss the obvious, use of this script is:
% python split-accesslog-by-domain.py access.log
which will produce files with their domains appended:
access.log.example.com
access.log.foo.com
…etc…
Each of these files is now in Apache’s combined log format, ready to be used as input to almost every statistics package.
This script will only work on POSIX-complaint UNIXes that support the “append” write mode. To avoid having to open, close, and reopen a file many times, the script incorporates an ridiculously simple and extremely effective file handle caching. This caching will become a problem if there are too many different domains, as it may be possible to exceed limit of open files a process may have. Fixing this is an exercise for the reader, as well as more exception detection and mitigation.
Like this article? Please support my writing! Flattr my blog (see my thoughts on Flattr), tip me via PayPal, or send me an item from my Amazon wish list.
Want to see more of my writing? Subscribe to
Samat Says' RSS feed







Comments
Permalink Hepcat Willy on March 16, 2007 - 10:52am wrote…
Thanks for this article, I’ve been struggling for days to get virtualhost logs working. Your script works perfectly! Yes, I often miss the obvious… Kind regards, Hep
Permalink Anonymous Visitor on August 28, 2009 - 5:01am wrote…
CustomLog /Volumes/……. “%{host}i %v %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"”