%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
Nikolay Koldunov
koldunovn@gmail.com
In this notebook, I am going to show simple example of an Apache access log analysis with pandas. It's my first experience with pandas, and I am sure there are better and more efficient ways to do some of the things shown here. So comments, suggestions and corrections of my broken English are very welcome. You can send me an email, or create a PR for this notebook at github.
We will need apachelog module, for parsing the log. We also have to know the format of the log that is set in the Apache config. In my case I have no access to the Apache config, but the hoster provide description of the format on his help page. Below is the format itself and a short description of every element:
format = r'%V %h %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'
Here (copied mostly from this SO post):
%V - the server name according to the UseCanonicalName setting
%h - remote host (ie the client IP)
%l - identity of the user determined by identd (not usually used since not reliable)
%u - user name determined by HTTP authentication
%t - time the server finished processing the request.
%r - request line from the client. ("GET / HTTP/1.0")
%>s - status code sent from the server to the client (200, 404 etc.)
%b - size of the response to the client (in bytes)
\"%i\" - Referer is the page that linked to this URL.
User-agent - the browser identification string
%T - Apache request time
import apachelog, sys
Set the format:
fformat = r'%V %h %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'
Create the parcer:
p = apachelog.parser(fformat)
Sample string:
koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0
sample_string = 'koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0'
data = p.parse(sample_string)
data
{'%>s': '200', '%T': '0', '%V': 'koldunov.net', '%b': '65237', '%h': '85.26.235.202', '%i': 'http://koldunov.net/?p=364', '%l': '-', '%r': 'GET /?p=364 HTTP/1.0', '%t': '[16/Mar/2013:00:19:43 +0400]', '%u': '-', '%{User-Agent}i': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
log = open('access_log_for_pandas').readlines()
Parse every line of it and create a list of dictionaries:
log_list = []
for line in log:
try:
data = p.parse(line)
except:
sys.stderr.write("Unable to parse %s" % line)
data['%t'] = data['%t'][1:12]+' '+data['%t'][13:21]+' '+data['%t'][22:27]
log_list.append(data)
We had to tweak the time format a bit, since otherwise pandas will not be able to parse it.
This will create a list of dictionaries, that can be transformed in to a Data Frame:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel
df = DataFrame(log_list)
Show the first two lines of the Data Frame:
df[0:2]
%>s | %T | %V | %b | %h | %i | %l | %r | %t | %u | %{User-Agent}i | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200 | 0 | www.oceanographers.ru | 26126 | 109.165.31.156 | - | - | GET /index.php?option=com_content&task=section... | 16/Mar/2013 08:00:25 +0400 | - | Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20... |
1 | 200 | 0 | www.oceanographers.ru | 10532 | 109.165.31.156 | http://www.oceanographers.ru/index.php?option=... | - | GET /templates/ja_procyon/css/template_css.css... | 16/Mar/2013 08:00:25 +0400 | - | Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20... |
We are not going to use all the data, so let's delete some of the columns:
del df['%T']; del df['%V']; del df['%i']; del df['%l']; del df['%u']; del df['%{User-Agent}i']
and rename columns to something that humans can understand:
df = df.rename(columns={'%>s': 'Status', '%b':'b',
'%h':'IP', '%r':'Request', '%t': 'Time'})
First five rows of resulting Data Frame:
df.head()
Status | b | IP | Request | Time | |
---|---|---|---|---|---|
0 | 200 | 26126 | 109.165.31.156 | GET /index.php?option=com_content&task=section... | 16/Mar/2013 08:00:25 +0400 |
1 | 200 | 10532 | 109.165.31.156 | GET /templates/ja_procyon/css/template_css.css... | 16/Mar/2013 08:00:25 +0400 |
2 | 200 | 1853 | 109.165.31.156 | GET /templates/ja_procyon/switcher.js HTTP/1.0 | 16/Mar/2013 08:00:25 +0400 |
3 | 200 | 37153 | 109.165.31.156 | GET /includes/js/overlib_mini.js HTTP/1.0 | 16/Mar/2013 08:00:25 +0400 |
4 | 200 | 3978 | 109.165.31.156 | GET /modules/ja_transmenu/transmenuh.css HTTP/1.0 | 16/Mar/2013 08:00:25 +0400 |
Convert Time column to datetime format and make an index out of it (pop will drop original Time column):
df.index = pd.to_datetime(df.pop('Time'))
The Status variable is a string type, so we have to change it to int:
df['Status'] = df['Status'].astype('int')
Some of the rows in the b column contain '-' string, so we can't convert them with astype:
df['b'][93]
'-'
We can apply a custom function to this column, that will convert all dashes to NaN, and the rest to floats, additionally converting from bytes to megabytes:
def dash2nan(x):
if x == '-':
x = np.nan
else:
x = float(x)/1048576.
return x
df['b'] = df['b'].apply(dash2nan)
I am sure there is a more elegant way to do this :)
Our first, simplest plot: outgoing traffic from the website:
df['b'].plot()
<matplotlib.axes.AxesSubplot at 0xbf7574c>
Look's like somebody downloaded something big from the website around 9 a.m.
But actually the first thing that you probably want to know is how many visits (actually hits) your site has, and how they are distributed in time. We resample the series of variable b with 5 minute intervals, and calculate number of requests during every time span. Actually, in this case it doesn't matter what variable we use, these numbers will indicate just how many times information from the the website was requested.
df_s = df['b'].resample('5t', how='count')
df_s.plot()
<matplotlib.axes.AxesSubplot at 0xc14588c>
We can not only count number of requests per time, but also calculate the sum of the traffic for every time span:
df_b = df['b'].resample('10t', how=['count','sum'])
df_b['count'].plot( color='r')
legend()
df_b['sum'].plot(secondary_y=True)
<matplotlib.axes.AxesSubplot at 0xc2d53ac>
On the plot you can see, that the number of server requests do not always coincide with the amount of traffic, and correlation is actually not extremely high:
df_b.corr()
count | sum | |
---|---|---|
count | 1.000000 | 0.512629 |
sum | 0.512629 | 1.000000 |
We can have a closer look at the curious morning peak:
df_b['2013-03-16 6:00':'2013-03-16 10:00']['sum'].plot()
<matplotlib.axes.AxesSubplot at 0xc3f5dac>
Seems that this traffic spike was caused by only one request. Let's find out how this request looks. Choose all requests with the size of the response larger than 20 Mb (remember we convert bytes to Mb):
df[df['b']>20]
Status | b | IP | Request | |
---|---|---|---|---|
Time | ||||
2013-03-16 09:02:59 | 200 | 21.365701 | 77.50.248.20 | GET /books/Bondarenko.pdf HTTP/1.0 |
It was a .pdf file of a book (look at Request field), and this explains the peak in outgoing trafic at 2013-03-16 09:02:59.
Clearly 20 Mb is a large request (for our website at least :)). But what is the typical size of the server's response? The histogram of response sizes (less than 20 Mb) looks like this:
cc = df[df['b']<20]
cc.b.hist(bins=10)
<matplotlib.axes.AxesSubplot at 0xc52374c>
So, most of the files are less than 0.5 Mb large. In fact they are even smaller:
cc = df[df['b']<0.3]
cc.b.hist(bins=100)
<matplotlib.axes.AxesSubplot at 0xc5760ec>
Very small responses can be all kinds of stuff, but larger ones are probably some relatively big files. Let's have a look at these little spikes between 0.2 and 0.25 Mb:
cc = df[(df['b']>0.2)&(df['b']<0.25)]
cc.b.hist(bins=100)
<matplotlib.axes.AxesSubplot at 0xc57cacc>
We now can locate the size of the file more precisely, and find out its name:
cc = df[(df['b']>0.220)&(df['b']<0.224)]
cc.head()
Status | b | IP | Request | |
---|---|---|---|---|
Time | ||||
2013-03-16 10:13:51 | 200 | 0.223101 | 178.238.18.194 | GET /mypict/itpocover.png HTTP/1.0 |
2013-03-16 11:02:12 | 200 | 0.223101 | 2.95.140.217 | GET /mypict/itpocover.png HTTP/1.0 |
2013-03-16 11:18:48 | 200 | 0.223101 | 213.87.138.145 | GET /mypict/itpocover.png HTTP/1.0 |
2013-03-16 11:33:36 | 200 | 0.223101 | 91.207.106.252 | GET /mypict/itpocover.png HTTP/1.0 |
2013-03-16 11:34:03 | 200 | 0.223101 | 212.3.137.202 | GET /mypict/itpocover.png HTTP/1.0 |
This is one of the image files from the front page of the website.
I would like to know how different server responses are distributed through time. Let's first try a not very elegant approach. We create several variables with time series of Status values, each containing only a time series with one particular value. Then a Data Frame is created out of this time series.
t_span = '2H'
df_404 = df['Status'][df['Status'] == 404].resample(t_span, how='count')
df_403 = df['Status'][df['Status'] == 403].resample(t_span, how='count')
df_301 = df['Status'][df['Status'] == 301].resample(t_span, how='count')
df_304 = df['Status'][df['Status'] == 304].resample(t_span, how='count')
df_200 = df['Status'][df['Status'] == 200].resample(t_span, how='count')
status_df = DataFrame({'Not Found':df_404, 'Forbidden':df_403, 'Moved Permanently':df_301, 'Not Modified':df_304, 'OK':df_200,})
status_df.head()
Forbidden | Moved Permanently | Not Found | Not Modified | OK | |
---|---|---|---|---|---|
Time | |||||
2013-03-16 08:00:00 | 22 | 17 | 3 | 4 | 375 |
2013-03-16 10:00:00 | 26 | 76 | 2 | 2 | 607 |
2013-03-16 12:00:00 | 45 | 83 | 2 | 23 | 780 |
2013-03-16 14:00:00 | 29 | 34 | 4 | 2 | 699 |
2013-03-16 16:00:00 | 27 | 94 | 5 | 3 | 886 |
We plot all values at once:
status_df.plot(figsize=(10, 3))
<matplotlib.axes.AxesSubplot at 0xcda1e6c>
Show only Error and Redirection status codes as a stacked plot:
status_df[['Not Found','Forbidden','Moved Permanently','Not Modified']].plot(kind='barh', stacked=True, figsize=(10, 7))
<matplotlib.axes.AxesSubplot at 0xce27e2c>
But there is a better way to receive statistics about different groups of values. Here we group our data by Status:
grouped_status = df.groupby('Status')
Data are now rearranged in to several groups, each corresponding to a certain value of Status (here, only the first two values of every group are shown):
grouped_status.head(2)
Status | b | IP | Request | ||
---|---|---|---|---|---|
Status | Time | ||||
200 | 2013-03-16 08:00:25 | 200 | 0.024916 | 109.165.31.156 | GET /index.php?option=com_content&task=section... |
2013-03-16 08:00:25 | 200 | 0.010044 | 109.165.31.156 | GET /templates/ja_procyon/css/template_css.css... | |
206 | 2013-03-16 11:42:53 | 206 | 0.031250 | 109.171.109.164 | GET /images/stories/pdf/0104_044.pdf HTTP/1.0 |
2013-03-16 11:42:53 | 206 | 0.224717 | 109.171.109.164 | GET /images/stories/pdf/0104_044.pdf HTTP/1.0 | |
301 | 2013-03-16 08:13:08 | 301 | 0.000232 | 38.100.21.63 | GET /forum HTTP/1.0 |
2013-03-16 08:13:09 | 301 | 0.000234 | 38.100.21.63 | GET /podcast HTTP/1.0 | |
302 | 2013-03-16 08:47:04 | 302 | NaN | 176.8.91.244 | POST /podcast/wp-comments-post.php HTTP/1.0 |
2013-03-16 11:27:40 | 302 | NaN | 78.30.212.183 | POST /podcast/wp-comments-post.php HTTP/1.0 | |
304 | 2013-03-16 08:19:41 | 304 | NaN | 178.154.206.250 | GET /images/stories/researchers/laplace.jpg HT... |
2013-03-16 08:33:14 | 304 | NaN | 178.154.206.250 | GET /images/stories/researchers/treshnikov.jpg... | |
400 | 2013-03-16 08:13:09 | 400 | 0.000999 | 38.100.21.63 | GET /.. HTTP/1.0 |
403 | 2013-03-16 08:04:23 | 403 | 0.001054 | 5.79.199.116 | GET /forum/download/file.php?avatar=156.gif HT... |
2013-03-16 08:04:23 | 403 | 0.001054 | 5.79.199.116 | GET /forum/download/file.php?avatar=216.jpg HT... | |
404 | 2013-03-16 08:04:25 | 404 | 0.001063 | 5.79.199.116 | GET /apple-touch-icon-precomposed.png HTTP/1.0 |
2013-03-16 08:04:26 | 404 | 0.001063 | 5.79.199.116 | GET /apple-touch-icon.png HTTP/1.0 |
We can count the number of values in each group:
grouped_status.size().plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0xcfe76ac>
We can look at individual groups:
t_span = '30t'
grouped_status.get_group(301)['Status'].resample(t_span, how='count').plot(color='g', label='301')
legend()
grouped_status.get_group(200)['Status'].resample(t_span, how='count').plot(color='b', secondary_y=True, label='200')
<matplotlib.axes.AxesSubplot at 0xd0799ec>
Now I want to group by IPs and calculate how many requests we get from each of them:
ips = df.groupby('IP').size()
I don't want to see all of them, just the top 10:
ips.sort()
ips[-10:].plot(kind='barh')
<matplotlib.axes.AxesSubplot at 0xd13c0ec>
I just want to have a nice table with the top 10 IPs, so I convert them to Data Frame :)
ips_fd = DataFrame({'Number of requests':ips[-10:]})
ips_fd = ips_fd.sort(columns='Number of requests', ascending=False)
ips_fd
Number of requests | |
---|---|
IP | |
66.249.73.137 | 814 |
178.154.206.250 | 168 |
85.114.191.235 | 165 |
173.199.115.187 | 127 |
85.142.104.227 | 126 |
178.70.210.234 | 120 |
85.142.104.228 | 111 |
95.118.89.119 | 93 |
217.69.133.68 | 74 |
92.255.161.124 | 72 |
We also can group by both IP and Status:
ips_status = df.groupby(['IP', 'Status']).size()
ips_status.sort()
ips_status[-20:].plot(kind='barh')
<matplotlib.axes.AxesSubplot at 0xd3425cc>
Import module and initialise data base:
import pygeoip
gi = pygeoip.GeoIP('./GeoLiteCity.dat', pygeoip.MEMORY_CACHE)
This is how output of the pygeoip looks (we test our top IP from the previous section):
ipcon = gi.record_by_addr('64.233.161.99')
ipcon
{'area_code': 650, 'city': 'Mountain View', 'continent': 'NA', 'country_code': 'US', 'country_code3': 'USA', 'country_name': 'United States', 'dma_code': 807, 'latitude': 37.41919999999999, 'longitude': -122.0574, 'metro_code': 'San Francisco, CA', 'postal_code': '94043', 'region_name': 'CA', 'time_zone': 'America/Los_Angeles'}
Guess who it is :) Looks like Google Bot is my main reader. That's rather sad :(
Loop to create a list of dictionaries with information about IP numbers. We also add information about the number of requests from each of the IPs:
ipcon = []
for iip in ips.index:
rres = gi.record_by_addr(iip)
# rres['ip'] = iip
rres['Number'] = ips[iip]
#delete some fields we don't need
del rres['area_code']
del rres['dma_code']
del rres['metro_code']
del rres['postal_code']
del rres['region_name']
del rres['time_zone']
del rres['country_code']
ipcon.append(rres)
Create Data Frame from this list, use ips index:
reg = DataFrame(ipcon, index = ips.index)
reg.head()
Number | city | continent | country_code3 | country_name | latitude | longitude | |
---|---|---|---|---|---|---|---|
IP | |||||||
112.78.139.130 | 1 | Jakarta | AS | IDN | Indonesia | -6.1744 | 106.8294 |
116.25.249.140 | 1 | Shenzhen | AS | CHN | China | 22.5333 | 114.1333 |
117.199.193.19 | 1 | Agra | AS | IND | India | 27.1833 | 78.0167 |
117.27.175.158 | 1 | Zhangzhou | AS | CHN | China | 24.5133 | 117.6556 |
119.147.6.60 | 1 | Guangzhou | AS | CHN | China | 23.1167 | 113.2500 |
Group by Country:
country = reg.groupby('country_code3')
ff = country.Number.agg('sum').copy()
ff.sort( )
ff[-10:].plot(kind='barh')
<matplotlib.axes.AxesSubplot at 0xd6c68cc>
The website is in Russian, so no surprises here.
Group by city:
city = reg.groupby('city')
ff = city.Number.agg('sum').copy()
ff.sort( )
ff[-20:].plot(kind='barh', figsize=(5,8))
<matplotlib.axes.AxesSubplot at 0xd86fa2c>