This data conversation uses the car parking meter data obtained via an FOI request to the the Isle of Wight Council for Pay and Display ticket machine transaction data from the ticket machines in the River Road Car Park, Yarmouth, Isle of Wight, for the financial year 2012-13.
The data includes an identifier for the ticket machine that issued the ticket, the time the ticket was issued, the tariff band (i.e. the nominal ticket value), and the amount of cash paid for the ticket.
The following conversation is a predomninantly visual one, where questions are asked of the data and responses provided in a graphical form - as charts - that then need interpreting.
Several lines of questioning naturally arise:
This conversation with data has been created within an interactive IPython Notebook, using the pandas data wrangling library and the ggplot
graphics library.
#Import some necessary programming libraries that we'll use for the analysis
import pandas as pd
from ggplot import *
#And some housekeeping
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
#See what data files we have available
!ls data/iw_parkingMeterData/
4_10_River Road Transaction Report April 2012.xls 4_11_Transaction Report RR Aug 2012.xls 4_3_Ticket Machine Locations with GIS.xlsx 4_4_Tony hirst reply iw14 2 27649 18mar14.pdf 4_5_Transaction Report RR Dec 2012 March 2013.xls 4_6_Transaction Report RR July 2012.xls 4_7_Transaction Report RR June 2012.xls 4_8_Transaction Report RR May 2012.xls 4_9_Transaction Report RR Sept Nov 2012.xls correspondence.pdf
#I'm going to start by just looking at data from the period Dec 2012 to March 2013.
#Read in a single spreadsheet and tidy it
df=pd.read_excel("data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls",skiprows=6)
#We need to clean the data a little, dropping empty columns, identifying timestamps as such
df.dropna(how='all',axis=1,inplace=True)
df.Date=pd.to_datetime(df.Date, format="%Y-%m-%d %H:%M:%S",coerce=True)
df.dropna(subset=["Date"],inplace=True)
#So what does the data look like?
df[:5]
Date | Machine | Description | Tariff | Description.1 | Cash | |
---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
2 | 2012-12-01 08:22:15 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
3 | 2012-12-01 08:27:01 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
4 | 2012-12-01 08:34:11 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
#What are the separate tariff bands?
df['Description.1'].unique()
array(['LS with Cch £6.60 6>24hrs', 'LS with Cch £1.00 30m>1hr', 'LS with Cch £4.50 4->6hrs', 'LS with Cch £1.90 1->2hrs', 'LS with Cch £3.40 2->4hrs', 'LS with Cch £0.60 30 Mins', 'LS with Cch £3.00 Cch>10h', 'LS with Cch £10 Cch10>14h'], dtype=object)
#It's possibly easier to work with the Tariff code, so what code applies to which description?
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
dfx=df[["Tariff","Description.1"]].copy()
dfx.rename(columns=lambda x: x.replace('.','_'), inplace=True)
q="SELECT DISTINCT Tariff, Description_1 FROM dfx"
pysqldf(q)
Tariff | Description_1 | |
---|---|---|
0 | 01F | LS with Cch £6.60 6>24hrs |
1 | 01B | LS with Cch £1.00 30m>1hr |
2 | 01E | LS with Cch £4.50 4->6hrs |
3 | 01C | LS with Cch £1.90 1->2hrs |
4 | 01D | LS with Cch £3.40 2->4hrs |
5 | 01A | LS with Cch £0.60 30 Mins |
6 | 01G | LS with Cch £3.00 Cch>10h |
7 | 01H | LS with Cch £10 Cch10>14h |
#We can use this information to generate a mapping from the description or tariff to the tariff price
#[Really should automate the extraction of the amount from the description]
tariffMap={'01A':0.6, '01B':1,'01C':1.9, '01D':3.4,'01E':4.5,'01F':6.6,'01G':3,'01H':10}
df["Tariff_val"]=df['Tariff'].apply(lambda x: tariffMap[x])
df[:3]
Date | Machine | Description | Tariff | Description.1 | Cash | weekday | hour | Tariff_val | |
---|---|---|---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 6 | 6.6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 7 | 6.6 |
2 | 2012-12-01 08:22:15 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 8 | 6.6 |
#How much cash was taken over this period in total?
df[['Cash']].sum()
Cash 18385.85 dtype: float64
#If people paid exactly the tariff price, how much would have been taken?
df[['Tariff_val']].sum()
Tariff_val 18076.6 dtype: float64
#So for this one car park, over four off season months, how much was overpaid?
round(float(df[['Cash']].sum())-float(df[['Tariff_val']].sum()),2)
309.25
#How much cash was taken over this period for each machine?
df[['Machine','Cash']].groupby('Machine').sum()
Cash | |
---|---|
Machine | |
YARR01 | 9876.65 |
YARR02 | 8509.20 |
#How much cash was taken over this period for each machine and tariff?
df[['Machine','Tariff','Cash']].groupby(['Machine','Tariff']).sum()
Cash | ||
---|---|---|
Machine | Tariff | |
YARR01 | 01A | 81.00 |
01B | 634.05 | |
01C | 2004.90 | |
01D | 1981.35 | |
01E | 1391.80 | |
01F | 3780.55 | |
01G | 3.00 | |
YARR02 | 01A | 116.90 |
01B | 601.15 | |
01C | 1979.45 | |
01D | 1676.00 | |
01E | 1017.50 | |
01F | 3070.20 | |
01G | 18.00 | |
01H | 30.00 |
That total cash amounts are interesting, but if we want to know how busy the car parks were, we need to count the number of tickets issued.
#So how many tickets of each tariff type were issued by each machine?
df[["Tariff","Machine"]].groupby(['Tariff',"Machine"]).agg(len).sort_index()
Tariff Machine 01A YARR01 133 YARR02 192 01B YARR01 627 YARR02 595 01C YARR01 1022 YARR02 1014 01D YARR01 572 YARR02 488 01E YARR01 302 YARR02 222 01F YARR01 564 YARR02 463 01G YARR01 1 YARR02 6 01H YARR02 3 dtype: int64
#Can you show me that graphically?
p = ggplot(aes(x='Tariff'), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count") + facet_wrap('Machine',scales='fixed')
<ggplot: (-9223363294099285553)>
It looks as if YARR02 is used slightly less - is the area of the car park it covers "further away" from where people are likely to want to go?
There's possibly a diagnostic here too - if the sales from one machine fall off and the other runs at a higher rate than normal, it suggests a possible problem with the former machine? We want explore that here, but we could explore it in a more detailed investigation.
#Here's the same question asked another way
p = ggplot(aes(x='Tariff',fill="Machine"), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count")
#Ideally these bars would be "dodged" - placed side-by-side, but the charting library doesn't support that at the moment
<ggplot: (-9223363294100548878)>
I'm now going to start exploring when there is most activity. One way of doing this is to summarise the data and look for activity around particular days of the week or hours of the day.
#We can get a designation different time components as follows
# /via http://pandas-docs.github.io/pandas-docs-travis/timeseries.html
# minute: the minutes of the datetime
# weekday OR dayofweek: the day of the week with Monday=0, Sunday=6
# week: the week ordinal of the year
df['weekday']=df['Date'].apply(lambda x: x.dayofweek)
# hour: the hour of the datetime
df['hour']=df['Date'].apply(lambda x: x.hour)
#Let's just check that's worked:
df[:3]
Date | Machine | Description | Tariff | Description.1 | Cash | weekday | hour | |
---|---|---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 7 |
2 | 2012-12-01 08:22:15 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 8 |
How many transactions are issued by day of week? Let's plot them as a bar chart.
ggplot(df, aes(x='factor(weekday)'))+geom_bar()
<ggplot: (8742780763421)>
# this, or similar, should be supported at some point? +scale_x_discrete(labels=["Mon","Tues","Weds","Thurs","Fri","Sat","Sun"])
So Saturday apppears to be the most popular day of the week, and Monday the quietest.
#How many transactions occured by hour of day?
ggplot(df, aes(x='hour'))+geom_bar(binwidth=1)
<ggplot: (-9223363294099411554)>
#Can we split that up to see whether it's different across days of the week?
ggplot(df, aes(x='hour'))+geom_bar(binwidth=1)+facet_wrap('weekday',scales='fixed')
<ggplot: (-9223363294100000403)>
In distribution terms, it looks as if this concentrates more the middle of the day at the weekends, compared to weekdays. (We could run statistical tests to check this.)
#Can we probe that distribution a little further, perhaps seeing how the hourly counts are made up from different tariff counts?
ggplot(df, aes(x='hour',fill='Tariff'))+geom_bar(binwidth=1)+facet_wrap('weekday',scales='fixed')
<ggplot: (8770290310783)>
So that's not too clear - and we need a legend. But the grey-blue band doesn't appear to be used much in the afternoon... And there's a burst of red band activity last thing on a Saturday. The light blue also seems quite popular on a Saturday?
#Let's try to dig into that a little more. For a given day of the week, how do the tariff bands get used over the day?
ggplot(df[df['weekday']==2], aes(x='hour'))+geom_bar(binwidth=1)+facet_grid('Tariff')+ggtitle('Wednesday')
<ggplot: (-9223363266572976910)>
So the longer ticket 01F is bought in the morning (reasonable) and late in the day (to cover the next morning). 01B and 01C (up to an hour and 1-2 hours) are popular throughout the day. There is maybe a burst in sales of the short 30 minute 01A ticket at the end of the day?
So how does another day compare?
#Let's see what activity for Saturday looks like:
ggplot(df[df['weekday']==5], aes(x='hour'))+geom_bar(binwidth=1)+facet_grid('Tariff')+ggtitle('Saturday')
<ggplot: (-9223363294100597458)>
There definitely seems to be an upswing in short term ticket sales at the end of the day: people going out for the evening?
#Let's try to look over all the data to see how the tariff bands compare by hour of day
ggplot(df[(df['Tariff']!='01H') & (df['Tariff']!='01G') ], aes(x='hour'))+geom_bar(binwidth=1)+facet_wrap('Tariff',scales='fixed')
<ggplot: (8742753738907)>
To what extent do people pay more for their parking than they need to - at least in terms of paying more for a ticket than its actual marked price?
#Let's plot a count of cash payments using bins that ar 5 pence wide
p = ggplot(aes(x='Cash'), data=df)
p + geom_histogram(binwidth=0.05)
<ggplot: (-9223363294101038101)>
Note the "echo peaks" at £2.00 and £3.50 - representing 10p overpayments on the £1.90 01C tariff and £3.40 01D tariff. Clever, eh? Set the tariff just below natural coinage, perhaps in the expectation you'll get the 'natural' amount a good proportion of the time.
#The Overpayment column is a boolean that specifies whether there was an overpayment or not
df["Overpayment"]=(df["Cash"]!=df["Tariff_val"])
#The OverpaymentVal identifies how much, if anything, was overpaid
df["OverpaymentVal"]=df["Cash"]-df["Tariff_val"]
df[1220:1223]
Date | Machine | Description | Tariff | Description.1 | Cash | weekday | hour | Tariff_val | Overpayment | OverpaymentVal | |
---|---|---|---|---|---|---|---|---|---|---|---|
1220 | 2013-01-16 13:34:31 | YARR01 | River Road 1 Yarmouth | 01D | LS with Cch £3.40 2->4hrs | 3.4 | 2 | 13 | 3.4 | False | 0.0 |
1221 | 2013-01-16 14:14:49 | YARR01 | River Road 1 Yarmouth | 01C | LS with Cch £1.90 1->2hrs | 2.0 | 2 | 14 | 1.9 | True | 0.1 |
1222 | 2013-01-16 14:17:16 | YARR01 | River Road 1 Yarmouth | 01A | LS with Cch £0.60 30 Mins | 0.6 | 2 | 14 | 0.6 | False | 0.0 |
#So how common are overpayents by tariff type?
df[["Tariff","Overpayment"]].groupby(['Tariff',"Overpayment"]).agg(len)
Tariff Overpayment 01A False 304 True 21 01B False 1197 True 25 01C False 1074 True 962 01D False 854 True 206 01E False 455 True 69 01F False 890 True 137 01G False 7 01H False 3 dtype: int64
Seems like 01C has a getting on for almost 50% overpayment!
How does revenue come in over the data collection period?
#Let's order the data by timestamp, then add up the cumulative revenue
df.sort(['Date'],inplace=True)
df['Cash_cumul'] = df.Cash.cumsum()
df[:3]
Date | Machine | Description | Tariff | Description.1 | Cash | weekday | hour | Tariff_val | Overpayment | OverpaymentVal | Cash_cumul | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 6 | 6.6 | False | 0.0 | 6.6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 5 | 7 | 6.6 | False | 0.0 | 13.2 |
3221 | 2012-12-01 07:30:14 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 7.0 | 5 | 7 | 6.6 | True | 0.4 | 20.2 |
#How does it look?
g = ggplot(aes(x="Date",y="Cash_cumul"), data=df )+ geom_line()
g
<ggplot: (-9223363294101464881)>
#We can also calculate the accumulated amount within each tariff band
#Let's group=df[['Tariff','Cash']].groupby('Tariff')
#For group of rows, apply the transformation to each row in the group
#The number of rows in the response will be the same as the number of rows in the original data frame
df['Cash_cumul2']=group.transform(cumsum)['Cash']
#Here's how it looks:
ggplot(df,aes(x="Date",y="Cash_cumul2",colour="Tariff"))+geom_line()
<ggplot: (-9223363294101508818)>
#We can also split the amounts out into separate charts
ggplot(df, aes(x="Date",y="Cash_cumul2")) + geom_line() \
+ ggtitle("Payments made over time") \
+ labs("Transaction Date", "Transaction amount (£)") \
+ facet_wrap("Tariff",scales = "fixed")