In this project, we are going to analyse the dataset about the westbound traffic on the I-94 Interstate highway.
Our goal is to determine few factors for heavy traffic, and the correlation between them.
Information about our dataset including CSV downloadable can be found at the UCI Machine learning repository.
Credits: Thanks to John Hogue for compiling the dataset, and making it available for us.
Note: As per the documentation
Due to the above note on the dataset, we should avoid generalising our results for the entire I-94 highway.
Let's now read the CSV file to examine few sample rows along with the broader information about the dataset.
import pandas as pd
traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
traffic
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
48204 rows × 9 columns
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
traffic["traffic_volume"].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
From the distribution of the traffic_volume
data, we can see that:
This signifies that potentially there are factors such as time of the day (Day or Night hours), holidays or road maintenance/construction in play here.
Since we have the date and hourly window when the data was recorded, it would be interesting next to look at the variation based on hour window.
traffic["date_time"] = pd.to_datetime(traffic["date_time"])
# Extracting the hour and storing it as a seperate column
traffic["hour"] = traffic["date_time"].dt.hour
# Extracting the month and day of the week - Storing it as a seperate columns for later
traffic["month"] = traffic["date_time"].dt.month
traffic["day_of_week"] = traffic["date_time"].dt.dayofweek
traffic["day"] = traffic["date_time"].dt.day
traffic["year"] = traffic["date_time"].dt.year
day_time_traffic = traffic[traffic["hour"].between(6,19)]
night_time_traffic = traffic[(traffic["hour"] < 6) | (traffic["hour"] > 19)]
Above, we have seperated the hour part from the date_time
as a seperate column, and also divided the dataset into two parts:
Now we will use the divided dataset to plot the traffic volume frequency side-by-side.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
sns.set(style = "darkgrid") #To have better theme for grid display
plt.figure(figsize=(10,6))
# Subplot for day time traffic (6 am to 7 pm)
plt.subplot(1,2,1)
plt.title("Day Time Traffic")
plt.ylim([0, 8000])
plt.xlim([0, 7500])
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.tight_layout() #To ensure labels don't overlap
day_time_traffic["traffic_volume"].plot.hist()
# Subplot for night time traffic (7 pm to 6 am)
plt.subplot(1,2,2)
plt.title("Night Time Traffic")
plt.ylim([0, 8000])
plt.xlim([0, 7500])
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.tight_layout() #To ensure labels don't overlap
night_time_traffic["traffic_volume"].plot.hist()
plt.show()
# Describe the two dataset traffic volume series
print('Statistics: Day time traffic')
print('='*len('Statistics: Day time traffic'))
print(day_time_traffic["traffic_volume"].describe())
print('\n')
print('Statistics: Night time traffic')
print('='*len('Statistics: Night time traffic'))
print(night_time_traffic["traffic_volume"].describe())
Statistics: Day time traffic ============================ count 27925.000000 mean 4611.267574 std 1294.808371 min 0.000000 25% 3992.000000 50% 4761.000000 75% 5521.000000 max 7280.000000 Name: traffic_volume, dtype: float64 Statistics: Night time traffic ============================== count 20279.000000 mean 1398.818334 std 1047.170978 min 0.000000 25% 441.000000 50% 929.000000 75% 2456.000000 max 4939.000000 Name: traffic_volume, dtype: float64
From the histogram, and statistics on the Day time vs Night time traffic volume - We can infer:
Considering our goal to find indicators of heavy traffic, going forward we will only proceed with our day time only dataset.
Next, we will look at one such possible indicators - Time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.
We will start our analysis of traffic volume in relation to the certain month.
Note: When we extracted the hour, we also extracted the month and day of the week part of the date_time
and stored in month
and day_of_week
columns respectively.
We will use our month
column that we had already created from date_time
to group by and calculate averages.
# Group by month and calculate averages for all numeric columns
by_month = day_time_traffic.groupby('month').mean()
# Show the averages grouped by month
by_month
temp | rain_1h | snow_1h | clouds_all | traffic_volume | hour | day_of_week | day | year | |
---|---|---|---|---|---|---|---|---|---|
month | |||||||||
1 | 265.267536 | 0.014556 | 0.000545 | 57.604933 | 4354.506274 | 12.376893 | 2.877975 | 15.727823 | 2015.771095 |
2 | 266.356072 | 0.003843 | 0.000000 | 50.849654 | 4561.668150 | 12.384768 | 2.914936 | 14.489614 | 2015.671118 |
3 | 273.438491 | 0.017853 | 0.000000 | 56.565138 | 4697.060550 | 12.352294 | 3.061009 | 16.403211 | 2015.727523 |
4 | 279.456943 | 0.098661 | 0.000000 | 59.478029 | 4733.964271 | 12.501437 | 3.011910 | 15.446817 | 2015.673511 |
5 | 289.065720 | 0.124830 | 0.000000 | 55.594814 | 4754.816176 | 12.481424 | 2.869582 | 16.289087 | 2015.575077 |
6 | 294.316698 | 0.288086 | 0.000000 | 47.834234 | 4750.101802 | 12.381081 | 3.073874 | 15.303604 | 2015.854054 |
7 | 296.482285 | 3.838574 | 0.000000 | 40.813218 | 4466.139009 | 12.339080 | 3.028017 | 15.865661 | 2015.608477 |
8 | 294.816310 | 0.229472 | 0.000000 | 41.743297 | 4782.513801 | 12.332413 | 2.993691 | 15.768533 | 2015.966088 |
9 | 292.261089 | 0.311384 | 0.000000 | 45.024201 | 4719.379909 | 12.419178 | 2.915525 | 15.373059 | 2016.156164 |
10 | 283.835078 | 0.020980 | 0.000000 | 52.818719 | 4784.037931 | 12.524138 | 3.051232 | 15.750246 | 2014.757635 |
11 | 276.438355 | 0.005304 | 0.000000 | 56.159981 | 4552.144880 | 12.458707 | 2.982539 | 16.292591 | 2014.752714 |
12 | 267.551172 | 0.039121 | 0.001945 | 66.363166 | 4227.925219 | 12.412490 | 3.036993 | 15.678202 | 2014.547733 |
Having our averages grouped by month, let's now visualise how the average traffic volume changed each month.
by_month['traffic_volume'].plot.line(x="month", y="traffic_volume")
plt.title("Average traffic volume by month")
plt.xlabel('Month')
plt.ylabel('Avg. Traffic Volume')
plt.xlim([1,12])
# Setting ticks so we can clearly see the movement per month
plt.xticks(np.arange(1,12,step=1))
plt.ylim([4200,5000])
plt.show()
Looking at the line plot above - Few interesting patterns we can observe on the average traffic volume:
It seems that warmer months have higher traffic volume than cold months, potentially, the effects of year-end holiday season (Thanksgiving and Christmas).
During the warmer months, there is an exception that the average traffic volume dropped significantly. We will pull data for July across years to see if there is a level of consistency or a true outlier.
# Seperate only July data and then group by year to get the averages for that month across years
by_july = day_time_traffic[day_time_traffic["month"] == 7].groupby('year').mean()
# Line plot by year for average traffiv volume
by_july['traffic_volume'].plot.line(x="year", y="traffic_volume")
plt.title("July: Average traffic volume over years")
plt.xlabel('Year')
plt.ylabel('Avg. Traffic Volume')
plt.ylim([3000,5000])
plt.show()
From this analysis, it seems that July significant drop happened only in 2016, and seems like an outlier rather than sasonal reoccurance over years.
A quick google using "July 2016 I-94" shows Black life matter protests - Resulting conflict with the police caused shut down of the free way for large part of the weekend.
In particular, the protest for the shooting of Philando Castile (July 6, 2016) happened on weekend (9-10th July, 2016).
by_day_july_2016 = day_time_traffic[(day_time_traffic["month"] == 7) & (day_time_traffic["year"] == 2016)].groupby('day').mean()
by_day_july_2016['traffic_volume'].plot.line(x="day", y="traffic_volume", figsize=(15,6))
plt.title("July 2016: Average traffic volume by day")
plt.xlabel('Day')
plt.ylabel('Avg. Traffic Volume')
plt.xlim([1,31])
plt.xticks(np.arange(1,31,step=1))
plt.show()
However, when plotting the average traffic volume against each day for July, 2016 - We see a significant drop on the weekend of 23-24th July.
This probably might be linked to the protest closure, but dataset might have an incorrect date?
Either way we can assume that exception for July was an outlier due one particular year.
Next, we will look at the average traffic volume for each day of the week.
We will use our day_of_week
column that we had already created from date_time
to group by and calculate averages.
Note: Day of the week is a range from 0 (Monday) through 6 (Sunday).
# Group by day of the week and calculate averages for all numeric columns
by_dayofweek = day_time_traffic.groupby('day_of_week').mean()
# Show the averages grouped by day of the week
by_dayofweek # 0 is Monday, 6 is Sunday
temp | rain_1h | snow_1h | clouds_all | traffic_volume | hour | month | day | year | |
---|---|---|---|---|---|---|---|---|---|
day_of_week | |||||||||
0 | 282.126012 | 2.536164 | 0.000015 | 56.566723 | 4771.066360 | 12.482926 | 6.395495 | 15.337128 | 2015.528215 |
1 | 282.068457 | 0.144639 | 0.000191 | 52.259240 | 5071.712720 | 12.341066 | 6.466225 | 15.605149 | 2015.581188 |
2 | 282.020898 | 0.061704 | 0.000935 | 52.758020 | 5171.716737 | 12.405123 | 6.622979 | 15.993534 | 2015.497140 |
3 | 282.121619 | 0.181953 | 0.000127 | 53.279605 | 5187.552885 | 12.452429 | 6.514676 | 15.878289 | 2015.490385 |
4 | 281.884306 | 0.093327 | 0.000192 | 50.770397 | 5164.136903 | 12.441526 | 6.562263 | 15.962364 | 2015.505431 |
5 | 282.080936 | 0.102685 | 0.000081 | 49.643327 | 3697.681496 | 12.390546 | 6.502781 | 15.600101 | 2015.527300 |
6 | 282.034226 | 0.151365 | 0.000000 | 51.798243 | 3213.063237 | 12.367378 | 6.595232 | 15.634630 | 2015.447930 |
Having averages grouped by day of the week; let's now visualise how the average traffic volume by it.
by_dayofweek['traffic_volume'].plot.line(x="day_of_week", y="traffic_volume")
plt.title("Average traffic volume by day of the week")
plt.xlabel('Day of Week')
plt.ylabel('Avg. Traffic Volume')
plt.xlim([0,6])
# Setting ticks so we see the days rather than numbers
day_of_week_labels = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
plt.xticks(np.arange(0,7,step=1), labels=day_of_week_labels,rotation=90)
plt.ylim([3000,5250])
plt.yticks(np.arange(3000,5250,step=250))
plt.show()
As per the plot above, we can see that during the business days (Monday to Friday) the average traffic volume is significantly high, and sharply reducing during the weekends (Saturday and Sunday).
Next, we will look at the average traffic volume based on time of the day. For this, we will seperate the business days (Monday to Friday) from weekends (Saturday and Sunday).
We will use our hour
column in combination with day_of_week
column that we had already created from date_time
to calculate averages.
# Seperate weekend and business days from overall day time traffic
business_days = day_time_traffic.copy()[day_time_traffic["day_of_week"] <= 4]
weekend = day_time_traffic.copy()[day_time_traffic["day_of_week"] >= 5]
by_hour_business = business_days.groupby("hour").mean()
by_hour_weekend = weekend.groupby("hour").mean()
by_hour_business
temp | rain_1h | snow_1h | clouds_all | traffic_volume | month | day_of_week | day | year | |
---|---|---|---|---|---|---|---|---|---|
hour | |||||||||
6 | 278.432872 | 0.164433 | 0.000067 | 45.685695 | 5365.983210 | 6.573539 | 1.997314 | 16.005373 | 2015.558093 |
7 | 278.662639 | 0.145105 | 0.000068 | 50.538983 | 6030.413559 | 6.363390 | 1.984407 | 15.810847 | 2015.562712 |
8 | 278.938443 | 0.144614 | 0.000135 | 53.666441 | 5503.497970 | 6.567659 | 1.989175 | 15.889716 | 2015.493234 |
9 | 279.628421 | 0.156829 | 0.000139 | 53.619709 | 4895.269257 | 6.484386 | 1.981263 | 15.701596 | 2015.548924 |
10 | 280.664650 | 0.113984 | 0.000033 | 54.781417 | 4378.419118 | 6.481283 | 1.957888 | 16.094251 | 2015.526738 |
11 | 281.850231 | 0.151976 | 0.000000 | 52.808876 | 4633.419470 | 6.448819 | 1.979957 | 15.682176 | 2015.528275 |
12 | 282.832763 | 0.090271 | 0.001543 | 53.855714 | 4855.382143 | 6.569286 | 1.989286 | 15.764286 | 2015.550000 |
13 | 283.292447 | 0.092433 | 0.000370 | 53.325444 | 4859.180473 | 6.465237 | 1.982988 | 15.656065 | 2015.514053 |
14 | 284.091787 | 0.102991 | 0.000746 | 55.326531 | 5152.995778 | 6.588318 | 1.990852 | 15.683322 | 2015.501056 |
15 | 284.450605 | 0.090036 | 0.000274 | 54.168467 | 5592.897768 | 6.541397 | 1.962563 | 15.635709 | 2015.509719 |
16 | 284.399011 | 0.118180 | 0.000632 | 54.444132 | 6189.473647 | 6.580464 | 1.995081 | 15.635980 | 2015.483486 |
17 | 284.263033 | 7.299358 | 0.000000 | 55.204960 | 5784.827133 | 6.510576 | 1.994165 | 15.510576 | 2015.482859 |
18 | 284.388061 | 0.121533 | 0.000125 | 54.183079 | 4434.209431 | 6.529126 | 1.988211 | 15.708738 | 2015.529126 |
19 | 283.439235 | 0.156652 | 0.000000 | 53.014184 | 3298.340426 | 6.460993 | 1.989362 | 15.705674 | 2015.492199 |
by_hour_weekend
temp | rain_1h | snow_1h | clouds_all | traffic_volume | month | day_of_week | day | year | |
---|---|---|---|---|---|---|---|---|---|
hour | |||||||||
6 | 278.115656 | 0.270134 | 0.000000 | 44.491639 | 1089.100334 | 6.533445 | 5.521739 | 15.884615 | 2015.498328 |
7 | 278.095331 | 0.291540 | 0.000000 | 50.006623 | 1589.365894 | 6.518212 | 5.501656 | 15.970199 | 2015.442053 |
8 | 277.981017 | 0.083870 | 0.000083 | 48.877076 | 2338.578073 | 6.523256 | 5.503322 | 15.734219 | 2015.471761 |
9 | 279.785660 | 0.075234 | 0.000364 | 49.688042 | 3111.623917 | 6.603120 | 5.492201 | 15.519931 | 2015.495667 |
10 | 280.403811 | 0.079674 | 0.000103 | 48.915808 | 3686.632302 | 6.491409 | 5.503436 | 16.039519 | 2015.458763 |
11 | 282.129355 | 0.141387 | 0.000000 | 52.372973 | 4044.154955 | 6.482883 | 5.491892 | 15.544144 | 2015.549550 |
12 | 282.936119 | 0.095784 | 0.000000 | 51.418018 | 4372.482883 | 6.500901 | 5.493694 | 15.353153 | 2015.495495 |
13 | 283.784951 | 0.180452 | 0.000000 | 53.095841 | 4362.296564 | 6.580470 | 5.497288 | 15.493671 | 2015.508137 |
14 | 284.663261 | 0.087847 | 0.000000 | 52.735401 | 4358.543796 | 6.644161 | 5.500000 | 15.594891 | 2015.474453 |
15 | 284.854578 | 0.074771 | 0.000000 | 52.148624 | 4342.456881 | 6.612844 | 5.506422 | 15.543119 | 2015.486239 |
16 | 284.755487 | 0.145894 | 0.000000 | 53.630088 | 4339.693805 | 6.566372 | 5.507965 | 15.631858 | 2015.467257 |
17 | 284.760020 | 0.135783 | 0.000000 | 53.064057 | 4151.919929 | 6.571174 | 5.508897 | 15.357651 | 2015.446619 |
18 | 284.308607 | 0.040956 | 0.000000 | 50.948529 | 3811.792279 | 6.531250 | 5.496324 | 15.332721 | 2015.545956 |
19 | 283.463563 | 0.054773 | 0.000000 | 49.558984 | 3220.234120 | 6.537205 | 5.499093 | 15.560799 | 2015.491833 |
We will now visualise average traffic volume by time of the day comparing business days to weekends.
# Line plot of weekend and busines days with the same x and y-axes scales for comparision
by_hour_business['traffic_volume'].plot(x="hour", y="traffic_volume", label="Business Days")
by_hour_weekend['traffic_volume'].plot(x="hour", y="traffic_volume", label="Weekend")
# Graph visual context setting
plt.title("Average Traffic Volume by time of the day")
plt.xlabel('Hour')
plt.ylabel('Avg. Traffic Volume')
# Ensure x and y axes ticks are spread for presentability
plt.xlim([6,19])
plt.xticks(np.arange(6,19,step=2))
plt.ylim([1000,7000])
plt.yticks(np.arange(1000,7000,step=1000))
plt.legend() #Display legend for Business Days vs Weekend
plt.show()
By comparing the average traffic volumes across time of the day for both business days and weekends side by side - We can note the following:
During business days the average traffic volume across all hours seem higher compared to the weekends and this is significant during the peaks.
During Business Days
During Weekends
Another possible indicator of heavy traffic is weather. In our dataset, we have the related columns:
temp
rain_1h
snow_1h
clouds_all
weather_main
weather_description
.We will look at the numeric columns among the above and it's correlation to traffic_volume
.
# Get Persons R for traffic volume against numeric columns, but ignore all time related numeric columns
day_time_traffic.corr()["traffic_volume"].drop(["traffic_volume","hour","month","day_of_week", "year"])
temp 0.117139 rain_1h 0.003612 snow_1h 0.003786 clouds_all -0.024714 day -0.009218 Name: traffic_volume, dtype: float64
We see temp
column most coorelated to traffic_volume
amongst other numeric weather columns. So, let's visualise the correlation through a scatter plot.
sns.scatterplot(data=day_time_traffic, x="traffic_volume", y="temp")
plt.title("Temperature vs Traffic Volume")
Text(0.5, 1.0, 'Temperature vs Traffic Volume')
From the above analysis, it's clear that none of the numeric weather columns have any clear relation to the traffic volume.
We will now at the categorical weather columns weather_main
and weather_description
and it's relationship with traffic_volume
(if any).
Since these are categorical columns, we will first calculate the averages by grouping with these columns.
by_weather_main = day_time_traffic.groupby("weather_main").mean()
by_weather_desc = day_time_traffic.groupby("weather_description").mean()
# Sort and plot so that we can clearly see the top indicator for heavy traffic
by_weather_main['traffic_volume'].sort_values().plot.barh(x="weather_main", y="traffic_volume", figsize=(6,6))
plt.xlabel("Traffic Volume")
plt.ylabel("Weather Type")
plt.xlim([0,6000])
plt.xticks(np.arange(0,6000,step=1000))
plt.title('Weather Type vs Traffic Volume')
plt.show()
From the analysis and plot above, we see that none of the weather main types have any siginificant impact on the traffic volumes.
# Sort and plot so that we can clearly see the top indicator for heavy traffic
by_weather_desc['traffic_volume'].sort_values().plot.barh(x="by_weather_desc", y="traffic_volume", figsize=(6,12))
plt.xlabel("Traffic Volume")
plt.ylabel("Weather Desc.")
plt.title('Weather Desc vs Traffic Volume')
plt.xlim([0,6000])
plt.xticks(np.arange(0,6000,step=1000))
plt.show()
From the plot above, we can see shower snow, light rain and snow, and shower drizzle seem to be the conditions that cause traffic of more than 5000 vehicles.
In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators:
Time indicators
Weather indicators