This analysis is attempting to identify factors associated with heavy traffic.
The dataset is comprised of daily and hourly entries for westbound traffic midway between Minneapolis and St. Paul, the largest and capital cities, respectively, in Minnesota. It was collected from a Department of Transportation station on interstate 94 between October 2012 and September 2018.
The dataset was created by John Hogue from the UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume
The colors used are the University of Minnesota school colors, gold for graphs and maroon for line plots. Go Gophers!
# import libraries, open file, set display options
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pprint
pd.options.display.float_format = '{:20,.4f}'.format
traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 3)
A quick look at the beginning and ending of the data set shows nine columns with a mix of categorical text and numeric information. The source of the data set states that an automatic traffic recorder (ATR) was used to tabulate traffic volume. It is expressed in the number of vehicles per hour.
# display head, tail, and basic info
display(traffic.head())
display(traffic.tail())
display(traffic.info())
display(traffic.describe())
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.2800 | 0.0000 | 0.0000 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.3600 | 0.0000 | 0.0000 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.5800 | 0.0000 | 0.0000 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.1300 | 0.0000 | 0.0000 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.1400 | 0.0000 | 0.0000 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.4500 | 0.0000 | 0.0000 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.7600 | 0.0000 | 0.0000 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.7300 | 0.0000 | 0.0000 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.0900 | 0.0000 | 0.0000 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.1200 | 0.0000 | 0.0000 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
None
temp | rain_1h | snow_1h | clouds_all | traffic_volume | |
---|---|---|---|---|---|
count | 48,204.0000 | 48,204.0000 | 48,204.0000 | 48,204.0000 | 48,204.0000 |
mean | 281.2059 | 0.3343 | 0.0002 | 49.3622 | 3,259.8184 |
std | 13.3382 | 44.7891 | 0.0082 | 39.0158 | 1,986.8607 |
min | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
25% | 272.1600 | 0.0000 | 0.0000 | 1.0000 | 1,193.0000 |
50% | 282.4500 | 0.0000 | 0.0000 | 64.0000 | 3,380.0000 |
75% | 291.8060 | 0.0000 | 0.0000 | 90.0000 | 4,933.0000 |
max | 310.0700 | 9,831.3000 | 0.5100 | 100.0000 | 7,280.0000 |
# separate rain column by no rain or some rain
no_rain = traffic[traffic["rain_1h"] == 0]
some_rain = traffic[traffic["rain_1h"] > 0]
print("Hours with no rain: ", len(no_rain))
print("Hours with some rain", len(some_rain))
# separate snow column by no snow or some snow
no_snow = traffic[traffic["snow_1h"] == 0]
some_snow = traffic[traffic["snow_1h"] > 0]
print("Hours with no snow: ", len(no_snow))
print("Hours with some snow: ", len(some_snow))
Hours with no rain: 44737 Hours with some rain 3467 Hours with no snow: 48141 Hours with some snow: 63
# create a some_rain column with any rain measurements greater than 0
traffic["some_rain"] = traffic["rain_1h"][traffic["rain_1h"] > 0]
# basic stats for some_rain
display(traffic["some_rain"].describe())
count 3,467.0000 mean 4.6475 std 166.9703 min 0.2500 25% 0.2500 50% 0.6400 75% 1.7800 max 9,831.3000 Name: some_rain, dtype: float64
# remove row with outlier rain values
traffic = traffic[traffic["rain_1h"]<100]
# basic stats for some_rain
display(traffic["some_rain"].describe())
count 3,466.0000 mean 1.8123 std 3.3100 min 0.2500 25% 0.2500 50% 0.6400 75% 1.7800 max 55.6300 Name: some_rain, dtype: float64
# create a some_snow column with any snow measurements greater than 0
traffic["some_snow"] = traffic["snow_1h"][traffic["snow_1h"] > 0]
# basic stats for some_snow
display(traffic["some_snow"].describe())
count 63.0000 mean 0.1702 std 0.1499 min 0.0500 25% 0.0600 50% 0.1000 75% 0.2500 max 0.5100 Name: some_snow, dtype: float64
# convert temperature column from Kelvin to Fahrenheit
def k_to_f(k_temp):
f_temp = ((k_temp-273.15)*(9/5)) + 32
return f_temp
traffic["temp"] = traffic["temp"].map(k_to_f)
# basic stats for temp column
display(traffic["temp"].describe())
count 48,203.0000 mean 46.4998 std 24.0085 min -459.6700 25% 30.2180 50% 48.7400 75% 65.5808 max 98.4560 Name: temp, dtype: float64
# # look at outlier temp rows
# cold_days = traffic[traffic["temp"]<-100]
# display(cold_days)
# remove rows with outlier temperatures
traffic = traffic[traffic["temp"]>-100]
# basic stats for temp column
display(traffic["temp"].describe())
count 48,193.0000 mean 46.6048 std 22.8769 min -21.5680 25% 30.2540 50% 48.7580 75% 65.5880 max 98.4560 Name: temp, dtype: float64
The previous statistics for traffic volume shows that the minimum value is 0, the maximum value is 7280, and that traffic is usually somewhere in the middle. A histogram shows an interesting distribution. The mean is heavily influenced by how often the traffic is really light and how often the traffic is fairly bad.
# create histogram of traffic volume for entire data set
traffic["traffic_volume"].plot.hist(color="#FFCC33")
plt.title("Vehicles per Hour")
plt.xticks([1000,3000,5000,7000], [1000,3000,5000,7000])
plt.yticks([2000,4000,6000,8000], [2000,4000,6000,8000])
plt.show()
Creating a daytime group (7:00 am to 6:00 pm) and a nighttime group (7:00 pm to 6:00 am) reveals unsurprising differences in traffic volume.
# transform column to datetime
traffic["date_time"] = pd.to_datetime(traffic["date_time"])
# create date_time series
day_night = traffic["date_time"].dt.hour
# create day traffic series from 0700 to 1800
day_traffic = day_night.between(7, 18)
day_traffic = traffic.loc[day_traffic].copy()
# create night traffic series from 1900 to 0600
night_traffic = day_night.between(19, 24) | day_night.between(0, 6)
night_traffic = traffic.loc[night_traffic].copy()
# verify
print("The number of rows and columns for the day and night series.")
display(day_traffic.shape)
display(night_traffic.shape)
The number of rows and columns for the day and night series.
(23874, 11)
(24319, 11)
# create two graphs
plt.figure(figsize=(10,7))
# daytime traffic volume histogram
plt.subplot(2,2,1)
day_traffic["traffic_volume"].plot.hist(color="#FFCC33")
plt.title("Daytime Vehicles per Hour")
plt.xticks([1000,3000,5000,7000], [1000,3000,5000,7000])
plt.xlim(0,7500)
plt.ylim(0,8000)
plt.ylabel("")
# nighttime traffic volume histogram
plt.subplot(2,2,2)
night_traffic["traffic_volume"].plot.hist(color="#FFCC33")
plt.title("Nighttime Vehicles per Hour")
plt.xticks([1000,3000,5000,7000], [1000,3000,5000,7000])
plt.xlim(0,7500)
plt.ylim(0,8000)
plt.ylabel("")
plt.show()
# basic daytime and nighttime stats
display(day_traffic.describe())
display(night_traffic.describe())
temp | rain_1h | snow_1h | clouds_all | traffic_volume | some_rain | some_snow | |
---|---|---|---|---|---|---|---|
count | 23,874.0000 | 23,874.0000 | 23,874.0000 | 23,874.0000 | 23,874.0000 | 1,770.0000 | 34.0000 |
mean | 48.4347 | 0.1213 | 0.0003 | 53.1255 | 4,762.3038 | 1.6365 | 0.1779 |
std | 23.4823 | 0.8804 | 0.0089 | 37.5635 | 1,174.1816 | 2.8249 | 0.1553 |
min | -21.5680 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.2500 | 0.0500 |
25% | 31.1540 | 0.0000 | 0.0000 | 5.0000 | 4,253.0000 | 0.2500 | 0.0600 |
50% | 51.1340 | 0.0000 | 0.0000 | 75.0000 | 4,820.0000 | 0.6400 | 0.1000 |
75% | 68.5220 | 0.0000 | 0.0000 | 90.0000 | 5,559.0000 | 1.6900 | 0.3025 |
max | 98.4560 | 44.4500 | 0.5100 | 100.0000 | 7,280.0000 | 44.4500 | 0.5100 |
temp | rain_1h | snow_1h | clouds_all | traffic_volume | some_rain | some_snow | |
---|---|---|---|---|---|---|---|
count | 24,319.0000 | 24,319.0000 | 24,319.0000 | 24,319.0000 | 24,319.0000 | 1,696.0000 | 29.0000 |
mean | 44.8084 | 0.1392 | 0.0002 | 45.6870 | 1,785.5309 | 1.9959 | 0.1610 |
std | 22.1202 | 1.1111 | 0.0074 | 40.0464 | 1,441.8681 | 3.7420 | 0.1455 |
min | -20.0740 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.2500 | 0.0500 |
25% | 29.4080 | 0.0000 | 0.0000 | 1.0000 | 530.5000 | 0.2500 | 0.0500 |
50% | 46.8140 | 0.0000 | 0.0000 | 40.0000 | 1,287.0000 | 0.6700 | 0.1000 |
75% | 63.5900 | 0.0000 | 0.0000 | 90.0000 | 2,819.0000 | 1.8500 | 0.2500 |
max | 94.1540 | 55.6300 | 0.5100 | 100.0000 | 6,386.0000 | 55.6300 | 0.5100 |
# create month column and convert info in date_time column to months 1-12
day_traffic["month"] = day_traffic["date_time"].dt.month
# group daytime by month and get the mean column values for each month
day_traffic_month_group = day_traffic.groupby("month").mean()
# basic stats for daytime month group mean traffic volume
display(day_traffic_month_group["traffic_volume"].describe())
# line plot of daytime month group mean traffic volumes
day_traffic_month_group["traffic_volume"].plot.line(c="#7A0019")
plt.title("Daytime Mean Traffic Volume by Month")
plt.yticks([4400,4600,4800], [4400,4600,4800])
plt.xlabel("")
plt.xticks([1,2,3,4,5,6,7,8,9,10,11,12],["Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"])
plt.show()
count 12.0000 mean 4,767.5037 std 189.9449 min 4,374.8346 25% 4,676.7308 50% 4,880.0964 75% 4,907.9511 max 4,928.3020 Name: traffic_volume, dtype: float64
# group July by year
july_months = traffic[traffic["date_time"].dt.month == 7]
july_yearly_group = july_months.groupby(traffic["date_time"].dt.year)
# line plot of July yearly group mean traffic volume
july_yearly_group["traffic_volume"].mean().plot.line(c="#7A0019")
plt.title("July Mean Traffic Volume by Year")
plt.xlabel("")
plt.show()
# create day_of_week column and convert info in date_time column to days 0-6
day_traffic["day_of_week"] = day_traffic["date_time"].dt.dayofweek
# group daytime by day of week and get the mean column values for each day
day_traffic_day_group = day_traffic.groupby("day_of_week").mean()
# line plot of daytime week of day group mean traffic volumes
day_traffic_day_group["traffic_volume"].plot.line(c="#7A0019")
plt.title("Daytime Mean Traffic Volume by Day")
plt.yticks([3500,4000,4500,5000],[3500,4000,4500,5000])
plt.xlabel("")
plt.xticks([0,1,2,3,4,5,6],["Mon","Tue","Wed","Thur","Fri","Sat","Sun"])
plt.show()
# create an hour column and convert info in date_time column to hours 0 to 23
day_traffic["hour"] = day_traffic["date_time"].dt.hour
# group daytime into weekdays (4=Friday) and weekends (5=Saturday)
weekdays = day_traffic[day_traffic["day_of_week"] <= 4]
weekends = day_traffic[day_traffic["day_of_week"] >= 5]
# group weekday entries by hour and get the mean column values for each hour
weekdays_hours_group = weekdays.groupby("hour").mean()
# group weekend entries by hour and get the mean column values for each hour
weekends_hours_group = weekends.groupby("hour").mean()
# basic stats for daytime weekday hour group mean traffic volume
# display(weekdays_hours_group["traffic_volume"].describe())
# create two graphs
plt.figure(figsize=(10,6))
# line plot of weekdays hours group mean traffic volumes
plt.subplot(2,2,1)
weekdays_hours_group["traffic_volume"].plot.line(c="#7A0019")
plt.title("Weekday Traffic Volume by Hour")
plt.ylim(0,6500)
# line plot of weekends hours group mean traffic volumes
plt.subplot(2,2,2)
weekends_hours_group["traffic_volume"].plot.line(c="#7A0019")
plt.title("Weekend Traffic Volulme by Hour")
plt.ylim(0,6500)
plt.show()
# line plot for nighttime traffic vollume by hour
night_traffic["hour"] = night_traffic["date_time"].dt.hour
nighttime_hours_group = night_traffic.groupby("hour").mean()
nighttime_hours_group["traffic_volume"].plot.bar(color="#FFCC33")
plt.ylim(0,6000)
plt.title("Nightime Mean Traffic Volume by Hour")
plt.xlabel("")
plt.show()
There are small correlations between daytime traffic volume and measurable amounts of snow and rain. Neither of these associations would be obvious though.
# Pearson correlation coefficient for all daytime columns
day_traffic.corr()
temp | rain_1h | snow_1h | clouds_all | traffic_volume | some_rain | some_snow | month | day_of_week | hour | |
---|---|---|---|---|---|---|---|---|---|---|
temp | 1.0000 | 0.0849 | -0.0197 | -0.1408 | 0.1255 | 0.2065 | 0.5590 | 0.2240 | 0.0030 | 0.1632 |
rain_1h | 0.0849 | 1.0000 | 0.0068 | 0.0885 | -0.0407 | 1.0000 | 0.7923 | 0.0194 | -0.0047 | -0.0168 |
snow_1h | -0.0197 | 0.0068 | 1.0000 | 0.0277 | 0.0013 | -0.0111 | 1.0000 | 0.0268 | -0.0088 | 0.0039 |
clouds_all | -0.1408 | 0.0885 | 0.0277 | 1.0000 | -0.0333 | 0.0850 | 0.1273 | 0.0004 | -0.0418 | 0.0235 |
traffic_volume | 0.1255 | -0.0407 | 0.0013 | -0.0333 | 1.0000 | -0.1723 | 0.4014 | -0.0227 | -0.4163 | 0.1724 |
some_rain | 0.2065 | 1.0000 | -0.0111 | 0.0850 | -0.1723 | 1.0000 | nan | 0.0084 | 0.0343 | -0.0429 |
some_snow | 0.5590 | 0.7923 | 1.0000 | 0.1273 | 0.4014 | nan | 1.0000 | 0.2191 | 0.0110 | 0.0840 |
month | 0.2240 | 0.0194 | 0.0268 | 0.0004 | -0.0227 | 0.0084 | 0.2191 | 1.0000 | 0.0138 | 0.0080 |
day_of_week | 0.0030 | -0.0047 | -0.0088 | -0.0418 | -0.4163 | 0.0343 | 0.0110 | 0.0138 | 1.0000 | -0.0025 |
hour | 0.1632 | -0.0168 | 0.0039 | 0.0235 | 0.1724 | -0.0429 | 0.0840 | 0.0080 | -0.0025 | 1.0000 |
# scatter plot for daytime traffic volume and some snow
plt.scatter(day_traffic["traffic_volume"],day_traffic["some_snow"],c="#FFCC33")
plt.title("Traffic Volume and Snow")
plt.xlabel("Vehicles per Hour")
plt.ylabel("Snow in mm per Hour")
plt.show()
# scatter plot for daytime traffic volume and some rain
plt.scatter(day_traffic["traffic_volume"], day_traffic["some_rain"],c="#FFCC33")
plt.title("Traffic and Rain")
plt.xlabel("Vehicles per Hour")
plt.ylabel("Rain in mm per Hour")
plt.show()
weather_main
column provides an interesting graph but no real insight. The traffic volume seems to be fairly similar across the group.weather_description
doesn't reveal any surprises either. The highest traffic volumes were on days with snow, so maybe we can conclude the Minnesotans are pretty resilient to bad weather!# group day traffic by weather main and weather description
weather_main = day_traffic.groupby("weather_main").mean()
weather_description = day_traffic.groupby("weather_description").mean()
# bar chart for weather main
weather_main["traffic_volume"].plot.barh(color="#FFCC33")
plt.xticks([1000,3000,5000], [1000,3000,5000])
plt.ylabel("")
plt.title("Weather and Daytime Traffic Volume")
plt.show()
# bar chart for weather description
weather_description["traffic_volume"].plot.barh(figsize=(7,10), color="#FFCC33")
plt.xticks([1000,3000,5000], [1000,3000,5000])
plt.ylabel("")
plt.title("Weather and Daytime Traffic Volume")
plt.show()