Guided Project: Finding Heavy Traffic Indicators on I-94

We are going to analyze a dataset about the westbound traffic on the I-94 Interstate highway. John Hogue made the dataset available, and it can be downloaded from the UCI Machine Learning Repository. The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.

In [1]:
import pandas as pd 
#Load in the data
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
In [2]:
#Display first 5 rows
traffic.head()
Out[2]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
0 None 288.28 0.0 0.0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
1 None 289.36 0.0 0.0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
2 None 289.58 0.0 0.0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
3 None 290.13 0.0 0.0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
4 None 291.14 0.0 0.0 75 Clouds broken clouds 2012-10-02 13:00:00 4918
In [3]:
#Display last 5 rows
traffic.tail()
Out[3]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
48199 None 283.45 0.0 0.0 75 Clouds broken clouds 2018-09-30 19:00:00 3543
48200 None 282.76 0.0 0.0 90 Clouds overcast clouds 2018-09-30 20:00:00 2781
48201 None 282.73 0.0 0.0 90 Thunderstorm proximity thunderstorm 2018-09-30 21:00:00 2159
48202 None 282.09 0.0 0.0 90 Clouds overcast clouds 2018-09-30 22:00:00 1450
48203 None 282.12 0.0 0.0 90 Clouds overcast clouds 2018-09-30 23:00:00 954
In [4]:
traffic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

In this dataset, there are a total of 9 columns and 48204 rows. None of the rows have null values with a mixture of float, integer and object data types. The date_time column shows that the record starts from 2012-10-02 09:00:00 and 2018-09-30 23:00:00.

The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west). This means that the results of our analysis will be about the westbound traffic in the proximity of that station. In other words, we should avoid generalizing our results for the entire I-94 highway.

Analyzing Traffic Volume

In [5]:
#Plotting a histogram to examine the distribution of the traffic_volume column. Using Pandas method.
import matplotlib.pyplot as plt
%matplotlib inline
traffic['traffic_volume'].plot.hist()
plt.title('Frequency of Traffic Volume')
plt.xlabel('Traffic Volume')
plt.show()
In [6]:
traffic['traffic_volume'].describe()
#creating a summary of statistics of traffic_volume column
Out[6]:
count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

From the summary of statistics of the traffic_volume column, we see that the hourly traffic volume varied from 0 to 7,280 cars, with an average volume of 3259 cars. About 25% of the time, there were 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction. About 75% of the time, the traffic volume was four times as much (4,933 cars or more).

The potential outcome that nighttime and daytime traffic volumes may influence each other steers our analysis in an interesting direction: comparing daytime and nighttime data.

Next, We'll start by dividing the dataset into two parts:

-Daytime data:- hours from 7 a.m. to 7 p.m. (12 hours). -Nighttime data:- hours from 7 p.m. to 7 a.m. (12 hours). While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.

In [7]:
#transforming the column to a datetime datatype
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
traffic['date_time']
Out[7]:
0       2012-10-02 09:00:00
1       2012-10-02 10:00:00
2       2012-10-02 11:00:00
3       2012-10-02 12:00:00
4       2012-10-02 13:00:00
                ...        
48199   2018-09-30 19:00:00
48200   2018-09-30 20:00:00
48201   2018-09-30 21:00:00
48202   2018-09-30 22:00:00
48203   2018-09-30 23:00:00
Name: date_time, Length: 48204, dtype: datetime64[ns]
In [8]:
#copy dataframe for the isolation of  daytime data
day_time = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
day_time.shape
Out[8]:
(23877, 9)
In [9]:
night_time = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)]
night_time.shape
Out[9]:
(24327, 9)

Next, Now we're going to compare the traffic volume at night and during day.

In [10]:
#plotting a histogram using a Pandas method

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(day_time['traffic_volume'])
plt.xlim([0,7500])
plt.ylim([0,8000])
plt.title('Daytime Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')


plt.subplot(1, 2, 2)
plt.hist(night_time['traffic_volume'])
plt.xlim([0,7500])
plt.ylim([0,8000])
plt.title('Nighttime Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')

plt.show()
In [11]:
day_time['traffic_volume'].describe()
Out[11]:
count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64
In [12]:
night_time['traffic_volume'].describe()
Out[12]:
count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

The daytime histogram is leftskewed, Most of the values pile up on the right side of the histogram and the median is higher than the mean. The nighttime histogram is rightskewed, Most of the values pile up on the left side of the histogram and the mean is higher than the median.

Traffic at night is light compared to the daytime when you look at the averages and our goal is to find the indicators of heavy traffic, so we will be using the daytime data going forward.

Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.

In [13]:
day_time['month'] = day_time['date_time'].dt.month
by_month = day_time.groupby('month').mean()
by_month['traffic_volume']
Out[13]:
month
1     4495.613727
2     4711.198394
3     4889.409560
4     4906.894305
5     4911.121609
6     4898.019566
7     4595.035744
8     4928.302035
9     4870.783145
10    4921.234922
11    4704.094319
12    4374.834566
Name: traffic_volume, dtype: float64
In [14]:
#plotting a line graph showing monthly traffic volume averages

by_month['traffic_volume'].plot.line()    
plt.title('Monthly Traffic Volume Averages')
plt.xlabel('Month')
plt.ylabel('Traffic Volume Averages') 
plt.xticks(range(1,13))
plt.show()

It shows from the line graph that the traffic volume has high averages in March - June, and August - October, they are also warm months while traffic volume with low averages are in January, February, November and December. But July has a low traffic volume average, which is quite unusual. let's see how the traffic volume changed each year in July.

In [15]:
#creating new column for traffic volume measured yearly
day_time['year'] = day_time['date_time'].dt.year
only_july = day_time[day_time['month'] == 7]
only_july = only_july.groupby('year').mean()
only_july['traffic_volume'].plot.line()

plt.show()

Typically, the traffic is pretty heavy in July, similar to the other warm months. The only exception we see is 2016, which had a high decrease in traffic volume. One possible reason for this is road construction — this article from 2016 supports this hypothesis.

As a tentative conclusion here, we can say that warm months generally show heavier traffic compared to cold months. In a warm month, you can can expect for each hour of daytime a traffic volume close to 5,000 cars.

Time Indicators

we found that the traffic volume is significantly heavier on business days compared to the weekends.

We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.

In [16]:
#create new column for traffic volume measured daily
day_time['dayofweek'] = day_time['date_time'].dt.dayofweek
by_dayofweek = day_time.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']
Out[16]:
dayofweek
0    4893.551286
1    5189.004782
2    5284.454282
3    5311.303730
4    5291.600829
5    3927.249558
6    3436.541789
Name: traffic_volume, dtype: float64
In [17]:
#plotting a line graph showing daily traffic volume averages
by_dayofweek['traffic_volume'].plot.line()
plt.title('Daily Traffic Volume Averages')
plt.xlabel('Day')
plt.ylabel('Traffic Volume Averages') 
plt.show()

On business days (Monday through Friday), traffic volume is significantly higher. Except for Monday, we only see values in exceeding 5,000 on business days. Weekend traffic is lighter, with fewer than 4,000 vehicles.

Time Indicators

In [18]:
day_time['hour'] = day_time['date_time'].dt.hour
business_days = day_time.copy()[day_time['dayofweek'] <= 4]  #4 == Friday
weekend = day_time.copy()[day_time['dayofweek'] > 4 ]   

#getting average traffic volume for business days

by_hour_businessdays = business_days.groupby('hour').mean()

#getting average traffic volume for weekends

by_hour_weekends = weekend.groupby('hour').mean()
In [19]:
#Plotting two line plots showing average traffic volume changes by time of the day
plt.figure(figsize=(11,4))

plt.subplot(1, 2, 1)
by_hour_businessdays['traffic_volume'].plot.line()
plt.xlim(5,20)
plt.ylim(1500,6300)
plt.title('Hourly Businessday Traffic Volume')
plt.xlabel('Hour')
plt.ylabel('Average traffic volume')

plt.subplot(1, 2, 2)
by_hour_weekends['traffic_volume'].plot.line()
plt.xlim(5,20)
plt.ylim(1500,6300)
plt.title('Hourly Weekend Traffic Volume')
plt.xlabel('Hour')
plt.ylabel('Average traffic volume')
plt.show()

At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. As somehow expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours.

To summarize, we found a few time-related indicators of heavy traffic:

-The traffic is usually heavier during warm months (March–October) compared to cold months (November–February). -The traffic is usually heavier on business days compared to weekends.

-On business days, the rush hours are around 7 and 16.

Weather indicators

In [20]:
#find correlation 
day_time.corr()['traffic_volume']
Out[20]:
temp              0.128317
rain_1h           0.003697
snow_1h           0.001265
clouds_all       -0.032932
traffic_volume    1.000000
month            -0.022337
year             -0.003557
dayofweek        -0.416453
hour              0.172704
Name: traffic_volume, dtype: float64

Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h, snow_1h, clouds_all) don't show any strong correlation with traffic_value.

Let's generate a scatter plot to visualize the correlation between temp and traffic_volume.

In [21]:
#plotting a scatter plot

day_time.plot.scatter('traffic_volume', 'temp')
plt.xlim() 
plt.show()

We can conclude that temperature doesn't look like a solid indicator of heavy traffic.

Weather types

In [22]:
by_weather_main = day_time.groupby('weather_main').mean()
by_weather_main['traffic_volume'].plot.barh()
plt.show()
In [23]:
by_weather_description = day_time.groupby('weather_description').mean()

#plotting horizontal bar plot

by_weather_description['traffic_volume'].plot.barh(figsize=(6,12))
plt.xlabel('Average traffic volume')
plt.ylabel('weather main')
plt.show()

Where traffic volume exceeds 5,000, it appears that three weather types exist: shower snow, light rain and snow, and proximity thunderstorm with drizzle. It's unclear why these weather types have the highest average traffic values — this seems to be bad weather, and not particularly bad. When the weather is bad, perhaps more people take their cars out of the garage instead of riding a bicycle or having to walk.

We attempted to identify a few indicators of heavy traffic on the I-94 Interstate highway in this project. We were successful in locating two types of indicators: Time Indicators and Weather Indicators.

Time indicators

  • The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
  • The traffic is usually heavier on business days compared to the weekends.
  • On business days, the rush hours are around 7 and 16.

Weather indicators

  • Shower snow
  • Light rain and snow
  • Proximity thunderstorm with drizzle
In [ ]: