Project: Finding Heavy Traffic Indicators on I-94

Interstate 94 (I-94) is an east–west Interstate Highway connecting the Great Lakes and northern Great Plains regions of the United States. Its western terminus is in Billings, Montana, at a junction with I-90; its eastern terminus is in Port Huron, Michigan, where it meets with I-69 and crosses the Blue Water Bridge into Sarnia, Ontario, Canada, where the route becomes Ontario Highway 402. It thus lies along the primary overland route from Seattle (via I-90) to Toronto (via Ontario Highway 401), and is the only east–west Interstate highway to have a direct connection to Canada.


The record of traffic in I-94 was tranfromed to the dataset, and created by (). In this porject, we will use this dataset to analysis some insight about the factor of heavy traffic on I-94 road.

Based on the description of dataset, all these recording is about from midway between Minneapolis and St Paul, MN and Westbound traffic volume (meaning that is is car traffic's records from east to west). So we'll find the indicators of heavy traffic around the factor: Westbound traffic, with these title description below:

holiday (Categorical): US National holidays plus regional holiday, Minnesota State Fair
temp (Numeric): Average temp in kelvin
rain_1h (Numeric): Amount in mm of rain that occurred in the hour
snow_1h (Numeric): Amount in mm of snow that occurred in the hour
clouds_all (Numeric): Percentage of cloud cover
weather_main (Categorical): Short textual description of the current weather
weather_description (Categorical): Longer textual description of the current weather
date_time (DateTime): Hour of the data collected in local CST time
traffic_volume (Numeric): Hourly I-94 ATR 301 reported westbound traffic volume

Now, let's get started!

1. Checking out data

We'll start the project by load in, and see some basic information about data.

In [77]:
## Using pandas, seaborn, matplotlib libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
## Interpet Excel file
i94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
In [78]:
## Check basic information
i94.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

By the few basics information, we can see that the data has 48024 rows, 9 columns, and none of it have missed data, except for the date_time is in object format. Let's check a few five row of this dataset.

In [79]:
## Check first 5 rows of dataset
i94.head()
Out[79]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
0 None 288.28 0.0 0.0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
1 None 289.36 0.0 0.0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
2 None 289.58 0.0 0.0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
3 None 290.13 0.0 0.0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
4 None 291.14 0.0 0.0 75 Clouds broken clouds 2012-10-02 13:00:00 4918

We will explore the 'traffic_volume' first, but instead on print it here, we're going to see the distribution by graph first, and check the basics distribution by numbers after. Below, we'll use histogram graph to quickly check-out data:

In [80]:
## Drawing histogram for 'traffic_volume':
i94['traffic_volume'].plot.hist()
plt.show()
In [81]:
## Check the distribution by number:
i94['traffic_volume'].describe()
Out[81]:
count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

Based on the histogram, and the number's distribution, we can see that:

  1. The traffic volume's highest (on the peak time) is about 4500 - 5500 times, the normal is from 2500 - less than 4500, and from 5500 - less than 6500
  2. The distribution at 0~1000 is odd, we must examine more about this range (min = 0 is properly or not?)

An suggestion: from the histogram and the first 5 rows of dataset, we can considering that during the daytime (before 18:00 p.m, after 8:00 a.m) will gain more traffic volume than other time period in a day. We also considering it by the number's distribution:

  1. 25% of the record is about 1193 times or less => it can suggest us about it's night time
  2. 75% of the record is about 4933 times => this volume could reach in daytime (during daytime, on the peaktime, etc...)

2. Exploring data (Part 1) : Day-time or Night-time?

Since we're exploring that the time period could become an indicator of traffic condition by possibility, it got us a direction analysis: compare with record data at daytime vs nighttime.


We can divide the daytime and nighttime along these mark below:

  • Daytime data: hours from 7 a.m. to 19 p.m. (24 hours)
  • Nighttime data: hours from 19 p.m. to 7 a.m. (24 hours)
In [82]:
## Convert datetime object data form:
i94['date_time'] = pd.to_datetime(i94['date_time'])
## Check 'traffic_volume' format data
i94.info()
i94.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              48204 non-null  object        
 1   temp                 48204 non-null  float64       
 2   rain_1h              48204 non-null  float64       
 3   snow_1h              48204 non-null  float64       
 4   clouds_all           48204 non-null  int64         
 5   weather_main         48204 non-null  object        
 6   weather_description  48204 non-null  object        
 7   date_time            48204 non-null  datetime64[ns]
 8   traffic_volume       48204 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(2), object(3)
memory usage: 3.3+ MB
Out[82]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
0 None 288.28 0.0 0.0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
1 None 289.36 0.0 0.0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
2 None 289.58 0.0 0.0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
3 None 290.13 0.0 0.0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
4 None 291.14 0.0 0.0 75 Clouds broken clouds 2012-10-02 13:00:00 4918
In [83]:
## Isolating hour in datetime data:
hours = i94['date_time'].dt.hour
## Participation data in daytime/nighttime:
daytime = i94[hours.between(7,19)]
nighttime = i94[(hours <7)|(hours >19)]
In [84]:
## Check data:
daytime.head()
Out[84]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
0 None 288.28 0.0 0.0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
1 None 289.36 0.0 0.0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
2 None 289.58 0.0 0.0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
3 None 290.13 0.0 0.0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
4 None 291.14 0.0 0.0 75 Clouds broken clouds 2012-10-02 13:00:00 4918
In [85]:
nighttime.head()
Out[85]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
11 None 289.38 0.0 0.0 1 Clear sky is clear 2012-10-02 20:00:00 2784
12 None 288.61 0.0 0.0 1 Clear sky is clear 2012-10-02 21:00:00 2361
13 None 287.16 0.0 0.0 1 Clear sky is clear 2012-10-02 22:00:00 1529
14 None 285.45 0.0 0.0 1 Clear sky is clear 2012-10-02 23:00:00 963
15 None 284.63 0.0 0.0 1 Clear sky is clear 2012-10-03 00:00:00 506

We've been participated dataset into two part: 1 part contain only datetime in daytime (07a.m ~ 19p.m), and the rest is containing only in the nighttime. Now we're going to compare traffic volume in daytime vs nighttime:

In [86]:
## Drawing histogram in grid graph
plt.figure(figsize=[30,10]) ## Constrat size of grid graph
## Drawing the first histogram graph: daytime
plt.subplot(1,2,1)
plt.hist(daytime['traffic_volume'])
plt.title('Traffic Volume During Daytime (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Frequency')
## Drawing the second graph at the right
plt.subplot(1,2,2)
plt.hist(nighttime['traffic_volume'])
plt.title('Traffic Volume During Night Time (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Frequency')
plt.show()
In [87]:
## Check the distribution of two dataset:
daytime['traffic_volume'].describe()
Out[87]:
count    25838.000000
mean      4649.292360
std       1202.321987
min          0.000000
25%       4021.000000
50%       4736.000000
75%       5458.000000
max       7280.000000
Name: traffic_volume, dtype: float64

By combined the result of histogram (the left graph) and the distribution, we got:

  • 25% of the records above 4021 times, and it's increasing in 50% and 75%, reflected in the histogram: the area with range of 4000-6000 have the most density => We got the shape of right skewed distribution.
  • Though 0 value is contained but in the minimum density

Sub-conclusion 1: As in daytime, the traffic volume is increasing <= We will check the night time to see whether time period is a indicator or not.

In [88]:
nighttime['traffic_volume'].describe()
Out[88]:
count    22366.000000
mean      1654.648484
std       1425.175292
min          0.000000
25%        486.000000
50%       1056.500000
75%       2630.750000
max       6386.000000
Name: traffic_volume, dtype: float64

Similary, we got some information by combined two result of histogram and number's distribution:

  • 25% record is 486 times, 50% is 1056 times, we also got max equal to 6386 times, reflected in the histogram: area in range of 0~1000 is popular => The distribution shape is left skewed, shown us that in the nighttime, the volume have decreased.
  • Because of value 75% of data equal to 2630, we still see area in 2000-3000 have data gathered, but less than 0-1000.

Sub-conclusion 2: In the nighttime, the traffic volume have decreased than daytime.

CONCLUSION : The time period can be considered as a indicator of heavy traffic, specify during daytime, and we can skip the night time because the volume during this period is lighter, have not effect to our analysis.

3. Exploring data (Part 2): Time factor (Month, day of week...)

We've already conclusived that daytime can effect to the heavy traffic in I-94, and it might be another factor also play a role in traffic volume, ex: which month? which day in a week?. Now, we're going to explore it.

Starting with month, we might want to examine the relation of traffic volume with specify month in year by its mean value, which will show by code below:

In [91]:
## Isolating the month, create a new column:
i94['month'] = i94['date_time'].dt.month
## Gathering data by month columns:
by_month = i94.groupby('month').mean()
by_month
Out[91]:
temp rain_1h snow_1h clouds_all traffic_volume
month
1 264.907683 0.012154 0.000716 55.393909 3051.081378
2 265.577242 0.002204 0.000000 49.326716 3197.945547
3 272.764798 0.012138 0.000000 54.839705 3308.388611
4 278.722985 0.084137 0.000000 57.634656 3304.372388
5 288.172498 0.123616 0.000000 51.648557 3366.319432
6 293.351019 0.257601 0.000000 42.975610 3419.077413
7 295.276199 2.348936 0.000000 36.036288 3205.481752
8 293.615149 0.311969 0.000000 37.697807 3394.241891
9 291.190540 0.286053 0.000000 40.749152 3303.049334
10 282.966573 0.051330 0.000000 50.069105 3390.678376
11 275.795938 0.003049 0.000000 53.813619 3167.592784
12 267.097371 0.051226 0.001847 64.189456 3024.257943
In [92]:
## Visualing line plot od relation between month and volume:
plt.plot(by_month['traffic_volume'])
plt.title('Traffic Volume by Month (times)')
plt.ylabel('Volume (times)')
plt.xlabel('Month')
plt.show()

By a quick glance in graph, the month have a large traffic volume is about April - June (summer period), Autumn and October. The possibility that in summer, people's activities increase, lead to traffic requirement increase, and so on. It means that we have some specify month in year which the traffic volume suddenly increase, especially from April to June, Autumn and Octorber.

Next, we will examine how day of week effect to the traffic volume in I-94, in similar way we've done with Month factor.

In [93]:
## Create new column: day of week
i94['dayofweek']=i94['date_time'].dt.dayofweek #0 is Monday, 6 is Sunday
## Create new dataframe describe the relation of day of week to traffic volume:
by_dayofweek = i94.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']
Out[93]:
dayofweek
0    3309.387161
1    3488.555799
2    3583.196681
3    3637.899663
4    3656.358836
5    2773.638120
6    2368.588329
Name: traffic_volume, dtype: float64
In [95]:
## Visualize the realtion of day in week to traffic volume:
plt.plot(by_dayofweek['traffic_volume'])
plt.title('Traffic Volume by Day of Week (times)')
plt.xlabel('Day')
plt.ylabel('Volume (times)')
plt.show()

As the result was reflected in the graph, the traffic volume had decreased down start from Friday (item 4th in axis x) and reach the minimum on Sunday (item 6th in axis X). We have a border line point is Friday: in Friday, the traffic volume is the most, from Monday, but from Friday to the weekend the traffic volume significantly decrease.

By gathering all day of week in one dataset, we've been dragged down the average of the business day's record, so, let's take a closer look to the two different dataset: 1 for business days and 1 for weekend days - in order to clearly the different of traffic volume on these two dataset.

In [105]:
## Isolating and participating the business days data and weekend days data:
business_days = i94.copy()[i94['dayofweek']<=4]
weekend_days = i94.copy()[i94['dayofweek']>4]
## Groupby and retrive the mean value of two dataset
by_businessdays = business_days.groupby('dayofweek').mean()
by_weekenddays = weekend_days.groupby('dayofweek').mean()
In [106]:
## Check some data:
by_businessdays['traffic_volume'].describe()
Out[106]:
count       5.000000
mean     3535.079628
std       142.036450
min      3309.387161
25%      3488.555799
50%      3583.196681
75%      3637.899663
max      3656.358836
Name: traffic_volume, dtype: float64
In [107]:
by_weekenddays['traffic_volume'].describe()
Out[107]:
count       2.000000
mean     2571.113225
std       286.413454
min      2368.588329
25%      2469.850777
50%      2571.113225
75%      2672.375673
max      2773.638120
Name: traffic_volume, dtype: float64

Now let's take some graph of these two dataset, and compare it.

In [110]:
## Use grid plot to compare
plt.figure(figsize=[20,10])
plt.subplot(1,2,1)
plt.plot(by_businessdays['traffic_volume'])
plt.title('Traffic Volume in Business Days (times)')
plt.xlabel('Day')
plt.ylabel('Volume (times)')
## The second gird graph's element
plt.subplot(1,2,2)
plt.plot(by_weekenddays['traffic_volume'])
plt.ylim([0,3700])
plt.title('Traffic Volume in Weekend Days (times)')
plt.xlabel('Day')
plt.ylabel('Volume (times)')
plt.show()

Let's combine the distribution in number and the line graph, we can see the significantly different betwwen business days and weekend days about traffic volume. The peak of volume reached when it's Friday (item 4th in axis X) - the end of a business week - and min in Monday.The growth is likely linear, one is linear increase (business days) and one is linear decrease (weekend days).


When compare to the graph about volume in weekend, the minimum is reached in Sunday, but the decrease is slightly smooth in slope angle, because from the end of Friday had the strongly dropped down to less than 3000 times traffic volume in Saturday (item 5th in axis X). Summarilize, the traffic volume will increase in business days and reach the peak in Friday.

CONCLUSION 2: There's an aprearance of some specific month, and, in business weekdays, the traffic volume is increasing and traffic conditon is in heavy status.

Time is an indicator of heavy traffic in I-94, it contains: during daytime (7 a.m to 19 p.m), during business weekdays, in some specific of month in year - like summer period (April - June), Autumn and October.

4. Another factor: Weather

After time, now we're going to exploring data more and find out what's else can effect to the traffic status. Let's do this by examine the correlation module with traffic volume and the rest.

In [118]:
## Define the correlation of dataframe
i94.corr()
Out[118]:
temp rain_1h snow_1h clouds_all traffic_volume month dayofweek
temp 1.000000 0.009069 -0.019755 -0.101976 0.130299 0.223738 -0.007708
rain_1h 0.009069 1.000000 -0.000090 0.004818 0.004714 0.001298 -0.006920
snow_1h -0.019755 -0.000090 1.000000 0.027931 0.000733 0.020412 -0.014928
clouds_all -0.101976 0.004818 0.027931 1.000000 0.067054 -0.009133 -0.039715
traffic_volume 0.130299 0.004714 0.000733 0.067054 1.000000 -0.002533 -0.149544
month 0.223738 0.001298 0.020412 -0.009133 -0.002533 1.000000 0.010741
dayofweek -0.007708 -0.006920 -0.014928 -0.039715 -0.149544 0.010741 1.000000

We've already known that the distribution is linear when correlation module (r) match: r=1 (or r=-1), mean that a factor have stronger with another factor by its r if r is high and closer to 1. In the correlation table above, we see r(traffic_volume,temp) = 0.13 and r(traffic_volume,dayofweek)=-0.149, means that traffic status have a relation with temp and dayofweek factor, so, let's talk about the temp factor.

In [113]:
## Convert the temp into Celsius
i94['temp'] = i94['temp'] - 272.15
In [121]:
## Graphing the relation of traffic volume with temp:
plt.scatter(i94['traffic_volume'],i94['temp'],)
plt.title('Traffic Volume by Ambient Temperature (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Temperature (Celsius)')
plt.ylim([-40,40])
plt.show()

The result in the scatter plot can't mention any information to us, though its correlation is high, aproximately equal to dayofweek factor. We will try with the second one: Clouds_all

In [122]:
## Graphing the relation of traffic volume with foggy cover percentage:
plt.scatter(i94['traffic_volume'],i94['clouds_all'],)
plt.title('Traffic Volume by Foggy Percentage (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Foggy Cover (%)')
plt.show()

Since we can't see anything in scatter plot of traffic volume with foggy cover percentage, we'll next to the factor: _weathermain and _weatherdescription. We're going to calculate the average traffic volume associated with each unique value in these two columns:

In [123]:
## Caculate the mean value of traffic volume realtion to weather situation:
by_weathermain = i94.copy().groupby('weather_main').mean()
by_weatherdesc = i94.copy().groupby('weather_description').mean()
In [124]:
## Check data:
by_weathermain['traffic_volume']
Out[124]:
weather_main
Clear           3055.908819
Clouds          3618.449749
Drizzle         3290.727073
Fog             2703.720395
Haze            3502.101471
Mist            2932.956639
Rain            3317.905501
Smoke           3237.650000
Snow            3016.844228
Squall          2061.750000
Thunderstorm    3001.620890
Name: traffic_volume, dtype: float64
In [139]:
by_weathermain['traffic_volume'].describe()
Out[139]:
count      11.000000
mean     3067.239524
std       424.607888
min      2061.750000
25%      2967.288764
50%      3055.908819
75%      3304.316287
max      3618.449749
Name: traffic_volume, dtype: float64
In [126]:
by_weatherdesc['traffic_volume'].head()
Out[126]:
weather_description
SQUALLS          2061.750000
Sky is Clear     3423.148899
broken clouds    3661.142092
drizzle          3094.858679
few clouds       3691.453476
Name: traffic_volume, dtype: float64
In [140]:
by_weatherdesc['traffic_volume'].describe()
Out[140]:
count      38.000000
mean     3350.650074
std       708.136232
min      2061.750000
25%      2866.053160
50%      3220.126983
75%      3632.773235
max      5664.000000
Name: traffic_volume, dtype: float64

Now, we're going to plot some graph to reveal to relation between traffic volume and weather main/ weather description, and this times, we'll use horizontal bar-plot to draw:

In [138]:
## Drawing plot of weather main and traffic volume
by_weathermain['traffic_volume'].plot.barh()
plt.title('Traffic Volume by Weather Status')
plt.xlabel('Volume (times)')
plt.ylabel('Weather Status')
plt.show()

As shown in the graph result, in the weather status with Cloud, the traffic volume is maximum => People prefer to join traffic in the cool weather than Clear (refer the compare with weather description below), and, Haze status too.

Also in the graph, the status of Rain, Smoke, Drizzle have the traffic volume at high situation (>3000), only the status 'Squall' have minimum volume - of course, because no one want to traffic under the weather is rain with water's rock fall down. In other words, people want to join traffic in properly dry, cool, and not harder too much at weather situation.

Finally, we examine the relation of traffic volume with specific weather description situation.

In [141]:
## Drawing plot of specific weather description and traffic volume
by_weatherdesc['traffic_volume'].plot.barh(figsize=[30,60])
plt.title('Traffic Volume by Specific Weather Situation')
plt.xlabel('Volume (times)')
plt.ylabel('Weather Situation')
plt.show()

There's some funny point here, the traffic volume is maximum (over 5000 times) at the weather's situation: shower snow, and we can see that the situation: sky is clear and Sky is clear is the same situation but due to case_sensitive, value is separated, and it can be the second one after shower_snow case. Minimum? Of course, is SQUALL situation.

In the graph compare traffic volume with weather main above, because it is the shortest of a specific weather situation which have been detaily descripted in Weather description => We can put this result to the reference item. The reason is there's some mismatch in the result when compare traffic volume with weather main and compare traffic volume with weather description

Because weather description is the description of weather at the record time in detaily, so we can considering that at the Clear situation, the traffic volume is high. About the case of shower snow, we can temporaly accept this, and waiting for another dataset record to check whether the description is properly or not.

At the Clear situation of weather, the traffic volume will increase.

Summarize:

  1. Time is one of an indicator to the heavy traffic status on I-94, it contain: during daytime (7 a.m to 19 p.m), during warmly month (April to June, Autumn and October), and in the business weekdays.
  2. Weather also another indicator of heavy traffic status, but until now, only CLEAR situation of weather is acceptable to be gain the traffic volume. As mentioned, the shower snow could be an reference item to be a factor of gaining traffic volume.