We will analyze a dataset about the westbound traffic on the I-94 Interstate highway.
The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc.
import pandas as pd
traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
traffic.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
traffic.head(50)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
5 | None | 291.72 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 14:00:00 | 5181 |
6 | None | 293.17 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 15:00:00 | 5584 |
7 | None | 293.86 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 16:00:00 | 6015 |
8 | None | 294.14 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 17:00:00 | 5791 |
9 | None | 293.10 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 18:00:00 | 4770 |
10 | None | 290.97 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 19:00:00 | 3539 |
11 | None | 289.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 20:00:00 | 2784 |
12 | None | 288.61 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 21:00:00 | 2361 |
13 | None | 287.16 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 22:00:00 | 1529 |
14 | None | 285.45 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 23:00:00 | 963 |
15 | None | 284.63 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 00:00:00 | 506 |
16 | None | 283.47 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 01:00:00 | 321 |
17 | None | 281.18 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 02:00:00 | 273 |
18 | None | 281.09 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 03:00:00 | 367 |
19 | None | 279.53 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 04:00:00 | 814 |
20 | None | 278.62 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 05:00:00 | 2718 |
21 | None | 278.23 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 06:00:00 | 5673 |
22 | None | 278.12 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 08:00:00 | 6511 |
23 | None | 282.48 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 09:00:00 | 5471 |
24 | None | 291.97 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 12:00:00 | 5097 |
25 | None | 293.23 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 13:00:00 | 4887 |
26 | None | 294.31 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 14:00:00 | 5337 |
27 | None | 295.17 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 15:00:00 | 5692 |
28 | None | 295.13 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 16:00:00 | 6137 |
29 | None | 293.66 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-03 18:00:00 | 4623 |
30 | None | 290.65 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-03 19:00:00 | 3591 |
31 | None | 288.19 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-03 20:00:00 | 2898 |
32 | None | 287.10 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 21:00:00 | 2637 |
33 | None | 286.25 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 22:00:00 | 1777 |
34 | None | 285.26 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 23:00:00 | 1015 |
35 | None | 284.55 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 00:00:00 | 598 |
36 | None | 283.47 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 01:00:00 | 369 |
37 | None | 283.17 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 02:00:00 | 312 |
38 | None | 282.04 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 03:00:00 | 367 |
39 | None | 281.69 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 04:00:00 | 835 |
40 | None | 281.32 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 05:00:00 | 2726 |
41 | None | 280.74 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 06:00:00 | 5689 |
42 | None | 280.57 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 07:00:00 | 6990 |
43 | None | 281.86 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 08:00:00 | 5985 |
44 | None | 284.98 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 09:00:00 | 5309 |
45 | None | 289.18 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 10:00:00 | 4603 |
46 | None | 291.55 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 11:00:00 | 4884 |
47 | None | 294.97 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 12:00:00 | 5104 |
48 | None | 296.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 13:00:00 | 5178 |
49 | None | 297.32 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 14:00:00 | 5501 |
As we can see, there is some irregularity in the data at the 23rd column. We will try to correct it only if it has occured substantially.
traffic.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
import matplotlib.pyplot as plt
%matplotlib inline
traffic_volume = traffic['traffic_volume']
traffic_volume.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb290a4d940>
traffic_volume.describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The most direct comparison for the staggering difference in traffic volume can be done by comparing the data at daytime and nighttime.
We can divide the day as follows-
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
traffic['date_time'].dt.hour
0 9 1 10 2 11 3 12 4 13 .. 48199 19 48200 20 48201 21 48202 22 48203 23 Name: date_time, Length: 48204, dtype: int64
traffic_day = traffic.loc[(traffic['date_time'].dt.hour >=7)
& (traffic['date_time'].dt.hour < 19),
"traffic_volume"]
day = traffic.copy()[(traffic['date_time'].dt.hour >=7)
& (traffic['date_time'].dt.hour < 19)]
traffic_day
0 5545 1 4516 2 4767 3 5026 4 4918 ... 48194 4302 48195 4302 48196 4283 48197 4132 48198 3947 Name: traffic_volume, Length: 23877, dtype: int64
traffic_night = traffic.loc[(traffic['date_time'].dt.hour >=19)
| (traffic['date_time'].dt.hour < 7),
"traffic_volume"]
night = traffic.copy()[(traffic['date_time'].dt.hour >=19)
| (traffic['date_time'].dt.hour < 7)]
traffic_night
10 3539 11 2784 12 2361 13 1529 14 963 ... 48199 3543 48200 2781 48201 2159 48202 1450 48203 954 Name: traffic_volume, Length: 24327, dtype: int64
print(traffic_day.shape, traffic_night.shape)
(23877,) (24327,)
As we can see, there is a difference between the shape or number o fnetries in the daytime and nightime. This is due to missing hours in our data. As mentioned in a cell above, the data after 23rd columns are missing. By default these data were supposed to contribute to the daytime rows.
plt.figure(figsize = (12,4))
plt.subplot(1,2,1)
plt.hist(traffic_day)
plt.xlim(0, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Day')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')
plt.subplot(1,2,2)
plt.hist(traffic_night)
plt.xlim(0, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Night')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')
Text(0.5, 0, 'Traffic Volume')
traffic_day.describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
traffic_night.describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
Our goal is to find indicators of heavy traffic, so we decide to only focus on the daytime data from here on.
day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean()
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
by_month['traffic_volume'].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb28c2c5040>
The traffic volume is low during the winter months(Nov-Feb) while it is high during the summmer months(March-Oct). There is one exception in month og July where the volume has dropped uncharacteristically.
We can check if there is any event or factor which causes the average traffic volume to drop in July.
Below we plot the average traffic volume in JULY for each year of the data.
day['year'] = day['date_time'].dt.year
only_july = day[day['month'] == 7]
only_july.groupby('year').mean()['traffic_volume'].plot.line()
plt.show()
As seen, the decrease was more specific to 2016. As stated in this report, the downwards spike was due to construction.
Now, to conclude the monthly analysis, the warm months show a high traffic volume as compared to cold months, at nearly 5000 hourly cars.
We will now analyse using a more fundamental metric than month, that is by days of the weeks.
day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
by_dayofweek['traffic_volume'].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb290d3a550>
Now we will analyse an even more fundametric, time of the each day
day['hour'] = day['date_time'].dt.hour
bussiness_days = day.copy()[day['dayofweek'] <= 4] # 4 == Friday
weekend = day.copy()[day['dayofweek'] >= 5] # 5 = Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
plt.figure(figsize=(11,3.5))
plt.subplot(1, 2, 1)
by_hour_business['traffic_volume'].plot.line()
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Traffic Volume By Hour: Monday–Friday')
plt.subplot(1, 2, 2)
by_hour_weekend['traffic_volume'].plot.line()
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Traffic Volume By Hour: Weekend')
plt.show()
To conclude our analysis of effects of time metrics on traffic volume are as follows-
Having focused on time indicators for heavy traffic, we will now analyse the weather topic. We have many columns about weather types in dataframe.
day.corr()["traffic_volume"]
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 year -0.003557 dayofweek -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
Among the weather based columns, the temp column has the strongest correlation with traffic_volume.
day.plot.scatter('traffic_volume', 'temp')
plt.ylim(235, 320)
plt.show()
There is no type of correlation between traffic_volume and temp
day.plot.scatter('traffic_volume', 'rain_1h')
plt.show()
day.plot.scatter('traffic_volume', 'snow_1h')
plt.show()
day.plot.scatter('traffic_volume', 'clouds_all')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb294731550>
There is no correlation of volume with any parameters
We will have to look at the categorical weather related columns because we dont have any correlation with numerical based columns
by_weather_main = day.groupby('weather_main').mean()
by_weather_main['traffic_volume'].plot.barh()
plt.show()
by_weather_main = day.groupby('weather_description').mean()
by_weather_main['traffic_volume'].plot.barh(figsize = (9,15))
plt.show()
Weather types exceed 5000 only for three weather types -
Time Indicators-
Weather Indicators -