In this project, we will analyse a dataset about the westbound traffic on the I-94 Interstate highway, an east–west Interstate Highway connecting the Great Lakes and northern Great Plains regions of the United States.
The dataset for this analysis is made available at UCI Machine Learning Repository
The aim of this analysis is to get insightfull information with regards to the indicators of heavy traffic on I-94. Some of these indicators can be:
holiday | Categorical US National holidays plus regional holiday |
---|---|
temp | Numeric Average temp in kelvin |
rain_1h | Numeric Amount in mm of rain that occurred in the hour |
snow_1h | Numeric Amount in mm of snow that occurred in the hour |
clouds_all | Numeric Percentage of cloud cover |
weather_main | Categorical Short textual description of the current weather |
weather_description | Categorical Longer textual description of the current weather |
date_time | DateTime Hour of the data collected in local CST time |
traffic_volume | Numeric Hourly I-94 ATR 301 reported westbound traffic volume |
# importing the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Read the dataset
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.head()
traffic.tail()
traffic.info()
The dataset contains a total of 48204 rows and 9 columns. The columns include both numerical and string values. There is no column or datapoint with a null or empty value.
Each entry contains data about traffic for a particular hour. The data was recorded after every 1hour from 2012-10-02 09:00:00 to 2018-09-30 23:00:00
The documentation of the dataset states that the station collecting the data was located approximately midway between Minneapolis and Saint Paul. Meaning it was recording only Westbound traffic ie Cars moving east west.
Therefore, our conclusion can not be used to generalize the traffic trend along the entire I-94 highway.
Let's look at the distribution of trafic
traffic['traffic_volume'].plot.hist()
plt.xlabel('Traffic Volume')
plt.title('Distribution of traffic volume')
plt.show()
traffic['traffic_volume'].describe()
The minimum value of the traffic value is 0 indicating no traffic, and the maximum traffic value reached is 7280.
About 25% of the time there are about 1193 cars on the highway. This time surely reflect late hours of the night.
We can also see that 25% of the time the total number of cars on the road is about 4933 and more. This is surely during the day.
The most frequent traffic volume are in the range 0 to 1000 and 4500 to 5000. This may be reflecting traffic during the day and night respectively.
There are few cases where traffic volume reaches values between 6500 and 7000.
Our previous analysis shows that there could be a potential correlation between the time of the day and traffic volume.
We will divide our dataset into two:
Though this values may not be good enough to seperate daytime from night time, it can serve as a good sarting point.
# Convert the date_time column to datetime object
traffic.loc[:, 'date_time'] = pd.to_datetime(traffic['date_time'])
daytime = traffic.copy()[ traffic['date_time'].dt.hour.between(7, 18, inclusive='both')]
nighttime = traffic.copy()[ (traffic['date_time'].dt.hour.between(19, 24, inclusive='both')) | (traffic['date_time'].dt.hour.between(0, 6, inclusive='both')) ]
nighttime.head(3)
# Function to plot graphs on a grid
plt.figure(figsize=(15, 12))
# Day
plt.subplot(2, 2, 1)
plt.hist(daytime['traffic_volume'])
plt.title('Variation of traffic: day')
plt.ylim([0, 8000])
plt.xlim(0, 7500)
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
#Night
plt.subplot(2, 2, 2)
plt.hist(nighttime['traffic_volume'])
plt.title('Variation of traffic: Night')
plt.ylim([0, 8000])
plt.xlim(0, 7500)
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.show()
daytime['traffic_volume'].describe()
nighttime['traffic_volume'].describe()
The histogram that displays the information of traffic distribution during the day is left skewed. This shows that the majority of the traffic volumes are high. 75% of the time There are about 4000 to 7250 cars on the road each hour.
The histogram showing the distribution of traffic volume at night is right skewed. This indicates that most of the traffic volumes are low at night. 75% of the time the traffic volume is 2819 or less
Since the traffic is light at night, and the aim of this analysis is to find indicators of heavy traffic, we can get rid of all the data collected at night.
Henceforth we will continue working with daytime data.
Time is an indicator that may affect traffic since people find themselve outside at different times for different reasons.
We will look at the variation of traffic in three different time frames:
daytime['month'] = daytime['date_time'].dt.month
# Grouping the dataset by month with the mean as an aggregate function
by_month = daytime.groupby('month').mean()
by_month['traffic_volume']
by_month['traffic_volume'].plot.line()
plt.show()
We can observe heavy traffic from March to October. This may be due to the warm weather that encourages activities.
The traffic is less heavy from November to February which may be due to the cold in these months.
There is an exception with the first case though, we observe a suden decrease in traffic in the month of July.
daytime['dayofweek'] = daytime['date_time'].dt.dayofweek
by_dayofweek = daytime.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
by_dayofweek['traffic_volume'].plot.line()
plt.show()
We can observe heavy traffic volume on business days (Monday to Friday) which is due to high activity of people.
The traffic volume is low on Saturday and even lower on Sunday.
daytime['hour'] = daytime['date_time'].dt.hour
bussiness_days = daytime.copy()[daytime['dayofweek'] <= 4] # 4 == Friday
weekend = daytime.copy()[daytime['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
print(by_hour_business['traffic_volume'])
print(by_hour_weekend['traffic_volume'])
plt.figure(figsize=(12, 4))
# Business day plot
plt.subplot(1, 2, 1)
by_hour_business['traffic_volume'].plot.line()
plt.title('Traffic Volume Variation: Business Days')
plt.xlabel('Hour')
plt.ylim([0, 6250])
plt.ylabel('Traffic Volume')
# Weekend Plot
plt.subplot(1, 2, 2)
by_hour_weekend['traffic_volume'].plot.line()
plt.title('Traffic Volume Variation: Weekend')
plt.xlabel('Hour')
plt.ylim([0, 6250])
plt.ylabel('Traffic Volume')
plt.show()
During Business days, traffic is heaviest at 7am, 4pm and 5pm. This is the time at which people go and return from work respectively.
On the other hand, during Weekends traffic is less from 7am to 9am.
Overall, at any given time traffic is always highest during the weekends than during business days.
To sum up, Traffic are heavier:
Another indicator that influences traffic is weather. Our dataset contains the following weather related columns:
Some of these columns are numerical. Let us check their correlation value with traffic volume.
daytime.corr()['traffic_volume']
The correlation between the weather indicators and the traffic volume is very week. The temp column shows the strongest correlation with a value of 0.13
let's visualize the relationship between temp and traffic volume.
daytime.plot.scatter('traffic_volume', 'temp')
plt.ylim(230, 320)
plt.show()
Let's look next at the categorical weather related columns to see if we can get useful insight. There are two of these columns: weather_main and weather_description
# Grouping the data by weather_main and weather_description
by_weather_main = daytime.groupby('weather_main').mean()
by_weather_description = daytime.groupby('weather_description').mean()
# Traffic Volume vs Weather main
by_weather_main['traffic_volume'].plot.barh()
plt.show()
There is no weather type whose traffic volume exceeds 5000 cars, we can't therefore conclude that any of them is a strong traffic indicator.
Let's look at the weather_description distribution.
# Traffic Volume vs Weather description
by_weather_description['traffic_volume'].plot.barh(figsize=(4, 10))
plt.show()
The traffic volume exceeds 5000 cars when the weather condition match the description of shower snow, Proximity thunderstorm with drizzle, and light rain and show. This means that these weather conditions are indicators of heavy traffic.
In this project, we tried to analyze I-94 highway traffic dataset to find the main indicators of heavy traffic.
At the end of the analysis, we find out that there are two major types of heavy traffic indicators: Time indicators and weather indicators