In this project, we're going to analyze a dataset about the westbound traffic on the I-94 Interstate highway connecting the Great Lakes and northern Great Plains regions of the U.S. The dataset was made available by John Hogue and can be downloaded from this repository.
The goal of our analysis is to determine a few indicators of heavy traffic on I-94, such as weather type, day of the week, hour, etc.
We found out that the traffic is most intense in the daytime, warm months, and business days, especially 6.00-8.00 and 16.00-17.00. Temperature doesn't influence traffic intensity, while some relatively light weather conditions do. The lowest average traffic volume is related to 2016, followed by the maximum peak in 2017. Of all the holidays, the heaviest traffic is related to Columbus Day, the lightest one – to Christmas Day and New Year.
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
traffic.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
print(f'\033[1mNumber of rows:\033[0m\t\t {traffic.shape[0]:,}'
f'\n\033[1mNumber of columns:\033[0m \t {traffic.shape[1]:,}'
f'\n\033[1mNumber of missing values:\033[0m {traffic.isnull().sum().sum()}'
f'\n\n\033[1mCOLUMN NAMES:\033[0m \n{traffic.columns.to_list()}'
f'\n\n\033[1mDATA TYPES:\033[0m \n{traffic.dtypes}')
Number of rows: 48,204 Number of columns: 9 Number of missing values: 0 COLUMN NAMES: ['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main', 'weather_description', 'date_time', 'traffic_volume'] DATA TYPES: holiday object temp float64 rain_1h float64 snow_1h float64 clouds_all int64 weather_main object weather_description object date_time object traffic_volume int64 dtype: object
There are 48,204 entries and 9 columns in the dataset, and no NaN values. The column holiday
seems to contain a lot of 'None' values, though. Below are the descriptions of each column from the documentation:
holiday
– US National holidays plus regional holiday, Minnesota State Fair,temp
– average temp in kelvin,rain_1h
– the amount in mm of rain per hour,snow_1h
– the amount in mm of snow per hour,clouds_all
– % of cloud cover,weather_main
– a short textual description of the current weather,weather_description
– a longer textual description of the current weather,date_time
– the time of the data collected in local CST time,traffic_volume
– hourly I-94 ATR 301 reported westbound traffic volume.The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records cars moving from east to west.
This means that the results of our analysis will be about the westbound traffic in the proximity of that station, so we should avoid generalizing our results for the entire I-94 highway.
Let's plot the distribution of traffic volume:
plt.figure(figsize=(10,6))
def create_hist(df, bins, color, title, title_font, axis_font, tick_font):
plt.hist(df['traffic_volume'], bins=bins, color=color)
plt.title(title, fontsize=title_font)
plt.xlabel('Traffic volume, cars/hr', fontsize=axis_font)
plt.ylabel('Frequency', fontsize=axis_font)
plt.xticks(fontsize=tick_font)
plt.yticks(fontsize=tick_font)
sns.despine()
# Plotting the overall distribution of traffic volume
create_hist(df=traffic, bins=24, color='slateblue',
title='Traffic volume distribution',
title_font=30, axis_font=20, tick_font=14)
round(traffic['traffic_volume'].describe())
count 48204.0 mean 3260.0 std 1987.0 min 0.0 25% 1193.0 50% 3380.0 75% 4933.0 max 7280.0 Name: traffic_volume, dtype: float64
Some observations here:
At this point, we can assume that traffic volume is strongly influenced by daytime and nighttime. In particular, in the daytime, most probably, there are distinct periods of moderate traffic (related to working hours) and heavy traffic with traffic jams (related to the morning and evening hours when people usually go to or from work). So, let's compare daytime with nighttime data.
We'll start by dividing the dataset into two parts:
While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
# Dividing the dataset into daytime and nighttime subsets
day = traffic.copy()[(traffic['date_time'].dt.hour>=7)&(traffic['date_time'].dt.hour<19)]
night = traffic.copy()[(traffic['date_time'].dt.hour<7)|(traffic['date_time'].dt.hour>=19)]
Now, we're going to compare traffic volume at night and during the day:
dfs = [day, night]
titles = ['Traffic volume: Day', 'Traffic volume: Night']
colors = ['deepskyblue', 'darkblue']
# Plotting the distributions of traffic volume at night and during the day
plt.figure(figsize=(10,5))
for i in range(1,3):
plt.subplot(1, 2, i)
create_hist(df=dfs[i-1], bins=20, color=colors[i-1], title=titles[i-1],
title_font=22, axis_font=16, tick_font=11)
plt.tight_layout(pad=2)
day_stats = round(day['traffic_volume'].describe())
night_stats = round(night['traffic_volume'].describe())
print(f'\033[1mDAYTIME STATS:\033[0m'
f'\n{day_stats}'
f'\n\n\033[1mNIGHTTIME STATS:\033[0m'
f'\n{night_stats}\n')
DAYTIME STATS: count 23877.0 mean 4762.0 std 1175.0 min 0.0 25% 4252.0 50% 4820.0 75% 5559.0 max 7280.0 Name: traffic_volume, dtype: float64 NIGHTTIME STATS: count 24327.0 mean 1785.0 std 1442.0 min 0.0 25% 530.0 50% 1287.0 75% 2819.0 max 6386.0 Name: traffic_volume, dtype: float64
Let's analyze the features for daytime and nighttime separately.
Daytime:
Nighttime:
All in all, from the histograms above, we can conclude that in general, the night traffic is much less intense than that of daytime. Since our goal is to find indicators of heavy traffic, let's focus on the daytime.
One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain hour.
Let's check how traffic volume changed by different time units: year, month, day of the week, or hour.
day['year'] = day['date_time'].dt.year
by_year = day.groupby('year').mean()
def create_line_plot(df, title, xlabel, tick_min=None, tick_max=None, labels=None,
xmin=None, xmax=None, ymin=None, ymax=None,
title_font=30, color='slateblue',
linestyle=None, marker=None, rotation=None):
plt.plot(df['traffic_volume'], color=color, linestyle=linestyle, marker=marker)
plt.title(title, fontsize=title_font)
plt.xlabel(xlabel, fontsize=20)
plt.ylabel('Traffic volume, cars/hr', fontsize=20)
if tick_max:
ticks=[i for i in range(tick_min, tick_max+1)]
else:
ticks=None
plt.xticks(ticks=ticks, labels=labels, fontsize=15, rotation=rotation)
plt.yticks(fontsize=12)
plt.xlim(xmin,xmax)
plt.ylim(ymin,ymax)
sns.despine()
# Plotting traffic volume by year
plt.figure(figsize=(10,5))
create_line_plot(df=by_year,
title='Traffic volume by year', xlabel='Year')
The lowest traffic volume is related to 2016, followed by the maximum peak in 2017. One possible explanation here is that in 2016, there were some global road works ongoing that caused temporary traffic volume reduction. Probably they were expanding the road which resulted in a traffic volume increase the year after.
day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean()
# Plotting traffic volume by month
plt.figure(figsize=(10,5))
create_line_plot(df=by_month,
title='Traffic volume by month', xlabel=None,
tick_min=1, tick_max=12,
labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
xmin=1, xmax=12,
rotation=30)
We see that in general, the traffic is much less intense in cold months (November-February). However, we can observe a sharp traffic volume decrease also in July, which looks curious. Let's try to figure out the reason for it and check the traffic volume of this month by year.
july = day[day['month']==7].groupby('year').mean()
# Plotting traffic volume in July by year
plt.figure(figsize=(10,5))
create_line_plot(df=july,
title='Traffic volume in July by year', xlabel='Year')
We can state that the decrease derives only from the year 2016, which we've already distinguished earlier as a year of the lowest traffic volume. All the other years show quite high values of traffic volume for July, just as we saw earlier for the other warm months. Our assumption about global road works in 2016 seems more realistic now.
day['day_of_week'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('day_of_week').mean()
# Plotting traffic volume by day of the week
plt.figure(figsize=(10,5))
create_line_plot(df=by_dayofweek,
title='Traffic volume by day of the week', xlabel=None,
tick_min=0, tick_max=6,
xmin=0, xmax=6,
labels=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
We observe a considerable difference between the average traffic volume on business days, usually exceeding 5,000 cars/hr, and that on weekends, especially on Sunday (less than 3,500 cars/hr).
To analyze traffic volume by hour, we should take into account that the weekends would drag down the average values. Hence, it makes sense to look at the averages separately for business days and weekends.
day['hour'] = day['date_time'].dt.hour
bussiness_days = day.copy()[day['day_of_week']<=4] # 4 == Friday
weekend = day.copy()[day['day_of_week']>=5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
dfs = [by_hour_business, by_hour_weekend]
titles = ['Traffic volume by hour:\nBusiness days', 'Traffic volume by hour:\nWeekends']
colors = ['blue', 'red']
# Plotting traffic volume by hour
plt.figure(figsize=(10,5))
for i in range(1,3):
plt.subplot(1, 2, i)
create_line_plot(df=dfs[i-1],
title=titles[i-1], xlabel='Time',
tick_min=7, tick_max=18,
xmin=7, xmax=18, ymin=1500, ymax=6250,
title_font=25, color=colors[i-1])
plt.tight_layout(pad=2)
The traffic is heavier on business days for almost all daytime hours with respect to weekends. For business days, there are 2 clear peaks: 7.00-8.00 and 16.00-17.00, both related to rush hours when people go to work and back. As for weekends, there are no peaks on the plot, and the traffic gradually increases from 7.00 till 12.00, when it reaches a plateau and from 16.00 starts decreasing.
All in all, we found the following time indicators of more intense traffic:
In addition, we discovered a sharp traffic volume reduction in 2016, presumably due to road expansion works, followed by the highest peak in 2017.
Another possible indicator of heavy traffic is the weather. We can find information about the weather in the following columns: temp
, rain_1h
, snow_1h
, clouds_all
, weather_main
, weather_description
. The first 4 of them are numerical, so let's try to figure out how they correlate with traffic_volume
.
round(day.corr()['traffic_volume'][['temp', 'rain_1h', 'snow_1h', 'clouds_all']], 3)
temp 0.128 rain_1h 0.004 snow_1h 0.001 clouds_all -0.033 Name: traffic_volume, dtype: float64
Temperature shows the strongest correlation (even though very low anyway) with traffic volume. Let's plot these two variables against each other:
def create_scatter_plot(df, column, title, xlabel, xmin=None, xmax=None):
plt.figure(figsize=(10,5))
plt.scatter(df[column], df['traffic_volume'], color='slateblue', alpha=0.1)
plt.title(title, fontsize=30)
plt.xlabel(xlabel, fontsize=20)
plt.ylabel('Traffic volume, cars/hr', fontsize=20)
plt.xlim(xmin,xmax)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine()
# Plotting traffic volume vs. temperature
create_scatter_plot(df=day, column='temp',
title='Traffic volume vs. temperature',
xlabel='Temperature, K')
There are 2 wrong values of temperature to be ignored.
create_scatter_plot(df=day, column='temp',
title='Traffic volume vs. temperature',
xlabel='Temperature, K', xmin=240, xmax=315)
Now we can conclude that actually there is no valid correlation between temperature and traffic volume, meaning that temperature isn't a reliable indicator for heavy traffic, not to mention other 3 numerical weather columns (rain_1h
, snow_1h
, and clouds_all
) that showed very lower Pearson correlation coefficient. To see if we can find more useful data, we'll look next at the categorical weather columns: weather_main
and weather_description
.
We're going to calculate and plot the average traffic volume associated with each weather type, i.e. each unique value in the columns weather_main
and weather_description
.
by_weather_main = day.groupby('weather_main').mean().sort_values('traffic_volume')
by_weather_description = day.groupby('weather_description').mean().sort_values('traffic_volume')
def create_stem_plot(df, fig_height,
title='Traffic volume by weather type',
ymin=None, ymax=None, vert_line=5000):
plt.figure(figsize=(10,fig_height))
plt.hlines(y=df.index,
xmin=0, xmax=df['traffic_volume'],
color='slateblue')
plt.plot(df['traffic_volume'], df.index,
'o', c='slateblue')
plt.title(title, fontsize=30)
plt.xlabel('Traffic volume, cars/hr', fontsize=20)
plt.ylabel(None)
plt.xlim(0,None)
plt.ylim(ymin,ymax)
plt.tick_params(left=False)
plt.axvline(x=vert_line, color='grey', linewidth=0.2)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(left=True)
# Plotting traffic volume by weather type
create_stem_plot(df=by_weather_main, fig_height=6)
There are no weather types where traffic volume exceeds 5,000 cars/hr, so we cannot identify any heavy traffic indicator from the weather_main
column. Let's plot the results for the weather_description
column instead:
# Plotting traffic volume by weather type (detailed)
create_stem_plot(df=by_weather_description, fig_height=20,
ymin=-1, ymax=38)
In this case, we can identify the following 3 weather types that led to heavy traffic of more than 5,000 cars/hr:
The results look surprising: evidently, there are many other weather types in the dataset representing much worse weather where traffic is much lighter. One possible explanation here is that really bad weather conditions (thunderstorms, very heavy rain, squalls, etc.) are usually forecast in advance, so people try to do their best not to travel by car on such days.
Earlier we concluded that the nighttime traffic is much lighter than that of daytime, and since our goal was to find indicators of heavy traffic, we focused on the daytime only. Now that we've already figured out the main influencing factors, let's return to the nighttime traffic and check its overall trends with respect to different time and weather indicators discussed above:
# Calculating and plotting nighttime traffic volume by different indicators
#____________________________________________________
# By year
night['year'] = night['date_time'].dt.year
by_year_night = night.groupby('year').mean()
plt.figure(figsize=(10,5))
create_line_plot(df=by_year_night,
title='Traffic volume by year (nighttime)', xlabel='Year')
plt.show()
print('\n')
#____________________________________________________
# By month
night['month'] = night['date_time'].dt.month
by_month_night = night.groupby('month').mean()
plt.figure(figsize=(10,5))
create_line_plot(df=by_month_night,
title='Traffic volume by month (nighttime)', xlabel=None,
tick_min=1, tick_max=12,
labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
rotation=30,
xmin=1, xmax=12)
plt.show()
print('\n')
#____________________________________________________
# By day of the week
night['day_of_week'] = night['date_time'].dt.dayofweek
by_dayofweek_night = night.groupby('day_of_week').mean()
plt.figure(figsize=(10,5))
create_line_plot(df=by_dayofweek_night,
title='Traffic volume by day of the week (nighttime)', xlabel=None,
tick_min=0, tick_max=6,
xmin=0, xmax=6,
labels=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.show()
print('\n')
#____________________________________________________
# By hour
night['hour'] = night['date_time'].dt.hour
bussiness_days_night = night.copy()[night['day_of_week']<=4] # 4 == Friday
weekend_night = night.copy()[night['day_of_week']>=5] # 5 == Saturday
by_hour_business_night = bussiness_days_night.groupby('hour').mean()
by_hour_weekend_night = weekend_night.groupby('hour').mean()
dfs_night = [by_hour_business_night, by_hour_weekend_night]
titles_night = ['Traffic volume by hour: Business days (nighttime)', 'Traffic volume by hour: Weekends (nighttime)']
colors = ['blue', 'red']
plt.figure(figsize=(10,10))
for i in range(1,3):
plt.subplot(2, 1, i)
create_line_plot(df=dfs_night[i-1],
title=titles_night[i-1], xlabel='Time',
tick_min=0, tick_max=23,
labels=['0','1', '2', '3', '4', '5', '6', '', '', '', '', '', '', '', '', '', '', '', '', '19', '20','21','22','23'],
xmin=-0.2, xmax=23.2,
ymin=0, ymax=6250,
title_font=25, color=colors[i-1], linestyle=' ', marker='o')
plt.tick_params(bottom=False)
plt.tight_layout(pad=2)
plt.show()
print('\n')
#_______________________________________________________
# By temperature
print(round(night.corr()['traffic_volume'][['temp', 'rain_1h', 'snow_1h', 'clouds_all']], 3))
print('\n')
create_scatter_plot(df=night, column='temp',
title='Traffic volume by temperature (nighttime)',
xlabel='Temperature, K', xmin=240, xmax=310)
plt.show()
print('\n')
#_______________________________________________________
# By weather type
by_weather_main_night = night.groupby('weather_main').mean().sort_values('traffic_volume')
by_weather_description_night = night.groupby('weather_description').mean().sort_values('traffic_volume')
create_stem_plot(df=by_weather_main_night, fig_height=6,
title='Traffic volume by weather type (nighttime)')
plt.show()
print('\n')
create_stem_plot(df=by_weather_description_night, fig_height=20,
title='Traffic volume by weather type \n(nighttime)',
ymin=-0.5, ymax=32.5, vert_line=2500)
temp 0.094 rain_1h -0.013 snow_1h -0.007 clouds_all 0.013 Name: traffic_volume, dtype: float64
Even though the absolute values of traffic volume are much smaller for the nighttime, we can observe some already familiar tendencies:
As in the case of daytime, these weather types are not that bad, but probably not much forecast in advance as seriously bad weather conditions, so they cannot prevent people from traveling by car.
Finally, let's explore traffic intensity on holidays:
print(f"\033[1m{traffic[traffic['holiday']!='None']['holiday'].nunique()} unique holidays in the dataset:\033[0m\n"
f"{list(traffic[traffic['holiday']!='None']['holiday'].unique())}\n\n"
f"\033[1m{len(traffic[traffic['holiday']!='None'])} entries\033[0m for all the holidays\n\n"
f"\033[1mThe trips on holidays happened at the following hours:\033[0m {list(traffic[traffic['holiday']!='None']['date_time'].dt.hour.unique())} ")
11 unique holidays in the dataset: ['Columbus Day', 'Veterans Day', 'Thanksgiving Day', 'Christmas Day', 'New Years Day', 'Washingtons Birthday', 'Memorial Day', 'Independence Day', 'State Fair', 'Labor Day', 'Martin Luther King Jr Day'] 61 entries for all the holidays The trips on holidays happened at the following hours: [0]
A strange thing is that all 61 holiday trips were registered at midnight, like Cinderella's style! 😃 Besides, 61 entries (out of 48,204) look suspiciously too few. Probably, it's a technical issue to be fixed, or just in the majority of cases, the holidays were not noted. Let's try to figure it out:
xmas_entries = traffic['holiday'].value_counts()['Christmas Day']
print(f'\033[1mNumber of entries for Christmas Day:\033[0m\t {xmas_entries}')
# Creating a new column with the dates converted to strings
traffic['date'] = traffic['date_time'].dt.strftime('%Y-%m-%d')
xmas_days = len(traffic[traffic['date'].str[-5:]=='12-25'])
print(f'\033[1mReal number of trips on Christmas Day:\033[0m\t {xmas_days}')
Number of entries for Christmas Day: 6 Real number of trips on Christmas Day: 165
We see now that there is indeed an issue with the holiday
column and that in reality, we have much more entries related to Christmas Day. Presumably, the same thing can be said about the other holidays. Hence, let's update the holiday
column. For this purpose, let's ignore the State Fair that lasts 12 days every year finishing on Labor Day, and count the real number of entries for each of the other holidays. The majority of them, however, don't have a fixed date, so we'll google an exact day for each holiday for the range of years of our dataframe:
min_year = traffic['date_time'].dt.strftime('%Y').min()
max_year = traffic['date_time'].dt.strftime('%Y').max()
print(f'\033[1mYear range:\033[0m {min_year}-{max_year}')
Year range: 2012-2018
# Creating a list of dates in the range of 2012-2018 for each holiday
columbus_day = ['2012-10-08', '2013-10-14', '2014-10-13', '2015-10-12',
'2016-10-10', '2017-10-09', '2018-10-08']
veterans_day = ['2012-11-11', '2012-11-12', '2013-11-11', '2014-11-11', '2015-11-11',
'2016-11-11', '2017-11-10', '2017-11-11', '2018-11-11', '2018-11-12']
thanksgiving_day = ['2012-11-22', '2013-11-28', '2014-11-27', '2015-11-26',
'2016-11-24', '2017-11-23', '2018-11-22']
washington_bday = ['2012-02-20', '2013-02-18', '2014-02-17', '2015-02-16',
'2016-02-15', '2017-02-20', '2018-02-19']
memorial_day = ['2012-05-28', '2013-05-27', '2014-05-26', '2015-05-25',
'2016-05-30', '2017-05-29', '2018-05-28']
labor_day = ['2012-09-03', '2013-09-02', '2014-09-01', '2015-09-07',
'2016-09-05', '2017-09-04', '2018-09-03']
martin_luther_king_day = ['2012-01-16', '2013-01-21', '2014-01-20', '2015-01-19',
'2016-01-18', '2017-01-16', '2018-01-15']
holiday_dates = [columbus_day, veterans_day, thanksgiving_day, washington_bday,
memorial_day, labor_day, martin_luther_king_day]
holiday_names = ['Columbus Day', 'Veterans Day', 'Thanksgiving Day', 'Washingtons Birthday',
'Memorial Day', 'Labor Day', 'Martin Luther King Jr Day']
# Calculating real numbers of trips on holidays in the dataframe
for i in range(len(traffic)):
for j in range(len(holiday_names)):
if traffic.loc[i, 'date'] in holiday_dates[j]:
traffic.loc[i, 'holiday'] = holiday_names[j]
if traffic.loc[i, 'date'][5:10]=='12-25':
traffic.loc[i, 'holiday'] = 'Christmas Day'
if traffic.loc[i, 'date'][5:10]=='01-01':
traffic.loc[i, 'holiday'] = 'New Years Day'
if traffic.loc[i, 'date'][5:10]=='07-04':
traffic.loc[i, 'holiday'] = 'Independence Day'
traffic['holiday'].value_counts()
None 46742 Veterans Day 198 Christmas Day 167 Independence Day 162 Labor Day 157 Martin Luther King Jr Day 142 Washingtons Birthday 136 Thanksgiving Day 135 Memorial Day 134 New Years Day 114 Columbus Day 112 State Fair 5 Name: holiday, dtype: int64
Now we labeled the holidays correctly and can visualize traffic volume by holiday:
by_holiday = traffic[(traffic['holiday']!='None')&(traffic['holiday']!='State Fair')].groupby('holiday').mean().sort_values('traffic_volume')
# Plotting traffic volume by holiday
create_stem_plot(df=by_holiday, fig_height=6, vert_line=3000,
title='Traffic volume by holiday')
Of all the holidays, the heaviest traffic (3,500 cars/hr) is related to Columbus Day. The attitude to this holiday and the scale of its celebration is rather different depending on the state. In particular, the small part of the I-94 highway that we're analyzing in this project is related to the state of Minnesota, where Columbus Day was actually reconsidered in 2014 and renamed in Indigenous Peoples' Day to respect the history and culture of Native Americans. Anyway, this day is an official public holiday in Minnesota and it's largely celebrated, even if not for the original reason.
The lightest traffic is observed on Christmas Day and New Years Day, when people, most probably, stay at home with their families or rather walk than drive.
In this project, we analyzed the data on the westbound traffic on the I-94 Interstate highway of the U.S., registered by a station midway between Minneapolis and Saint Paul, with the goal to determine a few indicators of heavy traffic on this part of the road. Below are the most important insights:
All these weather conditions are not that bad, but probably not much forecast in advance as really bad ones, so they cannot prevent people from traveling by car.