#!/usr/bin/env python
# coding: utf-8

# # Finding Heavy Traffic Indicators on I-94

# In this project, we will analyze a dataset about the westbound traffic on the [I-94 Interstate highway](https://en.wikipedia.org/wiki/Interstate_94). John Hogue made the dataset available, and the dataset can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume).
# 
# The goal of this analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.

# In[1]:


# Importing the pandas library and reading in the dataset
import pandas as pd

traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')

#Examining the first and last rows of the traffic dataset
traffic.head()
print(traffic.tail())


# In[2]:


# Getting more detailed information about the fraffic dataset
traffic.info()
# The dataset has 48,204 rows of data, 9 columns distributed as by dtype as follows:
# 3 are float64, 2 are int64 and 5 are object type
# Non of the columns have any null values


# Note that traffic data used in this project was recoreded on a station located approximately midway between Minneapolis and Saint Paul. And that the station only recorded westbound traffic (cars moving from east to west).
# 
# From this we can deduce that the results of our analysis will be about the westbound traffic in the proximity of that station. In other words, we should avoid generalizing our results for the entire I-94 highway.

# # Analysing Traffic Volume Distribution

# In this section, we plot a histogram to visualize the distribution of the `traffic_volume` column. When we use Matplotlib inside Jupyter, we also need to add the `%matplotlib inline` magic that enables Jupyter to generate the graphs.

# In[3]:


# Importing Matplotlib library
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# In[4]:


# Plotting `traffic_volume` histogram using pandas method
traffic['traffic_volume'].plot.hist()
plt.show()


# In[5]:


# Showing the statistics of the `traffic_volume` column
traffic['traffic_volume'].describe()


# Between 2012-10-02 09:00:00 and 2018-09-30 23:00:00, the hourly traffic volume varied from 0 to 7,280 cars, with an average of 3,260 cars.
# 
# About 25% of the time, there were 1193 cars or fewer passing the station each hour, about 75% of the time, the traffic volume was four times as much accounting for more than 4933 cars or more.

# # Extracting Nightime and Daytime Traffic Volumes

# If we take it that nighttime and daytime might influence traffic volume, we may have to do our analysis by comparing daytime with nighttime data.
# 
# To do this, we start by dividing the dataset into two parts:
# 
# - Daytime data: hours from 7 a.m. to 7 p.m. (12 hours)
# - Nighttime data: hours from 7 p.m. to 7 a.m. (12 hours)
# 
# While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.

# In[6]:


# Trasforming the `date_time` column to datetime format
traffic['date_time'] = pd.to_datetime(traffic['date_time'])


# ## Isolating Daytime and Nighttime Data

# In[7]:


traffic['date_time'] = pd.to_datetime(traffic['date_time'])

day = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
print(day.shape)

night = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)]
print(night.shape)


# The result above shows that there is a significant difference in row numbers between day and night which could be due to a few hours of missing data. For instance, if you look at rows 176 and 177 (i_94.iloc[176:178]), you'll notice there's no data for two hours (4 and 5).

# # Histogram Plots for Daytime and Nighttime

# Now that we've isolated day and night, we're going to look at the histograms of traffic volume side-by-side by using a grid chart.

# In[8]:


plt.figure(figsize=(12,9))

plt.subplot(2, 2, 1)
day['traffic_volume'].plot.hist()
plt.title('Daytime Traffic Volume')
plt.xlim([-100,7500])
plt.ylim([0,8000])
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')


plt.subplot(2, 2, 2)
night['traffic_volume'].plot.hist()
plt.title('Nighttime Traffic Volume')
plt.xlim([-100,7500])
plt.ylim([0,8000])
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')

plt.show()


# In[9]:


# Statistical values of daytime dataset for hour column
# There are 23,877 daytime traffic incidences
day['traffic_volume'].describe()


# In[10]:


# Statistical values for nighttime dataset for hour column
# There are 24,327 traffic incidences in the nighttime
night['traffic_volume'].describe()


# The daytime traffic volume histogram is left skewed. This means that most of the traffic volume values are high — there are 4,252 or more cars passing the station each hour 75% of the time (because 25% of values are less than 4,252).
# 
# The nighttime traffic volume histogram is right skewed. This means that most of the traffic volume values are low — 75% of the time, the number of cars that passed the station each hour was less than 2,819.
# 
# Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.

# # Time Indicators of Heavy Traffic

# One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.
# 
# We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:
# 
# - Month
# - Day of the week
# - Time of day
# 
# The fastest way to get the average traffic volume for each month is by using the `DataFrame.groupby()` method.

# In[11]:


day['month'] = day['date_time'].dt.month


# In[12]:


day.info()


# ## Time Indicator: Month

# In[13]:


by_month = day.groupby('month').mean()
by_month['traffic_volume']


# In[14]:


# Generating the line plot for the monthly averages
by_month['traffic_volume'].plot.line()


# The traffic looks less heavy during cold months (November–February) and more intense during warm months (March–October), with one interesting exception: July. There is a sharp decline monthly average traffic volume in July. Is there anything special about July? Is traffic significantly less heavy in July each year?
# 
# To answer the last question, let's see how the traffic volume changed each year in July.

# In[15]:


day['year'] = day['date_time'].dt.year
only_july = day[day['month'] == 7]
only_july.groupby('year').mean()['traffic_volume'].plot.line()
plt.show()


# Typically, the traffic is pretty heavy in July, similar to the other warm months. The only exception we see is 2016, which had a high decrease in traffic volume. One possible reason for this is road construction — [this article from 2016](https://www.crainsdetroit.com/article/20160728/NEWS/160729841/weekend-construction-i-96-us-23-bridge-work-i-94-lane-closures-i-696) supports this hypothesis.
# 
# As a tentative conclusion here, we can say that warm months generally show heavier traffic compared to cold months. In a warm month, you can can expect for each hour of daytime a traffic volume close to 5,000 cars.

# ## Time Indicator: Day of the Week

# In[16]:


day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']  # 0 is Monday, 6 is Sunday


# In[17]:


# Generating the line plot for the daily averages
by_dayofweek['traffic_volume'].plot.line()


# The average daily traffic volume steadily increases from Monday to Friday and declines from Suturday to Sunday, which implies that the traffic volume is significantly heavier on business days compared to the weekends.

# ## Time Indicator: Time of Day

# Beacuse the weekends will affect our average values, will look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.
# 
# Below we split the dataset and plot the graphs.

# In[18]:


day['hour'] = day['date_time'].dt.hour
bussiness_days = day.copy()[day['dayofweek'] <= 4] # 4 == Friday
weekend = day.copy()[day['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()

print(by_hour_business['traffic_volume'])
print(by_hour_weekend['traffic_volume'])


# In[19]:


plt.figure(figsize=(10,8))

plt.subplot(2, 2, 1)
by_hour_business['traffic_volume'].plot.line()
plt.title('Business Days Traffic Volume')
plt.xlim([6,20])
plt.ylim([1500,6500])
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')


plt.subplot(2, 2, 2)
by_hour_weekend['traffic_volume'].plot.line()
plt.title('Weekend Traffic Volume')
plt.xlim([6,20])
plt.ylim([1500,6500])
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')

plt.show()


# At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. This is as is expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours.

# In our quest to find time indicators for heavy traffic we discover the following:
# 
# - Traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
# - The traffic is usually heavier on business days compared to weekends.
# - On business days, the rush hours are around 7 and 16. We see volumes of over 6,000 cars at rush hours.
# 

# # Weather Indicators

# Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: `temp`, `rain_1h`, `snow_1h`, `clouds_all`, `weather_main`, and `weather_description`.
# 
# A few of these columns are numerical so let's start by looking up their correlation values with traffic_volume.

# In[20]:


day.info()


# In[21]:


# Correlation between Traffic Volume and Weather
day.corr()['traffic_volume']


# Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h, snow_1h, clouds_all) don't show any strong correlation with traffic_value.
# 
# Let's generate a scatter plot to visualize the correlation between temp and traffic_volume.

# In[22]:


# Correlation between Traffic Volume and Snow
day.plot.scatter('traffic_volume', 'temp')
plt.ylim(230, 320) # two wrong 0K temperatures mess up the y-axis
plt.show()


# From this we can see that temperature doesn't look like a solid indicator of heavy traffic. We need to look at other weather-related columns: weather_main and weather_description.

# # Weather Types

# In[23]:


by_weather_main = day.groupby('weather_main').mean()
by_weather_description = day.groupby('weather_description').mean()


# In[24]:


by_weather_main['traffic_volume'].plot.bar()
plt.show()


# Traffic volume for all weather types falls just below 5000, so there is no real heavy indicator of heavy traffic here.

# In[25]:


by_weather_description['traffic_volume'].plot.barh(figsize=(10,20))
#plt.figure(figsize=(10,20)) or Series.plot.barh(figsize=(width,height))
plt.show()


# There are three weather types that have traffic volume exceeding 5000;
# - Shower snow
# - Light rain and snow
# - Proximity thunderstorm with drizzle
# 
# This indicates that, these could be indicators of heavy traffic.

# # Conclusion

# In this project the main aim was to find indicators of heavy traffic on the I-94 Interstate highway. After our analysis of the data, we found two types of indicators:
# 
# - Time indicators
#     - The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
#     - The traffic is usually heavier on business days compared to the weekends.
#     - On business days, the rush hours are around 7 and 16.
# 
# 
# - Weather indicators
#     - Shower snow
#     - Light rain and snow
#     - Proximity thunderstorm with drizzle