#!/usr/bin/env python # coding: utf-8 # # Guided Project: Finding Heavy Traffic Indicators on I-94 # We are going to analyze a dataset about the westbound traffic on the [I-94 Interstate highway](https://en.wikipedia.org/wiki/Interstate_94). John Hogue made the dataset available, and it can be downloaded from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php). # The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows. # In[1]: import pandas as pd #Load in the data traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv') # In[2]: #Display first 5 rows traffic.head() # In[3]: #Display last 5 rows traffic.tail() # In[4]: traffic.info() # In this dataset, there are a total of 9 columns and 48204 rows. None of the rows have null values with a mixture of float, integer and object data types. The `date_time` column shows that the record starts from 2012-10-02 09:00:00 and 2018-09-30 23:00:00. # # # The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west). This means that the results of our analysis will be about the westbound traffic in the proximity of that station. In other words, we should avoid generalizing our results for the entire I-94 highway. # ### Analyzing Traffic Volume # In[5]: #Plotting a histogram to examine the distribution of the traffic_volume column. Using Pandas method. import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') traffic['traffic_volume'].plot.hist() plt.title('Frequency of Traffic Volume') plt.xlabel('Traffic Volume') plt.show() # In[6]: traffic['traffic_volume'].describe() #creating a summary of statistics of traffic_volume column # From the summary of statistics of the `traffic_volume` column, we see that the hourly traffic volume varied from 0 to 7,280 cars, with an average volume of 3259 cars. About 25% of the time, there were 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction. About 75% of the time, the traffic volume was four times as much (4,933 cars or more). # # The potential outcome that nighttime and daytime traffic volumes may influence each other steers our analysis in an interesting direction: comparing daytime and nighttime data. # Next, We'll start by dividing the dataset into two parts: # # -`Daytime data`:- hours from 7 a.m. to 7 p.m. (12 hours). # -`Nighttime data`:- hours from 7 p.m. to 7 a.m. (12 hours). # While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point. # In[7]: #transforming the column to a datetime datatype traffic['date_time'] = pd.to_datetime(traffic['date_time']) traffic['date_time'] # In[8]: #copy dataframe for the isolation of daytime data day_time = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)] day_time.shape # In[9]: night_time = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)] night_time.shape # Next, Now we're going to compare the traffic volume at night and during day. # In[10]: #plotting a histogram using a Pandas method plt.figure(figsize=(10,4)) plt.subplot(1, 2, 1) plt.hist(day_time['traffic_volume']) plt.xlim([0,7500]) plt.ylim([0,8000]) plt.title('Daytime Traffic Volume') plt.xlabel('Traffic Volume') plt.ylabel('Frequency') plt.subplot(1, 2, 2) plt.hist(night_time['traffic_volume']) plt.xlim([0,7500]) plt.ylim([0,8000]) plt.title('Nighttime Traffic Volume') plt.xlabel('Traffic Volume') plt.ylabel('Frequency') plt.show() # In[11]: day_time['traffic_volume'].describe() # In[12]: night_time['traffic_volume'].describe() # The daytime histogram is leftskewed, Most of the values pile up on the right side of the histogram and the median is higher than the mean. The nighttime histogram is rightskewed, Most of the values pile up on the left side of the histogram and the mean is higher than the median. # # Traffic at night is light compared to the daytime when you look at the averages and our goal is to find the indicators of heavy traffic, so we will be using the daytime data going forward. # # Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward. # In[13]: day_time['month'] = day_time['date_time'].dt.month by_month = day_time.groupby('month').mean() by_month['traffic_volume'] # In[14]: #plotting a line graph showing monthly traffic volume averages by_month['traffic_volume'].plot.line() plt.title('Monthly Traffic Volume Averages') plt.xlabel('Month') plt.ylabel('Traffic Volume Averages') plt.xticks(range(1,13)) plt.show() # It shows from the line graph that the traffic volume has high averages in March - June, and August - October, they are also warm months while traffic volume with low averages are in January, February, November and December. But July has a low traffic volume average, which is quite unusual. let's see how the traffic volume changed each year in July. # In[15]: #creating new column for traffic volume measured yearly day_time['year'] = day_time['date_time'].dt.year only_july = day_time[day_time['month'] == 7] only_july = only_july.groupby('year').mean() only_july['traffic_volume'].plot.line() plt.show() # Typically, the traffic is pretty heavy in July, similar to the other warm months. The only exception we see is 2016, which had a high decrease in traffic volume. One possible reason for this is road construction — this article from 2016 supports this hypothesis. # # As a tentative conclusion here, we can say that warm months generally show heavier traffic compared to cold months. In a warm month, you can can expect for each hour of daytime a traffic volume close to 5,000 cars. # ### Time Indicators # we found that the traffic volume is significantly heavier on business days compared to the weekends. # # We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend. # In[16]: #create new column for traffic volume measured daily day_time['dayofweek'] = day_time['date_time'].dt.dayofweek by_dayofweek = day_time.groupby('dayofweek').mean() by_dayofweek['traffic_volume'] # In[17]: #plotting a line graph showing daily traffic volume averages by_dayofweek['traffic_volume'].plot.line() plt.title('Daily Traffic Volume Averages') plt.xlabel('Day') plt.ylabel('Traffic Volume Averages') plt.show() # On business days (Monday through Friday), traffic volume is significantly higher. Except for Monday, we only see values in exceeding 5,000 on business days. Weekend traffic is lighter, with fewer than 4,000 vehicles. # ### Time Indicators # In[18]: day_time['hour'] = day_time['date_time'].dt.hour business_days = day_time.copy()[day_time['dayofweek'] <= 4] #4 == Friday weekend = day_time.copy()[day_time['dayofweek'] > 4 ] #getting average traffic volume for business days by_hour_businessdays = business_days.groupby('hour').mean() #getting average traffic volume for weekends by_hour_weekends = weekend.groupby('hour').mean() # In[19]: #Plotting two line plots showing average traffic volume changes by time of the day plt.figure(figsize=(11,4)) plt.subplot(1, 2, 1) by_hour_businessdays['traffic_volume'].plot.line() plt.xlim(5,20) plt.ylim(1500,6300) plt.title('Hourly Businessday Traffic Volume') plt.xlabel('Hour') plt.ylabel('Average traffic volume') plt.subplot(1, 2, 2) by_hour_weekends['traffic_volume'].plot.line() plt.xlim(5,20) plt.ylim(1500,6300) plt.title('Hourly Weekend Traffic Volume') plt.xlabel('Hour') plt.ylabel('Average traffic volume') plt.show() # At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. As somehow expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours. # # To summarize, we found a few time-related indicators of heavy traffic: # # -The traffic is usually heavier during warm months (March–October) compared to cold months (November–February). # -The traffic is usually heavier on business days compared to weekends. # # -On business days, the rush hours are around 7 and 16. # ### Weather indicators # In[20]: #find correlation day_time.corr()['traffic_volume'] # Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h, snow_1h, clouds_all) don't show any strong correlation with traffic_value. # # Let's generate a scatter plot to visualize the correlation between temp and traffic_volume. # In[21]: #plotting a scatter plot day_time.plot.scatter('traffic_volume', 'temp') plt.xlim() plt.show() # We can conclude that temperature doesn't look like a solid indicator of heavy traffic. # ### Weather types # In[22]: by_weather_main = day_time.groupby('weather_main').mean() by_weather_main['traffic_volume'].plot.barh() plt.show() # In[23]: by_weather_description = day_time.groupby('weather_description').mean() #plotting horizontal bar plot by_weather_description['traffic_volume'].plot.barh(figsize=(6,12)) plt.xlabel('Average traffic volume') plt.ylabel('weather main') plt.show() # Where traffic volume exceeds 5,000, it appears that three weather types exist: shower snow, light rain and snow, and proximity thunderstorm with drizzle. It's unclear why these weather types have the highest average traffic values — this seems to be bad weather, and not particularly bad. When the weather is bad, perhaps more people take their cars out of the garage instead of riding a bicycle or having to walk. # We attempted to identify a few indicators of heavy traffic on the I-94 Interstate highway in this project. We were successful in locating two types of indicators: Time Indicators and Weather Indicators. # # Time indicators # - The traffic is usually heavier during warm months (March–October) compared to cold months (November–February). # - The traffic is usually heavier on business days compared to the weekends. # - On business days, the rush hours are around 7 and 16. # # Weather indicators # - Shower snow # - Light rain and snow # - Proximity thunderstorm with drizzle # In[ ]: