#!/usr/bin/env python # coding: utf-8 # # Title # Finding Heavy Traffic Indicators on I-94 # # Project Description # I'm going to analyze a dataset about the westbound traffic on the I-94 Interstate highway. # # My goal is to find out what factors affect heavy traffic on I-94. These factors can be weather type, time of the day, time of the week, etc. # ## Importation # Here is the section to import all the packages/libraries that will be used through this notebook. # In[2]: # Data handling import pandas as pd import numpy as np from datetime import datetime # Vizualisation (Matplotlib, Plotly, Seaborn, etc. ) import seaborn as sns import matplotlib.pyplot as plt # EDA (pandas-profiling, etc. ) ... # Feature Processing (Scikit-learn processing, etc. ) ... # Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. ) ... # Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. ) ... # Other packages # # Data Loading # Here is the section to load the datasets (train, eval, test) and the additional files # In[3]: traffic=pd.read_csv("Metro_Interstate_Traffic_Volume.csv") # # Exploratory Data Analysis: EDA # Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*. # ## Dataset overview # # Have a look at the loaded datsets using the following methods: `.head(), .info()` # In[4]: traffic.shape # In[5]: traffic.head(20) # In[6]: traffic['holiday'].unique() # ## Hypothesis # #### Null Hypothesis, HO # Time is the number one factor that affects traffic the most. # # #### AlternativeHypothesis, H1 # Time is NOT the number one factor that affects traffic the most. # ## Questions # 1. Which holidays have the most traffic? # 2. Which weekdays have the most traffic? # 3. What time of day has the most traffic?(Is it morning, afternoon, evening, midnight or midday?) # # MORNING # This is the time from midnight to midday. # # AFTERNOON # This is the time from midday (noon) to evening. # From 12:00 hours to approximately 18:00 hours. # # EVENING # This is the time from the end of the afternoon to midnight. # From approximately 18:00 hours to 00:00 hours. # # MIDNIGHT # This is the middle of the night (00:00 hours). # # MIDDAY # This is the middle of the day, also called "NOON" (12:00 hours). # # 4. Which factor affects traffic the most? # 5. Compare rain, snow and temparature based on traffic_volume # 6. What's the highest recorded traffic? # # In[7]: traffic.info() # ## Data Cleaning # In[8]: #splitting the date_time column into date and time traffic[['date','time']] = traffic['date_time'].str.split(' ',expand=True) traffic.head(1) # In[9]: #creating weekday column from date_time column traffic['weekday'] =pd.to_datetime(traffic['date_time']).dt.day_name() #.dt.dayofweek traffic.drop(columns=['date_time'], axis=1, inplace = True) traffic['weekday'].unique() # In[10]: def stringToTime(timeString): return datetime.strptime(timeString, '%H:%M:%S').time() midnight=stringToTime('00:00:00') midday=stringToTime('12:00:00') sixpm=stringToTime('18:00:00') #creating time_of_day column from time column traffic['time_of_day'] = traffic['time'].apply( lambda x: 'morning' if midnightmidnight else ('midday' if stringToTime(x)==midday else ('midnight' if stringToTime(x)==midnight else x)))) ) #dropping time column #traffic.drop(columns=['time'], axis=1, inplace = True) traffic['time'] = traffic['time'].apply(lambda x: int(stringToTime(x).strftime("%H%M%S"))) # int(current_date.strftime("%Y%m%d%H%M%S"))) traffic['time_of_day'].unique() # In[11]: traffic.head(60) # ## Analysis # # 1. Which holidays have the most traffic? # In[12]: top_holidays=traffic.groupby("holiday")["traffic_volume"].sum().reset_index().sort_values(by="traffic_volume",ascending=False) top_holidays=top_holidays[top_holidays.holiday != 'None'] top_holidays # In[25]: fig = plt.figure(figsize=(12,5)) plt.title("Top Holidays by Traffic_Volume") sns.barplot(data=top_holidays.head(-7), y="holiday", x="traffic_volume", palette='Blues_d') fig.show() # 2. Which weekdays have the most traffic? # In[14]: top_weekdays=traffic.groupby('weekday')["traffic_volume"].sum().reset_index().sort_values(by="traffic_volume",ascending=False) top_weekdays # In[43]: fig = plt.figure(figsize=(5,5)) plt.title("Top Weekdays by Traffic_Volume") sns.barplot(data=top_weekdays.head(-3), x="weekday", y="traffic_volume") fig.show() # 3. What time of day has the most traffic?(Is it morning, afternoon, evening, midnight or midday?) # In[15]: top_hours=traffic.groupby('time_of_day')["traffic_volume"].sum().reset_index().sort_values(by="traffic_volume",ascending=False) top_hours # 4. Which factor affects traffic the most? # In[16]: corr_matrix=traffic.corr() corr_matrix['traffic_volume'].sort_values(ascending=False) # No variable has a strong linear correlation with traffic_volume. # # Although traffic is independent of all other variables in our data, time has the strongest linear correlation with traffic volume. # # Our null hypothesis is therefore true, since time affects traffic_volume the most. # 5. Compare rain, snow and temparature based on traffic_volume # 6. What's the highest recorded traffic? # In[17]: top_traffic = traffic.sort_values('traffic_volume', ascending=False) top_traffic.head(1) # In[ ]: