#!/usr/bin/env python # coding: utf-8 # # Accidents and Traffic in Atlanta # # ### What is my goal for this project? # As someone studying machine learning and data science, I didn't really have a good portfolio of projects to show. I had some experience working on this type of stuff in graduate school, and I have completed several online machine learning and data science courses, but I didn't have many projects to show for that. # # My initial idea for this project was to browse the data sets on Kaggle and pick one that seemed interesting. I would then look at the data, train a classifier or regressor to do some predictions, and put together some written thoughts on it. I didn't really have a plan when I picked the dataset, I just wanted to mess around with it and see what came to me. # # ### What dataset did I choose? # After spending some time browsing through various data sets, I ended up choosing [US Accidents (4.2 million records) # A Countrywide Traffic Accident Dataset (2016 - 2020)](https://www.kaggle.com/sobhanmoosavi/us-accidents). There is an accopanying paper for this data set located at [arxiv.org](https://arxiv.org/abs/1906.05409). Using this data set, I narrowed the scope of the data down to the Atlanta, Georgia area where I recently moved. # # ### About the Data: # # According to the author, the data was collected in real time using multiple traffic APIs. The majority of the data, approximately 63% and 36%, came from Mapquest and Bing repsectively. The data was collected from February 2016 until December 2020. The author of the data set made some comments on the validity of the data [here](https://www.kaggle.com/sobhanmoosavi/us-accidents/discussion/159189). Overall, the author believes that this data set is a subset of the entire accidents in the United States. The author also discussed a change in collection techniques for MapQuest after August 2017 [here](https://www.kaggle.com/sobhanmoosavi/us-accidents/discussion/126883). # # Given the above information, while the data may not be complete, it should be enough to give an interesting view into the traffic around the Atlanta area. # # The following table is a description of the column values as per the author's [website](https://smoosavi.org/datasets/us_accidents). # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
AttributeDescription
IDThis is a unique identifier of the accident record.
SourceIndicates source of the accident report (i.e. the API which reported the accident.).
TMCA traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event.
SeverityShows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
Start_TimeShows start time of the accident in local time zone.
End_TimeShows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
Start_LatShows latitude in GPS coordinate of the start point.
Start_LngShows longitude in GPS coordinate of the start point.
End_LatShows latitude in GPS coordinate of the end point.
End_LngShows longitude in GPS coordinate of the end point.
Distance(mi)The length of the road extent affected by the accident.
DescriptionShows natural language description of the accident.
NumberShows the street number in address field.
StreetShows the street name in address field.
SideShows the relative side of the street (Right/Left) in address field.
CityShows the city in address field.
CountyShows the county in address field.
StateShows the state in address field.
ZipcodeShows the zipcode in address field.
CountryShows the country in address field.
TimezoneShows timezone based on the location of the accident (eastern, central, etc.).
Airport_CodeDenotes an airport-based weather station which is the closest one to location of the accident.
Weather_TimestampShows the time-stamp of weather observation record (in local time).
Temperature(F)Shows the temperature (in Fahrenheit).
Wind_Chill(F)Shows the wind chill (in Fahrenheit).
Humidity(%)Shows the humidity (in percentage).
Pressure(in)Shows the air pressure (in inches).
Visibility(mi)Shows visibility (in miles).
Wind_DirectionShows wind direction.
Wind_Speed(mph)Shows wind speed (in miles per hour).
Precipitation(in)Shows precipitation amount in inches, if there is any.
Weather_ConditionShows the weather condition (rain, snow, thunderstorm, fog, etc.)
AmenityA amenity in a nearby location.
BumpA POI annotation which indicates presence of speed bump or hump in a nearby location.
CrossingA POI annotation which indicates presence of crossing in a nearby location.
Give_WayA POI annotation which indicates presence of give_way in a nearby location.
JunctionA POI annotation which indicates presence of junction in a nearby location.
No_ExitA POI annotation which indicates presence of no_exit in a nearby location.
RailwayA POI annotation which indicates presence of railway in a nearby location.
RoundaboutA POI annotation which indicates presence of roundabout in a nearby location.
StationA POI annotation which indicates presence of station in a nearby location.
StopA POI annotation which indicates presence of stop in a nearby location.
Traffic_CalmingA POI annotation which indicates presence of traffic_calming in a nearby location.
Traffic_SignalA POI annotation which indicates presence of traffic_signal in a nearby location.
Turning_LoopA POI annotation which indicates presence of turning_loop in a nearby location.
Sunrise_SunsetShows the period of day (i.e. day or night) based on sunrise/sunset.
Civil_TwilightShows the period of day (i.e. day or night) based on civil twilight.
Nautical_TwilightShows the period of day (i.e. day or night) based on nautical twilight.
Astronomical_TwilightShows the period of day (i.e. day or night) based on astronomical twilight.
#
# ### What is the project? # For this project, I will be looking at plots of traffic accidents and the impact on traffic when there is an auto accident. The impacts on traffic in this data set are rated by severity from 1 to 4 with 1 meaning the accident caused minimal traffic delays and 4 meaning an accident had an extreme effect on traffic delays. # # The data set doesn't define a clear timetable for the severity of the delays. However, the author posted some approximate estimates [here](https://www.kaggle.com/sobhanmoosavi/us-accidents/discussion/152370). Delays are estimated at the following. # # | Severity | Time | # | :--- | :--- | # | 1 | 2m 30s | # | 2 | 3m 15s | # | 3 | 8m | # | 4 | 18m | # #
# # To examine the data, I will plot several different types of traffic maps. Next, I will divide the data into train and test sets in order to train a severity classifier for accidents that will predict the traffic delay given an accidents location and relevant information. # # To look at the data and train the classifier I will use the years 2017 through 2019. I will discard 2016 since data from the first part of the year is missing. I will also not consider the year 2020 in the first part of the project since various COVID measures had a serious impact on traffic in Atlanta. In the last part of the project, I will compare the 2020 data to the previous years to see how the data shows COVID affected traffic delays and accidents. # In[2]: import numpy as np import pandas as pd pd.set_option('display.max_columns', None) import plotly.graph_objects as go from plotly.subplots import make_subplots from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, VotingClassifier from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.model_selection import RandomizedSearchCV, GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn.feature_selection import RFECV from sklearn.model_selection import cross_val_score from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif import json with open('credentials.json') as f: json_data = json.load(f) mapbox_key = json_data['mapbox_key'] random_state = 0 import plotly.io as pio pio.renderers.default = "notebook" # # Cleaning and processing the data # # The first step is to read in the data and take a look at it. The data contains 49 columns of information and 4,232,541 rows of accident entries. The columns contain varying information on location, weather, dates, times, traffic impact, source, and other information. The exact explanation of each column can be found [here](https://smoosavi.org/datasets/us_accidents). # In[3]: df_all = pd.read_csv('US_Accidents_Dec20.csv') df_all # ### Narrowing the data to Atlanta # In order to narrow the data to the Atlanta, Georgia area, I started by retrieving the lattitude and longitude coordinates for Atlanta via Google. The listed coordinates for Atlanta are 33.7490° N, 84.3880° W. To select the area around Atlanta, I opened a map and selected what I thought was a good representation of the city and outer suburbs which is approximately a 35 square mile area. The first step is to retrieve the rows in the data set that correspond to this area. # In[4]: atlanta_lat = 33.7490 atlanta_lng = -84.3880 radius = 0.3 df = df_all[(df_all['Start_Lat'] <= atlanta_lat + radius) & (df_all['Start_Lat'] >= atlanta_lat - radius) & (df_all['Start_Lng'] >= atlanta_lng - radius) & (df_all['Start_Lng'] <= atlanta_lng + radius)] df.head() # ### Removing non accident data # For the column [TMC (Traffic Message Channel)](https://wiki.openstreetmap.org/wiki/TMC/Event_Code_List), it lists a variety of codes used to describe traffic incidents. After referencing all the codes listed in the Atlanta data, there are 140 entries for code 406 which is described as 'entry slip road closed'. This didn't sound like an accident, so I decided to investigate further. Fortunately, the accidents have a text description. Looking through these descriptions, several incidents didn't seem to be accidents at all. For example, several roads were closed due to fallen trees, protests, and various other activities. After looking through them, most of the entries that reference an accident have the text 'accident' in the description. So, I will filter out any rows that don't have that description. # In[5]: to_remove = ~df[df['TMC'] == 406]['Description'].str.contains('accident') df = df.drop(df[df['TMC'] == 406][to_remove].index) # ### Removing columns that won't be used # The data set contains certain columns that won't be used for mapping and predictions. # # In the paper, the original traffic accident data only contained GPS data. The author used a tool that converted that GPS data into the data for number, street, side, city, county, state, country, and zip code. Given that the original data didn't contain this information, I decided to stick with the original lattitude and longitude coordinates to do mapping and prediction. # # For the columns End_Lat and End_Lng, many values were missing, so I decided to use the starting coordinates. Distance(mi) is also a function of Start - End, so I removed it as well. # # The columns ID and Source were irrelevant, so I removed those. # # The column TMC was used to clean in a previous step, so it is no longer needed. Also, TMC codes may not be available at the time of an accident to predict the delay. # # The column End_Time will be removed since the Severity column will be used as a target. Knowing the end time of an accident isn't something that would be known beforehand in prediction. # # The column Description will be removed because it is a text description of the accident, and also wouldn't be known beforehand. # # The column Timezone will be irrelevant since the entire area is in the same timezone. # # The column Airport_Code will be removed since it gives no localized information on the accident. # # The column Weather_Timestamp will be removed since the time of the last weather update can't provide additional information. After some research, I found several websites that have historical weather APIs, but they all charged a fee which I'm avoiding for this project. So, checking a timestamp to see if more accurate weather data is available can't be accomplished. # # This leaves the following columns: # # In[6]: df = df.drop(['ID', 'Source', 'TMC', 'End_Time', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street', 'Side', 'City', 'State', 'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp', 'County'], axis=1) df.columns # ### Converting the timestamp to columns # For this project, I'm going to plot several charts by time, I'm going to convert the timestamp into it's own separate columns for simplicity. Columns will be made for Year, Month, Day, Hour, and DayOfWeek. The hours will be rounded to the nearest hour. DayOfWeek will be in the format of 0 to 6 where 0 is Monday. # In[7]: #convert to datetime format df['Start_Time'] = pd.to_datetime(df['Start_Time'], infer_datetime_format=True) #round to nearest hour df['Start_Time'] = df['Start_Time'].dt.round("H") #create day of week df.insert(2, 'DayOfWeek', '') df['DayOfWeek'] = df['Start_Time'].dt.weekday #create an hour of the day column df.insert(2, 'Hour', '') df['Hour'] = df['Start_Time'].dt.hour #create day of week column df.insert(2, 'Day', '') df['Day'] = df['Start_Time'].dt.day #create month column df.insert(2, 'Month', '') df['Month'] = df['Start_Time'].dt.month #create year column df.insert(2, 'Year', '') df['Year'] = df['Start_Time'].dt.year df = df.drop(['Start_Time'], axis=1) display(df.head()) # ### Checking for missing data # Before the data can be used, any missing data must be dealt with. The first step is to identify which values are missing an how many. # In[8]: def check_na(): return pd.merge(left=df.isna().sum()[df.isna().sum() != 0].rename('null counts'), right=df.isnull().mean()[df.isnull().mean() != 0].rename('percentage').round(4)*100, left_index=True, right_index=True).sort_values(ascending=False, by=['percentage']) check_na() # There are nine columns with missing values. For Wind_Chill(F) and Precipitation(in), over half of the values are missing. Since I don't have access to historical weather data in order to fill those in, those two columns will be completely dropped. Wind_Chill(F) is a computation of temperature and wind speed, so that information should still be encoded in the data. While Precepitation(in) would probably be a good indicator, there is just too much information to fill in without having access to an outside data source. # # For the remaining missing values, it is possible to fill in some data from surrounding values. For a missing value, we can check another entry on that same date in the same area and use that value. For example, if Weather_Condition is missing from an entry with lat/lng x1 y1 on date Jan 1 2000 at 8am, and we have another entry with lat/lng x2 y2 on Jan 1 2000 at 8am, it's reasonable that those would have the same Weather_Condition or other weather values. # In[9]: df = df.drop(['Wind_Chill(F)', 'Precipitation(in)'], axis=1) def get_missing_row_values(row_id): #attempts to fill in the missing value from a row with other incidents that happened on the same day row = df.loc[row_id].copy() #find all entries that happen on that day columns = ['Year', 'Month', 'Day'] values = [row.Year, row.Month, row.Day] matches = df[(df[columns] == values).all(1)].copy() #drop duplicate row that was passed in matches = matches.drop(row_id) matches.insert(6, 'LL_Dist', '') #compute distance between the coordinates matches['LL_Dist'] = (np.sqrt((row.Start_Lat - matches.Start_Lat)**2 + (row.Start_Lng - matches.Start_Lng)**2)) matches.insert(6, 'Time_Diff', '') #compute time between the accidents matches['Time_Diff'] = abs(row.Hour - matches.Hour) #sort matches by time differential then distance matches = matches.sort_values(['Time_Diff', 'LL_Dist'], ascending=[True, True]) #to make things simple, we will take the match that is closest hour and least distance #this may lead to some small innacuracy, but the missing values are few, so it should be ok #since the match dataframe is sorted by hour then distance, the first matching value should be used #possible missing values are: #'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Weather_Condition' def fill_row(col_name): nonlocal row nonlocal matches if pd.isna(row[col_name]): value = matches[col_name].first_valid_index() if value: row[col_name] = matches.loc[value][col_name] fill_row('Temperature(F)') fill_row('Humidity(%)') fill_row('Pressure(in)') fill_row('Visibility(mi)') fill_row('Wind_Direction') fill_row('Wind_Speed(mph)') fill_row('Weather_Condition') return row df_nan = df[df.isna().any(axis=1)].copy() for i, row in df_nan.iterrows(): filled_row = get_missing_row_values(i) df.loc[i] = filled_row check_na() # After filling in missing weather values from nearby dates, most of the values have been filled in. There are still a small number missing. Next, I will take a look at the values that are missing and other values from those same dates. # In[10]: df_nan = df[df.isna().any(axis=1)].copy() display(df_nan.groupby(['Year', 'Month', 'Day']).size()) # The missing data is spread out over five separate days. Next, let's take a look at other data from those days to see why no values were filled in. # Note: I will use head() to display part of the data for brevity. # In[11]: display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17))) display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29))) display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 29)).head(5)) display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 30)).head(5)) display(df.groupby(['Year', 'Month', 'Day']).get_group((2020, 11, 8)).head(5)) # For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows. # For the dates July 17 2016 and October 29 2016 only the Wind_Speed(mph) is missing. The Wind_Direction on all of those days happen to be calm, so I will take the mean of all calm days to fill in these missing values. First, I will look at the mean values for each category. # In[12]: df.groupby('Wind_Direction').mean()['Wind_Speed(mph)'] # As can be seen above, the Wind_Direction categories have overlapping values such as CALM and Calm, E and East, etc. Next, I will combine these groups with each other and fill in the missing Wind_Speed(mph) mean values. # In[13]: df.replace({'CALM' : 'Calm', 'East' : 'E', 'North' : 'N', 'South' : 'S', 'VAR' : 'Variable', 'West' : 'W'}, inplace=True) ws_mean = df.groupby('Wind_Direction').mean()['Wind_Speed(mph)'].loc['Calm'].round(1) ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)).index df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean display(df.loc[ws_index]) ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)).index df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean display(df.loc[ws_index]) # For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows. The only remaining NaN values fall on these dates, so I will just drop all NaN rows. # In[14]: df = df.drop(df_nan.index) # ### Converting binary values # To use the data set in learning algorithms, the data values must be converted to numerical form. There are several columns of True and False values. There are also several columns of Day and Night values. These values will be converted to 1 and 0. # In[15]: tf_columns = ['Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop'] df.replace({False : 0, True : 1}, inplace=True) df.replace({'Day' : 1, 'Night' : 0}, inplace=True) # ### Converting categorical values # Now categorical values need to be converted to numerical values. In the Wind_Direction column, there are values such as N, NE, NNE, etc that need to be converted. The Weather_Condition column will be skipped for now. I plan to take a look at weather values in one of the plots and values such as Rainy, Windy, Clear, etc will be useful. # In[16]: df['Wind_Direction'] = df['Wind_Direction'].astype('category').cat.codes # ### Removing columns with single data values # # Some of the columns have no variance. For example, a column that contains all 0s or all 1s isn't really useful to a model, so they should be dropped. # In[17]: drop_no_variance = [] for col in df.columns: if len(df[col].value_counts()) <= 1: drop_no_variance.append(col) display(df[col].value_counts()) df = df.drop(drop_no_variance, axis=1) # ### Display and Save the data # The data has been cleaned and is ready to be plotted and fed into a learning algorithm. Below shows there are no missing values and shows the newly cleaned data set. # In[18]: df = df.reset_index(drop=True) display(df.isna().sum()) display(df) df.to_csv('clean_atlanta_accidents.csv', index=False) # # Exploring the data # # In this section, I'll take a deeper look at the data. Several plots will be made to get an idea of the traffic accident patterns. # # I will be looking at data from the years 2017, 2018, and 2019. Since 2016 is an incomplete year, it won't be considered. I also won't be considering the year 2020 since COVID-19 was reported to have a significant effect on Atlanta traffic. However, in the last section of this project, I will take a look at the data and make some comparisons between 2020 and the other three years. # # So, the first step is to remove and separate the years of data. # In[19]: df_covid = df[df['Year'] == 2020].copy().reset_index(drop=True) df.drop(df.loc[df['Year'] == 2016].index, inplace=True) df.drop(df.loc[df['Year'] == 2020].index, inplace=True) df = df.reset_index(drop=True) display(df.head()) display(df_covid.head()) # ### Looking at traffic density # Next, we'll take a look at the accident density for 2017, 2018, and 2019. Each year will be plotted separately using a heatmap to plot the accident locations. This should give us an idea of any hot spots in the area. Having lived in Atlanta for several years and recently moving back, I would expect the hotspots to be on the interstates around the city. The 285 perimeter should be fairly heavy. I20, I75, and I85 should also be very heavy with an even larger hotspot in the downtown connector area. # In[20]: def plot_yearly_heatmap(): zmin = 0.9 zmax = 5.1 radius = 2 mapbox_style = 'light' df_2017 = df[df['Year'] == 2017] df_2018 = df[df['Year'] == 2018] df_2019 = df[df['Year'] == 2019] fig = make_subplots(rows=1, cols=3, specs=[[dict(type='mapbox'), dict(type='mapbox'), dict(type='mapbox')]], subplot_titles=('2017 Traffic Accidents
({} Accidents)'.format(df_2017.shape[0]), '2018 Traffic Accidents
({} Accidents)'.format(df_2018.shape[0]), '2019 Traffic Accidents
({} Accidents)'.format(df_2019.shape[0])), vertical_spacing=0.05, horizontal_spacing=0.01) fig.add_trace(go.Densitymapbox(lat=df_2017['Start_Lat'], lon=df_2017['Start_Lng'], z=[1] * df_2017.shape[0], radius=radius, colorscale='Turbo', colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=1) fig.add_trace(go.Densitymapbox(lat=df_2018['Start_Lat'], lon=df_2018['Start_Lng'], z=[1] * df_2018.shape[0], radius=radius, colorscale='Turbo', colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=2) fig.add_trace(go.Densitymapbox(lat=df_2019['Start_Lat'], lon=df_2019['Start_Lng'], z=[1] * df_2019.shape[0], radius=radius, colorscale='Turbo', colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=3) fig.update_layout(width=900, height=450, showlegend=False, mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), mapbox2=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), mapbox3=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style)) #fig.write_html('yearly_heatmap.html') fig.show() plot_yearly_heatmap() # As were the expectations, we can clearly see the outlines of I20, I75, and I85. The downtown connector area is also quite heavy as expected. Zooming in on the maps, you can also see smaller sections of accident scattered around the map, but they don't show up quite as well since they aren't as dense. # # ### Looking at weather over the 2017-2019 # Next, I will plot traffic accidents for the combined three years with several weather conditions taken into account. The weather conditions are listed as follows. # In[21]: df['Weather_Condition'].value_counts() # Most of these conditions have too small of values to have enough data to map. However, most of these conditions overlap in their type of weather. For example, you have T-Storm, Heavy T-Storm, Thunderstorms and Rain, Heavy Thunderstorms and Rain, and more which could all be combined into a general Thunderstorm category. For this plot, I will divide the data into 7 categories including Clear, Rain, Thunderstorm, Windy, Winter, Cloudy, Fog. These might not be as granular as they could be, but it is a good starting place to have a look at the weather effects. # # My expectations would be to see larger traffic delays with the worse weather. # In[27]: def plot_density_weather_map(df_data): weather_types = {} weather_types['Clear'] = ['Clear', 'Fair'] weather_types['Rain'] = ['Light Rain', 'Rain', 'Light Drizzle', 'Heavy Rain', 'Mist', 'Light Rain Showers', 'Rain / Windy', 'Light Rain Shower', 'Rain Showers', 'Drizzle'] weather_types['Thunderstorms'] = ['Light Thunderstorms and Rain', 'Thunderstorm', 'Heavy Thunderstorms and Rain', 'Thunder in the Vicinity', 'Thunderstorms and Rain', 'Heavy T-Storm', 'T-Storm', 'Light Rain with Thunder', 'Thunder', 'Light Rain / Windy', 'Heavy Rain / Windy', 'T-Storm / Windy', 'Squalls', 'Heavy T-Storm / Windy', 'Thunder / Windy'] weather_types['Windy'] = ['Mostly Cloudy / Windy', 'Fair / Windy', 'Cloudy / Windy', 'Partly Cloudy / Windy'] weather_types['Winter'] = ['Light Snow', 'Light Ice Pellets', 'Snow', 'Light Freezing Rain', 'Wintry Mix'] weather_types['Cloudy'] = ['Mostly Cloudy', 'Overcast', 'Partly Cloudy', 'Scattered Clouds', 'Cloudy'] weather_types['Fog'] = ['Fog', 'Haze', 'Drizzle and Fog', 'Patches of Fog', 'Smoke'] df_clear = df_data[df_data['Weather_Condition'].isin(weather_types['Clear'])] df_rain = df_data[df_data['Weather_Condition'].isin(weather_types['Rain'])] df_tstorm = df_data[df_data['Weather_Condition'].isin(weather_types['Thunderstorms'])] df_windy = df_data[df_data['Weather_Condition'].isin(weather_types['Windy'])] df_winter = df_data[df_data['Weather_Condition'].isin(weather_types['Winter'])] df_cloudy = df_data[df_data['Weather_Condition'].isin(weather_types['Cloudy'])] df_fog = df_data[df_data['Weather_Condition'].isin(weather_types['Fog'])] fig = go.Figure() zmin = 0.9 zmax = 5.1 colorscale = ['blue', 'yellow', 'orange', 'red'] fig.add_densitymapbox(lat=df_clear['Start_Lat'], lon=df_clear['Start_Lng'], z=[1] * df_clear.shape[0], radius=2, colorscale='Turbo', visible=True, zmin=zmin, zmax=zmax) fig.add_densitymapbox(lat=df_rain['Start_Lat'], lon=df_rain['Start_Lng'], z=[1] * df_rain.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax) fig.add_densitymapbox(lat=df_tstorm['Start_Lat'], lon=df_tstorm['Start_Lng'], z=[1] * df_tstorm.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax) fig.add_densitymapbox(lat=df_windy['Start_Lat'], lon=df_windy['Start_Lng'], z=[1] * df_windy.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax) fig.add_densitymapbox(lat=df_winter['Start_Lat'], lon=df_winter['Start_Lng'], z=[1] * df_winter.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax) fig.add_densitymapbox(lat=df_cloudy['Start_Lat'], lon=df_cloudy['Start_Lng'], z=[1] * df_cloudy.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax) fig.add_densitymapbox(lat=df_fog['Start_Lat'], lon=df_fog['Start_Lng'], z=[1] * df_fog.shape[0], radius=2, colorscale='Turbo', visible=False, zmin=zmin, zmax=zmax) label_string = 'Number of Accidents: {}' annotation_clear = dict(text=label_string.format(df_clear.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_rain = dict(text=label_string.format(df_rain.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_tstorm = dict(text=label_string.format(df_tstorm.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_windy = dict(text=label_string.format(df_windy.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_winter = dict(text=label_string.format(df_winter.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_cloudy = dict(text=label_string.format(df_cloudy.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_fog = dict(text=label_string.format(df_fog.shape[0]), showarrow=False, x=0.7, y=1.06, yref='paper') annotation_weather_label = dict(text='Weather Condition', showarrow=False, x=-0.3, y=1.06, yref='paper') button_clear = dict(method='update', args=[dict(visible=[True, False, False, False, False, False, False]), dict(annotations=[annotation_weather_label, annotation_clear])], label='Clear') button_rain = dict(method='update', args=[dict(visible=[False, True, False, False, False, False, False]), dict(annotations=[annotation_weather_label, annotation_rain])], label='Rain') button_tstorm = dict(method='update', args=[dict(visible=[False, False, True, False, False, False, False]), dict(annotations=[annotation_weather_label, annotation_tstorm])], label='Thunderstorm') button_windy = dict(method='update', args=[dict(visible=[False, False, False, True, False, False, False]), dict(annotations=[annotation_weather_label, annotation_windy])], label='Windy') button_winter = dict(method='update', args=[dict(visible=[False, False, False, False, True, False, False]), dict(annotations=[annotation_weather_label, annotation_winter])], label='Winter') button_cloudy = dict(method='update', args=[dict(visible=[False, False, False, False, False, True, False]), dict(annotations=[annotation_weather_label, annotation_cloudy])], label='Cloudy') button_fog = dict(method='update', args=[dict(visible=[False, False, False, False, False, False, True]), dict(annotations=[annotation_weather_label, annotation_fog])], label='Fog') fig.update_layout(width=600, height=600, mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8.5, style='light'), title=dict(text='Atlanta Accident Map', x=0.5, xanchor='center', xref='paper'), updatemenus=[dict(buttons=[button_clear, button_rain, button_tstorm, button_windy, button_winter, button_cloudy, button_fog], type='buttons')], annotations=[annotation_weather_label, annotation_clear], margin=dict(l=100)) #fig.write_html('weather_heatmap.html') fig.show() plot_density_weather_map(df) # Above, we can see the different accidents with weather events by selecting a button on the left side. The results of this aren't quite what I anticipated, but they make sense. The data seems to be imbalanced towards Clear and Cloudy. Atlanta being in the south where it is usually warm, I expected a small count of Winter accidents. Also, fog while not rare, isn't too common, so I expected it to be low. The Windy category was also expected to be low, since most windy days were grouped into other categories such as Heavy Rain / Wind and Heavy T-Storm Windy being grouped into Thunderstorms. However, I expected Rain and Thunderstorms to be much higher. I looked briefy into more info on how Mapquest and Bing categories weather data, but I didn't turn up any info. We could reason that the counts are much lower because Thunderstorms and Rain are brief events. For example, when it rains on a day, it may only rain for a couple hours. Therefore, there is only a small window for the accidents to occur in. It entirely depends on how the weather is classified. Does Rain mean rain was currently falling from the sky? Does it mean that it rained that day? There are many questions we need to answer to quantify this data, but I don't have that information available. So, with no further information, I will continue to the next section. # # ### Looking at daily and hourly accidents # # One of the big concepts of big city traffic is the daily commute. People flood the road ways during 'rush hour' to go to and from work. News stations report on it constantly throughout the day. Rush hour usually runs from early to mid morning and then again from late afternoon until evening. To get a look at this concept, I'll plot the years 2017-2019 hourly traffic data for a week. The plot will be animated and cycle through the 24 hour 7 day week of each year. # In[28]: #animate all yearly and covid def plot_animate_delays_by_year(): df_2017 = df[df['Year'] == 2017] df_2018 = df[df['Year'] == 2018] df_2019 = df[df['Year'] == 2019] mapbox_style = 'light' zmin = 0 zmax = 2 buttons_list = [dict(type="buttons", buttons=[dict(label="Play", method="animate", args=[None,dict(frame=dict(duration=500,redraw=True),fromcurrent=True)]), dict(label="Pause", method="animate", args=[[None],dict(frame=dict(duration=0,redraw=True),mode="immediate")])], direction="left", pad={"r": 10, "t": 35}, showactive=False, x=0.27, xanchor="right", y=0, yanchor="top")] sliders_list = [dict(active=0, visible=True, yanchor="top", xanchor="left", currentvalue=dict(font=dict(size=20, color='#000000'), prefix="", visible=True, xanchor="center"), pad=dict(b=10, t=10), len=0.8, x=0.3, y=0, tickcolor='white', font={'color' : 'white'}, steps=[])] fig = make_subplots(rows=1, cols=3, specs=[[dict(type='mapbox'), dict(type='mapbox'), dict(type='mapbox')]], vertical_spacing=0.05, horizontal_spacing=0.01) fig_frames = [] label_string = 'Number of Accidents: {}' title_annotations = [dict(text='2017 Accidents', showarrow=False, x=0.1, y=1.13, yref='paper'), dict(text='2018 Accidents', showarrow=False, x=0.5, y=1.13, yref='paper'), dict(text='2019 Accidents', showarrow=False, x=0.9, y=1.13, yref='paper')] for d in range(0, 7): for h in range(0, 24): annotation_list = title_annotations.copy() df_2017_temp = df_2017[(df_2017['DayOfWeek'] == d) & (df_2017['Hour'] == h)] df_2018_temp = df_2018[(df_2018['DayOfWeek'] == d) & (df_2018['Hour'] == h)] df_2019_temp = df_2019[(df_2019['DayOfWeek'] == d) & (df_2019['Hour'] == h)] annotation_list.append(dict(text=label_string.format(df_2017_temp.shape[0]), showarrow=False, x=0.075, y=1.07, yref='paper')) annotation_list.append(dict(text=label_string.format(df_2018_temp.shape[0]), showarrow=False, x=0.5, y=1.07, yref='paper')) annotation_list.append(dict(text=label_string.format(df_2019_temp.shape[0]), showarrow=False, x=0.94, y=1.07, yref='paper')) if d == 0 and h == 0: fig.add_trace(go.Densitymapbox(lat=df_2017_temp['Start_Lat'], lon=df_2017_temp['Start_Lng'], z=[1] * df_2017_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, colorscale='Turbo', colorbar=dict(tickmode='linear')), row=1, col=1) fig.add_trace(go.Densitymapbox(lat=df_2018_temp['Start_Lat'], lon=df_2018_temp['Start_Lng'], z=[1] * df_2018_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, colorscale='Turbo', colorbar=dict(tickmode='linear')), row=1, col=2) fig.add_trace(go.Densitymapbox(lat=df_2019_temp['Start_Lat'], lon=df_2019_temp['Start_Lng'], z=[1] * df_2019_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, colorscale='Turbo', colorbar=dict(tickmode='linear')), row=1, col=3) starting_annotations = annotation_list.copy() day = d + 1 hour = '0{}'.format(h) if h <= 9 else '{}'.format(h) timestamp = pd.Timestamp('2021-02-0{}T{}:00:00'.format(day, hour)) day_text = timestamp.strftime('%A - %I:%M %p') f1 = go.Densitymapbox(lat=df_2017_temp['Start_Lat'], lon=df_2017_temp['Start_Lng'], z=[1] * df_2017_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, colorscale='Turbo', colorbar=dict(tickmode='linear')) f2 = go.Densitymapbox(lat=df_2018_temp['Start_Lat'], lon=df_2018_temp['Start_Lng'], z=[1] * df_2018_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, colorscale='Turbo', colorbar=dict(tickmode='linear')) f3 = go.Densitymapbox(lat=df_2019_temp['Start_Lat'], lon=df_2019_temp['Start_Lng'], z=[1] * df_2019_temp.shape[0], radius=2, zmin=zmin, zmax=zmax, colorscale='Turbo', colorbar=dict(tickmode='linear')) layout = go.Layout(annotations=annotation_list) frame = go.Frame(data=[f1, f2, f3], traces=[0, 1, 2], name=day_text, layout=layout) fig_frames.append(frame) slider_time_step = dict(args=[[day_text], dict(mode='immediate', frame=dict(duration=300, redraw=True))], method='animate', label=day_text) sliders_list[0]['steps'].append(slider_time_step) fig_layout = go.Layout(width=900, height=450, showlegend=False, mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), mapbox2=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), mapbox3=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), updatemenus=buttons_list, sliders=sliders_list, annotations=starting_annotations) fig.frames = fig_frames fig.update_layout(fig_layout) #fig.write_html('yearly_animation_heatmap.html') fig.show() plot_animate_delays_by_year() # Looking at the above plots gives an interesting look into how accidents occur over the week, but it is hard to get an accurate view of the information. You can see accidents pick up in the morning rush hour time, but there seems to be a steady flow of accidents throughout the rest of the daylight hours. Below, we will look at a more detailed plot. # In[29]: def plot_line_delays_by_year(): df_2017 = df[df['Year'] == 2017] df_2018 = df[df['Year'] == 2018] df_2019 = df[df['Year'] == 2019] df_2017_counts = df_2017.groupby(['DayOfWeek', 'Hour']).count() df_2018_counts = df_2018.groupby(['DayOfWeek', 'Hour']).count() df_2019_counts = df_2019.groupby(['DayOfWeek', 'Hour']).count() hours = [pd.Timestamp(2021, 2 , 1, x).strftime('%I %p') for x in range(0, 24)] y_2017_data_list = [] y_2018_data_list = [] y_2019_data_list = [] #some days/hour are empty, 0 counts need to be filled in for d in range(0, 7): data_list_2017 = [] data_list_2018 = [] data_list_2019 = [] for h in range(0, 24): if (d, h) not in df_2017_counts.index: data_list_2017.append(0) else: data_list_2017.append(df_2017_counts.loc[(d, h)][0]) if (d, h) not in df_2018_counts.index: data_list_2018.append(0) else: data_list_2018.append(df_2018_counts.loc[(d, h)][0]) if (d, h) not in df_2019_counts.index: data_list_2019.append(0) else: data_list_2019.append(df_2019_counts.loc[(d, h)][0]) y_2017_data_list.append(data_list_2017) y_2018_data_list.append(data_list_2018) y_2019_data_list.append(data_list_2019) fig = make_subplots(rows=5, cols=2, subplot_titles=['Monday', 'Saturday', 'Tuesday', 'Sunday', 'Wednesday', '', 'Thursday', '', 'Friday'], vertical_spacing=0.125, horizontal_spacing=0.1) line_styles = [dict(color='blue'), dict(color='green'), dict(color='purple')] mark_styles = [dict(color='blue'), dict(color='green'), dict(color='purple')] fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[0], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[0], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[0], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[1], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[1], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[1], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[2], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=3, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[2], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=3, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[2], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=3, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[3], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=4, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[3], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=4, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[3], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=4, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[4], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=5, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[4], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=5, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[4], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=5, col=1) fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[5], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[5], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[5], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=y_2017_data_list[6], name='2017', line=line_styles[0], marker=mark_styles[0], opacity=0.5, showlegend=False), row=2, col=2) fig.add_trace(go.Scatter(x=hours, y=y_2018_data_list[6], name='2018', line=line_styles[1], marker=mark_styles[1], opacity=0.5, showlegend=False), row=2, col=2) fig.add_trace(go.Scatter(x=hours, y=y_2019_data_list[6], name='2019', line=line_styles[2], marker=mark_styles[2], opacity=0.5, showlegend=False), row=2, col=2) fig.update_layout(width=800, height=800) for r in range(1, 6): for c in range(1, 3): fig.update_yaxes(row=r, col=c, range=[0, 400]) #fig.write_html('weekday_hourly_plots.html') fig.show() plot_line_delays_by_year() # Given the plots above, it is easier to see the weekly rush hour traffic effect. During the Monday to Friday work week, you see a sharp spike around 7-8 AM and another small hump in the evening around 5-6 PM. The number of accidents during the morning rush hour seems to be significantly higher than evening rush hour accidents. Also, the weekend takes on an entirely different shape. Accidents seem to occur in a hump that falls around the afternoon, and it has many fewer accidents. # # # Predicting traffic severity from an accident # # In the previous sections, I looked at the locations of traffic accidents around Atlanta. These accidents also came with rating on the severity of traffic delays that these accidents caused. In the following section, I will train a classifier to predict the delay an accident might cause. I will split the data into train and test sets, and see what kind of accuracy I can get from predicting severity of an accident. I will use several types of classifiers to see which gives the best accuracy. # # # ### Weather Encoding # # Before splitting the data into train and test sets, there is one thing I must fix. In the previous sections, I used the weather data to make a plot. For the weather data to be used in the following classifiers, it must be converted to a numerical format. # # Initially, I used one hot encoding to convert the data to numerical format. I did this because because it seemed difficult to apply ordinality to the weather conditions.. Meaning, if the weather was ordinal, you could organize the weather conditions from least to most severe. While this may be true to some extent, it's not clear cut. Yes, Thunderstorms are worse than a clear day, but is fog worse than light or heavy rain? Is wind worse than light rain? Is heavy wind worse than light rain? There were many different weather types that would be hard to rank, and many that would seem to overlap. Thus, one hot encoding seemed the best option. # # After comparing categorical values to one hot encoding, I noticed no real increase in accuracy results with one hot encoding, in fact, most models were worse. This is most likely due to the fact that a lot of the weather conditions were sparse, as seen in a previous section where many weather condition only had a single digit number of data points. The weather conditions in this dataset are poorly categorized with many overlappig types as seen in the section where I plotted the weather types vs accidents. Also, using one hot encoding caused a substancial increase in run time for certain models. # # However, combining the weather data like I did in the previous weather plot or in some different order, may help results. I may come back later to attempt that, but for now, I will use categorical values. In this specific project, the one hot encoding of the original weather conditions didn't increase the accuracy. # In[25]: df['Weather_Condition'] = df['Weather_Condition'].astype('category').cat.codes # ### Splitting the train/test data # # The first step is to randomly divide the data into train and test splits. I will use an 80/20 split of the data. # In[26]: features = ['Hour', 'DayOfWeek', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Weather_Condition', 'Amenity', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Start_Lat', 'Start_Lng'] X_train, X_test, y_train, y_test = train_test_split(df[features], df['Severity'], test_size=0.2, random_state=random_state) # ### Dummy Classifier # # The first classifier I will implement is a dummy classifier. This classifier will simply predict the target with the most frequent count that it has seen with the training data. It is a good measurement to compare other classifiers to. # In[26]: clf_dummy = DummyClassifier(strategy='most_frequent', random_state=random_state) clf_dummy.fit(X_train, y_train) clf_dummy_score = clf_dummy.score(X_test, y_test) display(y_train.value_counts()) display('Dummy Classifier Score: {}'.format(clf_dummy_score)) # As can be seen in the value_counts of the training set above, traffic delays of severity 3 are the most common target. The dummy classifier always predicts a severity of 3. Using this technique, it achieves an accuracy rating of approximately 60.7%. # # ### Feature Selection and Models # # For this project, I was looking to read up on and get experience with several different classifiers in the scikit-learn library. For the following models, I will use multiple types of feature selection, and I will use the same types for each different classifier in order to see a comparison to how well they work. # # For the first model, I will use all the features. The models will be trained using default settings. # # For the second model, I will use the features I feel would give the most information on an accident. The models will be trained on default settings. The features used will be: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
FeatureWhy
LatitudeLocation of an accident is important.
LongitudeLocation of an accident is important.
Traffic SignalMany accidents happen at stop lights. Blocking of intersections can cause extreme delays.
CrossingAccidents at road crossings without stop lights can backup the roads causing delays.
StopAccidents at stop signs could cause traffic to backyp causing extra delays.
HourTime of day can affect delays. Example: Delays can be more or less depending on if they happen in rush hour or not.
Day Of The WeekPrevious graphs showed less accidents happened on the weekend, so this could help predict delays.
Weather ConditionWeather could be important in predicting delays. Example: Rain slows down traffic. Having an accident could increase the slowdown.
# # There are several features I will choose to skip. For example hour of the day should capture the same information as twilight and sunset, so I left those out. Wind speed, humidity, visibility, and pressure should be somewhat captured in the weather condition category. Other features didn't seem like they would add much information. # # For the third model, I will use the top seven features from the SelectKBest algorithm from the scikit-learn library. The model will be trained on default settings. The chart below shows the SelectKBest scores for the dataset. Latitude and Longitude had the strongest score. There was a steep dropoff after that. I decided to go with the top seven because it includes hour and day of the week. Day of the week seems like it could be useful since the previous charts showed Saturdays and Sundays having less accidents. # # For the fourth model, I will use Recursive Feature Elimination with Cross Validation (RFECV) from the scikit-learn library. The model will be trained on default settings. # # For the fifth model, I will use all features as used in the first model, however I will perform a grid search with a few hand selected parameter values. For brevity, I will run the parameter search and then hard code the results in to speed up the run time of the notebook. # In[27]: best_features = SelectKBest(score_func=mutual_info_classif, k='all') fit = best_features.fit(X_train, y_train) feature_scores = pd.Series(fit.scores_, index=X_train.columns) feature_scores.sort_values(ascending=True).plot(kind='barh', figsize=(5,10)) # Below, I will select the seven features from the SelectKBest algorithm and select my choices for the second feature set. # In[28]: k_best_features = ['Hour', 'DayOfWeek', 'Crossing', 'Traffic_Signal', 'Astronomical_Twilight', 'Start_Lat', 'Start_Lng'] my_selected_features = ['Hour', 'DayOfWeek', 'Weather_Condition', 'Crossing', 'Stop', 'Traffic_Signal', 'Start_Lat', 'Start_Lng'] # ## Decision Tree Classifier # # For the first classifier, I will use a simple decision tree. As stated in the feature selection and model section, I will run five different variations of the decision tree. # In[52]: clf_dt_all = DecisionTreeClassifier(random_state=random_state) clf_dt_all.fit(X_train, y_train) clf_dt_all_score = clf_dt_all.score(X_test, y_test) display('Decision Tree - All Features with Default Model Parameters:') display('Score: {}'.format(clf_dt_all_score)) # In[29]: clf_dt_my = DecisionTreeClassifier(random_state=random_state) clf_dt_my.fit(X_train[my_selected_features], y_train) clf_dt_my_score = clf_dt_my.score(X_test[my_selected_features], y_test) display('Decision Tree - My Features with Default Model Parameters:') display('Score: {}'.format(clf_dt_my_score)) # In[30]: clf_dt_kb = DecisionTreeClassifier(random_state=random_state) clf_dt_kb.fit(X_train[k_best_features], y_train) clf_dt_kb_score = clf_dt_kb.score(X_test[k_best_features], y_test) display('Decision Tree - KBest(7) with Default Model Parameters:') display('Score: {}'.format(clf_dt_kb_score)) # In[31]: estimator_dt = DecisionTreeClassifier(random_state=random_state) selector_dt = RFECV(estimator_dt, scoring='accuracy') selector_dt.fit(X_train, y_train) selector_dt_score = selector_dt.score(X_test, y_test) features_dt = [features[x] for x in np.where(selector_dt.support_)[0]] display('Decision Tree - Recursive Feature Elimination with Cross Validation with Default Model Parameters') display('Features Selected: {}'.format(features_dt)) display('Score: {}'.format(selector_dt_score)) # In[29]: # clf_dt = DecisionTreeClassifier(random_state=random_state) # params_dt = {'criterion' : ['gini', 'entropy'], # 'max_depth': [15, 25, 50, 100, None], # 'min_samples_split': [2, 5, 10, 15, 20, 30, 50, 100, 200], # 'min_samples_leaf' : [1, 2, 5, 10, 15, 20], # 'splitter' : ['best', 'random']} # clf_dt_search = GridSearchCV(estimator=clf_dt, param_grid=params_dt, cv=3, n_jobs=-1, verbose=0) # clf_dt_search.fit(X_train, y_train) # clf_dt_search_score = clf_dt_search.score(X_test, y_test) # display('Decision Tree Search Score: {}'.format(clf_dt_search_score)) # display('Search Best Parameters: {}'.format(clf_dt_search.best_params_)) params_dt_found = {'criterion': 'gini', 'max_depth': 25, 'min_samples_leaf': 5, 'min_samples_split': 100, 'splitter': 'best'} clf_dt_search = DecisionTreeClassifier(random_state=random_state, **params_dt_found) clf_dt_search.fit(X_train, y_train) clf_dt_search_score = clf_dt_search.score(X_test, y_test) display('Decision Tree - All Features with Grid Search') display('Score: {}'.format(clf_dt_search_score)) # For the standard decision tree, the best accuracy was achieved using recursive feature elimination. The model was able to get an accuracy of 76.3% using only the lattitude and longitude features. This is approximately a 15.6% improvement over the dummy classifier which had an accuracy of 60.7%. # # ### Random Forest Classifier # # For the next classifier, I'll try the random forest classifier. For the fifth classifier on the previous model, I used GridSearchCV to narrow down parameters. For this model, GridSearchCV was taking much longer. Due to this, all the following classifiers will use a RandomizedSearchCV. The randomized search does not try all parameters provided. You provide the number of iterations you would like to try, and it randomly samples the values to provide parameters to try for those iterations. # In[39]: clf_rf_all = RandomForestClassifier(random_state=random_state) clf_rf_all.fit(X_train, y_train) clf_rf_all_score = clf_rf_all.score(X_test, y_test) display('Random Forest - All Features with Default Model Parameters:') display('Score: {}'.format(clf_rf_all_score)) # In[34]: clf_rf_my = RandomForestClassifier(random_state=random_state) clf_rf_my.fit(X_train[my_selected_features], y_train) clf_rf_my_score = clf_rf_my.score(X_test[my_selected_features], y_test) display('Random Forest - My Features with Default Model Parameters:') display('Score: {}'.format(clf_rf_my_score)) # In[35]: clf_rf_kb = RandomForestClassifier(random_state=random_state) clf_rf_kb.fit(X_train[k_best_features], y_train) clf_rf_kb_score = clf_rf_kb.score(X_test[k_best_features], y_test) display('Random Forest - KBest(7) with Default Model Parameters:') display('Score: {}'.format(clf_rf_kb_score)) # In[45]: estimator_rf = RandomForestClassifier(random_state=random_state) selector_rf = RFECV(estimator_rf, scoring='accuracy') selector_rf.fit(X_train, y_train) selector_rf_score = selector_rf.score(X_test, y_test) features_rf = [features[x] for x in np.where(selector_rf.support_)[0]] display('Random Forest - Recursive Feature Elimination with Cross Validation with Default Model Parameters') display('Features Selected: {}'.format(features_rf)) display('Score: {}'.format(selector_rf_score)) # In[37]: # clf_rf = RandomForestClassifier(random_state=random_state) # params_rf = {'criterion' : ['gini', 'entropy'], # 'max_depth': [15, 25, 50, 100, None], # 'max_features': ['auto', 'log2', 0.1, 0.2, 0.3, 0.4, 0.5], # 'min_samples_split': [2, 5, 10, 15, 20, 30, 50, 100, 200], # 'min_samples_leaf' : [1, 2, 5, 10, 15, 20], # 'bootstrap' : [True, False], # 'n_estimators' : [50, 100, 200, 300]} # clf_rf_search = RandomizedSearchCV(estimator=clf_rf, param_distributions=params_rf, n_iter=200, cv=3, n_jobs=-1, verbose=0) # clf_rf_search.fit(X_train, y_train) # clf_rf_search_score = clf_rf_search.score(X_test, y_test) # display('Random Forest Search Score: {}'.format(clf_rf_score)) # display('Search Best Parameters: {}'.format(clf_rf_search.best_params_)) params_rf_found = {'criterion' : 'entropy', 'max_depth': None, 'max_features': 0.5, 'min_samples_split': 15, 'min_samples_leaf' : 1, 'bootstrap' : False, 'n_estimators' : 50} clf_rf_search = RandomForestClassifier(random_state=random_state, **params_rf_found) clf_rf_search.fit(X_train, y_train) clf_rf_search_score = clf_rf_search.score(X_test, y_test) display('Random Forest - All Features with Randomized Search') display('Score: {}'.format(clf_rf_search_score)) # For the random forest classifier, the best accuracy was achieved using a randomized paramater search. The random forest had an accuracy of 78.3% which is approximately 17.6% higher than the dummy classifier of 60.7% # # ### Gradient Boosting Classifier # # For the next classifier, I'll try the gradient boosting classifier. I will use another random parameter search to find the best parameters. # In[47]: clf_gb_all = GradientBoostingClassifier(random_state=random_state) clf_gb_all.fit(X_train, y_train) clf_gb_all_score = clf_gb_all.score(X_test, y_test) display('Gradient Boosting - All Features with Default Model Parameters:') display('Score: {}'.format(clf_gb_all_score)) # In[39]: clf_gb_my = GradientBoostingClassifier(random_state=random_state) clf_gb_my.fit(X_train[my_selected_features], y_train) clf_gb_my_score = clf_gb_my.score(X_test[my_selected_features], y_test) display('Gradient Boosting - My Features with Default Model Parameters:') display('Score: {}'.format(clf_gb_my_score)) # In[40]: clf_gb_kb = GradientBoostingClassifier(random_state=random_state) clf_gb_kb.fit(X_train[k_best_features], y_train) clf_gb_kb_score = clf_gb_kb.score(X_test[k_best_features], y_test) display('Gradient Boosting - KBest(7) with Default Model Parameters:') display('Score: {}'.format(clf_gb_kb_score)) # In[41]: estimator_gb = GradientBoostingClassifier(random_state=random_state, n_iter_no_change=3) selector_gb = RFECV(estimator_gb, scoring='accuracy') selector_gb.fit(X_train, y_train) selector_gb_score = selector_gb.score(X_test, y_test) features_gb = [features[x] for x in np.where(selector_gb.support_)[0]] display('Gradient Boosting - Recursive Feature Elimination with Cross Validation with Default Model Parameters') display('Features Selected: {}'.format(features_gb)) display('Score: {}'.format(selector_gb_score)) # In[42]: # clf_gb = GradientBoostingClassifier(random_state=random_state, n_iter_no_change=3) # params_gb = {'learning_rate' : [0.09, 0.1, 0.11], # 'subsample' : [0.9, 1.0], # 'max_depth': [5, 8, 10], # 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], # 'n_estimators' : [125, 150, 175]} # clf_gb_search = RandomizedSearchCV(estimator=clf_gb, param_distributions=params_gb, n_iter=200, cv=3, n_jobs=-1, verbose=2) # clf_gb_search.fit(X_train, y_train) # clf_gb_search_score = clf_gb_search.score(X_test, y_test) # display('Gradient Boosting Search Score: {}'.format(clf_gb_score)) # display('Gradient Boosting Parameters: {}'.format(clf_gb_search.best_params_)) params_gb_found = {'learning_rate' : 0.09, 'subsample' : 0.9, 'max_depth': 10, 'max_features': 0.6, 'n_estimators' : 150} clf_gb_search = GradientBoostingClassifier(random_state=random_state, n_iter_no_change=3, **params_gb_found) clf_gb_search.fit(X_train, y_train) clf_gb_search_score = clf_gb_search.score(X_test, y_test) display('Gradient Boosting - All Features with Randomized Search') display('Score: {}'.format(clf_gb_search_score)) # The gradient boosting classifier managed an accuracy of 78.9% using a randomized parameter search. This is approximately 18.2% better than the dummy classifier of 60.7%. # # ### Extra Tree Classifier # # For the next classifier, I'll try the extra tree classifier. I will use another random parameter search to find the best parameters. # In[43]: clf_et_all = ExtraTreesClassifier(random_state=random_state) clf_et_all.fit(X_train, y_train) clf_et_all_score = clf_et_all.score(X_test, y_test) display('Extra Tree - All Features with Default Model Parameters:') display('Score: {}'.format(clf_et_all_score)) # In[44]: clf_et_my = ExtraTreesClassifier(random_state=random_state) clf_et_my.fit(X_train[my_selected_features], y_train) clf_et_my_score = clf_et_my.score(X_test[my_selected_features], y_test) display('Extra Tree - My Features with Default Model Parameters:') display('Score: {}'.format(clf_et_my_score)) # In[45]: clf_et_kb = ExtraTreesClassifier(random_state=random_state) clf_et_kb.fit(X_train[k_best_features], y_train) clf_et_kb_score = clf_et_kb.score(X_test[k_best_features], y_test) display('Extra Tree - KBest(7) with Default Model Parameters:') display('Score: {}'.format(clf_et_kb_score)) # In[46]: estimator_et = ExtraTreesClassifier(random_state=random_state) selector_et = RFECV(estimator_et, scoring='accuracy') selector_et.fit(X_train, y_train) selector_et_score = selector_et.score(X_test, y_test) features_et = [features[x] for x in np.where(selector_et.support_)[0]] display('Extra Tree - Recursive Feature Elimination with Cross Validation with Default Model Parameters') display('Features Selected: {}'.format(features_et)) display('Score: {}'.format(selector_et_score)) # In[47]: # clf_et = ExtraTreesClassifier(random_state=random_state) # params_et = {'criterion' : ['gini', 'entropy'], # 'max_depth': [30, 40, 50, 60, None], # 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6], # 'n_estimators' : [50, 100, 200, 300], # 'bootstrap' : [True, False]} # clf_et_search = RandomizedSearchCV(estimator=clf_et, param_distributions=params_et, n_iter=200, cv=3, n_jobs=-1, verbose=3) # clf_et_search.fit(X_train, y_train) # clf_et_search_score = clf_et_search.score(X_test, y_test) # display('Extra Tree Search Score: {}'.format(clf_et_score)) # display('Extra Tree Parameters: {}'.format(clf_et_search.best_params_)) params_et_found = {'criterion' : 'entropy', 'max_depth': None, 'max_features': 0.6, 'n_estimators' : 200, 'bootstrap' : False} clf_et_search = ExtraTreesClassifier(random_state=random_state, **params_et_found) clf_et_search.fit(X_train, y_train) clf_et_search_score = clf_et_search.score(X_test, y_test) display('Extra Tree - All Features with Randomized Search') display('Score: {}'.format(clf_et_search_score)) # The extra tree classifier managed an accuracy of 77.3% using recursive feature elimination. This was approximately 16.6% higher than the 60.7% accuracy of the dummy classifier. # # ### Voting Classifier # # The different models from the four above classifiers returned results significantly better than the dummy classifier with the best classifier being the gradient boosted trees that used a randomized parameter search and had an accuracy of 78.9%. Overall, the best results from all four models were only a couple percent difference. I tried other classifiers in my initial prototyping, but these four tree classes gave the best results. # # Next, I'm going to combine these four best performing models into a voting classifier. A voting classifier takes multiple models. Each model makes a prediction. The voting classifier then combines those predictions using a vote to get a voted prediction. # # There are multiple ways to count the vote. For a 'hard' vote, each classifier gets one vote and the classification with the most votes wins. For a 'soft' vote, the maximum probability for a class from all the models is taken. I will also use weighted voting where each classifier is given a weight based on its test accuracy. The accuracy values from the previous test data will be normalized so they all sum to 1. # # Below, I will take a hard and soft vote. Then, I will take a weighted hard and soft vote. The best performing models from above were the decision tree with RFECV with an accuracy of 76.3%, random forest randomized search with an accuracy of 78.3%, gradient boosting randomized search with an accuracy of 78.9%, and extra tree RFECV with an accuracy of 77.3%. # In[48]: clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='hard') clf_vote.fit(X_train, y_train) clf_vote_score = clf_vote.score(X_test, y_test) display('Hard Voting Score: {}'.format(clf_vote_score)) clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='soft') clf_vote.fit(X_train, y_train) clf_vote_score = clf_vote.score(X_test, y_test) display('Soft Voting Score: {}'.format(clf_vote_score)) # In[49]: weights = np.array([selector_dt_score, clf_rf_search_score, clf_gb_search_score, selector_et_score]) weights = weights / weights.sum() clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='hard', weights=weights) clf_vote.fit(X_train, y_train) clf_vote_score = clf_vote.score(X_test, y_test) display('Weighted Hard Voting Score: {}'.format(clf_vote_score)) clf_vote = VotingClassifier(estimators=[('DT', selector_dt), ('RF', clf_rf_search), ('GB', clf_gb_search), ('ET', selector_et)], voting='soft', weights=weights) clf_vote.fit(X_train, y_train) clf_vote_score = clf_vote.score(X_test, y_test) display('Weighted Soft Voting Score: {}'.format(clf_vote_score)) # After working with several different models above, the final best accuracy before the voting classifier was the randomized search on the gradient boosted classifier. It had a final accuracy of 78.9%. Combing several of these classifiers into a voting classifier yielded an accuracy of 79.6% for the weighted hard voting classifier. Considering the voting classifier must store and use all four of the models which increases memory and makes it slower, this isn't a very significant increase in accuracy. # # A look at traffic during Covid-19 # # In the earlier section, I removed the traffic data from the year 2020 due to the possibility it could be anomalous. It was widely reported in Atlanta that traffic flows were less. Also, [there were some reports that the severity of accidents were worse because of this.](https://www.wsbtv.com/news/local/atlanta/despite-covid-19-shutdown-there-was-huge-rise-deadly-wrecks-georgia-roads-over-last-year/FIYWJJQS6ZDJPGN67T7YHY3R7E/) The cause being since less traffic was on the road, people were traveling at higher speeds which made the accidents worse. However, a more severe accident wouldn't necessarily cause traffic delays since there may be less vehicles on the roads to impact. Given those thoughts, I thought it was worth taking a look into some plots of the data for the year. # # Atlanta began to see an impact from Covid-19 in March. After the first case in early March, Atlanta slowly began shutting down. It started with cancellations of events, school closures, and eventually led to shelter in place orders. It was widely reported that there were less traffic on the roads. Given that information, we would expect to see the year of 2020 be significantly lower in the various traffic plots. We will take a look at that below. # # ### Monthly accident count # # Given the discussion above, I would expect to see a dip in the accident counts for 2020 starting in March. With a lot of businesses in Atlanta switching to remote work, I would expect that trend to continue somewhat into the rest of the year. However, when I moved to Atlanta in August of 2020, there seemed to be plenty of traffic on the roads, and business seemed to be going on for the most part as usual. It had been almost 20 years since I lived in Atlanta, but I definetely wouldn't have called the traffic light. # In[30]: def plot_crash_count_severity_per_month(): df_2017 = df[df['Year'] == 2017] df_2018 = df[df['Year'] == 2018] df_2019 = df[df['Year'] == 2019] df_2017_count = df_2017.groupby(['Month']).count().iloc[:,0] df_2018_count = df_2018.groupby(['Month']).count().iloc[:,0] df_2019_count = df_2019.groupby(['Month']).count().iloc[:,0] df_2020_count = df_covid.groupby(['Month']).count().iloc[:,0] df_2017_mean = df_2017.groupby(['Month']).mean()['Severity'] df_2018_mean = df_2018.groupby(['Month']).mean()['Severity'] df_2019_mean = df_2019.groupby(['Month']).mean()['Severity'] df_2020_mean = df_covid.groupby(['Month']).mean()['Severity'] months = [pd.Timestamp(1900, x , 1).strftime('%b') for x in range(1, 13)] fig = make_subplots(rows=2, cols=1, subplot_titles=['Accident Count', 'Traffic Impact Severity (Mean)']) line_styles = [dict(color='blue'), dict(color='green'), dict(color='purple'), dict(color='red')] mark_styles = [dict(color='blue'), dict(color='green'), dict(color='purple'), dict(color='red')] fig.add_trace(go.Scatter(x=months, y=df_2017_count, name='2017', line=line_styles[0], marker=mark_styles[0]), row=1, col=1) fig.add_trace(go.Scatter(x=months, y=df_2018_count, name='2018', line=line_styles[1], marker=mark_styles[1]), row=1, col=1) fig.add_trace(go.Scatter(x=months, y=df_2019_count, name='2019', line=line_styles[2], marker=mark_styles[2]), row=1, col=1) fig.add_trace(go.Scatter(x=months, y=df_2020_count, name='2020', line=line_styles[3], marker=mark_styles[3]), row=1, col=1) fig.add_trace(go.Scatter(x=months, y=df_2017_mean, name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=months, y=df_2018_mean, name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=months, y=df_2019_mean, name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=months, y=df_2020_mean, name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=2, col=1) fig.update_layout(width=900, height=600) #fig.write_html('monthly_covid_plot.html') fig.show() plot_crash_count_severity_per_month() # The plots above show some interesting results. The accident count for the year 2020 was, for the most part, below the other three years until September of 2020. In September, there was a fairly steep climb in accident count. While I commented that traffic seemed normal when I moved here, I didn't expect to see accidents above normal levels. Also, on the traffic impact severity charts, we see a fairly significant drop in severity from September 2020 and onward. # # ### Weekly accident count # # While the monthly count and impact severity wasn't quite what was expected. Let's take a look at the count and traffic impact severity for the weekly commute. # In[31]: def plot_crash_count_severity_per_weekday(): df_2017 = df[df['Year'] == 2017] df_2018 = df[df['Year'] == 2018] df_2019 = df[df['Year'] == 2019] df_2017_count = df_2017.groupby(['DayOfWeek']).count().iloc[:,0] df_2018_count = df_2018.groupby(['DayOfWeek']).count().iloc[:,0] df_2019_count = df_2019.groupby(['DayOfWeek']).count().iloc[:,0] df_2020_count = df_covid.groupby(['DayOfWeek']).count().iloc[:,0] df_2017_mean = df_2017.groupby(['DayOfWeek']).mean()['Severity'] df_2018_mean = df_2018.groupby(['DayOfWeek']).mean()['Severity'] df_2019_mean = df_2019.groupby(['DayOfWeek']).mean()['Severity'] df_2020_mean = df_covid.groupby(['DayOfWeek']).mean()['Severity'] days = [pd.Timestamp(2021, 2 , x).strftime('%a') for x in range(1, 8)] fig = make_subplots(rows=2, cols=1, subplot_titles=['Accident Count', 'Traffic Impact Severity (Mean)']) line_styles = [dict(color='blue'), dict(color='green'), dict(color='purple'), dict(color='red')] mark_styles = [dict(color='blue'), dict(color='green'), dict(color='purple'), dict(color='red')] fig.add_trace(go.Scatter(x=days, y=df_2017_count, name='2017', line=line_styles[0], marker=mark_styles[0]), row=1, col=1) fig.add_trace(go.Scatter(x=days, y=df_2018_count, name='2018', line=line_styles[1], marker=mark_styles[1]), row=1, col=1) fig.add_trace(go.Scatter(x=days, y=df_2019_count, name='2019', line=line_styles[2], marker=mark_styles[2]), row=1, col=1) fig.add_trace(go.Scatter(x=days, y=df_2020_count, name='2020', line=line_styles[3], marker=mark_styles[3]), row=1, col=1) fig.add_trace(go.Scatter(x=days, y=df_2017_mean, name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=days, y=df_2018_mean, name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=days, y=df_2019_mean, name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=days, y=df_2020_mean, name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=2, col=1) fig.update_layout(width=900, height=600) #fig.write_html('daily_covid_plot.html') fig.show() plot_crash_count_severity_per_weekday() # Taking a look at the weekly data, we can see that it follows a pattern that was seen earlier where counts are higher during the week and drop off on the weekends. We can see the accident counts are slightly lower for 2020 during the week, but not by much. However, on the weekends, the counts are slightly higher for 2020. Perhaps this could be attributed to remote work? Perhaps people being stuck inside their home all week get out on the weekends to do something? Next, we can see that traffic impact is lower than the previous three years on all days. During 2017-2019, there seemed to be higher delays on the weekend even though there were lower accident counts. However, 2020 didn't see those same impacts over the weekend. # # ### Hourly accident count # # In this last plot, I'm going to look at the hourly accident counts and traffic impacts. Given the difference in weekday and weekend data seen in previous sections, I'm going to split the data into hourly plots for the weekday and weekend for all years. # In[32]: def plot_crash_count_severity_per_hour(): df_2017 = df[df['Year'] == 2017] df_2018 = df[df['Year'] == 2018] df_2019 = df[df['Year'] == 2019] df_2017_counts = df_2017.groupby(['DayOfWeek', 'Hour']).count() df_2018_counts = df_2018.groupby(['DayOfWeek', 'Hour']).count() df_2019_counts = df_2019.groupby(['DayOfWeek', 'Hour']).count() df_2020_counts = df_covid.groupby(['DayOfWeek', 'Hour']).count() df_2017_means = df_2017.groupby(['DayOfWeek', 'Hour']).mean() df_2018_means = df_2018.groupby(['DayOfWeek', 'Hour']).mean() df_2019_means = df_2019.groupby(['DayOfWeek', 'Hour']).mean() df_2020_means = df_covid.groupby(['DayOfWeek', 'Hour']).mean() hours = [pd.Timestamp(2021, 2 , 1, x).strftime('%I %p') for x in range(0, 24)] y_2017_data_list = [[], []] y_2018_data_list = [[], []] y_2019_data_list = [[], []] y_2020_data_list = [[], []] #some days/hour are empty, 0 counts need to be filled in for d in range(0, 7): data_list_2017_c, data_list_2017_m = [], [] data_list_2018_c, data_list_2018_m = [], [] data_list_2019_c, data_list_2019_m = [], [] data_list_2020_c, data_list_2020_m = [], [] for h in range(0, 24): if (d, h) not in df_2017_counts.index: data_list_2017_c.append(0) data_list_2017_m.append(0) else: data_list_2017_c.append(df_2017_counts.loc[(d, h)][0]) data_list_2017_m.append(df_2017_means.loc[(d, h)]['Severity']) if (d, h) not in df_2018_counts.index: data_list_2018_c.append(0) data_list_2018_m.append(0) else: data_list_2018_c.append(df_2018_counts.loc[(d, h)][0]) data_list_2018_m.append(df_2018_means.loc[(d, h)]['Severity']) if (d, h) not in df_2019_counts.index: data_list_2019_c.append(0) data_list_2019_m.append(0) else: data_list_2019_c.append(df_2019_counts.loc[(d, h)][0]) data_list_2019_m.append(df_2019_means.loc[(d, h)]['Severity']) if (d, h) not in df_2020_counts.index: data_list_2020_c.append(0) data_list_2020_m.append(0) else: data_list_2020_c.append(df_2020_counts.loc[(d, h)][0]) data_list_2020_m.append(df_2020_means.loc[(d, h)]['Severity']) y_2017_data_list[0].append(data_list_2017_c) y_2018_data_list[0].append(data_list_2018_c) y_2019_data_list[0].append(data_list_2019_c) y_2020_data_list[0].append(data_list_2020_c) y_2017_data_list[1].append(data_list_2017_m) y_2018_data_list[1].append(data_list_2018_m) y_2019_data_list[1].append(data_list_2019_m) y_2020_data_list[1].append(data_list_2020_m) fig = make_subplots(rows=2, cols=2, subplot_titles=['Weekday Accident Count', 'Weekend Accident Count Weekend', 'Weekday Traffic Impact Severity (Mean)', 'Weekend Traffic Impact Severity (Mean)']) line_styles = [dict(color='blue'), dict(color='green'), dict(color='purple'), dict(color='red')] mark_styles = [dict(color='blue'), dict(color='green'), dict(color='purple'), dict(color='red')] fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[0][:5]).sum(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0]), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[0][:5]).sum(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1]), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[0][:5]).sum(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2]), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[0][:5]).sum(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3]), row=1, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[0][5:]).sum(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[0][5:]).sum(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[0][5:]).sum(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[0][5:]).sum(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=1, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[1][:5]).mean(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[1][:5]).mean(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[1][:5]).mean(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[1][:5]).mean(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=2, col=1) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2017_data_list[1][5:]).mean(axis=0), name='2017', line=line_styles[0], marker=mark_styles[0], showlegend=False), row=2, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2018_data_list[1][5:]).mean(axis=0), name='2018', line=line_styles[1], marker=mark_styles[1], showlegend=False), row=2, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2019_data_list[1][5:]).mean(axis=0), name='2019', line=line_styles[2], marker=mark_styles[2], showlegend=False), row=2, col=2) fig.add_trace(go.Scatter(x=hours, y=np.array(y_2020_data_list[1][5:]).mean(axis=0), name='2020', line=line_styles[3], marker=mark_styles[3], showlegend=False), row=2, col=2) fig.update_layout(width=900, height=600, xaxis1=dict(tickmode='linear', tickangle=90), xaxis2=dict(tickmode='linear', tickangle=90)) fig.update_yaxes(row=1, col=1, range=[0, 1800]) fig.update_yaxes(row=1, col=2, range=[0, 1800]) fig.update_yaxes(row=2, col=1, range=[0, 5]) fig.update_yaxes(row=2, col=2, range=[0, 5]) #fig.write_html('weekday_hourly_covid_plots.html') fig.show() plot_crash_count_severity_per_hour() # As can be seen in the above charts, the accident counts for 2020 during the week are significantly lower during the morning rush hour phase. While the rest of the time the counts fall close to the counts for the other three years. There is also not much difference in traffic severity during the weekday between all the years. For the weekend data, all data are close except for a slight dip on traffic severity in the early morning hours of the weekend. # # # Summary # # In summary, this project was meant to be an excercise to get familiar with cleaning an unprocessed data set, making interactive plots, and getting more familiar with the scikit-learn library. # # The data was a set of traffic accidents and the traffic delays caused by them. To begin, I narrowed the data down to the Atlanta, Georgia area. I cleaned the data by removing values that weren't needed and filling in missing values. After the data was prepared, I used the Plotly library to make various interactive plots to visualize the data. # # I looked at heatmap plots for the years 2017, 2018, and 2019. The heatmaps showed large concentrations of accidents located around Atlanta's interstates. I looked at a weather plot for all years that showed a heatmap for each type of weather condition. This data was imbalanced towards clear and cloudy, so not much information could be gathered from the plot. Next, I plotted animated heatmaps for 2017, 2018, and 2019 that showed hourly heatmaps of accidents from Monday through Sunday. You could see that there was a slight increase in accidents during rush hour times, but it didn't give a clear picture of the data. I looked further at the rush hour traffic incidents by plotting line graphs that showed each weekday broken down into hours where you could clearly see rise in rush hour accidents during the week. # # After looking at various plots, I divided the data into train/test sets which I used to train Decision Tree, Random Forest, Gradient Boosted Trees, and Extra Tree classifiers. For each of these classifier, I trained several models using various techniques. Next, I combined some of the best models into a voting classifier. Overall, I managed to get significantly better results than the dummy classifier which chose the most frequent category. # # Finally, I took a look at traffic data for the year 2020. I compared this data to the data I used in previous sections in order to get an idea how Covid-19 affected accidents and delays. I looked at the monthly accident count for all years. For that plot, 2020 started off below other years, but had increased to above other years towards the end of the year. I plotted the weekly accident count to take a look at the Monday-Sunday work week that showed 2020 was slightly below during the week, however, it was slightly above on the weekends. Finally, I plotted the hourly info for the four years, which only showed a significant less difference on weekday morning rush hour, and all other times were reasonably close. Overall, the data for 2020 didn't seem to match what it was expected to be. # # Finally, I have one final note about the plots and data for this project. Many of the traffic plots had different results than what was expected. I want to call back to the About the Data section. In that section, it was mentioned that this dataset is most likely a subset of all the accidents that happened. While the data came from multiple large reliable sources, it is unlikely that they captured all accidents that happened. With that information, it is impossible to predict what data is missing, and this could be the cause of the results being so different than expected. This project was intended to be a practice exercise and not a completely accurate representation of traffic accidents.