As someone studying machine learning and data science, I didn't really have a good portfolio of projects to show. I had some experience working on this type of stuff in graduate school, and I have completed several online machine learning and data science courses, but I didn't have many projects to show for that.
My initial idea for this project was to browse the data sets on Kaggle and pick one that seemed interesting. I would then look at the data, train a classifier or regressor to do some predictions, and put together some written thoughts on it. I didn't really have a plan when I picked the dataset, I just wanted to mess around with it and see what came to me.
After spending some time browsing through various data sets, I ended up choosing US Accidents (4.2 million records) A Countrywide Traffic Accident Dataset (2016 - 2020). There is an accopanying paper for this data set located at arxiv.org. Using this data set, I narrowed the scope of the data down to the Atlanta, Georgia area where I recently moved.
According to the author, the data was collected in real time using multiple traffic APIs. The majority of the data, approximately 63% and 36%, came from Mapquest and Bing repsectively. The data was collected from February 2016 until December 2020. The author of the data set made some comments on the validity of the data here. Overall, the author believes that this data set is a subset of the entire accidents in the United States. The author also discussed a change in collection techniques for MapQuest after August 2017 here.
Given the above information, while the data may not be complete, it should be enough to give an interesting view into the traffic around the Atlanta area.
The following table is a description of the column values as per the author's website.
Attribute | Description |
---|---|
ID | This is a unique identifier of the accident record. |
Source | Indicates source of the accident report (i.e. the API which reported the accident.). |
TMC | A traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event. |
Severity | Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). |
Start_Time | Shows start time of the accident in local time zone. |
End_Time | Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed. |
Start_Lat | Shows latitude in GPS coordinate of the start point. |
Start_Lng | Shows longitude in GPS coordinate of the start point. |
End_Lat | Shows latitude in GPS coordinate of the end point. |
End_Lng | Shows longitude in GPS coordinate of the end point. |
Distance(mi) | The length of the road extent affected by the accident. |
Description | Shows natural language description of the accident. |
Number | Shows the street number in address field. |
Street | Shows the street name in address field. |
Side | Shows the relative side of the street (Right/Left) in address field. |
City | Shows the city in address field. |
County | Shows the county in address field. |
State | Shows the state in address field. |
Zipcode | Shows the zipcode in address field. |
Country | Shows the country in address field. |
Timezone | Shows timezone based on the location of the accident (eastern, central, etc.). |
Airport_Code | Denotes an airport-based weather station which is the closest one to location of the accident. |
Weather_Timestamp | Shows the time-stamp of weather observation record (in local time). |
Temperature(F) | Shows the temperature (in Fahrenheit). |
Wind_Chill(F) | Shows the wind chill (in Fahrenheit). |
Humidity(%) | Shows the humidity (in percentage). |
Pressure(in) | Shows the air pressure (in inches). |
Visibility(mi) | Shows visibility (in miles). |
Wind_Direction | Shows wind direction. |
Wind_Speed(mph) | Shows wind speed (in miles per hour). |
Precipitation(in) | Shows precipitation amount in inches, if there is any. |
Weather_Condition | Shows the weather condition (rain, snow, thunderstorm, fog, etc.) |
Amenity | A amenity in a nearby location. |
Bump | A POI annotation which indicates presence of speed bump or hump in a nearby location. |
Crossing | A POI annotation which indicates presence of crossing in a nearby location. |
Give_Way | A POI annotation which indicates presence of give_way in a nearby location. |
Junction | A POI annotation which indicates presence of junction in a nearby location. |
No_Exit | A POI annotation which indicates presence of no_exit in a nearby location. |
Railway | A POI annotation which indicates presence of railway in a nearby location. |
Roundabout | A POI annotation which indicates presence of roundabout in a nearby location. |
Station | A POI annotation which indicates presence of station in a nearby location. |
Stop | A POI annotation which indicates presence of stop in a nearby location. |
Traffic_Calming | A POI annotation which indicates presence of traffic_calming in a nearby location. |
Traffic_Signal | A POI annotation which indicates presence of traffic_signal in a nearby location. |
Turning_Loop | A POI annotation which indicates presence of turning_loop in a nearby location. |
Sunrise_Sunset | Shows the period of day (i.e. day or night) based on sunrise/sunset. |
Civil_Twilight | Shows the period of day (i.e. day or night) based on civil twilight. |
Nautical_Twilight | Shows the period of day (i.e. day or night) based on nautical twilight. |
Astronomical_Twilight | Shows the period of day (i.e. day or night) based on astronomical twilight. |
For this project, I will be looking at plots of traffic accidents and the impact on traffic when there is an auto accident. The impacts on traffic in this data set are rated by severity from 1 to 4 with 1 meaning the accident caused minimal traffic delays and 4 meaning an accident had an extreme effect on traffic delays.
The data set doesn't define a clear timetable for the severity of the delays. However, the author posted some approximate estimates here. Delays are estimated at the following.
Severity | Time |
---|---|
1 | 2m 30s |
2 | 3m 15s |
3 | 8m |
4 | 18m |
To examine the data, I will plot several different types of traffic maps. Next, I will divide the data into train and test sets in order to train a severity classifier for accidents that will predict the traffic delay given an accidents location and relevant information.
To look at the data and train the classifier I will use the years 2017 through 2019. I will discard 2016 since data from the first part of the year is missing. I will also not consider the year 2020 in the first part of the project since various COVID measures had a serious impact on traffic in Atlanta. In the last part of the project, I will compare the 2020 data to the previous years to see how the data shows COVID affected traffic delays and accidents.
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif
import json
with open('credentials.json') as f:
json_data = json.load(f)
mapbox_key = json_data['mapbox_key']
random_state = 0
import plotly.io as pio
pio.renderers.default = "notebook"
The first step is to read in the data and take a look at it. The data contains 49 columns of information and 4,232,541 rows of accident entries. The columns contain varying information on location, weather, dates, times, traffic impact, source, and other information. The exact explanation of each column can be found here.
df_all = pd.read_csv('US_Accidents_Dec20.csv')
df_all
In order to narrow the data to the Atlanta, Georgia area, I started by retrieving the lattitude and longitude coordinates for Atlanta via Google. The listed coordinates for Atlanta are 33.7490° N, 84.3880° W. To select the area around Atlanta, I opened a map and selected what I thought was a good representation of the city and outer suburbs which is approximately a 35 square mile area. The first step is to retrieve the rows in the data set that correspond to this area.
atlanta_lat = 33.7490
atlanta_lng = -84.3880
radius = 0.3
df = df_all[(df_all['Start_Lat'] <= atlanta_lat + radius) &
(df_all['Start_Lat'] >= atlanta_lat - radius) &
(df_all['Start_Lng'] >= atlanta_lng - radius) &
(df_all['Start_Lng'] <= atlanta_lng + radius)]
df.head()
For the column TMC (Traffic Message Channel), it lists a variety of codes used to describe traffic incidents. After referencing all the codes listed in the Atlanta data, there are 140 entries for code 406 which is described as 'entry slip road closed'. This didn't sound like an accident, so I decided to investigate further. Fortunately, the accidents have a text description. Looking through these descriptions, several incidents didn't seem to be accidents at all. For example, several roads were closed due to fallen trees, protests, and various other activities. After looking through them, most of the entries that reference an accident have the text 'accident' in the description. So, I will filter out any rows that don't have that description.
to_remove = ~df[df['TMC'] == 406]['Description'].str.contains('accident')
df = df.drop(df[df['TMC'] == 406][to_remove].index)
The data set contains certain columns that won't be used for mapping and predictions.
In the paper, the original traffic accident data only contained GPS data. The author used a tool that converted that GPS data into the data for number, street, side, city, county, state, country, and zip code. Given that the original data didn't contain this information, I decided to stick with the original lattitude and longitude coordinates to do mapping and prediction.
For the columns End_Lat and End_Lng, many values were missing, so I decided to use the starting coordinates. Distance(mi) is also a function of Start - End, so I removed it as well.
The columns ID and Source were irrelevant, so I removed those.
The column TMC was used to clean in a previous step, so it is no longer needed. Also, TMC codes may not be available at the time of an accident to predict the delay.
The column End_Time will be removed since the Severity column will be used as a target. Knowing the end time of an accident isn't something that would be known beforehand in prediction.
The column Description will be removed because it is a text description of the accident, and also wouldn't be known beforehand.
The column Timezone will be irrelevant since the entire area is in the same timezone.
The column Airport_Code will be removed since it gives no localized information on the accident.
The column Weather_Timestamp will be removed since the time of the last weather update can't provide additional information. After some research, I found several websites that have historical weather APIs, but they all charged a fee which I'm avoiding for this project. So, checking a timestamp to see if more accurate weather data is available can't be accomplished.
This leaves the following columns:
df = df.drop(['ID', 'Source', 'TMC', 'End_Time', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street',
'Side', 'City', 'State', 'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp', 'County'], axis=1)
df.columns
For this project, I'm going to plot several charts by time, I'm going to convert the timestamp into it's own separate columns for simplicity. Columns will be made for Year, Month, Day, Hour, and DayOfWeek. The hours will be rounded to the nearest hour. DayOfWeek will be in the format of 0 to 6 where 0 is Monday.
#convert to datetime format
df['Start_Time'] = pd.to_datetime(df['Start_Time'], infer_datetime_format=True)
#round to nearest hour
df['Start_Time'] = df['Start_Time'].dt.round("H")
#create day of week
df.insert(2, 'DayOfWeek', '')
df['DayOfWeek'] = df['Start_Time'].dt.weekday
#create an hour of the day column
df.insert(2, 'Hour', '')
df['Hour'] = df['Start_Time'].dt.hour
#create day of week column
df.insert(2, 'Day', '')
df['Day'] = df['Start_Time'].dt.day
#create month column
df.insert(2, 'Month', '')
df['Month'] = df['Start_Time'].dt.month
#create year column
df.insert(2, 'Year', '')
df['Year'] = df['Start_Time'].dt.year
df = df.drop(['Start_Time'], axis=1)
display(df.head())
Before the data can be used, any missing data must be dealt with. The first step is to identify which values are missing an how many.
def check_na():
return pd.merge(left=df.isna().sum()[df.isna().sum() != 0].rename('null counts'),
right=df.isnull().mean()[df.isnull().mean() != 0].rename('percentage').round(4)*100,
left_index=True, right_index=True).sort_values(ascending=False, by=['percentage'])
check_na()
There are nine columns with missing values. For Wind_Chill(F) and Precipitation(in), over half of the values are missing. Since I don't have access to historical weather data in order to fill those in, those two columns will be completely dropped. Wind_Chill(F) is a computation of temperature and wind speed, so that information should still be encoded in the data. While Precepitation(in) would probably be a good indicator, there is just too much information to fill in without having access to an outside data source.
For the remaining missing values, it is possible to fill in some data from surrounding values. For a missing value, we can check another entry on that same date in the same area and use that value. For example, if Weather_Condition is missing from an entry with lat/lng x1 y1 on date Jan 1 2000 at 8am, and we have another entry with lat/lng x2 y2 on Jan 1 2000 at 8am, it's reasonable that those would have the same Weather_Condition or other weather values.
df = df.drop(['Wind_Chill(F)', 'Precipitation(in)'], axis=1)
def get_missing_row_values(row_id):
#attempts to fill in the missing value from a row with other incidents that happened on the same day
row = df.loc[row_id].copy()
#find all entries that happen on that day
columns = ['Year', 'Month', 'Day']
values = [row.Year, row.Month, row.Day]
matches = df[(df[columns] == values).all(1)].copy()
#drop duplicate row that was passed in
matches = matches.drop(row_id)
matches.insert(6, 'LL_Dist', '')
#compute distance between the coordinates
matches['LL_Dist'] = (np.sqrt((row.Start_Lat - matches.Start_Lat)**2 + (row.Start_Lng - matches.Start_Lng)**2))
matches.insert(6, 'Time_Diff', '')
#compute time between the accidents
matches['Time_Diff'] = abs(row.Hour - matches.Hour)
#sort matches by time differential then distance
matches = matches.sort_values(['Time_Diff', 'LL_Dist'], ascending=[True, True])
#to make things simple, we will take the match that is closest hour and least distance
#this may lead to some small innacuracy, but the missing values are few, so it should be ok
#since the match dataframe is sorted by hour then distance, the first matching value should be used
#possible missing values are:
#'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Weather_Condition'
def fill_row(col_name):
nonlocal row
nonlocal matches
if pd.isna(row[col_name]):
value = matches[col_name].first_valid_index()
if value:
row[col_name] = matches.loc[value][col_name]
fill_row('Temperature(F)')
fill_row('Humidity(%)')
fill_row('Pressure(in)')
fill_row('Visibility(mi)')
fill_row('Wind_Direction')
fill_row('Wind_Speed(mph)')
fill_row('Weather_Condition')
return row
df_nan = df[df.isna().any(axis=1)].copy()
for i, row in df_nan.iterrows():
filled_row = get_missing_row_values(i)
df.loc[i] = filled_row
check_na()
After filling in missing weather values from nearby dates, most of the values have been filled in. There are still a small number missing. Next, I will take a look at the values that are missing and other values from those same dates.
df_nan = df[df.isna().any(axis=1)].copy()
display(df_nan.groupby(['Year', 'Month', 'Day']).size())
The missing data is spread out over five separate days. Next, let's take a look at other data from those days to see why no values were filled in. Note: I will use head() to display part of the data for brevity.
display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 29)).head(5))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 30)).head(5))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2020, 11, 8)).head(5))
For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows.
For the dates July 17 2016 and October 29 2016 only the Wind_Speed(mph) is missing. The Wind_Direction on all of those days happen to be calm, so I will take the mean of all calm days to fill in these missing values. First, I will look at the mean values for each category.
df.groupby('Wind_Direction').mean()['Wind_Speed(mph)']
As can be seen above, the Wind_Direction categories have overlapping values such as CALM and Calm, E and East, etc. Next, I will combine these groups with each other and fill in the missing Wind_Speed(mph) mean values.
df.replace({'CALM' : 'Calm', 'East' : 'E', 'North' : 'N', 'South' : 'S', 'VAR' : 'Variable', 'West' : 'W'}, inplace=True)
ws_mean = df.groupby('Wind_Direction').mean()['Wind_Speed(mph)'].loc['Calm'].round(1)
ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)).index
df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean
display(df.loc[ws_index])
ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)).index
df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean
display(df.loc[ws_index])
For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows. The only remaining NaN values fall on these dates, so I will just drop all NaN rows.
df = df.drop(df_nan.index)
To use the data set in learning algorithms, the data values must be converted to numerical form. There are several columns of True and False values. There are also several columns of Day and Night values. These values will be converted to 1 and 0.
tf_columns = ['Amenity', 'Bump', 'Crossing', 'Give_Way',
'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']
df.replace({False : 0, True : 1}, inplace=True)
df.replace({'Day' : 1, 'Night' : 0}, inplace=True)
Now categorical values need to be converted to numerical values. In the Wind_Direction column, there are values such as N, NE, NNE, etc that need to be converted. The Weather_Condition column will be skipped for now. I plan to take a look at weather values in one of the plots and values such as Rainy, Windy, Clear, etc will be useful.
df['Wind_Direction'] = df['Wind_Direction'].astype('category').cat.codes
Some of the columns have no variance. For example, a column that contains all 0s or all 1s isn't really useful to a model, so they should be dropped.
drop_no_variance = []
for col in df.columns:
if len(df[col].value_counts()) <= 1:
drop_no_variance.append(col)
display(df[col].value_counts())
df = df.drop(drop_no_variance, axis=1)
The data has been cleaned and is ready to be plotted and fed into a learning algorithm. Below shows there are no missing values and shows the newly cleaned data set.
df = df.reset_index(drop=True)
display(df.isna().sum())
display(df)
df.to_csv('clean_atlanta_accidents.csv', index=False)
In this section, I'll take a deeper look at the data. Several plots will be made to get an idea of the traffic accident patterns.
I will be looking at data from the years 2017, 2018, and 2019. Since 2016 is an incomplete year, it won't be considered. I also won't be considering the year 2020 since COVID-19 was reported to have a significant effect on Atlanta traffic. However, in the last section of this project, I will take a look at the data and make some comparisons between 2020 and the other three years.
So, the first step is to remove and separate the years of data.
df_covid = df[df['Year'] == 2020].copy().reset_index(drop=True)
df.drop(df.loc[df['Year'] == 2016].index, inplace=True)
df.drop(df.loc[df['Year'] == 2020].index, inplace=True)
df = df.reset_index(drop=True)
display(df.head())
display(df_covid.head())
Next, we'll take a look at the accident density for 2017, 2018, and 2019. Each year will be plotted separately using a heatmap to plot the accident locations. This should give us an idea of any hot spots in the area. Having lived in Atlanta for several years and recently moving back, I would expect the hotspots to be on the interstates around the city. The 285 perimeter should be fairly heavy. I20, I75, and I85 should also be very heavy with an even larger hotspot in the downtown connector area.
def plot_yearly_heatmap():
zmin = 0.9
zmax = 5.1
radius = 2
mapbox_style = 'light'
df_2017 = df[df['Year'] == 2017]
df_2018 = df[df['Year'] == 2018]
df_2019 = df[df['Year'] == 2019]
fig = make_subplots(rows=1, cols=3, specs=[[dict(type='mapbox'), dict(type='mapbox'), dict(type='mapbox')]],
subplot_titles=('2017 Traffic Accidents<br>({} Accidents)'.format(df_2017.shape[0]),
'2018 Traffic Accidents<br>({} Accidents)'.format(df_2018.shape[0]),
'2019 Traffic Accidents<br>({} Accidents)'.format(df_2019.shape[0])),
vertical_spacing=0.05, horizontal_spacing=0.01)
fig.add_trace(go.Densitymapbox(lat=df_2017['Start_Lat'], lon=df_2017['Start_Lng'], z=[1] * df_2017.shape[0], radius=radius, colorscale='Turbo',
colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=1)
fig.add_trace(go.Densitymapbox(lat=df_2018['Start_Lat'], lon=df_2018['Start_Lng'], z=[1] * df_2018.shape[0], radius=radius, colorscale='Turbo',
colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=2)
fig.add_trace(go.Densitymapbox(lat=df_2019['Start_Lat'], lon=df_2019['Start_Lng'], z=[1] * df_2019.shape[0], radius=radius, colorscale='Turbo',
colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=3)
fig.update_layout(width=900, height=450, showlegend=False,
mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style),
mapbox2=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style),
mapbox3=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style))
#fig.write_html('yearly_heatmap.html')
fig.show()
plot_yearly_heatmap()