Accidents and Traffic in Atlanta

What is my goal for this project?

As someone studying machine learning and data science, I didn't really have a good portfolio of projects to show. I had some experience working on this type of stuff in graduate school, and I have completed several online machine learning and data science courses, but I didn't have many projects to show for that.

My initial idea for this project was to browse the data sets on Kaggle and pick one that seemed interesting. I would then look at the data, train a classifier or regressor to do some predictions, and put together some written thoughts on it. I didn't really have a plan when I picked the dataset, I just wanted to mess around with it and see what came to me.

What dataset did I choose?

After spending some time browsing through various data sets, I ended up choosing US Accidents (4.2 million records) A Countrywide Traffic Accident Dataset (2016 - 2020). There is an accopanying paper for this data set located at arxiv.org. Using this data set, I narrowed the scope of the data down to the Atlanta, Georgia area where I recently moved.

About the Data:

According to the author, the data was collected in real time using multiple traffic APIs. The majority of the data, approximately 63% and 36%, came from Mapquest and Bing repsectively. The data was collected from February 2016 until December 2020. The author of the data set made some comments on the validity of the data here. Overall, the author believes that this data set is a subset of the entire accidents in the United States. The author also discussed a change in collection techniques for MapQuest after August 2017 here.

Given the above information, while the data may not be complete, it should be enough to give an interesting view into the traffic around the Atlanta area.

The following table is a description of the column values as per the author's website.

Attribute Description
ID This is a unique identifier of the accident record.
Source Indicates source of the accident report (i.e. the API which reported the accident.).
TMC A traffic accident may have a Traffic Message Channel (TMC) code which provides more detailed description of the event.
Severity Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
Start_Time Shows start time of the accident in local time zone.
End_Time Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
Start_Lat Shows latitude in GPS coordinate of the start point.
Start_Lng Shows longitude in GPS coordinate of the start point.
End_Lat Shows latitude in GPS coordinate of the end point.
End_Lng Shows longitude in GPS coordinate of the end point.
Distance(mi) The length of the road extent affected by the accident.
Description Shows natural language description of the accident.
Number Shows the street number in address field.
Street Shows the street name in address field.
Side Shows the relative side of the street (Right/Left) in address field.
City Shows the city in address field.
County Shows the county in address field.
State Shows the state in address field.
Zipcode Shows the zipcode in address field.
Country Shows the country in address field.
Timezone Shows timezone based on the location of the accident (eastern, central, etc.).
Airport_Code Denotes an airport-based weather station which is the closest one to location of the accident.
Weather_Timestamp Shows the time-stamp of weather observation record (in local time).
Temperature(F) Shows the temperature (in Fahrenheit).
Wind_Chill(F) Shows the wind chill (in Fahrenheit).
Humidity(%) Shows the humidity (in percentage).
Pressure(in) Shows the air pressure (in inches).
Visibility(mi) Shows visibility (in miles).
Wind_Direction Shows wind direction.
Wind_Speed(mph) Shows wind speed (in miles per hour).
Precipitation(in) Shows precipitation amount in inches, if there is any.
Weather_Condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
Amenity A amenity in a nearby location.
Bump A POI annotation which indicates presence of speed bump or hump in a nearby location.
Crossing A POI annotation which indicates presence of crossing in a nearby location.
Give_Way A POI annotation which indicates presence of give_way in a nearby location.
Junction A POI annotation which indicates presence of junction in a nearby location.
No_Exit A POI annotation which indicates presence of no_exit in a nearby location.
Railway A POI annotation which indicates presence of railway in a nearby location.
Roundabout A POI annotation which indicates presence of roundabout in a nearby location.
Station A POI annotation which indicates presence of station in a nearby location.
Stop A POI annotation which indicates presence of stop in a nearby location.
Traffic_Calming A POI annotation which indicates presence of traffic_calming in a nearby location.
Traffic_Signal A POI annotation which indicates presence of traffic_signal in a nearby location.
Turning_Loop A POI annotation which indicates presence of turning_loop in a nearby location.
Sunrise_Sunset Shows the period of day (i.e. day or night) based on sunrise/sunset.
Civil_Twilight Shows the period of day (i.e. day or night) based on civil twilight.
Nautical_Twilight Shows the period of day (i.e. day or night) based on nautical twilight.
Astronomical_Twilight Shows the period of day (i.e. day or night) based on astronomical twilight.

What is the project?

For this project, I will be looking at plots of traffic accidents and the impact on traffic when there is an auto accident. The impacts on traffic in this data set are rated by severity from 1 to 4 with 1 meaning the accident caused minimal traffic delays and 4 meaning an accident had an extreme effect on traffic delays.

The data set doesn't define a clear timetable for the severity of the delays. However, the author posted some approximate estimates here. Delays are estimated at the following.

Severity Time
1 2m 30s
2 3m 15s
3 8m
4 18m


To examine the data, I will plot several different types of traffic maps. Next, I will divide the data into train and test sets in order to train a severity classifier for accidents that will predict the traffic delay given an accidents location and relevant information.

To look at the data and train the classifier I will use the years 2017 through 2019. I will discard 2016 since data from the first part of the year is missing. I will also not consider the year 2020 in the first part of the project since various COVID measures had a serious impact on traffic in Atlanta. In the last part of the project, I will compare the 2020 data to the previous years to see how the data shows COVID affected traffic delays and accidents.

In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

import json

with open('credentials.json') as f:
    json_data = json.load(f)
    mapbox_key = json_data['mapbox_key']
    
random_state = 0

import plotly.io as pio
pio.renderers.default = "notebook"

Cleaning and processing the data

The first step is to read in the data and take a look at it. The data contains 49 columns of information and 4,232,541 rows of accident entries. The columns contain varying information on location, weather, dates, times, traffic impact, source, and other information. The exact explanation of each column can be found here.

In [3]:
df_all = pd.read_csv('US_Accidents_Dec20.csv')
df_all
Out[3]:
ID Source TMC Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Description Number Street Side City County State Zipcode Country Timezone Airport_Code Weather_Timestamp Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-1 MapQuest 201.0 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN 0.010 Right lane blocked due to accident on I-70 Eas... NaN I-70 E R Dayton Montgomery OH 45424 US US/Eastern KFFO 2016-02-08 05:58:00 36.9 NaN 91.0 29.68 10.0 Calm NaN 0.02 Light Rain False False False False False False False False False False False False False Night Night Night Night
1 A-2 MapQuest 201.0 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN 0.010 Accident on Brice Rd at Tussing Rd. Expect del... 2584.0 Brice Rd L Reynoldsburg Franklin OH 43068-3402 US US/Eastern KCMH 2016-02-08 05:51:00 37.9 NaN 100.0 29.65 10.0 Calm NaN 0.00 Light Rain False False False False False False False False False False False False False Night Night Night Day
2 A-3 MapQuest 201.0 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN 0.010 Accident on OH-32 State Route 32 Westbound at ... NaN State Route 32 R Williamsburg Clermont OH 45176 US US/Eastern KI69 2016-02-08 06:56:00 36.0 33.3 100.0 29.67 10.0 SW 3.5 NaN Overcast False False False False False False False False False False False True False Night Night Day Day
3 A-4 MapQuest 201.0 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN 0.010 Accident on I-75 Southbound at Exits 52 52B US... NaN I-75 S R Dayton Montgomery OH 45417 US US/Eastern KDAY 2016-02-08 07:38:00 35.1 31.0 96.0 29.64 9.0 SW 4.6 NaN Mostly Cloudy False False False False False False False False False False False False False Night Day Day Day
4 A-5 MapQuest 201.0 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN 0.010 Accident on McEwen Rd at OH-725 Miamisburg Cen... NaN Miamisburg Centerville Rd R Dayton Montgomery OH 45459 US US/Eastern KMGY 2016-02-08 07:53:00 36.0 33.3 89.0 29.65 6.0 SW 3.5 NaN Mostly Cloudy False False False False False False False False False False False True False Day Day Day Day
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4232536 A-4239402 Bing NaN 2 2019-08-23 18:03:25 2019-08-23 18:32:01 34.002480 -117.379360 33.99888 -117.37094 0.543 At Market St - Accident. NaN Pomona Fwy E R Riverside Riverside CA 92501 US US/Pacific KRAL 2019-08-23 17:53:00 86.0 86.0 40.0 28.92 10.0 W 13.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day
4232537 A-4239403 Bing NaN 2 2019-08-23 19:11:30 2019-08-23 19:38:23 32.766960 -117.148060 32.76555 -117.15363 0.338 At Camino Del Rio/Mission Center Rd - Accident. NaN I-8 W R San Diego San Diego CA 92108 US US/Pacific KMYF 2019-08-23 18:53:00 70.0 70.0 73.0 29.39 10.0 SW 6.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day
4232538 A-4239404 Bing NaN 2 2019-08-23 19:00:21 2019-08-23 19:28:49 33.775450 -117.847790 33.77740 -117.85727 0.561 At Glassell St/Grand Ave - Accident. in the ri... NaN Garden Grove Fwy R Orange Orange CA 92866 US US/Pacific KSNA 2019-08-23 18:53:00 73.0 73.0 64.0 29.74 10.0 SSW 10.0 0.00 Partly Cloudy False False False False True False False False False False False False False Day Day Day Day
4232539 A-4239405 Bing NaN 2 2019-08-23 19:00:21 2019-08-23 19:29:42 33.992460 -118.403020 33.98311 -118.39565 0.772 At CA-90/Marina Fwy/Jefferson Blvd - Accident. NaN San Diego Fwy S R Culver City Los Angeles CA 90230 US US/Pacific KSMO 2019-08-23 18:51:00 71.0 71.0 81.0 29.62 10.0 SW 8.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day
4232540 A-4239406 Bing NaN 2 2019-08-23 18:52:06 2019-08-23 19:21:31 34.133930 -117.230920 34.13736 -117.23934 0.537 At Highland Ave/Arden Ave - Accident. NaN CA-210 W R Highland San Bernardino CA 92346 US US/Pacific KSBD 2019-08-23 20:50:00 79.0 79.0 47.0 28.63 7.0 SW 7.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day

4232541 rows × 49 columns

Narrowing the data to Atlanta

In order to narrow the data to the Atlanta, Georgia area, I started by retrieving the lattitude and longitude coordinates for Atlanta via Google. The listed coordinates for Atlanta are 33.7490° N, 84.3880° W. To select the area around Atlanta, I opened a map and selected what I thought was a good representation of the city and outer suburbs which is approximately a 35 square mile area. The first step is to retrieve the rows in the data set that correspond to this area.

In [4]:
atlanta_lat = 33.7490
atlanta_lng = -84.3880
radius = 0.3

df = df_all[(df_all['Start_Lat'] <= atlanta_lat + radius) & 
            (df_all['Start_Lat'] >= atlanta_lat - radius) & 
            (df_all['Start_Lng'] >= atlanta_lng - radius) & 
            (df_all['Start_Lng'] <= atlanta_lng + radius)]
df.head()
Out[4]:
ID Source TMC Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Description Number Street Side City County State Zipcode Country Timezone Airport_Code Weather_Timestamp Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
146255 A-146262 MapQuest 245.0 3 2016-11-30 15:13:44 2016-11-30 17:26:28 33.546177 -84.577347 NaN NaN 0.01 Two lanes blocked due to accident on I-85 Nort... NaN Senoia Rd R Fairburn Fulton GA 30213 US US/Eastern KATL 2016-11-30 15:09:00 63.0 NaN 97.0 29.75 3.0 WSW 9.2 0.05 Rain False False False False False False False False False False False False False Day Day Day Day
146256 A-146263 MapQuest 201.0 3 2016-11-30 15:25:27 2016-11-30 16:54:36 33.766376 -84.527321 NaN NaN 0.01 Restrictions on exit ramp due to accident on I... NaN GA-402 E R Atlanta Fulton GA 30336 US US/Eastern KFTY 2016-11-30 15:09:00 63.0 NaN 90.0 29.73 3.0 SSW 5.8 0.04 Rain False False False False False False False False False False False False False Day Day Day Day
146257 A-146264 MapQuest 229.0 3 2016-11-30 14:42:27 2016-11-30 16:57:07 33.786896 -84.493134 NaN NaN 0.01 Slow traffic due to accident on I-285 Southbou... NaN Donald Lee Hollowell Pkwy NW R Atlanta Fulton GA 30331 US US/Eastern KFTY 2016-11-30 14:40:00 63.0 NaN 90.0 29.73 2.5 SSW 8.1 0.62 Heavy Rain False False False False False False False False False False False False False Day Day Day Day
146258 A-146265 MapQuest 201.0 2 2016-11-30 16:27:58 2016-11-30 16:57:41 33.697849 -84.418266 NaN NaN 0.01 Accident on GA-166 Arthur Langford Pkwy at Syl... NaN Sylvan Rd R Atlanta Fulton GA 30344 US US/Eastern KATL 2016-11-30 16:52:00 63.0 NaN 97.0 29.77 9.0 WSW 10.4 0.01 Overcast False False False False False False False False False False False False False Day Day Day Day
146259 A-146266 MapQuest 201.0 3 2016-11-30 16:14:20 2016-11-30 16:58:59 33.696915 -84.404984 NaN NaN 0.01 Accident on I-75 Southbound at Exits 242 243 I... NaN Arthur Langford Pkwy E R Atlanta Fulton GA 30315 US US/Eastern KATL 2016-11-30 15:52:00 63.0 NaN 97.0 29.70 10.0 SW 8.1 0.13 Light Rain False False False False False False False False False False False False False Day Day Day Day

Removing non accident data

For the column TMC (Traffic Message Channel), it lists a variety of codes used to describe traffic incidents. After referencing all the codes listed in the Atlanta data, there are 140 entries for code 406 which is described as 'entry slip road closed'. This didn't sound like an accident, so I decided to investigate further. Fortunately, the accidents have a text description. Looking through these descriptions, several incidents didn't seem to be accidents at all. For example, several roads were closed due to fallen trees, protests, and various other activities. After looking through them, most of the entries that reference an accident have the text 'accident' in the description. So, I will filter out any rows that don't have that description.

In [5]:
to_remove = ~df[df['TMC'] == 406]['Description'].str.contains('accident')
df = df.drop(df[df['TMC'] == 406][to_remove].index)

Removing columns that won't be used

The data set contains certain columns that won't be used for mapping and predictions.

In the paper, the original traffic accident data only contained GPS data. The author used a tool that converted that GPS data into the data for number, street, side, city, county, state, country, and zip code. Given that the original data didn't contain this information, I decided to stick with the original lattitude and longitude coordinates to do mapping and prediction.

For the columns End_Lat and End_Lng, many values were missing, so I decided to use the starting coordinates. Distance(mi) is also a function of Start - End, so I removed it as well.

The columns ID and Source were irrelevant, so I removed those.

The column TMC was used to clean in a previous step, so it is no longer needed. Also, TMC codes may not be available at the time of an accident to predict the delay.

The column End_Time will be removed since the Severity column will be used as a target. Knowing the end time of an accident isn't something that would be known beforehand in prediction.

The column Description will be removed because it is a text description of the accident, and also wouldn't be known beforehand.

The column Timezone will be irrelevant since the entire area is in the same timezone.

The column Airport_Code will be removed since it gives no localized information on the accident.

The column Weather_Timestamp will be removed since the time of the last weather update can't provide additional information. After some research, I found several websites that have historical weather APIs, but they all charged a fee which I'm avoiding for this project. So, checking a timestamp to see if more accurate weather data is available can't be accomplished.

This leaves the following columns:

In [6]:
df = df.drop(['ID', 'Source', 'TMC', 'End_Time', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street', 
              'Side', 'City', 'State', 'Zipcode', 'Country', 'Timezone', 'Airport_Code', 'Weather_Timestamp', 'County'], axis=1)
df.columns
Out[6]:
Index(['Severity', 'Start_Time', 'Start_Lat', 'Start_Lng', 'Temperature(F)',
       'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)',
       'Wind_Direction', 'Wind_Speed(mph)', 'Precipitation(in)',
       'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Give_Way',
       'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
       'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop', 'Sunrise_Sunset',
       'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight'],
      dtype='object')

Converting the timestamp to columns

For this project, I'm going to plot several charts by time, I'm going to convert the timestamp into it's own separate columns for simplicity. Columns will be made for Year, Month, Day, Hour, and DayOfWeek. The hours will be rounded to the nearest hour. DayOfWeek will be in the format of 0 to 6 where 0 is Monday.

In [7]:
#convert to datetime format
df['Start_Time'] = pd.to_datetime(df['Start_Time'], infer_datetime_format=True)
#round to nearest hour
df['Start_Time'] = df['Start_Time'].dt.round("H")
#create day of week
df.insert(2, 'DayOfWeek', '')
df['DayOfWeek'] = df['Start_Time'].dt.weekday
#create an hour of the day column
df.insert(2, 'Hour', '')
df['Hour'] = df['Start_Time'].dt.hour
#create day of week column
df.insert(2, 'Day', '')
df['Day'] = df['Start_Time'].dt.day
#create month column
df.insert(2, 'Month', '')
df['Month'] = df['Start_Time'].dt.month
#create year column
df.insert(2, 'Year', '')
df['Year'] = df['Start_Time'].dt.year
df = df.drop(['Start_Time'], axis=1)
display(df.head())
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
146255 3 2016 11 30 15 2 33.546177 -84.577347 63.0 NaN 97.0 29.75 3.0 WSW 9.2 0.05 Rain False False False False False False False False False False False False False Day Day Day Day
146256 3 2016 11 30 15 2 33.766376 -84.527321 63.0 NaN 90.0 29.73 3.0 SSW 5.8 0.04 Rain False False False False False False False False False False False False False Day Day Day Day
146257 3 2016 11 30 15 2 33.786896 -84.493134 63.0 NaN 90.0 29.73 2.5 SSW 8.1 0.62 Heavy Rain False False False False False False False False False False False False False Day Day Day Day
146258 2 2016 11 30 16 2 33.697849 -84.418266 63.0 NaN 97.0 29.77 9.0 WSW 10.4 0.01 Overcast False False False False False False False False False False False False False Day Day Day Day
146259 3 2016 11 30 16 2 33.696915 -84.404984 63.0 NaN 97.0 29.70 10.0 SW 8.1 0.13 Light Rain False False False False False False False False False False False False False Day Day Day Day

Checking for missing data

Before the data can be used, any missing data must be dealt with. The first step is to identify which values are missing an how many.

In [8]:
def check_na():
    return pd.merge(left=df.isna().sum()[df.isna().sum() != 0].rename('null counts'), 
                    right=df.isnull().mean()[df.isnull().mean() != 0].rename('percentage').round(4)*100, 
                    left_index=True, right_index=True).sort_values(ascending=False, by=['percentage'])
check_na()
Out[8]:
null counts percentage
Wind_Chill(F) 39253 56.60
Precipitation(in) 38395 55.36
Wind_Speed(mph) 8059 11.62
Humidity(%) 338 0.49
Temperature(F) 317 0.46
Wind_Direction 281 0.41
Visibility(mi) 265 0.38
Pressure(in) 239 0.34
Weather_Condition 237 0.34

There are nine columns with missing values. For Wind_Chill(F) and Precipitation(in), over half of the values are missing. Since I don't have access to historical weather data in order to fill those in, those two columns will be completely dropped. Wind_Chill(F) is a computation of temperature and wind speed, so that information should still be encoded in the data. While Precepitation(in) would probably be a good indicator, there is just too much information to fill in without having access to an outside data source.

For the remaining missing values, it is possible to fill in some data from surrounding values. For a missing value, we can check another entry on that same date in the same area and use that value. For example, if Weather_Condition is missing from an entry with lat/lng x1 y1 on date Jan 1 2000 at 8am, and we have another entry with lat/lng x2 y2 on Jan 1 2000 at 8am, it's reasonable that those would have the same Weather_Condition or other weather values.

In [9]:
df = df.drop(['Wind_Chill(F)', 'Precipitation(in)'], axis=1)

def get_missing_row_values(row_id):
    #attempts to fill in the missing value from a row with other incidents that happened on the same day
    row = df.loc[row_id].copy()
    #find all entries that happen on that day
    columns = ['Year', 'Month', 'Day']
    values = [row.Year, row.Month, row.Day]
    matches = df[(df[columns] == values).all(1)].copy()
    #drop duplicate row that was passed in
    matches = matches.drop(row_id)
    matches.insert(6, 'LL_Dist', '')
    #compute distance between the coordinates
    matches['LL_Dist'] = (np.sqrt((row.Start_Lat - matches.Start_Lat)**2 + (row.Start_Lng - matches.Start_Lng)**2))
    matches.insert(6, 'Time_Diff', '')
    #compute time between the accidents
    matches['Time_Diff'] = abs(row.Hour - matches.Hour)
    #sort matches by time differential then distance 
    matches = matches.sort_values(['Time_Diff', 'LL_Dist'], ascending=[True, True])
    #to make things simple, we will take the match that is closest hour and least distance
    #this may lead to some small innacuracy, but the missing values are few, so it should be ok
    #since the match dataframe is sorted by hour then distance, the first matching value should be used
    #possible missing values are:
    #'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)', 'Weather_Condition'
    def fill_row(col_name):
        nonlocal row
        nonlocal matches
        if pd.isna(row[col_name]): 
            value = matches[col_name].first_valid_index()
            if value:
                row[col_name] = matches.loc[value][col_name]
    fill_row('Temperature(F)')
    fill_row('Humidity(%)')
    fill_row('Pressure(in)')
    fill_row('Visibility(mi)')
    fill_row('Wind_Direction')
    fill_row('Wind_Speed(mph)')
    fill_row('Weather_Condition')
    return row

df_nan = df[df.isna().any(axis=1)].copy()
for i, row in df_nan.iterrows():
    filled_row = get_missing_row_values(i)
    df.loc[i] = filled_row
    
check_na()
Out[9]:
null counts percentage
Wind_Speed(mph) 144 0.21
Temperature(F) 136 0.20
Humidity(%) 136 0.20
Pressure(in) 136 0.20
Visibility(mi) 136 0.20
Wind_Direction 136 0.20
Weather_Condition 136 0.20

After filling in missing weather values from nearby dates, most of the values have been filled in. There are still a small number missing. Next, I will take a look at the values that are missing and other values from those same dates.

In [10]:
df_nan = df[df.isna().any(axis=1)].copy()
display(df_nan.groupby(['Year', 'Month', 'Day']).size())
Year  Month  Day
2016  7      17      1
      10     29      7
2018  6      29     72
             30     16
2020  11     8      48
dtype: int64

The missing data is spread out over five separate days. Next, let's take a look at other data from those days to see why no values were filled in. Note: I will use head() to display part of the data for brevity.

In [11]:
display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 29)).head(5))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2018, 6, 30)).head(5))
display(df.groupby(['Year', 'Month', 'Day']).get_group((2020, 11, 8)).head(5))
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
2809500 4 2016 7 17 0 6 33.868 -84.28513 72.0 91.0 30.11 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Night Night
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
150475 3 2016 10 29 19 5 33.917973 -84.338135 73.0 46.0 30.14 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Day Day
150476 3 2016 10 29 19 5 33.937080 -84.158661 68.0 56.0 30.18 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Day Day
150477 3 2016 10 29 20 5 33.912239 -84.207817 66.9 59.0 30.17 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Night Night
150478 3 2016 10 29 22 5 33.699013 -84.266167 70.0 57.0 30.18 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Night Night
150479 3 2016 10 29 22 5 33.827091 -84.252602 61.0 75.0 30.18 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Night Night
150480 3 2016 10 29 22 5 33.821548 -84.359383 60.1 78.0 30.19 9.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Night Night
150481 3 2016 10 29 23 5 33.864868 -84.439674 62.8 72.0 30.20 10.0 Calm NaN Clear False False False False False False False False False False False False False Night Night Night Night
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
2094112 2 2018 6 29 4 4 33.849075 -84.430061 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False True False Night Night Night Night
2094116 3 2018 6 29 4 4 33.711597 -84.217415 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Night Night Night Night
2094124 3 2018 6 29 6 4 33.746490 -84.430565 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Night Day Day Day
2094125 3 2018 6 29 6 4 33.742832 -84.403595 NaN NaN NaN NaN NaN NaN NaN False False False False True False False False False False False False False Night Day Day Day
2094136 2 2018 6 29 7 4 33.823612 -84.367332 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False True False Day Day Day Day
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
1983936 3 2018 6 30 11 5 34.037842 -84.562691 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
1983948 3 2018 6 30 13 5 33.703480 -84.170181 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
1983949 3 2018 6 30 13 5 33.928226 -84.176018 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
1983950 3 2018 6 30 14 5 33.793625 -84.392815 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
1983955 3 2018 6 30 15 5 33.793625 -84.392815 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
580121 2 2020 11 8 12 6 33.852207 -84.369438 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False True False Day Day Day Day
580123 3 2020 11 8 15 6 33.745125 -84.390442 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
580125 3 2020 11 8 15 6 33.760025 -84.493500 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
580128 4 2020 11 8 17 6 33.834286 -84.250153 NaN NaN NaN NaN NaN NaN NaN False False False False False False False False False False False False False Day Day Day Day
580130 3 2020 11 8 18 6 33.797981 -84.395782 NaN NaN NaN NaN NaN NaN NaN False False False False True False False False False False False False False Day Day Day Day

For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows.

For the dates July 17 2016 and October 29 2016 only the Wind_Speed(mph) is missing. The Wind_Direction on all of those days happen to be calm, so I will take the mean of all calm days to fill in these missing values. First, I will look at the mean values for each category.

In [12]:
df.groupby('Wind_Direction').mean()['Wind_Speed(mph)']
Out[12]:
Wind_Direction
CALM         0.000000
Calm         4.953956
E            8.261937
ENE          8.385809
ESE          7.376227
East         8.070343
N            6.769912
NE           7.790666
NNE          6.377500
NNW          9.618584
NW          10.239534
North        7.409104
S            7.427520
SE           6.816618
SSE          7.167884
SSW          7.515848
SW           7.628337
South        7.583828
VAR          4.663056
Variable     4.815705
W            8.767478
WNW          9.799287
WSW          7.646964
West         8.538314
Name: Wind_Speed(mph), dtype: float64

As can be seen above, the Wind_Direction categories have overlapping values such as CALM and Calm, E and East, etc. Next, I will combine these groups with each other and fill in the missing Wind_Speed(mph) mean values.

In [13]:
df.replace({'CALM' : 'Calm', 'East' : 'E', 'North' : 'N', 'South' : 'S', 'VAR' : 'Variable', 'West' : 'W'}, inplace=True)

ws_mean = df.groupby('Wind_Direction').mean()['Wind_Speed(mph)'].loc['Calm'].round(1)

ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 7, 17)).index
df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean
display(df.loc[ws_index])

ws_index = df.groupby(['Year', 'Month', 'Day']).get_group((2016, 10, 29)).index
df.loc[ws_index, 'Wind_Speed(mph)'] = ws_mean
display(df.loc[ws_index])
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
2809500 4 2016 7 17 0 6 33.868 -84.28513 72.0 91.0 30.11 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Night Night
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
150475 3 2016 10 29 19 5 33.917973 -84.338135 73.0 46.0 30.14 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Day Day
150476 3 2016 10 29 19 5 33.937080 -84.158661 68.0 56.0 30.18 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Day Day
150477 3 2016 10 29 20 5 33.912239 -84.207817 66.9 59.0 30.17 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Night Night
150478 3 2016 10 29 22 5 33.699013 -84.266167 70.0 57.0 30.18 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Night Night
150479 3 2016 10 29 22 5 33.827091 -84.252602 61.0 75.0 30.18 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Night Night
150480 3 2016 10 29 22 5 33.821548 -84.359383 60.1 78.0 30.19 9.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Night Night
150481 3 2016 10 29 23 5 33.864868 -84.439674 62.8 72.0 30.20 10.0 Calm 3.1 Clear False False False False False False False False False False False False False Night Night Night Night

For the dates June 29 2018, June 30 2018, and November 8 2020, all weather data is missing. Since I have no access to historical weather data and there are such a small number of values missing, I will just drop those rows. The only remaining NaN values fall on these dates, so I will just drop all NaN rows.

In [14]:
df = df.drop(df_nan.index)

Converting binary values

To use the data set in learning algorithms, the data values must be converted to numerical form. There are several columns of True and False values. There are also several columns of Day and Night values. These values will be converted to 1 and 0.

In [15]:
tf_columns = ['Amenity', 'Bump', 'Crossing', 'Give_Way',
       'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
       'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']
df.replace({False : 0, True : 1}, inplace=True)

df.replace({'Day' : 1, 'Night' : 0}, inplace=True)

Converting categorical values

Now categorical values need to be converted to numerical values. In the Wind_Direction column, there are values such as N, NE, NNE, etc that need to be converted. The Weather_Condition column will be skipped for now. I plan to take a look at weather values in one of the plots and values such as Rainy, Windy, Clear, etc will be useful.

In [16]:
df['Wind_Direction'] = df['Wind_Direction'].astype('category').cat.codes

Removing columns with single data values

Some of the columns have no variance. For example, a column that contains all 0s or all 1s isn't really useful to a model, so they should be dropped.

In [17]:
drop_no_variance = []
for col in df.columns:
    if len(df[col].value_counts()) <= 1:
        drop_no_variance.append(col)
        display(df[col].value_counts())
df = df.drop(drop_no_variance, axis=1)
0    69207
Name: Bump, dtype: int64
0    69207
Name: Roundabout, dtype: int64
0    69207
Name: Turning_Loop, dtype: int64

Display and Save the data

The data has been cleaned and is ready to be plotted and fed into a learning algorithm. Below shows there are no missing values and shows the newly cleaned data set.

In [18]:
df = df.reset_index(drop=True)
display(df.isna().sum())
display(df)
df.to_csv('clean_atlanta_accidents.csv', index=False)
Severity                 0
Year                     0
Month                    0
Day                      0
Hour                     0
DayOfWeek                0
Start_Lat                0
Start_Lng                0
Temperature(F)           0
Humidity(%)              0
Pressure(in)             0
Visibility(mi)           0
Wind_Direction           0
Wind_Speed(mph)          0
Weather_Condition        0
Amenity                  0
Crossing                 0
Give_Way                 0
Junction                 0
No_Exit                  0
Railway                  0
Station                  0
Stop                     0
Traffic_Calming          0
Traffic_Signal           0
Sunrise_Sunset           0
Civil_Twilight           0
Nautical_Twilight        0
Astronomical_Twilight    0
dtype: int64
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Crossing Give_Way Junction No_Exit Railway Station Stop Traffic_Calming Traffic_Signal Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 3 2016 11 30 15 2 33.546177 -84.577347 63.0 97.0 29.75 3.0 17 9.2 Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 3 2016 11 30 15 2 33.766376 -84.527321 63.0 90.0 29.73 3.0 12 5.8 Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
2 3 2016 11 30 15 2 33.786896 -84.493134 63.0 90.0 29.73 2.5 12 8.1 Heavy Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
3 2 2016 11 30 16 2 33.697849 -84.418266 63.0 97.0 29.77 9.0 17 10.4 Overcast 0 0 0 0 0 0 0 0 0 0 1 1 1 1
4 3 2016 11 30 16 2 33.696915 -84.404984 63.0 97.0 29.70 10.0 13 8.1 Light Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
69202 2 2019 8 23 18 4 33.665910 -84.344760 83.0 67.0 28.91 10.0 7 7.0 Mostly Cloudy 0 0 0 0 0 0 0 0 0 0 1 1 1 1
69203 2 2019 8 23 21 4 33.920200 -84.320150 80.0 74.0 28.94 10.0 4 5.0 Fair 0 0 0 1 0 0 0 0 0 0 0 0 1 1
69204 2 2019 8 23 19 4 33.803530 -84.249600 85.0 63.0 28.92 10.0 7 5.0 Partly Cloudy 0 0 0 0 0 0 0 0 0 1 1 1 1 1
69205 2 2019 8 23 19 4 33.803350 -84.249240 85.0 63.0 28.92 10.0 7 5.0 Partly Cloudy 0 0 0 0 0 0 0 0 0 1 1 1 1 1
69206 2 2019 8 23 20 4 33.895722 -84.252706 80.0 74.0 28.94 10.0 4 5.0 Fair 0 0 0 0 0 0 0 0 0 0 0 1 1 1

69207 rows × 29 columns

Exploring the data

In this section, I'll take a deeper look at the data. Several plots will be made to get an idea of the traffic accident patterns.

I will be looking at data from the years 2017, 2018, and 2019. Since 2016 is an incomplete year, it won't be considered. I also won't be considering the year 2020 since COVID-19 was reported to have a significant effect on Atlanta traffic. However, in the last section of this project, I will take a look at the data and make some comparisons between 2020 and the other three years.

So, the first step is to remove and separate the years of data.

In [19]:
df_covid = df[df['Year'] == 2020].copy().reset_index(drop=True)
df.drop(df.loc[df['Year'] == 2016].index, inplace=True)
df.drop(df.loc[df['Year'] == 2020].index, inplace=True)
df = df.reset_index(drop=True)
display(df.head())
display(df_covid.head())
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Crossing Give_Way Junction No_Exit Railway Station Stop Traffic_Calming Traffic_Signal Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 3 2017 1 2 14 0 33.652431 -84.396278 55.9 100.0 30.11 0.2 6 8.1 Light Rain 0 0 0 1 0 0 0 0 0 0 1 1 1 1
1 3 2017 1 2 14 0 33.744976 -84.390343 55.9 100.0 30.11 0.2 6 8.1 Light Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
2 3 2017 1 2 14 0 33.928226 -84.176018 53.1 96.0 30.13 0.5 5 6.9 Light Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
3 2 2017 1 2 15 0 33.821548 -84.359383 53.1 96.0 30.12 1.0 14 4.6 Light Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
4 3 2017 1 2 15 0 33.843380 -84.487671 53.6 100.0 30.13 1.5 3 6.9 Rain 0 0 0 0 0 0 0 0 0 0 1 1 1 1
Severity Year Month Day Hour DayOfWeek Start_Lat Start_Lng Temperature(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Weather_Condition Amenity Crossing Give_Way Junction No_Exit Railway Station Stop Traffic_Calming Traffic_Signal Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 2 2020 12 31 6 3 33.744804 -84.350777 45.0 97.0 29.00 0.25 3 5.0 Fog 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 3 2020 12 31 13 3 33.758720 -84.379395 55.0 83.0 29.20 7.00 1 6.0 Cloudy 0 0 0 0 0 0 0 0 0 0 1 1 1 1
2 3 2020 12 31 14 3 33.711750 -84.217361 52.0 97.0 29.02 3.00 0 0.0 Cloudy 0 0 0 0 0 0 0 0 0 0 1 1 1 1
3 3 2020 12 31 15 3 33.912098 -84.208755 54.0 93.0 29.01 3.00 1 3.0 Cloudy 0 0 0 0 0 0 0 0 0 0 1 1 1 1
4 3 2020 12 31 16 3 33.797970 -84.392929 57.0 89.0 29.18 6.00 6 3.0 Cloudy 0 0 0 0 0 0 0 0 0 0 1 1 1 1

Looking at traffic density

Next, we'll take a look at the accident density for 2017, 2018, and 2019. Each year will be plotted separately using a heatmap to plot the accident locations. This should give us an idea of any hot spots in the area. Having lived in Atlanta for several years and recently moving back, I would expect the hotspots to be on the interstates around the city. The 285 perimeter should be fairly heavy. I20, I75, and I85 should also be very heavy with an even larger hotspot in the downtown connector area.

In [20]:
def plot_yearly_heatmap():
    zmin = 0.9
    zmax = 5.1
    radius = 2
    mapbox_style = 'light'

    df_2017 = df[df['Year'] == 2017]
    df_2018 = df[df['Year'] == 2018]
    df_2019 = df[df['Year'] == 2019]

    fig = make_subplots(rows=1, cols=3, specs=[[dict(type='mapbox'), dict(type='mapbox'), dict(type='mapbox')]], 
                        subplot_titles=('2017 Traffic Accidents<br>({} Accidents)'.format(df_2017.shape[0]), 
                                        '2018 Traffic Accidents<br>({} Accidents)'.format(df_2018.shape[0]), 
                                        '2019 Traffic Accidents<br>({} Accidents)'.format(df_2019.shape[0])), 
                        vertical_spacing=0.05, horizontal_spacing=0.01)
    
    fig.add_trace(go.Densitymapbox(lat=df_2017['Start_Lat'], lon=df_2017['Start_Lng'], z=[1] * df_2017.shape[0], radius=radius, colorscale='Turbo', 
                                   colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=1)

    fig.add_trace(go.Densitymapbox(lat=df_2018['Start_Lat'], lon=df_2018['Start_Lng'], z=[1] * df_2018.shape[0], radius=radius, colorscale='Turbo', 
                                   colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=2)

    fig.add_trace(go.Densitymapbox(lat=df_2019['Start_Lat'], lon=df_2019['Start_Lng'], z=[1] * df_2019.shape[0], radius=radius, colorscale='Turbo', 
                                   colorbar=dict(tickmode='linear'), zmin=zmin, zmax=zmax), row=1, col=3)

    fig.update_layout(width=900, height=450, showlegend=False,
                      mapbox=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), 
                      mapbox2=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style), 
                      mapbox3=dict(center=dict(lat=atlanta_lat, lon=atlanta_lng), accesstoken=mapbox_key, zoom=8, style=mapbox_style))

    #fig.write_html('yearly_heatmap.html')
    fig.show()
    
     
plot_yearly_heatmap()