The objective of this project is to do an extensive analysis of the Pulsepoint Emergency Data as well as apply some clustering and dimensionality reduction techniques.
The result from the analysis might be beneficial to a varieties of business stakeholders.
For example –
PulsePoint is a 911-connected mobile app that allows users to view and receive alerts on calls being responded to by fire departments and emergency medical services. The app's main feature, and where its name comes from, is that it sends alerts to users at the same time that dispatchers are sending the call to emergency crews. The goal is to increase the possibility that a victim in cardiac arrest will receive cardiopulmonary resuscitation (CPR) quickly. The app uses the current location of a user and will alert them if someone in their vicinity is in need of CPR. The app, which interfaces with the local government public safety answering point, will send notifications to users only if the victim is in a public place and only to users that are in the immediate vicinity of the emergency. - Wikipedia
Pulsepoint logs of the incidents can be used to identify the local pattern of emergencies which is helpful for local businesses as well as emergency agencies to stay alert and take precautions which, in the long term ensure the social well-being of the people.
The data was collected via web scraping using python. The logs were collected from 2021-05-02 to 2021-12-31.
PulsePoint Respond Mobile APP UI (visual inspection of the data) :
NB: This project also serves as my assignments for the course below -
%%capture
!pip install geopandas # geo-plotting
!pip install pdpipe # data pipeline
!pip install yellowbrick # for elbow method
import re
import json
import requests
import urllib
import pandas as pd
import numpy as np
import pdpipe as pdp
# from tqdm import tqdm
from tqdm.auto import tqdm # for notebooks
# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas() # https://stackoverflow.com/a/34365537/11105356
from datetime import timedelta, datetime
# data visualization
import folium
import plotly.graph_objects as go
import plotly.express as px
import geopandas
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
from wordcloud import WordCloud
from folium.plugins import MarkerCluster, HeatMap
from geopy.geocoders import Nominatim # reverse geocoding
# data processing and algorithm
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import (KMeans, DBSCAN, OPTICS,
AgglomerativeClustering,
MiniBatchKMeans)
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from IPython.display import Image, HTML, Markdown
# from IPython.html import widgets
%matplotlib inline
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
plt.rcParams['figure.figsize'] = 12, 8
# utility function to print markdown string
def printmd(string):
display(Markdown(string))
pd.set_option('display.max_colwidth', None)
SEED = 42
# set the size of the geo bubble
def set_size(value):
'''
Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
Returns a number to indicate the size of a bubble for a country which numeric attribute value
was supplied as an input
'''
result = np.log(1+value)
if result < 0:
result = 0.1
return result
# API Key
API_KEY_POSITIONSTACK = "YOUR_API_KEY_HERE"
parse_dates=['date_of_incident']
pulse_point_df = pd.read_csv("/content/PulsePoint_local_threats_emergencies.csv",
parse_dates=parse_dates,
skipinitialspace=True)
# to parse datetime column later
# pulse_point_df.date_of_incident = pd.to_datetime(pulse_point_df.date_of_incident)
printmd(f"Dataset has **{pulse_point_df.shape[0]}** rows and **{pulse_point_df.shape[1]}** columns")
Strip Object Columns
It will remove noise like extra whitespaces.
Example - there are some state values in the data such as - " CA" and "CA" which can be identified as separate entities. So this operation will remove that issue.
pulse_point_df = pulse_point_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
pulse_point_df.sort_values(by='date_of_incident')
Data was collected from 2021-05-02 to present
pulse_point_df.columns
Columns | Description | Data Type |
---|---|---|
id | Contains record id | numeric, int |
type | Incident type (recent or active) | object |
title | Title of the incident (e.g., Medical Emergency, Fire) | object |
agency | Agency name (e.g., fire departments, emergency medical services) | object |
location | Location where the incident took place | object |
timestamp_time | Time when the incident record was logged | object |
date_of_incident | Date when the incident record was logged | datetime |
description | Emergency code description (e.g., E53 - refers to Fire Engine Truck ) | object |
duration | Duration of the incident | object |
Incident_logo | Logo of the incident | object |
agency_logo | Logo of the agency | object |
pulse_point_df.info()
pulse_point_df.dtypes.value_counts()
pulse_point_df.describe(include='object').T
def path_to_image_html(path):
'''
This function essentially convert the image url to
'<img src="'+ path + '"/>' format. And one can put any
formatting adjustments to control the height, aspect ratio, size etc.
within as in the below example.
'''
return '<img src="'+ path + '" style=max-height:124px;"/>' # option : '" width="60"
pulse_point_df_short = pulse_point_df.head(10)
HTML(pulse_point_df_short.to_html(escape=False , formatters=dict(incident_logo=path_to_image_html, agency_logo=path_to_image_html)))
def missing_value_describe(data):
# check missing values in the data
total = data.isna().sum().sort_values(ascending=False)
missing_value_pct_stats = (data.isnull().sum() / len(data)*100)
missing_value_col_count = sum(missing_value_pct_stats > 0)
# missing_value_stats = missing_value_pct_stats.sort_values(ascending=False)[:missing_value_col_count]
missing_data = pd.concat([total, missing_value_pct_stats], axis=1, keys=['Total', 'Percentage(%)'])
print("Number of rows with at least 1 missing values:", data.isna().any(axis = 1).sum())
print("Number of columns with missing values:", missing_value_col_count)
if missing_value_col_count != 0:
# print out column names with missing value percentage
print("\nMissing percentage (desceding):")
display(missing_data[:missing_value_col_count])
# plot missing values
missing = data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
else:
print("No missing data!!!")
# pass a dataframe to the function
missing_value_describe(pulse_point_df)
pulse_point_df.drop(['id', 'incident_logo', 'agency_logo'], axis=1, inplace=True)
Because active incidents are the noisy duplicated data of the “recent type incident” which was unable to remove during the data collection process. Thus, it does not contribute to the analysis.
pulse_point_df.type.value_counts()
pulse_point_df.drop(pulse_point_df[pulse_point_df.type == 'active'].index, inplace=True)
pulse_point_df.reset_index(drop=True, inplace=True)
Drop redundant column "type"
pulse_point_df.drop(columns=['type'], axis=1, inplace=True)
pulse_point_df.location
pulse_point_df.location.value_counts().head(10)
There are many variations in the location column:
Such as -
We can split the locations into multiple features -
State¶
Text after the last comma appears to be the short form of US states or Canadian provinces.
CA -> California state
OR -> Oregon state
City¶
Text after second last comma appears to be city name (or town, county name)
MEDFORD is a city in Oregon (last example - "E BARNETT RD, MEDFORD, OR")
Address¶
Apart from state and city name, the rest will be counted as address features if there are three comma-separated elements (texts)
Address_2¶
Apart from state, city, and address the rest will be counted as extended address (address_2) feature if there are four comma-separated element/string
Business¶
Bracket enclosed string will be counted as Business Name.
From the above example - OJAI ARCADE (21002302) and MCMINNVILLE FIRE DEPARTMENT are counted as business feature
def get_business_name(location):
# https://stackoverflow.com/a/38212061/11105356
stack = 0
start_index = None
results = []
for i, c in enumerate(location):
if c == '(':
if stack == 0:
start_index = i + 1 # string to extract starts one index later
# push to stack
stack += 1
elif c == ')':
# pop stack
stack -= 1
if stack == 0:
results.append(location[start_index:i])
try:
if len(results) == 0:
return None
elif len(results) == 1 and len(results[0]) == 1:
return None
elif len(results) == 1 and len(results[0])!=1:
return results[0].strip()
elif len(results) > 1 and len(results[0])==1:
return None
else:
return results[1].strip()
except IndexError as ie:
pass
### handles variations such as -
# 5709 RICHMOND RD, STE 76, JAMES CITY COUNTY, VA (JANIE & JACK)
# 433 SARATOGA RD, SCHENECTADY, NY ((GLENVILLE)EAST GLENVILLE FD)
# I 229 RAMP & I 229 RAMP (0.1 MILES), SIOUX FALLS, SD (I 229 MM 8 NB)
# 6501 MISTY WATERS DR, STE (S)E260 (N), BURLEIGH COUNTY, ND
Example 1 (3 elements):
302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))
address =
302 E OJAI AVE
, city =OJAI
, state =CA
, business =OJAI ARCADE (21002302)
Example 2 (4 elements):
GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA
address =
GRASSIE BLVD
, address_2 =STE 212
, city =WINNIPEG
, state =MANITOBA
(wil be converted toMB
later)
# examples
# 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302)) --- 3 segments with business inside
# 1959 MORSE RD, COLUMBUS, OH (DOLLAR GENERAL)
# I 229 RAMP & I 229 RAMP (0.1 MILES), SIOUX FALLS, SD (I 229 MM 8 NB)
# GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA --- 4 segments
# split location into 3 or 4 parts depending on number of commas ->
# 3 segments : address, city, state
# 4 segments : address, address_2, city, state
# to extract bracket enclosed string
pulse_point_df['business'] = pulse_point_df.location.apply(lambda x : get_business_name(x))
### remove enclosed business name from the location string
pulse_point_location_data = pulse_point_df.apply(lambda row : row['location'].replace(str(row['business']), ''), axis=1)
# remove leftover bracket from the business replacemnt
# https://stackoverflow.com/a/49183590/11105356
# remove a (...) substring with a leading whitespace at the end of the string only
pulse_point_location_data = pulse_point_location_data.str.replace(r"\s*\([^()]*\)$","").str.strip()
# split the location
four_col_location_split = ['address', 'address_2', 'city','state']
three_col_location_split = ['address', 'city','state']
# four col indices
# pulse_point_location_data[pulse_point_location_data.str.split(',', expand=True)[3].notna()]
extra_loc_data = pulse_point_location_data.str.split(',', expand=True) # to expand columns
four_col_indices = extra_loc_data[extra_loc_data.apply(lambda x: np.all(pd.notnull(x[3])) , axis = 1)].index
four_col_loc_df = extra_loc_data.iloc[four_col_indices]
four_col_loc_df.columns = four_col_location_split
four_col_loc_df
pulse_point_df.loc[four_col_loc_df.index , four_col_location_split] = four_col_loc_df
pulse_point_df[four_col_location_split] = pulse_point_df[four_col_location_split].apply(lambda x: x.str.strip())
pulse_point_df[four_col_location_split]
# there are very few numbers of four feature location than three feature location
four_col_loc_df_mask = extra_loc_data.index.isin(four_col_indices)
three_col_loc_df = extra_loc_data[~four_col_loc_df_mask].drop([3], axis=1)
three_col_loc_df.columns = three_col_location_split
# extra_loc_data[~three_col_loc_df][3].notna().sum() # to check null values
three_col_loc_df
pulse_point_df.loc[three_col_loc_df.index , three_col_location_split] = three_col_loc_df
pulse_point_df[three_col_location_split] = pulse_point_df[three_col_location_split].apply(lambda x: x.str.strip())
pulse_point_df[three_col_location_split]
pulse_point_df[['location','address', 'address_2', 'city','state', 'business']]
missing_value_describe(pulse_point_df[['location','address', 'address_2', 'city','state', 'business']])
pulse_point_df[pulse_point_df.city.isna()]
pulse_point_df = pulse_point_df[pulse_point_df.city.notna()]
mask = ((pulse_point_df.city.isna()) | (pulse_point_df.city==u'') )
display(pulse_point_df[mask])
The business names are same as the city names. I first removed the text containing business names and then performed text extraction for cities. That's why city names are blank for the cases like these.
Let's replace their city names with business names.
pulse_point_df.loc[mask,'city'] = pulse_point_df[mask].business
display(pulse_point_df.state.value_counts())
printmd(f"**Total {len(pulse_point_df.state.value_counts().index)} States. Some of them are Canadian provinces, ex - MANITOBA**")
# Canadian Province Mapping
# https://www150.statcan.gc.ca/n1/pub/92-195-x/2011001/geo/prov/tbl/tbl8-eng.htm
# https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada
ca_province_dic = {
'Newfoundland and Labrador': 'NL',
'Prince Edward Island': 'PE',
'Nova Scotia': 'NS',
'New Brunswick': 'NB',
'Quebec': 'QC',
'Ontario': 'ON',
'Manitoba': 'MB',
'Saskatchewan': 'SK',
'Alberta': 'AB',
'British Columbia': 'BC',
'Yukon': 'YT',
'Northwest Territories': 'NT',
'Nunavut': 'NU',
}
# approach 1
# def handle_state(data_attr):
# for k, v in canada_provinces_dic.items():
# if data_attr.strip().lower() == k.lower():
# return canada_provinces_dic[k]
# else:
# return data_attr
# pulse_point_df['state'] = pulse_point_df.state.apply(handle_state)
# approach 2
# https://stackoverflow.com/a/69994272/11105356
ca_province_dict = {k.lower():v for k,v in ca_province_dic.items()}
pulse_point_df['state'] = pulse_point_df['state'].str.lower().map(ca_province_dict).fillna(pulse_point_df.state)
# Exception state : example - 'FL #1005' , 'NY EAST GLENVILLE FD', ' DE / RM304'
mask = pulse_point_df.state.apply(lambda x:len(x)>2)
display(pulse_point_df[mask].state)
Keeping only the first segment which is the short form for city, discarding the rest(noise)
pulse_point_df.loc[mask,'state'] = pulse_point_df[mask].state.apply(lambda x: x.split()[0])
pulse_point_df.state.value_counts()
# CONCORD
mask = pulse_point_df.state.str.startswith('CONCORD')
display(pulse_point_df[mask])
printmd("**CONCORD should be in CA**")
pulse_point_df.loc[mask,'state'] = 'CA'
pulse_point_df.state.value_counts()
#https://stackoverflow.com/a/57846984/11105356
UNITS = {'s':'seconds', 'm':'minutes', 'h':'hours', 'd':'days', 'w':'weeks'}
# chance of having days and weeks is none
def convert_to_seconds(s):
s = s.replace(" ","")
return int(timedelta(**{
UNITS.get(m.group('unit').lower(), 'seconds'): int(m.group('val'))
for m in re.finditer(r'(?P<val>\d+)(?P<unit>[smhdw]?)', s, flags=re.I)
}).total_seconds())
# convert_to_seconds("1 h 34 m")
pulse_point_df["duration_in_seconds"] = pulse_point_df.duration.apply(lambda x:convert_to_seconds(x))
pulse_point_df["day_name"], pulse_point_df["weekday"] = pulse_point_df.date_of_incident.dt.day_name(), pulse_point_df.date_of_incident.dt.weekday
pulse_point_df["month_name"] = pulse_point_df.date_of_incident.dt.month_name()
## more features
# pulse_point_df.date_of_incident.dt.month_name()
# pulse_point_df.date_of_incident.dt.month
# pulse_point_df.date_of_incident.dt.day
# pulse_point_df.date_of_incident.dt.day_name()
# pulse_point_df.date_of_incident.dt.weekday
# pulse_point_df.date_of_incident.dt.isocalendar().week
pulse_point_df.tail(40)
I will assign Daytime values based on the time range below -
Time of the Day | Range |
---|---|
Morning | 5 AM to 11:59 AM |
Afternoon | 12PM to 4:59 PM |
Evening | 5 PM to 8:59 PM |
Night | 9 PM to 11:59 PM |
Midnight | 12 AM to 4:59 AM |
# https://stackoverflow.com/a/70018607/11105356
def time_range(time):
hour = datetime.strptime(time, '%I:%M %p').hour
if hour > 20:
return "Night"
elif hour > 16:
return "Evening"
elif hour > 11:
return "Afternoon"
elif hour > 4:
return "Morning"
else:
return "Midnight"
pulse_point_df["time_of_the_day"] = pulse_point_df.timestamp_time.apply(lambda time: time_range(time))
# # pulse_point_df.timestamp_time = pd.to_datetime(pulse_point_df.timestamp_time).dt.time
pulse_point_df.to_csv('PulsePoint-emergencies-cleaned.csv', index=False)
A quick overview of the preprocessed data-
The preprocessed dataset contains additional 5 columns extracted from the location column, another 5 columns extracted from date_of_incident and duration columns. Id , Incident_logo and agency_logo columns from the original dataset was discarded.
Columns | Description | Data Type |
---|---|---|
business | Name of the business place extracted from location(e.g., JANIE & JACK, DOLLAR GENERAL etc.) | object |
address | Address where the incident took place (extracted from location) | object |
address_2 | Extended address where the incident took place (extracted from location) | object |
city | City where the incident took place (extracted from location). It could also be a town or a country | object |
state | State where the incident took place (extracted from location) | object |
duration_in_seconds | Incident duration in seconds (extracted from duration) | numeric, int |
day_name | Name of the day when the incident took place | object |
weekday | The day of the week with Monday=0, Sunday=6. | object |
month_name | Name of the month (extracted from date) | object |
time_of_the_day | morning (5AM-11:59AM), afternoon (12PM-4:59 PM), evening (5PM-8:59PM), night (9PM-11:59PM), midnight (12AM-4:59AM) | object |
printmd(f"There are total **{pulse_point_df.shape[0]}** incidents")
pulse_point_df.info()
pulse_point_df.describe().T
pulse_point_df.describe(include='object').T
missing_value_describe(pulse_point_df)
printmd(f"There are total **{len(pulse_point_df.title.unique())}** types of incidents")
pulse_point_df.title.value_counts().head(20)
# crisp wordcloud : https://stackoverflow.com/a/28795577/11105356
data = pulse_point_df.title.value_counts().to_dict()
wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300).generate_from_frequencies(data)
plt.figure(figsize=(14,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()
printmd(f"There are total **{len(pulse_point_df.agency.unique())}** agencies")
# Top agencies by incident engagement count
pulse_point_df.agency.value_counts().head(20)
pulse_point_df.agency.value_counts().head(10).sort_values(ascending=False).plot(kind = 'bar');
Most frequent - Montgomery County
data = pulse_point_df.agency.value_counts().to_dict()
wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300).generate_from_frequencies(data)
plt.figure(figsize=(14,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()