Table of Contents¶

1. Introduction
2. Libraries & Configurations
3 Explore Dataset
- 3.1 Metadata Summary
- 3.2 Missing Values
4 Data Cleaning
5 Feature Extraction
6 EDA
7 Clustering
- 7.1 Preprocess Dataset
- 7.2 Agency Engagement Vs Incident Duration by City
8 Key Insights
9 Top 10 Incidents
10 Top State : CA
11 Conclusions
12 References

1 Introduction¶

The objective of this project is to do an extensive analysis of the Pulsepoint Emergency Data as well as apply some clustering and dimensionality reduction techniques.

The result from the analysis might be beneficial to a varieties of business stakeholders.

For example –

For real estate business agencies, they may make decisions based on emergency occurrences and the frequency of which place is risky for housing and which are not and take precautions properly.
For local authorities to avoid planting oil/gas filling stations fire prone locations.
It will also be helpful for the local government to estimate a proper budget to take preventive measures for local emergencies and other natural phenomena.

Background¶

PulsePoint is a 911-connected mobile app that allows users to view and receive alerts on calls being responded to by fire departments and emergency medical services. The app's main feature, and where its name comes from, is that it sends alerts to users at the same time that dispatchers are sending the call to emergency crews. The goal is to increase the possibility that a victim in cardiac arrest will receive cardiopulmonary resuscitation (CPR) quickly. The app uses the current location of a user and will alert them if someone in their vicinity is in need of CPR. The app, which interfaces with the local government public safety answering point, will send notifications to users only if the victim is in a public place and only to users that are in the immediate vicinity of the emergency. - Wikipedia

Pulsepoint logs of the incidents can be used to identify the local pattern of emergencies which is helpful for local businesses as well as emergency agencies to stay alert and take precautions which, in the long term ensure the social well-being of the people.

Data Collection¶

The data was collected via web scraping using python. The logs were collected from 2021-05-02 to 2021-12-31.

PulsePoint Respond Mobile APP UI (visual inspection of the data) :

NB: This project also serves as my assignments for the course below -

IBM Unsupervised Machine Learning

View this project on GitHub : ahmedshahriar/PulsePoint-Data-Analytics ¶

Kaggle Notebook : ahmedshahriarsakib/pulsepoint-emergency-analytics ¶

2 Libraries & Configuration¶

I used positionstack for geocoding data as a backup option for Nominatim¶

You can create an account on positionstack API (25,000 free requests/month)¶

In [1]:

%%capture
!pip install geopandas    # geo-plotting     
!pip install pdpipe       # data pipeline 
!pip install yellowbrick  # for elbow method 

In [2]:

import re
import json
import requests
import urllib

import pandas as pd
import numpy as np
import pdpipe as pdp

# from tqdm import tqdm
from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas() # https://stackoverflow.com/a/34365537/11105356


from datetime import timedelta, datetime

# data visualization

import folium
import plotly.graph_objects as go
import plotly.express as px
import geopandas
import seaborn as sns
import matplotlib.pyplot as plt

from plotly.subplots import make_subplots
from wordcloud import WordCloud

from folium.plugins import MarkerCluster, HeatMap


from geopy.geocoders import Nominatim # reverse geocoding

# data processing and algorithm
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import (KMeans, DBSCAN, OPTICS, 
                             AgglomerativeClustering,
                             MiniBatchKMeans)
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA


from IPython.display import Image, HTML, Markdown
# from IPython.html import widgets


%matplotlib inline
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

plt.rcParams['figure.figsize'] = 12, 8

# utility function to print markdown string
def printmd(string):
    display(Markdown(string))

pd.set_option('display.max_colwidth', None)

SEED = 42

# set the size of the geo bubble
def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value)
    if result < 0:
        result = 0.1
    return result

# API Key
API_KEY_POSITIONSTACK = "YOUR_API_KEY_HERE"

3 Explore Dataset¶

In [4]:

parse_dates=['date_of_incident']
pulse_point_df = pd.read_csv("/content/PulsePoint_local_threats_emergencies.csv", 
                             parse_dates=parse_dates,
                             skipinitialspace=True)

# to parse datetime column later
# pulse_point_df.date_of_incident = pd.to_datetime(pulse_point_df.date_of_incident)

In [5]:

printmd(f"Dataset has **{pulse_point_df.shape[0]}** rows and **{pulse_point_df.shape[1]}** columns")

Dataset has 361245 rows and 11 columns

Strip Object Columns

It will remove noise like extra whitespaces.

Example - there are some state values in the data such as - " CA" and "CA" which can be identified as separate entities. So this operation will remove that issue.

In [6]:

pulse_point_df = pulse_point_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

In [7]:

pulse_point_df.sort_values(by='date_of_incident')

Out[7]:

	id	type	title	agency	location	timestamp_time	date_of_incident	description	duration	incident_logo	agency_logo
0	3569	recent	Commercial Fire	Suffolk Fire Rescue	2210 E WASHINGTON ST, SUFFOLK, VA	12:38 PM	2021-05-02	B1 E1 E2 E3 E4 EMS1 L3 L5 M1 R1 R6 SF1	1 h 25 m	https://web.pulsepoint.org/assets/images/list/cf_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1344
361013	3365	recent	Structure Fire	Alachua/Gainesville	15270 NW 150TH AVE, STE 3046, ALACHUA, FL	8:16 PM	2021-05-02	DC6 E21 E25 E29 Q23 R21 SQ29	23 m	https://web.pulsepoint.org/assets/images/list/sf_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1079
361012	3364	recent	Full Assignment	Allegheny County EMS	428 LINCOLN HIGHLANDS DR, NORTH FAYETTE, PA	9:10 PM	2021-05-02	1902	21 m	https://web.pulsepoint.org/assets/images/list/full_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=599
361011	3363	recent	Commercial Fire	Akron Fire	750 MULL AVE, STE 3D, AKRON, OH	9:11 PM	2021-05-02	AKAT6 AKBC4 AKBC9 AKCH4 AKEN11 AKEN3 AKEN4 AKEN6 AKEN9 AKFI3 AKL4 AKL9 AKM10 AKM12 AKM4 AKM6 AKT10 AMR5 AMR6	4 h 35 m	https://web.pulsepoint.org/assets/images/list/cf_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=979
361010	3362	recent	Mutual Aid	Allegheny County EMS	965 BURTNER RD, HARRISON, PA	9:25 PM	2021-05-02	111	1 h 55 m	https://web.pulsepoint.org/assets/images/list/mu_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=599
...	...	...	...	...	...	...	...	...	...	...	...
358158	361760	recent	Gas Leak	Fairfax County Fire	3903I FAIR RIDGE DR, FAIRFAX, VA (MASSAGE ENVY)	8:14 PM	2021-12-31	A440 BC407 E421M E440M TL440M	33 m	https://web.pulsepoint.org/assets/images/list/gas_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1441
358157	361759	recent	Medical Emergency	Escambia Co EMS	ROYCE ST, BRENT, FL	8:15 PM	2021-12-31	M34	47 m	https://web.pulsepoint.org/assets/images/list/me_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=874
358156	361758	recent	Medical Emergency	Fairfax County Fire	BURKE COMMONS RD, BURKE, VA	8:18 PM	2021-12-31	E432M M432	26 m	https://web.pulsepoint.org/assets/images/list/me_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1441
358162	361764	recent	Medical Emergency	Fairbanks ECC	MARY ANN ST, FAIRBANKS, AK	8:08 PM	2021-12-31	M2	33 m	https://web.pulsepoint.org/assets/images/list/me_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1288
358066	363743	active	Medical Emergency	Cosumnes FD	SANTO CT, ELK GROVE, CA	10:40 PM	2021-12-31	E72 E77 M71	NaN	https://web.pulsepoint.org/assets/images/list/me_list.png	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=551

361245 rows × 11 columns

Data was collected from 2021-05-02 to present

3.1 Metadata Summary¶

In [8]:

pulse_point_df.columns

Out[8]:

Index(['id', 'type', 'title', 'agency', 'location', 'timestamp_time',
       'date_of_incident', 'description', 'duration', 'incident_logo',
       'agency_logo'],
      dtype='object')

Columns	Description	Data Type
id	Contains record id	numeric, int
type	Incident type (recent or active)	object
title	Title of the incident (e.g., Medical Emergency, Fire)	object
agency	Agency name (e.g., fire departments, emergency medical services)	object
location	Location where the incident took place	object
timestamp_time	Time when the incident record was logged	object
date_of_incident	Date when the incident record was logged	datetime
description	Emergency code description (e.g., E53 - refers to Fire Engine Truck )	object
duration	Duration of the incident	object
Incident_logo	Logo of the incident	object
agency_logo	Logo of the agency	object

In [9]:

pulse_point_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361245 entries, 0 to 361244
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                361245 non-null  int64         
 1   type              361245 non-null  object        
 2   title             361245 non-null  object        
 3   agency            361245 non-null  object        
 4   location          361245 non-null  object        
 5   timestamp_time    361245 non-null  object        
 6   date_of_incident  361245 non-null  datetime64[ns]
 7   description       344622 non-null  object        
 8   duration          281278 non-null  object        
 9   incident_logo     361245 non-null  object        
 10  agency_logo       361245 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 30.3+ MB

Data Types¶

In [10]:

pulse_point_df.dtypes.value_counts()

Out[10]:

object            9
int64             1
datetime64[ns]    1
dtype: int64

Object Data¶

In [11]:

pulse_point_df.describe(include='object').T

Out[11]:

	count	unique	top	freq
type	361245	2	recent	281278
title	361245	89	Medical Emergency	240433
agency	361245	792	Montgomery County	7265
location	361245	222795	EUCLID AV, EUCLID, OH	135
timestamp_time	361245	1440	6:42 AM	415
description	344622	111286	E1	1334
duration	281278	728	16 m	6420
incident_logo	361245	89	https://web.pulsepoint.org/assets/images/list/me_list.png	240433
agency_logo	361245	648	https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=100	6114

Web Image Data¶

In [12]:

def path_to_image_html(path):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''

    return '<img src="'+ path + '" style=max-height:124px;"/>' # option : '" width="60" 

pulse_point_df_short = pulse_point_df.head(10)
HTML(pulse_point_df_short.to_html(escape=False , formatters=dict(incident_logo=path_to_image_html, agency_logo=path_to_image_html)))

Out[12]:

	id	type	title	agency	location	timestamp_time	date_of_incident	description	duration
0	3569	recent	Commercial Fire	Suffolk Fire Rescue	2210 E WASHINGTON ST, SUFFOLK, VA	12:38 PM	2021-05-02	B1 E1 E2 E3 E4 EMS1 L3 L5 M1 R1 R6 SF1	1 h 25 m
1	3570	recent	Fire	Tacoma Fire	2300 A ST, TACOMA, WA	11:52 AM	2021-05-02	NaN	1 h 46 m
2	3571	recent	Residential Fire	Tamarac Fire	4601 NW 30TH TER, TAMARAC, FL	10:00 AM	2021-05-02	BC15 E34 E37 Q110 R278 R34 R37	8 m
3	3572	recent	Electrical Fire	Tamarac Fire	4611 NW 30TH TER, TAMARAC, FL	9:52 AM	2021-05-02	Q78	40 m
4	3573	recent	Fire	Tacoma Fire	PUYALLUP AVE & A ST, TACOMA, WA	9:37 AM	2021-05-02	E04	18 m
5	3574	recent	Fire	Tacoma Fire	S 24TH ST & A ST, TACOMA, WA	9:00 AM	2021-05-02	E02	14 m
6	3575	recent	Commercial Fire	Stafford County Fire	1287 JEFFERSON DAVIS HWY, FREDERICKSBURG, VA	5:49 AM	2021-05-02	A6 E1 R1U	8 m
7	3576	recent	Residential Fire	Suffolk Fire Rescue	101 ROCKLAND TER, SUFFOLK, VA	5:04 AM	2021-05-02	B1 E1 E3 E6 EMS1 EMS2 L3 M3 M6 R1 SF1	1 h 6 m
8	3577	active	Residential Fire	Suffolk Co FRES	1701 AVALON PINES DR, CORAM, NY	4:27 AM	2021-05-03	?5-06-A	NaN
9	3578	recent	Mutual Aid	WPG Fire Paramedic	DAKOTA ST & MEADOWOOD DR, WINNIPEG, MANITOBA	1:33 AM	2021-05-03	NaN	2 m

3.2 Missing Values¶

In [13]:

def missing_value_describe(data):
    # check missing values in the data
    total = data.isna().sum().sort_values(ascending=False)
    missing_value_pct_stats = (data.isnull().sum() / len(data)*100)
    missing_value_col_count = sum(missing_value_pct_stats > 0)

    # missing_value_stats = missing_value_pct_stats.sort_values(ascending=False)[:missing_value_col_count]
    missing_data = pd.concat([total, missing_value_pct_stats], axis=1, keys=['Total', 'Percentage(%)'])

    print("Number of rows with at least 1 missing values:", data.isna().any(axis = 1).sum())
    print("Number of columns with missing values:", missing_value_col_count)

    if missing_value_col_count != 0:
        # print out column names with missing value percentage
        print("\nMissing percentage (desceding):")
        display(missing_data[:missing_value_col_count])

        # plot missing values
        missing = data.isnull().sum()
        missing = missing[missing > 0]
        missing.sort_values(inplace=True)
        missing.plot.bar()
    else:
        print("No missing data!!!")

# pass a dataframe to the function
missing_value_describe(pulse_point_df)

Number of rows with at least 1 missing values: 93351
Number of columns with missing values: 2

Missing percentage (desceding):

	Total	Percentage(%)
duration	79967	22.136500
description	16623	4.601586

4 Data Cleaning¶

Discard Columns¶

In [14]:

pulse_point_df.drop(['id', 'incident_logo', 'agency_logo'], axis=1, inplace=True)

Remove Active Incidents¶

Because active incidents are the noisy duplicated data of the “recent type incident” which was unable to remove during the data collection process. Thus, it does not contribute to the analysis.

In [15]:

pulse_point_df.type.value_counts()

Out[15]:

recent    281278
active     79967
Name: type, dtype: int64

In [16]:

pulse_point_df.drop(pulse_point_df[pulse_point_df.type == 'active'].index, inplace=True)
pulse_point_df.reset_index(drop=True, inplace=True)

Drop redundant column "type"

In [17]:

pulse_point_df.drop(columns=['type'], axis=1, inplace=True)

5 Feature Extraction¶

5.1 Location¶

In [18]:

pulse_point_df.location

Out[18]:

0            2210 E WASHINGTON ST, SUFFOLK, VA
1                        2300 A ST, TACOMA, WA
2                4601 NW 30TH TER, TAMARAC, FL
3                4611 NW 30TH TER, TAMARAC, FL
4              PUYALLUP AVE & A ST, TACOMA, WA
                          ...                 
281273             1252 WILROY RD, SUFFOLK, VA
281274        913 E WASHINGTON ST, SUFFOLK, VA
281275    717 OCEAN BREEZE WK, OCEAN BEACH, NY
281276       S 8TH ST & YAKIMA AVE, TACOMA, WA
281277     S 92ND ST & S HOSMER ST, TACOMA, WA
Name: location, Length: 281278, dtype: object

In [19]:

pulse_point_df.location.value_counts().head(10)

Out[19]:

COLLINS AVE, MIAMI BEACH, FL                                    99
N HARBOR BL, FULLERTON, CA                                      79
175 NE 1ST ST, MCMINNVILLE, OR (MCMINNVILLE FIRE DEPARTMENT)    78
WASHINGTON AVE, MIAMI BEACH, FL                                 77
FREMONT BLVD, FREMONT, CA                                       76
ALTON RD, MIAMI BEACH, FL                                       70
PRESTON RD, FRISCO, TX                                          70
LEGACY DR, FRISCO, TX                                           66
E STATE ST, ROCKFORD, IL                                        65
STONEBROOK PKWY, FRISCO, TX                                     64
Name: location, dtype: int64

Insights and Feature Extraction¶

There are many variations in the location column:

Such as -

NE OAK SPRINGS FARM RD, CARLTON, OR
W 10TH ST, LONG BEACH, CA
302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))
175 NE 1ST ST, MCMINNVILLE, OR (MCMINNVILLE FIRE DEPARTMENT)
E BARNETT RD, MEDFORD, OR

We can split the locations into multiple features -

State¶

Text after the last comma appears to be the short form of US states or Canadian provinces.

CA -> California state

OR -> Oregon state

City¶

Text after second last comma appears to be city name (or town, county name)

MEDFORD is a city in Oregon (last example - "E BARNETT RD, MEDFORD, OR")

Address¶

Apart from state and city name, the rest will be counted as address features if there are three comma-separated elements (texts)

Address_2¶

Apart from state, city, and address the rest will be counted as extended address (address_2) feature if there are four comma-separated element/string

Business¶

Bracket enclosed string will be counted as Business Name.

From the above example - OJAI ARCADE (21002302) and MCMINNVILLE FIRE DEPARTMENT are counted as business feature

Business Place Extractor¶

In [20]:

def get_business_name(location):
    # https://stackoverflow.com/a/38212061/11105356
    stack = 0
    start_index = None
    results = []

    for i, c in enumerate(location):
        if c == '(':
            if stack == 0:
                start_index = i + 1  # string to extract starts one index later

            # push to stack
            stack += 1
        elif c == ')':
            # pop stack
            stack -= 1

            if stack == 0:
                results.append(location[start_index:i])
    try:
      if len(results) == 0:
        return None
      elif len(results) == 1 and len(results[0]) == 1:
        return None
      elif len(results) == 1 and len(results[0])!=1:
        return results[0].strip()
      elif len(results) > 1 and len(results[0])==1:
        return None
      else:
        return results[1].strip()
    except IndexError as ie:
      pass

### handles variations such as -
# 5709 RICHMOND RD, STE 76, JAMES CITY COUNTY, VA (JANIE & JACK)
# 433 SARATOGA RD, SCHENECTADY, NY ((GLENVILLE)EAST GLENVILLE FD)
# I 229 RAMP  & I 229 RAMP (0.1 MILES), SIOUX FALLS, SD (I 229 MM 8 NB)
# 6501 MISTY WATERS DR, STE (S)E260 (N), BURLEIGH COUNTY, ND

Split Location¶

Example 1 (3 elements): 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))

address = 302 E OJAI AVE, city = OJAI, state = CA, business = OJAI ARCADE (21002302)

Example 2 (4 elements): GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA

address = GRASSIE BLVD, address_2 = STE 212, city = WINNIPEG, state = MANITOBA (wil be converted to MB later)

In [21]:

# examples
# 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302)) --- 3 segments with business inside
# 1959 MORSE RD, COLUMBUS, OH (DOLLAR GENERAL)
# I 229 RAMP  & I 229 RAMP (0.1 MILES), SIOUX FALLS, SD (I 229 MM 8 NB)
# GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA --- 4 segments



# split location into 3 or 4 parts depending on number of commas -> 
# 3 segments : address, city, state
# 4 segments : address, address_2, city, state


# to extract bracket enclosed string 
pulse_point_df['business'] = pulse_point_df.location.apply(lambda x : get_business_name(x))

### remove enclosed business name from the location string
pulse_point_location_data = pulse_point_df.apply(lambda row : row['location'].replace(str(row['business']), ''), axis=1)

# remove leftover bracket from the business replacemnt
# https://stackoverflow.com/a/49183590/11105356
# remove a (...) substring with a leading whitespace at the end of the string only
pulse_point_location_data = pulse_point_location_data.str.replace(r"\s*\([^()]*\)$","").str.strip()


# split the location
four_col_location_split = ['address', 'address_2', 'city','state']
three_col_location_split = ['address', 'city','state']


# four col indices
# pulse_point_location_data[pulse_point_location_data.str.split(',', expand=True)[3].notna()]

extra_loc_data = pulse_point_location_data.str.split(',', expand=True) # to expand columns
four_col_indices = extra_loc_data[extra_loc_data.apply(lambda x: np.all(pd.notnull(x[3])) , axis = 1)].index
four_col_loc_df = extra_loc_data.iloc[four_col_indices]
four_col_loc_df.columns = four_col_location_split
four_col_loc_df

Out[21]:

	address	address_2	city	state
41	150 JOHNSON AVE	STE 14	CAPE CANAVERAL	FL
43	1262 PRAIRIE LN	STE 305	TITUSVILLE	FL
54	6431 N 84TH ST	STE 4	MILWAUKEE	WI
64	405 S BLAINE ST	STE 5	NEWBERG	OR
65	18390 SW BOONES FERRY RD	STE F207	TIGARD	OR
...	...	...	...	...
281244	4515 86TH ST	STE 35	URBANDALE	IA
281253	223 E BAKERVIEW RD	STE 348	BELLINGHAM	WA
281254	1129 11TH ST	STE 304	WEST DES MOINES	IA
281256	1245 SE UNIVERSITY AVE	STE 103	WAUKEE	IA
281271	34464 CORTEZ BLVD	BLDG NOT FOUND	RIDGE MANOR	FL

9032 rows × 4 columns

Four Features¶

In [22]:

pulse_point_df.loc[four_col_loc_df.index , four_col_location_split] = four_col_loc_df
pulse_point_df[four_col_location_split] = pulse_point_df[four_col_location_split].apply(lambda x: x.str.strip())
pulse_point_df[four_col_location_split]

# there are very few numbers of four feature location than three feature location

Out[22]:

	address	address_2	city	state
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN
...	...	...	...	...
281273	NaN	NaN	NaN	NaN
281274	NaN	NaN	NaN	NaN
281275	NaN	NaN	NaN	NaN
281276	NaN	NaN	NaN	NaN
281277	NaN	NaN	NaN	NaN

281278 rows × 4 columns

Three Features¶

In [23]:

four_col_loc_df_mask = extra_loc_data.index.isin(four_col_indices)
three_col_loc_df = extra_loc_data[~four_col_loc_df_mask].drop([3], axis=1)
three_col_loc_df.columns = three_col_location_split

# extra_loc_data[~three_col_loc_df][3].notna().sum() # to check null values

three_col_loc_df

Out[23]:

	address	city	state
0	2210 E WASHINGTON ST	SUFFOLK	VA
1	2300 A ST	TACOMA	WA
2	4601 NW 30TH TER	TAMARAC	FL
3	4611 NW 30TH TER	TAMARAC	FL
4	PUYALLUP AVE & A ST	TACOMA	WA
...	...	...	...
281273	1252 WILROY RD	SUFFOLK	VA
281274	913 E WASHINGTON ST	SUFFOLK	VA
281275	717 OCEAN BREEZE WK	OCEAN BEACH	NY
281276	S 8TH ST & YAKIMA AVE	TACOMA	WA
281277	S 92ND ST & S HOSMER ST	TACOMA	WA

272246 rows × 3 columns

In [24]:

pulse_point_df.loc[three_col_loc_df.index , three_col_location_split] = three_col_loc_df
pulse_point_df[three_col_location_split] = pulse_point_df[three_col_location_split].apply(lambda x: x.str.strip())
pulse_point_df[three_col_location_split]

Out[24]:

	address	city	state
0	2210 E WASHINGTON ST	SUFFOLK	VA
1	2300 A ST	TACOMA	WA
2	4601 NW 30TH TER	TAMARAC	FL
3	4611 NW 30TH TER	TAMARAC	FL
4	PUYALLUP AVE & A ST	TACOMA	WA
...	...	...	...
281273	1252 WILROY RD	SUFFOLK	VA
281274	913 E WASHINGTON ST	SUFFOLK	VA
281275	717 OCEAN BREEZE WK	OCEAN BEACH	NY
281276	S 8TH ST & YAKIMA AVE	TACOMA	WA
281277	S 92ND ST & S HOSMER ST	TACOMA	WA

281278 rows × 3 columns

Final Merging of Location Features¶

In [25]:

pulse_point_df[['location','address', 'address_2', 'city','state', 'business']]

Out[25]:

	location	address	address_2	city	state	business
0	2210 E WASHINGTON ST, SUFFOLK, VA	2210 E WASHINGTON ST	NaN	SUFFOLK	VA	None
1	2300 A ST, TACOMA, WA	2300 A ST	NaN	TACOMA	WA	None
2	4601 NW 30TH TER, TAMARAC, FL	4601 NW 30TH TER	NaN	TAMARAC	FL	None
3	4611 NW 30TH TER, TAMARAC, FL	4611 NW 30TH TER	NaN	TAMARAC	FL	None
4	PUYALLUP AVE & A ST, TACOMA, WA	PUYALLUP AVE & A ST	NaN	TACOMA	WA	None
...	...	...	...	...	...	...
281273	1252 WILROY RD, SUFFOLK, VA	1252 WILROY RD	NaN	SUFFOLK	VA	None
281274	913 E WASHINGTON ST, SUFFOLK, VA	913 E WASHINGTON ST	NaN	SUFFOLK	VA	None
281275	717 OCEAN BREEZE WK, OCEAN BEACH, NY	717 OCEAN BREEZE WK	NaN	OCEAN BEACH	NY	None
281276	S 8TH ST & YAKIMA AVE, TACOMA, WA	S 8TH ST & YAKIMA AVE	NaN	TACOMA	WA	None
281277	S 92ND ST & S HOSMER ST, TACOMA, WA	S 92ND ST & S HOSMER ST	NaN	TACOMA	WA	None

281278 rows × 6 columns

In [26]:

missing_value_describe(pulse_point_df[['location','address', 'address_2', 'city','state', 'business']])

Number of rows with at least 1 missing values: 280163
Number of columns with missing values: 2

Missing percentage (desceding):

	Total	Percentage(%)
address_2	272246	96.788942
business	266152	94.622402

Drop Garbage¶

In [27]:

pulse_point_df[pulse_point_df.city.isna()]

Out[27]:

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	city	state

In [28]:

pulse_point_df = pulse_point_df[pulse_point_df.city.notna()]

5.2 City¶

In [29]:

mask = ((pulse_point_df.city.isna()) | (pulse_point_df.city==u'') )

display(pulse_point_df[mask])

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	state
38923	Mutual Aid	Sumter Fire & EMS	34498 CORTEZ BLVD, BLDG NOT FOUND, RIDGE MANOR, FL (RIDGE MANOR)	11:53 PM	2021-07-03	NaN	4 m	RIDGE MANOR	34498 CORTEZ BLVD	BLDG NOT FOUND	FL
50888	Mutual Aid	San Ramon Valley FPD	3590 CLAYTON RD, CONCORD, CA (CONCORD)	10:15 PM	2021-07-12	PM32	52 m	CONCORD	3590 CLAYTON RD	NaN	CA
272125	Traffic Collision	Idaho Falls Fire	UNKNOWN & W 137TH S, S 5TH W, SHELLEY, ID (SHELLEY)	12:53 AM	2021-12-28	AB5	9 m	SHELLEY	UNKNOWN & W 137TH S	S 5TH W	ID

The business names are same as the city names. I first removed the text containing business names and then performed text extraction for cities. That's why city names are blank for the cases like these.

Let's replace their city names with business names.

In [30]:

pulse_point_df.loc[mask,'city'] = pulse_point_df[mask].business

5.3 State¶

In [31]:

display(pulse_point_df.state.value_counts())
printmd(f"**Total {len(pulse_point_df.state.value_counts().index)} States. Some of them are Canadian provinces,  ex - MANITOBA**")

CA                                        85135
FL                                        26543
WA                                        17828
VA                                        17754
OH                                        17079
OR                                        15827
WI                                         9808
MO                                         9521
TX                                         8451
IL                                         5783
PA                                         5172
IN                                         4960
KS                                         4647
NV                                         4574
MN                                         3865
NC                                         3759
AZ                                         3642
TN                                         3624
DE                                         2861
OK                                         2849
MANITOBA                                   2730
MD                                         2658
ND                                         2569
NY                                         2228
CO                                         1815
DC                                         1788
NE                                         1721
NJ                                         1663
ID                                         1505
SD                                         1372
AK                                         1271
GA                                         1192
KY                                          908
UT                                          816
HI                                          808
AR                                          797
SC                                          778
NM                                          396
MI                                          277
IA                                          201
AL                                           58
LA                                           21
ON                                           15
BC                                            6
NV ())                                        1
MO (NUSACH HARI BNAI ZION CONGREGATION        1
CONCORD                                       1
Name: state, dtype: int64

Total 47 States. Some of them are Canadian provinces, ex - MANITOBA

Canadian Province¶

Mapping Canadian provinces to their unique short form¶

In [32]:

# Canadian Province Mapping
# https://www150.statcan.gc.ca/n1/pub/92-195-x/2011001/geo/prov/tbl/tbl8-eng.htm
# https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada

ca_province_dic = {
    'Newfoundland and Labrador': 'NL',
    'Prince Edward Island': 'PE',
    'Nova Scotia': 'NS',
    'New Brunswick': 'NB',
    'Quebec': 'QC',
    'Ontario': 'ON',
    'Manitoba': 'MB',
    'Saskatchewan': 'SK',
    'Alberta': 'AB',
    'British Columbia': 'BC',
    'Yukon': 'YT',
    'Northwest Territories': 'NT',
    'Nunavut': 'NU',
}

# approach 1

# def handle_state(data_attr):
#   for k, v in canada_provinces_dic.items():
#       if data_attr.strip().lower() == k.lower():
#         return canada_provinces_dic[k]
#   else:
#     return data_attr

# pulse_point_df['state'] =  pulse_point_df.state.apply(handle_state)


# approach 2

# https://stackoverflow.com/a/69994272/11105356

ca_province_dict = {k.lower():v for k,v in ca_province_dic.items()}
pulse_point_df['state']  = pulse_point_df['state'].str.lower().map(ca_province_dict).fillna(pulse_point_df.state)

Noise Removal¶

In [33]:

# Exception state : example - 'FL  #1005' , 'NY EAST GLENVILLE FD', ' DE / RM304'

mask = pulse_point_df.state.apply(lambda x:len(x)>2)
display(pulse_point_df[mask].state)

25695    MO (NUSACH HARI BNAI ZION CONGREGATION
50895                                   CONCORD
78626                                    NV ())
Name: state, dtype: object

Keeping only the first segment which is the short form for city, discarding the rest(noise)

In [34]:

pulse_point_df.loc[mask,'state'] = pulse_point_df[mask].state.apply(lambda x: x.split()[0])

In [35]:

pulse_point_df.state.value_counts()

Out[35]:

CA         85135
FL         26543
WA         17828
VA         17754
OH         17079
OR         15827
WI          9808
MO          9522
TX          8451
IL          5783
PA          5172
IN          4960
KS          4647
NV          4575
MN          3865
NC          3759
AZ          3642
TN          3624
DE          2861
OK          2849
MB          2730
MD          2658
ND          2569
NY          2228
CO          1815
DC          1788
NE          1721
NJ          1663
ID          1505
SD          1372
AK          1271
GA          1192
KY           908
UT           816
HI           808
AR           797
SC           778
NM           396
MI           277
IA           201
AL            58
LA            21
ON            15
BC             6
CONCORD        1
Name: state, dtype: int64

Leftover¶

In [36]:

# CONCORD
mask = pulse_point_df.state.str.startswith('CONCORD')

display(pulse_point_df[mask])
printmd("**CONCORD should be in CA**")

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	city	state
50895	Mutual Aid	San Ramon Valley FPD	2020 GRANT ST, STE 1205, CONCORD	9:51 PM	2021-07-12	PM32	20 m	None	2020 GRANT ST	NaN	STE 1205	CONCORD

CONCORD should be in CA

In [37]:

pulse_point_df.loc[mask,'state'] = 'CA'

In [38]:

pulse_point_df.state.value_counts()

Out[38]:

CA    85136
FL    26543
WA    17828
VA    17754
OH    17079
OR    15827
WI     9808
MO     9522
TX     8451
IL     5783
PA     5172
IN     4960
KS     4647
NV     4575
MN     3865
NC     3759
AZ     3642
TN     3624
DE     2861
OK     2849
MB     2730
MD     2658
ND     2569
NY     2228
CO     1815
DC     1788
NE     1721
NJ     1663
ID     1505
SD     1372
AK     1271
GA     1192
KY      908
UT      816
HI      808
AR      797
SC      778
NM      396
MI      277
IA      201
AL       58
LA       21
ON       15
BC        6
Name: state, dtype: int64

5.4 Time¶

Converting time string to seconds¶

For example - "1 h 34 m" will be 94*60 = 5640 seconds

In [39]:

#https://stackoverflow.com/a/57846984/11105356

UNITS = {'s':'seconds', 'm':'minutes', 'h':'hours', 'd':'days', 'w':'weeks'}

# chance of having days and weeks is none 

def convert_to_seconds(s):
    s = s.replace(" ","")
    return int(timedelta(**{
        UNITS.get(m.group('unit').lower(), 'seconds'): int(m.group('val'))
        for m in re.finditer(r'(?P<val>\d+)(?P<unit>[smhdw]?)', s, flags=re.I)
    }).total_seconds())

# convert_to_seconds("1 h 34 m")

Duration (seconds)¶

Extract duration total time from “duration” text¶

In [40]:

pulse_point_df["duration_in_seconds"] = pulse_point_df.duration.apply(lambda x:convert_to_seconds(x))

In [41]:

pulse_point_df["day_name"], pulse_point_df["weekday"] = pulse_point_df.date_of_incident.dt.day_name(), pulse_point_df.date_of_incident.dt.weekday

pulse_point_df["month_name"] = pulse_point_df.date_of_incident.dt.month_name()


## more features

# pulse_point_df.date_of_incident.dt.month_name()
# pulse_point_df.date_of_incident.dt.month

# pulse_point_df.date_of_incident.dt.day
# pulse_point_df.date_of_incident.dt.day_name()

# pulse_point_df.date_of_incident.dt.weekday
# pulse_point_df.date_of_incident.dt.isocalendar().week

In [42]:

pulse_point_df.tail(40)

Out[42]:

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	city	state	duration_in_seconds	day_name	weekday	month_name
281238	Refuse/Garbage Fire	Huntington Beach FD	21661 BROOKHURST ST, HUNTINGTON BEACH, CA	12:32 AM	2021-05-03	ME83	19 m	None	21661 BROOKHURST ST	NaN	HUNTINGTON BEACH	CA	1140	Monday	0	May
281239	Residential Fire	Whatcom Fire/EMS	3037 PACIFIC ST, BELLINGHAM, WA	3:30 AM	2021-05-03	NaN	59 m	None	3037 PACIFIC ST	NaN	BELLINGHAM	WA	3540	Monday	0	May
281240	Mutual Aid	West Sacramento Fire	1700 CAPITAL AVE, SAC, WEST SACRAMENTO, CA	2:16 AM	2021-05-03	E43 WSAID	3 m	None	1700 CAPITAL AVE	SAC	WEST SACRAMENTO	CA	180	Monday	0	May
281241	Mutual Aid	WestShore FDs	25157 CARLTON PARK, STE 120, NORTH OLMSTED, OH	2:13 AM	2021-05-03	FVM31	54 m	None	25157 CARLTON PARK	STE 120	NORTH OLMSTED	OH	3240	Monday	0	May
281242	Mutual Aid	West Sacramento Fire	106 J ST, SACRAMENTO, CA (SPIRITS RESTAURANT)	2:13 AM	2021-05-03	B44 E44 WSAID	17 m	SPIRITS RESTAURANT	106 J ST	NaN	SACRAMENTO	CA	1020	Monday	0	May
281243	Mutual Aid	West Sacramento Fire	1210 FRONT ST, SACRAMENTO, CA (RIO CITY CAFE)	2:10 AM	2021-05-03	BT41 WSAID	46 m	RIO CITY CAFE	1210 FRONT ST	NaN	SACRAMENTO	CA	2760	Monday	0	May
281244	Structure Fire	Westcom	4515 86TH ST, STE 35, URBANDALE, IA	1:13 AM	2021-05-03	A213 A323 A433 C300 C404 E411 E431 JGMENG2 L325 L425	28 m	None	4515 86TH ST	STE 35	URBANDALE	IA	1680	Monday	0	May
281245	Refuse/Garbage Fire	Wicomico County	317 WHITMAN AVE, SALISBURY, MD	12:57 AM	2021-05-03	TR2	16 m	None	317 WHITMAN AVE	NaN	SALISBURY	MD	960	Monday	0	May
281246	Residential Fire	West Palm Beach Fire	436 51ST ST, WEST PALM BEACH, FL	12:37 AM	2021-05-03	BC1 BC5 E1 E4 EMS2 FOO HM2 L6 PI11 R1 R3 SQ5 TAC8A TR1 WPIV WPIV2	1 h 30 m	None	436 51ST ST	NaN	WEST PALM BEACH	FL	5400	Monday	0	May
281247	Appliance Fire	Wicomico County	1149 S DIVISION ST, SALISBURY, MD	8:55 PM	2021-05-02	AC1 E1 TR2	12 m	None	1149 S DIVISION ST	NaN	SALISBURY	MD	720	Sunday	6	May
281248	Residential Fire	Westfield Fire	6311 E 161ST ST, NOBLESVILLE, IN	3:36 PM	2021-05-02	E382	26 m	None	6311 E 161ST ST	NaN	NOBLESVILLE	IN	1560	Sunday	6	May
281249	Refuse/Garbage Fire	West Pierce Fire	112TH ST SW & FARWEST DR SW, LAKEWOOD, WA	3:14 PM	2021-05-02	E22	17 m	None	112TH ST SW & FARWEST DR SW	NaN	LAKEWOOD	WA	1020	Sunday	6	May
281250	Mutual Aid	Wicomico County	31671 W POST OFFICE RD, PRINCESS ANNE, MD	11:47 AM	2021-05-02	ET151 RE302	1 h 5 m	None	31671 W POST OFFICE RD	NaN	PRINCESS ANNE	MD	3900	Sunday	6	May
281251	Extinguished Fire	West Metro	W YALE AVE & S INDIANA ST, LAKEWOOD, CO	10:44 AM	2021-05-02	E9	13 m	None	W YALE AVE & S INDIANA ST	NaN	LAKEWOOD	CO	780	Sunday	6	May
281252	Structure Fire	Westcom	14575 SE UNIVERSITY AVE, WAUKEE, IA	10:33 AM	2021-05-02	A913 A917 C900 E190 E220 E910 L425	59 m	None	14575 SE UNIVERSITY AVE	NaN	WAUKEE	IA	3540	Sunday	6	May
281253	Structure Fire	Whatcom Fire/EMS	223 E BAKERVIEW RD, STE 348, BELLINGHAM, WA	6:18 AM	2021-05-02	B1 E6 L5	13 h 10 m	None	223 E BAKERVIEW RD	STE 348	BELLINGHAM	WA	47400	Sunday	6	May
281254	Confirmed Structure Fire	Westcom	1129 11TH ST, STE 304, WEST DES MOINES, IA	6:17 AM	2021-05-02	A193 A213 C100 C104 C199 C219 E170 E180 E220 L215 L325 U218 WHTENG	4 h 48 m	None	1129 11TH ST	STE 304	WEST DES MOINES	IA	17280	Sunday	6	May
281255	Structure Fire	Westcom	1650 SE HOLIDAY CREST CIR, WAUKEE, IA	4:52 AM	2021-05-02	A433 A913 C219 E190 E220 E431 E910 L425 WKEFD1	1 h 22 m	None	1650 SE HOLIDAY CREST CIR	NaN	WAUKEE	IA	4920	Sunday	6	May
281256	Structure Fire	Westcom	1245 SE UNIVERSITY AVE, STE 103, WAUKEE, IA	4:37 AM	2021-05-02	A913 C219 C901 E220 E431 E910 L215	21 m	None	1245 SE UNIVERSITY AVE	STE 103	WAUKEE	IA	1260	Sunday	6	May
281257	Fire	Anoka County	3740 BRIDGE ST, SAINT FRANCIS, MN	3:37 AM	2021-05-03	NaN	2 m	None	3740 BRIDGE ST	NaN	SAINT FRANCIS	MN	120	Monday	0	May
281258	Commercial Fire	Anne Arundel CFD	7514 RITCHIE HWY, GLEN BURNIE, MD (LA FONTAINE BLEUE)	2:11 AM	2021-05-03	BC01 CH12 E122 E181 E301 E311 E331 MU33 RS11 SAFE03 SAFE07 SCMD TK26 TK31	44 m	LA FONTAINE BLEUE	7514 RITCHIE HWY	NaN	GLEN BURNIE	MD	2640	Monday	0	May
281259	Extinguished Fire	Anne Arundel CFD	808 ELMHURST RD, SEVERN, MD	2:04 AM	2021-05-03	E041 E331	26 m	None	808 ELMHURST RD	NaN	SEVERN	MD	1560	Monday	0	May
281260	Extinguished Fire	Anne Arundel CFD	245 KILMARNOCK DR, MILLERSVILLE, MD	1:02 AM	2021-05-03	E301	30 m	None	245 KILMARNOCK DR	NaN	MILLERSVILLE	MD	1800	Monday	0	May
281261	Mutual Aid	Anne Arundel CFD	7015 AARONSON DR, BWI AIRPORT, MD (GENERAL AVIATION TERMINAL AND SIGNATURE FLIGHT SUPPORT)	12:37 AM	2021-05-03	HOLD01 RE23 RS11 TK04 TR04	1 h 17 m	GENERAL AVIATION TERMINAL AND SIGNATURE FLIGHT SUPPORT	7015 AARONSON DR	NaN	BWI AIRPORT	MD	4620	Monday	0	May
281262	Residential Fire	Anne Arundel CFD	106 PINECREST DR, ANNAPOLIS, MD	12:15 AM	2021-05-03	NaN	43 m	None	106 PINECREST DR	NaN	ANNAPOLIS	MD	2580	Monday	0	May
281263	Mutual Aid	Anne Arundel CFD	2508 KNIGHTHILL LN, BOWIE, MD	10:14 PM	2021-05-02	MU05	38 m	None	2508 KNIGHTHILL LN	NaN	BOWIE	MD	2280	Sunday	6	May
281264	Mutual Aid	Anne Arundel CFD	FORT MEADE RD & LAUREL BOWIE RD, LAUREL, MD	9:32 PM	2021-05-02	RE27	26 m	None	FORT MEADE RD & LAUREL BOWIE RD	NaN	LAUREL	MD	1560	Sunday	6	May
281265	Mutual Aid	Anne Arundel CFD	13503 AVEBURY DR, LAUREL, MD	9:12 PM	2021-05-02	MU27	39 m	None	13503 AVEBURY DR	NaN	LAUREL	MD	2340	Sunday	6	May
281266	Commercial Fire	Anaheim FD	2400 E KATELLA AV, ANAHEIM, CA (STADIUM TOWERS PLAZA BUILDING)	8:25 PM	2021-05-02	AB2 AE3 AE7 OE3 OR6 OT6	14 m	STADIUM TOWERS PLAZA BUILDING	2400 E KATELLA AV	NaN	ANAHEIM	CA	840	Sunday	6	May
281267	Mutual Aid	Anoka County	13301 HANSON BLVD NW, ANDOVER, MN (ANOKA COUNTY PUBLIC SAFETY CAMPUS)	11:01 AM	2021-05-02	NaN	42 m	ANOKA COUNTY PUBLIC SAFETY CAMPUS	13301 HANSON BLVD NW	NaN	ANDOVER	MN	2520	Sunday	6	May
281268	Structure Fire	Anoka County	SOUTH COON CREEK DR NW & ROUND LAKE BLVD NW, ANDOVER, MN	6:22 AM	2021-05-02	AALL AE21	10 m	None	SOUTH COON CREEK DR NW & ROUND LAKE BLVD NW	NaN	ANDOVER	MN	600	Sunday	6	May
281269	Residential Fire	Anaheim FD	3114 W TYLER AV, ANAHEIM, CA	4:31 AM	2021-05-02	AB2 AE11 AE4 CE61 CT61	34 m	None	3114 W TYLER AV	NaN	ANAHEIM	CA	2040	Sunday	6	May
281270	Structure Fire	Anaheim FD	710 E CERRITOS AV, ANAHEIM, CA (CRENSHAW LUMBER)	1:33 AM	2021-05-02	AB1 AB2 AE1 AE3 AE5 AE6 AE7 AI2 AT1 AT3 AT6 CE83 OB1 OE3	5 h 46 m	CRENSHAW LUMBER	710 E CERRITOS AV	NaN	ANAHEIM	CA	20760	Sunday	6	May
281271	Mutual Aid	Sumter Fire & EMS	34464 CORTEZ BLVD, BLDG NOT FOUND, RIDGE MANOR, FL (DOLLAR GENERAL)	3:45 AM	2021-05-03	NaN	9 m	DOLLAR GENERAL	34464 CORTEZ BLVD	BLDG NOT FOUND	RIDGE MANOR	FL	540	Monday	0	May
281272	Residential Fire	Suffolk Fire Rescue	234 N 4TH ST, SUFFOLK, VA	3:37 AM	2021-05-03	B1 E1 E2 E3 EMS1 L3 M3 R1 R6 SF1	8 m	None	234 N 4TH ST	NaN	SUFFOLK	VA	480	Monday	0	May
281273	Residential Fire	Suffolk Fire Rescue	1252 WILROY RD, SUFFOLK, VA	2:27 AM	2021-05-03	B1 E1 E2 E3 EMS1 L3 M3 M9 R1 SF1 T1 T9	21 m	None	1252 WILROY RD	NaN	SUFFOLK	VA	1260	Monday	0	May
281274	Commercial Fire	Suffolk Fire Rescue	913 E WASHINGTON ST, SUFFOLK, VA	2:08 AM	2021-05-03	B1 E1 E2 E3 E4 L3 L5 M3 R1 R6 SF1	17 m	None	913 E WASHINGTON ST	NaN	SUFFOLK	VA	1020	Monday	0	May
281275	Residential Fire	Suffolk Co FRES	717 OCEAN BREEZE WK, OCEAN BEACH, NY	11:05 PM	2021-05-02	3-20-05 3-20-07 3-20-30 3-20-31 3-20-32 3-20-A OBCHPD-A	51 m	None	717 OCEAN BREEZE WK	NaN	OCEAN BEACH	NY	3060	Sunday	6	May
281276	Fire	Tacoma Fire	S 8TH ST & YAKIMA AVE, TACOMA, WA	7:38 PM	2021-05-02	E01	8 m	None	S 8TH ST & YAKIMA AVE	NaN	TACOMA	WA	480	Sunday	6	May
281277	Fire	Tacoma Fire	S 92ND ST & S HOSMER ST, TACOMA, WA	3:47 PM	2021-05-02	E08	2 m	None	S 92ND ST & S HOSMER ST	NaN	TACOMA	WA	120	Sunday	6	May

Time of the day¶

I will assign Daytime values based on the time range below -

Time of the Day	Range
Morning	5 AM to 11:59 AM
Afternoon	12PM to 4:59 PM
Evening	5 PM to 8:59 PM
Night	9 PM to 11:59 PM
Midnight	12 AM to 4:59 AM

In [43]:

# https://stackoverflow.com/a/70018607/11105356

def time_range(time):
  hour = datetime.strptime(time, '%I:%M %p').hour
  if hour > 20:
      return "Night"
  elif hour > 16:
      return "Evening"
  elif hour > 11:
      return "Afternoon"
  elif hour > 4:
      return "Morning"
  else:
      return "Midnight"

In [44]:

pulse_point_df["time_of_the_day"] = pulse_point_df.timestamp_time.apply(lambda time: time_range(time))

# # pulse_point_df.timestamp_time = pd.to_datetime(pulse_point_df.timestamp_time).dt.time

Save Cleaned and Processed Dataset¶

In [45]:

pulse_point_df.to_csv('PulsePoint-emergencies-cleaned.csv', index=False)

6 EDA¶

A quick overview of the preprocessed data-

The preprocessed dataset contains additional 5 columns extracted from the location column, another 5 columns extracted from date_of_incident and duration columns. Id , Incident_logo and agency_logo columns from the original dataset was discarded.

Columns	Description	Data Type
business	Name of the business place extracted from location(e.g., JANIE & JACK, DOLLAR GENERAL etc.)	object
address	Address where the incident took place (extracted from location)	object
address_2	Extended address where the incident took place (extracted from location)	object
city	City where the incident took place (extracted from location). It could also be a town or a country	object
state	State where the incident took place (extracted from location)	object
duration_in_seconds	Incident duration in seconds (extracted from duration)	numeric, int
day_name	Name of the day when the incident took place	object
weekday	The day of the week with Monday=0, Sunday=6.	object
month_name	Name of the month (extracted from date)	object
time_of_the_day	morning (5AM-11:59AM), afternoon (12PM-4:59 PM), evening (5PM-8:59PM), night (9PM-11:59PM), midnight (12AM-4:59AM)	object

In [46]:

printmd(f"There are total **{pulse_point_df.shape[0]}** incidents")

There are total 281278 incidents

In [47]:

pulse_point_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 281278 entries, 0 to 281277
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   title                281278 non-null  object        
 1   agency               281278 non-null  object        
 2   location             281278 non-null  object        
 3   timestamp_time       281278 non-null  object        
 4   date_of_incident     281278 non-null  datetime64[ns]
 5   description          267894 non-null  object        
 6   duration             281278 non-null  object        
 7   business             15126 non-null   object        
 8   address              281278 non-null  object        
 9   address_2            9032 non-null    object        
 10  city                 281278 non-null  object        
 11  state                281278 non-null  object        
 12  duration_in_seconds  281278 non-null  int64         
 13  day_name             281278 non-null  object        
 14  weekday              281278 non-null  int64         
 15  month_name           281278 non-null  object        
 16  time_of_the_day      281278 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(14)
memory usage: 38.6+ MB

In [48]:

pulse_point_df.describe().T

Out[48]:

	count	mean	std	min	25%	50%	75%	max
duration_in_seconds	281278.0	2578.742312	2878.920631	0.0	1020.0	1860.0	3480.0	116760.0
weekday	281278.0	3.076256	2.012422	0.0	1.0	3.0	5.0	6.0

In [49]:

pulse_point_df.describe(include='object').T

Out[49]:

	count	unique	top	freq
title	281278	88	Medical Emergency	179753
agency	281278	773	Montgomery County	6296
location	281278	179763	COLLINS AVE, MIAMI BEACH, FL	99
timestamp_time	281278	1440	5:04 AM	333
description	267894	85203	E1	1249
duration	281278	728	16 m	6420
business	15126	11244	UNINC	83
address	281278	155622	MAIN ST	462
address_2	9032	2758	STE BLK	266
city	281278	3530	LOS ANGELES	8017
state	281278	44	CA	85136
day_name	281278	7	Sunday	45983
month_name	281278	8	November	53193
time_of_the_day	281278	5	Morning	101072

In [50]:

missing_value_describe(pulse_point_df)

Number of rows with at least 1 missing values: 280200
Number of columns with missing values: 3

Missing percentage (desceding):

	Total	Percentage(%)
address_2	272246	96.788942
business	266152	94.622402
description	13384	4.758282

6.1 Incidents¶

In [51]:

printmd(f"There are total **{len(pulse_point_df.title.unique())}** types of incidents")

There are total 88 types of incidents

Top¶

In [52]:

pulse_point_df.title.value_counts().head(20)

Out[52]:

Medical Emergency             179753
Traffic Collision              22835
Fire Alarm                     11147
Alarm                           7420
Public Service                  7363
Refuse/Garbage Fire             4437
Structure Fire                  4152
Lift Assist                     3130
Mutual Aid                      2862
Fire                            2723
Residential Fire                2559
Expanded Traffic Collision      2527
Interfacility Transfer          2014
Outside Fire                    1963
Vehicle Fire                    1810
Investigation                   1628
Carbon Monoxide                 1513
Vegetation Fire                 1428
Hazardous Condition             1423
Commercial Fire                 1405
Name: title, dtype: int64

Wordcloud¶

In [53]:

# crisp wordcloud : https://stackoverflow.com/a/28795577/11105356

data = pulse_point_df.title.value_counts().to_dict()
wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300).generate_from_frequencies(data)
plt.figure(figsize=(14,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

6.2 Agency¶

In [54]:

printmd(f"There are total **{len(pulse_point_df.agency.unique())}** agencies")

There are total 773 agencies

Most Active¶

In [55]:

# Top agencies by incident engagement count

pulse_point_df.agency.value_counts().head(20)

Out[55]:

Montgomery County       6296
Columbus Fire           4957
Milwaukee Fire          4793
Cleveland EMS           4728
Contra Costa FPD        4609
Fairfax County Fire     3547
Hamilton County         3473
Eug Spfld Fire          3302
Boone County Joint      3143
LAFD - Central          3136
Rockford Fire           3052
LA County FD (Div 4)    2992
LA County FD (Div 8)    2990
LA County FD (Div 6)    2964
LA County FD (Div 1)    2925
Seminole County Fire    2909
LA County FD (Div 2)    2900
LA County FD (Div 5)    2864
Seattle FD              2858
Miami Beach Fire        2802
Name: agency, dtype: int64

In [56]:

pulse_point_df.agency.value_counts().head(10).sort_values(ascending=False).plot(kind = 'bar');

Wordcloud¶

Most frequent - Montgomery County

In [57]:

data = pulse_point_df.agency.value_counts().to_dict()
wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300).generate_from_frequencies(data)
plt.figure(figsize=(14,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

6.3 Description (Codes)¶

The codes themselves are defined by each agency, and are typically followed by a number to identify a particular instance of each asset type. A legend is sometimes provided on the agency information page, and following are some common examples:

B=Battalion
BC=Battalion
Chief E=Engine
CMD=Command
CPT=Helicopter
C=Crew
DZR=Dozer
HM=Hazmat
ME=Medic Engine
MRE=Medic Rescue Engine
P=Patrol
R=Rescue
RE=Rescue Engine
SQ=Squad
T=Truck
U=Utility
WT=Water Tender

Credit: PulsePoint Wikipedia

Note: There is no standard for the identifier abbreviations (E, T, S, BC, RA, PM, etc.), and they can vary significantly from agency to agency.

Example - Ventura County Fire Department PulsePoint Unit Abbreviations PDF

To know more, visit - https://www.pulsepoint.org/unit-status-legend

In [58]:

pulse_point_df.description.value_counts().head(10)

Out[58]:

E1     1249
E2     1056
E4      743
E3      701
E6      699
E11     671
M1      636
E14     610
E10     601
E51     564
Name: description, dtype: int64

Checking for MRE(Medic Rescue Engine) code in description

In [59]:

mask = pulse_point_df.description.str.contains('MRE', regex=False, na=False)
display(pulse_point_df.description[mask])
printmd(f"**{pulse_point_df.description[mask].count()}** instances contain **MRE** code")

565       B5 DMRE238 E24 E47 M9 RSFB261 SOLE237 SOLT237
1903         B14 B15 E132 ME32 ME33 ME35 ME52 MRE31 T35
3637          B19 B29 E132 ME34 ME36 ME37 MRE31 Q44 T35
3669          MRB4 MRE3 MRF6 MRFD1 MRR76 MRS5 MRT7 MRT8
4087                           B17 ME21 ME23 ME26 SMRE4
                              ...                      
264925                    B18 ME21 ME23 ME26 SMRE4 WT26
276683            A14 B1 B2 E15 E6 EMS1 H14 M2 MRED T12
280991                                    B14 E22 MRE23
281020    B14 B2 E22 IV12 ME1 ME20 ME21 MRE23 MT5 VNCIV
281088                                       E132 MRE31
Name: description, Length: 93, dtype: object

93 instances contain MRE code

6.4 Duration¶

In [60]:

(pulse_point_df.duration_in_seconds/ 60).value_counts().head(30)

# alternative
# pulse_point_df.duration.value_counts().head(20)

Out[60]:

16.0    6420
18.0    6334
15.0    6332
17.0    6330
14.0    6202
19.0    6192
20.0    6025
21.0    5908
13.0    5907
12.0    5664
22.0    5530
11.0    5360
23.0    5195
10.0    5076
24.0    4877
9.0     4582
25.0    4535
26.0    4308
27.0    4018
8.0     3832
28.0    3652
29.0    3541
30.0    3425
7.0     3409
31.0    3174
4.0     3060
32.0    3044
5.0     2977
33.0    2925
3.0     2911
Name: duration_in_seconds, dtype: int64

Most of the emergency engagement lasted under 30 mins

6.5 Incident Location¶

City¶

In [61]:

printmd(f"There are total **{len(pulse_point_df.city.unique())}** cities PulsePoint covered")

There are total 3530 cities PulsePoint covered

In [62]:

pulse_point_city_df = pulse_point_df.groupby(['city','state'], as_index=False).count()[['city', 'state', 'title']].reset_index(drop=True).rename(columns={'title':'count'})
pulse_point_city_df.head(50)

Out[62]:

	city	state	count
0	*	NM	3
1	**UNDEFINED	CA	17
2	-105.124526	CO	1
3	0304	NJ	5
4	0306	NJ	4
5	0308	NJ	1
6	0310	NJ	3
7	0311	NJ	2
8	0312	NJ	3
9	0313	NJ	7
10	0315	NJ	1
11	0316	NJ	1
12	0318	NJ	1
13	0319	NJ	5
14	0320	NJ	3
15	0322	NJ	3
16	0323	NJ	4
17	0324	NJ	12
18	0325	NJ	1
19	0332	NJ	1
20	0334	NJ	2
21	0335	NJ	1
22	0337	NJ	2
23	0338	NJ	5
24	0339	NJ	3
25	10TH AVE N	ID	1
26	11TH ST N	ID	1
27	12809 - NAME?	NY	1
28	1328 - NAME?	DE	1
29	1ST AVE N	ID	1
30	21	NJ	3
31	21804 - NAME?	MD	7
32	21875 - NAME?	MD	1
33	29 PALMS	CA	9
34	2ND AVE N & RIVERFRONT PARK RD	ID	1
35	50TH ST S	ID	1
36	6TH AVE N	ID	1
37	ABERDEEN	SD	38
38	ABINGTON	PA	190
39	ACCOKEEK	MD	3
40	ACME	WA	2
41	ACTON	CA	37
42	ADDISON	TX	70
43	ADELANTO	CA	20
44	ADMIRALS CHASE	DE	1
45	ADVANCE	IN	2
46	AFFTON	MO	1
47	AGASSIZ	BC	1
48	AGOURA	CA	23
49	AGOURA HILLS	CA	147

Some cities in different states have the same name

In [63]:

pulse_point_city_df[pulse_point_city_df.city.str.lower() == 'bloomington']

Out[63]:

	city	state	count
330	BLOOMINGTON	CA	9
331	BLOOMINGTON	IN	410

Outliers in city names - * , '0324', ' UNDEFINED', '12809 - NAME?'** etc.

In [64]:

pulse_point_df[pulse_point_df.city.str.startswith('0324')].head()

Out[64]:

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	city	state	duration_in_seconds	day_name	weekday	month_name	time_of_the_day
63653	Fire Alarm	Burlington County	6407 NORMANDY DR, 0324, NJ	4:54 AM	2021-07-26	F3603 F361 F363 F3631 F3632 F3635 P36115	12 m	None	6407 NORMANDY DR	NaN	0324	NJ	720	Monday	0	July	Midnight
63729	Medical Emergency	Burlington County	HAINESPORT MT LAUREL RD & FOX RUN, 0324, NJ	12:13 AM	2021-07-26	E3681 P36137 P36159 P36185	1 h 30 m	None	HAINESPORT MT LAUREL RD & FOX RUN	NaN	0324	NJ	5400	Monday	0	July	Midnight
79887	Medical Emergency	Burlington County	CHADBURY RD & ABERDEEN DR, 0324, NJ	5:21 AM	2021-08-09	E3693 P3695	56 m	None	CHADBURY RD & ABERDEEN DR	NaN	0324	NJ	3360	Monday	0	August	Morning
175543	Medical Emergency	Burlington County	CHURCH RD, 0324, NJ	4:55 AM	2021-10-21	E3671 E3672 P36122	1 h 26 m	None	CHURCH RD	NaN	0324	NJ	5160	Thursday	3	October	Midnight
193627	Fire Alarm	Burlington County	4105 ADELAIDE DR, 0324, NJ	3:20 AM	2021-11-04	F3614	12 m	None	4105 ADELAIDE DR	NaN	0324	NJ	720	Thursday	3	November	Midnight

Geolocation¶

Extract geolocation from the city address¶

In [65]:

geolocator = Nominatim(user_agent='myapplication')
location = geolocator.geocode("50TH ST S")
print(location.address)
display(location.raw)
print("Lattitude: ", location.raw['lat'],", Longitude: ", location.raw['lon'])

50th Street South, Gulfport, Pinellas County, Florida, 33707, United States

{'boundingbox': ['27.750967', '27.7517259', '-82.7011036', '-82.701095'],
 'class': 'highway',
 'display_name': '50th Street South, Gulfport, Pinellas County, Florida, 33707, United States',
 'importance': 0.4,
 'lat': '27.7517259',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'lon': '-82.7011036',
 'osm_id': 11238610,
 'osm_type': 'way',
 'place_id': 99532491,
 'type': 'residential'}

Lattitude:  27.7517259 , Longitude:  -82.7011036

Utility Function¶

In [66]:

geolocator = Nominatim(user_agent='myapplication')



def get_nominatim_geocode(address):
    try:
      location = geolocator.geocode(address)
      return location.raw['lon'], location.raw['lat']
    except Exception as e:
        # print(e)
        return None, None
        
# alternative way : scraping from the website 

# def get_nominatim_geocode(address):
#     url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) + '?format=json'
#     try:
#         response = requests.get(url).json()
#         return response[0]["lon"], response[0]["lat"]
#     except Exception as e:
#         # print(e)
#         return None, None

def get_positionstack_geocode(address):
  BASE_URL = "http://api.positionstack.com/v1/forward?access_key="
  API_KEY = API_KEY_POSITIONSTACK
  
  url = BASE_URL +API_KEY+'&query='+urllib.parse.quote(address)
  try:
      response = requests.get(url).json()
      # print( response["data"][0])
      return response["data"][0]["longitude"], response["data"][0]["latitude"]
  except Exception as e:
      # print(e)
      return None, None

def get_geocode(address):
  long,lat = get_nominatim_geocode(address)
  if long == None:
    return get_positionstack_geocode(address)
  else:
    return long,lat

address = "50TH ST S"

get_geocode(address)

Out[66]:

('-82.7011036', '27.7517259')

Some cities with the same names appear in two different countries.

examples -

NAPLES - Italy
Columbia - Country in South America
Suffolk - UK
STAFFORD - UK
NORFOLK - UK

In [67]:

address = 'Suffolk'
location = geolocator.geocode(address)
location

Out[67]:

Location(Suffolk, East of England, England, United Kingdom, (52.241001350000005, 1.0466830312565236, 0.0))

Adding tailing 'USA' to the location text would solve this issue

In [68]:

address = 'Suffolk, USA'
location = geolocator.geocode(address)
location

Out[68]:

Location(Suffolk, Suffolk (city), Virginia, 23434, United States, (36.7282096, -76.5835703, 0.0))

Adding city and country names will help to get the appropriate location

Let's fetch geolocation of some cities

In [69]:

test_df = pulse_point_city_df.tail()
test_df

Out[69]:

	city	state	count
3783	ZEPHYR COVE	NV	16
3784	ZEPHYRHILLS	FL	12
3785	ZIONSVILLE	IN	193
3786	ZOAR	OH	2
3787	ZOC-ORLANDO	FL	2

In [70]:

test_df['location'] = test_df['city'] + ', ' + test_df['state'] + ', USA'

# test_df[['city', 'state']].agg(', '.join, axis=1) + ', USA'
test_df

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[70]:

	city	state	count	location
3783	ZEPHYR COVE	NV	16	ZEPHYR COVE, NV, USA
3784	ZEPHYRHILLS	FL	12	ZEPHYRHILLS, FL, USA
3785	ZIONSVILLE	IN	193	ZIONSVILLE, IN, USA
3786	ZOAR	OH	2	ZOAR, OH, USA
3787	ZOC-ORLANDO	FL	2	ZOC-ORLANDO, FL, USA

In [71]:

%%time
location_test_df = test_df.location.progress_apply(lambda x:get_geocode(str(x.strip()))).apply(pd.Series)

  0%|          | 0/5 [00:00<?, ?it/s]

CPU times: user 113 ms, sys: 7.43 ms, total: 121 ms
Wall time: 2.72 s

In [72]:

location_test_df.columns = ['longitude', 'latitude']
test_df = test_df.join(location_test_df)
display(test_df)

	city	state	count	location	longitude	latitude
3783	ZEPHYR COVE	NV	16	ZEPHYR COVE, NV, USA	-119.9472389	39.0060103
3784	ZEPHYRHILLS	FL	12	ZEPHYRHILLS, FL, USA	-82.1812531782471	28.24262955
3785	ZIONSVILLE	IN	193	ZIONSVILLE, IN, USA	-86.26473508663867	39.963235499999996
3786	ZOAR	OH	2	ZOAR, OH, USA	-81.4223375	40.6142286
3787	ZOC-ORLANDO	FL	2	ZOC-ORLANDO, FL, USA	-81.2937	28.4196

Alternative Approach (iterate every rows)

In [73]:

%%time
for index,row in test_df.iterrows():
  test_df.loc[index,'longitude'], test_df.loc[index,'latitude'] = get_geocode(row.city.strip())
display(test_df)

	city	state	count	location	longitude	latitude
3783	ZEPHYR COVE	NV	16	ZEPHYR COVE, NV, USA	-119.9472389	39.0060103
3784	ZEPHYRHILLS	FL	12	ZEPHYRHILLS, FL, USA	-82.1812531782471	28.24262955
3785	ZIONSVILLE	IN	193	ZIONSVILLE, IN, USA	-86.26473508663867	39.963235499999996
3786	ZOAR	OH	2	ZOAR, OH, USA	-81.4223375	40.6142286
3787	ZOC-ORLANDO	FL	2	ZOC-ORLANDO, FL, USA	17.2179	48.8399

CPU times: user 51 ms, sys: 9.18 ms, total: 60.2 ms
Wall time: 2.67 s

Create a temporary column "location" by merging city, state and country

In [74]:

canada_mask = pulse_point_city_df.state.isin([*ca_province_dic.values()])

pulse_point_city_df['location'] = pulse_point_city_df['city'] + ', ' + pulse_point_city_df['state'] 

pulse_point_city_df['location'].loc[canada_mask] = pulse_point_city_df['location'] + ', CANADA'

pulse_point_city_df['location'].loc[~canada_mask] = pulse_point_city_df['location'] + ', USA'

# to verify

# pulse_point_city_df[pulse_point_city_df['location'].str.endswith('USA')]
# pulse_point_city_df[pulse_point_city_df['location'].str.endswith('CANADA')]

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Fetch Geolocation¶

In [75]:

%%time
location_df = pulse_point_city_df.location.progress_apply(lambda x:get_geocode(str(x.strip()))).apply(pd.Series)

  0%|          | 0/3788 [00:00<?, ?it/s]

CPU times: user 29.8 s, sys: 4.68 s, total: 34.5 s
Wall time: 34min 59s

In [76]:

location_df.columns = ['longitude', 'latitude']
pulse_point_city_df = pulse_point_city_df.join(location_df)

Check for missing values and drop cities without geocode

In [77]:

pulse_point_city_df.isna().sum()

Out[77]:

city         0
state        0
count        0
location     0
longitude    6
latitude     6
dtype: int64

In [78]:

pulse_point_city_df.dropna(inplace=True)

In [79]:

pulse_point_city_df.to_csv('City-coordinate.csv', index=False)

In [80]:

pulse_point_city_df.tail()

Out[80]:

	city	state	count	location	longitude	latitude
3783	ZEPHYR COVE	NV	16	ZEPHYR COVE, NV, USA	-119.9472389	39.0060103
3784	ZEPHYRHILLS	FL	12	ZEPHYRHILLS, FL, USA	-82.1812531782471	28.24262955
3785	ZIONSVILLE	IN	193	ZIONSVILLE, IN, USA	-86.26473508663867	39.963235499999996
3786	ZOAR	OH	2	ZOAR, OH, USA	-81.4223375	40.6142286
3787	ZOC-ORLANDO	FL	2	ZOC-ORLANDO, FL, USA	-81.2937	28.4196

Top Engagement¶

In [81]:

pulse_point_city_df.sort_values(by='count', ascending=False).head(20)

Out[81]:

	city	state	count	location	longitude	latitude
1891	LOS ANGELES	CA	8017	LOS ANGELES, CA, USA	-118.242766	34.0536909
696	COLUMBUS	OH	4873	COLUMBUS, OH, USA	-83.0007065	39.9622601
2132	MILWAUKEE	WI	4815	MILWAUKEE, WI, USA	-87.922497	43.0349931
648	CLEVELAND	OH	4700	CLEVELAND, OH, USA	-81.6936813	41.4996562
2843	ROCKFORD	IL	3213	ROCKFORD, IL, USA	-89.093966	42.2713945
3014	SEATTLE	WA	2856	SEATTLE, WA, USA	-122.3300624	47.6038321
2091	MIAMI BEACH	FL	2799	MIAMI BEACH, FL, USA	-80.1353006	25.7929198
691	COLUMBIA	MO	2777	COLUMBIA, MO, USA	-92.3484631580807	38.9464035
3726	WINNIPEG	MB	2730	WINNIPEG, MB, CANADA	-97.1384584	49.8955367
3158	SPOKANE	WA	2703	SPOKANE, WA, USA	-117.4235106	47.6571934
1182	FREMONT	CA	2611	FREMONT, CA, USA	-121.988571	37.5482697
1044	EUGENE	OR	2608	EUGENE, OR, USA	-123.0950506	44.0505054
2790	RICHMOND	VA	2600	RICHMOND, VA, USA	-77.43428	37.5385087
1880	LONG BEACH	CA	2587	LONG BEACH, CA, USA	-118.191604	33.7690164
1382	HAMPTON	VA	2476	HAMPTON, VA, USA	-76.3452057	37.0300969
1186	FRISCO	TX	2461	FRISCO, TX, USA	-96.8236116	33.1506744
1194	FT LAUDERDALE	FL	2356	FT LAUDERDALE, FL, USA	-80.1433786	26.1223084
339	BOCA RATON	FL	2259	BOCA RATON, FL, USA	-80.0830984	26.3586885
583	CHATTANOOGA	TN	2252	CHATTANOOGA, TN, USA	-85.3094883	35.0457219
2056	MEDFORD	OR	2248	MEDFORD, OR, USA	-122.8718605	42.3264181

Top 5 Cities by agency engagement -

Name	Count	State
1. LOS ANGELES	7449	CA
2. MILWAUKEE	4404	WI
3. COLUMBUS	4115	OH
4. CLEVELAND	3977	OH
5. ROCKFORD	2950	IL

Heat Map¶

In [82]:

geometry = geopandas.points_from_xy(pulse_point_city_df.longitude, pulse_point_city_df.latitude)
geo_df = geopandas.GeoDataFrame(pulse_point_city_df[['city','count','longitude', 'latitude']], geometry=geometry)

geo_df.head()

Out[82]:

	city	count	longitude	latitude	geometry
0	*	3	100.54536963597755	13.73723285	POINT (100.54537 13.73723)
1	**UNDEFINED	17	38.1239	55.9865	POINT (38.12390 55.98653)
2	-105.124526	1	-105.548	38.9967	POINT (-105.54782 38.99666)
3	0304	5	-74.6185170609449	40.36201695	POINT (-74.61852 40.36202)
4	0306	4	-74.6185170609449	40.36201695	POINT (-74.61852 40.36202)

In [83]:

map = folium.Map(location = [48, -102], tiles='Cartodb dark_matter', zoom_start = 4)

heat_data = [[point.xy[1][0], point.xy[0][0]] for point in geo_df.geometry ]

# heat_data
HeatMap(heat_data).add_to(map)

map

Out[83]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Map Circle Overlays¶

In [84]:

# to avoid recursion depth issue change latitude,longitude type to float
# https://github.com/python-visualization/folium/issues/1105

pulse_point_city_df['latitude'] = pulse_point_city_df['latitude'].astype(float)
pulse_point_city_df['longitude'] = pulse_point_city_df['longitude'].astype(float)

In [85]:

map_USA = folium.Map(location=[48, -102], 
                     zoom_start=4, 
                     prefer_canvas=True,
                     )


occurences = folium.map.FeatureGroup()
n_mean = pulse_point_city_df['count'].mean()

for lat, lng, number, city in zip(pulse_point_city_df['latitude'],
                                        pulse_point_city_df['longitude'],
                                        pulse_point_city_df['count'],
                                        pulse_point_city_df['city']):
  occurences.add_child(
      folium.vector_layers.CircleMarker(
          [lat, lng],
          radius=number/(n_mean/3), # radius for number of occurrences
          color='yellow',
          fill=True,
          fill_color='blue',
          fill_opacity=0.4,
          # tooltip = city
          tooltip=str(number)+','+str(city)[:21], # can be displayed max 21 character 
          # most of the city names contain 5-20 characters 
          # check pulse_point_city_df.city.apply(len).plot();
          # get more from tooltip https://github.com/python-visualization/folium/issues/1010#issuecomment-435968337
      )
  )

map_USA.add_child(occurences)

Out[85]:

Make this Notebook Trusted to load map: File -> Trust Notebook

State¶

In [86]:

printmd(f"There are total **{len(pulse_point_df.state.unique())}** US states & Canadian provinces PulsePoint covered")

There are total 44 US states & Canadian provinces PulsePoint covered

Top Engagement¶

In [87]:

pulse_point_df.state.value_counts().head(20)

Out[87]:

CA    85136
FL    26543
WA    17828
VA    17754
OH    17079
OR    15827
WI     9808
MO     9522
TX     8451
IL     5783
PA     5172
IN     4960
KS     4647
NV     4575
MN     3865
NC     3759
AZ     3642
TN     3624
DE     2861
OK     2849
Name: state, dtype: int64

Top 5 States by agency engagement -

Name	Count	Abbreviation
1. California	70989	CA
2. Florida	23213	FL
3. Virginia	16016	VA
4. Washington	15532	WA
5. Ohio	14440	OH

Let's Visualize it

In [88]:

pulse_point_df.state.value_counts().head(10).sort_values(ascending=False).plot(kind = 'bar');

Duration¶

In [89]:

(pulse_point_df.groupby('state').sum()['duration_in_seconds'].sort_values(ascending=False) / 3600).head(5)

Out[89]:

state
CA    52874.550000
FL    17708.350000
VA    13721.133333
WA    13311.633333
OH    13197.016667
Name: duration_in_seconds, dtype: float64

California has over 43,000 hours of agency engagement whereas the second highest Florida has over 15,000 hours of engagement which is 3 times less

Occurrence Timeline¶

In [90]:

df_state_incident = pulse_point_df.groupby(["date_of_incident", 
                                      "state"],
                                     as_index=False).count()[['date_of_incident', 
                                     'state', 'title']].reset_index(drop=True).rename(columns={'date_of_incident':'date',
                                                                                               'title':'count'})

df_state_incident.columns = ['date', 'state', 'count']
df_state_incident

Out[90]:

	date	state	count
0	2021-05-02	AK	1
1	2021-05-02	AZ	2
2	2021-05-02	CA	131
3	2021-05-02	CO	4
4	2021-05-02	DC	1
...	...	...	...
5434	2021-12-31	SD	29
5435	2021-12-31	UT	2
5436	2021-12-31	VA	417
5437	2021-12-31	WA	256
5438	2021-12-31	WI	81

5439 rows × 3 columns

In [91]:

pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('count', set_size, 'size', drop=False),
])

agg_incident_data = pipeline.apply(df_state_incident)

agg_incident_data.fillna(0, inplace=True)
agg_incident_data = agg_incident_data.sort_values(by='date', ascending=True)
agg_incident_data.date = agg_incident_data.date.dt.strftime('%Y-%m-%d') # convert  to string object
agg_incident_data.tail()

Out[91]:

	date	state	count	size
5420	2021-12-31	AZ	19	2.995732
5419	2021-12-31	AK	39	3.688879
5437	2021-12-31	WA	256	5.549076
5427	2021-12-31	MN	26	3.295837
5438	2021-12-31	WI	81	4.406719

Animated geo scatter plot¶

In [92]:

fig = px.scatter_geo(
    agg_incident_data, locations="state", locationmode='USA-states',
    scope="usa",
    color="count", 
    size='size', hover_name="state", 
    range_color= [0, 2000], 
    projection="albers usa", animation_frame="date", 
    title='PulsePoint Incidents: Local Emergencies By State', 
    color_continuous_scale="portland"
    )

fig.show()

US States Geolocation¶

Scrape States data¶

In [93]:

# https://developers.google.com/public-data/docs/canonical/states_csv

state_coordinate = pd.read_html("https://developers.google.com/public-data/docs/canonical/states_csv")[0]
state_coordinate

Out[93]:

	state	latitude	longitude	name
0	AK	63.588753	-154.493062	Alaska
1	AL	32.318231	-86.902298	Alabama
2	AR	35.201050	-91.831833	Arkansas
3	AZ	34.048928	-111.093731	Arizona
4	CA	36.778261	-119.417932	California
5	CO	39.550051	-105.782067	Colorado
6	CT	41.603221	-73.087749	Connecticut
7	DC	38.905985	-77.033418	District of Columbia
8	DE	38.910832	-75.527670	Delaware
9	FL	27.664827	-81.515754	Florida
10	GA	32.157435	-82.907123	Georgia
11	HI	19.898682	-155.665857	Hawaii
12	IA	41.878003	-93.097702	Iowa
13	ID	44.068202	-114.742041	Idaho
14	IL	40.633125	-89.398528	Illinois
15	IN	40.551217	-85.602364	Indiana
16	KS	39.011902	-98.484246	Kansas
17	KY	37.839333	-84.270018	Kentucky
18	LA	31.244823	-92.145024	Louisiana
19	MA	42.407211	-71.382437	Massachusetts
20	MD	39.045755	-76.641271	Maryland
21	ME	45.253783	-69.445469	Maine
22	MI	44.314844	-85.602364	Michigan
23	MN	46.729553	-94.685900	Minnesota
24	MO	37.964253	-91.831833	Missouri
25	MS	32.354668	-89.398528	Mississippi
26	MT	46.879682	-110.362566	Montana
27	NC	35.759573	-79.019300	North Carolina
28	ND	47.551493	-101.002012	North Dakota
29	NE	41.492537	-99.901813	Nebraska
30	NH	43.193852	-71.572395	New Hampshire
31	NJ	40.058324	-74.405661	New Jersey
32	NM	34.972730	-105.032363	New Mexico
33	NV	38.802610	-116.419389	Nevada
34	NY	43.299428	-74.217933	New York
35	OH	40.417287	-82.907123	Ohio
36	OK	35.007752	-97.092877	Oklahoma
37	OR	43.804133	-120.554201	Oregon
38	PA	41.203322	-77.194525	Pennsylvania
39	PR	18.220833	-66.590149	Puerto Rico
40	RI	41.580095	-71.477429	Rhode Island
41	SC	33.836081	-81.163725	South Carolina
42	SD	43.969515	-99.901813	South Dakota
43	TN	35.517491	-86.580447	Tennessee
44	TX	31.968599	-99.901813	Texas
45	UT	39.320980	-111.093731	Utah
46	VA	37.431573	-78.656894	Virginia
47	VT	44.558803	-72.577841	Vermont
48	WA	47.751074	-120.740139	Washington
49	WI	43.784440	-88.787868	Wisconsin
50	WV	38.597626	-80.454903	West Virginia
51	WY	43.075968	-107.290284	Wyoming

US States with Total Incident Count¶

In [94]:

pulse_point_state_df = pulse_point_df.groupby(['state']).count()[['title']].reset_index().rename(columns={'title':'count'})
pulse_point_state_df

Out[94]:

	state	count
0	AK	1271
1	AL	58
2	AR	797
3	AZ	3642
4	BC	6
5	CA	85136
6	CO	1815
7	DC	1788
8	DE	2861
9	FL	26543
10	GA	1192
11	HI	808
12	IA	201
13	ID	1505
14	IL	5783
15	IN	4960
16	KS	4647
17	KY	908
18	LA	21
19	MB	2730
20	MD	2658
21	MI	277
22	MN	3865
23	MO	9522
24	NC	3759
25	ND	2569
26	NE	1721
27	NJ	1663
28	NM	396
29	NV	4575
30	NY	2228
31	OH	17079
32	OK	2849
33	ON	15
34	OR	15827
35	PA	5172
36	SC	778
37	SD	1372
38	TN	3624
39	TX	8451
40	UT	816
41	VA	17754
42	WA	17828
43	WI	9808

Missing US States¶

In [95]:

state_coordinate[~state_coordinate.state.isin(pulse_point_state_df.state)].reset_index(drop=True)

Out[95]:

	state	latitude	longitude	name
0	CT	41.603221	-73.087749	Connecticut
1	MA	42.407211	-71.382437	Massachusetts
2	ME	45.253783	-69.445469	Maine
3	MS	32.354668	-89.398528	Mississippi
4	MT	46.879682	-110.362566	Montana
5	NH	43.193852	-71.572395	New Hampshire
6	PR	18.220833	-66.590149	Puerto Rico
7	RI	41.580095	-71.477429	Rhode Island
8	VT	44.558803	-72.577841	Vermont
9	WV	38.597626	-80.454903	West Virginia
10	WY	43.075968	-107.290284	Wyoming

Filter US States¶

In [96]:

pulse_point_state_df = pulse_point_state_df.merge(state_coordinate, on='state', how='left')
pulse_point_state_df

# there are three provinces of canada : 
# Manitoba : MB
# British Columbia : BC
# Ontario : ON

Out[96]:

	state	count	latitude	longitude	name
0	AK	1271	63.588753	-154.493062	Alaska
1	AL	58	32.318231	-86.902298	Alabama
2	AR	797	35.201050	-91.831833	Arkansas
3	AZ	3642	34.048928	-111.093731	Arizona
4	BC	6	NaN	NaN	NaN
5	CA	85136	36.778261	-119.417932	California
6	CO	1815	39.550051	-105.782067	Colorado
7	DC	1788	38.905985	-77.033418	District of Columbia
8	DE	2861	38.910832	-75.527670	Delaware
9	FL	26543	27.664827	-81.515754	Florida
10	GA	1192	32.157435	-82.907123	Georgia
11	HI	808	19.898682	-155.665857	Hawaii
12	IA	201	41.878003	-93.097702	Iowa
13	ID	1505	44.068202	-114.742041	Idaho
14	IL	5783	40.633125	-89.398528	Illinois
15	IN	4960	40.551217	-85.602364	Indiana
16	KS	4647	39.011902	-98.484246	Kansas
17	KY	908	37.839333	-84.270018	Kentucky
18	LA	21	31.244823	-92.145024	Louisiana
19	MB	2730	NaN	NaN	NaN
20	MD	2658	39.045755	-76.641271	Maryland
21	MI	277	44.314844	-85.602364	Michigan
22	MN	3865	46.729553	-94.685900	Minnesota
23	MO	9522	37.964253	-91.831833	Missouri
24	NC	3759	35.759573	-79.019300	North Carolina
25	ND	2569	47.551493	-101.002012	North Dakota
26	NE	1721	41.492537	-99.901813	Nebraska
27	NJ	1663	40.058324	-74.405661	New Jersey
28	NM	396	34.972730	-105.032363	New Mexico
29	NV	4575	38.802610	-116.419389	Nevada
30	NY	2228	43.299428	-74.217933	New York
31	OH	17079	40.417287	-82.907123	Ohio
32	OK	2849	35.007752	-97.092877	Oklahoma
33	ON	15	NaN	NaN	NaN
34	OR	15827	43.804133	-120.554201	Oregon
35	PA	5172	41.203322	-77.194525	Pennsylvania
36	SC	778	33.836081	-81.163725	South Carolina
37	SD	1372	43.969515	-99.901813	South Dakota
38	TN	3624	35.517491	-86.580447	Tennessee
39	TX	8451	31.968599	-99.901813	Texas
40	UT	816	39.320980	-111.093731	Utah
41	VA	17754	37.431573	-78.656894	Virginia
42	WA	17828	47.751074	-120.740139	Washington
43	WI	9808	43.784440	-88.787868	Wisconsin

Drop Canadian Provinces

In [97]:

pulse_point_state_df.dropna(inplace=True)
pulse_point_state_df = pulse_point_state_df.reset_index(drop=True)

Choropleth USA¶

In [98]:

url = (
    "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)

state_geo = f"{url}/us-states.json"
state_data = pulse_point_state_df.iloc[:,[0,1]]

m = folium.Map(location=[48, -102], zoom_start=4)

folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["state", "count"],
    key_on="feature.id",
    fill_color="YlGn",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Number of Incidents",
).add_to(m)

folium.LayerControl().add_to(m)

m

Out[98]:

Make this Notebook Trusted to load map: File -> Trust Notebook

With Count Marker¶

In [99]:

# icon credit : https://icon-icons.com/icon/location-sos-phone-call-help/68848
# https://www.clipartmax.com/middle/m2H7i8G6N4H7b1N4_metallic-icon-royalty-free-cliparts-icone-sos-png/

# custom icon : https://stackoverflow.com/a/68992396/11105356

for i in range(0, len(pulse_point_state_df)):
  folium.Marker(
    location = [pulse_point_state_df.iloc[i]['latitude'], pulse_point_state_df.iloc[i]['longitude']],
    popup = folium.Popup(f"{pulse_point_state_df.iloc[i]['name']}\n{pulse_point_state_df.iloc[i]['count']}", parse_html=True),
    icon=folium.features.CustomIcon('https://i.postimg.cc/JhmnMQXj/sos.png', icon_size=(24, 31))
  ).add_to(m)
m

Out[99]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Choropleth USA Zoomed-IN¶

In [100]:

# https://plotly.com/python/choropleth-maps

fig = go.Figure(data=go.Choropleth(
    locations=pulse_point_state_df['state'], # Spatial coordinates
    z = pulse_point_state_df['count'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Total Occurrences",
))

fig.update_layout(
    title_text = 'US PulsePoint Emergencies Occurrences by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

6.6 Incident Date & Time¶

Date¶

Highest Number of Incidents¶

In [101]:

pulse_point_df.date_of_incident.value_counts().head(20)

Out[101]:

2021-11-13    3918
2021-11-27    3765
2021-11-21    3481
2021-12-31    3422
2021-12-07    3273
2021-11-07    3164
2021-11-25    2963
2021-08-20    2947
2021-06-22    2931
2021-06-18    2890
2021-06-16    2865
2021-11-04    2862
2021-08-23    2850
2021-12-05    2832
2021-10-12    2755
2021-09-16    2692
2021-11-10    2656
2021-08-21    2560
2021-08-09    2547
2021-08-15    2529
Name: date_of_incident, dtype: int64

Timeline¶

Daily¶

In [102]:

pulse_point_df.groupby(['date_of_incident']).count()['title'].reset_index().rename(columns={'date_of_incident':'Date',
                                                    'title':'count'}).sort_values('Date').plot(y='count',
                                                             x='Date',label="Incident")
plt.xlabel('Date of Incidents')
plt.ylabel('Number of Incidents')
plt.title("Incidents Frequency (Daily)")
plt.show();

Number of PulsePoint dispatches increased after August, 2021

Weekly¶

In [103]:

pulse_point_df.groupby([pd.Grouper(key='date_of_incident', 
                                   freq='W-MON')]).count()['title'].reset_index().rename(columns={'date_of_incident':'Date',
                                                    'title':'count'}).sort_values('Date').plot(y='count',
                                                             x='Date',label="Incident")
plt.xlabel('Date (Month)')
plt.ylabel('Number of Incidents')
plt.title("Incidents Frequency (Weekly)")
plt.show();

Weekday Time¶

In [104]:

# pulse_point_df.groupby(['day_name','time_of_the_day'],as_index=False).count()

incident_time_df = pulse_point_df.groupby(["day_name", "time_of_the_day"],
                                     as_index=False).count()[['day_name', 
                                                              'time_of_the_day',
                                     'title']].reset_index(drop=True).rename(columns={'date_of_incident':'date',
                                                                                               'title':'incident_count'})
incident_time_df

Out[104]:

	day_name	time_of_the_day	incident_count
0	Friday	Afternoon	4436
1	Friday	Evening	3278
2	Friday	Midnight	11565
3	Friday	Morning	15066
4	Friday	Night	4284
5	Monday	Afternoon	4806
6	Monday	Evening	4258
7	Monday	Midnight	10151
8	Monday	Morning	14196
9	Monday	Night	5633
10	Saturday	Afternoon	5943
11	Saturday	Evening	5325
12	Saturday	Midnight	7945
13	Saturday	Morning	11508
14	Saturday	Night	6887
15	Sunday	Afternoon	6652
16	Sunday	Evening	4314
17	Sunday	Midnight	12949
18	Sunday	Morning	17344
19	Sunday	Night	4724
20	Thursday	Afternoon	4016
21	Thursday	Evening	5192
22	Thursday	Midnight	12619
23	Thursday	Morning	15335
24	Thursday	Night	6262
25	Tuesday	Afternoon	3888
26	Tuesday	Evening	4054
27	Tuesday	Midnight	10678
28	Tuesday	Morning	13109
29	Tuesday	Night	4894
30	Wednesday	Afternoon	4796
31	Wednesday	Evening	4737
32	Wednesday	Midnight	10002
33	Wednesday	Morning	14514
34	Wednesday	Night	5918

In [105]:

display(incident_time_df.groupby(['day_name']).sum().reset_index().sort_values(["incident_count"], ascending=False))

display(incident_time_df.groupby(['day_name']).sum().plot(kind='bar'))

	day_name	incident_count
3	Sunday	45983
4	Thursday	43424
6	Wednesday	39967
1	Monday	39044
0	Friday	38629
2	Saturday	37608
5	Tuesday	36623

<matplotlib.axes._subplots.AxesSubplot at 0x7fbf6f44c0d0>

Highest number of incidents occurred on Sunday

Time of the Day¶

In [106]:

# incident_time_df.groupby(['day_name','time_of_the_day']).sum().plot(kind='bar', figsize=(25,8));

fig = px.bar(incident_time_df, 
        x="day_name", 
        y="incident_count", 
        color="time_of_the_day", 
        barmode="group",
        labels={'day_name':'Day', 
                'incident_count': 'Incident Count',
                'time_of_the_day': ''},
        title=f"Number of Incidents by Time of the Day",
        ).for_each_trace(lambda t: t.update(name=t.name.replace("=","")))

printmd("Emergency responses spiked at **midnight** or in the **morning**")
fig.show()


## alternative in seaborn catplot

# g=sns.catplot(data= incident_time_df, 
#               x="time_of_the_day", 
#               col='day_name',
#               y='incident_count', 
#               kind='bar', 
#               height=6,
#               col_wrap=4,
#                 )
# # bug : x-ticks not showing while using col_wrap
# # fixed : https://stackoverflow.com/a/52184614/11105356
# for ax in g.axes.flatten():
#     ax.tick_params(labelbottom=True)
# g.set_ylabels('Incident count')
# # for rotated x-ticks 
# # for ax in g.axes:
# #     plt.setp(ax.get_xticklabels(), visible=True, rotation=45)
# # plt.subplots_adjust(hspace=0.5)
# plt.show()

Emergency responses spiked at midnight or in the morning

Most of the incidents occurred during Midnight or in the morning. Probably some of the incidents were already started at nighttime and were logged later in the morning.

6.7 Major Incidents¶

In [107]:

pulse_point_df.groupby(['title']).count()[['agency']].rename(columns={'agency':'total'}).sort_values('total', ascending=False)[:25]

Out[107]:

	total
title
Medical Emergency	179753
Traffic Collision	22835
Fire Alarm	11147
Alarm	7420
Public Service	7363
Refuse/Garbage Fire	4437
Structure Fire	4152
Lift Assist	3130
Mutual Aid	2862
Fire	2723
Residential Fire	2559
Expanded Traffic Collision	2527
Interfacility Transfer	2014
Outside Fire	1963
Vehicle Fire	1810
Investigation	1628
Carbon Monoxide	1513
Vegetation Fire	1428
Hazardous Condition	1423
Commercial Fire	1405
Gas Leak	1389
Wires Down	1366
Smoke Investigation	1252
Odor Investigation	1154
Elevator Rescue	980

In [108]:

mask = (pulse_point_df.time_of_the_day == 'Midnight') | (pulse_point_df.time_of_the_day == 'Morning') 
highest_occ_incident = pulse_point_df[mask].groupby(['time_of_the_day','title']).count()[['agency']].rename(columns={'agency':'total'})

highest_occ_incident.sort_values('total', ascending=False)[:25]

Out[108]:

		total
time_of_the_day	title
Morning	Medical Emergency	64587
Midnight	Medical Emergency	47640
Morning	Traffic Collision	8262
Midnight	Traffic Collision	7135
Morning	Fire Alarm	3507
Midnight	Fire Alarm	3112
Morning	Public Service	2725
Morning	Alarm	2437
Midnight	Alarm	2067
Midnight	Public Service	2042
Morning	Refuse/Garbage Fire	1669
	Structure Fire	1566
	Fire	1102
Midnight	Structure Fire	1064
Morning	Residential Fire	1027
	Lift Assist	993
	Mutual Aid	966
	Outside Fire	936
Midnight	Refuse/Garbage Fire	858
Morning	Expanded Traffic Collision	840
Midnight	Mutual Aid	772
Morning	Investigation	735
Midnight	Expanded Traffic Collision	730
	Lift Assist	721
	Residential Fire	697

Top ten emergencies during 'Midnight' or 'Morning' -

Midnight :

Medical Emergency
Traffic Collision
Fire Alarm
Alarm
Public Service
Structure Fire
Refuse/Garbage Fire
Mutual Aid
Residential Fire
Expanded Traffic Collision

Morning :

Medical Emergency
Traffic Collision
Fire Alarm
Public Service
Refuse/Garbage Fire
Structure Fire
Fire
Residential Fire
Mutual Aid
Lift Assist

7 Clustering¶

7.1 Preprocess Dataset¶

In [109]:

pulse_point_df.isna().sum()

Out[109]:

title                       0
agency                      0
location                    0
timestamp_time              0
date_of_incident            0
description             13384
duration                    0
business               266152
address                     0
address_2              272246
city                        0
state                       0
duration_in_seconds         0
day_name                    0
weekday                     0
month_name                  0
time_of_the_day             0
dtype: int64

Filter Data¶

"timestamp_time" was replaced with "time_of_the_day" feature
"date_of_incident" was replaced with "week_day", "day_name" and "month_name"
"business" and "address_2" has lots of null values, hence those features were removed
"duration" was converted to numerical value and replaced with "duration_in_seconds"

In [110]:

pulse_point_cluster_df = pulse_point_df.drop([# 'location',
                                              'timestamp_time',
                                              'date_of_incident',
                                              # 'description',
                                              'duration',
                                              # 'address',
                                              'business', 
                                              'address_2', 
                                              ], axis=1)
pulse_point_cluster_df.dropna(inplace=True)
pulse_point_cluster_df.isna().sum()

Out[110]:

title                  0
agency                 0
location               0
description            0
address                0
city                   0
state                  0
duration_in_seconds    0
day_name               0
weekday                0
month_name             0
time_of_the_day        0
dtype: int64

Scaling¶

In [111]:

def scaling_df(df):
  X_cluster = df.copy()
  object_cols = df.columns[df.dtypes == object].to_list()
  label_enc=LabelEncoder()
  for i in object_cols:
      X_cluster[i]=X_cluster[[i]].apply(label_enc.fit_transform)
  
  scaler = MinMaxScaler()
  scaler.fit(X_cluster)
  X_cluster_scaled = pd.DataFrame(scaler.transform(X_cluster),columns= X_cluster.columns)
  return X_cluster_scaled

In [112]:

X_cluster = pulse_point_cluster_df.copy()
X_cluster_scaled = scaling_df(X_cluster)
X_cluster_scaled

Out[112]:

	title	agency	location	description	address	city	state	duration_in_seconds	day_name	weekday	month_name	time_of_the_day
0	0.127907	0.854356	0.144585	0.294277	0.160584	0.855613	0.953488	0.043679	0.500000	1.0	0.571429	0.00
1	0.662791	0.875163	0.257272	0.367620	0.285639	0.871238	0.209302	0.004111	0.500000	1.0	0.571429	0.75
2	0.209302	0.875163	0.257622	0.971573	0.286045	0.871238	0.209302	0.020555	0.500000	1.0	0.571429	0.75
3	0.290698	0.872562	0.788836	0.524330	0.801920	0.869792	0.976744	0.009250	0.500000	1.0	0.571429	0.75
4	0.290698	0.872562	0.812933	0.521830	0.824166	0.869792	0.976744	0.007194	0.500000	1.0	0.571429	0.75
...	...	...	...	...	...	...	...	...	...	...	...	...
267889	0.662791	0.854356	0.047843	0.294371	0.053327	0.855613	0.953488	0.010791	0.166667	0.0	0.571429	0.50
267890	0.127907	0.854356	0.363169	0.294324	0.402886	0.855613	0.953488	0.008736	0.166667	0.0	0.571429	0.50
267891	0.662791	0.853056	0.324505	0.067862	0.360158	0.630208	0.697674	0.026208	0.500000	1.0	0.571429	1.00
267892	0.290698	0.872562	0.816313	0.519694	0.826735	0.869792	0.976744	0.004111	0.500000	1.0	0.571429	0.25
267893	0.290698	0.872562	0.816477	0.527699	0.826836	0.869792	0.976744	0.001028	0.500000	1.0	0.571429	0.00

267894 rows × 12 columns

PCA¶

In [113]:

def pulse_point_pca(X_data, n_components):
  pca = PCA(n_components=n_components)

  fit_pca = pca.fit(X_data)
 
  print("Variance Explained with {0} components ".format(n_components),
        round(sum(fit_pca.explained_variance_ratio_),2))

  return fit_pca, fit_pca.transform(X_data)

In [114]:

# for 12 components
pca_full, pulsepoint_data_full = pulse_point_pca(X_cluster_scaled, X_cluster_scaled.shape[1])

Variance Explained with 12 components  1.0

In [115]:

X_cluster_scaled.shape

Out[115]:

(267894, 12)

In [116]:

plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.title("Proportion of PCA variance\nexplained by number of components")
plt.xlabel("Number of components")
plt.ylabel("Proportion of variance explained");

We need about 7 components to explain ~90% of the variance in the data

7.2 Agency Engagement Vs Incident Duration by City¶

In [117]:

pulse_point_state_duration_df = pulse_point_cluster_df.groupby('city').agg({'agency':'count', 'duration_in_seconds': 'sum'}).reset_index()
pulse_point_state_duration_df.duration_in_seconds = pulse_point_state_duration_df.duration_in_seconds.apply(lambda x: x/3600)

pulse_point_state_duration_df.columns= ['city','total_agency_engagement', 'total_duration_hr']

In [118]:

x = pulse_point_state_duration_df['total_agency_engagement'].values
y = pulse_point_state_duration_df['total_duration_hr'].values
plt.scatter(x,y)
plt.title('Agency Engagement vs Incident Duration')
plt.xlabel('Number of Agency Engagement')
plt.ylabel('Total Incident Duration(hour)')
plt.show()

The agency_count (number of agency engagement) and duration_hr (duration in hour) has a positive linear relationship.

Higher Duration of Incidents indicates more agency engagement in a city

Clustering States By Duration¶

In [119]:

X = pulse_point_state_duration_df[['total_agency_engagement', 'total_duration_hr']].values

K-means Clustering¶

With “kmeans++” initialization the objective of this clustering is to create groups based on number of agency engagements and total incident duration (hour)

In [120]:

# To decide the optimum cluster number, KMeans++ using Elbow method
# to figure out K for KMeans, I will use ELBOW Method on KMEANS++ Calculation

wcss=[]

for i in range(1,11):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=SEED)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

    # inertia_ is the formula used to segregate the data points into clusters

The best number of K lies between 2 and 4

In [121]:

plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

In [122]:

kmeans = KMeans(n_clusters= 3, init='k-means++', random_state=SEED)
y_kmeans= kmeans.fit_predict(X)

In [123]:

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=200, alpha=0.5);

plt.title('Clusters of Incidents By Duration and Agency Engagement')
plt.xlabel('Number of Agency Engagement')
plt.ylabel('Total Incident Duration(hour)')
plt.show()

The k-mean clustering algorithm clusters the cities based on duration of incidents and number of agencies into three groups. Small duration indicates having less number of agency engagement and vice-versa.

Group 1 : Cities with very low number of incidents duraion and agency engagements

Group 2 : Cities with comparatively higher number of incidents duraion and agency engagements

Group 2 : Cities with highest number of incidents duraion and agency engagements

Agglomerative Clustering¶

With “KElbowVisualizer” from yellowbrick library it’s found that optimum K value is 3 ( it implements the “elbow” method)

In [181]:

# Instantiate the clustering model and visualizer
model = AgglomerativeClustering()
visualizer = KElbowVisualizer(model, k=(1,12))

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show();

Ward Linkage¶

In [184]:

#Initiating the Agglomerative Clustering model 
AC = AgglomerativeClustering(n_clusters=3)
yhat_AC = AC.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=yhat_AC, s=50, cmap='viridis')
plt.title('Clusters of Incidents By Duration and Agency Engagement')
plt.xlabel('Number of Agency Engagement')
plt.ylabel('Total Incident Duration(hour)')
plt.show()

Complete Linkage¶

In [185]:

#Initiating the Agglomerative Clustering model 
AC = AgglomerativeClustering(n_clusters=3, linkage='complete')
yhat_AC = AC.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=yhat_AC, s=50, cmap='viridis')
plt.title('Clusters of Incidents By Duration and Agency Engagement')
plt.xlabel('Number of Agency Engagement')
plt.ylabel('Total Incident Duration(hour)')
plt.show()

Results¶

From the above clustering techniques, it is clear that “complete” linkage is not suitable for Agglomerative clustering (cluster parameter was given 3 but it mostly formed 2 clusters with 1-element cluster due to outlier). On the other hand, k-means and “Ward” Agglomerative provided a better clustering result. But the density of the cities is high when the value of number of agency engagement and total incident duration is low.

K-means++ focused on clustering lower dense cities with unequal parameter distribution –

Cluster 0: Total incident duration = ~400, number of engagements = ~500
Cluster 1: Total incident duration = 400 to ~1200, number of engagements = 500 to ~1500
Cluster 2: Total incident duration = 1200 to max, number of engagements = 1500 to max

The range of cluster 1 is bigger than cluster 0 in k-means whereas Ward Agglomerative did almost an equally distributed clustering for cluster 0 and 1. If the range of the parameters (engagements of duration) is important based on other factor, for example – budget allocation with respect to engagements or business decision/future planning based on duration of emergencies, then depending on the priority both clusters would be acceptable.

8 Key Insights¶

Most of the incidents occurred in California
Most incidents happened during midnight and in the morning throughout the week
Most of the emergency engagement lasted under 30 mins
The highest number of incidents happened on Sunday
The incidents’ number got increased after Covid-19 lockdown
Medical emergency was the highest occurring incident which was followed by traffic collision and fire alarm
Montgomery County, Milwaukee Fire, and Columbus Fire were the top active agencies during the five monthly period

9 Top 10 Incidents¶

In [127]:

top_incidents_list = pulse_point_df.title.value_counts().head(25).reset_index().rename(columns={'title':'total','index':'title'})
top_incidents_list

Out[127]:

	title	total
0	Medical Emergency	179753
1	Traffic Collision	22835
2	Fire Alarm	11147
3	Alarm	7420
4	Public Service	7363
5	Refuse/Garbage Fire	4437
6	Structure Fire	4152
7	Lift Assist	3130
8	Mutual Aid	2862
9	Fire	2723
10	Residential Fire	2559
11	Expanded Traffic Collision	2527
12	Interfacility Transfer	2014
13	Outside Fire	1963
14	Vehicle Fire	1810
15	Investigation	1628
16	Carbon Monoxide	1513
17	Vegetation Fire	1428
18	Hazardous Condition	1423
19	Commercial Fire	1405
20	Gas Leak	1389
21	Wires Down	1366
22	Smoke Investigation	1252
23	Odor Investigation	1154
24	Elevator Rescue	980

In [128]:

top_incidents_list[:10].plot(x='title', y='total', rot=30);

In [129]:

top_10_incidents = top_incidents_list.title.tolist()[:10]
top_10_incidents

Out[129]:

['Medical Emergency',
 'Traffic Collision',
 'Fire Alarm',
 'Alarm',
 'Public Service',
 'Refuse/Garbage Fire',
 'Structure Fire',
 'Lift Assist',
 'Mutual Aid',
 'Fire']

In [130]:

pulse_point_top_10_df = pulse_point_df[pulse_point_df['title'].isin(top_10_incidents)].reset_index(drop=True)
pulse_point_top_10_df

Out[130]:

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	city	state	duration_in_seconds	day_name	weekday	month_name	time_of_the_day
0	Fire	Tacoma Fire	2300 A ST, TACOMA, WA	11:52 AM	2021-05-02	NaN	1 h 46 m	None	2300 A ST	NaN	TACOMA	WA	6360	Sunday	6	May	Morning
1	Fire	Tacoma Fire	PUYALLUP AVE & A ST, TACOMA, WA	9:37 AM	2021-05-02	E04	18 m	None	PUYALLUP AVE & A ST	NaN	TACOMA	WA	1080	Sunday	6	May	Morning
2	Fire	Tacoma Fire	S 24TH ST & A ST, TACOMA, WA	9:00 AM	2021-05-02	E02	14 m	None	S 24TH ST & A ST	NaN	TACOMA	WA	840	Sunday	6	May	Morning
3	Mutual Aid	WPG Fire Paramedic	DAKOTA ST & MEADOWOOD DR, WINNIPEG, MANITOBA	1:33 AM	2021-05-03	NaN	2 m	None	DAKOTA ST & MEADOWOOD DR	NaN	WINNIPEG	MB	120	Monday	0	May	Midnight
4	Fire	Winter Park Fire	1000 EARLY AVE, WINTER PARK, FL	12:43 AM	2021-05-03	E61	20 m	None	1000 EARLY AVE	NaN	WINTER PARK	FL	1200	Monday	0	May	Midnight
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
245817	Structure Fire	Anoka County	SOUTH COON CREEK DR NW & ROUND LAKE BLVD NW, ANDOVER, MN	6:22 AM	2021-05-02	AALL AE21	10 m	None	SOUTH COON CREEK DR NW & ROUND LAKE BLVD NW	NaN	ANDOVER	MN	600	Sunday	6	May	Morning
245818	Structure Fire	Anaheim FD	710 E CERRITOS AV, ANAHEIM, CA (CRENSHAW LUMBER)	1:33 AM	2021-05-02	AB1 AB2 AE1 AE3 AE5 AE6 AE7 AI2 AT1 AT3 AT6 CE83 OB1 OE3	5 h 46 m	CRENSHAW LUMBER	710 E CERRITOS AV	NaN	ANAHEIM	CA	20760	Sunday	6	May	Midnight
245819	Mutual Aid	Sumter Fire & EMS	34464 CORTEZ BLVD, BLDG NOT FOUND, RIDGE MANOR, FL (DOLLAR GENERAL)	3:45 AM	2021-05-03	NaN	9 m	DOLLAR GENERAL	34464 CORTEZ BLVD	BLDG NOT FOUND	RIDGE MANOR	FL	540	Monday	0	May	Midnight
245820	Fire	Tacoma Fire	S 8TH ST & YAKIMA AVE, TACOMA, WA	7:38 PM	2021-05-02	E01	8 m	None	S 8TH ST & YAKIMA AVE	NaN	TACOMA	WA	480	Sunday	6	May	Evening
245821	Fire	Tacoma Fire	S 92ND ST & S HOSMER ST, TACOMA, WA	3:47 PM	2021-05-02	E08	2 m	None	S 92ND ST & S HOSMER ST	NaN	TACOMA	WA	120	Sunday	6	May	Afternoon

245822 rows × 17 columns

Group By Day & Time¶

In [131]:

top_10_time_df = pulse_point_top_10_df.groupby(["title","day_name", "time_of_the_day"],
                                     as_index=False).count()[['title','day_name', 
                                                              'time_of_the_day',
                                     'agency']].reset_index(drop=True).rename(columns={'date_of_incident':'date',
                                                                                             'agency':'incident_count'})
top_10_time_df

Out[131]:

	title	day_name	time_of_the_day	incident_count
0	Alarm	Friday	Afternoon	113
1	Alarm	Friday	Evening	114
2	Alarm	Friday	Midnight	355
3	Alarm	Friday	Morning	435
4	Alarm	Friday	Night	134
...	...	...	...	...
345	Traffic Collision	Wednesday	Afternoon	232
346	Traffic Collision	Wednesday	Evening	398
347	Traffic Collision	Wednesday	Midnight	851
348	Traffic Collision	Wednesday	Morning	990
349	Traffic Collision	Wednesday	Night	408

350 rows × 4 columns

Utility Plot Function¶

In [132]:

# plotly categorical barplot
def plot_top_incident_by_time(title):
  fig = px.bar(top_10_time_df[top_10_time_df.title.str.strip()==title], 
        x="day_name", 
        y="incident_count", 
        color="time_of_the_day", 
        barmode="group",
        labels={'day_name':'Day', 
                'incident_count': 'Incident Count',
                'time_of_the_day': ''},
        title=f"{title} by Time of The Day",
        ).for_each_trace(lambda t: t.update(name=t.name.replace("=","")))
  # remove '=' sign from color
  # https://github.com/plotly/plotly_express/issues/36
  fig.show()


# seaborn alternative

# def plot_top_incident_by_time(title):
#   g=sns.catplot(data=top_10_time_df[top_10_time_df.title.str.strip()==title], 
#                 x="time_of_the_day",
#                 y='incident_count', 
#                 col='day_name',
#                 kind='bar', 
#                 height=6,
#                 col_wrap=4,)

#   for ax in g.axes.flatten():
#       ax.tick_params(labelbottom=True)

#   g.set_ylabels('Incident count')
#   g.set_axis_labels('Time of the Day')
#   g.set_titles("{col_name}")
#   # g.despine(left=True)
#   # plt.suptitle('Incident By Time')
#   plt.show()

Medical Emergency¶

In [133]:

plot_top_incident_by_time('Medical Emergency')

Traffic Collision¶

In [134]:

plot_top_incident_by_time('Traffic Collision')

Fire Alarm¶

In [135]:

plot_top_incident_by_time('Fire Alarm')

Alarm¶

In [136]:

plot_top_incident_by_time('Alarm')

Public Service¶

In [137]:

plot_top_incident_by_time('Public Service')

Refuse/Garbage Fire¶

In [138]:

plot_top_incident_by_time('Refuse/Garbage Fire')

Structure Fire¶

In [139]:

plot_top_incident_by_time('Structure Fire')

Mutual Aid¶

In [140]:

plot_top_incident_by_time('Mutual Aid')

Lift Assist¶

In [141]:

plot_top_incident_by_time('Lift Assist')

Fire¶

In [142]:

plot_top_incident_by_time('Fire')

10 Top State : CA¶

According to a 2017 study from the U.S. Census Bureau, California state's local governments consist of 57 counties, 482 cities, towns, and villages, and 2,894 special districts

In [143]:

pulse_point_ca_df = pulse_point_df[pulse_point_df.state.str.strip() == 'CA'].copy()
pulse_point_ca_df.drop(axis=1, columns=['state'],inplace=True)
pulse_point_ca_df.head()

Out[143]:

	title	agency	location	timestamp_time	date_of_incident	description	duration	business	address	address_2	city	duration_in_seconds	day_name	weekday	month_name	time_of_the_day
11	Structure Fire	Woodland Fire	2220 MURPHY DR, WOODLAND, CA	11:00 PM	2021-05-02	BAT1 BAT31 E1 E2 E3 E32 E7 T3	41 m	None	2220 MURPHY DR	NaN	WOODLAND	2460	Sunday	6	May	Night
20	Fire	Colton Fire	EB 10 MT VERNON > I 215, COLTON, CA	12:57 AM	2021-05-03	MT211	22 m	None	EB 10 MT VERNON > I 215	NaN	COLTON	1320	Monday	0	May	Midnight
21	Working Residential Fire	Contra Costa FPD	30 GUZMAN CT, CONCORD, CA	12:25 AM	2021-05-03	BC1 BC2 BS107 DISP E105 E109 E110 E122 E1665 INV80 M9 PD PGE2 T101 TC10 TIMER	2 h 55 m	None	30 GUZMAN CT	NaN	CONCORD	10500	Monday	0	May	Midnight
25	Refuse/Garbage Fire	Compton Fire	710 - LONG BEACH FRWY & W ALONDRA BLVD, COMPTON, CA	11:34 AM	2021-05-02	42	32 m	None	710 - LONG BEACH FRWY & W ALONDRA BLVD	NaN	COMPTON	1920	Sunday	6	May	Morning
35	Residential Fire	Alameda Co FD	32242 MERCURY WAY, UNION CITY, CA	1:12 AM	2021-05-12	B07 E27 E32 E33 R24 T31 TAC03	23 m	None	32242 MERCURY WAY	NaN	UNION CITY	1380	Wednesday	2	May	Midnight

10.1 Descriptive Analysis¶

In [144]:

pulse_point_ca_df.describe().T

Out[144]:

	count	mean	std	min	25%	50%	75%	max
duration_in_seconds	85136.0	2235.815401	2495.429214	0.0	960.0	1500.0	2940.0	94080.0
weekday	85136.0	3.157278	2.070344	0.0	1.0	3.0	5.0	6.0

The average duration of the incidents is ~37 minutes

In [145]:

pulse_point_ca_df.describe(include='object').T

Out[145]:

	count	unique	top	freq
title	85136	81	Medical Emergency	52612
agency	85136	190	Contra Costa FPD	4609
location	85136	54278	N HARBOR BL, FULLERTON, CA	79
timestamp_time	85136	1440	6:42 AM	113
description	83417	22841	E51	439
duration	85136	469	19 m	2528
business	3264	2633	UNINC	83
address	85136	50046	EL CAMINO REAL	112
address_2	2791	1197	GIL	168
city	85136	774	LOS ANGELES	8017
day_name	85136	7	Sunday	15451
month_name	85136	8	November	16856
time_of_the_day	85136	5	Morning	32291

CA State Summary¶

Contra Costa FPD is the most engaged agency
Highest reported incidents occurred on Sunday
After the end of covid-19 lockdown, the number of emergency incidents was higher and peaked at last month - November
Most reported emergencies have taken place in Los Angeles
Fire Truck Engine 51 (E51) was the most frequent emergency code

In [146]:

pulse_point_ca_df.description.value_counts().head(20)

Out[146]:

E51        439
E57        355
E53        307
E3         270
E58        269
E56        261
E1         254
E55        254
E52        247
FA1 FE1    243
E60        238
E13        233
E14        232
E54        232
E18        229
E11        225
E10        225
E7         217
E33        214
E2         211
Name: description, dtype: int64

Top emergency code descriptions refer to Fire Emergency -

E51 Fire Engine
E57 responding to a emergency
Here is an E53 Fire Truck -

E53 Fire Truck

In [147]:

printmd(f"There are total **{len(pulse_point_ca_df.city.unique())}** cities in California")

There are total 774 cities in California

In [148]:

pulse_point_ca_df.day_name.value_counts().head(10)
printmd(f"**Most emergencies take place in Saturday and Sunday (Holiday) in California**")

Most emergencies take place in Saturday and Sunday (Holiday) in California

Incidents by Time of The Day¶

In [149]:

pulse_point_ca_df.time_of_the_day.value_counts()

Out[149]:

Morning      32291
Midnight     22433
Afternoon    11710
Night        11128
Evening       7574
Name: time_of_the_day, dtype: int64

Major Incidents¶

In [150]:

pulse_point_ca_df.title.value_counts().head(20)

Out[150]:

Medical Emergency             52612
Traffic Collision              7065
Refuse/Garbage Fire            3418
Alarm                          2908
Fire Alarm                     2519
Public Service                 2076
Structure Fire                 1216
Fire                           1037
Investigation                   889
Outside Fire                    871
Expanded Traffic Collision      826
Vehicle Fire                    645
Wires Down                      569
Commercial Fire                 501
Interfacility Transfer          485
Lift Assist                     448
Waterflow Alarm                 444
Residential Fire                439
Emergency Response              414
Vegetation Fire                 408
Name: title, dtype: int64

Apart from medical emergency, top incidents of California includes fire, alarm and traffic collision

California is susceptible to an impressive array of natural hazards, including earthquakes, fires, flooding and mudslides.

Here is a good article on this -

4 REASONS CALIFORNIA IS MORE SUSCEPTIBLE TO NATURAL DISASTERS THAN OTHER STATES

Major Cities¶

In [151]:

pulse_point_ca_df.city.value_counts().head(20)

Out[151]:

LOS ANGELES      8017
FREMONT          2611
LONG BEACH       2587
FULLERTON        2138
COMPTON          1705
SANTA ANA        1683
SANTA CLARITA    1452
LANCASTER        1413
MILPITAS         1403
WOODLAND         1377
POMONA           1266
ONTARIO          1230
OCEANSIDE        1227
IRVINE           1175
PALMDALE         1074
CONCORD           970
GARDEN GROVE      942
ANTIOCH           937
GLENDALE          821
INGLEWOOD         784
Name: city, dtype: int64

Map Geolocation

In [152]:

mask = (pulse_point_city_df.city.isin(pulse_point_ca_df.city.unique().tolist())) & (pulse_point_city_df.state == 'CA')

ca_city = pulse_point_city_df[mask].reset_index(drop=True)
ca_city

Out[152]:

	city	state	count	location	longitude	latitude
0	**UNDEFINED	CA	17	**UNDEFINED, CA, USA	38.123901	55.986527
1	29 PALMS	CA	9	29 PALMS, CA, USA	-116.054351	34.135692
2	ACTON	CA	37	ACTON, CA, USA	-118.186838	34.480741
3	ADELANTO	CA	20	ADELANTO, CA, USA	-117.409215	34.582770
4	AGOURA	CA	23	AGOURA, CA, USA	-118.738129	34.143161
...	...	...	...	...	...	...
768	YOLO	CA	184	YOLO, CA, USA	-121.905900	38.718454
769	YORBA LINDA	CA	320	YORBA LINDA, CA, USA	-117.824971	33.890110
770	YUCCA VALLEY	CA	11	YUCCA VALLEY, CA, USA	-116.413984	34.123621
771	ZAMORA	CA	63	ZAMORA, CA, USA	-121.881912	38.796568
772	ZAYANTE	CA	74	ZAYANTE, CA, USA	-122.043573	37.091892

773 rows × 6 columns

Incidents Distribution¶

In [153]:

ca_city['count'].describe()

Out[153]:

count     773.000000
mean      110.133247
std       382.835327
min         1.000000
25%         3.000000
50%         9.000000
75%        73.000000
max      8017.000000
Name: count, dtype: float64

In [154]:

ca_city['count'].plot(title='Emergency Incidents Distribution on California Cities');
plt.xlabel('Number of Cities')
plt.ylabel('Incidents')
plt.show()

10.2 Geoplot¶

Color Code Cities (By Number of Inidents)¶

In [155]:

ca_city['color']=ca_city['count'].apply(lambda count:"Black" if count>=1500 else
                                         "green" if count>=1200 and count<1500 else
                                         "Orange" if count>=800 and count<1200 else
                                         "darkblue" if count>=500 and count<800 else
                                         "red" if count>=300 and count<500 else
                                         "lightblue" if count>=100 and count<300 else
                                         "brown" if count>=10 and count<100 else
                                         "violet" if count>=5 and count<10 else
                                         "grey")
ca_city['size']=ca_city['count'].apply(lambda count:10 if count>=1500 else
                                         8 if count>=1200 and count<1500 else
                                         7 if count>=800 and count<1200 else
                                         6 if count>=500 and count<800 else
                                         5 if count>=300 and count<500 else
                                         4 if count>=100 and count<300 else
                                         3 if count>=10 and count<100 else
                                         2 if count>=5 and count<10 else
                                         1)

In [156]:

geometry2 = geopandas.points_from_xy(ca_city.longitude, ca_city.latitude)
geo_df2 = geopandas.GeoDataFrame(ca_city[['city','count','longitude', 'latitude']], geometry=geometry2)

geo_df2.head()

Out[156]:

	city	count	longitude	latitude	geometry
0	**UNDEFINED	17	38.123901	55.986527	POINT (38.12390 55.98653)
1	29 PALMS	9	-116.054351	34.135692	POINT (-116.05435 34.13569)
2	ACTON	37	-118.186838	34.480741	POINT (-118.18684 34.48074)
3	ADELANTO	20	-117.409215	34.582770	POINT (-117.40922 34.58277)
4	AGOURA	23	-118.738129	34.143161	POINT (-118.73813 34.14316)

In [157]:

geoJSON_df = geopandas.read_file(state_geo)
geoJSON_CA = geoJSON_df.loc[geoJSON_df.id == 'CA']
geoJSON_CA

Out[157]:

	id	name	geometry
4	CA	California	POLYGON ((-123.23326 42.00619, -122.37885 42.01166, -121.03700 41.99523, -120.00186 41.99523, -119.99638 40.26452, -120.00186 38.99935, -118.71478 38.10113, -117.49890 37.21934, -116.54044 36.50186, -115.85034 35.97060, -114.63446 35.00118, -114.63446 34.87521, -114.47015 34.71090, -114.33323 34.44801, -114.13606 34.30561, -114.25655 34.17416, -114.41538 34.10844, -114.53587 33.93318, -114.49754 33.69767, -114.52492 33.54979, -114.72757 33.40739, -114.66184 33.03496, -114.52492 33.02948, -114.47015 32.84327, -114.52492 32.75563, -114.72209 32.71730, -116.04751 32.62419, -117.12647 32.53656, -117.24696 32.66800, -117.25244 32.87613, -117.32911 33.12259, -117.47151 33.29785, -117.78370 33.53884, -118.18352 33.76339, -118.26019 33.70314, -118.41355 33.74148, -118.39164 33.84007, -118.56690 34.04272, -118.80241 33.99890, -119.21866 34.14678, -119.27890 34.26727, -119.55823 34.41515, -119.87589 34.40967, -120.13878 34.47539, -120.47288 34.44801, -120.64814 34.57946, -120.60980 34.85878, -120.67005 34.90259, -120.63171 35.09976, -120.89460 35.24764, -120.90556 35.45029, -121.00414 35.46124, -121.16845 35.63650, -121.28347 35.67484, -121.33276 35.78438, -121.71614 36.19515, -121.89688 36.31565, -121.93522 36.63878, -121.85854 36.61140, -121.78734 36.80309, -121.92974 36.97836, -122.10501 36.95645, -122.33504 37.11528, -122.41719 37.24125, -122.40076 37.36174, -122.51578 37.52057, -122.51578 37.78346, -122.32956 37.78346, -122.40624 38.15042, -122.48839 38.11208, -122.50482 37.93134, -122.70199 37.89300, -122.93750 38.02993, -122.97584 38.26544, -123.12919 38.45165, -123.33184 38.56667, -123.44138 38.69811, -123.73713 38.95553, -123.68784 39.03221, -123.82476 39.36630, -123.76452 39.55252, -123.85215 39.83184, -124.10957 40.10569, -124.36151 40.25904, -124.41080 40.43978, -124.15886 40.87794, -124.10957 41.02581, -124.15886 41.14083, -124.06575 41.44206, -124.14790 41.71591, -124.25744 41.78163, -124.21363 42.00071, -123.23326 42.00619))

In [158]:

map_CA = folium.Map(location = [38, -115], zoom_start = 6)

# https://stackoverflow.com/a/61129097/11105356
folium.GeoJson(geoJSON_CA.geometry,
               name='California').add_to(map_CA)


for lat,lon,area,color,count,size in zip(ca_city['latitude'],ca_city['longitude'],ca_city['city'],ca_city['color'],ca_city['count'],ca_city['size']):
     folium.CircleMarker([lat, lon],
                            popup=folium.Popup(f"{area}, {count}", parse_html=True),
                            radius=size*5,
                            color='b',
                            fill=True,
                            fill_opacity=0.7,
                            fill_color=color,
                           ).add_to(map_CA)
map_CA

Out[158]:

Make this Notebook Trusted to load map: File -> Trust Notebook

With Heatmap¶

In [159]:

heat_data = [[point.xy[1][0], point.xy[0][0]] for point in geo_df2.geometry ]

# # heat_data
HeatMap(heat_data).add_to(map_CA)

map_CA

Out[159]:

Make this Notebook Trusted to load map: File -> Trust Notebook

With Marker¶

In [160]:

map_CA_c = folium.Map(location = [38, -115], zoom_start = 6)

# https://stackoverflow.com/a/61129097/11105356
folium.GeoJson(geoJSON_CA.geometry,
               name='California').add_to(map_CA_c)

for i in range(0, len(ca_city)):
  folium.Marker(
    location = [ca_city.iloc[i]['latitude'], ca_city.iloc[i]['longitude']],
    popup = folium.Popup(f"{ca_city.iloc[i]['city']}\n{ca_city.iloc[i]['count']}", parse_html=True),
    icon=folium.features.CustomIcon('https://i.postimg.cc/JhmnMQXj/sos.png', icon_size=(24, 31))
  ).add_to(map_CA_c)

# # heat_data
HeatMap(heat_data).add_to(map_CA_c)

map_CA_c

Out[160]:

Make this Notebook Trusted to load map: File -> Trust Notebook

10.3 Insights¶

Southern and central California are two most active region
Most of the incidents have taken place in Los Angelos and San Francisco, both are highly active city (PulsePoint is based in the San Francisco Bay Area)

11 Conclusions¶

Future Work¶

Dataset still has some outliers in location related features. This issue will be resolved in the future version
I performed clustering on incident duration and agency engagement grouped by cities. So, different groups of data could be explored for clustering as well as other clustering techniques
Better result could be achieved by performing dimensionality reduction for clustering or hyperparameter tuning
Detailed time series & geospatial analysis on the dataset
The impact of/correlations with the lockdown and other Covid-19-driven policy actions/agency activities in different states in 2020
Detailed investigation on the possible correlation/inference with the incidents and local infrastructure

End Notes¶

Demographic data would be valuabe to do research on users to find more insights and their association with emergencies
A glossary / clear explanation of all emegercy codes (description) would be valuable to sort out incidents
Research work on PulsePoint can be found here - https://www.pulsepoint.org/research-studies

Table of Contents¶

1 Introduction¶

Background¶

Data Collection¶

View this project on GitHub : ahmedshahriar/PulsePoint-Data-Analytics¶

Kaggle Notebook : ahmedshahriarsakib/pulsepoint-emergency-analytics¶

2 Libraries & Configuration¶

I used positionstack for geocoding data as a backup option for Nominatim¶

You can create an account on positionstack API (25,000 free requests/month)¶

3 Explore Dataset¶

3.1 Metadata Summary¶

Data Types¶

Object Data¶

Web Image Data¶

3.2 Missing Values¶

4 Data Cleaning¶

Discard Columns¶

Remove Active Incidents¶

5 Feature Extraction¶

5.1 Location¶

Insights and Feature Extraction¶

State¶

City¶

Address¶

Address_2¶

Business¶

Business Place Extractor¶

Split Location¶

Four Features¶

Three Features¶

Final Merging of Location Features¶

Drop Garbage¶

5.2 City¶

5.3 State¶

Canadian Province¶

Mapping Canadian provinces to their unique short form¶

Noise Removal¶

Leftover¶

5.4 Time¶

Converting time string to seconds¶

Duration (seconds)¶

Extract duration total time from “duration” text¶

Adding Extra Time Related Features¶

Time of the day¶

Save Cleaned and Processed Dataset¶

6 EDA¶

6.1 Incidents¶

Top¶

Wordcloud¶

6.2 Agency¶

Most Active¶

Wordcloud¶

6.3 Description (Codes)¶

6.4 Duration¶

6.5 Incident Location¶

City¶

Geolocation¶

Extract geolocation from the city address¶

Utility Function¶

Fetch Geolocation¶

Top Engagement¶

Heat Map¶

Map Circle Overlays¶

State¶

Top Engagement¶

Duration¶

Occurrence Timeline¶

Animated geo scatter plot¶

US States Geolocation¶

Scrape States data¶

US States with Total Incident Count¶

Missing US States¶

Filter US States¶

Choropleth USA¶

With Count Marker¶

Choropleth USA Zoomed-IN¶

6.6 Incident Date & Time¶

Date¶

Highest Number of Incidents¶

Timeline¶

View this project on GitHub : ahmedshahriar/PulsePoint-Data-Analytics ¶

Kaggle Notebook : ahmedshahriarsakib/pulsepoint-emergency-analytics ¶