1 Introduction

The objective of this project is to do an extensive analysis of the Pulsepoint Emergency Data as well as apply some clustering and dimensionality reduction techniques.

The result from the analysis might be beneficial to a varieties of business stakeholders.

For example –

  1. For real estate business agencies, they may make decisions based on emergency occurrences and the frequency of which place is risky for housing and which are not and take precautions properly.
  2. For local authorities to avoid planting oil/gas filling stations fire prone locations.
  3. It will also be helpful for the local government to estimate a proper budget to take preventive measures for local emergencies and other natural phenomena.

Background

PulsePoint is a 911-connected mobile app that allows users to view and receive alerts on calls being responded to by fire departments and emergency medical services. The app's main feature, and where its name comes from, is that it sends alerts to users at the same time that dispatchers are sending the call to emergency crews. The goal is to increase the possibility that a victim in cardiac arrest will receive cardiopulmonary resuscitation (CPR) quickly. The app uses the current location of a user and will alert them if someone in their vicinity is in need of CPR. The app, which interfaces with the local government public safety answering point, will send notifications to users only if the victim is in a public place and only to users that are in the immediate vicinity of the emergency. - Wikipedia

Pulsepoint logs of the incidents can be used to identify the local pattern of emergencies which is helpful for local businesses as well as emergency agencies to stay alert and take precautions which, in the long term ensure the social well-being of the people.

Data Collection

The data was collected via web scraping using python. The logs were collected from 2021-05-02 to 2021-12-31.

PulsePoint Respond Mobile APP UI (visual inspection of the data) :

PulsePoint Mobile APP UI

NB: This project also serves as my assignments for the course below -

View this project on GitHub : ahmedshahriar/PulsePoint-Data-Analytics

Kaggle Notebook : ahmedshahriarsakib/pulsepoint-emergency-analytics

2 Libraries & Configuration

I used positionstack for geocoding data as a backup option for Nominatim

You can create an account on positionstack API (25,000 free requests/month)

In [1]:
%%capture
!pip install geopandas    # geo-plotting     
!pip install pdpipe       # data pipeline 
!pip install yellowbrick  # for elbow method 
In [2]:
import re
import json
import requests
import urllib

import pandas as pd
import numpy as np
import pdpipe as pdp

# from tqdm import tqdm
from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas() # https://stackoverflow.com/a/34365537/11105356


from datetime import timedelta, datetime

# data visualization

import folium
import plotly.graph_objects as go
import plotly.express as px
import geopandas
import seaborn as sns
import matplotlib.pyplot as plt

from plotly.subplots import make_subplots
from wordcloud import WordCloud

from folium.plugins import MarkerCluster, HeatMap


from geopy.geocoders import Nominatim # reverse geocoding

# data processing and algorithm
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import (KMeans, DBSCAN, OPTICS, 
                             AgglomerativeClustering,
                             MiniBatchKMeans)
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA


from IPython.display import Image, HTML, Markdown
# from IPython.html import widgets


%matplotlib inline
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

plt.rcParams['figure.figsize'] = 12, 8

# utility function to print markdown string
def printmd(string):
    display(Markdown(string))

pd.set_option('display.max_colwidth', None)

SEED = 42

# set the size of the geo bubble
def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value)
    if result < 0:
        result = 0.1
    return result

# API Key
API_KEY_POSITIONSTACK = "YOUR_API_KEY_HERE"

3 Explore Dataset

In [4]:
parse_dates=['date_of_incident']
pulse_point_df = pd.read_csv("/content/PulsePoint_local_threats_emergencies.csv", 
                             parse_dates=parse_dates,
                             skipinitialspace=True)

# to parse datetime column later
# pulse_point_df.date_of_incident = pd.to_datetime(pulse_point_df.date_of_incident)
In [5]:
printmd(f"Dataset has **{pulse_point_df.shape[0]}** rows and **{pulse_point_df.shape[1]}** columns")

Dataset has 361245 rows and 11 columns

Strip Object Columns

It will remove noise like extra whitespaces.

Example - there are some state values in the data such as - " CA" and "CA" which can be identified as separate entities. So this operation will remove that issue.

In [6]:
pulse_point_df = pulse_point_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
In [7]:
pulse_point_df.sort_values(by='date_of_incident')
Out[7]:
id type title agency location timestamp_time date_of_incident description duration incident_logo agency_logo
0 3569 recent Commercial Fire Suffolk Fire Rescue 2210 E WASHINGTON ST, SUFFOLK, VA 12:38 PM 2021-05-02 B1 E1 E2 E3 E4 EMS1 L3 L5 M1 R1 R6 SF1 1 h 25 m https://web.pulsepoint.org/assets/images/list/cf_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1344
361013 3365 recent Structure Fire Alachua/Gainesville 15270 NW 150TH AVE, STE 3046, ALACHUA, FL 8:16 PM 2021-05-02 DC6 E21 E25 E29 Q23 R21 SQ29 23 m https://web.pulsepoint.org/assets/images/list/sf_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1079
361012 3364 recent Full Assignment Allegheny County EMS 428 LINCOLN HIGHLANDS DR, NORTH FAYETTE, PA 9:10 PM 2021-05-02 1902 21 m https://web.pulsepoint.org/assets/images/list/full_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=599
361011 3363 recent Commercial Fire Akron Fire 750 MULL AVE, STE 3D, AKRON, OH 9:11 PM 2021-05-02 AKAT6 AKBC4 AKBC9 AKCH4 AKEN11 AKEN3 AKEN4 AKEN6 AKEN9 AKFI3 AKL4 AKL9 AKM10 AKM12 AKM4 AKM6 AKT10 AMR5 AMR6 4 h 35 m https://web.pulsepoint.org/assets/images/list/cf_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=979
361010 3362 recent Mutual Aid Allegheny County EMS 965 BURTNER RD, HARRISON, PA 9:25 PM 2021-05-02 111 1 h 55 m https://web.pulsepoint.org/assets/images/list/mu_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=599
... ... ... ... ... ... ... ... ... ... ... ...
358158 361760 recent Gas Leak Fairfax County Fire 3903I FAIR RIDGE DR, FAIRFAX, VA (MASSAGE ENVY) 8:14 PM 2021-12-31 A440 BC407 E421M E440M TL440M 33 m https://web.pulsepoint.org/assets/images/list/gas_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1441
358157 361759 recent Medical Emergency Escambia Co EMS ROYCE ST, BRENT, FL 8:15 PM 2021-12-31 M34 47 m https://web.pulsepoint.org/assets/images/list/me_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=874
358156 361758 recent Medical Emergency Fairfax County Fire BURKE COMMONS RD, BURKE, VA 8:18 PM 2021-12-31 E432M M432 26 m https://web.pulsepoint.org/assets/images/list/me_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1441
358162 361764 recent Medical Emergency Fairbanks ECC MARY ANN ST, FAIRBANKS, AK 8:08 PM 2021-12-31 M2 33 m https://web.pulsepoint.org/assets/images/list/me_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=1288
358066 363743 active Medical Emergency Cosumnes FD SANTO CT, ELK GROVE, CA 10:40 PM 2021-12-31 E72 E77 M71 NaN https://web.pulsepoint.org/assets/images/list/me_list.png https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=551

361245 rows × 11 columns

Data was collected from 2021-05-02 to present

3.1 Metadata Summary

In [8]:
pulse_point_df.columns
Out[8]:
Index(['id', 'type', 'title', 'agency', 'location', 'timestamp_time',
       'date_of_incident', 'description', 'duration', 'incident_logo',
       'agency_logo'],
      dtype='object')
Columns Description Data Type
id Contains record id numeric, int
type Incident type (recent or active) object
title Title of the incident (e.g., Medical Emergency, Fire) object
agency Agency name (e.g., fire departments, emergency medical services) object
location Location where the incident took place object
timestamp_time Time when the incident record was logged object
date_of_incident Date when the incident record was logged datetime
description Emergency code description (e.g., E53 - refers to Fire Engine Truck ) object
duration Duration of the incident object
Incident_logo Logo of the incident object
agency_logo Logo of the agency object
In [9]:
pulse_point_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361245 entries, 0 to 361244
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                361245 non-null  int64         
 1   type              361245 non-null  object        
 2   title             361245 non-null  object        
 3   agency            361245 non-null  object        
 4   location          361245 non-null  object        
 5   timestamp_time    361245 non-null  object        
 6   date_of_incident  361245 non-null  datetime64[ns]
 7   description       344622 non-null  object        
 8   duration          281278 non-null  object        
 9   incident_logo     361245 non-null  object        
 10  agency_logo       361245 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 30.3+ MB

Data Types

In [10]:
pulse_point_df.dtypes.value_counts()
Out[10]:
object            9
int64             1
datetime64[ns]    1
dtype: int64

Object Data

In [11]:
pulse_point_df.describe(include='object').T
Out[11]:
count unique top freq
type 361245 2 recent 281278
title 361245 89 Medical Emergency 240433
agency 361245 792 Montgomery County 7265
location 361245 222795 EUCLID AV, EUCLID, OH 135
timestamp_time 361245 1440 6:42 AM 415
description 344622 111286 E1 1334
duration 281278 728 16 m 6420
incident_logo 361245 89 https://web.pulsepoint.org/assets/images/list/me_list.png 240433
agency_logo 361245 648 https://web.pulsepoint.org/DB/GetAgencyImage.php?agency_id=100 6114

Web Image Data

In [12]:
def path_to_image_html(path):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''

    return '<img src="'+ path + '" style=max-height:124px;"/>' # option : '" width="60" 

pulse_point_df_short = pulse_point_df.head(10)
HTML(pulse_point_df_short.to_html(escape=False , formatters=dict(incident_logo=path_to_image_html, agency_logo=path_to_image_html)))
Out[12]:
id type title agency location timestamp_time date_of_incident description duration incident_logo agency_logo
0 3569 recent Commercial Fire Suffolk Fire Rescue 2210 E WASHINGTON ST, SUFFOLK, VA 12:38 PM 2021-05-02 B1 E1 E2 E3 E4 EMS1 L3 L5 M1 R1 R6 SF1 1 h 25 m
1 3570 recent Fire Tacoma Fire 2300 A ST, TACOMA, WA 11:52 AM 2021-05-02 NaN 1 h 46 m
2 3571 recent Residential Fire Tamarac Fire 4601 NW 30TH TER, TAMARAC, FL 10:00 AM 2021-05-02 BC15 E34 E37 Q110 R278 R34 R37 8 m
3 3572 recent Electrical Fire Tamarac Fire 4611 NW 30TH TER, TAMARAC, FL 9:52 AM 2021-05-02 Q78 40 m
4 3573 recent Fire Tacoma Fire PUYALLUP AVE & A ST, TACOMA, WA 9:37 AM 2021-05-02 E04 18 m
5 3574 recent Fire Tacoma Fire S 24TH ST & A ST, TACOMA, WA 9:00 AM 2021-05-02 E02 14 m
6 3575 recent Commercial Fire Stafford County Fire 1287 JEFFERSON DAVIS HWY, FREDERICKSBURG, VA 5:49 AM 2021-05-02 A6 E1 R1U 8 m
7 3576 recent Residential Fire Suffolk Fire Rescue 101 ROCKLAND TER, SUFFOLK, VA 5:04 AM 2021-05-02 B1 E1 E3 E6 EMS1 EMS2 L3 M3 M6 R1 SF1 1 h 6 m
8 3577 active Residential Fire Suffolk Co FRES 1701 AVALON PINES DR, CORAM, NY 4:27 AM 2021-05-03 ?5-06-A NaN
9 3578 recent Mutual Aid WPG Fire Paramedic DAKOTA ST & MEADOWOOD DR, WINNIPEG, MANITOBA 1:33 AM 2021-05-03 NaN 2 m

3.2 Missing Values

In [13]:
def missing_value_describe(data):
    # check missing values in the data
    total = data.isna().sum().sort_values(ascending=False)
    missing_value_pct_stats = (data.isnull().sum() / len(data)*100)
    missing_value_col_count = sum(missing_value_pct_stats > 0)

    # missing_value_stats = missing_value_pct_stats.sort_values(ascending=False)[:missing_value_col_count]
    missing_data = pd.concat([total, missing_value_pct_stats], axis=1, keys=['Total', 'Percentage(%)'])

    print("Number of rows with at least 1 missing values:", data.isna().any(axis = 1).sum())
    print("Number of columns with missing values:", missing_value_col_count)

    if missing_value_col_count != 0:
        # print out column names with missing value percentage
        print("\nMissing percentage (desceding):")
        display(missing_data[:missing_value_col_count])

        # plot missing values
        missing = data.isnull().sum()
        missing = missing[missing > 0]
        missing.sort_values(inplace=True)
        missing.plot.bar()
    else:
        print("No missing data!!!")

# pass a dataframe to the function
missing_value_describe(pulse_point_df)
Number of rows with at least 1 missing values: 93351
Number of columns with missing values: 2

Missing percentage (desceding):
Total Percentage(%)
duration 79967 22.136500
description 16623 4.601586

4 Data Cleaning

Discard Columns

In [14]:
pulse_point_df.drop(['id', 'incident_logo', 'agency_logo'], axis=1, inplace=True)

Remove Active Incidents

Because active incidents are the noisy duplicated data of the “recent type incident” which was unable to remove during the data collection process. Thus, it does not contribute to the analysis.

In [15]:
pulse_point_df.type.value_counts()
Out[15]:
recent    281278
active     79967
Name: type, dtype: int64
In [16]:
pulse_point_df.drop(pulse_point_df[pulse_point_df.type == 'active'].index, inplace=True)
pulse_point_df.reset_index(drop=True, inplace=True)

Drop redundant column "type"

In [17]:
pulse_point_df.drop(columns=['type'], axis=1, inplace=True)

5 Feature Extraction

5.1 Location

In [18]:
pulse_point_df.location
Out[18]:
0            2210 E WASHINGTON ST, SUFFOLK, VA
1                        2300 A ST, TACOMA, WA
2                4601 NW 30TH TER, TAMARAC, FL
3                4611 NW 30TH TER, TAMARAC, FL
4              PUYALLUP AVE & A ST, TACOMA, WA
                          ...                 
281273             1252 WILROY RD, SUFFOLK, VA
281274        913 E WASHINGTON ST, SUFFOLK, VA
281275    717 OCEAN BREEZE WK, OCEAN BEACH, NY
281276       S 8TH ST & YAKIMA AVE, TACOMA, WA
281277     S 92ND ST & S HOSMER ST, TACOMA, WA
Name: location, Length: 281278, dtype: object
In [19]:
pulse_point_df.location.value_counts().head(10)
Out[19]:
COLLINS AVE, MIAMI BEACH, FL                                    99
N HARBOR BL, FULLERTON, CA                                      79
175 NE 1ST ST, MCMINNVILLE, OR (MCMINNVILLE FIRE DEPARTMENT)    78
WASHINGTON AVE, MIAMI BEACH, FL                                 77
FREMONT BLVD, FREMONT, CA                                       76
ALTON RD, MIAMI BEACH, FL                                       70
PRESTON RD, FRISCO, TX                                          70
LEGACY DR, FRISCO, TX                                           66
E STATE ST, ROCKFORD, IL                                        65
STONEBROOK PKWY, FRISCO, TX                                     64
Name: location, dtype: int64

Insights and Feature Extraction

There are many variations in the location column:

Such as -

  • NE OAK SPRINGS FARM RD, CARLTON, OR
  • W 10TH ST, LONG BEACH, CA
  • 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))
  • 175 NE 1ST ST, MCMINNVILLE, OR (MCMINNVILLE FIRE DEPARTMENT)
  • E BARNETT RD, MEDFORD, OR

We can split the locations into multiple features -

State

Text after the last comma appears to be the short form of US states or Canadian provinces.

CA -> California state

OR -> Oregon state

City

Text after second last comma appears to be city name (or town, county name)

MEDFORD is a city in Oregon (last example - "E BARNETT RD, MEDFORD, OR")

Address

Apart from state and city name, the rest will be counted as address features if there are three comma-separated elements (texts)

Address_2

Apart from state, city, and address the rest will be counted as extended address (address_2) feature if there are four comma-separated element/string

Business

Bracket enclosed string will be counted as Business Name.

From the above example - OJAI ARCADE (21002302) and MCMINNVILLE FIRE DEPARTMENT are counted as business feature

Business Place Extractor

In [20]:
def get_business_name(location):
    # https://stackoverflow.com/a/38212061/11105356
    stack = 0
    start_index = None
    results = []

    for i, c in enumerate(location):
        if c == '(':
            if stack == 0:
                start_index = i + 1  # string to extract starts one index later

            # push to stack
            stack += 1
        elif c == ')':
            # pop stack
            stack -= 1

            if stack == 0:
                results.append(location[start_index:i])
    try:
      if len(results) == 0:
        return None
      elif len(results) == 1 and len(results[0]) == 1:
        return None
      elif len(results) == 1 and len(results[0])!=1:
        return results[0].strip()
      elif len(results) > 1 and len(results[0])==1:
        return None
      else:
        return results[1].strip()
    except IndexError as ie:
      pass

### handles variations such as -
# 5709 RICHMOND RD, STE 76, JAMES CITY COUNTY, VA (JANIE & JACK)
# 433 SARATOGA RD, SCHENECTADY, NY ((GLENVILLE)EAST GLENVILLE FD)
# I 229 RAMP  & I 229 RAMP (0.1 MILES), SIOUX FALLS, SD (I 229 MM 8 NB)
# 6501 MISTY WATERS DR, STE (S)E260 (N), BURLEIGH COUNTY, ND

Split Location

Example 1 (3 elements): 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302))

address = 302 E OJAI AVE, city = OJAI, state = CA, business = OJAI ARCADE (21002302)

Example 2 (4 elements): GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA

address = GRASSIE BLVD, address_2 = STE 212, city = WINNIPEG, state = MANITOBA (wil be converted to MB later)

In [21]:
# examples
# 302 E OJAI AVE, OJAI, CA (OJAI ARCADE (21002302)) --- 3 segments with business inside
# 1959 MORSE RD, COLUMBUS, OH (DOLLAR GENERAL)
# I 229 RAMP  & I 229 RAMP (0.1 MILES), SIOUX FALLS, SD (I 229 MM 8 NB)
# GRASSIE BLVD, STE 212, WINNIPEG, MANITOBA --- 4 segments



# split location into 3 or 4 parts depending on number of commas -> 
# 3 segments : address, city, state
# 4 segments : address, address_2, city, state


# to extract bracket enclosed string 
pulse_point_df['business'] = pulse_point_df.location.apply(lambda x : get_business_name(x))

### remove enclosed business name from the location string
pulse_point_location_data = pulse_point_df.apply(lambda row : row['location'].replace(str(row['business']), ''), axis=1)

# remove leftover bracket from the business replacemnt
# https://stackoverflow.com/a/49183590/11105356
# remove a (...) substring with a leading whitespace at the end of the string only
pulse_point_location_data = pulse_point_location_data.str.replace(r"\s*\([^()]*\)$","").str.strip()


# split the location
four_col_location_split = ['address', 'address_2', 'city','state']
three_col_location_split = ['address', 'city','state']


# four col indices
# pulse_point_location_data[pulse_point_location_data.str.split(',', expand=True)[3].notna()]

extra_loc_data = pulse_point_location_data.str.split(',', expand=True) # to expand columns
four_col_indices = extra_loc_data[extra_loc_data.apply(lambda x: np.all(pd.notnull(x[3])) , axis = 1)].index
four_col_loc_df = extra_loc_data.iloc[four_col_indices]
four_col_loc_df.columns = four_col_location_split
four_col_loc_df
Out[21]:
address address_2 city state
41 150 JOHNSON AVE STE 14 CAPE CANAVERAL FL
43 1262 PRAIRIE LN STE 305 TITUSVILLE FL
54 6431 N 84TH ST STE 4 MILWAUKEE WI
64 405 S BLAINE ST STE 5 NEWBERG OR
65 18390 SW BOONES FERRY RD STE F207 TIGARD OR
... ... ... ... ...
281244 4515 86TH ST STE 35 URBANDALE IA
281253 223 E BAKERVIEW RD STE 348 BELLINGHAM WA
281254 1129 11TH ST STE 304 WEST DES MOINES IA
281256 1245 SE UNIVERSITY AVE STE 103 WAUKEE IA
281271 34464 CORTEZ BLVD BLDG NOT FOUND RIDGE MANOR FL

9032 rows × 4 columns

Four Features

In [22]:
pulse_point_df.loc[four_col_loc_df.index , four_col_location_split] = four_col_loc_df
pulse_point_df[four_col_location_split] = pulse_point_df[four_col_location_split].apply(lambda x: x.str.strip())
pulse_point_df[four_col_location_split]

# there are very few numbers of four feature location than three feature location
Out[22]:
address address_2 city state
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
... ... ... ... ...
281273 NaN NaN NaN NaN
281274 NaN NaN NaN NaN
281275 NaN NaN NaN NaN
281276 NaN NaN NaN NaN
281277 NaN NaN NaN NaN

281278 rows × 4 columns

Three Features

In [23]:
four_col_loc_df_mask = extra_loc_data.index.isin(four_col_indices)
three_col_loc_df = extra_loc_data[~four_col_loc_df_mask].drop([3], axis=1)
three_col_loc_df.columns = three_col_location_split

# extra_loc_data[~three_col_loc_df][3].notna().sum() # to check null values

three_col_loc_df
Out[23]:
address city state
0 2210 E WASHINGTON ST SUFFOLK VA
1 2300 A ST TACOMA WA
2 4601 NW 30TH TER TAMARAC FL
3 4611 NW 30TH TER TAMARAC FL
4 PUYALLUP AVE & A ST TACOMA WA
... ... ... ...
281273 1252 WILROY RD SUFFOLK VA
281274 913 E WASHINGTON ST SUFFOLK VA
281275 717 OCEAN BREEZE WK OCEAN BEACH NY
281276 S 8TH ST & YAKIMA AVE TACOMA WA
281277 S 92ND ST & S HOSMER ST TACOMA WA

272246 rows × 3 columns

In [24]:
pulse_point_df.loc[three_col_loc_df.index , three_col_location_split] = three_col_loc_df
pulse_point_df[three_col_location_split] = pulse_point_df[three_col_location_split].apply(lambda x: x.str.strip())
pulse_point_df[three_col_location_split]
Out[24]:
address city state
0 2210 E WASHINGTON ST SUFFOLK VA
1 2300 A ST TACOMA WA
2 4601 NW 30TH TER TAMARAC FL
3 4611 NW 30TH TER TAMARAC FL
4 PUYALLUP AVE & A ST TACOMA WA
... ... ... ...
281273 1252 WILROY RD SUFFOLK VA
281274 913 E WASHINGTON ST SUFFOLK VA
281275 717 OCEAN BREEZE WK OCEAN BEACH NY
281276 S 8TH ST & YAKIMA AVE TACOMA WA
281277 S 92ND ST & S HOSMER ST TACOMA WA

281278 rows × 3 columns

Final Merging of Location Features

In [25]:
pulse_point_df[['location','address', 'address_2', 'city','state', 'business']]
Out[25]:
location address address_2 city state business
0 2210 E WASHINGTON ST, SUFFOLK, VA 2210 E WASHINGTON ST NaN SUFFOLK VA None
1 2300 A ST, TACOMA, WA 2300 A ST NaN TACOMA WA None
2 4601 NW 30TH TER, TAMARAC, FL 4601 NW 30TH TER NaN TAMARAC FL None
3 4611 NW 30TH TER, TAMARAC, FL 4611 NW 30TH TER NaN TAMARAC FL None
4 PUYALLUP AVE & A ST, TACOMA, WA PUYALLUP AVE & A ST NaN TACOMA WA None
... ... ... ... ... ... ...
281273 1252 WILROY RD, SUFFOLK, VA 1252 WILROY RD NaN SUFFOLK VA None
281274 913 E WASHINGTON ST, SUFFOLK, VA 913 E WASHINGTON ST NaN SUFFOLK VA None
281275 717 OCEAN BREEZE WK, OCEAN BEACH, NY 717 OCEAN BREEZE WK NaN OCEAN BEACH NY None
281276 S 8TH ST & YAKIMA AVE, TACOMA, WA S 8TH ST & YAKIMA AVE NaN TACOMA WA None
281277 S 92ND ST & S HOSMER ST, TACOMA, WA S 92ND ST & S HOSMER ST NaN TACOMA WA None

281278 rows × 6 columns

In [26]:
missing_value_describe(pulse_point_df[['location','address', 'address_2', 'city','state', 'business']])
Number of rows with at least 1 missing values: 280163
Number of columns with missing values: 2

Missing percentage (desceding):
Total Percentage(%)
address_2 272246 96.788942
business 266152 94.622402

Drop Garbage

In [27]:
pulse_point_df[pulse_point_df.city.isna()]
Out[27]:
title agency location timestamp_time date_of_incident description duration business address address_2 city state
In [28]:
pulse_point_df = pulse_point_df[pulse_point_df.city.notna()]

5.2 City

In [29]:
mask = ((pulse_point_df.city.isna()) | (pulse_point_df.city==u'') )

display(pulse_point_df[mask])
title agency location timestamp_time date_of_incident description duration business address address_2 city state
38923 Mutual Aid Sumter Fire & EMS 34498 CORTEZ BLVD, BLDG NOT FOUND, RIDGE MANOR, FL (RIDGE MANOR) 11:53 PM 2021-07-03 NaN 4 m RIDGE MANOR 34498 CORTEZ BLVD BLDG NOT FOUND FL
50888 Mutual Aid San Ramon Valley FPD 3590 CLAYTON RD, CONCORD, CA (CONCORD) 10:15 PM 2021-07-12 PM32 52 m CONCORD 3590 CLAYTON RD NaN CA
272125 Traffic Collision Idaho Falls Fire UNKNOWN & W 137TH S, S 5TH W, SHELLEY, ID (SHELLEY) 12:53 AM 2021-12-28 AB5 9 m SHELLEY UNKNOWN & W 137TH S S 5TH W ID

The business names are same as the city names. I first removed the text containing business names and then performed text extraction for cities. That's why city names are blank for the cases like these.

Let's replace their city names with business names.

In [30]:
pulse_point_df.loc[mask,'city'] = pulse_point_df[mask].business

5.3 State

In [31]:
display(pulse_point_df.state.value_counts())
printmd(f"**Total {len(pulse_point_df.state.value_counts().index)} States. Some of them are Canadian provinces,  ex - MANITOBA**")
CA                                        85135
FL                                        26543
WA                                        17828
VA                                        17754
OH                                        17079
OR                                        15827
WI                                         9808
MO                                         9521
TX                                         8451
IL                                         5783
PA                                         5172
IN                                         4960
KS                                         4647
NV                                         4574
MN                                         3865
NC                                         3759
AZ                                         3642
TN                                         3624
DE                                         2861
OK                                         2849
MANITOBA                                   2730
MD                                         2658
ND                                         2569
NY                                         2228
CO                                         1815
DC                                         1788
NE                                         1721
NJ                                         1663
ID                                         1505
SD                                         1372
AK                                         1271
GA                                         1192
KY                                          908
UT                                          816
HI                                          808
AR                                          797
SC                                          778
NM                                          396
MI                                          277
IA                                          201
AL                                           58
LA                                           21
ON                                           15
BC                                            6
NV ())                                        1
MO (NUSACH HARI BNAI ZION CONGREGATION        1
CONCORD                                       1
Name: state, dtype: int64

Total 47 States. Some of them are Canadian provinces, ex - MANITOBA

Canadian Province

Mapping Canadian provinces to their unique short form

In [32]:
# Canadian Province Mapping
# https://www150.statcan.gc.ca/n1/pub/92-195-x/2011001/geo/prov/tbl/tbl8-eng.htm
# https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada

ca_province_dic = {
    'Newfoundland and Labrador': 'NL',
    'Prince Edward Island': 'PE',
    'Nova Scotia': 'NS',
    'New Brunswick': 'NB',
    'Quebec': 'QC',
    'Ontario': 'ON',
    'Manitoba': 'MB',
    'Saskatchewan': 'SK',
    'Alberta': 'AB',
    'British Columbia': 'BC',
    'Yukon': 'YT',
    'Northwest Territories': 'NT',
    'Nunavut': 'NU',
}

# approach 1

# def handle_state(data_attr):
#   for k, v in canada_provinces_dic.items():
#       if data_attr.strip().lower() == k.lower():
#         return canada_provinces_dic[k]
#   else:
#     return data_attr

# pulse_point_df['state'] =  pulse_point_df.state.apply(handle_state)


# approach 2

# https://stackoverflow.com/a/69994272/11105356

ca_province_dict = {k.lower():v for k,v in ca_province_dic.items()}
pulse_point_df['state']  = pulse_point_df['state'].str.lower().map(ca_province_dict).fillna(pulse_point_df.state)

Noise Removal

In [33]:
# Exception state : example - 'FL  #1005' , 'NY EAST GLENVILLE FD', ' DE / RM304'

mask = pulse_point_df.state.apply(lambda x:len(x)>2)
display(pulse_point_df[mask].state)
25695    MO (NUSACH HARI BNAI ZION CONGREGATION
50895                                   CONCORD
78626                                    NV ())
Name: state, dtype: object

Keeping only the first segment which is the short form for city, discarding the rest(noise)

In [34]:
pulse_point_df.loc[mask,'state'] = pulse_point_df[mask].state.apply(lambda x: x.split()[0])
In [35]:
pulse_point_df.state.value_counts()
Out[35]:
CA         85135
FL         26543
WA         17828
VA         17754
OH         17079
OR         15827
WI          9808
MO          9522
TX          8451
IL          5783
PA          5172
IN          4960
KS          4647
NV          4575
MN          3865
NC          3759
AZ          3642
TN          3624
DE          2861
OK          2849
MB          2730
MD          2658
ND          2569
NY          2228
CO          1815
DC          1788
NE          1721
NJ          1663
ID          1505
SD          1372
AK          1271
GA          1192
KY           908
UT           816
HI           808
AR           797
SC           778
NM           396
MI           277
IA           201
AL            58
LA            21
ON            15
BC             6
CONCORD        1
Name: state, dtype: int64

Leftover

In [36]:
# CONCORD
mask = pulse_point_df.state.str.startswith('CONCORD')

display(pulse_point_df[mask])
printmd("**CONCORD should be in CA**")
title agency location timestamp_time date_of_incident description duration business address address_2 city state
50895 Mutual Aid San Ramon Valley FPD 2020 GRANT ST, STE 1205, CONCORD 9:51 PM 2021-07-12 PM32 20 m None 2020 GRANT ST NaN STE 1205 CONCORD

CONCORD should be in CA

In [37]:
pulse_point_df.loc[mask,'state'] = 'CA'
In [38]:
pulse_point_df.state.value_counts()
Out[38]:
CA    85136
FL    26543
WA    17828
VA    17754
OH    17079
OR    15827
WI     9808
MO     9522
TX     8451
IL     5783
PA     5172
IN     4960
KS     4647
NV     4575
MN     3865
NC     3759
AZ     3642
TN     3624
DE     2861
OK     2849
MB     2730
MD     2658
ND     2569
NY     2228
CO     1815
DC     1788
NE     1721
NJ     1663
ID     1505
SD     1372
AK     1271
GA     1192
KY      908
UT      816
HI      808
AR      797
SC      778
NM      396
MI      277
IA      201
AL       58
LA       21
ON       15
BC        6
Name: state, dtype: int64

5.4 Time

Converting time string to seconds

For example - "1 h 34 m" will be 94*60 = 5640 seconds

In [39]:
#https://stackoverflow.com/a/57846984/11105356

UNITS = {'s':'seconds', 'm':'minutes', 'h':'hours', 'd':'days', 'w':'weeks'}

# chance of having days and weeks is none 

def convert_to_seconds(s):
    s = s.replace(" ","")
    return int(timedelta(**{
        UNITS.get(m.group('unit').lower(), 'seconds'): int(m.group('val'))
        for m in re.finditer(r'(?P<val>\d+)(?P<unit>[smhdw]?)', s, flags=re.I)
    }).total_seconds())

# convert_to_seconds("1 h 34 m")

Duration (seconds)

Extract duration total time from “duration” text

In [40]:
pulse_point_df["duration_in_seconds"] = pulse_point_df.duration.apply(lambda x:convert_to_seconds(x))
In [41]:
pulse_point_df["day_name"], pulse_point_df["weekday"] = pulse_point_df.date_of_incident.dt.day_name(), pulse_point_df.date_of_incident.dt.weekday

pulse_point_df["month_name"] = pulse_point_df.date_of_incident.dt.month_name()


## more features

# pulse_point_df.date_of_incident.dt.month_name()
# pulse_point_df.date_of_incident.dt.month

# pulse_point_df.date_of_incident.dt.day
# pulse_point_df.date_of_incident.dt.day_name()

# pulse_point_df.date_of_incident.dt.weekday
# pulse_point_df.date_of_incident.dt.isocalendar().week
In [42]:
pulse_point_df.tail(40)
Out[42]:
title agency location timestamp_time date_of_incident description duration business address address_2 city state duration_in_seconds day_name weekday month_name
281238 Refuse/Garbage Fire Huntington Beach FD 21661 BROOKHURST ST, HUNTINGTON BEACH, CA 12:32 AM 2021-05-03 ME83 19 m None 21661 BROOKHURST ST NaN HUNTINGTON BEACH CA 1140 Monday 0 May
281239 Residential Fire Whatcom Fire/EMS 3037 PACIFIC ST, BELLINGHAM, WA 3:30 AM 2021-05-03 NaN 59 m None 3037 PACIFIC ST NaN BELLINGHAM WA 3540 Monday 0 May
281240 Mutual Aid West Sacramento Fire 1700 CAPITAL AVE, SAC, WEST SACRAMENTO, CA 2:16 AM 2021-05-03 E43 WSAID 3 m None 1700 CAPITAL AVE SAC WEST SACRAMENTO CA 180 Monday 0 May
281241 Mutual Aid WestShore FDs 25157 CARLTON PARK, STE 120, NORTH OLMSTED, OH 2:13 AM 2021-05-03 FVM31 54 m None 25157 CARLTON PARK STE 120 NORTH OLMSTED OH 3240 Monday 0 May
281242 Mutual Aid West Sacramento Fire 106 J ST, SACRAMENTO, CA (SPIRITS RESTAURANT) 2:13 AM 2021-05-03 B44 E44 WSAID 17 m SPIRITS RESTAURANT 106 J ST NaN SACRAMENTO CA 1020 Monday 0 May
281243 Mutual Aid West Sacramento Fire 1210 FRONT ST, SACRAMENTO, CA (RIO CITY CAFE) 2:10 AM 2021-05-03 BT41 WSAID 46 m RIO CITY CAFE 1210 FRONT ST NaN SACRAMENTO CA 2760 Monday 0 May
281244 Structure Fire Westcom 4515 86TH ST, STE 35, URBANDALE, IA 1:13 AM 2021-05-03 A213 A323 A433 C300 C404 E411 E431 JGMENG2 L325 L425 28 m None 4515 86TH ST STE 35 URBANDALE IA 1680 Monday 0 May
281245 Refuse/Garbage Fire Wicomico County 317 WHITMAN AVE, SALISBURY, MD 12:57 AM 2021-05-03 TR2 16 m None 317 WHITMAN AVE NaN SALISBURY MD 960 Monday 0 May
281246 Residential Fire West Palm Beach Fire 436 51ST ST, WEST PALM BEACH, FL 12:37 AM 2021-05-03 BC1 BC5 E1 E4 EMS2 FOO HM2 L6 PI11 R1 R3 SQ5 TAC8A TR1 WPIV WPIV2 1 h 30 m None 436 51ST ST NaN WEST PALM BEACH FL 5400 Monday 0 May
281247 Appliance Fire Wicomico County 1149 S DIVISION ST, SALISBURY, MD 8:55 PM 2021-05-02 AC1 E1 TR2 12 m None 1149 S DIVISION ST NaN SALISBURY MD 720 Sunday 6 May
281248 Residential Fire Westfield Fire 6311 E 161ST ST, NOBLESVILLE, IN 3:36 PM 2021-05-02 E382 26 m None 6311 E 161ST ST NaN NOBLESVILLE IN 1560 Sunday 6 May
281249 Refuse/Garbage Fire West Pierce Fire 112TH ST SW & FARWEST DR SW, LAKEWOOD, WA 3:14 PM 2021-05-02 E22 17 m None 112TH ST SW & FARWEST DR SW NaN LAKEWOOD WA 1020 Sunday 6 May
281250 Mutual Aid Wicomico County 31671 W POST OFFICE RD, PRINCESS ANNE, MD 11:47 AM 2021-05-02 ET151 RE302 1 h 5 m None 31671 W POST OFFICE RD NaN PRINCESS ANNE MD 3900 Sunday 6 May
281251 Extinguished Fire West Metro W YALE AVE & S INDIANA ST, LAKEWOOD, CO 10:44 AM 2021-05-02 E9 13 m None W YALE AVE & S INDIANA ST NaN LAKEWOOD CO 780 Sunday 6 May
281252 Structure Fire Westcom 14575 SE UNIVERSITY AVE, WAUKEE, IA 10:33 AM 2021-05-02 A913 A917 C900 E190 E220 E910 L425 59 m None 14575 SE UNIVERSITY AVE NaN WAUKEE IA 3540 Sunday 6 May
281253 Structure Fire Whatcom Fire/EMS 223 E BAKERVIEW RD, STE 348, BELLINGHAM, WA 6:18 AM 2021-05-02 B1 E6 L5 13 h 10 m None 223 E BAKERVIEW RD STE 348 BELLINGHAM WA 47400 Sunday 6 May
281254 Confirmed Structure Fire Westcom 1129 11TH ST, STE 304, WEST DES MOINES, IA 6:17 AM 2021-05-02 A193 A213 C100 C104 C199 C219 E170 E180 E220 L215 L325 U218 WHTENG 4 h 48 m None 1129 11TH ST STE 304 WEST DES MOINES IA 17280 Sunday 6 May
281255 Structure Fire Westcom 1650 SE HOLIDAY CREST CIR, WAUKEE, IA 4:52 AM 2021-05-02 A433 A913 C219 E190 E220 E431 E910 L425 WKEFD1 1 h 22 m None 1650 SE HOLIDAY CREST CIR NaN WAUKEE IA 4920 Sunday 6 May
281256 Structure Fire Westcom 1245 SE UNIVERSITY AVE, STE 103, WAUKEE, IA 4:37 AM 2021-05-02 A913 C219 C901 E220 E431 E910 L215 21 m None 1245 SE UNIVERSITY AVE STE 103 WAUKEE IA 1260 Sunday 6 May
281257 Fire Anoka County 3740 BRIDGE ST, SAINT FRANCIS, MN 3:37 AM 2021-05-03 NaN 2 m None 3740 BRIDGE ST NaN SAINT FRANCIS MN 120 Monday 0 May
281258 Commercial Fire Anne Arundel CFD 7514 RITCHIE HWY, GLEN BURNIE, MD (LA FONTAINE BLEUE) 2:11 AM 2021-05-03 BC01 CH12 E122 E181 E301 E311 E331 MU33 RS11 SAFE03 SAFE07 SCMD TK26 TK31 44 m LA FONTAINE BLEUE 7514 RITCHIE HWY NaN GLEN BURNIE MD 2640 Monday 0 May
281259 Extinguished Fire Anne Arundel CFD 808 ELMHURST RD, SEVERN, MD 2:04 AM 2021-05-03 E041 E331 26 m None 808 ELMHURST RD NaN SEVERN MD 1560 Monday 0 May
281260 Extinguished Fire Anne Arundel CFD 245 KILMARNOCK DR, MILLERSVILLE, MD 1:02 AM 2021-05-03 E301 30 m None 245 KILMARNOCK DR NaN MILLERSVILLE MD 1800 Monday 0 May
281261 Mutual Aid Anne Arundel CFD 7015 AARONSON DR, BWI AIRPORT, MD (GENERAL AVIATION TERMINAL AND SIGNATURE FLIGHT SUPPORT) 12:37 AM 2021-05-03 HOLD01 RE23 RS11 TK04 TR04 1 h 17 m GENERAL AVIATION TERMINAL AND SIGNATURE FLIGHT SUPPORT 7015 AARONSON DR NaN BWI AIRPORT MD 4620 Monday 0 May
281262 Residential Fire Anne Arundel CFD 106 PINECREST DR, ANNAPOLIS, MD 12:15 AM 2021-05-03 NaN 43 m None 106 PINECREST DR NaN ANNAPOLIS MD 2580 Monday 0 May
281263 Mutual Aid Anne Arundel CFD 2508 KNIGHTHILL LN, BOWIE, MD 10:14 PM 2021-05-02 MU05 38 m None 2508 KNIGHTHILL LN NaN BOWIE MD 2280 Sunday 6 May
281264 Mutual Aid Anne Arundel CFD FORT MEADE RD & LAUREL BOWIE RD, LAUREL, MD 9:32 PM 2021-05-02 RE27 26 m None FORT MEADE RD & LAUREL BOWIE RD NaN LAUREL MD 1560 Sunday 6 May
281265 Mutual Aid Anne Arundel CFD 13503 AVEBURY DR, LAUREL, MD 9:12 PM 2021-05-02 MU27 39 m None 13503 AVEBURY DR NaN LAUREL MD 2340 Sunday 6 May
281266 Commercial Fire Anaheim FD 2400 E KATELLA AV, ANAHEIM, CA (STADIUM TOWERS PLAZA BUILDING) 8:25 PM 2021-05-02 AB2 AE3 AE7 OE3 OR6 OT6 14 m STADIUM TOWERS PLAZA BUILDING 2400 E KATELLA AV NaN ANAHEIM CA 840 Sunday 6 May
281267 Mutual Aid Anoka County 13301 HANSON BLVD NW, ANDOVER, MN (ANOKA COUNTY PUBLIC SAFETY CAMPUS) 11:01 AM 2021-05-02 NaN 42 m ANOKA COUNTY PUBLIC SAFETY CAMPUS 13301 HANSON BLVD NW NaN ANDOVER MN 2520 Sunday 6 May
281268 Structure Fire Anoka County SOUTH COON CREEK DR NW & ROUND LAKE BLVD NW, ANDOVER, MN 6:22 AM 2021-05-02 AALL AE21 10 m None SOUTH COON CREEK DR NW & ROUND LAKE BLVD NW NaN ANDOVER MN 600 Sunday 6 May
281269 Residential Fire Anaheim FD 3114 W TYLER AV, ANAHEIM, CA 4:31 AM 2021-05-02 AB2 AE11 AE4 CE61 CT61 34 m None 3114 W TYLER AV NaN ANAHEIM CA 2040 Sunday 6 May
281270 Structure Fire Anaheim FD 710 E CERRITOS AV, ANAHEIM, CA (CRENSHAW LUMBER) 1:33 AM 2021-05-02 AB1 AB2 AE1 AE3 AE5 AE6 AE7 AI2 AT1 AT3 AT6 CE83 OB1 OE3 5 h 46 m CRENSHAW LUMBER 710 E CERRITOS AV NaN ANAHEIM CA 20760 Sunday 6 May
281271 Mutual Aid Sumter Fire & EMS 34464 CORTEZ BLVD, BLDG NOT FOUND, RIDGE MANOR, FL (DOLLAR GENERAL) 3:45 AM 2021-05-03 NaN 9 m DOLLAR GENERAL 34464 CORTEZ BLVD BLDG NOT FOUND RIDGE MANOR FL 540 Monday 0 May
281272 Residential Fire Suffolk Fire Rescue 234 N 4TH ST, SUFFOLK, VA 3:37 AM 2021-05-03 B1 E1 E2 E3 EMS1 L3 M3 R1 R6 SF1 8 m None 234 N 4TH ST NaN SUFFOLK VA 480 Monday 0 May
281273 Residential Fire Suffolk Fire Rescue 1252 WILROY RD, SUFFOLK, VA 2:27 AM 2021-05-03 B1 E1 E2 E3 EMS1 L3 M3 M9 R1 SF1 T1 T9 21 m None 1252 WILROY RD NaN SUFFOLK VA 1260 Monday 0 May
281274 Commercial Fire Suffolk Fire Rescue 913 E WASHINGTON ST, SUFFOLK, VA 2:08 AM 2021-05-03 B1 E1 E2 E3 E4 L3 L5 M3 R1 R6 SF1 17 m None 913 E WASHINGTON ST NaN SUFFOLK VA 1020 Monday 0 May
281275 Residential Fire Suffolk Co FRES 717 OCEAN BREEZE WK, OCEAN BEACH, NY 11:05 PM 2021-05-02 3-20-05 3-20-07 3-20-30 3-20-31 3-20-32 3-20-A OBCHPD-A 51 m None 717 OCEAN BREEZE WK NaN OCEAN BEACH NY 3060 Sunday 6 May
281276 Fire Tacoma Fire S 8TH ST & YAKIMA AVE, TACOMA, WA 7:38 PM 2021-05-02 E01 8 m None S 8TH ST & YAKIMA AVE NaN TACOMA WA 480 Sunday 6 May
281277 Fire Tacoma Fire S 92ND ST & S HOSMER ST, TACOMA, WA 3:47 PM 2021-05-02 E08 2 m None S 92ND ST & S HOSMER ST NaN TACOMA WA 120 Sunday 6 May

Time of the day

I will assign Daytime values based on the time range below -

Time of the Day Range
Morning 5 AM to 11:59 AM
Afternoon 12PM to 4:59 PM
Evening 5 PM to 8:59 PM
Night 9 PM to 11:59 PM
Midnight 12 AM to 4:59 AM
In [43]:
# https://stackoverflow.com/a/70018607/11105356

def time_range(time):
  hour = datetime.strptime(time, '%I:%M %p').hour
  if hour > 20:
      return "Night"
  elif hour > 16:
      return "Evening"
  elif hour > 11:
      return "Afternoon"
  elif hour > 4:
      return "Morning"
  else:
      return "Midnight"
In [44]:
pulse_point_df["time_of_the_day"] = pulse_point_df.timestamp_time.apply(lambda time: time_range(time))

# # pulse_point_df.timestamp_time = pd.to_datetime(pulse_point_df.timestamp_time).dt.time

Save Cleaned and Processed Dataset

In [45]:
pulse_point_df.to_csv('PulsePoint-emergencies-cleaned.csv', index=False)

6 EDA

A quick overview of the preprocessed data-

The preprocessed dataset contains additional 5 columns extracted from the location column, another 5 columns extracted from date_of_incident and duration columns. Id , Incident_logo and agency_logo columns from the original dataset was discarded.

Columns Description Data Type
business Name of the business place extracted from location(e.g., JANIE & JACK, DOLLAR GENERAL etc.) object
address Address where the incident took place (extracted from location) object
address_2 Extended address where the incident took place (extracted from location) object
city City where the incident took place (extracted from location). It could also be a town or a country object
state State where the incident took place (extracted from location) object
duration_in_seconds Incident duration in seconds (extracted from duration) numeric, int
day_name Name of the day when the incident took place object
weekday The day of the week with Monday=0, Sunday=6. object
month_name Name of the month (extracted from date) object
time_of_the_day morning (5AM-11:59AM), afternoon (12PM-4:59 PM), evening (5PM-8:59PM), night (9PM-11:59PM), midnight (12AM-4:59AM) object
In [46]:
printmd(f"There are total **{pulse_point_df.shape[0]}** incidents")

There are total 281278 incidents

In [47]:
pulse_point_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 281278 entries, 0 to 281277
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   title                281278 non-null  object        
 1   agency               281278 non-null  object        
 2   location             281278 non-null  object        
 3   timestamp_time       281278 non-null  object        
 4   date_of_incident     281278 non-null  datetime64[ns]
 5   description          267894 non-null  object        
 6   duration             281278 non-null  object        
 7   business             15126 non-null   object        
 8   address              281278 non-null  object        
 9   address_2            9032 non-null    object        
 10  city                 281278 non-null  object        
 11  state                281278 non-null  object        
 12  duration_in_seconds  281278 non-null  int64         
 13  day_name             281278 non-null  object        
 14  weekday              281278 non-null  int64         
 15  month_name           281278 non-null  object        
 16  time_of_the_day      281278 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(14)
memory usage: 38.6+ MB
In [48]:
pulse_point_df.describe().T
Out[48]:
count mean std min 25% 50% 75% max
duration_in_seconds 281278.0 2578.742312 2878.920631 0.0 1020.0 1860.0 3480.0 116760.0
weekday 281278.0 3.076256 2.012422 0.0 1.0 3.0 5.0 6.0
In [49]:
pulse_point_df.describe(include='object').T
Out[49]:
count unique top freq
title 281278 88 Medical Emergency 179753
agency 281278 773 Montgomery County 6296
location 281278 179763 COLLINS AVE, MIAMI BEACH, FL 99
timestamp_time 281278 1440 5:04 AM 333
description 267894 85203 E1 1249
duration 281278 728 16 m 6420
business 15126 11244 UNINC 83
address 281278 155622 MAIN ST 462
address_2 9032 2758 STE BLK 266
city 281278 3530 LOS ANGELES 8017
state 281278 44 CA 85136
day_name 281278 7 Sunday 45983
month_name 281278 8 November 53193
time_of_the_day 281278 5 Morning 101072
In [50]:
missing_value_describe(pulse_point_df)
Number of rows with at least 1 missing values: 280200
Number of columns with missing values: 3

Missing percentage (desceding):
Total Percentage(%)
address_2 272246 96.788942
business 266152 94.622402
description 13384 4.758282

6.1 Incidents

In [51]:
printmd(f"There are total **{len(pulse_point_df.title.unique())}** types of incidents")

There are total 88 types of incidents

Top

In [52]:
pulse_point_df.title.value_counts().head(20)
Out[52]:
Medical Emergency             179753
Traffic Collision              22835
Fire Alarm                     11147
Alarm                           7420
Public Service                  7363
Refuse/Garbage Fire             4437
Structure Fire                  4152
Lift Assist                     3130
Mutual Aid                      2862
Fire                            2723
Residential Fire                2559
Expanded Traffic Collision      2527
Interfacility Transfer          2014
Outside Fire                    1963
Vehicle Fire                    1810
Investigation                   1628
Carbon Monoxide                 1513
Vegetation Fire                 1428
Hazardous Condition             1423
Commercial Fire                 1405
Name: title, dtype: int64

Wordcloud

In [53]:
# crisp wordcloud : https://stackoverflow.com/a/28795577/11105356

data = pulse_point_df.title.value_counts().to_dict()
wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300).generate_from_frequencies(data)
plt.figure(figsize=(14,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

6.2 Agency

In [54]:
printmd(f"There are total **{len(pulse_point_df.agency.unique())}** agencies")

There are total 773 agencies

Most Active

In [55]:
# Top agencies by incident engagement count

pulse_point_df.agency.value_counts().head(20)
Out[55]:
Montgomery County       6296
Columbus Fire           4957
Milwaukee Fire          4793
Cleveland EMS           4728
Contra Costa FPD        4609
Fairfax County Fire     3547
Hamilton County         3473
Eug Spfld Fire          3302
Boone County Joint      3143
LAFD - Central          3136
Rockford Fire           3052
LA County FD (Div 4)    2992
LA County FD (Div 8)    2990
LA County FD (Div 6)    2964
LA County FD (Div 1)    2925
Seminole County Fire    2909
LA County FD (Div 2)    2900
LA County FD (Div 5)    2864
Seattle FD              2858
Miami Beach Fire        2802
Name: agency, dtype: int64
In [56]:
pulse_point_df.agency.value_counts().head(10).sort_values(ascending=False).plot(kind = 'bar');

Wordcloud

Most frequent - Montgomery County

In [57]:
data = pulse_point_df.agency.value_counts().to_dict()
wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300).generate_from_frequencies(data)
plt.figure(figsize=(14,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()