Notebook

In [ ]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

from mpl_toolkits.basemap import Basemap
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import SCORERS
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV, LassoCV
from scipy import sparse
from sklearn.metrics import mean_squared_error, mean_absolute_error
import random
PATH = './data/'

In [ ]:

from jupyterthemes import jtplot
jtplot.style(theme='onedork')

In [ ]:

file_name_cr = 'crime.csv'
df_crime = pd.read_csv(PATH+file_name_cr, encoding = "ISO-8859-1")

Part 1. Dataset and features description¶

In [ ]:

df_crime.head()

"crime.csv" contains an array of data on crimes in the city of Boston for the years 2015-2018. The data array contains the following data: 'INCIDENT_NUMBER' - incident number 'OFFENSE_CODE' - offense code 'OFFENSE_CODE_GROUP' - offense group code 'OFFENSE_DESCRIPTION' - brief description of the incident 'DISTRICT' - district 'REPORTING_AREA' - reporting site 'SHOOTING' - whether weapons were used 'OCCURRED_ON_DATE' - date of offense 'YEAR' - year 'MONTH' - month 'DAY_OF_WEEK' 'HOUR' 'UCR_PART' - Uniform Crime Reports 'STREET' 'Lat' - latitude 'Long' - longitude 'Location' - coordinate The target variables that we will predict are the crime scene, i.e. coordinates - longitude and latitude. Also try to predict the number of crimes per day. The task is relevant due to the fact that it will help prevent crimes and reduce the number of victims.

In [ ]:

file_name_weather = 'Boston weather_clean.csv'
df_weather = pd.read_csv(PATH+file_name_weather, encoding = "ISO-8859-1")

In [ ]:

df_weather.tail()

"Boston weather_clean.csv" contains an array of weather conditions that were in the city of Boston and coincide in time with the main file "crime.csv". 'Year' - year 'Month' - month 'Day' - day 'High Temp (F)' - Maximum temperature 'Avg Temp (F)' - average 'Low Temp (F)' - minimal 'High Dew Point (F)' - maximum dew point 'Avg Dew Point (F)' - average 'Low Dew Point (F)' - min 'High Humidity (%)' - Humidity 'Avg Humidity (%)' 'Low Humidity (%)' 'High Sea Level Press (in)' 'Avg Sea Level Press (in)' 'Low Sea Level Press (in)' 'High Visibility (mi)' - visibility 'Avg Visibility (mi)' 'Low Visibility (mi)' 'High Wind (mph)' 'Avg Wind (mph)' 'High Wind Gust (mph)' 'Snowfall (in)' snow cover 'Precip (in)' - precipitation 'Events' - Events The weather information, I hope, will help build a more accurate model for the target variable.

Part 2. Exploratory data analysis¶

In [ ]:

df_crime.isnull().sum()

Let's see what areas are in the data

In [ ]:

df_crime.DISTRICT.value_counts()

In [ ]:

df_crime.DISTRICT = df_crime.DISTRICT.fillna('NON')

In [ ]:

df_crime.SHOOTING = df_crime.SHOOTING.fillna('NON')

In [ ]:

df_crime.UCR_PART.value_counts()

In [ ]:

df_crime.UCR_PART = df_crime.UCR_PART.fillna('Other')

In [ ]:

len(df_crime.STREET.unique())

In [ ]:

df_crime.STREET = df_crime.STREET.fillna('Other')

In [ ]:

df_crime[np.isnan(df_crime.Lat)]['Location'].unique()

In [ ]:

df_crime = df_crime[~np.isnan(df_crime.Lat)]

Check out the passes now. They are not left.

In [ ]:

df_crime.isnull().sum()

In [ ]:

df_weather.isnull().sum()

Let's look at the data using table analysis. On the most frequent crimes.

In [ ]:

df_crime.OFFENSE_CODE_GROUP.value_counts()

In [ ]:

df_crime.OFFENSE_CODE.value_counts()

On the coordinates. We see that there are problems in filling, because there is a number -1

In [ ]:

min(df_crime.Lat), max(df_crime.Lat)

In [ ]:

min(df_crime.Long), max(df_crime.Long)

There are spikes in the coordinates.

Part 3. Visual analysis of the features¶

In [ ]:

df_crime.OCCURRED_ON_DATE = df_crime.OCCURRED_ON_DATE.map(pd.to_datetime)

In [ ]:

df_crime['test_one'] = 1

In [ ]:

g = df_crime.groupby(('YEAR','MONTH'))['test_one'].sum()
fig = plt.figure(1, (12, 8))
ax1 = fig.add_subplot(211)
g.unstack(level=0).plot(kind='bar', grid = True, ax = ax1);
ax2 = fig.add_subplot(212)

g.plot(grid = True, ax = ax2);
ax2.set_xticks(range(len(g)));
ax2.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);

most crimes are committed in the summer, least in the winter. average offense remains constant

In [ ]:

df_crime.groupby(('HOUR'))['test_one'].sum().plot(kind='bar',figsize=(15,6), grid = True);

most crimes are committed after dinner, least of all late at night

In [ ]:

order = ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday','Saturday','Sunday']
df_crime.groupby(('DAY_OF_WEEK'))['test_one'].sum().loc[order].plot(kind = 'pie', figsize=(7, 7), autopct='%.2f');

by the weekend the number of crimes decreases

In [ ]:

g = df_crime[df_crime.SHOOTING != 'NON'].groupby(('YEAR','MONTH'))['test_one'].sum()
fig = plt.figure(1, (12, 12))
ax1 = fig.add_subplot(311)
g.unstack(level=0).plot(kind='bar', grid = True, ax = ax1, stacked=True);
ax2 = fig.add_subplot(312)
g.plot(grid = True, ax = ax2);
ax2.set_xticks(range(len(g)));
ax2.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);

g = df_crime[df_crime.SHOOTING != 'NON'].groupby(('DAY_OF_WEEK'))['test_one'].sum().loc[order]
ax3 = fig.add_subplot(313)
g.plot(kind='bar', grid = True, ax = ax3, stacked=True);

In [ ]:

g = df_crime.groupby(('STREET'))['test_one'].sum().sort_values(ascending = False).head(30)
fig = plt.figure(1, (12, 8))
#ax1 = fig.add_subplot(211)
g.plot(kind='bar', grid = True);

Street connection and number of crimes

In [ ]:

g = df_crime.groupby(('OFFENSE_CODE_GROUP'))['test_one'].sum().sort_values(ascending = False).head(20)
fig = plt.figure(1, (12, 8))
#ax1 = fig.add_subplot(211)
g.plot(kind='barh', grid = True);

Distribution by district

In [ ]:

g = df_crime.groupby(('DISTRICT'))['test_one'].sum().sort_values(ascending = False).head(20)
fig = plt.figure(1, (12, 8))
#ax1 = fig.add_subplot(211)
g.plot(kind='pie', grid = True);

Remove emissions in the coordinates

In [ ]:

df_crime_pre =  df_crime[(df_crime.Lat > 20) & (df_crime['Long']<-20)]

In [ ]:

g = df_crime_pre.groupby(('OFFENSE_CODE_GROUP'))['Lat', 'Long', 'test_one'].aggregate((np.sum,  np.median)).sort_values(by = ('test_one','sum'), ascending = False)

fig = plt.figure(1, (8, 8))
#ax1 = fig.add_subplot(211)
plt.scatter(x = g[('Lat','median')].values, y = g[('Long','median')].values, marker = '*');

The graph shows the centers of gravity of the crimes. It can be seen that various crimes are most often committed in different parts of the city.

In [ ]:

fig = plt.figure(1, (8, 8))
sns.scatterplot(x='Long',  y='Lat',   hue='DISTRICT',   data=df_crime_pre)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2);

Distribution of crimes by coordinates

In [ ]:

df_crime_pre['Lat'].plot(kind='kde');

In [ ]:

df_crime_pre['Long'].plot(kind='kde');

In [ ]:

df_crime_pre['DAY'] = df_crime_pre['OCCURRED_ON_DATE'].dt.day

Now add an array with the weather.

In [ ]:

df_total = df_crime_pre.merge(df_weather, left_on=['YEAR', 'MONTH', 'DAY'], right_on=['Year', 'Month', 'Day'])

In [ ]:

crime rate and temperature

In [ ]:

plt.figure();
df_total['Low Temp (F)'].hist(alpha = 0.5, bins = 30);
df_total['High Temp (F)'].hist(alpha = 0.5, bins = 30);

In [ ]:

# this is a real temperature distribution
plt.figure();
df_weather['Low Temp (F)'].hist(alpha = 0.5, bins = 30);
df_weather['High Temp (F)'].hist(alpha = 0.5, bins = 30);

In [ ]:

df_total.groupby('Events')['test_one'].aggregate(np.sum).plot.pie();
# what precipitations were at the time of the crime

calculate the correlation between variables

In [ ]:

def plot_corr(df,size=10):
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    cs = ax.matshow(corr, interpolation='none', cmap='plasma')
    plt.xticks(range(len(corr.columns)), corr.columns,  rotation=90);
    plt.yticks(range(len(corr.columns)), corr.columns);
    cbar = fig.colorbar(cs, ax=ax, shrink=0.7) 

In [ ]:

df_dummies_OFFENSE_CODE_GROUP = pd.get_dummies(df_total['OFFENSE_CODE_GROUP'])
df_dummies_Events = pd.get_dummies(df_total['Events'])
df_dummies_DISTRICT = pd.get_dummies(df_total['DISTRICT'])
#df_dummies_SHOOTING = pd.get_dummies(df_total['SHOOTING'])
df_dummies_UCR_PART = pd.get_dummies(df_total['UCR_PART'])
df_dummies_STREET = pd.get_dummies(df_total['STREET'])

df_new = pd.concat([df_dummies_OFFENSE_CODE_GROUP, df_dummies_Events, df_dummies_DISTRICT, df_dummies_UCR_PART,  df_total[['Lat', 'Long']], df_total.iloc[:, 23:42]], axis=1)

In [ ]:

plot_corr(df_new, 25)

It can be seen that there is no correlation between the weather and the type of crime. There is a connection between weather clones with each other.

In [ ]:

plt.figure(figsize=(12,12))
map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)
    #http://server.arcgisonline.com/arcgis/rest/services
    #EPSG Number of America is 4269
#World_Street_Map
#World_Topo_Map
map.arcgisimage(service='World_Street_Map', xpixels = 1000, verbose= True)
for i in range(100):
    xpt,ypt = map(df_crime.Long[i], df_crime.Lat[i])
    map.scatter(xpt,ypt,c = 'b')

# airport coorinates
coo_aero = (-71.01, 42.36)
xpt,ypt = map(coo_aero[0],coo_aero[1])
map.scatter(xpt,ypt,c = 'r')
# south station
coo_v1 = (-71.06, 42.34)
xpt,ypt = map(coo_v1[0],coo_v1[1])
map.scatter(xpt,ypt,c = 'r')
plt.show()

And now let's look at the map a few points with crimes

Part 4. Patterns, insights, pecularities of data¶

based on the lack of correlation and almost uniform distribution of the number of crimes, the probability of predicting the crime scene is small.

In [ ]:

As a metric we will use MAE. MAE because it is less sensitive to outliers, and our prediction area is small, the median is better.

In [ ]:

Part 5. Data preprocessing¶

Part 6. Feature engineering and description¶

Part 7. Cross-validation, hyperparameter tuning¶

Part 8. Validation and learning curves¶

Part 9. Prediction for hold-out and test samples¶

Part 10. Model evaluation with metrics description¶

In [ ]:

Let's try several regression methods, starting with simple, linear ones and ending with a library special for analyzing time rads.

In [ ]:

X = df_total.copy()
X = X[X.OFFENSE_CODE == 3006]

In [ ]:

X.head()

In [ ]:

X.info()

In [ ]:

X.drop(['INCIDENT_NUMBER', 'Location', 'test_one', 'Day', 'Month', 'Year'], axis = 1, inplace=True)

Since we predict coordinates, we have to remove from the data everything that indirectly contains spatial features - street, district.

In [ ]:

X['OFFENSE_CODE'] = X.OFFENSE_CODE.astype('category')
X['YEAR'] = X.YEAR.astype('category')
X['MONTH'] = X.MONTH.astype('category')
X['HOUR'] = X.HOUR.astype('category')
X['DAY'] = X.DAY.astype('category')
#X['quarter'] = X.quarter.astype('category')
X.drop(['STREET',], axis = 1, inplace=True) 
X.drop(['DISTRICT',], axis = 1, inplace=True) 

X.drop(['REPORTING_AREA',], axis = 1, inplace=True) 
X.drop(['OFFENSE_DESCRIPTION',], axis = 1, inplace=True) 


X_hot = pd.get_dummies(X)
X_hot.drop(['OCCURRED_ON_DATE',], axis = 1, inplace=True)

For linear models, we use the normalization of numerical data.

In [ ]:

ss = StandardScaler()

col = [x for x in X_hot.columns if ((X_hot[x].dtype == np.float64 or X_hot[x].dtype == np.int64) and (x !='Lat') and (x != 'Long'))]
col

In [ ]:

X_ss = pd.DataFrame(ss.fit_transform(X_hot[col].values), columns=col)

In [ ]:

X_hot.drop(col, axis=1, inplace=True)

In [ ]:

X_hot_ss = X_hot.reset_index(drop=True).join(X_ss.reset_index(drop=True))
X_hot_ss.shape

In [ ]:

y1 = X_hot_ss['Lat']
y2 = X_hot_ss['Long']

In [ ]:

X_hot_ss.drop(['Lat', 'Long'], axis=1, inplace=True)

In [ ]:

X_train, X_val, y1_train, y1_val, y2_train, y2_val = train_test_split(X_hot_ss, y1, y2, random_state = 42)

In [ ]:

RidgeCV¶

In [ ]:

%%time
model1_cv = RidgeCV(alphas= (0.01, 0.05,0.1,0.7,1,2), cv=7)
model2_cv = RidgeCV(alphas= (0.01, 0.05,0.1,0.7,1,2), cv=7)
#model = LinearRegression()

rid1_cv = model1_cv.fit(X_train, y1_train)
rid2_cv = model2_cv.fit(X_train, y2_train)

In [ ]:

pred_Lat_cv = rid1_cv.predict(X_val)
mse_Lat_cv = mean_absolute_error(y1_val, pred_Lat_cv)
pred_Long_cv = rid2_cv.predict(X_val)
mse_Long_cv = mean_absolute_error(y2_val, pred_Long_cv)

In [ ]:

mse_Lat_cv, mse_Long_cv

In [ ]:

dd_Lat_cv = pd.DataFrame({'tr':y1_val, 'pred': pred_Lat_cv, 'del': (y1_val - pred_Lat_cv)}).sort_values(by=['del'])
dd_Long_cv = pd.DataFrame({'tr':y2_val, 'pred': pred_Long_cv, 'del': (y2_val - pred_Long_cv)}).sort_values(by=['del'])

In [ ]:

fig = plt.figure(1, (12, 12))
ax1 = fig.add_subplot(211)

dd_Lat_cv['del'].hist(bins = 100, ax = ax1, range = (-0.2,0.2), log = True)
ax2 = fig.add_subplot(212)

dd_Long_cv['del'].hist(bins = 100, ax = ax2, range = (-0.2,0.2), log= True) 

A very big mistake in determining the coordinates

In [ ]:

plt.figure(figsize=(12,12))
map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)
    #http://server.arcgisonline.com/arcgis/rest/services
    #EPSG Number of America is 4269
#World_Street_Map
#World_Topo_Map
map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)

random.seed(3)
for rr in range(100):
    rnd_int = random.randint(0,len(dd_Lat_cv))
    coordinat_true_Lat = y1[rnd_int]    
    coordinat_true_Long = y2[rnd_int]
    coordinat_pred_Lat = pred_Lat_cv[rnd_int]    
    coordinat_pred_Long = pred_Long_cv[rnd_int]  
 
    xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)
    map.scatter(xpt1,ypt1,c = 'r')

    xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)
    map.scatter(xpt2,ypt2,c = 'b')
    

All predictions are grouped in median. Let's look at the important signs.

In [ ]:

pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model1_cv.coef_)}).sort_values(['coef'], ascending=False).head(10)

In [ ]:

pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model2_cv.coef_)}).sort_values(['coef'], ascending=False).head(10)

The signs that were most important for the linear model were time and average temperature. But the exact predictions are very low.

RandomForestRegressor¶

In [ ]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [ ]:

%%time
#params = {'n_estimators':[160],'max_depth':[20]}
rfc1=RandomForestRegressor(random_state=42, n_jobs= 3, n_estimators = 200)
#rfc1_grid = GridSearchCV(rfc1,param_grid=params,cv=3,n_jobs= 3)
rfc1.fit(X_train, y1_train)
#rfc1=rfc1_grid.best_estimator_ # assign best model to rfc

In [ ]:

rfc1

In [ ]:

(pd.Series(rfc1.feature_importances_, index=X_train.columns)
   .nlargest(20)
   .plot(kind='barh'))
plt.title('Feature Importance') # top 10 features

In [ ]:

%%time
rfc2=RandomForestRegressor(random_state=42, n_jobs= 3, n_estimators = 200)
#rfc2_grid = GridSearchCV(rfc2,param_grid=params,cv=5, n_jobs= 3)
rfc2.fit(X_train, y2_train)
#rfc2=rfc2_grid.best_estimator_ # assign best model to rfc

In [ ]:

rfc2

In [ ]:

(pd.Series(rfc2.feature_importances_, index=X_train.columns)
   .nlargest(20)
   .plot(kind='barh'))
plt.title('Feature Importance') # top 10 features

In [ ]:

pred_Lat_rfr = rfc1.predict(X_val)
mse_Lat_rfr = mean_absolute_error(y1_val, pred_Lat_rfr)
pred_Long_rfr = rfc2.predict(X_val)
mse_Long_rfr = mean_absolute_error(y2_val, pred_Long_rfr)

In [ ]:

mse_Lat_rfr, mse_Long_rfr

In [ ]:

mse_Lat_cv, mse_Long_cv

In [ ]:

plt.figure(figsize=(12,12))
map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)
    #http://server.arcgisonline.com/arcgis/rest/services
    #EPSG Number of America is 4269
#World_Street_Map
#World_Topo_Map
map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)

random.seed(3)
for rr in range(100):
    rnd_int = random.randint(0,len(dd_Lat_cv))
    coordinat_true_Lat = y1[rnd_int]    
    coordinat_true_Long = y2[rnd_int]
    coordinat_pred_Lat = pred_Lat_rfr[rnd_int]    
    coordinat_pred_Long = pred_Long_rfr[rnd_int]  
    
    xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)
    map.scatter(xpt1,ypt1,c = 'r')

    xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)
    map.scatter(xpt2,ypt2,c = 'b')
    
    map.plot((xpt1,xpt2),(ypt1,ypt2),c = 'y')

Similar result. The trees showed that the most important signs are the time and height of the sea level. The prediction accuracy is small. Worse than a linear model

LassoCV¶

In [ ]:

model3 = LassoCV(random_state = 42)
model4 = LassoCV(random_state = 42)

model3.fit(X_train, y1_train)
model4.fit(X_train, y2_train)

In [ ]:

pred_Lat_3 = model3.predict(X_val)
mse_Lat_3 = mean_absolute_error(y1_val, pred_Lat_3)

pred_Long_3 = model4.predict(X_val)
mse_Long_3 = mean_absolute_error(y2_val, pred_Long_3)

In [ ]:

mse_Lat_3, mse_Long_3

In [ ]:

dd_Lat_lasso = pd.DataFrame({'tr':y1_val, 'pred': pred_Lat_3, 'del': (y1_val - pred_Lat_3)}).sort_values(by=['del'])
dd_Long_lasso = pd.DataFrame({'tr':y2_val, 'pred': pred_Long_3, 'del': (y2_val - pred_Long_3)}).sort_values(by=['del'])

fig = plt.figure(1, (12, 12))
ax1 = fig.add_subplot(211)

dd_Lat_lasso['del'].hist(bins = 100, ax = ax1, range = (-0.2,0.2), log = True)
ax2 = fig.add_subplot(212)

dd_Long_lasso['del'].hist(bins = 100, ax = ax2, range = (-0.2,0.2), log= True) 

plt.figure(figsize=(12,12))

In [ ]:

plt.figure(figsize=(12,12))
map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)
map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)

random.seed(3)

for rr in range(100):

    rnd_int = random.randint(0,len(dd_Lat_cv))
    coordinat_true_Lat = y1[rnd_int]    
    coordinat_true_Long = y2[rnd_int]
    coordinat_pred_Lat = pred_Lat_3[rnd_int]    
    coordinat_pred_Long = pred_Long_3[rnd_int]  

 
    xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)
    map.scatter(xpt1,ypt1,c = 'r')

    xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)
    map.scatter(xpt2,ypt2,c = 'b')

In [ ]:

model3.intercept_

In [ ]:

pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model3.coef_)}).sort_values(['coef'], ascending=False).head(10)

In [ ]:

pd.DataFrame({'colum': X_train.columns, 'coef':np.abs(model4.coef_)}).sort_values(['coef'], ascending=False).head(10)

Another linear model with regularizations showed the worst result.

catboost¶

In [ ]:

df_total.head()

In [ ]:

for c in df_total.columns:
    col_type = df_total[c].dtype
    if col_type == 'object' or col_type.name == 'category':
        df_total[c] = df_total[c].astype('category')

In [ ]:

X_total = df_total.copy()
X_total.drop(['STREET','DISTRICT','REPORTING_AREA','OFFENSE_DESCRIPTION','Lat', 'Long','Location','INCIDENT_NUMBER','OCCURRED_ON_DATE','test_one','Year', 'Month', 'Day'], axis = 1, inplace=True) 

In [ ]:

X_total = X_total[X_total['OFFENSE_CODE']==3006]

In [ ]:

X_train, X_val, y1_train, y1_val, y2_train, y2_val = train_test_split(X_total, y1, y2, random_state = 42)

In [ ]:

for i, c in enumerate(X_total.columns):
    col_type = df_total[c].dtype
    if col_type == 'object' or col_type.name == 'category':
        print(i)

In [ ]:

SCORERS.keys()

In [ ]:

param = {'iterations':3000, 'depth':10, 'learning_rate':0.03, 'loss_function':'RMSE', 'random_seed' : 42, 
                           'task_type' : "GPU", 'cat_features' : [1,2,5,7,29],'verbose':0}

In [ ]:

model_y1 = CatBoostRegressor(**param)
#catboost_pool = Pool(X_train, y1_train, cat_features = [1,2,5,7,29])
model_y1.fit(X_train, y1_train)

In [ ]:

fig = plt.figure(1, (12, 12))
plt.plot(np.log(model_y1.evals_result_['learn']['RMSE']))

In [ ]:

%%time
#rsCV_y1 = RandomizedSearchCV(model_y1, param_distributions={'iterations':[100,500,1000,2000], 'depth':[i for i in range(2,15)]}, n_iter=20, n_jobs=1, random_state=42, scoring='neg_median_absolute_error')
#rsCV_y1.fit(X_train, y1_train)

In [ ]:

model_y2 = CatBoostRegressor(**param )
model_y2.fit(X_train, y2_train)

In [ ]:

%%time
#rsCV_y2 = RandomizedSearchCV(model_y2, param_distributions={'iterations':[100,500,1000,2000], 'depth':[i for i in range(2,15)]}, n_iter=20, n_jobs=1, random_state=42, scoring='neg_median_absolute_error')
#rsCV_y2.fit(X_train, y2_train)

In [ ]:

#rsCV_y1.best_params_, rsCV_y1.best_params_

In [ ]:

#y1_pred = rsCV_y1.best_estimator_.predict(X_val)
#y2_pred = rsCV_y2.best_estimator_.predict(X_val)
y1_pred = model_y1.predict(X_val)
y2_pred = model_y2.predict(X_val)

In [ ]:

y1_pred, y2_pred

In [ ]:

mse_Lat_boost = mean_absolute_error(y1_val, y1_pred)
mse_Long_boost = mean_absolute_error(y2_val, y2_pred)

In [ ]:

mse_Lat_boost, mse_Long_boost

In [ ]:

plt.figure(figsize=(12,12))
map = Basemap(llcrnrlon=-71.2, llcrnrlat=42.2, urcrnrlon=-70.95, urcrnrlat=42.46, epsg=4269)
    #http://server.arcgisonline.com/arcgis/rest/services
    #EPSG Number of America is 4269
#World_Street_Map
#World_Topo_Map
map.arcgisimage(service='World_Topo_Map', xpixels = 1000, verbose= True)

random.seed(3)
for rr in range(100):
    rnd_int = random.randint(0,len(y1_pred))
    coordinat_true_Lat = y1[rnd_int]    
    coordinat_true_Long = y2[rnd_int]
    coordinat_pred_Lat = y1_pred[rnd_int]    
    coordinat_pred_Long = y2_pred[rnd_int]  
    
    xpt1,ypt1 = map(coordinat_true_Long,coordinat_true_Lat)
    map.scatter(xpt1,ypt1,c = 'r')

    xpt2,ypt2 = map(coordinat_pred_Long,coordinat_pred_Lat)
    map.scatter(xpt2,ypt2,c = 'b')

 #   map.plot((xpt1,xpt2),(ypt1,ypt2),c = 'y')
    

In [ ]:

feat_imp = pd.Series(model_y2.feature_importances_, index=X_train.columns)
feat_imp.nlargest(20).plot(kind='barh', figsize=(8,10))

It seems that the ensemble of trees gives a better result, but all the same, the accuracy is very bad. Predicting where a crime will happen is impossible.

In [ ]:

An analysis of the predictions and size of the loss function in a deferred sample tells us that it is not possible to build a general model for perdogazaniya all types of crimes. Now we will try to predict only one type of crime using the theory of time series.

In [ ]:

X.columns

In [ ]:

df = X

In [ ]:

aggr_df = df.groupby('OCCURRED_ON_DATE')[['Lat']].count()
aggr_df.columns = ['Lat_count']
aggr_df.head()

In [ ]:

daily_df = aggr_df.resample('D').apply(sum)
daily_df.head(n=10)

In [ ]:

fig = plt.figure(figsize=(30,8))
plt.plot(daily_df.index, daily_df.Lat_count)

In [ ]:

weekly_df = daily_df.resample('W').apply(sum)   

In [ ]:

fig = plt.figure(figsize=(30,8))
plt.plot(weekly_df.index, weekly_df.Lat_count)

In [ ]:

from fbprophet import Prophet

import logging
logging.getLogger().setLevel(logging.ERROR)

In [ ]:

df = daily_df
df.columns = ['y']
df.index.name = 'ds'

In [ ]:

prediction_size = 90
train_df = df[:-prediction_size]
train_df.tail(n=3)
train_df['ds'] = train_df.index

In [ ]:

m = Prophet()
m.fit(train_df);

In [ ]:

future = m.make_future_dataframe(periods=prediction_size)
future.tail(n=3)

In [ ]:

forecast = m.predict(future)
forecast.tail(n=3)

In [ ]:

fig = plt.figure(figsize=(30,8))
plt.plot(df.y.tail(200));
plt.plot(forecast.ds.tail(prediction_size), forecast.yhat.tail(prediction_size));
plt.plot(forecast.ds.tail(prediction_size), forecast.yhat_lower.tail(prediction_size));
plt.plot(forecast.ds.tail(prediction_size), forecast.yhat_upper.tail(prediction_size));

The prediction using the time series also did not lead to a good result. The average predicted value is very different from the real.

In [ ]:

def make_comparison_dataframe(historical, forecast):
    """Join the history with the forecast.
    
       The resulting dataset will contain columns 'yhat', 'yhat_lower', 'yhat_upper' and 'y'.
    """
    return forecast.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']].join(historical)

In [ ]:

cmp_df = make_comparison_dataframe(df, forecast)
cmp_df.tail(n=10)

In [ ]:

def calculate_forecast_errors(df, prediction_size):
    df = df.copy()
    df['e'] = df['y'] - df['yhat']
    predicted_part = df[-prediction_size:]
    error_mean = lambda error_name: np.mean(np.abs(predicted_part[error_name]))
    return {'MAE': error_mean('e')}

In [ ]:

for err_name, err_value in calculate_forecast_errors(cmp_df, prediction_size).items():
    print(err_name, err_value)

In [ ]:

And in the end we will try to do the analysis as it is done in neural recurrent networks. We will use the past values of the series.

In [ ]:

data = pd.DataFrame(df.y.copy())
data.columns = ["y"]

In [ ]:

# Adding the lag of the target variable from 6 steps back up to 24
for i in range(6, 45):
    data["lag_{}".format(i)] = data.y.shift(i)

In [ ]:

data.tail(7)

In [ ]:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit

# for time-series cross-validation set 5 folds 
tscv = TimeSeriesSplit(n_splits=5)

In [ ]:

from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [ ]:

def timeseries_train_test_split(X, y, test_size):
    """
        Perform train-test split with respect to time series structure
    """
    
    # get the index after which test set starts
    test_index = int(len(X)*(1-test_size))
    
    X_train = X.iloc[:test_index]
    y_train = y.iloc[:test_index]
    X_test = X.iloc[test_index:]
    y_test = y.iloc[test_index:]
    
    return X_train, X_test, y_train, y_test

In [ ]:

y_ = data.dropna().y
X_ = data.dropna().drop(['y'], axis=1)

# reserve 30% of data for testing
X_train, X_test, y_train, y_test = timeseries_train_test_split(X_, y_, test_size=0.3)

In [ ]:

lr = LinearRegression()
lr.fit(X_train, y_train)

In [ ]:

def plotModelResults(model, X_train=X_train, X_test=X_test, plot_intervals=False, plot_anomalies=False):
    
    prediction = model.predict(X_test)
    
    plt.figure(figsize=(15, 7))
    plt.plot(prediction, "g", label="prediction", linewidth=2.0)
    plt.plot(y_test.values, label="actual", linewidth=2.0)
    
    if plot_intervals:
        cv = cross_val_score(model, X_train, y_train, 
                                    cv=tscv, 
                                    scoring="neg_mean_absolute_error")
        mae = cv.mean() * (-1)
        deviation = cv.std()
        
        scale = 1.96
        lower = prediction - (mae + scale * deviation)
        upper = prediction + (mae + scale * deviation)
        
        plt.plot(lower, "r--", label="upper bond / lower bond", alpha=0.5)
        plt.plot(upper, "r--", alpha=0.5)
        
        if plot_anomalies:
            anomalies = np.array([np.NaN]*len(y_test))
            anomalies[y_test<lower] = y_test[y_test<lower]
            anomalies[y_test>upper] = y_test[y_test>upper]
            plt.plot(anomalies, "o", markersize=10, label = "Anomalies")
    
    error = mean_absolute_percentage_error(prediction, y_test)
    plt.title("Mean absolute percentage error {0:.2f}%".format(error))
    plt.legend(loc="best")
    plt.tight_layout()
    plt.grid(True);
    
def plotCoefficients(model):
    """
        Plots sorted coefficient values of the model
    """
    
    coefs = pd.DataFrame(model.coef_, X_train.columns)
    coefs.columns = ["coef"]
    coefs["abs"] = coefs.coef.apply(np.abs)
    coefs = coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
    
    plt.figure(figsize=(15, 7))
    coefs.coef.plot(kind='bar')
    plt.grid(True, axis='y')
    plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed');

In [ ]:

plotModelResults(lr, plot_intervals=True)
plotCoefficients(lr)

In [ ]:

data.index = pd.to_datetime(data.index)
data["day"] = data.index.day
data["weekday"] = data.index.weekday
data['is_weekend'] = data.weekday.isin([5,6])*1
data.tail()

In [ ]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [ ]:

y_ = data.dropna().y
X_ = data.dropna().drop(['y'], axis=1)

X_train, X_test, y_train, y_test = timeseries_train_test_split(X_, y_, test_size=0.3)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

plotModelResults(lr, X_train=X_train_scaled, X_test=X_test_scaled, plot_intervals=True)
plotCoefficients(lr)

In [ ]:

def code_mean(data, cat_feature, real_feature):
    return dict(data.groupby(cat_feature)[real_feature].mean())

def prepareData(series, lag_start, lag_end, test_size, target_encoding=False):

    # copy of the initial dataset
    data = pd.DataFrame(series.copy())
#    data.columns = ["y"]
    
    # lags of series
    for i in range(lag_start, lag_end):
        data["lag_{}".format(i)] = data.y.shift(i)
    
    # datetime features
#    data.index = pd.to_datetime(data.index)
    data["day"] = data.index.day
    data["weekday"] = data.index.weekday
    data['is_weekend'] = data.weekday.isin([5,6])*1
    
    if target_encoding:
        # calculate averages on train set only
        test_index = int(len(data.dropna())*(1-test_size))
        data['weekday_average'] = list(map(code_mean(data[:test_index], 'weekday', "y").get, data.weekday))
        data["day_average"] = list(map(code_mean(data[:test_index], 'day', "y").get, data.day))

        # drop encoded variables 
        data.drop(["day", "weekday"], axis=1, inplace=True)
    
    # train-test split
    y = data.dropna().y
    X = data.dropna().drop(['y'], axis=1)
    X_train, X_test, y_train, y_test = timeseries_train_test_split(X, y, test_size=test_size)

    return X_train, X_test, y_train, y_test

In [ ]:

average_hour = code_mean(data, 'weekday', "y")
plt.figure(figsize=(7, 5))
plt.title("Hour averages")
pd.DataFrame.from_dict(average_hour, orient='index')[0].plot()
plt.grid(True);

In [ ]:

X_train, X_test, y_train, y_test = prepareData(daily_df, lag_start=6, lag_end=25, test_size=0.3, target_encoding=False)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

plotModelResults(lr, X_train=X_train_scaled, X_test=X_test_scaled, plot_intervals=True, plot_anomalies=True)
plotCoefficients(lr)

In [ ]:

##

Part 11. Conclusions¶

As a result, we can say that the prediction of the crime scene according to the weather for the day and the description of the crime is impossible. Coordinates tend to the average value, and the number of crimes is determined with a big mistake. Probably more data is needed from other areas such as demography, economics.

In [ ]: