Notebook
"crime.csv" contains an array of data on crimes in the city of Boston for the years 2015-2018. The data array contains the following data: 'INCIDENT_NUMBER' - incident number 'OFFENSE_CODE' - offense code 'OFFENSE_CODE_GROUP' - offense group code 'OFFENSE_DESCRIPTION' - brief description of the incident 'DISTRICT' - district 'REPORTING_AREA' - reporting site 'SHOOTING' - whether weapons were used 'OCCURRED_ON_DATE' - date of offense 'YEAR' - year 'MONTH' - month 'DAY_OF_WEEK' 'HOUR' 'UCR_PART' - Uniform Crime Reports 'STREET' 'Lat' - latitude 'Long' - longitude 'Location' - coordinate The target variables that we will predict are the crime scene, i.e. coordinates - longitude and latitude. Also try to predict the number of crimes per day. The task is relevant due to the fact that it will help prevent crimes and reduce the number of victims.
"Boston weather_clean.csv" contains an array of weather conditions that were in the city of Boston and coincide in time with the main file "crime.csv". 'Year' - year 'Month' - month 'Day' - day 'High Temp (F)' - Maximum temperature 'Avg Temp (F)' - average 'Low Temp (F)' - minimal 'High Dew Point (F)' - maximum dew point 'Avg Dew Point (F)' - average 'Low Dew Point (F)' - min 'High Humidity (%)' - Humidity 'Avg Humidity (%)' 'Low Humidity (%)' 'High Sea Level Press (in)' 'Avg Sea Level Press (in)' 'Low Sea Level Press (in)' 'High Visibility (mi)' - visibility 'Avg Visibility (mi)' 'Low Visibility (mi)' 'High Wind (mph)' 'Avg Wind (mph)' 'High Wind Gust (mph)' 'Snowfall (in)' snow cover 'Precip (in)' - precipitation 'Events' - Events The weather information, I hope, will help build a more accurate model for the target variable.
Check out the passes now. They are not left.
Let's look at the data using table analysis. On the most frequent crimes.
On the coordinates. We see that there are problems in filling, because there is a number -1
Street connection and number of crimes
Distribution by district
Remove emissions in the coordinates
The graph shows the centers of gravity of the crimes. It can be seen that various crimes are most often committed in different parts of the city.
Distribution of crimes by coordinates
Now add an array with the weather.
crime rate and temperature
It can be seen that there is no correlation between the weather and the type of crime. There is a connection between weather clones with each other.
And now let's look at the map a few points with crimes
based on the lack of correlation and almost uniform distribution of the number of crimes, the probability of predicting the crime scene is small.
As a metric we will use MAE. MAE because it is less sensitive to outliers, and our prediction area is small, the median is better.
Let's try several regression methods, starting with simple, linear ones and ending with a library special for analyzing time rads.
Since we predict coordinates, we have to remove from the data everything that indirectly contains spatial features - street, district.
For linear models, we use the normalization of numerical data.
A very big mistake in determining the coordinates
All predictions are grouped in median. Let's look at the important signs.
The signs that were most important for the linear model were time and average temperature. But the exact predictions are very low.
Similar result. The trees showed that the most important signs are the time and height of the sea level. The prediction accuracy is small. Worse than a linear model
Another linear model with regularizations showed the worst result.
It seems that the ensemble of trees gives a better result, but all the same, the accuracy is very bad. Predicting where a crime will happen is impossible.
An analysis of the predictions and size of the loss function in a deferred sample tells us that it is not possible to build a general model for perdogazaniya all types of crimes. Now we will try to predict only one type of crime using the theory of time series.
The prediction using the time series also did not lead to a good result. The average predicted value is very different from the real.
And in the end we will try to do the analysis as it is done in neural recurrent networks. We will use the past values of the series.
As a result, we can say that the prediction of the crime scene according to the weather for the day and the description of the crime is impossible. Coordinates tend to the average value, and the number of crimes is determined with a big mistake. Probably more data is needed from other areas such as demography, economics.