A Summary of lecture "Analyzing Police Activity with pandas", via datacamp
# Import the pandas library as pd
import pandas as pd
# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv('./dataset/police.csv')
When a police officer stops a driver, a small percentage of those stops ends in an arrest. This is known as the arrest rate. In this exercise, you'll find out whether the arrest rate varies by time of day.
First, you'll calculate the arrest rate across all stops in the ri DataFrame. Then, you'll calculate the hourly arrest rate by using the hour attribute of the index. The hour ranges from 0 to 23, in which:
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
ri['stop_datetime'] = pd.to_datetime(combined)
ri['is_arrested'] = ri['is_arrested'].astype(bool)
ri.set_index('stop_datetime', inplace=True)
# Calculate the overall arrest rate
print(ri.is_arrested.mean())
# Calculate the hourly arrest rate
print(ri.groupby(ri.index.hour).is_arrested.mean())
# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
0.09025408486936048 stop_datetime 0 0.121206 1 0.144250 2 0.144120 3 0.148370 4 0.179310 5 0.178899 6 0.043614 7 0.053497 8 0.073591 9 0.070199 10 0.069306 11 0.075217 12 0.087040 13 0.078964 14 0.080171 15 0.080526 16 0.089505 17 0.107914 18 0.089883 19 0.078508 20 0.091482 21 0.153265 22 0.110715 23 0.108225 Name: is_arrested, dtype: float64
In this exercise, you'll create a line plot from the hourly_arrest_rate object. A line plot is appropriate in this case because you're showing how a quantity changes over time.
This plot should help you to spot some trends that may not have been obvious when examining the raw numbers!
import matplotlib.pyplot as plt
# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()
# Add the xlabel, ylabel, and title
plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')
Text(0.5, 1.0, 'Arrest Rate by Time of Day')
In a small portion of traffic stops, drugs are found in the vehicle during a search. In this exercise, you'll assess whether these drug-related stops are becoming more common over time.
The Boolean column drugs_related_stop indicates whether drugs were found during a given stop. You'll calculate the annual drug rate by resampling this column, and then you'll use a line plot to visualize how the rate has changed over time.
# Calculate the annual rate of drug-related stops
print(ri.drugs_related_stop.resample('A').mean())
# Save the annual rate of drug-related stops
annual_drug_rate = ri.drugs_related_stop.resample('A').mean()
# Create a line plot of 'annual_drug_rate'
annual_drug_rate.plot()
stop_datetime 2005-12-31 0.006390 2006-12-31 0.006913 2007-12-31 0.007520 2008-12-31 0.006998 2009-12-31 0.009079 2010-12-31 0.009407 2011-12-31 0.009035 2012-12-31 0.009388 2013-12-31 0.012283 2014-12-31 0.013280 2015-12-31 0.011787 Freq: A-DEC, Name: drugs_related_stop, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x1afaf8989c8>
As you saw in the last exercise, the rate of drug-related stops increased significantly between 2005 and 2015. You might hypothesize that the rate of vehicle searches was also increasing, which would have led to an increase in drug-related stops even if more drivers were not carrying drugs.
You can test this hypothesis by calculating the annual search rate, and then plotting it against the annual drug rate. If the hypothesis is true, then you'll see both rates increasing over time.
# Calculate and save the annual search rate
annual_search_rate = ri.search_conducted.resample('A').mean()
# Concatenate 'annual_drug_rate' and 'annual_search_rate'
annual = pd.concat([annual_drug_rate, annual_search_rate], axis='columns')
# Create subplots from 'annual'
annual.plot(subplots=True)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001AFAF939288>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001AFAF965988>], dtype=object)
The state of Rhode Island is broken into six police districts, also known as zones. How do the zones compare in terms of what violations are caught by police?
In this exercise, you'll create a frequency table to determine how many violations of each type took place in each of the six zones. Then, you'll filter the table to focus on the "K" zones, which you'll examine further in the next exercise.
# Create a frequency table of districts and violations
print(pd.crosstab(ri.district, ri.violation))
# Save the frequency table as 'all_zones'
all_zones = pd.crosstab(ri.district, ri.violation)
# Select rows 'Zone K1' through 'Zone K3'
print(all_zones.loc['Zone K1':'Zone K3'])
# Save the smaller table as 'k_zones'
k_zones = all_zones.loc['Zone K1':'Zone K3']
violation Equipment Moving violation Other Registration/plates Seat belt \ district Zone K1 673 1254 290 120 0 Zone K2 2061 2962 942 768 481 Zone K3 2302 2898 706 695 638 Zone X1 296 671 143 38 74 Zone X3 2049 3086 769 671 820 Zone X4 3541 5353 1560 1411 843 violation Speeding district Zone K1 5960 Zone K2 10448 Zone K3 12323 Zone X1 1119 Zone X3 8779 Zone X4 9795 violation Equipment Moving violation Other Registration/plates Seat belt \ district Zone K1 673 1254 290 120 0 Zone K2 2061 2962 942 768 481 Zone K3 2302 2898 706 695 638 violation Speeding district Zone K1 5960 Zone K2 10448 Zone K3 12323
Now that you've created a frequency table focused on the "K" zones, you'll visualize the data to help you compare what violations are being caught in each zone.
First you'll create a bar plot, which is an appropriate plot type since you're comparing categorical data. Then you'll create a stacked bar plot in order to get a slightly different look at the data. Which plot do you find to be more insightful?
# Create a bar plot of 'k_zones'
k_zones.plot(kind='bar')
plt.savefig('../images/k-zones-plot.png')
# Create a stacked bar plot of 'k_zones'
k_zones.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1afafa5bbc8>
In the traffic stops dataset, the stop_duration column tells you approximately how long the driver was detained by the officer. Unfortunately, the durations are stored as strings, such as '0-15 Min'. How can you make this data easier to analyze?
In this exercise, you'll convert the stop durations to integers. Because the precise durations are not available, you'll have to estimate the numbers using reasonable values:
# Print the unique values in 'stop_duration'
print(ri.stop_duration.unique())
# Create a dictionary that maps strings to integers
mapping = {'0-15 Min': 8, '16-30 Min': 23, '30+ Min': 45}
# Convert the 'stop_duration' strings to intergers using the 'mapping'
ri['stop_minutes'] = ri.stop_duration.map(mapping)
# Print the unique values in 'stop_minutes'
print(ri.stop_minutes.unique())
['0-15 Min' '16-30 Min' nan '30+ Min'] [ 8. 23. nan 45.]
If you were stopped for a particular violation, how long might you expect to be detained?
In this exercise, you'll visualize the average length of time drivers are stopped for each type of violation. Rather than using the violation column in this exercise, you'll use violation_raw since it contains more detailed descriptions of the violations.
# Calculate the mean 'stop_minutes' for each value in 'violation_raw'
print(ri.groupby('violation_raw').stop_minutes.mean())
# Save the resulting Series as 'stop_length'
stop_length = ri.groupby('violation_raw').stop_minutes.mean()
# Sort 'stop_length' by its values and create a horizontal bar plot
stop_length.sort_values().plot(kind='barh')
violation_raw APB 17.967033 Call for Service 22.140805 Equipment/Inspection Violation 11.445340 Motorist Assist/Courtesy 17.741463 Other Traffic Violation 13.844490 Registration Violation 13.736970 Seatbelt Violation 9.662815 Special Detail/Directed Patrol 15.123632 Speeding 10.581509 Suspicious Person 14.910714 Violation of City/Town Ordinance 13.254144 Warrant 24.055556 Name: stop_minutes, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f5bdd8fc390>