A Summary of lecture "Analyzing Police Activity with pandas, via datacamp
import pandas as pd
import matplotlib.pyplot as plt
In this exercise, you'll examine the temperature columns from the weather dataset to assess whether the data seems trustworthy. First you'll print the summary statistics, and then you'll visualize the data using a box plot.
When deciding whether the values seem reasonable, keep in mind that the temperature is measured in degrees Fahrenheit, not Celsius!
# Read 'weather.csv' into a DataFrame named 'weather'
weather = pd.read_csv('./dataset/weather.csv')
# Describe the temperature columns
print(weather[['TMIN', 'TAVG', 'TMAX']].describe())
# Create a box plot of the temperature columns
weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box')
plt.savefig('../images/tmin-boxplot.png')
TMIN TAVG TMAX count 4017.000000 1217.000000 4017.000000 mean 43.484441 52.493016 61.268608 std 17.020298 17.830714 18.199517 min -5.000000 6.000000 15.000000 25% 30.000000 39.000000 47.000000 50% 44.000000 54.000000 62.000000 75% 58.000000 68.000000 77.000000 max 77.000000 86.000000 102.000000
In this exercise, you'll continue to assess whether the dataset seems trustworthy by plotting the difference between the maximum and minimum temperatures.
What do you notice about the resulting histogram? Does it match your expectations, or do you see anything unusual?
# Create a 'TDIFF' column that represents temperature difference
weather['TDIFF'] = weather.TMAX - weather.TMIN
# Describe the 'TDIFF' column
print(weather.TDIFF.describe())
# Create a histogram with 20 bins to visualize 'TDIFF'
weather.TDIFF.plot(kind='hist', bins=20)
count 4017.000000 mean 17.784167 std 6.350720 min 2.000000 25% 14.000000 50% 18.000000 75% 22.000000 max 43.000000 Name: TDIFF, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f454980add0>
The weather DataFrame contains 20 columns that start with 'WT', each of which represents a bad weather condition. For example:
For every row in the dataset, each WT column contains either a 1 (meaning the condition was present that day) or NaN (meaning the condition was not present).
In this exercise, you'll quantify "how bad" the weather was each day by counting the number of 1 values in each row.
# Copy 'WT01' through 'WT22' to a new DataFrame
WT = weather.loc[:, 'WT01':'WT22']
# Calculate the sum of each row in 'WT'
weather['bad_conditions'] = WT.sum(axis='columns')
# Replace missing values in 'bad_conditions' with 0
weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')
# Create a histogram to visualist 'bad_conditions'
weather['bad_conditions'].plot(kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x7f45455a6f50>
In the previous exercise, you counted the number of bad weather conditions each day. In this exercise, you'll use the counts to create a rating system for the weather.
The counts range from 0 to 9, and should be converted to ratings as follows:
# Count the unique values in 'bad_conditions' and sort the index
print(weather.bad_conditions.value_counts().sort_index())
# Create a dictionary that maps integers to strings
mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse',
6:'worse', 7:'worse', 8:'worse', 9:'worse'}
# Convert the 'bad_conditions' integers to strings using the 'mapping'
weather['rating'] = weather.bad_conditions.map(mapping)
# Count the unique values in 'rating'
print(weather.rating.value_counts())
0 1749 1 613 2 367 3 380 4 476 5 282 6 101 7 41 8 4 9 4 Name: bad_conditions, dtype: int64 bad 1836 good 1749 worse 432 Name: rating, dtype: int64
Since the rating column only has a few possible values, you'll change its data type to category in order to store the data more efficiently. You'll also specify a logical order for the categories, which will be useful for future exercises.
NOTE : in pandas 1.0.3, ordered categories must be defined with pd.api.types.CategoricalDtype
# Create a list of weather ratings in logical order
cats = pd.api.types.CategoricalDtype(categories=['good', 'bad', 'worse'], ordered=True)
# cats = pd.categories=['good', 'bad', 'worse'] # for newer versions of pandas
# Change the data type of 'rating' to category
weather['rating'] = weather.rating.astype(cats)
# Examine the head of 'rating'
print(weather.rating.head())
0 bad 1 bad 2 bad 3 bad 4 bad Name: rating, dtype: category Categories (3, object): [good < bad < worse]
In this exercise, you'll prepare the traffic stop and weather rating DataFrames so that they're ready to be merged:
ri = pd.read_csv('./dataset/police.csv')
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
ri['stop_datetime'] = pd.to_datetime(combined)
ri['is_arrested'] = ri['is_arrested'].astype(bool)
# Reset the index of 'ri'
ri.reset_index(inplace=True)
# Examine the head of 'ri'
print(ri.head())
# Create a DataFrame from the 'DATE' and 'rating' columns
weather_rating = weather[['DATE', 'rating']]
# Examine the head of 'weather_rating'
print(weather_rating.head())
index state stop_date stop_time county_name driver_gender driver_race \ 0 0 RI 2005-01-04 12:55 NaN M White 1 1 RI 2005-01-23 23:15 NaN M White 2 2 RI 2005-02-17 04:15 NaN M White 3 3 RI 2005-02-20 17:15 NaN M White 4 4 RI 2005-02-24 01:20 NaN F White violation_raw violation search_conducted search_type \ 0 Equipment/Inspection Violation Equipment False NaN 1 Speeding Speeding False NaN 2 Speeding Speeding False NaN 3 Call for Service Other False NaN 4 Speeding Speeding False NaN stop_outcome is_arrested stop_duration drugs_related_stop district \ 0 Citation False 0-15 Min False Zone X4 1 Citation False 0-15 Min False Zone K3 2 Citation False 0-15 Min False Zone X4 3 Arrest Driver True 16-30 Min False Zone X1 4 Citation False 0-15 Min False Zone X3 stop_datetime 0 2005-01-04 12:55:00 1 2005-01-23 23:15:00 2 2005-02-17 04:15:00 3 2005-02-20 17:15:00 4 2005-02-24 01:20:00 DATE rating 0 2005-01-01 bad 1 2005-01-02 bad 2 2005-01-03 bad 3 2005-01-04 bad 4 2005-01-05 bad
In this exercise, you'll merge the ri and weather_rating DataFrames into a new DataFrame, ri_weather.
The DataFrames will be joined using the stop_date column from ri and the DATE column from weather_rating. Thankfully the date formatting matches exactly, which is not always the case!
Once the merge is complete, you'll set stop_datetime as the index, which is the column you saved in the previous exercise.
# Examine the shape of 'ri'
print(ri.shape)
# Merge 'ri' and 'weather_rating' using left join
ri_weather = pd.merge(left=ri, right=weather_rating,
left_on='stop_date', right_on='DATE', how='left')
# Examine the shape of 'ri_weather'
print(ri_weather.shape)
# Set 'stop_datetime' as the index of 'ri_weather'
ri_weather.set_index('stop_datetime', inplace=True)
(91741, 17) (91741, 19)
Do police officers arrest drivers more often when the weather is bad? Find out below!
Since you previously defined a logical order for the weather categories, good < bad < worse, they will be sorted that way in the results.
print(ri_weather.is_arrested.mean())
print(ri_weather.groupby('rating').is_arrested.mean())
# Calculate the arrest rate for each 'violation' and 'rating'
print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean())
0.09025408486936048 rating good 0.086842 bad 0.090479 worse 0.106527 Name: is_arrested, dtype: float64 violation rating Equipment good 0.058995 bad 0.066311 worse 0.097357 Moving violation good 0.056227 bad 0.058050 worse 0.065860 Other good 0.076923 bad 0.087443 worse 0.062893 Registration/plates good 0.081574 bad 0.098160 worse 0.115625 Seat belt good 0.028587 bad 0.022493 worse 0.000000 Speeding good 0.013404 bad 0.013314 worse 0.016886 Name: is_arrested, dtype: float64
The output of a single .groupby() operation on multiple columns is a Series with a MultiIndex. Working with this type of object is similar to working with a DataFrame:
In this exercise, you'll practice accessing data from a multi-indexed Series using the .loc[] accessor.
# Save the output of the groupby operation from the last exercise
arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
# Print the 'arrest_rate' Series
print(arrest_rate)
# Print the arrest rate for moving violations in bad weather
print(arrest_rate.loc['Moving violation', 'bad'])
# Print the arrest rates for speeding violations in all three weather condtions
print(arrest_rate.loc['Speeding'])
violation rating Equipment good 0.058995 bad 0.066311 worse 0.097357 Moving violation good 0.056227 bad 0.058050 worse 0.065860 Other good 0.076923 bad 0.087443 worse 0.062893 Registration/plates good 0.081574 bad 0.098160 worse 0.115625 Seat belt good 0.028587 bad 0.022493 worse 0.000000 Speeding good 0.013404 bad 0.013314 worse 0.016886 Name: is_arrested, dtype: float64 0.05804964058049641 rating good 0.013404 bad 0.013314 worse 0.016886 Name: is_arrested, dtype: float64
In this exercise, you'll start by reshaping the arrest_rate Series into a DataFrame. This is a useful step when working with any multi-indexed Series, since it enables you to access the full range of DataFrame methods.
Then, you'll create the exact same DataFrame using a pivot table. This is a great example of how pandas often gives you more than one way to reach the same result!
# Unstack the 'arrest_rate' Series into a DataFrame
print(arrest_rate.unstack())
# Create the same DataFrame using a pivot table
print(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))
rating good bad worse violation Equipment 0.058995 0.066311 0.097357 Moving violation 0.056227 0.058050 0.065860 Other 0.076923 0.087443 0.062893 Registration/plates 0.081574 0.098160 0.115625 Seat belt 0.028587 0.022493 0.000000 Speeding 0.013404 0.013314 0.016886 rating good bad worse violation Equipment 0.058995 0.066311 0.097357 Moving violation 0.056227 0.058050 0.065860 Other 0.076923 0.087443 0.062893 Registration/plates 0.081574 0.098160 0.115625 Seat belt 0.028587 0.022493 0.000000 Speeding 0.013404 0.013314 0.016886