This workbook restates the initial data analysis tasks, per ProPublica's methodology description and code
We will do our analysis with Pandas.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
Create a folder called "data" under the current folder, and download the COMPAS scores dataset. The dataset is archived and is not expected to change. Therefore, you can comment out this block if you already downloaded the data.
# Creates a folder "data" under the current folder
!mkdir -p data
# Removes any prior file if it exists
!rm -f data/compas-scores-two-years.csv*
# Fetches the most recent dataset and stores it under the folder data
!curl 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv' -o data/compas-scores-two-years.csv
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2486k 100 2486k 0 0 2806k 0 --:--:-- --:--:-- --:--:-- 2803k
Inspect the file. Read the description of the data collection methodology. Salient points are highlighted below, with my annotations in bold font. See full description from ProPublica for additional details.
Goal: We looked at more than 10,000 criminal defendants in Broward County, Florida, and compared their predicted recidivism rates with the rate that actually occurred over a two-year period.
COMPAS tool input (data subjects): When most defendants are booked in jail, they respond to a COMPAS questionnaire. Their answers are fed into the COMPAS software to generate several scores including predictions of Risk of Recidivism and Risk of Violent Recidivism
How COMPAS input was acquired by ProPublica: Through a public records request, ProPublica obtained two years worth of COMPAS scores from the Broward County Sheriff’s Office in Florida. We received data for all 18,610 people who were scored in 2013 and 2014.
COMPAS tool output: Each pretrial defendant received at least three COMPAS scores: “Risk of Recidivism,” “Risk of Violence” and “Risk of Failure to Appear. ... COMPAS scores for each defendant ranged from 1 to 10, with ten being the highest risk. Scores 1 to 4 were labeled by COMPAS as “Low”; 5 to 7 were labeled “Medium”; and 8 to 10 were labeled “High.”
Data integration (record linkage) to matching COMPAS input and output with an individual's criminal history Starting with the database of COMPAS scores, we built a profile of each person’s criminal history, both before and after they were scored. We collected public criminal records from the Broward County Clerk’s Office website through April 1, 2016. On average, defendants in our dataset were not incarcerated for 622.87 days (sd: 329.19).
Data integration (record linkage) details: We matched the criminal records to the COMPAS records using a person’s first and last names and date of birth. This is the same technique used in the Broward County COMPAS validation study conducted by researchers at Florida State University in 2010. We downloaded around 80,000 criminal records from the Broward County Clerk’s Office website.
What is recidivism?: Northpointe defined recidivism as “a finger-printable arrest involving a charge and a filing for any uniform crime reporting (UCR) code.” We interpreted that to mean a criminal offense that resulted in a jail booking and took place after the crime for which the person was COMPAS scored. ... For most of our analysis, we defined recidivism as a new arrest within two years.
csv_file_vr = 'data/compas-scores-two-years.csv'
df = pd.read_csv(csv_file_vr)
df.head()
id | name | first | last | compas_screening_date | sex | dob | age | age_cat | race | ... | v_decile_score | v_score_text | v_screening_date | in_custody | out_custody | priors_count.1 | start | end | event | two_year_recid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | miguel hernandez | miguel | hernandez | 2013-08-14 | Male | 1947-04-18 | 69 | Greater than 45 | Other | ... | 1 | Low | 2013-08-14 | 2014-07-07 | 2014-07-14 | 0 | 0 | 327 | 0 | 0 |
1 | 3 | kevon dixon | kevon | dixon | 2013-01-27 | Male | 1982-01-22 | 34 | 25 - 45 | African-American | ... | 1 | Low | 2013-01-27 | 2013-01-26 | 2013-02-05 | 0 | 9 | 159 | 1 | 1 |
2 | 4 | ed philo | ed | philo | 2013-04-14 | Male | 1991-05-14 | 24 | Less than 25 | African-American | ... | 3 | Low | 2013-04-14 | 2013-06-16 | 2013-06-16 | 4 | 0 | 63 | 0 | 1 |
3 | 5 | marcu brown | marcu | brown | 2013-01-13 | Male | 1993-01-21 | 23 | Less than 25 | African-American | ... | 6 | Medium | 2013-01-13 | NaN | NaN | 1 | 0 | 1174 | 0 | 0 |
4 | 6 | bouthy pierrelouis | bouthy | pierrelouis | 2013-03-26 | Male | 1973-01-22 | 43 | 25 - 45 | Other | ... | 1 | Low | 2013-03-26 | NaN | NaN | 2 | 0 | 1102 | 0 | 0 |
5 rows × 53 columns
# projection: keep a subset of the columns in the original dataset
df1 = df[['id','sex','age', 'race','decile_score','score_text','v_decile_score','v_score_text',
'two_year_recid','priors_count']]
df1.head()
id | sex | age | race | decile_score | score_text | v_decile_score | v_score_text | two_year_recid | priors_count | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Male | 69 | Other | 1 | Low | 1 | Low | 0 | 0 |
1 | 3 | Male | 34 | African-American | 3 | Low | 1 | Low | 1 | 0 |
2 | 4 | Male | 24 | African-American | 4 | Low | 3 | Low | 1 | 4 |
3 | 5 | Male | 23 | African-American | 8 | High | 6 | Medium | 0 | 1 |
4 | 6 | Male | 43 | Other | 1 | Low | 1 | Low | 0 | 2 |
Understand the statistical properties of the dataset.
Here, we inspect the basic properties of the dataset: break-down by age, gender and race.
df1["age"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a20130da0>
df1["race"].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x11da1a630>
df1["sex"].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a20130d68>
"However not all of the rows are useable for the first round of analysis.
There are a number of reasons remove rows because of missing data:
If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense. We coded the recidivist flag -- is_recid -- to be -1 if we could not find a compas case at all. In a similar vein, ordinary traffic offenses -- those with a c_charge_degree of 'O' -- will not result in Jail time are removed (only two of them). We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility."
# clean the data per ProPublica's methodology
df2 = df1[(df.c_charge_degree != 'O') & (df.score_text != 'N/A') & (df.is_recid != -1)
& (df.days_b_screening_arrest <= 30) & (df.days_b_screening_arrest >= -30)]
df2.head()
id | sex | age | race | decile_score | score_text | v_decile_score | v_score_text | two_year_recid | priors_count | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Male | 69 | Other | 1 | Low | 1 | Low | 0 | 0 |
1 | 3 | Male | 34 | African-American | 3 | Low | 1 | Low | 1 | 0 |
2 | 4 | Male | 24 | African-American | 4 | Low | 3 | Low | 1 | 4 |
5 | 7 | Male | 44 | Other | 1 | Low | 1 | Low | 0 | 0 |
6 | 8 | Male | 41 | Caucasian | 6 | Medium | 2 | Low | 1 | 14 |
print('Original ', len(df))
print('Projected ', len(df1))
print ('Cleaned ', len(df2))
df2.head()
Original 7214 Projected 7214 Cleaned 6172
id | sex | age | race | decile_score | score_text | v_decile_score | v_score_text | two_year_recid | priors_count | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Male | 69 | Other | 1 | Low | 1 | Low | 0 | 0 |
1 | 3 | Male | 34 | African-American | 3 | Low | 1 | Low | 1 | 0 |
2 | 4 | Male | 24 | African-American | 4 | Low | 3 | Low | 1 | 4 |
5 | 7 | Male | 44 | Other | 1 | Low | 1 | Low | 0 | 0 |
6 | 8 | Male | 41 | Caucasian | 6 | Medium | 2 | Low | 1 | 14 |
Look at basic properties of the dataset: break-down by age, gender and race. Compare histograms before and after data cleaning.
Observe that we are going through the lifecycle iteratively: profile, clean, profile again.
# visualize basic dataset statistics
df1["age"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a2069f6d8>
Are there differences in the distribution of risk scores by gender or race?
# compute score histograms by race and by gender
df_f = df2[(df2.sex == 'Female')]
df_f["decile_score"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a207bed68>
df_m = df2[(df2.sex == 'Male')]
df_m["decile_score"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a2083b860>
df_aa = df2[(df2.race == 'African-American')]
df_aa["decile_score"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a214b5cf8>
df_wh = df2[(df2.race == 'Caucasian')]
df_wh["decile_score"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a2157f6a0>
df_aa["score_text"].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a2167a6d8>
df_wh["score_text"].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a21743470>
df_aa["v_score_text"].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a21805080>
df_wh["v_score_text"].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a218bc6d8>
See ProPublica article for details. When evaluating performance of the prediction instrument, we consider accuracy and parity (or balance).
High false-positives rates Of those deemed likely to re-offend, in any crime category, only 61 percent were arrested for any subsequent crimes within two years. Only 20 percent of people predicted to commit violent crimes actually went on to do so.
Racial skew in scores Scores for white defendants were skewed toward lower-risk categories. Scores for black defendants were not -- a symptom of a potential problem.
Racial skew in false-positive rates 44.9% of African Americans are labeled high-risk but don't reoffend, compared to 23.5% of Caucasians.
Racial skew in false-negative rates 47.7% of Caucasians are labeled low-risk but do re-offend, compared to 28% of African Americans.
What do we mean when we say that data is biased? Societal bias? Measurement bias?
Consider the COMPAS questionaire. Race is not one of the questions. Then why are we seeing a difference in recidivism scores across the populations?
African Americans have higher recidivism rates. Recent results show that when recidivism rates are different across sub-populations, then we cannot simultaneously equalize false-positive rates and false-negative rates!
This is not a technical issue: it is not our definition that's flawed! Our society exhibits structural bias, which processes like those studies here reinforce.