This notebook contains my analysis of data presented in "Machine Bias", Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica, May 23, 2016.
I would like to thank the authors of that article for making their data and analysis freely available. They are a model of open science.
Copyright 2018 Allen Downey
The code and text of this notebook are under this license: Creative Commons Attribution 4.0 International
The data are from this repository, which contains the data and analysis pipeline described on this web page.
The terms of use for the data are here. In compliance with those terms, I am not distributing the data, but there is a link below that downloads it directly.
%matplotlib inline
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sympy import symbols, Eq, solve
from overthink import decorate
In this section I start with the data reported here and replicate the analysis there, computing various metrics based on the confusion matrix: prevalence, sensitivity, specificity, false positive rate, false negative rate, positive predictive value, and negative predictive value.
The following function takes an array and returns a Pandas DataFrame:
def make_matrix(a):
"""Make a confusion matrix from an array.
a: array or list of lists
returns: DataFrame
"""
a = np.asarray(a).reshape((2, 2))
index = ['Survived', 'Recidivated']
columns = ['Low', 'High']
return pd.DataFrame(a, index=index, columns=columns)
Make the matrix for all defendants.
a = [[2681, 1282], [1216, 2035]]
matrix_all = make_matrix(a)
Low | High | |
---|---|---|
Survived | 2681 | 1282 |
Recidivated | 1216 | 2035 |
Confirm that the total matches what's reported in the article.
np.sum(a)
7214
Compute sensitivity and specificity.
def percent(x, y):
"""Compute the percentage `x/(x+y)`.
"""
return x / (x+y) * 100
def sens_spec(m):
"""Compute sensitivity and specificity.
m: confusion matrix
"""
tn, fp, fn, tp = m.values.flatten()
sens = percent(tp, fn)
spec = percent(tn, fp)
return sens, spec
Compute sensitivity and specificity for all defendants.
sens, spec = sens_spec(matrix_all)
sens, spec
(62.59612426945556, 67.65076961897553)
Compute error rates.
def error_rates(m):
"""Compute false positive and false negative rate.
m: confusion matrix
"""
tn, fp, fn, tp = m.values.flatten()
fpr = percent(fp, tn)
fnr = percent(fn, tp)
return fpr, fnr
Compute error rates for all defendants.
fpr, fnr = error_rates(matrix_all)
fpr, fnr
(32.349230381024476, 37.40387573054445)
Compute predictive value.
def predictive_value(m):
"""Compute positive and negatie predictive value.
m: confusion matrix
"""
tn, fp, fn, tp = m.values.flatten()
ppv = percent(tp, fp)
npv = percent(tn, fn)
return ppv, npv
Compute predictive value for all defendants.
ppv, npv = predictive_value(matrix_all)
ppv, npv
(61.350618028338864, 68.79651013600206)
Compute prevalence.
def prevalence(df):
"""Compute prevalence.
m: confusion matrix
"""
tn, fp, fn, tp = df.values.flatten()
prevalence = percent(tp+fn, tn+fp)
return prevalence
Compute prevalences for all defendants.
prev = prevalence(matrix_all)
prev
45.06515109509287
Compute all metrics and put them in a DataFrame.
def compute_metrics(m, name=''):
"""Compute all metrics.
m: confusion matrix
returns: DataFrame
"""
fpr, fnr = error_rates(m)
ppv, npv = predictive_value(m)
sens, spec = sens_spec(m)
prev = prevalence(m)
index = ['FP rate', 'FN rate', 'PPV', 'NPV',
'Sensitivity', 'Specificity', 'Prevalence']
df = pd.DataFrame(index=index, columns=['Percent'])
df.Percent = fpr, fnr, ppv, npv, sens, spec, prev
df.index.name = name
return df
Compute metrics for all defendants.
compute_metrics(matrix_all, 'All defendants')
Percent | |
---|---|
All defendants | |
FP rate | 32.349230 |
FN rate | 37.403876 |
PPV | 61.350618 |
NPV | 68.796510 |
Sensitivity | 62.596124 |
Specificity | 67.650770 |
Prevalence | 45.065151 |
Make the confusion matrix for black defendants.
a = [[990, 805], [532, 1369]]
matrix_black = make_matrix(a)
Low | High | |
---|---|---|
Survived | 990 | 805 |
Recidivated | 532 | 1369 |
Compute metrics for black defendants.
black_metrics = compute_metrics(matrix_black, 'Black defendants')
Percent | |
---|---|
Black defendants | |
FP rate | 44.846797 |
FN rate | 27.985271 |
PPV | 62.971481 |
NPV | 65.045992 |
Sensitivity | 72.014729 |
Specificity | 55.153203 |
Prevalence | 51.433983 |
Make the confusion matrix for white defendants.
a = [[1139, 349], [461, 505]]
matrix_white = make_matrix(a)
Low | High | |
---|---|---|
Survived | 1139 | 349 |
Recidivated | 461 | 505 |
Compute metrics for white defendants.
white_metrics = compute_metrics(matrix_white, 'White defendants')
Percent | |
---|---|
White defendants | |
FP rate | 23.454301 |
FN rate | 47.722567 |
PPV | 59.133489 |
NPV | 71.187500 |
Sensitivity | 52.277433 |
Specificity | 76.545699 |
Prevalence | 39.364303 |
So far, all results are consistent with those reported in the article, including the headline results:
The false positive rate for black defendants is substantially higher than for white defendants (45%, compared to 23%).
The false negative rate for black defendants is substantially lower (28%, compared to 48%).
The false positive rate is the fraction of all non-recidivists who not labeled low risk.
The false negative rate is the fraction of all recidivists who were labeled low risk.
error_rates(matrix_black)
(44.84679665738162, 27.985270910047344)
error_rates(matrix_white)
(23.45430107526882, 47.72256728778468)
An ideal test should have equal predicitive value in all groups; that is, two people in the same risk category should have the same probability of recidivism, regardless of what group they are in.
An ideal test should also have the same error rates for all groups; that is, two non-recidivists should have the same probability of being classified as high risk.
Unfortunately, these two goals are in conflict:
If you design a test to achieve equal predictive value across groups with different prevalence, you will find that error rates depend on prevalence. Speficially, false positive rates will be higher in groups with higher rates of recividism.
If you design a test to achieve equal error rates across groups, you will find that predictive value depends on prevalence. Specifically, positive predictive value will be lower in groups with lower rates of recidivism.
The next two sections demonstrate these effects.
A confusion matrix contains four values, but because they are contrained to add up to 100, it only takes 3 values to determine a confusion matrix.
For example, if you specify prevalence, PPV, and NPV, that determines a confusion matrix, and then you can compute the error rates.
Or, if you specify prevalence, FPR, and FNR, that determines a confusion matrix, and then you can compute predictive values.
The following function takes prevalence, PPV, and NPV and returns a confusion matrix.
def constant_predictive_value(prev, ppv, npv):
"""Make a confusion matrix with given metrics.
prev: prevalence
ppv: positive predictive value
npv: negative predictive value
returns: confusion matrix
"""
tn, fp, fn, tp = symbols('tn fp fn tp')
eq1 = Eq(percent(tp+fn, tn+fp), prev)
eq2 = Eq(percent(tp, fp), ppv)
eq3 = Eq(percent(tn,fn), npv)
eq4 = Eq(tn+fp+fn+tp, 1)
soln = solve([eq1, eq2, eq3, eq4], [tn, fp, fn, tp])
a = list(soln.values())
return make_matrix(a)
To test it, I'll construct a confusion matrix with the actual metrics from all defendents.
ppv, npv = predictive_value(matrix_all)
prev = prevalence(matrix_all)
m = constant_predictive_value(prev, ppv, npv)
Low | High | |
---|---|---|
Survived | 0.371638480731910 | 0.177710008317161 |
Recidivated | 0.168561131133906 | 0.282090379817023 |
If we use it to compute the other metrics, they are consistent with the results we got with the original data.
compute_metrics(m)
Percent | |
---|---|
FP rate | 32.3492303810245 |
FN rate | 37.4038757305443 |
PPV | 61.3506180283389 |
NPV | 68.7965101360021 |
Sensitivity | 62.5961242694557 |
Specificity | 67.6507696189755 |
Prevalence | 45.0651510950929 |
We can use this function to run the "constant predictive value" model, which asks what happens if we keep PPV and NPV constant, and vary prevalence.
ppv, npv = predictive_value(matrix_all)
prevalences = np.linspace(32, 60, 31)
fp_rates = pd.Series(index=prevalences)
fn_rates = pd.Series(index=prevalences)
for prev in prevalences:
df = constant_predictive_value(prev, ppv, npv)
fpr, fnr = error_rates(df)
fp_rates[prev] = fpr
fn_rates[prev] = fnr
The following figure shows the error rates we would expect from a test with equal predictive value for all groups, regardless of prevalence.
def plot_cpv_model():
fp_rates.plot(label='false positive rate', color='C1')
fn_rates.plot(label='false negative rate', color='C4')
decorate(xlabel='Prevalence', ylabel='Percent',
title='Expected error rates, constant predictive value',
loc='upper center')
plot_cpv_model()
As prevalence increases, false positive rates increase quickly. Note the vertical scale: the difference in error rates between a low-prevalence group and a high-prevalence group is dramatic!
For the COMPAS test, the effect is not as extreme. The following figure shows the constant prediction model again, including data points for the white defendants (left), all defendants (middle), and black defendants (right).
def plot_fpr_fnr(m):
"""Plot error rates versus prevalence.
m: confusion matrix
"""
prev = prevalence(m)
fpr, fnr = error_rates(m)
plt.plot(prev, fpr, 'o', color='C1')
plt.plot(prev, fnr, 'o', color='C4')
plot_fpr_fnr(matrix_all)
plot_fpr_fnr(matrix_black)
plot_fpr_fnr(matrix_white)
plot_cpv_model()
For black defendants:
The actual false positive rate is lower that what we would expect if the test had the same predictive value for all groups.
The actual false negative rate is higher than expected.
For white defendants:
The actual false positive rate is higher than what we would expect if the test had the same predictive value for all groups.
The actual false negative rate is lower than expected.
Relative to the CPV model, the COMPAS test is what I will call "tempered", that is, less sensitive to variation in prevalence between groups.
In the previous section we held predictive value constant and computed the effect on error rates. In this section we'll go the other way: if we hold error rates constant for all groups, what effect does that have on predictive value?
The following function takes prevalence and error rates and returns a confusion matrix.
def constant_error_rates(prev, fpr, fnr):
"""Make a confusion matrix with given metrics.
prev: prevalence
fpr: false positive rate
fnr: false negative rate
returns: confusion matrix
"""
tn, fp, fn, tp = symbols('tn fp fn tp')
eq1 = Eq(percent(tp+fn, tn+fp), prev)
eq2 = Eq(percent(fp, tn), fpr)
eq3 = Eq(percent(fn, tp), fnr)
eq4 = Eq(tn+fp+fn+tp, 1)
soln = solve([eq1, eq2, eq3, eq4], [tn, fp, fn, tp])
a = list(soln.values())
return make_matrix(a)
Again, just to test it, we can replicate the observed confusion matrix.
fpr, fnr = error_rates(matrix_all)
prev = prevalence(matrix_all)
m = constant_error_rates(prev, fpr, fnr)
Low | High | |
---|---|---|
Survived | 0.371638480731910 | 0.177710008317161 |
Recidivated | 0.168561131133907 | 0.282090379817022 |
And it has the right metrics
compute_metrics(m)
Percent | |
---|---|
FP rate | 32.3492303810245 |
FN rate | 37.4038757305445 |
PPV | 61.3506180283388 |
NPV | 68.7965101360020 |
Sensitivity | 62.5961242694555 |
Specificity | 67.6507696189755 |
Prevalence | 45.0651510950929 |
Now we can see how predictive value depends on prevalence (with error rates held constant).
fpr, fnr = error_rates(matrix_all)
prevalences = np.linspace(20, 70, 31)
ppv_rates = pd.Series(index=prevalences)
npv_rates = pd.Series(index=prevalences)
for prev in prevalences:
df = constant_error_rates(prev, fpr, fnr)
ppv, npv = predictive_value(df)
ppv_rates[prev] = ppv
npv_rates[prev] = npv
The following function plots the results.
def plot_cer_model():
ppv_rates.plot(label='Positive predictive value', color='C0')
npv_rates.plot(label='Negative predictive value', color='C2')
decorate(xlabel='Prevalence', ylabel='Percent',
title='Expected predictive value, constant error rates',
loc='upper center')
plot_cer_model()
As prevalence increases, so does positive predictive value.
For the COMPAS test, the effect is not as extreme. The following figure shows the constant error rate again, including data points for the white defendants (left), all defendants (middle), and black defendants (right).
def plot_ppv_npv(m):
"""Plot predictive values versus prevalence.
m: confusion matrix
"""
prev = prevalence(m)
ppv, npv = predictive_value(m)
plt.plot(prev, ppv, 'o', color='C0')
plt.plot(prev, npv, 'o', color='C2')
plot_ppv_npv(matrix_all)
plot_ppv_npv(matrix_black)
plot_ppv_npv(matrix_white)
plot_cer_model()
Again, the test is less sensitive to differences in prevalence between groups than we would expect from the constant error rate model.
In this section I read the detailed dataset available from this repository and run validation checks.
# Uncomment and run this cell once to download the data.
# Then comment it again so you don't download it every time you run the notebook.
# !wget 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv'
cp = pd.read_csv("compas-scores-two-years.csv")
cp.head()
id | name | first | last | compas_screening_date | sex | dob | age | age_cat | race | ... | v_decile_score | v_score_text | v_screening_date | in_custody | out_custody | priors_count.1 | start | end | event | two_year_recid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | miguel hernandez | miguel | hernandez | 2013-08-14 | Male | 1947-04-18 | 69 | Greater than 45 | Other | ... | 1 | Low | 2013-08-14 | 2014-07-07 | 2014-07-14 | 0 | 0 | 327 | 0 | 0 |
1 | 3 | kevon dixon | kevon | dixon | 2013-01-27 | Male | 1982-01-22 | 34 | 25 - 45 | African-American | ... | 1 | Low | 2013-01-27 | 2013-01-26 | 2013-02-05 | 0 | 9 | 159 | 1 | 1 |
2 | 4 | ed philo | ed | philo | 2013-04-14 | Male | 1991-05-14 | 24 | Less than 25 | African-American | ... | 3 | Low | 2013-04-14 | 2013-06-16 | 2013-06-16 | 4 | 0 | 63 | 0 | 1 |
3 | 5 | marcu brown | marcu | brown | 2013-01-13 | Male | 1993-01-21 | 23 | Less than 25 | African-American | ... | 6 | Medium | 2013-01-13 | NaN | NaN | 1 | 0 | 1174 | 0 | 0 |
4 | 6 | bouthy pierrelouis | bouthy | pierrelouis | 2013-03-26 | Male | 1973-01-22 | 43 | 25 - 45 | Other | ... | 1 | Low | 2013-03-26 | NaN | NaN | 2 | 0 | 1102 | 0 | 0 |
5 rows × 53 columns
cp.shape
(7214, 53)
for col in cp.columns:
print(col)
id name first last compas_screening_date sex dob age age_cat race juv_fel_count decile_score juv_misd_count juv_other_count priors_count days_b_screening_arrest c_jail_in c_jail_out c_case_number c_offense_date c_arrest_date c_days_from_compas c_charge_degree c_charge_desc is_recid r_case_number r_charge_degree r_days_from_arrest r_offense_date r_charge_desc r_jail_in r_jail_out violent_recid is_violent_recid vr_case_number vr_charge_degree vr_offense_date vr_charge_desc type_of_assessment decile_score.1 score_text screening_date v_type_of_assessment v_decile_score v_score_text v_screening_date in_custody out_custody priors_count.1 start end event two_year_recid
The following functions compute value counts and percentages for various variables.
def make_dataframe(series, *columns):
"""Make a Series into a DataFrame with one column.
So it looks better in Jupyter.
series: Series
columns: column name(s)
returns: DataFrame
"""
df = pd.DataFrame(series.values,
index=series.index,
columns=columns)
df.index.name = series.name
return df
def counts(df, var):
"""Compute counts for each unique value.
df: DataFrame
var: variable name
"""
series = df[var].value_counts()
return make_dataframe(series, 'Count')
def percentages(df, var):
"""Compute percentages for each unique value.
df: DataFrame
var: variable name
"""
series = df[var].value_counts() / len(df) * 100
return make_dataframe(series, 'Percentage')
Breakdown by age
counts(cp, 'age_cat')
Count | |
---|---|
age_cat | |
25 - 45 | 4109 |
Greater than 45 | 1576 |
Less than 25 | 1529 |
percentages(cp, 'age_cat')
Percentage | |
---|---|
age_cat | |
25 - 45 | 56.958691 |
Greater than 45 | 21.846410 |
Less than 25 | 21.194899 |
Breakdown by race
counts(cp, 'race')
Count | |
---|---|
race | |
African-American | 3696 |
Caucasian | 2454 |
Hispanic | 637 |
Other | 377 |
Asian | 32 |
Native American | 18 |
percentages(cp, 'race')
Percentage | |
---|---|
race | |
African-American | 51.233712 |
Caucasian | 34.017189 |
Hispanic | 8.830053 |
Other | 5.225950 |
Asian | 0.443582 |
Native American | 0.249515 |
Breakdown by sex
counts(cp, 'sex')
Count | |
---|---|
sex | |
Male | 5819 |
Female | 1395 |
percentages(cp, 'sex')
Percentage | |
---|---|
sex | |
Male | 80.6626 |
Female | 19.3374 |
Breakdown by recidivism
counts(cp, 'two_year_recid')
Count | |
---|---|
two_year_recid | |
0 | 3963 |
1 | 3251 |
percentages(cp, 'two_year_recid')
Percentage | |
---|---|
two_year_recid | |
0 | 54.934849 |
1 | 45.065151 |
Breakdown by risk category
counts(cp, 'score_text')
Count | |
---|---|
score_text | |
Low | 3897 |
Medium | 1914 |
High | 1403 |
percentages(cp, 'score_text')
Percentage | |
---|---|
score_text | |
Low | 54.019961 |
Medium | 26.531744 |
High | 19.448295 |
The following function computes cross-tabulations.
def crosstab(df, index, columns):
"""Compute a cross-tabulation.
df: DataFrame
index: variable(s) that will label the rows
columns: variable(s) that will label the columns
returns: DataFrame
"""
xtab = df.pivot_table(index=index,
columns=columns,
values='id',
aggfunc='count')
return xtab
Breakdown by sex and race
xtab = crosstab(cp, 'sex', 'race')
race | African-American | Asian | Caucasian | Hispanic | Native American | Other |
---|---|---|---|---|---|---|
sex | ||||||
Female | 652 | 2 | 567 | 103 | 4 | 67 |
Male | 3044 | 30 | 1887 | 534 | 14 | 310 |
Breakdown by age and race
xtab = crosstab(cp, 'age_cat', 'race')
race | African-American | Asian | Caucasian | Hispanic | Native American | Other |
---|---|---|---|---|---|---|
age_cat | ||||||
25 - 45 | 2194 | 14 | 1312 | 367 | 12 | 210 |
Greater than 45 | 582 | 11 | 752 | 143 | 3 | 85 |
Less than 25 | 920 | 7 | 390 | 127 | 3 | 82 |
Breakdown by age and sex
xtab = crosstab(cp, 'age_cat', 'sex')
sex | Female | Male |
---|---|---|
age_cat | ||
25 - 45 | 807 | 3302 |
Greater than 45 | 300 | 1276 |
Less than 25 | 288 | 1241 |
Distribution of decile scores for black defendants.
cp.loc[cp.race=='African-American', 'decile_score'].hist()
decorate(xlabel='Decile Score',
ylabel='Count',
title='Black defendant decile scores',
ylim=[0, 700])
Distribution of decile scores for white defendants.
cp.loc[cp.race=='Caucasian', 'decile_score'].hist()
decorate(xlabel='Decile Score',
ylabel='Count',
title='White defendant decile scores',
ylim=[0, 700])
Cross tabulation of decile score and race.
crosstab(cp, 'decile_score', 'race')
race | African-American | Asian | Caucasian | Hispanic | Native American | Other |
---|---|---|---|---|---|---|
decile_score | ||||||
1 | 398.0 | 15.0 | 681.0 | 196.0 | NaN | 150.0 |
2 | 393.0 | 4.0 | 361.0 | 113.0 | 4.0 | 66.0 |
3 | 346.0 | 5.0 | 273.0 | 86.0 | 1.0 | 36.0 |
4 | 385.0 | NaN | 285.0 | 52.0 | 1.0 | 46.0 |
5 | 365.0 | 1.0 | 241.0 | 52.0 | NaN | 22.0 |
6 | 384.0 | 3.0 | 194.0 | 37.0 | 2.0 | 21.0 |
7 | 400.0 | 1.0 | 143.0 | 34.0 | 4.0 | 10.0 |
8 | 359.0 | 2.0 | 114.0 | 26.0 | 1.0 | 10.0 |
9 | 380.0 | NaN | 98.0 | 20.0 | 2.0 | 8.0 |
10 | 286.0 | 1.0 | 64.0 | 21.0 | 3.0 | 8.0 |
Cross tabulation of decile score and age group.
crosstab(cp, 'decile_score', 'age_cat')
age_cat | 25 - 45 | Greater than 45 | Less than 25 |
---|---|---|---|
decile_score | |||
1 | 654 | 782 | 4 |
2 | 660 | 182 | 99 |
3 | 437 | 120 | 190 |
4 | 434 | 98 | 237 |
5 | 405 | 85 | 191 |
6 | 357 | 104 | 180 |
7 | 340 | 77 | 175 |
8 | 298 | 57 | 157 |
9 | 292 | 54 | 162 |
10 | 232 | 17 | 134 |
Here's the confusion matrix with all three score categories.
crosstab(cp, 'two_year_recid', 'score_text')
score_text | High | Low | Medium |
---|---|---|---|
two_year_recid | |||
0 | 402 | 2681 | 880 |
1 | 1001 | 1216 | 1034 |
To make sure I've got the data right, I'll reproduce the confusion matrices from the article.
def compute_matrix(df):
"""Compute a confusion matrix from data.
df: DataFrame
returns: confusion matrix
"""
high = cp.score_text.isin(['Medium', 'High']).astype(int)
return crosstab(df, 'two_year_recid', high)
All defendants.
compute_matrix(cp)
score_text | 0 | 1 |
---|---|---|
two_year_recid | ||
0 | 2681 | 1282 |
1 | 1216 | 2035 |
And the differences are 0.
matrix_all - compute_matrix(cp).values
Low | High | |
---|---|---|
Survived | 0 | 0 |
Recidivated | 0 | 0 |
Black defendants.
black = cp[cp.race=='African-American']
compute_matrix(black)
score_text | 0 | 1 |
---|---|---|
two_year_recid | ||
0 | 990 | 805 |
1 | 532 | 1369 |
matrix_black - compute_matrix(black).values
Low | High | |
---|---|---|
Survived | 0 | 0 |
Recidivated | 0 | 0 |
White defendants.
white = cp[cp.race=='Caucasian']
compute_matrix(white)
score_text | 0 | 1 |
---|---|---|
two_year_recid | ||
0 | 1139 | 349 |
1 | 461 | 505 |
matrix_white - compute_matrix(white).values
Low | High | |
---|---|---|
Survived | 0 | 0 |
Recidivated | 0 | 0 |
To check for calibration, I group defendents by decile score and compute prevalence (recidivism rate) in each group.
This analysis does not take observation time into account, unlike the analysis in the original article.
The following function groups defendants by decile score and computes prevalence in each group.
def calibration_curve(df):
"""Compute probability of recidivism by decile score.
df: DataFrame
returns: Series
"""
return df.groupby('decile_score').two_year_recid.mean()
The following figure shows this calibration curve for all defendants and broken down by race.
def plot_calibration(df, group_vars, title=''):
calibration_curve(cp).plot(linestyle='dotted',
label='All defendants', color='gray')
for name, group in cp.groupby(group_vars):
if len(group) > 1000:
calibration_curve(group).plot(label=name)
decorate(xlabel='Decile score',
ylabel='Prob recidivism',
title=title)
plot_calibration(cp, 'race')
The test is well calibrated. People with higher scores have higher probabilities of recidivism. In fact, we could use this curve to transform COMPAS scores into probabilities.
The test is about equally calibrated for black and white defendants, although black defendants with scores 3 and 4 are more likely to recidivate than white defendants with the same scores (that apparent difference might not be statistically significant).
Here's the breakdown by age group.
plot_calibration(cp, 'age_cat')
The test is about equally calibrated for all age groups, which means that people with the same score have about the same probability of recidivism, regardless of what group they are in.
There are only 4 people in the "Less than 25" group with decile score 1, which is why that data point is so out of line.
Here's the breakdown by sex.
plot_calibration(cp, 'sex')
Here's the first case where the test does not seem well calibrated. At all decile scores, female defendants are substantially less likely to recidivate than male defendants.
Or, reading this graph the other way, female defendants are given decile scores 1-2 points higher than male defendants with the same actual risk of recidivism.
In this section I'll compare actual PPV and FPR for a variety of subgroups to the values we would expect based on the CPV model; that is, a model where the predictive values are the same for all groups.
The following function groups defendants by group_vars
and returns a table with one row for each group.
def make_table(df, group_vars, expected_ppv, expected_npv):
"""Make a table with one line per group.
df: DataFrame
group_vars: string or list of string variable names
expected_ppv:
expected_npv:
returns: table
"""
# make the DataFrame
columns = ['count', 'prevalence',
'actual PPV', 'actual NPV', 'actual FNR',
'actual FPR', 'expected FPR', 'difference']
columns = group_vars + columns
table = pd.DataFrame(columns=columns)
# loop through the groups
grouped = df.groupby(group_vars)
for i, (name, group) in enumerate(grouped):
if not isinstance(name, tuple):
name = name,
# size of group
count = len(group)
# compute metrics
matrix = compute_matrix(group)
prev = prevalence(matrix)
actual_ppv, actual_npv = predictive_value(matrix)
actual_fpr, actual_fnr = error_rates(matrix)
# generate the CPV matrix
cpv = constant_predictive_value(prev,
expected_ppv, expected_npv)
# get the expected error rates
expected_fpr, _ = error_rates(cpv * 100)
# for very low and high prevalences, it might
# not be possible to achieve given predictive values
if expected_fpr < 0:
expected_fpr = 0
if expected_fpr > 100:
expected_fpr = 100
# difference between actual and expected
diff = actual_fpr - expected_fpr
# add a row to the table
row = name + (count, prev,
actual_ppv, actual_npv, actual_fnr,
actual_fpr, expected_fpr, diff)
table.loc[i] = row
# sort the table by prevalence
table.sort_values(by='prevalence', inplace=True)
return table
subset = cp[cp.race.isin(['African-American', 'Caucasian'])]
subset.shape
(6150, 53)
Here's the breakdown by age category.
ppv, npv = predictive_value(matrix_all)
group_vars = ['age_cat']
table1 = make_table(subset, group_vars, ppv, npv)
age_cat | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|
1 | Greater than 45 | 1334 | 32.908546 | 54.945055 | 75.360825 | 54.441913 | 18.324022 | 3.25812703191668 | 15.0658953144520 |
0 | 25 - 45 | 3506 | 47.803765 | 61.931499 | 66.782609 | 34.188544 | 37.049180 | 40.7730043881377 | -3.72382406026880 |
2 | Less than 25 | 1310 | 57.404580 | 64.665912 | 57.611241 | 24.069149 | 55.913978 | 78.8593270762673 | -22.9453485816437 |
Again, the actual behavior of the test is tempered, compare to the CPV model; that is, the results are less extreme than the model expects.
Here's the breakdown by race.
ppv, npv = predictive_value(matrix_all)
group_vars = ['race']
table2 = make_table(subset, group_vars, ppv, npv)
race | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|
1 | Caucasian | 2454 | 39.364303 | 59.133489 | 71.187500 | 47.722567 | 23.454301 | 17.2544721036439 | 6.19982897162488 |
0 | African-American | 3696 | 51.433983 | 62.971481 | 65.045992 | 27.985271 | 44.846797 | 53.4036054274053 | -8.55680877002365 |
The false positive rate for whites is higher than we would expect if predictive value were the same for all groups.
The false positive rate for blacks is lower than we would expect.
Here's the breakdown by sex.
ppv, npv = predictive_value(matrix_all)
group_vars = ['sex']
table3 = make_table(subset, group_vars, ppv, npv)
sex | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|
0 | Female | 1219 | 36.587367 | 50.980392 | 75.683891 | 35.874439 | 35.575679 | 10.8846867203845 | 24.6909924516724 |
1 | Male | 4931 | 49.097546 | 64.369680 | 66.193182 | 34.407270 | 35.019920 | 45.0678340229313 | -10.0479137042062 |
The false positive rate for women is substantially higher than what we would expect in the CPV model, which is consistent with the calibration results in the previous section.
Here's the breakdown by age and race.
ppv, npv = predictive_value(matrix_all)
group_vars = ['age_cat', 'race']
table4 = make_table(subset, group_vars, ppv, npv)
age_cat | race | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|---|
3 | Greater than 45 | Caucasian | 752 | 27.792553 | 55.555556 | 77.322835 | 68.899522 | 9.576427 | 0 | 9.57643 |
2 | Greater than 45 | African-American | 582 | 39.518900 | 54.655870 | 71.641791 | 41.304348 | 31.818182 | 17.6262781850286 | 14.1919036331532 |
1 | 25 - 45 | Caucasian | 1312 | 43.140244 | 60.000000 | 67.241379 | 46.996466 | 26.809651 | 26.9139758955996 | -0.104324421068796 |
5 | Less than 25 | Caucasian | 390 | 48.974359 | 59.071730 | 66.666667 | 26.701571 | 48.743719 | 44.6495211038666 | 4.09419748909818 |
0 | 25 - 45 | African-American | 2194 | 50.592525 | 62.685402 | 66.374589 | 27.657658 | 44.095941 | 50.3106752879841 | -6.21473432857449 |
4 | Less than 25 | African-American | 920 | 60.978261 | 66.718266 | 52.554745 | 23.172906 | 59.888579 | 97.8224228659821 | -37.9338434787955 |
Breakdown by age and sex.
ppv, npv = predictive_value(matrix_all)
group_vars = ['age_cat', 'sex']
table5 = make_table(subset, group_vars, ppv, npv)
age_cat | sex | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|---|
2 | Greater than 45 | Female | 259 | 23.552124 | 41.071429 | 81.280788 | 62.295082 | 16.666667 | 0 | 16.6667 |
3 | Greater than 45 | Male | 1075 | 35.162791 | 57.467532 | 73.794003 | 53.174603 | 18.794835 | 7.82872060272464 | 10.9661144044490 |
0 | 25 - 45 | Female | 704 | 40.056818 | 55.520505 | 72.609819 | 37.588652 | 33.412322 | 18.9349160741967 | 14.4774062006848 |
4 | Less than 25 | Female | 256 | 40.234375 | 46.276596 | 76.470588 | 15.533981 | 66.013072 | 19.3720445626131 | 46.6410273328117 |
1 | 25 - 45 | Male | 2802 | 49.750178 | 63.319672 | 65.097160 | 33.500717 | 38.139205 | 47.3182268799581 | -9.17902233450359 |
5 | Less than 25 | Male | 1054 | 61.574953 | 69.640288 | 54.038997 | 25.423729 | 52.098765 | 100 | -47.9012 |
Breakdown by race and sex.
ppv, npv = predictive_value(matrix_all)
group_vars = ['race', 'sex']
table6 = make_table(subset, group_vars, ppv, npv)
race | sex | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|---|
2 | Caucasian | Female | 567 | 35.097002 | 50.446429 | 74.927114 | 43.216080 | 30.163043 | 7.69083250175007 | 22.4722109765108 |
0 | African-American | Female | 652 | 37.883436 | 51.335312 | 76.507937 | 29.959514 | 40.493827 | 13.7867567264522 | 26.7070704340416 |
3 | Caucasian | Male | 1887 | 40.646529 | 62.222222 | 70.167064 | 48.891786 | 21.250000 | 20.3968108299805 | 0.853189170019501 |
1 | African-American | Male | 3044 | 54.336399 | 65.106151 | 62.054681 | 27.690447 | 46.115108 | 64.9466440776828 | -18.8315361640137 |
Breakdown by age, race, and sex.
ppv, npv = predictive_value(matrix_all)
group_vars = ['age_cat', 'race', 'sex']
table7 = make_table(subset, group_vars, ppv, npv)
age_cat | race | sex | count | prevalence | actual PPV | actual NPV | actual FNR | actual FPR | expected FPR | difference | |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | Greater than 45 | African-American | Female | 88 | 22.727273 | 44.827586 | 88.135593 | 35.000000 | 23.529412 | 0 | 23.5294 |
6 | Greater than 45 | Caucasian | Female | 171 | 23.976608 | 37.037037 | 78.472222 | 75.609756 | 13.076923 | 0 | 13.0769 |
7 | Greater than 45 | Caucasian | Male | 581 | 28.915663 | 61.111111 | 76.985743 | 67.261905 | 8.474576 | 0 | 8.47458 |
10 | Less than 25 | Caucasian | Female | 87 | 31.034483 | 38.235294 | 94.736842 | 3.703704 | 70.000000 | 0 | 70 |
0 | 25 - 45 | African-American | Female | 395 | 38.227848 | 52.659574 | 74.879227 | 34.437086 | 36.475410 | 14.5784223461080 | 21.8969874899576 |
2 | 25 - 45 | Caucasian | Female | 309 | 42.394822 | 59.689922 | 70.000000 | 41.221374 | 29.213483 | 24.9067389374193 | 4.30674420864812 |
5 | Greater than 45 | African-American | Male | 494 | 42.510121 | 55.963303 | 68.115942 | 41.904762 | 33.802817 | 25.2138089028154 | 8.58900799859303 |
3 | 25 - 45 | Caucasian | Male | 1003 | 43.369890 | 60.107817 | 66.455696 | 48.735632 | 26.056338 | 27.5430043789732 | -1.48666635080423 |
8 | Less than 25 | African-American | Female | 169 | 44.970414 | 50.833333 | 69.387755 | 19.736842 | 63.440860 | 32.0728304197014 | 31.3680297953524 |
1 | 25 - 45 | African-American | Male | 1799 | 53.307393 | 64.409881 | 63.881020 | 26.590198 | 46.309524 | 60.6900439996716 | -14.3805201901478 |
11 | Less than 25 | Caucasian | Male | 303 | 54.125413 | 67.455621 | 62.686567 | 30.487805 | 39.568345 | 64.0583101483587 | -24.4899648246177 |
9 | Less than 25 | African-American | Male | 751 | 64.580559 | 70.342205 | 48.888889 | 23.711340 | 58.646617 | 100 | -41.3534 |
Those are all the possible subgroups for these three variables.
Now we can see what the results look like.
tables = [table1, table2, table3, table4, table5, table6, table7];
The following function plots one data point per subgroup showing the given metric versus prevalence.
Groups with a small number of people are shown with lighter colors.
def plot_table_var(table, var, color):
"""Plot one data point per row.
table: DataFrame
var: which metric to plot
color: string
"""
for _, row in table.iterrows():
alpha = 0.8 if row['count'] > 200 else 0.3
plt.plot(row['prevalence'], row[var],
'o', color=color, alpha=alpha)
Here's what the results look like for FPR.
fp_rates.plot(label='Expected FPR, constant CPV',
color='C1')
plt.axhline(fpr, linestyle='dotted',
label='Expected FPR, constant FPR', color='gray')
for table in tables:
plot_table_var(table, 'actual FPR', 'C1')
decorate(xlabel='Prevalence',
ylabel='Percent',
title='False positive rates by subgroup')
In general, groups with higher prevalence have higher false positive rates, but the effect is less extreme than what we would expect from the CPV model.
Here are the results for positive predictive value.
ppv_rates.plot(label='Expected PPV, constant FPR', color='C0')
plt.axhline(ppv, linestyle='dotted',
label='Expected PPV, constant PPV', color='gray')
for table in tables:
plot_table_var(table, 'actual PPV', 'C0')
decorate(xlabel='Prevalence',
ylabel='Rate',
title='Positive predictive value by subgroup')
Groups with higher prevalence have higher PPV, but the effect is less extreme than we would expect from the CPV model.
Here are the results for false negative rate.
fn_rates.plot(label='Expected FNR, constant CPV',
color='C4')
plt.axhline(fnr, linestyle='dotted',
label='Expected FNR, constant FNR', color='gray')
for table in tables:
plot_table_var(table, 'actual FNR', 'C4')
decorate(xlabel='Prevalence',
ylabel='Percent',
title='False negative rates by subgroup')
Groups with higher prevalence have lower FNR, but the effect is less extreme than we would expect from the CPV model.
Here are the results for negative predictive value.
npv_rates.plot(label='Expected NPV, constant FPR', color='C2')
plt.axhline(npv, linestyle='dotted',
label='Expected NPV, constant NPV', color='gray')
for table in tables:
plot_table_var(table, 'actual NPV', 'C2')
decorate(xlabel='Prevalence',
ylabel='Percent',
title='Negative predictive value by subgroup')
Groups with higher prevalence have lower NPV. In this case, the effect is almost exactly what we would expect from the CPV model.
from scipy.interpolate import interp1d
def crossing(series, value, **options):
"""Find where a function crosses a value.
series: Series
value: number
options: passed to interp1d (default is linear interp)
returns: number
"""
interp = interp1d(series.values, series.index, **options)
return interp(value)
def interpolate(series, value, **options):
"""Evaluate a function at a value.
series: Series
value: number
options: passed to interp1d (default is linear interp)
returns: number
"""
interp = interp1d(series.index, series.values, **options)
return interp(value)
cal_all = calibration_curve(cp)
cal_all.plot()
decorate(ylabel='Probability of recidivism')
crossing(cal_all, 0.4)
array(3.40971594)
crossing(cal_all, 0.7)
array(9.01595501)
interpolate(cal_all, 3.4)
array(0.39943493)
interpolate(cal_all, 9)
array(0.6988189)
def make_error_dist(std_dev):
"""Make a discrete Gaussian distribution.
std_dev: standard deviation
returns: Series that maps errors to probabilities
"""
errors = np.linspace(-3, 3, 21)
prob_error = np.exp(-(errors/std_dev)**2)
prob_error /= np.sum(prob_error)
error_dist = pd.Series(prob_error, index=errors)
return error_dist
error_dist = make_error_dist(std_dev=2)
error_dist.plot(label='')
decorate(xlabel='Error (score)',
ylabel='Probability')
def individual_fpr(actual_prob_recid, cal, thresh, std_dev):
"""Compute an individual FPR.
actual_prob_recid: actual probability of recidivism
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and not low risk
std_dev: standard deviation of the error function
returns: individual FPR
"""
# look up actual_prob_recid to get correct score
correct_score = crossing(cal, actual_prob_recid)
# make the error distribution
error_dist = make_error_dist(std_dev)
# loop through possible errors
total_prob = 0
for error, prob_error in error_dist.iteritems():
# hypothetical score
score = correct_score+error
score = max(score, 1)
score = min(score, 10)
# probability of being classified 'not low' | error
prob_positive = 0 if score < thresh else 1
# probability of being a false positive | error
prob_fp = prob_positive * (1-actual_prob_recid)
total_prob += prob_error * prob_fp
return total_prob
individual_fpr(0.3, cal_all, 4.5, 2)
0.01623457260428886
individual_fpr(0.5, cal_all, 4.5, 2)
0.3538656315021642
individual_fpr(0.7, cal_all, 4.5, 2)
0.3
def compute_fpr_vs_prob_recid(cal, thresh, std_dev):
"""Computes FPR as a function of probability of recidivism.
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and not low risk
std_dev: standard deviation of the error function
returns: Series
"""
prob_recid_array = np.linspace(min(cal), max(cal), 21)
prob_fpr_series = pd.Series(index=prob_recid_array)
for prob_recid in prob_recid_array:
fpr = individual_fpr(prob_recid, cal, thresh, std_dev)
prob_fpr_series[prob_recid] = fpr
return prob_fpr_series
s = compute_fpr_vs_prob_recid(cal_all, thresh=4.5, std_dev=2)
s.plot(label='FPR, std_dev=2')
s = compute_fpr_vs_prob_recid(cal_all, thresh=4.5, std_dev=1)
s.plot(label='FPR, std_dev=1')
decorate(xlabel='Actual probability of recidivism',
ylabel='Probability of false positive')
def individual_fpr_given_score(actual_score, cal, thresh, std_dev):
"""Compute an individual FPR.
actual_score: score assigned
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and high risk
std_dev: standard deviation of the error function
returns: individual FPR
"""
# make the error distribution
error_dist = make_error_dist(std_dev)
# loop through possible errors
total_prob = 0
for error, prob_error in error_dist.iteritems():
# correct score
correct_score = actual_score-error
correct_score = max(correct_score, 1)
correct_score = min(correct_score, 10)
# map from correct score to probability of recidivism.
# if calibration curves are different for different
# groups, this one should be group specific.
correct_prob_recid = interpolate(cal, correct_score)
cond_ifpr = individual_fpr(correct_prob_recid,
cal, thresh, std_dev)
total_prob += prob_error * cond_ifpr
return total_prob
individual_fpr_given_score(6, cal_all, thresh=4.5, std_dev=2)
0.3451680036503149
def make_ifpr_series(cal, thresh, std_dev):
scores = np.arange(1, 11)
t = [individual_fpr_given_score(score, cal, thresh, std_dev)
for score in scores]
ifpr_series = pd.Series(t, scores)
return ifpr_series
thresh = 4.5
s = make_ifpr_series(cal_all, thresh, std_dev=2)
s.plot(label='FPR, std_dev=2')
s = make_ifpr_series(cal_all, thresh, std_dev=1)
s.plot(label='FPR, std_dev=1')
decorate(xlabel='Score',
ylabel='Probability of false positive')
def assign_individual_fpr(df, cal, thresh, std_dev):
"""Assign individual FPRs to defendants.
df: DataFrame
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and high risk
std_dev: standard deviation of the error function
"""
# compute the map from score to FPR
ifpr_series = make_ifpr_series(cal, thresh, std_dev)
# assign FPR to each defendant
df['ifpr'] = [ifpr_series[score] for score in df.decile_score]
assign_individual_fpr(cp, cal_all, thresh=4.5, std_dev=2)
def make_cdf(series):
"""Make a CDF."""
counts = series.value_counts().sort_index()
counts /= counts.sum()
return counts.cumsum()
def plot_cdf(cdf, **options):
"""Plot a CDF as a step function."""
plt.step(cdf.index, cdf.values, where='post', **options)
cdf_ifpr = make_cdf(cp.ifpr)
0.018482 0.199612 0.050389 0.330053 0.128211 0.433601 0.210571 0.540200 0.258458 0.593291 0.282626 0.687691 0.292149 0.758109 0.327170 0.829082 0.345168 0.917937 0.353374 1.000000 Name: ifpr, dtype: float64
plot_cdf(cdf_ifpr, label='All')
decorate(xlabel='Individual probability of false positive',
ylabel='CDF',
ylim=[0,1])
black = cp[cp.race=='African-American']
white = cp[cp.race=='Caucasian']
thresh = 4.5
std_dev = 2
cal_black = calibration_curve(black)
s = make_ifpr_series(cal_black, thresh, std_dev)
s.plot(label='FPR, black')
cal_white = calibration_curve(white)
s = make_ifpr_series(cal_white, thresh, std_dev)
s.plot(label='FPR, white')
decorate(xlabel='Score',
ylabel='Probability of false positive')
male = cp[cp.sex=='Male']
female = cp[cp.sex=='Female']
thresh = 4.5
std_dev = 2
cal_male = calibration_curve(male)
s = make_ifpr_series(cal_male, thresh, std_dev)
s.plot(label='FPR, male')
cal_female = calibration_curve(female)
s = make_ifpr_series(cal_female, thresh, std_dev)
s.plot(label='FPR, female')
decorate(xlabel='Score',
ylabel='Probability of false positive')
assign_individual_fpr(cp, cal_all, thresh=4.5, std_dev=2)
cp.ifpr.mean()
0.18984802588839386
cp.groupby('race').ifpr.mean()
race African-American 0.225117 Asian 0.115755 Caucasian 0.160501 Hispanic 0.142154 Native American 0.240612 Other 0.119566 Name: ifpr, dtype: float64
assign_individual_fpr(cp, cal_all, thresh=4.5, std_dev=1)
cp.ifpr.mean()
0.182639042962266
cp.groupby('race').ifpr.mean()
race African-American 0.223846 Asian 0.098804 Caucasian 0.148866 Hispanic 0.125011 Native American 0.239059 Other 0.100288 Name: ifpr, dtype: float64
assign_individual_fpr(cp, cal_all, thresh=4.5, std_dev=0.01)
cp.ifpr.mean()
0.1777100083171615
cp.groupby('race').ifpr.mean()
race African-American 0.220867 Asian 0.097329 Caucasian 0.142568 Hispanic 0.119877 Native American 0.228798 Other 0.085460 Name: ifpr, dtype: float64
cal_all = calibration_curve(cp)
cal_all.plot()
decorate(ylabel='Probability of recidivism')
def individual_fnr(actual_prob_recid, cal, thresh, std_dev):
"""Compute an individual FNR.
actual_prob_recid: actual probability of recidivism
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and not low risk
std_dev: standard deviation of the error function
returns: individual FNR
"""
# look up actual_prob_recid to get correct score
correct_score = crossing(cal, actual_prob_recid)
# make the error distribution
error_dist = make_error_dist(std_dev)
# loop through possible errors
total_prob = 0
for error, prob_error in error_dist.iteritems():
# hypothetical score
score = correct_score+error
score = max(score, 1)
score = min(score, 10)
# probability of being classified 'low' | error
prob_negative = 0 if score >= thresh else 1
# probability of being a false negative | error
prob_fp = prob_negative * actual_prob_recid
total_prob += prob_error * prob_fp
return total_prob
individual_fnr(0.3, cal_all, 4.5, 2)
0.29304232602673336
individual_fnr(0.5, cal_all, 4.5, 2)
0.14613436849783573
individual_fnr(0.7, cal_all, 4.5, 2)
0.0
def compute_fnr_vs_prob_recid(cal, thresh, std_dev):
"""Computes FNR as a function of probability of recidivism.
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and not low risk
std_dev: standard deviation of the error function
returns: Series
"""
prob_recid_array = np.linspace(min(cal), max(cal), 21)
prob_fnr_series = pd.Series(index=prob_recid_array)
for prob_recid in prob_recid_array:
fnr = individual_fnr(prob_recid, cal, thresh, std_dev)
prob_fnr_series[prob_recid] = fnr
return prob_fnr_series
s = compute_fnr_vs_prob_recid(cal_all, thresh=4.5, std_dev=2)
s.plot(label='FNR, std_dev=2')
s = compute_fnr_vs_prob_recid(cal_all, thresh=4.5, std_dev=1)
s.plot(label='FNR, std_dev=1')
decorate(xlabel='Actual probability of recidivism',
ylabel='Probability of false negative')
def individual_fnr_given_score(actual_score, cal, thresh, std_dev):
"""Compute an individual FNR.
actual_score: score assigned
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and high risk
std_dev: standard deviation of the error function
returns: individual FNR
"""
# make the error distribution
error_dist = make_error_dist(std_dev)
# loop through possible errors
total_prob = 0
for error, prob_error in error_dist.iteritems():
# correct score
correct_score = actual_score-error
correct_score = max(correct_score, 1)
correct_score = min(correct_score, 10)
# map from correct score to probability of recidivism.
# if calibration curves are different for different
# groups, this one should be group specific.
correct_prob_recid = interpolate(cal, correct_score)
cond_ifnr = individual_fnr(correct_prob_recid,
cal, thresh, std_dev)
total_prob += prob_error * cond_ifnr
return total_prob
individual_fnr_given_score(6, cal_all, thresh=4.5, std_dev=2)
0.09711411113575788
def make_ifnr_series(cal, thresh, std_dev):
scores = np.arange(1, 11)
t = [individual_fnr_given_score(score, cal, thresh, std_dev)
for score in scores]
ifnr_series = pd.Series(t, scores)
return ifnr_series
thresh = 4.5
s = make_ifnr_series(cal_all, thresh, std_dev=2)
s.plot(label='FNR, std_dev=2')
s = make_ifnr_series(cal_all, thresh, std_dev=1)
s.plot(label='FNR, std_dev=1')
decorate(xlabel='Score',
ylabel='Probability of false negative')
def assign_individual_fnr(df, cal, thresh, std_dev):
"""Assign individual FNRs to defendants.
df: DataFrame
cal: calibration curve, map from score to prob_recid
thresh: threshold between low and high risk
std_dev: standard deviation of the error function
"""
# compute the map from score to FPR
ifnr_series = make_ifnr_series(cal, thresh, std_dev)
# assign FPR to each defendant
df['ifnr'] = [ifnr_series[score] for score in df.decile_score]
assign_individual_fnr(cp, cal_all, thresh=4.5, std_dev=2)
cdf_ifnr = make_cdf(cp.ifnr)
0.000205 0.053091 0.002586 0.123510 0.016131 0.194483 0.043440 0.276546 0.097114 0.365401 0.177649 0.459800 0.233751 0.566399 0.248533 0.766011 0.266719 0.869559 0.273906 1.000000 Name: ifnr, dtype: float64
plot_cdf(cdf_ifnr, label='All')
decorate(xlabel='Individual probability of false negative',
ylabel='CDF',
ylim=[0,1])
thresh = 4.5
std_dev = 2
cal_black = calibration_curve(black)
s = make_ifnr_series(cal_black, thresh, std_dev)
s.plot(label='FNR, black')
cal_white = calibration_curve(white)
s = make_ifnr_series(cal_white, thresh, std_dev)
s.plot(label='FNR, white')
decorate(xlabel='Score',
ylabel='Probability of false negative')
thresh = 4.5
std_dev = 2
cal_male = calibration_curve(male)
s = make_ifnr_series(cal_male, thresh, std_dev)
s.plot(label='FNR, male')
cal_female = calibration_curve(female)
s = make_ifnr_series(cal_female, thresh, std_dev)
s.plot(label='FNR, female')
decorate(xlabel='Score',
ylabel='Probability of false negative')
assign_individual_fnr(cp, cal_all, thresh=4.5, std_dev=2)
cp.ifnr.mean()
0.1681762611669047
cp.groupby('race').ifnr.mean()
race African-American 0.139389 Asian 0.209441 Caucasian 0.194595 Hispanic 0.203360 Native American 0.110333 Other 0.218244 Name: ifnr, dtype: float64
assign_individual_fnr(cp, cal_all, thresh=4.5, std_dev=1)
cp.ifnr.mean()
0.1690990899101053
cp.groupby('race').ifnr.mean()
race African-American 0.136836 Asian 0.210584 Caucasian 0.198345 Hispanic 0.210041 Native American 0.105383 Other 0.225368 Name: ifnr, dtype: float64
assign_individual_fnr(cp, cal_all, thresh=4.5, std_dev=0.01)
cp.ifnr.mean()
0.1685611311339046
cp.groupby('race').ifnr.mean()
race African-American 0.136599 Asian 0.197959 Caucasian 0.197450 Hispanic 0.207289 Native American 0.114221 Other 0.228528 Name: ifnr, dtype: float64
In this section I explore what it would take to make a test with the same false positive rate for all groups.
def fpr_thresh(df, thresh):
df = df.copy()
df['high'] = df.decile_score >= thresh
matrix_all = crosstab(df, 'two_year_recid', 'high')
fpr, fnr = error_rates(matrix_all)
return fpr
fpr_thresh(cp, 5)
fpr_thresh(black, 5)
fpr_thresh(white, 5)
def sweep_thresh(df):
threshes = range(2,10)
sweep = pd.Series(index=threshes)
for thresh in threshes:
sweep[thresh] = fpr_thresh(df, thresh)
return sweep
plt.axhline(32.25, color='gray')
sweep_thresh(cp).plot(label='All')
sweep_thresh(black).plot(label='Black')
sweep_thresh(white).plot(label='White')
decorate(xlabel='Threshold',
ylabel='False positive rate')
def find_threshold(group, fpr):
series = sweep_thresh(group)
xs = crossing(series.dropna(), fpr)
return xs
all_thresh = find_threshold(cp, 32.35)
black_thresh = find_threshold(black, 32.35)
white_thresh = find_threshold(white, 32.35)
interpolate(calibration_curve(cp), all_thresh)
interpolate(calibration_curve(cp), black_thresh)
interpolate(calibration_curve(black), black_thresh)
interpolate(calibration_curve(cp), white_thresh)
interpolate(calibration_curve(white), white_thresh)
black_male = black[black.sex=='Male']
black_male.shape
black_female = black[black.sex=='Female']
black_female.shape
old_black_female = black_female[black_female.age_cat=='Greater than 45']
old_black_female.shape
old_white_female = cp[(cp.age_cat=='Greater than 45') &
(cp.sex=='Female') &
(cp.race=='Caucasian')]
old_white_female.shape
young_black_male = cp[(cp.age_cat=='Less than 25') &
(cp.sex=='Male') &
(cp.race=='African-American')]
young_black_male.shape
fpr_thresh(cp, 5)
fpr_thresh(black, 5)
fpr_thresh(black_female, 5)
fpr_thresh(old_black_female, 5)
fpr_thresh(black_male, 5)
fpr_thresh(young_black_male, 5)
plt.axhline(32.25, color='gray')
sweep_thresh(black).plot(label='Black', color='gray')
sweep_thresh(black_male).plot(label='Black male')
sweep_thresh(young_black_male).plot(label='Young black male')
decorate(xlabel='Threshold',
ylabel='False positive rate')
plt.axhline(32.25, color='gray')
sweep_thresh(black).plot(label='Black', color='gray')
sweep_thresh(black_female).plot(label='Black female')
sweep_thresh(old_black_female).plot(label='Old black female')
decorate(xlabel='Threshold',
ylabel='False positive rate')
ybm_thresh = find_threshold(young_black_male, 32.35)
obf_thresh = find_threshold(old_black_female, 32.35)
interpolate(calibration_curve(cp), ybm_thresh)
interpolate(calibration_curve(cp), obf_thresh)