#!/usr/bin/env python
# coding: utf-8

# In[11]:


# Data Science Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
get_ipython().run_line_magic('matplotlib', 'inline')
# Project Helper Files
from constants import *


# # Quantifying the *School To Prison Pipeline* -- a high level analysis of discipline-based inequities America's Public Schools
# 
# ## Introduction
# 
# America's 96,000 public schools provide the foundation for the next generation's education and, ultimately, future success.
# 
# Yet, educational researchers and practitioners worry that America's public schools do not provide all students equal opportunity for future success. 
# 
# Some stakeholders feel the inequitable treatment of students is so severe -- particularly poor and minority students -- they've coined the term *School to Prison Pipeline*. That is, __some schools and students are pushed, not to success, but towards the criminal justice system.__ 
# 
# ## Overview of Investigation Foci
# 
# This analysis' following 3 questions will center on exploring this prominent and debated concept of "The School to Prison Pipeline" in US schools, generally defined as *the disproportionate tendency of minors and young adults from disadvantaged backgrounds to become incarcerated, because of increasingly harsh school and municipal policies.*
# 
# Discussions of "The School to Prison Pipeline" and its causes largely center on a few interrelated factors:
# 1. __Disproportionate suspensions__, particularly towards black and disabled students.
# 2. __Police Presence on campus__, especially in contrast to lack of counselors and mental health professionals, meaning troubled students may be more likely to be received by *law enforcement* than counselors who are trained in helping students cope, reform, and stay in school.
# 3. __"Zero tolerance" policies in schools increasing suspension and expulsion rates__ as schools deliver harsh discipline for even first-time or relatively minor offenses in an aim to stop disruptive behavior before it gets worse.
# 
# We'll explore these key factors via desegregated data on all 96,000 US public schools from the 2016/15 school year CRDC data. Because revealing causality in these factors is impossible to determine given just this data, __this investigation centers on the extent and nature of these key measures at a national level__.

# # Overview of the Dataset
# ## The Civil Rights Data Collection, school-level data for 2015-2016
# 
# The Civil Rights Data Collection (CRDC) is a biennial survey required by the U.S. Department of Education’s (Department) Office for Civil Rights (OCR) since 1968. (Note, however, that survey content changes over time.)
# 
# The 2015–16 CRDC (the most recent year published) collects data from all public local educational agencies (LEAs, ie *School Districts*) and schools, including 
# - long-term secure juvenile justice facilities
# - charter schools
# - alternative schools
# - and schools serving students with disabilities
# 
# with a response rate of 99.8% from __17,337 LEAs and 96,360 schools__. Specifically, I will be looking at the finer-grained *data disaggregated by school*.
# 
# ### Data content and format
# Each school (row) in the dataset includes 1,800 columns (typically a student count disaggregated by race and gender for some school measure) regarding 32 general topics, comprising a 460 MB csv. __The topics I will investigate utilize only 50 columns pertaining to suspensions, expulsions, and school population. I will only look at white, black, and hispanic students,__ who form the majority of students at nearly all schools.
# 
# 
# ### Known issues with dataset
# - From the CRDC report: *"An important consideration for response rates is that the reporting process requires all schools and LEAs to respond to each survey item on the CRDC. Some LEAs, that did not have complete data, __reported a zero value__. It is not possible to determine all possible situations where this may have occurred. As such, it may be the case that the item response rates may be positively biased. For the large majority of CRDC survey items, __the rate of missing data ranged from 0-5% of reported values__."*
# - In the separate __Data Prep notebook__, IDs in the original dataset greater than 10 significant digits had been cast to scientific notation format and stored as strings. __I recreated the IDs__ for those, which are a combination of State, local, and school identifiers.
# - CRDC does not provide Lat Long coordinates, so [I merged in school Lat Longs from the National Center for Educational Statistics](https://nces.ed.gov/programs/edge/Geographic/SchoolLocations) in the Data Prep notebook. When schools did not have lat-long coordinates available, I assigned them the coordinate of their school district (LEA) office.

# ## Load All Schools Data (after downlaoding)
# [Find this data in this Google Drive Folder.](https://drive.google.com/file/d/1LVRwFWpevvqGvMLlNt95BXmLVmBXfG0a/view?usp=sharing)

# In[5]:


DATA_FILE = 'data (download CSVs here)/crdc-data-with-lat-long.csv'
crdc_data = pd.read_csv(
    DATA_FILE,
    usecols=COLS_WITH_NEEDED_DATA,
    low_memory=False,
    encoding="ISO-8859-1"
)


# ### Geographic distribution of all schools
# To get a sense for the geographic distrubtion of the 96,000 schools, we can plot their lat-long coordinates. Schools are colored simply by their latitude.
# 
# Note Alaska and Hawaii faintly on the left, with lower population densities.

# In[6]:


plt.scatter(x=crdc_data['LON1516'], y=crdc_data['LAT1516'], c=crdc_data['LAT1516'], s=0.001, cmap='ocean')
plt.show()


# # Analytic Questions

# # Question 1: How disproportionate are suspensions across race and gender?

# In[17]:


df = crdc_data


# In[18]:


RACES = ['BL', 'WH', 'HI']
SEXES = ['M', 'F']
POP_LOWER_BOUND = 50 # Remove populations (e.g. white male) smaller than this threshold


# 1. Plotting

def plot_measure_accross_all_demographics(df, calculation, measure, bounds=[0,1]):
    figure_num = 0
    plt.figure(figsize=(20,6))
    for sex_index, sex in enumerate(SEXES):
        for race_index, race in enumerate(RACES):
            figure_num += 1
            likelyhood = f'{calculation}_{measure}_{race}_{sex}'
            curr_dem_data = df[pd.notnull(data[likelyhood])]
            
            plt.subplot(len(SEXES), len(RACES), figure_num)
            plt.scatter(x=curr_dem_data['LON1516'], y=curr_dem_data['LAT1516'], c=curr_dem_data[likelyhood], s=1, alpha=1, cmap='coolwarm')
            plt.title(f'{race}_{sex}, avg: {round(curr_dem_data[likelyhood].mean(), 2)}, n: {curr_dem_data[likelyhood].count()}')
            plt.colorbar()
            plt.clim(*bounds)
            plt.axis('off')
    plt.subplots_adjust(wspace=0.8, hspace=0.6)
    plt.show()
    

# 2. Calculations

# ITERATIVE FUNCTION which appends likelyhood columns to the df for all demographics
# Flag parameter 'comarison_race' lets you compare how many times the first races is likely to be
# affected as the second race.
def calculate_likelyhood_comparisons(df, measure, comparison_race=None, races=RACES, sexes=SEXES, lower_bound=POP_LOWER_BOUND):
    df = remove_schools_with_pop_less_than(lower_bound)
    for sex in sexes:
        for race in races:
            df = calculate_likelyhood_comparison(df, measure, race, sex, comparison_race, sex)
    return df


def remove_schools_with_pop_less_than(lower_bound):
    filter_col_df = df[DEMOGRAPHIC_COUNT_COLS]
    filtered_df = filter_col_df[filter_col_df >= lower_bound].dropna()
    return df.merge(filtered_df)


def calculate_likelyhood_comparison(df, measure, race, sex, comparison_race, comparison_sex):
    likelyhood = get_percentage_affected(df, measure, race, sex)
    column_name = f'PERCENT_AFFECTED_{measure}_{race}_{sex}'
    if comparison_race:
        likelyhood = likelyhood / get_percentage_affected(df, measure, comparison_race, comparison_sex)
        column_name = f'LH_COMPARED_TO_WH_FOR_{measure}_{race}_{sex}'
    likelyhood = likelyhood[(likelyhood != np.inf) & (pd.notnull(likelyhood)) & (likelyhood > 0)]  # Filter out infinity and NaN
    return df.merge(
        likelyhood.to_frame(column_name),
        how='left',
        left_index=True,
        right_index=True,
    )


def get_percentage_affected(df, measure, race, sex):
    affected = f'{measure}_{race}_{sex}'  # e.g. 'SCH_DISCWODIS_MULTOOS_BL_M'
    pop_total = f'SCH_ENR_{race}_{sex}'  # e.g. 'SCH_ENR_TR_M' 
    return df[affected] / df[pop_total]      


# ### What percentage of each demographic population recieve suspensions?

# In[20]:


data = calculate_likelyhood_comparisons(df, 'SCH_DISCWODIS_MULTOOS')  # "more than one out of school suspension"
calculation = 'PERCENT_AFFECTED' 
measure = 'SCH_DISCWODIS_MULTOOS'
plot_measure_accross_all_demographics(data, calculation, measure, bounds=[0, 0.15])


# ### How many times more likely to be suspended are blank and latino students than white students?
# For the above "percent of population affected" plots, it's near impossible to compare the severity *across a single school*. One way to zero on on this is to color schools by *how much more likely* a certain population is to be affeced __compared to the least affected population__. This measure might reveal schools where, even if a demogrpahic is severely affected, so were other demographics.
# 
# How likelyhood comparisons are calculated: 
# 
# `Percent of X pop affected / Percent of White counterpart population affected`

# In[21]:


data = calculate_likelyhood_comparisons(df, 'SCH_DISCWODIS_MULTOOS', comparison_race='WH')  # "more than one out of school suspension"
calculation = 'LH_COMPARED_TO_WH_FOR' 
measure = 'SCH_DISCWODIS_MULTOOS'
plot_measure_accross_all_demographics(data, calculation, measure, bounds=[1, 4])


# # Question 2: What impact have "Zero-tolerance" policies had on expulsion?
# 
# All analysis look at "students without disabilities expelled on zero tolerance policy".

# ### Zero Tolerance policies impact 2% of white students, but about *4% and 5%* of black and latino students, respectively, for schools who gave out at least one Zero Tolerance based expulsion for that demographic.

# In[22]:


measure = 'SCH_DISCWODIS_EXPZT'
calculation = 'PERCENT_AFFECTED'
data = calculate_likelyhood_comparisons(df, measure, lower_bound=1)
plot_measure_accross_all_demographics(data, calculation, measure, bounds=[0,0.09])


# # Question 3: Does the proportion of Campus Counselors (psychologists) vs Campus Police Officers correlate with suspensions? 
# 
# ### Data notes:
# 
# 1. It's possible to have *fractional* staff recorded if they are not full-time.
# 
# 2. __A Data entry, system-level error__ on the form filled in by schools caused only __22,000 schools (after corrections 25000)__ to correctly enter the number of Law Enforcement Officers on campus. Where this system error occurred, the value is `-9`. Due this fact, there are only __17,500 schools with *both* Police and Counselor counts__. We'll do *investigate Juvenile Justice Facilities seperately*: first, because we'd expect different counselor and police presence there, but second because none of them actually recorded the number of police.
# 
# 3. #### Where are the police data?
# Notably, none of the 608 Juvenile Justice facilities which have a count enterered for Police officers. This may be because the police staff at JJ facilities do not map to the categories on the survey.
# 
# We *could* try to use __Security Guards__ as a proxy for Police. However, only 64 of the 608 JJ facilities have data for both Counselors *and* Security Guards, so we've removed all JJ schools from the following analysis.
# 

# In[23]:


POLICE = 'SCH_FTESECURITY_LEO'
COUNSELORS = 'SCH_FTESERVICES_PSY'
SECURITY_GUARDS = 'SCH_FTESECURITY_GUA'

SUSPENSIONS = 'TOTAL_SUSPENSIONS'


def plot_ratio_to_students_of(
                data=crdc_data,
                x_name=POLICE,
                y_name=COUNSELORS,
                xlabel=None,
                ylabel=None,
                xlim=None,
                ylim=None,
                dot_size=0.5,
                show_hist=False
):
    # Filter out any negative numbers, which signal data errors
    schools_with_correctly_documented_staff = (data[[y_name, x_name]] >= 0).all(axis='columns')
    staff_df = data.loc[schools_with_correctly_documented_staff,]

    # Get ratio to student population
    total_pop = staff_df['TOT_ENR_M'] + staff_df['TOT_ENR_F']
    staff_to_students_df = pd.DataFrame()
    staff_to_students_df[y_name] = staff_df[y_name] / total_pop
    staff_to_students_df[x_name] = staff_df[x_name] / total_pop

    # Total Suspensions, for dot coloring
    staff_to_students_df[SUSPENSIONS] = staff_df[['TOT_DISCWODIS_MULTOOS_M', 'TOT_DISCWODIS_MULTOOS_F']].sum(
        axis='columns') / total_pop

    # Remove anomalies
    staff_to_students_df = staff_to_students_df[staff_to_students_df[SUSPENSIONS] > 0]

    plt.scatter(x=staff_to_students_df[x_name], y=staff_to_students_df[y_name], alpha=0.9, s=dot_size,
                c=staff_to_students_df[SUSPENSIONS], cmap='coolwarm')
    plt.clim(0, 0.05)
    plt.ylabel(ylabel)
    plt.xlabel(xlabel)
    plt.title(f'{xlabel} vs {ylabel}, color=Long Term Suspension Percentage, n={len(staff_to_students_df)}')
    plt.colorbar()
    axes = plt.gca()
    axes.set_xlim(xlim)
    axes.set_ylim(ylim)

    plt.show()

    if show_hist:
        # Hist of x axis
        plot_hist(staff_to_students_df, x_name, xlabel)
        # Hist of y axis
        plot_hist(staff_to_students_df, y_name, ylabel)

    return staff_to_students_df


def plot_hist(data, name, label, x_range=[0, 0.008], y_range=[None, None]):
    plt.hist(data[name], range=x_range, bins=100)
    plt.xlabel(label)
    plt.ylabel('Number of Schools')
    plt.title(f'Distribution of {label}, n={len(data)}, mean={round(data[name].mean(), 3)}')
    plt.grid(True)
    plt.show()
    return


# # Non Juvenile Justice Schools: 
# ## *Long Term Suspensions correlate with Police Presence*
# 
# Plotting Campus Police against Campus Counselors per student, and color schools by  we see a a hotspot of suspensions at shcools with low-counselor levels and high police levels.
# 
# You can also see how __schools seem to hire with a predetermined ratio in mind__: consistent ratio lines jut outward, most notably one marking the a 1-to-1 ratio.
# 
# This supports existing intution in the "School to Prison Pipeline" concept -- *schools with high suspension rates confront troubled kids with police with higher likelyhood than trained counselors*. However, __it remains unclear if high levels of suspension-worthy activity triggered hiring more Police, or if increased Police (and a lack of counseling) escalate suspension counts.__ 
# 
# Which leads us to two questions we can investigate: __do Police correlate positively *alone* with suspensions, or does it worsen when mixed with low counseling?__

# In[24]:


staff_to_students_df = plot_ratio_to_students_of( 
    x_name=POLICE, 
    y_name=COUNSELORS, 
    xlabel='Police/student', 
    ylabel='Counselor/student', 
    xlim=[0, 0.01], 
    ylim=[0, 0.01],
    show_hist=True
)


# ## *Does lack of counselors alone account for increased suspensions?*
# 
# ### Interestingly, Schools with the recommended ration of Counselors to Students actually have *3x the rate of suspensions*
# 
# This, of course, may result from schools with a high rates of suspensions and misbehaviour attempting to mediate rates via increased counselors. But further research is required.

# In[25]:


insufficient_counselors = staff_to_students_df[staff_to_students_df[COUNSELORS] < 0.004]
plot_hist(insufficient_counselors, SUSPENSIONS, label=f'Suspension Rates when Counselors/Student < 0.004', x_range=[0, 0.2])

sufficient_counselors = staff_to_students_df[staff_to_students_df[COUNSELORS] >= 0.004]
plot_hist(sufficient_counselors, SUSPENSIONS, label=f'Suspension Rates when Counselors/Student >= 0.004', x_range=[0, 0.2])


# ## *Does high levels of Police alone correlate with increased suspensions?*
# 
# ### Here, results line up predictably with the STTP: Schools with the Police/Student ration >= 0.004 have actually have *~10x the rate of suspensions* of those who had less police presence
# 
# This, of course, may result from schools with a high rates of suspensions and misbehaviour attempting to mediate rates via increased counselors. But further research is required.
# 
# Further research is also necessary to observe which schools have such high police presence, given that there are only 1,000 of them.

# In[26]:


low_police = staff_to_students_df[staff_to_students_df[POLICE] < 0.004]
plot_hist(low_police, POLICE, label=f'Suspension Rates when Police/Student < 0.004', x_range=[0, 0.2])

high_police = staff_to_students_df[staff_to_students_df[POLICE] >= 0.004]
plot_hist(high_police, POLICE, label=f'Suspension Rates when Police/Student >= 0.004', x_range=[0, 0.2])


# ## Q:*What's the correlation with having more Police than Counselors?*
# ## A: A 2x increase in suspensions: from 0.001 to 0.002

# In[148]:


more_police = staff_to_students_df[staff_to_students_df[COUNSELORS] < staff_to_students_df[POLICE]]
plot_hist(more_police, POLICE, label=f'Suspension Rates when Police/Student > Counselor/Student', x_range=[0, 0.2])

less_police = staff_to_students_df[staff_to_students_df[COUNSELORS] >= staff_to_students_df[POLICE]]
plot_hist(less_police, POLICE, label=f'Suspension Rates when Police/Student <= Counselor/Student', x_range=[0, 0.2])


# # Future questions to investigate
# 
# ### Digging deeper into Zero Tolerance Policies
# With proper data on *which schools implement zero-tolerance* (rather than just knowing which schools have given at least one suspension via the policies), we could more accurately measure the impact of these policies.
# 
# *Does zero-tolerance policy adoption correlate with other factors? Do ZT policy schools have a higher rate in general of expulsions?*
# 
# 
# ## *By introducing previous years data, what trends do we see in critical STTP measures?*
# Particularly of interest, can we see schools increasing police or at the cost of counselors? And does that impact change in suspension rates?
# 
# Further, we could takes differences between years to identify schools underwent large changes, and investigate them further for signs of the cuase.
# 
# ## Expanding my analysis to more demographic and discipline categories
# More obviously, this investigation looked only at students from three races without disabilities and in terms of long term suspensions. More rigorous analsysis would include more expansive demographics as well as comparison against and aggregation with short term suspensions.
# 
# ## Incorperate Socio-Economic Status and Test Scores
# [The NYT's Upshot did a data vis article comparing race, academic performance, and income](https://www.nytimes.com/interactive/2016/04/29/upshot/money-race-and-success-how-your-school-district-compares.html) utilizing the Stanford Education Data Archive (SEDA). The positive trend of wealth and performance were stunning, as were how wealth seems not to nuetralize the race-based gaps in achievement-- *do these hold for discipline as well?*

# In[ ]: