Analyzing NYC High School Data

Introduction

The SAT (Scholastic Aptitude Test) is administered during the senior year of high school. Colleges across the country rely on SAT scores as one of their key admission criteria.

We will use data published by the New York City Public schools to investigate some of these questions and explore some of the demographic factors that may be connected to variations in SAT scores. The initial dataset covers 435 highschools with over 100 pieces of information on school performance, programs, and student demographics for each school.

New York City has a significant immigrant population and is very diverse, so one of the aims of this project is to find correlations between SAT scores and demographic factors such as gender, race, income, and other variables.

Conclusions

After analyzing the different features that may influence the SAT score results, we arrived to the following conclusions:

  • From the New York Schools Sat Scores Distributions histogram we have seen that the majority of the schools have a SAT score between 1100 and 1300.
  • SAT score is highly related with total enrollment (N_s and N_p) and safety (saf_t_11 and saf_s_11).
  • Safety average by district interactive map shows us that safest schools which are located in Manhattan and in the nearby regions of Queens and Bronx have a hogher safety score.
  • Hispanic and Black ethnic groups are very disadvantaged in comparision with White and Asian ethnic groups.
  • There is no strong evidence of differences depending on students gender.
  • In general schools with bigger classes tend to have a better SAT score.
  • Prerequisites

    In this project are used the following Python packages:

  • Numpy
  • Pandas
  • Seaborn
  • Matplotlib
  • Warnings
  • Plotly
  • Cufflinks
  • Sklearn
  • Read in the data

    The final data set combines 7 different data sets freely published by New York City in order to find correlations between SAT scores, different demographic measures (for example, gender/ethnicity) or schools satisfaction scores (for example, school safety):

  • SAT scores by school: SAT scores for each high school in New York City.
  • School attendance: Attendance information for each school in New York City.
  • School survey: Surveys of parents, teachers, and students at each school.
  • Class size: Information on class size for each school.
  • AP test results: Advanced Placement (AP) exam.
  • Graduation outcomes: he percentage of students who graduated, and other outcome information.
  • Demographics: Demographic information for each school.
  • In [160]:
    import matplotlib.pyplot as plt
    import warnings
    import seaborn as sns
    import pandas as pd
    import numpy
    import re
    %matplotlib inline
    
    In [161]:
    from plotly.offline import iplot, init_notebook_mode
    import plotly.express as px
    import cufflinks as cf
    
    In [162]:
    from sklearn import preprocessing
    
    In [163]:
    cf.go_offline(connected=True)
    init_notebook_mode(connected=True)
    
    In [164]:
    data_files = [
        "ap_2010.csv",
        "class_size.csv",
        "demographics.csv",
        "graduation.csv",
        "hs_directory.csv",
        "sat_results.csv"
    ]
    
    data = {}
    
    for f in data_files:
        d = pd.read_csv("schools/{0}".format(f))
        data[f.replace(".csv", "")] = d
    

    Read in the surveys

    In [165]:
    all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
    d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
    survey = pd.concat([all_survey, d75_survey], axis=0)
    
    survey["DBN"] = survey["dbn"]
    
    survey_fields = [
        "DBN", 
        "rr_s", 
        "rr_t", 
        "rr_p", 
        "N_s", 
        "N_t", 
        "N_p", 
        "saf_p_11", 
        "com_p_11", 
        "eng_p_11", 
        "aca_p_11", 
        "saf_t_11", 
        "com_t_11", 
        "eng_t_11", 
        "aca_t_11", 
        "saf_s_11", 
        "com_s_11", 
        "eng_s_11", 
        "aca_s_11", 
        "saf_tot_11", 
        "com_tot_11", 
        "eng_tot_11", 
        "aca_tot_11",
    ]
    survey = survey.loc[:,survey_fields]
    data["survey"] = survey
    

    Add DBN columns

    In [166]:
    data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
    
    def pad_csd(num):
        string_representation = str(num)
        if len(string_representation) > 1:
            return string_representation
        else:
            return "0" + string_representation
        
    data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
    data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
    

    Convert columns to numeric

    In [167]:
    cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
    for c in cols:
        data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
    
    data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
    
    def find_lat(loc):
        coords = re.findall("\(.+, .+\)", loc)
        lat = coords[0].split(",")[0].replace("(", "")
        return lat
    
    def find_lon(loc):
        coords = re.findall("\(.+, .+\)", loc)
        lon = coords[0].split(",")[1].replace(")", "").strip()
        return lon
    
    data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
    data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
    
    data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
    data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
    

    Condense datasets

    In [168]:
    class_size = data["class_size"]
    class_size = class_size[class_size["GRADE "] == "09-12"]
    class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
    
    class_size = class_size.groupby("DBN").agg(numpy.mean)
    class_size.reset_index(inplace=True)
    data["class_size"] = class_size
    
    data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
    
    data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
    data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
    

    Convert AP scores to numeric

    In [169]:
    cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
    
    for col in cols:
        data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
    

    Combine the datasets

    In [170]:
    combined = data["sat_results"]
    
    combined = combined.merge(data["ap_2010"], on="DBN", how="left")
    combined = combined.merge(data["graduation"], on="DBN", how="left")
    
    to_merge = ["class_size", "demographics", "survey", "hs_directory"]
    
    for m in to_merge:
        combined = combined.merge(data[m], on="DBN", how="inner")
    
    In [171]:
    plt.figure(figsize=(20, 10))
    sns.heatmap(combined.isnull())
    
    Out[171]:
    <AxesSubplot:>
    In [172]:
    combined.isnull().sum()
    
    Out[172]:
    DBN                                  0
    SCHOOL NAME                          0
    Num of SAT Test Takers               0
    SAT Critical Reading Avg. Score     28
    SAT Math Avg. Score                 28
                                      ... 
    priority09                         363
    priority10                         363
    Location 1                           0
    lat                                  0
    lon                                  0
    Length: 159, dtype: int64

    After creating the combined dataset, the previous heatmap shows us that ther are som columns like priority09 which contain mainly null values. To solve this problem, data imputation technique will be applied.

    Data Imputation

    In [173]:
    combined = combined.fillna(combined.mean())
    combined = combined.fillna(0)
    combined.head(5)
    
    Out[173]:
    DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority04 priority05 priority06 priority07 priority08 priority09 priority10 Location 1 lat lon
    0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 0 129.028846 197.038462 ... Then to Manhattan students or residents Then to New York City residents 0 0 0.0 0.0 0.0 220 Henry Street\nNew York, NY 10002\n(40.7137... 40.713764 -73.985260
    1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39.000000 49.000000 ... 0 0 0 0 0.0 0.0 0.0 200 Monroe Street\nNew York, NY 10002\n(40.712... 40.712332 -73.984797
    2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19.000000 21.000000 ... 0 0 0 0 0.0 0.0 0.0 420 East 12 Street\nNew York, NY 10009\n(40.72... 40.729783 -73.983041
    3 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 0 129.028846 197.038462 ... 0 0 0 0 0.0 0.0 0.0 145 Stanton Street\nNew York, NY 10002\n(40.72... 40.720569 -73.985673
    4 01M539 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 159 522.0 574.0 525.0 1621.0 NEW EXPLORATIONS SCI,TECH,MATH 255.000000 377.000000 ... 0 0 0 0 0.0 0.0 0.0 111 Columbia Street\nNew York, NY 10002\n(40.7... 40.718725 -73.979426

    5 rows × 159 columns

    Add a school district column for mapping

    In [174]:
    def get_first_two_chars(dbn):
        return dbn[0:2]
    
    combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
    

    New York Schools sat scores distribution

    In [175]:
    plt.figure(figsize=(20, 10))
    sns.set_style("darkgrid")
    ax = sns.distplot(combined['sat_score'])
    ax.set_yticks([]) 
    ax.set_title("New York Schools Sat Scores Distributions", fontweight="bold", fontsize=15,)
    ax.set_xlabel("Sat Score")
    ax.set_ylabel("")
    ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
    sns.despine(left=True)
    

    The New York Schools Sat Scores Distributions histogram shows us that the majority of the schools have a sat score between 1100 and 1300. There are some schools over 1600 sat score, these may be considered as schools with a high sat score.

    Find correlations

    In [176]:
    correlations = combined.corr()
    correlations = correlations["sat_score"]
    print(correlations)
    
    SAT Critical Reading Avg. Score    0.986820
    SAT Math Avg. Score                0.972643
    SAT Writing Avg. Score             0.987771
    sat_score                          1.000000
    AP Test Takers                     0.523140
                                         ...   
    priority08                              NaN
    priority09                              NaN
    priority10                              NaN
    lat                               -0.121029
    lon                               -0.132222
    Name: sat_score, Length: 67, dtype: float64
    

    Plotting survey correlations

    In [177]:
    # Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
    survey_fields.remove("DBN")
    
    In [178]:
    sns.set_style("darkgrid")
    
    In [179]:
    plt.figure(figsize=(20, 20))
    ax = sns.barplot(y=survey_fields, x=correlations[survey_fields], palette='Blues')
    ax.set_title("Sat score correlation with survey fields", fontweight="bold", fontsize=15,)
    ax.set_xlabel("Correlation with SAT Scores",fontsize=12)
    ax.set_ylabel("",fontsize=14)
    ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
    sns.despine(left=True)
    

    As is shown in the previous Sat score correlation with survey fields plot. There are high correlations between sat_score and N_s and N_p. It makes sense, since these columns are correlated with the total_enrollment column.

    Follwed by the safety columns saf_t_11 and saf_s_11 which are high correlated with the sat_score column to.

    Also, another interesting correlation but no with the same importance as the previous ones is the sat_score correlation with rr_s colum. Which shows the student response rate.

    Differences between parent, teacher and students responses

    In [180]:
    df = combined[['DBN'] + survey_fields]
    df
    
    Out[180]:
    DBN rr_s rr_t rr_p N_s N_t N_p saf_p_11 com_p_11 eng_p_11 ... eng_t_11 aca_t_11 saf_s_11 com_s_11 eng_s_11 aca_s_11 saf_tot_11 com_tot_11 eng_tot_11 aca_tot_11
    0 01M292 89.0 70 39 379.000000 26.0 151.0 7.8 7.7 7.4 ... 6.1 6.5 6.000000 5.600000 6.100000 6.700000 6.7 6.2 6.6 7.0
    1 01M448 84.0 95 10 385.000000 37.0 46.0 7.9 7.4 7.2 ... 6.6 7.3 6.000000 5.700000 6.300000 7.000000 6.8 6.3 6.7 7.2
    2 01M450 0.0 98 28 598.208333 42.0 150.0 8.7 8.2 8.1 ... 8.0 8.8 6.611667 6.094722 6.620278 7.381389 7.9 7.9 7.9 8.4
    3 01M509 90.0 100 21 306.000000 29.0 69.0 7.7 7.4 7.2 ... 6.1 6.8 6.400000 5.900000 6.400000 7.000000 6.9 6.2 6.6 7.0
    4 01M539 98.0 68 51 923.000000 67.0 736.0 8.5 7.9 7.9 ... 5.9 7.3 7.300000 6.400000 7.000000 7.700000 7.8 6.7 6.9 7.8
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    358 32K545 80.0 57 34 504.000000 28.0 205.0 8.0 7.5 7.5 ... 3.4 4.8 6.400000 5.800000 6.200000 7.000000 6.7 5.2 5.7 6.5
    359 32K549 73.0 88 41 295.000000 30.0 160.0 8.2 8.1 7.9 ... 8.1 8.5 6.500000 5.900000 6.600000 7.300000 7.3 7.2 7.5 8.1
    360 32K552 59.0 49 31 256.000000 17.0 127.0 7.9 7.9 7.7 ... 7.0 7.1 6.400000 5.700000 6.700000 7.300000 6.9 6.5 7.1 7.4
    361 32K554 88.0 81 38 206.000000 13.0 80.0 9.4 8.6 8.5 ... 9.1 9.6 8.900000 7.100000 8.400000 8.900000 9.3 8.1 8.7 9.1
    362 32K556 67.0 71 35 323.000000 24.0 164.0 8.5 7.9 7.8 ... 7.6 7.7 6.600000 6.000000 6.700000 7.300000 7.4 7.0 7.4 7.7

    363 rows × 23 columns

    In [181]:
    df = df.melt(id_vars='DBN')
    df
    
    Out[181]:
    DBN variable value
    0 01M292 rr_s 89.0
    1 01M448 rr_s 84.0
    2 01M450 rr_s 0.0
    3 01M509 rr_s 90.0
    4 01M539 rr_s 98.0
    ... ... ... ...
    7981 32K545 aca_tot_11 6.5
    7982 32K549 aca_tot_11 8.1
    7983 32K552 aca_tot_11 7.4
    7984 32K554 aca_tot_11 9.1
    7985 32K556 aca_tot_11 7.7

    7986 rows × 3 columns

    In [182]:
    aux = pd.DataFrame(df['variable'].str.split("_").to_list())
    df['question'] = aux[0]
    df['respondent'] = aux[1]
    df = df.drop(columns=['variable'])
    df
    
    Out[182]:
    DBN value question respondent
    0 01M292 89.0 rr s
    1 01M448 84.0 rr s
    2 01M450 0.0 rr s
    3 01M509 90.0 rr s
    4 01M539 98.0 rr s
    ... ... ... ... ...
    7981 32K545 6.5 aca tot
    7982 32K549 8.1 aca tot
    7983 32K552 7.4 aca tot
    7984 32K554 9.1 aca tot
    7985 32K556 7.7 aca tot

    7986 rows × 4 columns

    In [183]:
    df['question'].value_counts(dropna=False)
    
    Out[183]:
    saf    1452
    com    1452
    eng    1452
    aca    1452
    rr     1089
    N      1089
    Name: question, dtype: int64
    In [184]:
    df['respondent'].value_counts(dropna=False)
    
    Out[184]:
    t      2178
    s      2178
    p      2178
    tot    1452
    Name: respondent, dtype: int64
    In [185]:
    df = df[df['respondent'] != 'tot']
    df = df[df['question'] != 'N']
    df = df[df['question'] != 'rr']
    df['respondent'] = df['respondent'].map({
        'p': 'Parent', 
        't':'Teacher', 
        's':'Student'})
    df['question'] = df['question'].map({
        'aca': 'Academic Expectations', 
        'eng': 'Engagement', 
        'com': 'Communication',
        'saf': 'Safety'})
    df = df.reset_index(drop=True)
    df
    
    Out[185]:
    DBN value question respondent
    0 01M292 7.8 Safety Parent
    1 01M448 7.9 Safety Parent
    2 01M450 8.7 Safety Parent
    3 01M509 7.7 Safety Parent
    4 01M539 8.5 Safety Parent
    ... ... ... ... ...
    4351 32K545 7.0 Academic Expectations Student
    4352 32K549 7.3 Academic Expectations Student
    4353 32K552 7.3 Academic Expectations Student
    4354 32K554 8.9 Academic Expectations Student
    4355 32K556 7.3 Academic Expectations Student

    4356 rows × 4 columns

    In [186]:
    plt.figure(figsize=(20, 10))
    sns.set_style("darkgrid")
    ax = sns.barplot(x='question',
                     y='value',
                     hue='respondent',
                     data=df,
                     palette=sns.color_palette("GnBu_d"))
    ax.set_title("Question values by respondent", fontweight="bold", fontsize=15,)
    ax.set_xlabel("Question")
    ax.set_ylabel("Value")
    ax.tick_params(bottom=True, top=False, left=True, right=False, labelbottom=True)
    sns.despine(left=True)
    

    From the Question values by respondent chart, we can see that all perceptions independently of the role of the participant are common. In general students are the ones that rate things slightly lower than teachers and parents. On the other hand, parents are the ones who rate things slightly higher. Teachers are in the middle quite near the student's ratings.

    The only type of question where students and parents ratings are quite the same is Academic Expectations.

    On the other hand, Safety rating seems to be the aspect where they differ the most. That's why in the next section the Safety scores will be deeply explored.

    Exploring safety scores

    In [187]:
    warnings.filterwarnings("ignore",category=FutureWarning)
    combined.iplot(
        x='saf_s_11',
        y='sat_score',
        categories='boro',
        text="SCHOOL NAME",
        vline= {"x":combined["saf_s_11"].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":combined["sat_score"].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        xTitle='Safety score 11',
        yTitle='Satisfaction score avg',
        title='Exploring safety score'
    )
    

    After seeing the Exploring safety score we can conclude that there is a correlation between the sat_score and sad_s_11.

    The majority of the schools have an average saf_s_11 between 5.5 and 7.5 and an average sat_score over 1000 and 1400.

    Also, we can see that there is a high correlation between schools with an extremely high SAT score and sad_s_11 score.

    To sum up, there is a correlation between these indicators but it isn't so strong at may seem at first look.

    In [188]:
    fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'saf_tot_11', 'boro']].dropna(),
                     trendline='ols',
                     x="saf_tot_11", 
                     y="sat_score", 
                     size="saf_tot_11", 
                     color="sat_score", 
                     size_max=15, 
                     hover_name="SCHOOL NAME", 
                     hover_data=['boro', 'saf_tot_11','sat_score'],
                     color_continuous_scale=px.colors.sequential.Viridis, 
                     title='Environment Safety and Respect Perception Correlation with SAT Scores')
    fig.show()
    

    In the Environment Safety and Respect Perception Correlation with SAT Scores plot we can see clearly the postivie correlation between sad_tot_11 and sat_scores with the number of outliers in the upper-right corner.

    On the other hand, the previous plot also shows us that there is a reduced group of schools which have a high safety rank but low SAT scores. In the lower right part of the chart, this schools are mainly from Queens and Brooklyn.

    Mapping safety scores

    In [189]:
    safety_avg_by_district = combined.groupby('school_dist').agg(numpy.mean)
    safety_avg_by_district
    
    Out[189]:
    SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 Total Cohort CSD NUMBER OF STUDENTS / SEATS FILLED ... grade_span_max expgrade_span_max zip total_students number_programs priority08 priority09 priority10 lat lon
    school_dist
    01 441.833333 473.333333 439.333333 1354.500000 116.681090 173.019231 135.800000 93.500000 1.0 115.244241 ... 12.0 12.0 10003.166667 659.500000 1.333333 0.0 0.0 0.0 40.719022 -73.982377
    02 426.619092 444.186256 424.832836 1295.638184 128.908454 201.516827 157.495833 158.647849 2.0 149.818949 ... 12.0 12.0 10023.770833 621.395833 1.416667 0.0 0.0 0.0 40.739699 -73.991386
    03 428.529851 437.997512 426.915672 1293.443035 156.183494 244.522436 193.087500 183.384409 3.0 156.005994 ... 12.0 12.0 10023.750000 717.916667 2.000000 0.0 0.0 0.0 40.781574 -73.977370
    04 402.142857 416.285714 405.714286 1224.142857 129.016484 183.879121 151.035714 113.857143 4.0 132.362265 ... 12.0 12.0 10029.857143 580.857143 1.142857 0.0 0.0 0.0 40.793449 -73.943215
    05 427.159915 438.236674 419.666098 1285.062687 85.722527 115.725275 142.464286 143.677419 5.0 120.623901 ... 12.0 12.0 10030.142857 609.857143 1.142857 0.0 0.0 0.0 40.817077 -73.949251
    06 382.011940 400.565672 382.066269 1164.643881 108.711538 159.715385 105.425000 180.848387 6.0 139.041709 ... 12.0 12.0 10036.200000 628.900000 1.300000 0.0 0.0 0.0 40.848970 -73.932502
    07 376.461538 380.461538 371.923077 1128.846154 73.703402 112.476331 105.276923 105.605459 7.0 97.597416 ... 12.0 12.0 10452.692308 465.846154 1.461538 0.0 0.0 0.0 40.816815 -73.919971
    08 386.214383 395.542741 377.908005 1159.665129 118.379371 168.020979 144.731818 215.510264 8.0 129.765099 ... 12.0 12.0 10467.000000 547.636364 1.272727 0.0 0.0 0.0 40.823803 -73.866087
    09 373.755970 383.582836 374.633134 1131.971940 71.411538 104.265385 98.470000 113.330645 9.0 100.118588 ... 12.0 12.0 10456.100000 449.700000 1.150000 0.0 0.0 0.0 40.836349 -73.906240
    10 403.363636 418.000000 400.863636 1222.227273 132.231206 226.914336 191.618182 161.318182 10.0 168.876526 ... 12.0 12.0 10463.181818 757.863636 1.500000 0.0 0.0 0.0 40.870345 -73.898360
    11 389.866667 394.533333 380.600000 1165.000000 83.813462 122.484615 108.833333 122.866667 11.0 129.031031 ... 12.0 12.0 10467.933333 563.666667 1.533333 0.0 0.0 0.0 40.873138 -73.856120
    12 364.769900 379.109453 357.943781 1101.823134 93.102564 139.442308 153.450000 110.467742 12.0 91.684504 ... 12.0 12.0 10463.166667 409.000000 1.083333 0.0 0.0 0.0 40.831412 -73.886946
    13 409.393800 424.127440 403.666361 1237.187600 232.931953 382.704142 320.773077 224.595533 13.0 218.306055 ... 12.0 12.0 11207.153846 895.153846 2.076923 0.0 0.0 0.0 40.692865 -73.977016
    14 395.937100 398.189765 385.333049 1179.459915 77.798077 114.873626 123.282143 112.347926 14.0 123.643728 ... 12.0 12.0 11210.785714 545.357143 2.000000 0.0 0.0 0.0 40.711599 -73.948360
    15 395.679934 404.628524 390.295854 1190.604312 94.574786 141.581197 153.450000 104.207885 15.0 135.707319 ... 12.0 12.0 11214.222222 573.111111 1.666667 0.0 0.0 0.0 40.675972 -73.989255
    16 371.529851 379.164179 369.415672 1120.109701 82.264423 126.519231 153.450000 247.185484 16.0 177.501282 ... 12.0 12.0 11219.000000 440.250000 1.750000 0.0 0.0 0.0 40.688008 -73.929686
    17 386.571429 394.071429 380.785714 1161.428571 105.583791 163.087912 111.360714 121.357143 17.0 130.246192 ... 12.0 12.0 11220.642857 547.071429 1.642857 0.0 0.0 0.0 40.660313 -73.955636
    18 373.454545 373.090909 371.454545 1118.000000 129.028846 197.038462 153.450000 72.771261 18.0 72.209438 ... 12.0 12.0 11224.000000 344.000000 1.090909 0.0 0.0 0.0 40.641863 -73.914726
    19 367.083333 377.583333 359.166667 1103.833333 88.097756 124.769231 120.670833 114.322581 19.0 105.752625 ... 12.0 12.0 11207.500000 440.416667 1.916667 0.0 0.0 0.0 40.676547 -73.882158
    20 406.223881 465.731343 401.732537 1273.687761 227.805769 359.407692 177.690000 591.374194 20.0 420.029766 ... 12.0 12.0 11210.200000 2521.400000 3.800000 0.0 0.0 0.0 40.626751 -74.006191
    21 395.283582 421.786974 389.242062 1206.312619 135.467657 203.835664 142.377273 275.351906 21.0 224.702989 ... 12.0 12.0 11221.000000 1098.272727 3.272727 0.0 0.0 0.0 40.593596 -73.978465
    22 473.500000 502.750000 474.250000 1450.500000 391.007212 614.509615 370.362500 580.250000 22.0 495.279369 ... 12.0 12.0 11223.000000 2149.000000 2.250000 0.0 0.0 0.0 40.618285 -73.952288
    23 380.666667 398.666667 378.000000 1157.333333 29.000000 31.000000 153.450000 87.000000 23.0 120.113095 ... 12.0 12.0 11219.000000 391.000000 1.333333 0.0 0.0 0.0 40.668586 -73.912298
    24 405.846154 434.000000 402.153846 1242.000000 126.474852 179.094675 115.165385 234.682382 24.0 213.471903 ... 12.0 12.0 11206.153846 962.461538 2.230769 0.0 0.0 0.0 40.740621 -73.911518
    25 437.250000 483.500000 436.250000 1357.000000 205.260817 279.889423 174.793750 268.733871 25.0 280.576007 ... 12.0 12.0 11361.000000 1288.875000 1.875000 0.0 0.0 0.0 40.745414 -73.815558
    26 445.200000 487.600000 444.800000 1377.600000 410.605769 632.407692 392.090000 825.600000 26.0 595.953216 ... 12.0 12.0 11388.600000 2837.400000 4.600000 0.0 0.0 0.0 40.748507 -73.759176
    27 407.800000 422.200000 394.300000 1224.300000 100.611538 145.315385 95.125000 288.961290 27.0 249.324536 ... 12.0 12.0 11556.300000 1072.000000 2.500000 0.0 0.0 0.0 40.638828 -73.807823
    28 445.941655 465.997286 435.908005 1347.846947 182.010490 273.559441 175.336364 351.214076 28.0 255.381164 ... 12.0 12.0 11422.000000 1304.272727 2.545455 0.0 0.0 0.0 40.709344 -73.806367
    29 395.764925 399.457090 386.707836 1181.929851 63.385817 96.514423 135.268750 98.108871 29.0 88.372155 ... 12.0 12.0 11413.625000 474.125000 1.250000 0.0 0.0 0.0 40.685276 -73.752740
    30 430.679934 465.961857 429.740299 1326.382090 157.231838 252.123932 115.150000 310.526882 30.0 251.803744 ... 12.0 12.0 11103.000000 1123.333333 2.555556 0.0 0.0 0.0 40.755398 -73.932306
    31 457.500000 472.500000 452.500000 1382.500000 228.908654 355.111538 194.435000 450.787097 31.0 380.528319 ... 12.0 12.0 10307.100000 1847.500000 5.000000 0.0 0.0 0.0 40.595680 -74.125726
    32 371.500000 385.833333 362.166667 1119.500000 70.342949 100.179487 83.558333 105.333333 32.0 100.525613 ... 12.0 12.0 11231.666667 381.500000 1.000000 0.0 0.0 0.0 40.696295 -73.917124

    32 rows × 67 columns

    In [190]:
    safety_avg_by_district.reset_index(inplace=True)
    
    In [191]:
    import folium
    import branca.colormap as cm
    
    In [192]:
    # NY = [40.693943, -73.985880]
    school_map = folium.Map(location=[40.693943, -73.985880], 
                            #width=500,
                            #height=800,
                            zoom_start=10,
                            tiles='CartoDB positron')
    
    map_title = "Safety average by district"
    title_html = '''
                 <h3 align="center" style="font-size:20px"><b>{}</b></h3>
                 '''.format(map_title)
    school_map.get_root().html.add_child(folium.Element(title_html))
    
    size_parameter = "total_enrollment"
    color_parameter = "saf_s_11"
    
    
    start_col_val = safety_avg_by_district[color_parameter].min()
    end_col_val = safety_avg_by_district[color_parameter].max()
    start_col_hex = "#d01c8b" # used https://colorbrewer2.org/ 
    end_col_hex = "#4dac26"
    colormap = cm.LinearColormap(colors=[start_col_hex,
                                         #"white",
                                         end_col_hex],
                                 vmin=start_col_val,
                                 vmax=end_col_val
                                )
    colormap.caption = "Safety average by district"
    
    for index, row in safety_avg_by_district.iterrows():
        folium.Circle(location=[row["lat"],row["lon"]],
                      color=colormap(row[color_parameter]),
                      fill_color=colormap(row[color_parameter]),
                      fill_opacity=0.75,
                      radius=row[size_parameter],
                      weight=2,
                      tooltip="District "+row["school_dist"]+
                          " <br> Avg. Safety / Respect Score, students response (color): {:.1f}".format(row[color_parameter])+
                          " <br> Avg. enrollment per school (size): {:.0f}".format(row[size_parameter])
                     ).add_to(school_map)
    
    school_map.add_child(colormap)
        
    
    school_map
    
    Out[192]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [193]:
    saf_pivot_df = combined.pivot_table(index=['boro', 'school_dist'], values='saf_s_11', margins='All')
    index = saf_pivot_df.index
    saf_pivot_df
    
    Out[193]:
    saf_s_11
    boro school_dist
    Bronx 07 6.800000
    08 6.364697
    09 6.655000
    10 6.577273
    11 6.186667
    12 7.116667
    Brooklyn 13 6.407692
    14 6.685714
    15 6.177778
    16 5.875000
    17 6.007143
    18 6.190909
    19 6.150000
    20 7.120000
    21 6.390909
    22 7.025000
    23 6.233333
    32 6.766667
    Manhattan 01 6.768611
    02 6.910660
    03 6.716667
    04 6.885714
    05 6.314286
    06 6.950000
    Queens 24 7.107692
    25 6.912500
    26 6.760000
    27 6.390000
    28 6.627273
    29 6.075000
    30 7.033333
    Staten Island 31 6.530000
    All 6.611667
    In [194]:
    saf_pivot_df = saf_pivot_df.reset_index(drop=True)
    
    In [195]:
    df = pd.DataFrame(index.tolist())
    df['saf_s_11'] = saf_pivot_df['saf_s_11']
    
    In [196]:
    df
    
    Out[196]:
    0 1 saf_s_11
    0 Bronx 07 6.800000
    1 Bronx 08 6.364697
    2 Bronx 09 6.655000
    3 Bronx 10 6.577273
    4 Bronx 11 6.186667
    5 Bronx 12 7.116667
    6 Brooklyn 13 6.407692
    7 Brooklyn 14 6.685714
    8 Brooklyn 15 6.177778
    9 Brooklyn 16 5.875000
    10 Brooklyn 17 6.007143
    11 Brooklyn 18 6.190909
    12 Brooklyn 19 6.150000
    13 Brooklyn 20 7.120000
    14 Brooklyn 21 6.390909
    15 Brooklyn 22 7.025000
    16 Brooklyn 23 6.233333
    17 Brooklyn 32 6.766667
    18 Manhattan 01 6.768611
    19 Manhattan 02 6.910660
    20 Manhattan 03 6.716667
    21 Manhattan 04 6.885714
    22 Manhattan 05 6.314286
    23 Manhattan 06 6.950000
    24 Queens 24 7.107692
    25 Queens 25 6.912500
    26 Queens 26 6.760000
    27 Queens 27 6.390000
    28 Queens 28 6.627273
    29 Queens 29 6.075000
    30 Queens 30 7.033333
    31 Staten Island 31 6.530000
    32 All 6.611667
    In [197]:
    df = df.rename(columns={0: "boro", 1: "school_dist"})
    df.drop(df.tail(1).index,inplace=True)
    df
    
    Out[197]:
    boro school_dist saf_s_11
    0 Bronx 07 6.800000
    1 Bronx 08 6.364697
    2 Bronx 09 6.655000
    3 Bronx 10 6.577273
    4 Bronx 11 6.186667
    5 Bronx 12 7.116667
    6 Brooklyn 13 6.407692
    7 Brooklyn 14 6.685714
    8 Brooklyn 15 6.177778
    9 Brooklyn 16 5.875000
    10 Brooklyn 17 6.007143
    11 Brooklyn 18 6.190909
    12 Brooklyn 19 6.150000
    13 Brooklyn 20 7.120000
    14 Brooklyn 21 6.390909
    15 Brooklyn 22 7.025000
    16 Brooklyn 23 6.233333
    17 Brooklyn 32 6.766667
    18 Manhattan 01 6.768611
    19 Manhattan 02 6.910660
    20 Manhattan 03 6.716667
    21 Manhattan 04 6.885714
    22 Manhattan 05 6.314286
    23 Manhattan 06 6.950000
    24 Queens 24 7.107692
    25 Queens 25 6.912500
    26 Queens 26 6.760000
    27 Queens 27 6.390000
    28 Queens 28 6.627273
    29 Queens 29 6.075000
    30 Queens 30 7.033333
    31 Staten Island 31 6.530000
    In [198]:
    import plotly.express as px
    fig = px.bar(df, x='school_dist', y='saf_s_11', color='boro', title='Saf_s_11 by school district')
    fig.show()
    

    From the Safety average by district interactive map, it looks like:

  • High safety scores: Upper Manhattan and parts of Queens.
  • Low safety scores: Brooklyn.
  • From the safety by district table printed and Saf_s_11 by school district interactive bar chart we can exctract more precise conclusions:

  • saf_s_11 scores, range from a low of 5.87 for district 16 in Brooklyn, to a high of 7.12 for district 20, which is also in Brooklyn.
  • Queens and the Bronx have districts near both the bottom and top of the range.
  • The unique district of Staten Island's saf_s_11 score is under the mean of the city.
  • Racial differences in SAT preformance

    In [199]:
    race_percentage_indicators = [
        'white_per', 
        'asian_per', 
        'black_per', 
        'hispanic_per'
    ]
    
    In [200]:
    plt.figure(figsize=(20, 10))
    ax = correlations[race_percentage_indicators].plot.bar()
    ax.set_title("Racial correlation with SAT performance", fontweight="bold", fontsize=15,)
    ax.set_xlabel("Race percentage")
    ax.set_ylabel("Correlation")
    ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
    sns.despine(left=True)
    

    From the Racial correlation with SAT performance it seems like a higher percentage of white or asian students correlates positively with the SAT score averages.

    On the other hand, seems looks like a higher percentage of black and hispanic students correlates negatively with the SAT score averages.

    On this point we have two questions are this correlations due a lack of funding for schools in certain areas, with higher percentage of black and hispanic students or simply white and asian students are more fitting better in the current educational system.

    Hispanic per vs Sat score

    In [201]:
    warnings.filterwarnings("ignore",category=FutureWarning)
    combined.iplot(
        x='hispanic_per',
        y='sat_score',
        categories='boro',
        text="SCHOOL NAME",
        vline= {"x":combined["hispanic_per"].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":combined["sat_score"].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        xTitle='Hispanic %',
        yTitle='Satisfaction score avg',
        title='Hispanic per vs Sat score'
    )
    

    It tseems that there isn't a strong correlation between Hispanic per and Sat score but the previous plot shows clearly that independently of the percentage of the hispanic students, the Sat score is between 1000 and 1400, as we have seen previously.

    What the previous scatter plot shows is that the schools with a percentage of hispanic students between 0% and 20% are the ones with a higher average Satisfaction score.

    On the other hand the extreme cases of the schools with a a hispanic students percentage of 100% are the ones with lower SAT score. This may be due a lack of capital on this schools for hispanic students.

    In [202]:
    combined[combined['hispanic_per'] > 95]['SCHOOL NAME']
    
    Out[202]:
    44                         MANHATTAN BRIDGES HIGH SCHOOL
    82      WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL
    89     GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...
    125                  ACADEMY FOR LANGUAGE AND TECHNOLOGY
    141                INTERNATIONAL SCHOOL FOR LIBERAL ARTS
    176     PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
    253                            MULTICULTURAL HIGH SCHOOL
    286               PAN AMERICAN INTERNATIONAL HIGH SCHOOL
    Name: SCHOOL NAME, dtype: object

    Schools info

    Schools info. After reading the descriptions of the listed schools, they appear to primarily be geared towards recent immigrants to the US. These schools students are starting the inclusion program by learning English, which would explain the lower SAT scores.

    In [203]:
    combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME']
    
    Out[203]:
    37                                STUYVESANT HIGH SCHOOL
    151                         BRONX HIGH SCHOOL OF SCIENCE
    187                       BROOKLYN TECHNICAL HIGH SCHOOL
    327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
    356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
    Name: SCHOOL NAME, dtype: object

    The schools listed above are specialized on science and technology, that generally receive extra founding to foment investigation and research. They only admin students after passing an entrance exam.

    There is no explanation why this percentage of hispanic students is so low, but the admision process may explain why the SAT score of this students is higher.

    Gender differences in SAT scores

    In [204]:
    plt.figure(figsize=(20, 10))
    ax = correlations[['male_per', 'female_per']].plot.bar()
    ax.set_title("Gender differences in SAT scores", fontweight="bold", fontsize=15,)
    ax.set_xlabel("Gender",fontsize=12)
    ax.set_ylabel("Correlation",fontsize=12)
    ax.tick_params(bottom=True, top=False, left=False, right=False, labelbottom=True)
    

    In the Gender differences in SAT scores, we can see that females have a high positive correlation with SAT scores and males a high negative correlation.

    Female gender and SAT score

    In [205]:
    warnings.filterwarnings("ignore",category=FutureWarning)
    combined.iplot(
        x='female_per',
        y='sat_score',
        categories='boro',
        text="SCHOOL NAME",
        vline= {"x":combined["female_per"].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":combined["sat_score"].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        xTitle='Female %',
        yTitle='Satisfaction score avg',
        title='Female per vs Sat score'
    )
    

    From the scatter above we can see a strong correlation between Satisfaction Score and Female percentage. But, there is a cluster of schools with a female percentage between 55% and 65% with a high SAT score.

    In [206]:
    combined[(
        combined["female_per"] > 60) & 
        (combined["sat_score"] > 1700)]["SCHOOL NAME"]
    
    Out[206]:
    5                         BARD HIGH SCHOOL EARLY COLLEGE
    26                         ELEANOR ROOSEVELT HIGH SCHOOL
    60                                    BEACON HIGH SCHOOL
    61     FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...
    302                          TOWNSEND HARRIS HIGH SCHOOL
    Name: SCHOOL NAME, dtype: object

    The shwon cluster in the scatter plot which schools are listed aboved, are formed by schools with high academic standards and specialized on arts. School info example.

    AP Exam Scores and SAT Scores

    In the U.S., high school students take Advanced Placement (AP). exams to earn college credit. There are AP exams for many different subjects.

    It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores. Let's explore this relationship. Because total_enrollment is highly correlated with sat_score, we don't want to bias our results. Instead, we'll look at the percentage of students in each school who took at least one AP exam.

    In [207]:
    combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]
    

    AP per and SAT score

    In [208]:
    fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'ap_per', 'boro']].dropna(),
                     trendline='ols',
                     x="ap_per", 
                     y="sat_score", 
                     size="ap_per", 
                     color="sat_score", 
                     size_max=15, 
                     hover_name="SCHOOL NAME", 
                     hover_data=['boro', 'ap_per','sat_score'],
                     color_continuous_scale=px.colors.sequential.Viridis, 
                     title='AP per vs Sat score')
    fig.show()
    

    It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores. It's not an extremely strong correlation, though. There are some cases with a high ap_per and low sat_score and some cases with low ap_per and low sat_score.

    Average Class Size and SAT score

    In [209]:
    combined[['sat_score','SCHOOL NAME','AVERAGE CLASS SIZE', 'boro']]
    
    Out[209]:
    sat_score SCHOOL NAME AVERAGE CLASS SIZE boro
    0 1122.0 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 22.564286 Manhattan
    1 1172.0 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 22.231250 Manhattan
    2 1149.0 EAST SIDE COMMUNITY SCHOOL 21.200000 Manhattan
    3 1207.0 MARTA VALLE HIGH SCHOOL 23.571429 Manhattan
    4 1621.0 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 25.510526 Manhattan
    ... ... ... ... ...
    358 1154.0 EBC HIGH SCHOOL FOR PUBLIC SERVICE–BUSHWICK 22.958824 Brooklyn
    359 1035.0 BUSHWICK SCHOOL FOR SOCIAL JUSTICE 22.760000 Brooklyn
    360 1060.0 ACADEMY OF URBAN PLANNING 23.900000 Brooklyn
    361 1315.0 ALL CITY LEADERSHIP SECONDARY SCHOOL 17.793750 Brooklyn
    362 1055.0 BUSHWICK LEADERS HIGH SCHOOL FOR ACADEMIC EXCE... 25.060000 Brooklyn

    363 rows × 4 columns

    In [210]:
    fig = px.scatter(combined[['sat_score','SCHOOL NAME', 'AVERAGE CLASS SIZE', 'boro']].dropna(),
                     trendline='ols',
                     x="AVERAGE CLASS SIZE", 
                     y="sat_score", 
                     size="AVERAGE CLASS SIZE", 
                     color="sat_score", 
                     size_max=15, 
                     hover_name="SCHOOL NAME", 
                     hover_data=['boro', 'AVERAGE CLASS SIZE','sat_score'],
                     color_continuous_scale=px.colors.sequential.Viridis, 
                     title='Average Class Size vs SAT Scores')
    fig.show()
    

    From the previous scatter plot we can extract that there is a correlation and in general schools with higher average in terms of class size seem to have also a higher average in SAT scores.

    Schools final scores

    In [211]:
    combined.head(5)
    
    Out[211]:
    DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority06 priority07 priority08 priority09 priority10 Location 1 lat lon school_dist ap_per
    0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 0 129.028846 197.038462 ... 0 0 0.0 0.0 0.0 220 Henry Street\nNew York, NY 10002\n(40.7137... 40.713764 -73.985260 01 0.305756
    1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39.000000 49.000000 ... 0 0 0.0 0.0 0.0 200 Monroe Street\nNew York, NY 10002\n(40.712... 40.712332 -73.984797 01 0.098985
    2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19.000000 21.000000 ... 0 0 0.0 0.0 0.0 420 East 12 Street\nNew York, NY 10009\n(40.72... 40.729783 -73.983041 01 0.031773
    3 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 0 129.028846 197.038462 ... 0 0 0.0 0.0 0.0 145 Stanton Street\nNew York, NY 10002\n(40.72... 40.720569 -73.985673 01 0.351577
    4 01M539 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 159 522.0 574.0 525.0 1621.0 NEW EXPLORATIONS SCI,TECH,MATH 255.000000 377.000000 ... 0 0 0.0 0.0 0.0 111 Columbia Street\nNew York, NY 10002\n(40.7... 40.718725 -73.979426 01 0.158091

    5 rows × 161 columns

    In [212]:
    combined.describe()
    
    Out[212]:
    SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 Total Cohort CSD NUMBER OF STUDENTS / SEATS FILLED ... expgrade_span_max zip total_students number_programs priority08 priority09 priority10 lat lon ap_per
    count 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 ... 363.0 363.000000 363.000000 363.000000 363.0 363.0 363.0 363.000000 363.000000 363.000000
    mean 404.119403 419.656716 399.662687 1223.438806 129.028846 197.038462 153.450000 193.870968 14.019284 171.762429 ... 12.0 10724.719008 777.556474 1.834711 0.0 0.0 0.0 40.743533 -73.925345 0.210503
    std 57.486468 64.804541 59.187101 178.223775 175.816044 310.125423 239.117292 224.190938 9.251380 184.956860 ... 0.0 536.599348 815.206633 1.680307 0.0 0.0 0.0 0.081881 0.071633 0.167423
    min 279.000000 312.000000 286.000000 887.000000 6.000000 6.000000 6.000000 1.000000 1.000000 10.000000 ... 12.0 10002.000000 110.000000 1.000000 0.0 0.0 0.0 40.528229 -74.192154 0.011765
    25% 371.000000 378.000000 365.000000 1113.000000 42.000000 56.000000 76.000000 82.000000 7.000000 80.311189 ... 12.0 10301.000000 395.000000 1.000000 0.0 0.0 0.0 40.687143 -73.979256 0.069868
    50% 397.000000 408.000000 390.000000 1193.000000 129.028846 197.038462 153.450000 111.000000 12.000000 105.166667 ... 12.0 10472.000000 477.000000 1.000000 0.0 0.0 0.0 40.741205 -73.929154 0.172805
    75% 417.500000 439.000000 411.500000 1266.500000 129.028846 197.038462 153.450000 193.870968 21.000000 153.784188 ... 12.0 11216.500000 648.000000 2.000000 0.0 0.0 0.0 40.819472 -73.890124 0.307213
    max 679.000000 735.000000 682.000000 2096.000000 2117.000000 3692.000000 2687.000000 1209.000000 32.000000 1329.052632 ... 12.0 11694.000000 5458.000000 10.000000 0.0 0.0 0.0 40.888373 -73.726910 0.877747

    8 rows × 68 columns

    In [213]:
    correlations = combined.corr()
    correlations = correlations["sat_score"]
    correlations
    
    Out[213]:
    SAT Critical Reading Avg. Score    0.986820
    SAT Math Avg. Score                0.972643
    SAT Writing Avg. Score             0.987771
    sat_score                          1.000000
    AP Test Takers                     0.523140
                                         ...   
    priority09                              NaN
    priority10                              NaN
    lat                               -0.121029
    lon                               -0.132222
    ap_per                             0.057171
    Name: sat_score, Length: 68, dtype: float64

    To compute the Schools Final Score from the correlations array, all the columns with a positive correlation bigger than the threshold (threshold=0.35) will be used.

    In [214]:
    positivie_correlations = correlations[correlations > 0.35]
    positivie_correlations
    
    Out[214]:
    SAT Critical Reading Avg. Score         0.986820
    SAT Math Avg. Score                     0.972643
    SAT Writing Avg. Score                  0.987771
    sat_score                               1.000000
    AP Test Takers                          0.523140
    Total Exams Taken                       0.514333
    Number of Exams with scores 3 4 or 5    0.463245
    NUMBER OF STUDENTS / SEATS FILLED       0.394626
    NUMBER OF SECTIONS                      0.362673
    AVERAGE CLASS SIZE                      0.381014
    total_enrollment                        0.367857
    asian_num                               0.475445
    asian_per                               0.570730
    white_num                               0.449559
    white_per                               0.620718
    female_num                              0.388631
    N_s                                     0.423463
    N_p                                     0.421530
    total_students                          0.407827
    Name: sat_score, dtype: float64

    Before that, we will remove the columns that contain repeted information or may affect on the final score, this columns are:

    • asian_num
    • white_num
    In [215]:
    positivie_correlations = positivie_correlations.drop(labels=['asian_num', 'white_num'])
    
    In [216]:
    aux = combined[positivie_correlations.index]
    aux
    
    Out[216]:
    SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE total_enrollment asian_per white_per female_num N_s N_p total_students
    0 355.0 404.0 363.0 1122.0 129.028846 197.038462 153.45 88.000000 4.000000 22.564286 422 14.0 1.7 163.0 379.000000 151.0 323.0
    1 383.0 423.0 366.0 1172.0 39.000000 49.000000 10.00 105.687500 4.750000 22.231250 394 29.2 2.3 168.0 385.000000 46.0 299.0
    2 377.0 402.0 370.0 1149.0 19.000000 21.000000 153.45 57.600000 2.733333 21.200000 598 9.7 10.4 271.0 598.208333 150.0 649.0
    3 390.0 433.0 384.0 1207.0 129.028846 197.038462 153.45 69.642857 3.000000 23.571429 367 9.3 1.6 197.0 306.000000 69.0 401.0
    4 522.0 574.0 525.0 1621.0 255.000000 377.000000 191.00 156.368421 6.157895 25.510526 1613 27.8 44.9 819.0 923.000000 736.0 1725.0
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    358 384.0 409.0 361.0 1154.0 47.000000 64.000000 13.00 150.941176 6.411765 22.958824 606 1.7 0.7 290.0 504.000000 205.0 513.0
    359 348.0 353.0 334.0 1035.0 129.028846 197.038462 153.45 71.066667 3.266667 22.760000 434 0.5 1.8 224.0 295.000000 160.0 416.0
    360 342.0 364.0 354.0 1060.0 76.000000 100.000000 10.00 102.375000 4.312500 23.900000 398 1.5 1.0 181.0 256.000000 127.0 313.0
    361 428.0 465.0 422.0 1315.0 7.000000 8.000000 153.45 66.937500 3.812500 17.793750 263 5.3 1.5 126.0 206.000000 80.0 333.0
    362 347.0 358.0 350.0 1055.0 34.000000 35.000000 18.00 132.333333 5.400000 25.060000 462 0.9 0.0 216.0 323.000000 164.0 362.0

    363 rows × 17 columns

    In [217]:
    aux.describe()
    
    Out[217]:
    SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE total_enrollment asian_per white_per female_num N_s N_p total_students
    count 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000
    mean 404.119403 419.656716 399.662687 1223.438806 129.028846 197.038462 153.450000 171.762429 6.491468 24.794077 793.035813 10.149587 8.430028 392.983471 598.208333 247.721763 777.556474
    std 57.486468 64.804541 59.187101 178.223775 175.816044 310.125423 239.117292 184.956860 5.956961 3.555077 865.298394 14.716465 14.030145 426.246744 667.764179 283.769637 815.206633
    min 279.000000 312.000000 286.000000 887.000000 6.000000 6.000000 6.000000 10.000000 1.000000 10.000000 141.000000 0.000000 0.000000 0.000000 70.000000 12.000000 110.000000
    25% 371.000000 378.000000 365.000000 1113.000000 42.000000 56.000000 76.000000 80.311189 3.366071 22.642708 386.500000 1.300000 1.000000 183.000000 291.000000 102.000000 395.000000
    50% 397.000000 408.000000 390.000000 1193.000000 129.028846 197.038462 153.450000 105.166667 4.375000 24.966667 471.000000 3.800000 2.100000 244.000000 374.000000 175.000000 477.000000
    75% 417.500000 439.000000 411.500000 1266.500000 129.028846 197.038462 153.450000 153.784188 6.296703 27.157222 669.000000 10.750000 8.900000 374.500000 552.500000 279.000000 648.000000
    max 679.000000 735.000000 682.000000 2096.000000 2117.000000 3692.000000 2687.000000 1329.052632 41.157895 37.023077 5332.000000 89.500000 82.100000 2315.000000 4768.000000 2804.000000 5458.000000

    Now we have a dataset with the most signigicant columns. But as it is shown in the previous description this columns values are in different ranges, that's why we previously importes Sklearn preprocessing, now all this dataset values will be scaled to values between 0 and 1.

    In [218]:
    x = aux.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    
    In [219]:
    df
    
    Out[219]:
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    0 0.1900 0.217494 0.194444 0.194376 0.058280 0.051828 0.054998 0.059133 0.074705 0.464947 0.054132 0.156425 0.020706 0.070410 0.065773 0.049785 0.039828
    1 0.2600 0.262411 0.202020 0.235732 0.015632 0.011666 0.001492 0.072543 0.093381 0.452622 0.048738 0.326257 0.028015 0.072570 0.067050 0.012178 0.035340
    2 0.2450 0.212766 0.212121 0.216708 0.006158 0.004069 0.054998 0.036087 0.043163 0.414461 0.088037 0.108380 0.126675 0.117063 0.112433 0.049427 0.100785
    3 0.2775 0.286052 0.247475 0.264682 0.058280 0.051828 0.054998 0.045216 0.049803 0.502216 0.043537 0.103911 0.019488 0.085097 0.050234 0.020415 0.054413
    4 0.6075 0.619385 0.603535 0.607113 0.117954 0.100651 0.069004 0.110965 0.128440 0.573973 0.283568 0.310615 0.546894 0.353780 0.181567 0.259312 0.301982
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    358 0.2625 0.229314 0.189394 0.220844 0.019422 0.015735 0.002611 0.106850 0.134762 0.479547 0.089578 0.018994 0.008526 0.125270 0.092380 0.069126 0.075355
    359 0.1725 0.096927 0.121212 0.122415 0.058280 0.051828 0.054998 0.046296 0.056444 0.472189 0.056444 0.005587 0.021924 0.096760 0.047893 0.053009 0.057218
    360 0.1575 0.122931 0.171717 0.143093 0.033160 0.025502 0.001492 0.070031 0.082487 0.514375 0.049509 0.016760 0.012180 0.078186 0.039591 0.041189 0.037958
    361 0.3725 0.361702 0.343434 0.354012 0.000474 0.000543 0.054998 0.043165 0.070036 0.288411 0.023502 0.059218 0.018270 0.054428 0.028948 0.024355 0.041698
    362 0.1700 0.108747 0.161616 0.138958 0.013264 0.007868 0.004476 0.092743 0.109567 0.557301 0.061838 0.010056 0.000000 0.093305 0.053853 0.054441 0.047120

    363 rows × 17 columns

    In [220]:
    df.describe()
    
    Out[220]:
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    count 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000 363.000000
    mean 0.312799 0.254508 0.287027 0.278279 0.058280 0.051828 0.054998 0.122635 0.136747 0.547461 0.125609 0.113403 0.102680 0.169755 0.112433 0.084428 0.124824
    std 0.143716 0.153202 0.149462 0.147414 0.083286 0.084136 0.089190 0.140219 0.148338 0.131557 0.166692 0.164430 0.170891 0.184124 0.142138 0.101637 0.152432
    min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
    25% 0.230000 0.156028 0.199495 0.186931 0.017054 0.013565 0.026110 0.053304 0.058919 0.467849 0.047293 0.014525 0.012180 0.079050 0.047041 0.032235 0.053291
    50% 0.295000 0.226950 0.262626 0.253102 0.058280 0.051828 0.054998 0.072148 0.084043 0.553848 0.063572 0.042458 0.025579 0.105400 0.064708 0.058381 0.068624
    75% 0.346250 0.300236 0.316919 0.313896 0.058280 0.051828 0.054998 0.109006 0.131897 0.634910 0.101715 0.120112 0.108404 0.161771 0.102703 0.095630 0.100598
    max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
    In [221]:
    df['final_score'] = df.sum(axis=1)
    df['final_score'] = (df['final_score'] * 10) / 17
    df
    
    Out[221]:
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 final_score
    0 0.1900 0.217494 0.194444 0.194376 0.058280 0.051828 0.054998 0.059133 0.074705 0.464947 0.054132 0.156425 0.020706 0.070410 0.065773 0.049785 0.039828 1.186626
    1 0.2600 0.262411 0.202020 0.235732 0.015632 0.011666 0.001492 0.072543 0.093381 0.452622 0.048738 0.326257 0.028015 0.072570 0.067050 0.012178 0.035340 1.292734
    2 0.2450 0.212766 0.212121 0.216708 0.006158 0.004069 0.054998 0.036087 0.043163 0.414461 0.088037 0.108380 0.126675 0.117063 0.112433 0.049427 0.100785 1.263724
    3 0.2775 0.286052 0.247475 0.264682 0.058280 0.051828 0.054998 0.045216 0.049803 0.502216 0.043537 0.103911 0.019488 0.085097 0.050234 0.020415 0.054413 1.303027
    4 0.6075 0.619385 0.603535 0.607113 0.117954 0.100651 0.069004 0.110965 0.128440 0.573973 0.283568 0.310615 0.546894 0.353780 0.181567 0.259312 0.301982 3.397787
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    358 0.2625 0.229314 0.189394 0.220844 0.019422 0.015735 0.002611 0.106850 0.134762 0.479547 0.089578 0.018994 0.008526 0.125270 0.092380 0.069126 0.075355 1.258947
    359 0.1725 0.096927 0.121212 0.122415 0.058280 0.051828 0.054998 0.046296 0.056444 0.472189 0.056444 0.005587 0.021924 0.096760 0.047893 0.053009 0.057218 0.936425
    360 0.1575 0.122931 0.171717 0.143093 0.033160 0.025502 0.001492 0.070031 0.082487 0.514375 0.049509 0.016760 0.012180 0.078186 0.039591 0.041189 0.037958 0.939801
    361 0.3725 0.361702 0.343434 0.354012 0.000474 0.000543 0.054998 0.043165 0.070036 0.288411 0.023502 0.059218 0.018270 0.054428 0.028948 0.024355 0.041698 1.258644
    362 0.1700 0.108747 0.161616 0.138958 0.013264 0.007868 0.004476 0.092743 0.109567 0.557301 0.061838 0.010056 0.000000 0.093305 0.053853 0.054441 0.047120 0.991267

    363 rows × 18 columns

    In [222]:
    df['final_score'].describe()
    
    Out[222]:
    count    363.000000
    mean       1.728054
    std        1.060487
    min        0.589832
    25%        1.116121
    50%        1.331321
    75%        1.903177
    max        8.611356
    Name: final_score, dtype: float64

    Now let's add the final_score column to the previous combined dataframe.

    In [223]:
    combined['final_score'] = df['final_score']
    combined
    
    Out[223]:
    DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority07 priority08 priority09 priority10 Location 1 lat lon school_dist ap_per final_score
    0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 0 129.028846 197.038462 ... 0 0.0 0.0 0.0 220 Henry Street\nNew York, NY 10002\n(40.7137... 40.713764 -73.985260 01 0.305756 1.186626
    1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39.000000 49.000000 ... 0 0.0 0.0 0.0 200 Monroe Street\nNew York, NY 10002\n(40.712... 40.712332 -73.984797 01 0.098985 1.292734
    2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19.000000 21.000000 ... 0 0.0 0.0 0.0 420 East 12 Street\nNew York, NY 10009\n(40.72... 40.729783 -73.983041 01 0.031773 1.263724
    3 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 0 129.028846 197.038462 ... 0 0.0 0.0 0.0 145 Stanton Street\nNew York, NY 10002\n(40.72... 40.720569 -73.985673 01 0.351577 1.303027
    4 01M539 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 159 522.0 574.0 525.0 1621.0 NEW EXPLORATIONS SCI,TECH,MATH 255.000000 377.000000 ... 0 0.0 0.0 0.0 111 Columbia Street\nNew York, NY 10002\n(40.7... 40.718725 -73.979426 01 0.158091 3.397787
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    358 32K545 EBC HIGH SCHOOL FOR PUBLIC SERVICE–BUSHWICK 88 384.0 409.0 361.0 1154.0 EBC-HS FOR PUB SERVICE (BUSH) 47.000000 64.000000 ... 0 0.0 0.0 0.0 1155 Dekalb Avenue\nBrooklyn, NY 11221\n(40.69... 40.694482 -73.929154 32 0.077558 1.258947
    359 32K549 BUSHWICK SCHOOL FOR SOCIAL JUSTICE 75 348.0 353.0 334.0 1035.0 0 129.028846 197.038462 ... 0 0.0 0.0 0.0 400 Irving Avenue\nBrooklyn, NY 11237\n(40.696... 40.696962 -73.910816 32 0.297301 0.936425
    360 32K552 ACADEMY OF URBAN PLANNING 67 342.0 364.0 354.0 1060.0 Academy of Urban Planning 76.000000 100.000000 ... 0 0.0 0.0 0.0 400 Irving Avenue\nBrooklyn, NY 11237\n(40.696... 40.696962 -73.910816 32 0.190955 0.939801
    361 32K554 ALL CITY LEADERSHIP SECONDARY SCHOOL 39 428.0 465.0 422.0 1315.0 All City Leadership Secondary School 7.000000 8.000000 ... Then to New York City residents 0.0 0.0 0.0 321 Palmetto Street\nBrooklyn, NY 11237\n(40.6... 40.697408 -73.913153 32 0.026616 1.258644
    362 32K556 BUSHWICK LEADERS HIGH SCHOOL FOR ACADEMIC EXCE... 23 347.0 358.0 350.0 1055.0 Bushwick Leaders High School for Academic Exce... 34.000000 35.000000 ... 0 0.0 0.0 0.0 797 Bushwick Avenue\nBrooklyn, NY 11221\n(40.6... 40.694996 -73.927986 32 0.073593 0.991267

    363 rows × 162 columns

    In [224]:
    combined.loc[combined['final_score'] > 5]
    
    Out[224]:
    DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority07 priority08 priority09 priority10 Location 1 lat lon school_dist ap_per final_score
    37 02M475 STUYVESANT HIGH SCHOOL 832 679.0 735.0 682.0 2096.0 STUYVESANT HS 1510.0 2819.0 ... 0 0.0 0.0 0.0 345 Chambers Street\nNew York, NY 10282\n(40.7... 40.717746 -74.014049 02 0.457992 7.327012
    61 03M485 FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A... 531 566.0 564.0 577.0 1707.0 FIORELLO H.LAGUARDIA HS 691.0 1236.0 ... 0 0.0 0.0 0.0 100 Amsterdam Avenue\nNew York, NY 10023\n(40.... 40.773671 -73.985269 03 0.265259 5.108604
    151 10X445 BRONX HIGH SCHOOL OF SCIENCE 731 632.0 688.0 649.0 1969.0 BRONX HS OF SCIENCE 1190.0 2435.0 ... 0 0.0 0.0 0.0 75 West 205 Street\nBronx, NY 10468\n(40.87705... 40.877056 -73.889780 10 0.394955 6.613325
    187 13K430 BROOKLYN TECHNICAL HIGH SCHOOL 1277 587.0 659.0 587.0 1833.0 BROOKLYN TECHNICAL HS 2117.0 3692.0 ... 0 0.0 0.0 0.0 29 Ft Greene Place\nBrooklyn, NY 11217\n(40.68... 40.688107 -73.976745 13 0.397037 8.611356
    276 22K405 MIDWOOD HIGH SCHOOL 824 478.0 519.0 476.0 1473.0 MIDWOOD HS 745.0 1223.0 ... 0 0.0 0.0 0.0 2839 Bedford Avenue\nBrooklyn, NY 11210\n(40.6... 40.633335 -73.952916 22 0.193909 5.718548
    304 26Q415 BENJAMIN N. CARDOZO HIGH SCHOOL 888 480.0 545.0 489.0 1514.0 BENJAMIN N. CARDOZO 676.0 1145.0 ... 0 0.0 0.0 0.0 57-00 223Rd Street\nOakland Gardens, NY 11364\... 40.752392 -73.756083 26 0.172802 5.427189
    305 26Q430 FRANCIS LEWIS HIGH SCHOOL 934 468.0 539.0 467.0 1474.0 FRANCIS LEWIS HS 697.0 1033.0 ... 0 0.0 0.0 0.0 58-20 Utopia Parkway\nFresh Meadows, NY 11365\... 40.740556 -73.792848 26 0.167871 5.008304
    322 28Q440 FOREST HILLS HIGH SCHOOL 762 456.0 497.0 454.0 1407.0 FOREST HILLS HS 733.0 1116.0 ... 0 0.0 0.0 0.0 67-01 110 Street\nForest Hills, NY 11375\n(40.... 40.729432 -73.845629 28 0.191184 5.021383

    8 rows × 162 columns

    In [225]:
    combined.loc[combined['final_score'].idxmax()]
    
    Out[225]:
    DBN                                                        13K430
    SCHOOL NAME                        BROOKLYN TECHNICAL HIGH SCHOOL
    Num of SAT Test Takers                                       1277
    SAT Critical Reading Avg. Score                               587
    SAT Math Avg. Score                                           659
                                                    ...              
    lat                                                       40.6881
    lon                                                      -73.9767
    school_dist                                                    13
    ap_per                                                   0.397037
    final_score                                               8.61136
    Name: 187, Length: 162, dtype: object

    As we can see following our sclae, only 8 schools have a final sat score bigger than 5. The school with the best final sat score is BROOKLYN TECHNICAL HIGH SCHOOL. The next chart shows all the final sat score results for each school.

    In [226]:
    fig = px.bar(combined, x='SCHOOL NAME', y='final_score', title='Final score by school')
    fig.show()
    
    In [227]:
    combined['final_score'].describe()
    
    Out[227]:
    count    363.000000
    mean       1.728054
    std        1.060487
    min        0.589832
    25%        1.116121
    50%        1.331321
    75%        1.903177
    max        8.611356
    Name: final_score, dtype: float64

    Conclusions

    This project combined and analyzed 7 data sets that contained information no only about SAT scores, also included schools of NY and student's demographic factors. The main objective of this project was to analyze and find which are the factors that influence the most performances in SATs.

    From the New York Schools Sat Scores Distributions histogram we have seen that the majority of the schools have an SAT score between 1100 and 1300. Also, there is a small group with SAT scores over 1600, this is considered schools with a very high SAT score.

    From the correlations plot we concluded that columns like total_enrollment have a good correlation with SAT score but the most important conclusion is that there is a high correlation between SAT score and the saf_t_11 and saf_s_11 columns which refer to the safety ranks of the schools.

    After that, we studied the difference in the results depending on the respondent. In the Question values by respondent chart, we can see that all perceptions independently of the role of the participant are common. In general students are the ones that rate things slightly lower than teachers and parents. On the other hand, parents are the ones who rate things slightly higher. Teachers are in the middle quite near the student's ratings. Safety rating seems to be the aspect where students and parents differ the most, knowing that and the high correlation between safety and SAT scores we decided to explore deeply Safety Scores.

    From the safety exploration and the Safety average by district interactive map, we have seen that safety influences the average Sat Score, there are no high SAT scores for schools with low safety scores. But there are schools with a low SAT score and a normal/high safety score and low values of other factors. In the Safety average by district interactive map we have seen that safest schools are located in Manhattan and in the nearby regions of Queens and Bronx.

    In terms of ethnic groups, we have seen that Hispanic and black ethnic groups are very disadvantaged in comparison with White and Asian ethnic groups. In general, they form or formed part of schools with poor academic standards and low safety ranks.

    We have seen that there is no strong evidence depending on students gender. Also, there is no evidence that AP exam preparation influences in the SAT scores.

    In the last step, we have seen that there is a relationship between average class size and SAT scores, schools with bigger classes tend to have a better SAT score (also there are exceptions).

    Finally, we have used the features correlations values with SAT_score and a threshold of 0.35 to determine which are the most important columns. After reducing this list of features we used the Scikit Learn Preprocessing MinMaxScaler to normalize all these values in a range of 0 and 1. After we summed all the results a divided it by the total num of columns. Last, the results were multiplied by 10 to convert it in a 0 to 10 scale. That's how we created the final_score for each school.