Analysing NYC High School Data

by Raghav_A

New York City has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. Also, using various surveys across NYC schools to compare how school-safety scores, what the average size of a class is, and number of AP test takers can also yield some interesting info. Let's see if we can find some useful correlations

Drawing

Read in the data

In [61]:
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Read in the surveys

In [2]:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

Add DBN columns

In [3]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric

In [4]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Condense datasets

In [5]:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric

In [6]:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets

In [7]:
combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

Add a school district column for mapping

In [8]:
def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Find correlations

In [180]:
correlations = combined.corr()
correlations['sat_score']
Out[180]:
SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
                                     ...   
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
ap_per                             0.057171
Name: sat_score, Length: 68, dtype: float64
In [399]:
correlations[abs(correlations['sat_score'])>0.25]['sat_score'].sort_values()
Out[399]:
frl_percent                            -0.722225
sped_percent                           -0.448170
ell_percent                            -0.398750
hispanic_per                           -0.396985
black_per                              -0.284139
N_t                                     0.291463
saf_t_11                                0.313810
SIZE OF LARGEST CLASS                   0.314434
saf_tot_11                              0.318753
Total Cohort                            0.325144
male_num                                0.325520
saf_s_11                                0.337639
aca_s_11                                0.339435
NUMBER OF SECTIONS                      0.362673
total_enrollment                        0.367857
AVERAGE CLASS SIZE                      0.381014
female_num                              0.388631
NUMBER OF STUDENTS / SEATS FILLED       0.394626
total_students                          0.407827
N_p                                     0.421530
N_s                                     0.423463
white_num                               0.449559
Number of Exams with scores 3 4 or 5    0.463245
asian_num                               0.475445
Total Exams Taken                       0.514333
AP Test Takers                          0.523140
asian_per                               0.570730
white_per                               0.620718
SAT Math Avg. Score                     0.972643
SAT Critical Reading Avg. Score         0.986820
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
Name: sat_score, dtype: float64

Plotting survey correlations

There are several fields in combined dataset that originally came from a survey of parents, teachers, and students. I will make a bar plot of the correlations between these fields and sat_score. By doing this, I can dive-deep into those fields that have a high correlation with the sat_score field.

image.png

In [387]:
combined.corr()[survey_fields].loc['sat_score',:].sort_values().plot.bar()
Out[387]:
<matplotlib.axes._subplots.AxesSubplot at 0x35ac5a3b88>

Immedeiately, it an be observed that the correlation of the survey fields with sat_score (in absolute terms) varies from 0.02 (almost no correlation) to 0.4 (medium correlation). Any values of correlation that are > 0.25 can be termed as 'interesting', and deserve a deeper dive into it. Thus, next, I isolate only those survey-fields which have an absolute correlation with sat_score > 0.25.

Rather than rely on the r (correlation) value alone, it is better to plot the 2 fields being compared via a scatterplot. In doing so, we will determine whether there is actually a correlation, or it is a ruse due to a bunch of influential outliers.

In [300]:
# A function that makes multiple scatterplots in a single figure
def scatter_plots_multiple(colnames_list):
    numcols = len(colnames_list)
    fig = plt.figure(figsize = (15,(int(numcols/3)+1)*5))

    for i in range(1,numcols+1):
        ax = fig.add_subplot(3,3,i)
        ax.scatter(combined[colnames_list[i-1]],combined['sat_score'])
        ax.set_title(colnames_list[i-1]+' vs "sat_score"')
    plt.show()
In [301]:
survey_sat_corr = combined.corr()[survey_fields].loc['sat_score',:][combined.corr()[survey_fields].loc['sat_score',:]>0.25]
scatter_plots_multiple(survey_sat_corr.index)

Safety and Respect scores based on Student responses (saf_s_11 field) seems to have a better correlation amongst all other survey fields. While most of the sat_score values are clustered around the 6-7 range of safety scores, it can be seen that schools with a higher student rated safety score seem to have higher mean sat_scores as well.

Let's dive deeper into this survey field, and try to visualise which districts have higher average safety scores and which ones don't.

In [10]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

saf_s_11 safety scores by District

In [209]:
# district-wise average scores
district = combined.groupby('school_dist').agg(numpy.mean)
district
Out[209]:
SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 Total Cohort CSD NUMBER OF STUDENTS / SEATS FILLED ... expgrade_span_max zip total_students number_programs priority08 priority09 priority10 lat lon ap_per
school_dist
01 441.833333 473.333333 439.333333 1354.500000 116.681090 173.019231 135.800000 93.500000 1.0 115.244241 ... 12.0 10003.166667 659.500000 1.333333 0.0 0.0 0.0 40.719022 -73.982377 0.192551
02 426.619092 444.186256 424.832836 1295.638184 128.908454 201.516827 157.495833 158.647849 2.0 149.818949 ... 12.0 10023.770833 621.395833 1.416667 0.0 0.0 0.0 40.739699 -73.991386 0.265711
03 428.529851 437.997512 426.915672 1293.443035 156.183494 244.522436 193.087500 183.384409 3.0 156.005994 ... 12.0 10023.750000 717.916667 2.000000 0.0 0.0 0.0 40.781574 -73.977370 0.267818
04 402.142857 416.285714 405.714286 1224.142857 129.016484 183.879121 151.035714 113.857143 4.0 132.362265 ... 12.0 10029.857143 580.857143 1.142857 0.0 0.0 0.0 40.793449 -73.943215 0.246798
05 427.159915 438.236674 419.666098 1285.062687 85.722527 115.725275 142.464286 143.677419 5.0 120.623901 ... 12.0 10030.142857 609.857143 1.142857 0.0 0.0 0.0 40.817077 -73.949251 0.161767
06 382.011940 400.565672 382.066269 1164.643881 108.711538 159.715385 105.425000 180.848387 6.0 139.041709 ... 12.0 10036.200000 628.900000 1.300000 0.0 0.0 0.0 40.848970 -73.932502 0.220879
07 376.461538 380.461538 371.923077 1128.846154 73.703402 112.476331 105.276923 105.605459 7.0 97.597416 ... 12.0 10452.692308 465.846154 1.461538 0.0 0.0 0.0 40.816815 -73.919971 0.170719
08 386.214383 395.542741 377.908005 1159.665129 118.379371 168.020979 144.731818 215.510264 8.0 129.765099 ... 12.0 10467.000000 547.636364 1.272727 0.0 0.0 0.0 40.823803 -73.866087 0.249342
09 373.755970 383.582836 374.633134 1131.971940 71.411538 104.265385 98.470000 113.330645 9.0 100.118588 ... 12.0 10456.100000 449.700000 1.150000 0.0 0.0 0.0 40.836349 -73.906240 0.175797
10 403.363636 418.000000 400.863636 1222.227273 132.231206 226.914336 191.618182 161.318182 10.0 168.876526 ... 12.0 10463.181818 757.863636 1.500000 0.0 0.0 0.0 40.870345 -73.898360 0.153976
11 389.866667 394.533333 380.600000 1165.000000 83.813462 122.484615 108.833333 122.866667 11.0 129.031031 ... 12.0 10467.933333 563.666667 1.533333 0.0 0.0 0.0 40.873138 -73.856120 0.170508
12 364.769900 379.109453 357.943781 1101.823134 93.102564 139.442308 153.450000 110.467742 12.0 91.684504 ... 12.0 10463.166667 409.000000 1.083333 0.0 0.0 0.0 40.831412 -73.886946 0.265387
13 409.393800 424.127440 403.666361 1237.187600 232.931953 382.704142 320.773077 224.595533 13.0 218.306055 ... 12.0 11207.153846 895.153846 2.076923 0.0 0.0 0.0 40.692865 -73.977016 0.180886
14 395.937100 398.189765 385.333049 1179.459915 77.798077 114.873626 123.282143 112.347926 14.0 123.643728 ... 12.0 11210.785714 545.357143 2.000000 0.0 0.0 0.0 40.711599 -73.948360 0.217193
15 395.679934 404.628524 390.295854 1190.604312 94.574786 141.581197 153.450000 104.207885 15.0 135.707319 ... 12.0 11214.222222 573.111111 1.666667 0.0 0.0 0.0 40.675972 -73.989255 0.181893
16 371.529851 379.164179 369.415672 1120.109701 82.264423 126.519231 153.450000 247.185484 16.0 177.501282 ... 12.0 11219.000000 440.250000 1.750000 0.0 0.0 0.0 40.688008 -73.929686 0.309973
17 386.571429 394.071429 380.785714 1161.428571 105.583791 163.087912 111.360714 121.357143 17.0 130.246192 ... 12.0 11220.642857 547.071429 1.642857 0.0 0.0 0.0 40.660313 -73.955636 0.209731
18 373.454545 373.090909 371.454545 1118.000000 129.028846 197.038462 153.450000 72.771261 18.0 72.209438 ... 12.0 11224.000000 344.000000 1.090909 0.0 0.0 0.0 40.641863 -73.914726 0.396711
19 367.083333 377.583333 359.166667 1103.833333 88.097756 124.769231 120.670833 114.322581 19.0 105.752625 ... 12.0 11207.500000 440.416667 1.916667 0.0 0.0 0.0 40.676547 -73.882158 0.200646
20 406.223881 465.731343 401.732537 1273.687761 227.805769 359.407692 177.690000 591.374194 20.0 420.029766 ... 12.0 11210.200000 2521.400000 3.800000 0.0 0.0 0.0 40.626751 -74.006191 0.150214
21 395.283582 421.786974 389.242062 1206.312619 135.467657 203.835664 142.377273 275.351906 21.0 224.702989 ... 12.0 11221.000000 1098.272727 3.272727 0.0 0.0 0.0 40.593596 -73.978465 0.206245
22 473.500000 502.750000 474.250000 1450.500000 391.007212 614.509615 370.362500 580.250000 22.0 495.279369 ... 12.0 11223.000000 2149.000000 2.250000 0.0 0.0 0.0 40.618285 -73.952288 0.215706
23 380.666667 398.666667 378.000000 1157.333333 29.000000 31.000000 153.450000 87.000000 23.0 120.113095 ... 12.0 11219.000000 391.000000 1.333333 0.0 0.0 0.0 40.668586 -73.912298 0.063672
24 405.846154 434.000000 402.153846 1242.000000 126.474852 179.094675 115.165385 234.682382 24.0 213.471903 ... 12.0 11206.153846 962.461538 2.230769 0.0 0.0 0.0 40.740621 -73.911518 0.185186
25 437.250000 483.500000 436.250000 1357.000000 205.260817 279.889423 174.793750 268.733871 25.0 280.576007 ... 12.0 11361.000000 1288.875000 1.875000 0.0 0.0 0.0 40.745414 -73.815558 0.205119
26 445.200000 487.600000 444.800000 1377.600000 410.605769 632.407692 392.090000 825.600000 26.0 595.953216 ... 12.0 11388.600000 2837.400000 4.600000 0.0 0.0 0.0 40.748507 -73.759176 0.124673
27 407.800000 422.200000 394.300000 1224.300000 100.611538 145.315385 95.125000 288.961290 27.0 249.324536 ... 12.0 11556.300000 1072.000000 2.500000 0.0 0.0 0.0 40.638828 -73.807823 0.150687
28 445.941655 465.997286 435.908005 1347.846947 182.010490 273.559441 175.336364 351.214076 28.0 255.381164 ... 12.0 11422.000000 1304.272727 2.545455 0.0 0.0 0.0 40.709344 -73.806367 0.215716
29 395.764925 399.457090 386.707836 1181.929851 63.385817 96.514423 135.268750 98.108871 29.0 88.372155 ... 12.0 11413.625000 474.125000 1.250000 0.0 0.0 0.0 40.685276 -73.752740 0.211378
30 430.679934 465.961857 429.740299 1326.382090 157.231838 252.123932 115.150000 310.526882 30.0 251.803744 ... 12.0 11103.000000 1123.333333 2.555556 0.0 0.0 0.0 40.755398 -73.932306 0.170433
31 457.500000 472.500000 452.500000 1382.500000 228.908654 355.111538 194.435000 450.787097 31.0 380.528319 ... 12.0 10307.100000 1847.500000 5.000000 0.0 0.0 0.0 40.595680 -74.125726 0.176337
32 371.500000 385.833333 362.166667 1119.500000 70.342949 100.179487 83.558333 105.333333 32.0 100.525613 ... 12.0 11231.666667 381.500000 1.000000 0.0 0.0 0.0 40.696295 -73.917124 0.170409

32 rows × 68 columns

In [329]:
# plotting safety scores district wise on NYC map (using Basemap library)
from mpl_toolkits.basemap import Basemap

def nyc_plot_district(fieldname):
    fig,ax = plt.subplots(figsize = (6,6))
    m = Basemap(projection = 'merc', llcrnrlat = 40.496044, urcrnrlat = 40.915256, 
                llcrnrlon = -74.255735, urcrnrlon = -73.700272, resolution = 'h')
    m.drawcoastlines(color = 'black', linewidth = 1)
    m.drawmapboundary(fill_color = '#85A6D9')
    
    # Creating scatterplot
    m.scatter(district['lon'].tolist(), 
              district['lat'].tolist(),
              zorder = 2, s=20, 
              latlon = True, 
              c=district[fieldname], 
              cmap = 'summer')
    if fieldname == 'saf_s_11':
        ax.set_title('Heat-Map: District Wise Safety Scores for NYC Schools')
    plt.show()
    

def nyc_plot_school(df):
    fig,ax = plt.subplots(figsize = (6,6))
    m = Basemap(projection = 'merc', llcrnrlat = 40.496044, urcrnrlat = 40.915256, 
                llcrnrlon = -74.255735, urcrnrlon = -73.700272, resolution = 'i')
    m.drawcoastlines(color = 'black', linewidth = 1)
    m.drawmapboundary(fill_color = '#85A6D9')
    
    # Creating scatterplot
    m.scatter(df['lon'].tolist(), 
              df['lat'].tolist(),
              zorder = 2, s=20, 
              latlon = True, 
              c='black')
    ax.set_title('Scatter Plot: NYC Schools')
    plt.show()
In [235]:
nyc_plot_district('saf_s_11')
C:\Users\Aseem\anaconda3\lib\site-packages\ipykernel_launcher.py:7: MatplotlibDeprecationWarning: 
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  import sys

From the NYC map above, we can see that most districts in Manhattan and Queens seem to have lower average saffety scores by students (yellow dots), while Brooklyn has relatively higher safety scores.

Race and sat_score correlation

By plotting out the correlations between these columns and sat_score, we can determine whether there are any racial differences in SAT performance.

In [108]:
combined.corr().loc['sat_score',['white_per','asian_per','black_per','hispanic_per']].plot.bar()
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x35a4fbd448>
In [302]:
scatter_plots_multiple(['white_per','asian_per','black_per','hispanic_per'])

Immediately we can see -

  • positive correlation between of whites/asians students in schools vs the sat_score
  • negative correlation between blacks/hispanic students in schools vs the sat_score This suggests that SAT is not favourable to students from hispanic/black communities.

Another interesting thing that we could further investigate is the cluster of low sat_score schools in the hispanic_per plot, having 99%+ hispanic students. On diving deeper, we find that all of these schools take in newly arrived immigrants from hispanic countries and have been in th USA for fewer than 2-3 years. This may contribute to lower English scores in SAT, and further, lower SAT overall scores.

In [415]:
# investigating schools with > 95% hispanic percentage
combined[combined['hispanic_per']>95][['SCHOOL NAME','hispanic_per','sat_score']]
Out[415]:
SCHOOL NAME hispanic_per sat_score
44 MANHATTAN BRIDGES HIGH SCHOOL 99.8 1058.0
82 WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL 96.7 1174.0
89 GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... 99.8 1014.0
125 ACADEMY FOR LANGUAGE AND TECHNOLOGY 99.4 951.0
141 INTERNATIONAL SCHOOL FOR LIBERAL ARTS 99.8 934.0
176 PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE 99.8 970.0
253 MULTICULTURAL HIGH SCHOOL 99.8 887.0
286 PAN AMERICAN INTERNATIONAL HIGH SCHOOL 100.0 951.0

Gender and sat_score correlation

We see that gender has a negligible correlation with the SAT scores. bu on a deeper dive, using the scatterplots, we can see that schools with a greater gender diversity (or a ratio close to 1:1 boys vs girls) have higher SAT scores on an average.

Also, a bunch of dots on the 100% female_per plot shows us that all-girls schools seem to have very low SAT scores on average. Same goes for schools having more than 80% male students.

In [270]:
combined.corr().loc['sat_score',['male_per','female_per']].plot.bar()
Out[270]:
<matplotlib.axes._subplots.AxesSubplot at 0x35acf6be48>
In [303]:
scatter_plots_multiple(['female_per','male_per'])
In [346]:
combined[combined['female_per']>80][['SCHOOL NAME','sat_score','priority01']]
Out[346]:
SCHOOL NAME sat_score priority01
15 URBAN ASSEMBLY SCHOOL OF BUSINESS FOR YOUNG WO... 1127.000000 Open only to female students
49 THE HIGH SCHOOL OF FASHION INDUSTRIES 1257.000000 Open to New York City residents
70 YOUNG WOMEN'S LEADERSHIP SCHOOL 1326.000000 Open only to female students
71 YOUNG WOMEN'S LEADERSHIP SCHOOL 1326.000000 Open only to female students
104 WOMEN'S ACADEMY OF EXCELLENCE 1171.000000 Open only to female students
133 HIGH SCHOOL FOR VIOLIN AND DANCE 1039.000000 Priority to Bronx students or residents who at...
137 THE MARIE CURIE SCHOOL FOR MEDICINE, NURSING, ... 1157.000000 Priority to Bronx students or residents who at...
191 URBAN ASSEMBLY INSTITUTE OF MATH AND SCIENCE F... 1223.438806 Open only to female students
264 THE URBAN ASSEMBLY SCHOOL FOR CRIMINAL JUSTICE 1223.438806 Open only to female students
329 YOUNG WOMEN'S LEADERSHIP SCHOOL, QUEENS 1316.000000 Open only to female students
338 YOUNG WOMEN'S LEADERSHIP SCHOOL, ASTORIA 1223.438806 Open only to female students
In [308]:
combined[combined['male_per']>80][['SCHOOL NAME','sat_score']]
Out[308]:
SCHOOL NAME sat_score
99 URBAN ASSEMBLY SCHOOL FOR CAREERS IN SPORTS 1181.0
101 ALFRED E. SMITH CAREER AND TECHNICAL EDUCATION... 1158.0
115 EAGLE ACADEMY FOR YOUNG MEN 1134.0
135 BRONX ENGINEERING AND TECHNOLOGY ACADEMY 1150.0
160 HIGH SCHOOL OF COMPUTERS AND TECHNOLOGY 1111.0
170 BRONX AEROSPACE HIGH SCHOOL 1163.0
207 AUTOMOTIVE HIGH SCHOOL 1093.0
249 FDNY HIGH SCHOOL FOR FIRE AND LIFE SAFETY 1023.0
254 TRANSIT TECH CAREER AND TECHNICAL EDUCATION HI... 1193.0
267 HIGH SCHOOL OF SPORTS MANAGEMENT 1164.0
295 AVIATION CAREER & TECHNICAL EDUCATION HIGH SCHOOL 1364.0
In [320]:
combined[(combined['female_per']>60)&(combined['sat_score']>1700)][['SCHOOL NAME','sat_score','female_per']]
Out[320]:
SCHOOL NAME sat_score female_per
5 BARD HIGH SCHOOL EARLY COLLEGE 1856.0 68.7
26 ELEANOR ROOSEVELT HIGH SCHOOL 1758.0 67.5
60 BEACON HIGH SCHOOL 1744.0 61.0
61 FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A... 1707.0 73.6
302 TOWNSEND HARRIS HIGH SCHOOL 1910.0 71.1

AP Test Takers and sat_score correlation

In the U.S., high school students take Advanced Placement (AP) exams to earn college credit. There are AP exams for many different subjects.

It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores. Let's explore this relationship.

In [147]:
combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']
combined.plot.scatter('ap_per','sat_score')
Out[147]:
<matplotlib.axes._subplots.AxesSubplot at 0x35aa158c08>
In [322]:
combined[combined['sat_score'] > 1800][['SCHOOL NAME','sat_score','ap_per']]
Out[322]:
SCHOOL NAME sat_score ap_per
5 BARD HIGH SCHOOL EARLY COLLEGE 1856.0 0.209123
37 STUYVESANT HIGH SCHOOL 2096.0 0.457992
79 HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGIN... 1847.0 0.280788
151 BRONX HIGH SCHOOL OF SCIENCE 1969.0 0.394955
155 HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE 1920.0 0.514589
187 BROOKLYN TECHNICAL HIGH SCHOOL 1833.0 0.397037
302 TOWNSEND HARRIS HIGH SCHOOL 1910.0 0.537719
327 QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... 1868.0 0.514354
356 STATEN ISLAND TECHNICAL HIGH SCHOOL 1953.0 0.478261

Average Class Size and sat_score correlation

Here we see a strong correlation - schools with higher average class size seem to have higher average SAT scores, and vice-versa.

In [371]:
combined.plot.scatter('AVERAGE CLASS SIZE', 'sat_score')
Out[371]:
<matplotlib.axes._subplots.AxesSubplot at 0x35aeea3a08>

CONCLUSION

  1. Schools woth hogher average SAT scores seem to have a higher Safety and Respect score based on Student responses in the surveys.
  2. SAT is not favourable to black/hispanic students, as schools with higher percentage of these students have lower SAT scores.
  3. Schools with a higher gender diversity have higher SAT scores, and vice versa.
  4. Schools with higher average class size seem to have higher average SAT scores, and vice-versa.
In [ ]: