Analysing NYC High School Data¶

by Raghav_A

New York City has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. Also, using various surveys across NYC schools to compare how school-safety scores, what the average size of a class is, and number of AP test takers can also yield some interesting info. Let's see if we can find some useful correlations

Read in the data¶

In [61]:

import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Read in the surveys¶

In [2]:

all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

Add DBN columns¶

In [3]:

data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric¶

In [4]:

cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Condense datasets¶

In [5]:

class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric¶

In [6]:

cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets¶

In [7]:

combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

Add a school district column for mapping¶

In [8]:

def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Find correlations¶

In [180]:

correlations = combined.corr()
correlations['sat_score']

Out[180]:

SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
                                     ...   
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
ap_per                             0.057171
Name: sat_score, Length: 68, dtype: float64

In [399]:

correlations[abs(correlations['sat_score'])>0.25]['sat_score'].sort_values()

Out[399]:

frl_percent                            -0.722225
sped_percent                           -0.448170
ell_percent                            -0.398750
hispanic_per                           -0.396985
black_per                              -0.284139
N_t                                     0.291463
saf_t_11                                0.313810
SIZE OF LARGEST CLASS                   0.314434
saf_tot_11                              0.318753
Total Cohort                            0.325144
male_num                                0.325520
saf_s_11                                0.337639
aca_s_11                                0.339435
NUMBER OF SECTIONS                      0.362673
total_enrollment                        0.367857
AVERAGE CLASS SIZE                      0.381014
female_num                              0.388631
NUMBER OF STUDENTS / SEATS FILLED       0.394626
total_students                          0.407827
N_p                                     0.421530
N_s                                     0.423463
white_num                               0.449559
Number of Exams with scores 3 4 or 5    0.463245
asian_num                               0.475445
Total Exams Taken                       0.514333
AP Test Takers                          0.523140
asian_per                               0.570730
white_per                               0.620718
SAT Math Avg. Score                     0.972643
SAT Critical Reading Avg. Score         0.986820
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
Name: sat_score, dtype: float64

Plotting survey correlations¶

There are several fields in combined dataset that originally came from a survey of parents, teachers, and students. I will make a bar plot of the correlations between these fields and sat_score. By doing this, I can dive-deep into those fields that have a high correlation with the sat_score field.

In [387]:

combined.corr()[survey_fields].loc['sat_score',:].sort_values().plot.bar()

Out[387]:

<matplotlib.axes._subplots.AxesSubplot at 0x35ac5a3b88>

Immedeiately, it an be observed that the correlation of the survey fields with sat_score (in absolute terms) varies from 0.02 (almost no correlation) to 0.4 (medium correlation). Any values of correlation that are > 0.25 can be termed as 'interesting', and deserve a deeper dive into it. Thus, next, I isolate only those survey-fields which have an absolute correlation with sat_score > 0.25.

Rather than rely on the r (correlation) value alone, it is better to plot the 2 fields being compared via a scatterplot. In doing so, we will determine whether there is actually a correlation, or it is a ruse due to a bunch of influential outliers.

In [300]:

# A function that makes multiple scatterplots in a single figure
def scatter_plots_multiple(colnames_list):
    numcols = len(colnames_list)
    fig = plt.figure(figsize = (15,(int(numcols/3)+1)*5))

    for i in range(1,numcols+1):
        ax = fig.add_subplot(3,3,i)
        ax.scatter(combined[colnames_list[i-1]],combined['sat_score'])
        ax.set_title(colnames_list[i-1]+' vs "sat_score"')
    plt.show()

In [301]:

survey_sat_corr = combined.corr()[survey_fields].loc['sat_score',:][combined.corr()[survey_fields].loc['sat_score',:]>0.25]
scatter_plots_multiple(survey_sat_corr.index)

Safety and Respect scores based on Student responses (saf_s_11 field) seems to have a better correlation amongst all other survey fields. While most of the sat_score values are clustered around the 6-7 range of safety scores, it can be seen that schools with a higher student rated safety score seem to have higher mean sat_scores as well.

Let's dive deeper into this survey field, and try to visualise which districts have higher average safety scores and which ones don't.

In [10]:

# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

`saf_s_11` safety scores by District¶

In [209]:

# district-wise average scores
district = combined.groupby('school_dist').agg(numpy.mean)
district

Out[209]:

	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	AP Test Takers	Total Exams Taken	Number of Exams with scores 3 4 or 5	Total Cohort	CSD	NUMBER OF STUDENTS / SEATS FILLED	...	expgrade_span_max	zip	total_students	number_programs	priority08	priority09	priority10	lat	lon	ap_per
school_dist
01	441.833333	473.333333	439.333333	1354.500000	116.681090	173.019231	135.800000	93.500000	1.0	115.244241	...	12.0	10003.166667	659.500000	1.333333	0.0	0.0	0.0	40.719022	-73.982377	0.192551
02	426.619092	444.186256	424.832836	1295.638184	128.908454	201.516827	157.495833	158.647849	2.0	149.818949	...	12.0	10023.770833	621.395833	1.416667	0.0	0.0	0.0	40.739699	-73.991386	0.265711
03	428.529851	437.997512	426.915672	1293.443035	156.183494	244.522436	193.087500	183.384409	3.0	156.005994	...	12.0	10023.750000	717.916667	2.000000	0.0	0.0	0.0	40.781574	-73.977370	0.267818
04	402.142857	416.285714	405.714286	1224.142857	129.016484	183.879121	151.035714	113.857143	4.0	132.362265	...	12.0	10029.857143	580.857143	1.142857	0.0	0.0	0.0	40.793449	-73.943215	0.246798
05	427.159915	438.236674	419.666098	1285.062687	85.722527	115.725275	142.464286	143.677419	5.0	120.623901	...	12.0	10030.142857	609.857143	1.142857	0.0	0.0	0.0	40.817077	-73.949251	0.161767
06	382.011940	400.565672	382.066269	1164.643881	108.711538	159.715385	105.425000	180.848387	6.0	139.041709	...	12.0	10036.200000	628.900000	1.300000	0.0	0.0	0.0	40.848970	-73.932502	0.220879
07	376.461538	380.461538	371.923077	1128.846154	73.703402	112.476331	105.276923	105.605459	7.0	97.597416	...	12.0	10452.692308	465.846154	1.461538	0.0	0.0	0.0	40.816815	-73.919971	0.170719
08	386.214383	395.542741	377.908005	1159.665129	118.379371	168.020979	144.731818	215.510264	8.0	129.765099	...	12.0	10467.000000	547.636364	1.272727	0.0	0.0	0.0	40.823803	-73.866087	0.249342
09	373.755970	383.582836	374.633134	1131.971940	71.411538	104.265385	98.470000	113.330645	9.0	100.118588	...	12.0	10456.100000	449.700000	1.150000	0.0	0.0	0.0	40.836349	-73.906240	0.175797
10	403.363636	418.000000	400.863636	1222.227273	132.231206	226.914336	191.618182	161.318182	10.0	168.876526	...	12.0	10463.181818	757.863636	1.500000	0.0	0.0	0.0	40.870345	-73.898360	0.153976
11	389.866667	394.533333	380.600000	1165.000000	83.813462	122.484615	108.833333	122.866667	11.0	129.031031	...	12.0	10467.933333	563.666667	1.533333	0.0	0.0	0.0	40.873138	-73.856120	0.170508
12	364.769900	379.109453	357.943781	1101.823134	93.102564	139.442308	153.450000	110.467742	12.0	91.684504	...	12.0	10463.166667	409.000000	1.083333	0.0	0.0	0.0	40.831412	-73.886946	0.265387
13	409.393800	424.127440	403.666361	1237.187600	232.931953	382.704142	320.773077	224.595533	13.0	218.306055	...	12.0	11207.153846	895.153846	2.076923	0.0	0.0	0.0	40.692865	-73.977016	0.180886
14	395.937100	398.189765	385.333049	1179.459915	77.798077	114.873626	123.282143	112.347926	14.0	123.643728	...	12.0	11210.785714	545.357143	2.000000	0.0	0.0	0.0	40.711599	-73.948360	0.217193
15	395.679934	404.628524	390.295854	1190.604312	94.574786	141.581197	153.450000	104.207885	15.0	135.707319	...	12.0	11214.222222	573.111111	1.666667	0.0	0.0	0.0	40.675972	-73.989255	0.181893
16	371.529851	379.164179	369.415672	1120.109701	82.264423	126.519231	153.450000	247.185484	16.0	177.501282	...	12.0	11219.000000	440.250000	1.750000	0.0	0.0	0.0	40.688008	-73.929686	0.309973
17	386.571429	394.071429	380.785714	1161.428571	105.583791	163.087912	111.360714	121.357143	17.0	130.246192	...	12.0	11220.642857	547.071429	1.642857	0.0	0.0	0.0	40.660313	-73.955636	0.209731
18	373.454545	373.090909	371.454545	1118.000000	129.028846	197.038462	153.450000	72.771261	18.0	72.209438	...	12.0	11224.000000	344.000000	1.090909	0.0	0.0	0.0	40.641863	-73.914726	0.396711
19	367.083333	377.583333	359.166667	1103.833333	88.097756	124.769231	120.670833	114.322581	19.0	105.752625	...	12.0	11207.500000	440.416667	1.916667	0.0	0.0	0.0	40.676547	-73.882158	0.200646
20	406.223881	465.731343	401.732537	1273.687761	227.805769	359.407692	177.690000	591.374194	20.0	420.029766	...	12.0	11210.200000	2521.400000	3.800000	0.0	0.0	0.0	40.626751	-74.006191	0.150214
21	395.283582	421.786974	389.242062	1206.312619	135.467657	203.835664	142.377273	275.351906	21.0	224.702989	...	12.0	11221.000000	1098.272727	3.272727	0.0	0.0	0.0	40.593596	-73.978465	0.206245
22	473.500000	502.750000	474.250000	1450.500000	391.007212	614.509615	370.362500	580.250000	22.0	495.279369	...	12.0	11223.000000	2149.000000	2.250000	0.0	0.0	0.0	40.618285	-73.952288	0.215706
23	380.666667	398.666667	378.000000	1157.333333	29.000000	31.000000	153.450000	87.000000	23.0	120.113095	...	12.0	11219.000000	391.000000	1.333333	0.0	0.0	0.0	40.668586	-73.912298	0.063672
24	405.846154	434.000000	402.153846	1242.000000	126.474852	179.094675	115.165385	234.682382	24.0	213.471903	...	12.0	11206.153846	962.461538	2.230769	0.0	0.0	0.0	40.740621	-73.911518	0.185186
25	437.250000	483.500000	436.250000	1357.000000	205.260817	279.889423	174.793750	268.733871	25.0	280.576007	...	12.0	11361.000000	1288.875000	1.875000	0.0	0.0	0.0	40.745414	-73.815558	0.205119
26	445.200000	487.600000	444.800000	1377.600000	410.605769	632.407692	392.090000	825.600000	26.0	595.953216	...	12.0	11388.600000	2837.400000	4.600000	0.0	0.0	0.0	40.748507	-73.759176	0.124673
27	407.800000	422.200000	394.300000	1224.300000	100.611538	145.315385	95.125000	288.961290	27.0	249.324536	...	12.0	11556.300000	1072.000000	2.500000	0.0	0.0	0.0	40.638828	-73.807823	0.150687
28	445.941655	465.997286	435.908005	1347.846947	182.010490	273.559441	175.336364	351.214076	28.0	255.381164	...	12.0	11422.000000	1304.272727	2.545455	0.0	0.0	0.0	40.709344	-73.806367	0.215716
29	395.764925	399.457090	386.707836	1181.929851	63.385817	96.514423	135.268750	98.108871	29.0	88.372155	...	12.0	11413.625000	474.125000	1.250000	0.0	0.0	0.0	40.685276	-73.752740	0.211378
30	430.679934	465.961857	429.740299	1326.382090	157.231838	252.123932	115.150000	310.526882	30.0	251.803744	...	12.0	11103.000000	1123.333333	2.555556	0.0	0.0	0.0	40.755398	-73.932306	0.170433
31	457.500000	472.500000	452.500000	1382.500000	228.908654	355.111538	194.435000	450.787097	31.0	380.528319	...	12.0	10307.100000	1847.500000	5.000000	0.0	0.0	0.0	40.595680	-74.125726	0.176337
32	371.500000	385.833333	362.166667	1119.500000	70.342949	100.179487	83.558333	105.333333	32.0	100.525613	...	12.0	11231.666667	381.500000	1.000000	0.0	0.0	0.0	40.696295	-73.917124	0.170409

32 rows × 68 columns

In [329]:

# plotting safety scores district wise on NYC map (using Basemap library)
from mpl_toolkits.basemap import Basemap

def nyc_plot_district(fieldname):
    fig,ax = plt.subplots(figsize = (6,6))
    m = Basemap(projection = 'merc', llcrnrlat = 40.496044, urcrnrlat = 40.915256, 
                llcrnrlon = -74.255735, urcrnrlon = -73.700272, resolution = 'h')
    m.drawcoastlines(color = 'black', linewidth = 1)
    m.drawmapboundary(fill_color = '#85A6D9')
    
    # Creating scatterplot
    m.scatter(district['lon'].tolist(), 
              district['lat'].tolist(),
              zorder = 2, s=20, 
              latlon = True, 
              c=district[fieldname], 
              cmap = 'summer')
    if fieldname == 'saf_s_11':
        ax.set_title('Heat-Map: District Wise Safety Scores for NYC Schools')
    plt.show()
    

def nyc_plot_school(df):
    fig,ax = plt.subplots(figsize = (6,6))
    m = Basemap(projection = 'merc', llcrnrlat = 40.496044, urcrnrlat = 40.915256, 
                llcrnrlon = -74.255735, urcrnrlon = -73.700272, resolution = 'i')
    m.drawcoastlines(color = 'black', linewidth = 1)
    m.drawmapboundary(fill_color = '#85A6D9')
    
    # Creating scatterplot
    m.scatter(df['lon'].tolist(), 
              df['lat'].tolist(),
              zorder = 2, s=20, 
              latlon = True, 
              c='black')
    ax.set_title('Scatter Plot: NYC Schools')
    plt.show()

In [235]:

nyc_plot_district('saf_s_11')

C:\Users\Aseem\anaconda3\lib\site-packages\ipykernel_launcher.py:7: MatplotlibDeprecationWarning: 
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  import sys

From the NYC map above, we can see that most districts in Manhattan and Queens seem to have lower average saffety scores by students (yellow dots), while Brooklyn has relatively higher safety scores.

Race and `sat_score` correlation¶

By plotting out the correlations between these columns and sat_score, we can determine whether there are any racial differences in SAT performance.

In [108]:

combined.corr().loc['sat_score',['white_per','asian_per','black_per','hispanic_per']].plot.bar()

Out[108]:

<matplotlib.axes._subplots.AxesSubplot at 0x35a4fbd448>

In [302]:

scatter_plots_multiple(['white_per','asian_per','black_per','hispanic_per'])

Immediately we can see -

positive correlation between of whites/asians students in schools vs the sat_score
negative correlation between blacks/hispanic students in schools vs the sat_score

This suggests that SAT is not favourable to students from hispanic/black communities.

Another interesting thing that we could further investigate is the cluster of low sat_score schools in the hispanic_per plot, having 99%+ hispanic students. On diving deeper, we find that all of these schools take in newly arrived immigrants from hispanic countries and have been in th USA for fewer than 2-3 years. This may contribute to lower English scores in SAT, and further, lower SAT overall scores.

In [415]:

# investigating schools with > 95% hispanic percentage
combined[combined['hispanic_per']>95][['SCHOOL NAME','hispanic_per','sat_score']]

Out[415]:

	SCHOOL NAME	hispanic_per	sat_score
44	MANHATTAN BRIDGES HIGH SCHOOL	99.8	1058.0
82	WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL	96.7	1174.0
89	GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...	99.8	1014.0
125	ACADEMY FOR LANGUAGE AND TECHNOLOGY	99.4	951.0
141	INTERNATIONAL SCHOOL FOR LIBERAL ARTS	99.8	934.0
176	PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE	99.8	970.0
253	MULTICULTURAL HIGH SCHOOL	99.8	887.0
286	PAN AMERICAN INTERNATIONAL HIGH SCHOOL	100.0	951.0

About - MANHATTAN BRIDGES HIGH SCHOOL
About - Gregorio Luperon High School for Science and Mathematics

Gender and sat_score correlation¶

We see that gender has a negligible correlation with the SAT scores. bu on a deeper dive, using the scatterplots, we can see that schools with a greater gender diversity (or a ratio close to 1:1 boys vs girls) have higher SAT scores on an average.

Also, a bunch of dots on the 100% female_per plot shows us that all-girls schools seem to have very low SAT scores on average. Same goes for schools having more than 80% male students.

In [270]:

combined.corr().loc['sat_score',['male_per','female_per']].plot.bar()

Out[270]:

<matplotlib.axes._subplots.AxesSubplot at 0x35acf6be48>

In [303]:

scatter_plots_multiple(['female_per','male_per'])

In [346]:

combined[combined['female_per']>80][['SCHOOL NAME','sat_score','priority01']]

Out[346]:

	SCHOOL NAME	sat_score	priority01
15	URBAN ASSEMBLY SCHOOL OF BUSINESS FOR YOUNG WO...	1127.000000	Open only to female students
49	THE HIGH SCHOOL OF FASHION INDUSTRIES	1257.000000	Open to New York City residents
70	YOUNG WOMEN'S LEADERSHIP SCHOOL	1326.000000	Open only to female students
71	YOUNG WOMEN'S LEADERSHIP SCHOOL	1326.000000	Open only to female students
104	WOMEN'S ACADEMY OF EXCELLENCE	1171.000000	Open only to female students
133	HIGH SCHOOL FOR VIOLIN AND DANCE	1039.000000	Priority to Bronx students or residents who at...
137	THE MARIE CURIE SCHOOL FOR MEDICINE, NURSING, ...	1157.000000	Priority to Bronx students or residents who at...
191	URBAN ASSEMBLY INSTITUTE OF MATH AND SCIENCE F...	1223.438806	Open only to female students
264	THE URBAN ASSEMBLY SCHOOL FOR CRIMINAL JUSTICE	1223.438806	Open only to female students
329	YOUNG WOMEN'S LEADERSHIP SCHOOL, QUEENS	1316.000000	Open only to female students
338	YOUNG WOMEN'S LEADERSHIP SCHOOL, ASTORIA	1223.438806	Open only to female students

In [308]:

combined[combined['male_per']>80][['SCHOOL NAME','sat_score']]

Out[308]:

	SCHOOL NAME	sat_score
99	URBAN ASSEMBLY SCHOOL FOR CAREERS IN SPORTS	1181.0
101	ALFRED E. SMITH CAREER AND TECHNICAL EDUCATION...	1158.0
115	EAGLE ACADEMY FOR YOUNG MEN	1134.0
135	BRONX ENGINEERING AND TECHNOLOGY ACADEMY	1150.0
160	HIGH SCHOOL OF COMPUTERS AND TECHNOLOGY	1111.0
170	BRONX AEROSPACE HIGH SCHOOL	1163.0
207	AUTOMOTIVE HIGH SCHOOL	1093.0
249	FDNY HIGH SCHOOL FOR FIRE AND LIFE SAFETY	1023.0
254	TRANSIT TECH CAREER AND TECHNICAL EDUCATION HI...	1193.0
267	HIGH SCHOOL OF SPORTS MANAGEMENT	1164.0
295	AVIATION CAREER & TECHNICAL EDUCATION HIGH SCHOOL	1364.0

In [320]:

combined[(combined['female_per']>60)&(combined['sat_score']>1700)][['SCHOOL NAME','sat_score','female_per']]

Out[320]:

	SCHOOL NAME	sat_score	female_per
5	BARD HIGH SCHOOL EARLY COLLEGE	1856.0	68.7
26	ELEANOR ROOSEVELT HIGH SCHOOL	1758.0	67.5
60	BEACON HIGH SCHOOL	1744.0	61.0
61	FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...	1707.0	73.6
302	TOWNSEND HARRIS HIGH SCHOOL	1910.0	71.1

AP Test Takers and sat_score correlation¶

In the U.S., high school students take Advanced Placement (AP) exams to earn college credit. There are AP exams for many different subjects.

It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores. Let's explore this relationship.

In [147]:

combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']
combined.plot.scatter('ap_per','sat_score')

Out[147]:

<matplotlib.axes._subplots.AxesSubplot at 0x35aa158c08>

In [322]:

combined[combined['sat_score'] > 1800][['SCHOOL NAME','sat_score','ap_per']]

Out[322]:

	SCHOOL NAME	sat_score	ap_per
5	BARD HIGH SCHOOL EARLY COLLEGE	1856.0	0.209123
37	STUYVESANT HIGH SCHOOL	2096.0	0.457992
79	HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGIN...	1847.0	0.280788
151	BRONX HIGH SCHOOL OF SCIENCE	1969.0	0.394955
155	HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE	1920.0	0.514589
187	BROOKLYN TECHNICAL HIGH SCHOOL	1833.0	0.397037
302	TOWNSEND HARRIS HIGH SCHOOL	1910.0	0.537719
327	QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...	1868.0	0.514354
356	STATEN ISLAND TECHNICAL HIGH SCHOOL	1953.0	0.478261

Average Class Size and sat_score correlation¶

Here we see a strong correlation - schools with higher average class size seem to have higher average SAT scores, and vice-versa.

In [371]:

combined.plot.scatter('AVERAGE CLASS SIZE', 'sat_score')

Out[371]:

<matplotlib.axes._subplots.AxesSubplot at 0x35aeea3a08>

CONCLUSION¶

Schools woth hogher average SAT scores seem to have a higher Safety and Respect score based on Student responses in the surveys.
SAT is not favourable to black/hispanic students, as schools with higher percentage of these students have lower SAT scores.
Schools with a higher gender diversity have higher SAT scores, and vice versa.
Schools with higher average class size seem to have higher average SAT scores, and vice-versa.

In [ ]: