Exploring The Causes of SAT Performance in NYC Public Schools

The quality of education and causes of disparity in public school performance has always been a top priority of parents, education administrators, and elected offcials. Here we will look at various demographic metrics and evaluate their impact on SAT scores.

The Scholastic Aptitude Test or SAT is a nationally standarized test that higher education institutions use in their admission decisions. It consists of 3 parts. Each part has a maximum of 800 points for a maximum of 2400. The SAT is often used as a measure of a school's overall quality.

We will be using data published by the City of New York through its "Open Data" portal:

Class size - Average class size for each school

AP Exam Results - School level AP exam results

Graduation Outcomes - Graduation outcomes for each school

Demographics Information - Demographic data for each school

SAT Results - SAT scores for each high school

Attendance - Attendance and enrollment statistics

School Surveys - Yearly parent, teacher, and student survey at each school

Read In The Data

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import folium

%matplotlib inline

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

# Store all dataframes in a dictionary for easy reference
data = {}
for f in data_files:
    d = pd.read_csv("{0}".format(f))
    key_name = f.replace(".csv", "")
    data[key_name] = d
In [2]:
data.keys()
Out[2]:
dict_keys(['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'sat_results'])

Observations:

  • Each school entry has an unique DBN number or the relevant information needed to compile one
  • We can ultimately join the data sets based on this number
  • The SAT scores are divied into the three sections which we can combine into the total score
  • Some of the schools have multiple rows so we will have to do some processing to ensure that each DBN is unique

Read In The Survey Data

In [3]:
all_survey = pd.read_csv("survey_all.txt", delimiter="\t", encoding="windows-1252")
d75_survey = pd.read_csv("survey_d75.txt", delimiter="\t", encoding="windows-1252")

# Create a dataframe with the combined surveys
survey = pd.concat([all_survey, d75_survey], axis=0)
In [4]:
# Make "DBN" lowercase
survey["DBN"] = survey["dbn"]

# Keep only the relevent columns that will give us data about how students and parents feel about
# school safety, academic performance, as well as the DBN
survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey[survey_fields]
data["survey"] = survey

Cleaning and Combining the Data

Adding a DBN column to class_size

Although class_size does not have a DBN column, we can create one using the CSD and School Code columns:

In [5]:
data['class_size'].head()
Out[5]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN
1 1 M M015 P.S. 015 Roberto Clemente 0K CTT - - - 21.0 1.0 21.0 21.0 21.0 ATS NaN
2 1 M M015 P.S. 015 Roberto Clemente 01 GEN ED - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN
3 1 M M015 P.S. 015 Roberto Clemente 01 CTT - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN
4 1 M M015 P.S. 015 Roberto Clemente 02 GEN ED - - - 15.0 1.0 15.0 15.0 15.0 ATS NaN
In [6]:
data['sat_results']['DBN'].head()
Out[6]:
0    01M292
1    01M448
2    01M450
3    01M458
4    01M509
Name: DBN, dtype: object
In [7]:
# Cast to 'str' and add a '0' in pad the number so that single digits contain a leading '0'

# Define function to be used in .apply()
def dbn_func(n):
   return str(n).zfill(2)

# Create new 'padded_csd' column
data['class_size']['padded_csd'] = data['class_size']['CSD'].apply(dbn_func)

# Join 'padded_csd' and 'SCHOOL CODE' into new 'DBN' column
data['class_size']['DBN'] = data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE']
In [8]:
data['class_size'].head()
Out[8]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO padded_csd DBN
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN 01 01M015
1 1 M M015 P.S. 015 Roberto Clemente 0K CTT - - - 21.0 1.0 21.0 21.0 21.0 ATS NaN 01 01M015
2 1 M M015 P.S. 015 Roberto Clemente 01 GEN ED - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN 01 01M015
3 1 M M015 P.S. 015 Roberto Clemente 01 CTT - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN 01 01M015
4 1 M M015 P.S. 015 Roberto Clemente 02 GEN ED - - - 15.0 1.0 15.0 15.0 15.0 ATS NaN 01 01M015

Combine SAT Section Scores Into One Total Score

The SAT scores are currently presented in each of the three test sections. We want to combine them into one column for easier analysis.

In [9]:
# Convert each column to a numeric data type
data['sat_results']['SAT Math Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Math Avg. Score'], errors="coerce")
data['sat_results']['SAT Critical Reading Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Critical Reading Avg. Score'], errors="coerce")
data['sat_results']['SAT Writing Avg. Score'] = pd.to_numeric(data['sat_results']['SAT Writing Avg. Score'], errors="coerce")

# Add them together into a new column
data['sat_results']['sat_score'] = data['sat_results']['SAT Math Avg. Score'] + data['sat_results']['SAT Critical Reading Avg. Score'] + data['sat_results']['SAT Writing Avg. Score']

data['sat_results']['sat_score'].head()
Out[9]:
0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

Extracting Geographic Location Data

In [10]:
data['hs_directory']['Location 1'].value_counts()
Out[10]:
122 Amsterdam Avenue\nNew York, NY 10023\n(40.77429641100048, -73.98482270099964)        6
500 East Fordham Road\nBronx, NY 10458\n(40.86001222100049, -73.88822960399966)          6
240 East 172 Street\nBronx, NY 10457\n(40.84037339800045, -73.91083800799964)            6
1980 Lafayette Avenue\nBronx, NY 10473\n(40.82230376500047, -73.85596138999966)          6
3000 East Tremont Avenue\nBronx, NY 10461\n(40.840513977000455, -73.83812095999963)      6
                                                                                        ..
28-01 41 Avenue\nLong Island City, NY 11101\n(40.751380834000486, -73.93744971899963)    1
48-10 31 Avenue\nAstoria, NY 11103\n(40.75812467500049, -73.91033330899967)              1
550 Wheeler Avenue\nManhattan, NY 10004\n(40.690787629000454, -74.01975732599965)        1
485 Clawson Street\nStaten Island, NY 10306\n(40.567913003000456, -74.11536224299965)    1
36-41 28 Street\nAstoria, NY 11106\n(40.7570971560005, -73.93261840099967)               1
Name: Location 1, Length: 259, dtype: int64
In [11]:
import re

# Define function to be used in apply
def lat_func(s):
    lat = re.findall("(?<=\()(.*)\,", s)[0]
    return lat 

# Extract latitude data and put it in a new column
data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(lat_func)

data['hs_directory']['lat'].head()
Out[11]:
0     40.67029890700047
1      40.8276026690005
2    40.842414068000494
3     40.71067947100045
4    40.718810094000446
Name: lat, dtype: object
In [12]:
# Define function to be used in apply
def long_func(s):
    long = re.findall("(\-\d*.\d*)(?=\))", s)[0]
    return long 

# Extract latitude data and put it in a new column
data['hs_directory']['lon'] = data['hs_directory']['Location 1'].apply(long_func)

data['hs_directory']['lon'].head()
Out[12]:
0    -73.96164787599963
1    -73.90447525699966
2    -73.91616158599965
3    -74.00080702099967
4    -73.80650045499965
Name: lon, dtype: object
In [13]:
# Convert the columns to a numeric value instead of a string
data['hs_directory']['lat'] = pd.to_numeric(data['hs_directory']['lat'], errors="coerce")
data['hs_directory']['lon'] = pd.to_numeric(data['hs_directory']['lon'], errors="coerce")

The 'class_size', 'graduation', and 'demographics' data sets have multiple rows for some of the schools, therefore multiple rows with the same DBN number. Before we can combine the data sets, we will need to consolidate these rows into a single row.

Remove Duplicate DBN Data

The class_size set has multiple rows for each school for each GRADE and PROGRAM TYPE. Since we are analyzing high schools, we are only interested in grades 9-12. The PROGRAM TYPE we are inerested in is "GEN ED"

In [14]:
class_size = data['class_size']
class_size = class_size[class_size['GRADE '] == '09-12']
class_size = class_size[class_size['PROGRAM TYPE'] == 'GEN ED']
class_size.head(5)
Out[14]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO padded_csd DBN
225 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 9 - 63.0 3.0 21.0 19.0 25.0 STARS NaN 01 01M292
226 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 10 - 79.0 3.0 26.3 24.0 31.0 STARS NaN 01 01M292
227 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 11 - 38.0 2.0 19.0 16.0 22.0 STARS NaN 01 01M292
228 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 12 - 69.0 3.0 23.0 13.0 30.0 STARS NaN 01 01M292
229 1 M M292 Henry Street School for International Studies 09-12 GEN ED MATH Integrated Algebra - 53.0 3.0 17.7 16.0 21.0 STARS NaN 01 01M292

We can see that there are different DBN numbers for different class subects as listed in the CORE SUBJECT (MS CORE and 9-12 ONLY) and CORE COURSE (MS CORE and 9-12 ONLY). We can use .groupby() to split the dataframe into groups and then get the average class size using the .agg() function.

In [15]:
# Groupby 'DBN' and then aggregate by mean
class_size = class_size.groupby(by='DBN', axis=0).agg(np.mean)

# Reset index so 'DBN' becomes a column again
class_size.reset_index(inplace=True)
data['class_size'] = class_size

data['class_size'].head()
Out[15]:
DBN CSD NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS SCHOOLWIDE PUPIL-TEACHER RATIO
0 01M292 1 88.0000 4.000000 22.564286 18.50 26.571429 NaN
1 01M332 1 46.0000 2.000000 22.000000 21.00 23.500000 NaN
2 01M378 1 33.0000 1.000000 33.000000 33.00 33.000000 NaN
3 01M448 1 105.6875 4.750000 22.231250 18.25 27.062500 NaN
4 01M450 1 57.6000 2.733333 21.200000 19.40 22.866667 NaN

For each DBN number in the demographics data set, there are multiple school years. We want the most recent data - 2012 in this case.

In [16]:
data['demographics'].head()
Out[16]:
DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per
0 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 NaN 281 15 36 40 33 ... 74 26.3 189 67.3 5 1.8 158.0 56.2 123.0 43.8
1 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 NaN 243 15 29 39 38 ... 68 28.0 153 63.0 4 1.6 140.0 57.6 103.0 42.4
2 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 NaN 261 18 43 39 36 ... 77 29.5 157 60.2 7 2.7 143.0 54.8 118.0 45.2
3 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 NaN 252 17 37 44 32 ... 75 29.8 149 59.1 7 2.8 149.0 59.1 103.0 40.9
4 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 96.5 208 16 40 28 32 ... 67 32.2 118 56.7 6 2.9 124.0 59.6 84.0 40.4

5 rows × 38 columns

In [17]:
#Filter the dataframe
data['demographics'] = data['demographics'][data['demographics']['schoolyear'] ==20112012]

print(data['demographics'].head())
       DBN                                              Name  schoolyear  \
6   01M015  P.S. 015 ROBERTO CLEMENTE                           20112012   
13  01M019  P.S. 019 ASHER LEVY                                 20112012   
20  01M020  PS 020 ANNA SILVER                                  20112012   
27  01M034  PS 034 FRANKLIN D ROOSEVELT                         20112012   
35  01M063  PS 063 WILLIAM MCKINLEY                             20112012   

   fl_percent  frl_percent  total_enrollment prek    k grade1 grade2  ...  \
6         NaN         89.4               189   13   31     35     28  ...   
13        NaN         61.5               328   32   46     52     54  ...   
20        NaN         92.5               626   52  102    121     87  ...   
27        NaN         99.7               401   14   34     38     36  ...   
35        NaN         78.9               176   18   20     30     21  ...   

   black_num black_per hispanic_num hispanic_per white_num white_per male_num  \
6         63      33.3          109         57.7         4       2.1     97.0   
13        81      24.7          158         48.2        28       8.5    147.0   
20        55       8.8          357         57.0        16       2.6    330.0   
27        90      22.4          275         68.6         8       2.0    204.0   
35        41      23.3          110         62.5        15       8.5     97.0   

   male_per female_num female_per  
6      51.3       92.0       48.7  
13     44.8      181.0       55.2  
20     52.7      296.0       47.3  
27     50.9      197.0       49.1  
35     55.1       79.0       44.9  

[5 rows x 38 columns]

The graduation data set also has multiple values for the DBN number. This is due to multiple years in the cohort column and the Demographic column.

We will take the most most recent year from the cohort column and Total Cohort from the Demographic column.

In [18]:
# Filter the dataframe
data['graduation'] = data['graduation'][data['graduation']['Cohort']=='2006']
data['graduation'] = data['graduation'][data['graduation']['Demographic']=='Total Cohort']

print(data['graduation'].head())
     Demographic     DBN                            School Name Cohort  \
3   Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL   2006   
10  Total Cohort  01M448    UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   2006   
17  Total Cohort  01M450             EAST SIDE COMMUNITY SCHOOL   2006   
24  Total Cohort  01M509                MARTA VALLE HIGH SCHOOL   2006   
31  Total Cohort  01M515  LOWER EAST SIDE PREPARATORY HIGH SCHO   2006   

    Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
3             78              43                     55.1%                36   
10           124              53                     42.7%                42   
17            90              70                     77.8%                67   
24            84              47                       56%                40   
31           193             105                     54.4%                91   

   Total Regents - % of cohort Total Regents - % of grads  ...  \
3                        46.2%                      83.7%  ...   
10                       33.9%                      79.2%  ...   
17         74.400000000000006%                      95.7%  ...   
24                       47.6%                      85.1%  ...   
31                       47.2%                      86.7%  ...   

   Regents w/o Advanced - n Regents w/o Advanced - % of cohort  \
3                        36                              46.2%   
10                       34                              27.4%   
17                       67                74.400000000000006%   
24                       23                              27.4%   
31                       22                              11.4%   

   Regents w/o Advanced - % of grads Local - n Local - % of cohort  \
3                              83.7%         7                  9%   
10                             64.2%        11                8.9%   
17                             95.7%         3                3.3%   
24                             48.9%         7  8.300000000000001%   
31                               21%        14                7.3%   

   Local - % of grads Still Enrolled - n Still Enrolled - % of cohort  \
3               16.3%                 16                        20.5%   
10              20.8%                 46                        37.1%   
17               4.3%                 15                        16.7%   
24              14.9%                 25                        29.8%   
31              13.3%                 53                        27.5%   

   Dropped Out - n Dropped Out - % of cohort  
3               11                     14.1%  
10              20       16.100000000000001%  
17               5                      5.6%  
24               5                        6%  
31              35       18.100000000000001%  

[5 rows x 23 columns]

We need to convert the exam scores in the ap_2010 data set to numeric values.

In [19]:
# Convert 'AP Test Takers ', 'Total Exams Taken', and 'Number of Exams with scores 3 4 or 5'
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
    
print(data["ap_2010"].dtypes)
DBN                                      object
SchoolName                               object
AP Test Takers                          float64
Total Exams Taken                       float64
Number of Exams with scores 3 4 or 5    float64
dtype: object

Combining the Data Sets

Because our analysis is centered around SAT scores and how demographic factors impact it, we want to preserve as many rows as possile in the sat_results data set. We will use a left merge:

In [20]:
combined = data["sat_results"]

# Left Merge
combined = combined.merge(data['ap_2010'], how='left', on='DBN')
combined = combined.merge(data['graduation'], how='left', on='DBN')
combined.shape
Out[20]:
(479, 33)
In [21]:
# Rename column to capital letters like the other data sets
data['hs_directory'] = data['hs_directory'].rename({'dbn':'DBN'}, axis=1)

# Inner Merge
combined = combined.merge(data['class_size'], how='inner', on='DBN')
combined = combined.merge(data['demographics'], how='inner', on='DBN')
combined = combined.merge(data['survey'], how='inner', on='DBN')
combined = combined.merge(data['hs_directory'], how='inner', on='DBN')

Fill missing values

In [22]:
# Fill missing values with the mean of each repective column
combined = combined.fillna(combined.mean())

# Fill any leftover missing values with zero
combined = combined.fillna(0)
combined.head()
Out[22]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority04 priority05 priority06 priority07 priority08 priority09 priority10 Location 1 lat lon
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 0 129.028846 197.038462 ... Then to Manhattan students or residents Then to New York City residents 0 0 0.0 0.0 0.0 220 Henry Street\nNew York, NY 10002\n(40.7137... 40.713764 -73.985260
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39.000000 49.000000 ... 0 0 0 0 0.0 0.0 0.0 200 Monroe Street\nNew York, NY 10002\n(40.712... 40.712332 -73.984797
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19.000000 21.000000 ... 0 0 0 0 0.0 0.0 0.0 420 East 12 Street\nNew York, NY 10009\n(40.72... 40.729783 -73.983041
3 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 0 129.028846 197.038462 ... 0 0 0 0 0.0 0.0 0.0 145 Stanton Street\nNew York, NY 10002\n(40.72... 40.720569 -73.985673
4 01M539 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 159 522.0 574.0 525.0 1621.0 NEW EXPLORATIONS SCI,TECH,MATH 255.000000 377.000000 ... 0 0 0 0 0.0 0.0 0.0 111 Columbia Street\nNew York, NY 10002\n(40.7... 40.718725 -73.979426

5 rows × 158 columns

To analyze by school disctrict we will need to extract it from the DBN column. The school district is the first two characters of the DBN number

In [23]:
# Define function to be used in .apply()
def first_two(s):
    dist = s[0:2]
    return dist

# Create new 'school_dist' column using .apply()
combined['school_dist'] = combined['DBN'].apply(first_two)

combined['school_dist'].head()
Out[23]:
0    01
1    01
2    01
3    01
4    01
Name: school_dist, dtype: object

What Is Driving SAT Score?

To answer this we will use the Pearson Correlation Coefficient or R Value. The R value measures the correlation between two values and assigns it a valule of -1 (perfect negative correlation) to 1 (perfect positive correlation). Zero represents no correlation at all.

In [24]:
# Create data set that contains just the 'sat_score' column, sorted.
combined_corr = combined.corr().loc[:,['sat_score']].sort_values(by='sat_score', ascending=False)

# Adjust figure size so its long enough for our data set.
plt.figure(figsize=(4,20))

#plot
ax=sns.heatmap(combined_corr, annot = True, cbar=False)

Observations:

  • total_enrollment and AVERAGE CLASS SIZE have a positive correlation to sat_score. One might expect that smaller schools and smaller classes would result in more academic proficiency but here the opposite is true.
  • female_per and female_num correlate positively while male_per and male_num correlate negatively, however the relationships are quite weak.
  • Teacher and student ratings of school safety (saf_t_11, and saf_s_11) correlate positively
  • Student ratings of academic standards (aca_s_11) correlate positively while academic standard ratings from teachers and parents have much less correlation
  • There is racial inequality: white_per& asian_per correlate positively while black_per & hispanic_per correlate negativley
  • The percentage of English language learners at the school (ell_percent & frl_percent) has a strong negative correlation.

Total Enrollment v. SAT Score

In [25]:
# Plot
fig, ax = plt.subplots(figsize=(9, 6))
ax = sns.scatterplot(data=combined, x='total_enrollment', y='sat_score')

# Plot Aesthetics
ax.set_title('Total Enrollment v. SAT Score', fontsize=18)
ax.set_xlabel('Total Enrollment', fontsize=14)
ax.set_ylabel('SAT Score', fontsize=14)
sns.despine()

There is a cluster of schools at the bottom left where both total enrollment and SAT scores are very low which could be skewing the data and r value.

In [26]:
# Filter to view schools with low enrollment and low SAT score
low_enrollment = combined[combined['total_enrollment'] < 1000] 
low_enrollment = combined[combined['sat_score'] < 1000] 

# View the school names and percentage of English learners
low_enrollment.loc[:,['SCHOOL NAME', 'ell_percent']]
Out[26]:
SCHOOL NAME ell_percent
91 INTERNATIONAL COMMUNITY HIGH SCHOOL 90.1
125 ACADEMY FOR LANGUAGE AND TECHNOLOGY 86.6
126 BRONX INTERNATIONAL HIGH SCHOOL 87.1
139 KINGSBRIDGE INTERNATIONAL HIGH SCHOOL 88.7
141 INTERNATIONAL SCHOOL FOR LIBERAL ARTS 79.9
176 PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE 92.9
179 HIGH SCHOOL OF WORLD CULTURES 86.4
188 BROOKLYN INTERNATIONAL HIGH SCHOOL 83.1
225 INTERNATIONAL HIGH SCHOOL AT PROSPECT HEIGHTS 91.0
237 IT TAKES A VILLAGE ACADEMY 30.4
253 MULTICULTURAL HIGH SCHOOL 94.6
286 PAN AMERICAN INTERNATIONAL HIGH SCHOOL 91.3
In [27]:
# City-wide average of ELL students
combined['ell_percent'].mean()
Out[27]:
13.334986225895316

Observations

The schools which have very low enrollment as well as low SAT scores are international schools where the percentage students who are English Language Learners is very high relative to the city-wide average.

In this case it is actually ell_percent_y which correlates more closely with sat_score rather than total_enrollment_x.

Let's plot out this relationship:

In [28]:
# Plot
fig, ax = plt.subplots(figsize=(9, 6))
ax = sns.scatterplot(data=combined, x='ell_percent', y='sat_score')

# Plot Aesthetics
ax.set_title('English Learners v. SAT Score', fontsize=18)
ax.set_xlabel('English Learners Percentage', fontsize=14)
ax.set_ylabel('SAT Score', fontsize=14)
sns.despine()

School Districts v. SAT Score

In [29]:
# Assign SAT score median for use in boxplot
sat_median = combined['sat_score'].median()

# Plot 
fig, ax = plt.subplots(figsize=(25, 15))
ax = sns.boxplot(x="school_dist", y="sat_score", data=combined,fliersize=0, color=[105/255, 158/255, 219/255])

# Plot Asethetics
ax.tick_params(axis='x', labelsize=18)
ax.tick_params(axis='y', labelsize=18)
ax.set_ylabel('SAT Score', fontsize=24)
ax.set_xlabel('School District', fontsize=24)
ax.set_title('SAT Scores By School District', fontsize=30)
ax.set(ylim=(800, 1900))
ax.axhline(y=sat_median, color='red', label='City-Wide SAT Score Median', linestyle='--')
ax.legend(prop={'size': 24})
sns.despine()

Observations

We don't see a very wide median variance between each school district. All the districts fall within about 20% of the median. District 22 (Southeastern Brooklyn) and 26 (Bayside, Queens) have highest SAT scores while district 32 (Bushwick, Brooklyn) has the lowest.

Examining Survey Fields with SAT Score

Recall that we combined the all_survey and d75_survey data sets into a single data set called survey. These surveys contained responses from parents, teachers, and students on issues such as engagement, safety, and communication.

Recall the data dictionary (Note: While this reference is from 2010, the descriptions are the same for our 2011 data set):

image.png

In [30]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
In [31]:
correlations = combined.corr()
correlations = correlations['sat_score']
correlations_survey = correlations[survey_fields].sort_values(ascending=False)

# Plot
fig, ax = plt.subplots(figsize=(10, 10))
ax=sns.barplot(x='sat_score', y='index', data=correlations_survey.reset_index(), color=[105/255, 158/255, 219/255])

# Plot Aesthetics
ax.axvline(x=0.25,color='red',linestyle='dashed',linewidth=0.5)
ax.set_title('Survey Resposes and SAT score Correlation', fontsize=18)
ax.set_ylabel('Survey', fontsize=14)
ax.set_xlabel('Correlation Value', fontsize=14)
ax.tick_params(labelsize=11)
sns.despine()

# Bar Labels
for p in ax.patches:
    width = p.get_width()
    x, y = p.get_xy()
    ax.annotate(str(round(width, 2)),(max(width, 0) + 0.005, y + 0.5))

Observations

N_s, N_t, & N_p are proxies for total_enrollment, it is expected that they would be high as well.

The safety scores saf_s_11, saf_tot_11, and saf_t_11 are highly correlated with SAT scores, however the parents' safety score (saf_p_11) is much less correlated. This could be that the students and teachers are present at school every day and have first hand impressions of safety at the school, while the parents are more removed from the daily experience.

The student reported academic score aca_s_11 is highly correlated. This might mean that if the students are being challenged academically, then it makes sense that they would tend to do well on the SAT. Interestingly, the parents' nor teachers' perception of academic standards have much correlation.

rr_s is moderately correlated; if a student cares enought to fill out the survey, then it might be a sign that they they take school seriously and will do well on the SAT.

Safer Schools Mean Higher SAT Scores

In [32]:
# Plot
fig, ax = plt.subplots(figsize=(9, 6))
ax = sns.scatterplot(data=combined, x='saf_s_11', y='sat_score')

# Plot Aesthetics
ax.set_title('Student Reported Safety Scores v. SAT Score', fontsize=18)
ax.set_xlabel('Safety Scores', fontsize=14)
ax.set_ylabel('SAT Score', fontsize=14)
sns.despine()
In [33]:
safest = combined[combined['saf_s_11'] > 8]
safest = safest.loc[:, ['lat', 'lon', 'SCHOOL NAME']]
safest
Out[33]:
lat lon SCHOOL NAME
5 40.718962 -73.976066 BARD HIGH SCHOOL EARLY COLLEGE
26 40.770116 -73.953379 ELEANOR ROOSEVELT HIGH SCHOOL
155 40.871255 -73.897516 HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE
173 40.827764 -73.900387 BRONX LATIN
175 40.825226 -73.893950 PEACE AND DIVERSITY ACADEMY
305 40.740556 -73.792848 FRANCIS LEWIS HIGH SCHOOL
338 40.771251 -73.924602 YOUNG WOMEN'S LEADERSHIP SCHOOL, ASTORIA
356 40.567913 -74.115362 STATEN ISLAND TECHNICAL HIGH SCHOOL
361 40.697408 -73.913153 ALL CITY LEADERSHIP SECONDARY SCHOOL
In [34]:
# Create Empty Map
m = folium.Map(location=[40.708, -73.94],
                        zoom_start=10.5,
                        tiles="CartoDB positron")

# Add 'markers' to pin the schools in our ```safest``` dataframe
for i in range(0,len(safest)):
    folium.Marker([safest.iloc[i]['lat'], safest.iloc[i]['lon']], popup=safest.iloc[i]['SCHOOL NAME']).add_to(m)
m
Out[34]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Observations

This plot confirms that as safety scores rise, SAT scores follow. From a safety score of 5 until about 7, we see an overall increase in SAT scores. From 7 though 8.25, we see an expansion in the upper limit of scores. From 8.5 onwards we scores that are lower than we would have expected.

In [35]:
# Group by borough
grouped_boro = combined.groupby(by='boro').agg(np.mean).reset_index()

# Plot
fig, ax = plt.subplots(figsize=(10, 10))
ax=sns.barplot(x='boro', y='saf_s_11', data=grouped_boro, color=[105/255, 158/255, 219/255])
ax.set(ylim=(6, 7))

# Plot Aesthetics
ax.set_title('Student-Reported Safety Scores By Borough', fontsize=18)
ax.set_xlabel('Borough', fontsize=16)
ax.set_ylabel('Safety Score', fontsize=16)
ax.tick_params(labelsize=12)
sns.despine()

# Bar Labels
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.2f'),  
                   (p.get_x() + p.get_width() / 2,  
                    p.get_height()), ha='center', va='center', 
                   size=15, xytext=(0, 8), 
                   textcoords='offset points')

Observations

Manhattan and Queens have the highest average safety score while Staten Island has the lowest.

Exploring Race v. SAT Score - Large Disparities Exist

In [36]:
race_cols = [
    'white_per',
    'asian_per', 
    'black_per',
    'hispanic_per']
correlations_race = correlations[race_cols].reset_index()
In [37]:
# Plot
fig, ax = plt.subplots(figsize=(9, 9))
ax=sns.barplot(x=race_cols, y='sat_score', data=correlations_race, color=[105/255, 158/255, 219/255])

# Plot Aesthetics
ax.set_title('Correlations By Race', fontsize=18)
ax.set_xlabel('Race Percentage', fontsize=16)
ax.set_ylabel('SAT Score Correlation', fontsize=16)
ax.tick_params(labelsize=12)
ax.set_xticklabels(labels=["White", "Asian", "Black", "Hispanic"])
sns.despine()

# Bar Labels
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.2f'),  
                   (p.get_x() + p.get_width() / 2,  
                    p.get_height()), ha='center', va='center', 
                   size=15, xytext=(0, 8), 
                   textcoords='offset points')

Observations

The percentage of white and asian students have a positive correlation with SAT score while the percentage of black and hispanic students have a negative correlation with SAT score.

Hispanic Percentage v. SAT Score - Language Is A Big Barrier

In [38]:
# Plot
fig, ax = plt.subplots(figsize=(9, 6))
ax = sns.scatterplot(data=combined, x='hispanic_per', y='sat_score')

# Plot Aesthetics
ax.set_title('Hispanic Percentage v. SAT Score', fontsize=18)
ax.set_xlabel('Hispanic Percentage', fontsize=14)
ax.set_ylabel('SAT Score', fontsize=14)
sns.despine()

Observations

Overall as the hispanic percentage increases, the SAT score tends to decrease. Once hispanic percentage reaches around 25%, there are no schools with SAT scores above about 1500. For schools with percentages between 25% and around 80%, the SAT score stays within a consistent range of about 1050 to 1400. There are a handful of schools with percentages near 100

In [39]:
combined[combined['hispanic_per']>95]['SCHOOL NAME']
Out[39]:
44                         MANHATTAN BRIDGES HIGH SCHOOL
82      WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL
89     GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...
125                  ACADEMY FOR LANGUAGE AND TECHNOLOGY
141                INTERNATIONAL SCHOOL FOR LIBERAL ARTS
176     PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
253                            MULTICULTURAL HIGH SCHOOL
286               PAN AMERICAN INTERNATIONAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object
In [40]:
# Plot
fig, ax = plt.subplots(figsize=(9, 6))
ax = sns.regplot(data=combined, x='ell_percent', y='sat_score', line_kws={"color": "red"})

# Plot Aesthetics
ax.set_title('English Learners v. SAT Score', fontsize=18)
ax.set_xlabel('English Learners Percentage', fontsize=14)
ax.set_ylabel('SAT Score', fontsize=14)
sns.despine()

The schools with over 95% hispanic students are international schools where the student body is made up of first generation immigrants from spanish speaking countries where English is their second language.

The above plot confirms that as the percentage English learners increases, average SAT score at that school decreases.

This would bolster our earlier observation that English language learner percentage is a big driver of SAT Scores.

In [41]:
combined[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)]["SCHOOL NAME"]
Out[41]:
37                                STUYVESANT HIGH SCHOOL
151                         BRONX HIGH SCHOOL OF SCIENCE
187                       BROOKLYN TECHNICAL HIGH SCHOOL
327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

These schools are among the 9 "selective and specialized high schools of New York City" that serve the needs of academically or artistically gifted. A seperate admissions test is required for enrollment. As expected, these students do well on the SAT.

Exploring Gender v. SAT Score - Not Much Correlation

In [42]:
gender_cols = ['male_per','female_per']
correlations_gender = correlations[gender_cols].reset_index()
In [43]:
# combined.corr()['sat_score'][['male_per','female_per']].plot.bar()

# Plot
fig, ax = plt.subplots(figsize=(9, 9))
ax=sns.barplot(x=gender_cols, y='sat_score', data=correlations_gender, color=[105/255, 158/255, 219/255])

# Plot Aesthetics
ax.set_title('Correlations By Gender', fontsize=18)
ax.set_xlabel('Gender Percentage', fontsize=16)
ax.set_ylabel('SAT Score Correlation', fontsize=16)
ax.tick_params(labelsize=12)
ax.set_xticklabels(labels=["Male", "Female"])
sns.despine()

# Bar Labels
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.2f'),  
                   (p.get_x() + p.get_width() / 2,  
                    p.get_height()), ha='center', va='center', 
                   size=15, xytext=(0, 8), 
                   textcoords='offset points')

Observations

Overall there is not much correlation between gender and SAT score. Higher female percentage has a slightly positive correlation while higher male percentage has a slightly negative correlation.

In [44]:
# Plot
fig, ax = plt.subplots(figsize=(9, 6))
ax = sns.scatterplot(data=combined, x='female_per', y='sat_score')

# Plot Aesthetics
ax.set_title('Female Percentage v. SAT Score', fontsize=18)
ax.set_xlabel('Female Percentage', fontsize=14)
ax.set_ylabel('SAT Score', fontsize=14)
sns.despine()