The SAT (Schoolastic Aptitude Test) is administered during the senior year of high school. Colleges across the country rely on SAT scores as one of their key admission critera.

A college degree is important to unlocking increased opportunities for social and economic advancement. Because of the SAT's importance in the college admissions process it come under increased scrutiny and concern that it may help perpetuate societal inequality, rather than reducing it, by being subject to various biases.

We will use data published by the New York City Public schools in order to investigate some of these questions and explore some of the demographic factors that may be connected to variations in SAT scores. The inital dataset covers 435 highschools with over 100 pieces of information on school performance, programs and student demographics for each school.

We start by analyzing all the factors and identifying those with the strongest relationship to SAT performance. Then we drill into specific areas of interest.

- Subsidized school lunch in relationship to SAT scores and other factors
- How the results of a survey of students, parents and teachers relate to SAT performance
- The school safety ratings from those surveys and how they relate to SAT scores
- The relationship between students receiving subsidized school lunch and SAT scores
- Racial differences in SAT Scores
- SAT performance by hispanic students
- Gender differences in SAT performance
- The relationship between SAT test scores and participation in AP (Advanced Placement) testing
- Relationships between class size and SAT performance

Caveat: This analysis looks at relationships between SAT scores and other factors. These relationships could exist because of coincidence, or because they stem from the same underlying cause, or because one is the cause of the other. This analysis can't distinguish among them.

**The single strongest relationship to SAT scores is the socioeconomic status of students.** New York City Schools reports the percentage of students receiving free or reduced price school lunch at each high school. If we compare this percentage to average SAT scores at each school, we find that the more students qualifying for subsidized school lunch programs the lower the schools SAT scores.

Other factors:

Schools with higher proportions of asian and white students tend to have higher average SAT scores, while schools with higher proportions of black and hispanic students tend to have lower average SAT scores.

*However, no racial/ethnic component is as strongly related to SAT scores as the percentage of students receiving free or reduced-cost school lunch*.Schools with larger proportions of hispanic students generally cater to new immigrants and have a significant % of students who are learning English. They also tend to have more students receiving free or reduced-cost school lunch.

Schools with high proportions of students learning English tend to have low average SAT scores. The SAT is only administrered in English, so it makes sense that English language learners would be a strong disadvantage.

The gender makeup of schools does not appear to be closely related to SAT scores.

There is not an obvious relationship between SAT scores and the number of students taking AP exams.

By analyzing survey data, I found that student and teacher perceptions of school safety have some relationship to higher SAT scores, as do student perceptions of academic expectations.

Further investigating the student responses about school safety, I found that:

- Brooklyn, Queens and the Bronx all have school districts that get the highest safety ratings from students, and
*also*have school districts that get lowest safety ratings. - Staten Island's single district is slightly below the overall average.
- Manhattan's schools are all in the middle of the range, with no schools at either end of the extreme.

- Brooklyn, Queens and the Bronx all have school districts that get the highest safety ratings from students, and

In [1]:

```
from matplotlib import pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy
import pandas as pd
from pandas.plotting import scatter_matrix
import re
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
%config InlineBackend.figure_format='retina'
#pd.options.display.max_rows = 200
pd.options.display.max_columns = 30
```

In [2]:

```
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
```

In [3]:

```
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
```

In [4]:

```
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
```

In [5]:

```
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
```

In [6]:

```
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
```

In [7]:

```
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
```

In [8]:

```
data['graduation'].info()
```

In [9]:

```
for col in data['graduation'].columns[5:]:
data['graduation'][col] = pd.to_numeric(data['graduation'][col].str.strip().str.replace('%',''), errors="coerce")
```

In [10]:

```
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.dropna(axis=1, thresh=len(combined)*.25)
combined = combined.fillna(0)
```

In [11]:

```
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
```

In [12]:

```
correlations = combined.corr()
#correlations = correlations
print(correlations["sat_score"].sort_values())
```

In [13]:

```
correlations[correlations["sat_score"].abs() > .25]["sat_score"].sort_values().plot.barh(figsize = (10,10))
```

Out[13]:

`frl_percent`

, which is the percentage of students who qualify for free or reduced-price school lunch, has a strong negative correlation with `sat_scores`

, with an R-value of -0.722225. The only other correlations of this magnitude are positive, and are other metrics of academic performance.

Other demographic characteristics with a negative correlation to SAT scores:

`black_per`

: percentage of black students in student body`hispanic_per`

: percentage of hispanic students`ell_percent`

: percentage of students learning english`sped_percent`

: percentage of students receiving special education

The academic measures that correlate most strongly are, not surprisingly, the SAT subscores (Writing, Math, Critical Reading). Percentages of students receiving advanced "Regents" diplomas also correlates strongly with high SAT scores.

Interestingly class and school size also seem to have a positive correlation with SAT scores.

We saw above that the percentage of students receiving free or reduced-price lunch (FRL) correlates strongly with lower SAT scores. Below I investigate that relationship, and identify other factors that correlate strongly with the percentage of FRL students.

In [14]:

```
combined.plot.scatter(x='frl_percent', y='sat_score')
```

Out[14]:

Plotting SAT scores against the percentage of students receiving subsidized school lunch makes the relationship between the more obvious, and also underscores the fact that the average NYC public highschool has more than 50% of students receiving subsidized school lunch.

In [15]:

```
correlations[abs(correlations['frl_percent']) >0.3]['frl_percent'].sort_values()
```

Out[15]:

There are a number of factors with strong negative correlations with `frl_percent`

. Most are other academic measures, which we saw earlier have a high positive correlation with SAT scores. Interestingly, the relationship is stronger for the Critical Reading and Writing SAT scores than it is for the Math portion. Other factors with a strong negative correlation include the percentage of white students.

The strongest positive correlation is with the percentage of hispanic students (`hispanic_per`

) followed by the percentage of students considered english language learners (`ell_percent`

).

Our dataset includes responses to a set of survey questions presented to students, teachers and parents. This gives us an opportunity to explore how their perceptions relate to SAT scores.

In [16]:

```
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
```

In [17]:

```
survey_fields.sort()
correlations["sat_score"][survey_fields].plot.barh(figsize=(10,12))
```

Out[17]:

**Key:**

- The first three letters indicate the question asked:
- aca: Academic expections
- com: Communications
- eng: Engagement
- saf: Safety
- r: Response rate
- N: Total number of respondants

- The next letters are the different categores of respondants.
- p: parents
- t: teachers
- s: students
- tot: Total of all the above categories

- The trailing number, where present, indicates the position of the question on the survey.

**Observations:**

The strongest correlations are for the number of parents and students responding the the survey. This is largely determined by school size, which, we've already noted, has a positive correlation with SAT scores.

Among the survey questions, parent, student and teacher ratings of school safety show the strongest correlation with SAT scores.

Student reports of engagement correlate weakly with higher SAT scores, but much more strongly than teacher or parent perceptions.

Student reports of academic expectations correlates with higher SAT scores, more so than teacher perceptions, and much more than parental perceptions.

As noted above, student and teacher perceptions of safety have a positive relationship with SAT scores. This seems like an important relationship, and further, safety is also important on its own. I investigate this in more detail below.

In [18]:

```
combined.plot.scatter(x='saf_s_11', y='sat_score', figsize=(7,5))
```

Out[18]:

While a correlation analysis identifies a relationship between SAT scores and safety scores, the relationship is only really evident above `saf_s_11`

scores of ~6.5.

In [19]:

```
district_averages = combined.groupby('school_dist').mean()
```

In [20]:

```
plt.figure(figsize=(8,8))
m = Basemap(
projection='merc',
llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272,
resolution='h',
)
m.drawmapboundary()
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
lat = district_averages['lat'].tolist()
lon = district_averages['lon'].tolist()
m.scatter(lon, lat, s=50, zorder=2, latlon=True, c=district_averages['saf_s_11'], cmap='summer')
plt.show()
```

In [21]:

```
print("min:", district_averages['saf_s_11'].min())
print("mean:", district_averages['saf_s_11'].mean())
print("max:", district_averages['saf_s_11'].max())
print("stdev:", district_averages['saf_s_11'].std())
```

In [22]:

```
combined.pivot_table(index=['boro', 'school_dist'], values='saf_s_11', margins='All')
```

Out[22]:

- Across all districts,
`saf_s_11`

scores, range from a low of 5.87 for district 16 in Brooklyn, to a high of 7.12 for district 20, which is also in Brooklyn. - Queens and the Bronx also repeat this pattern of having districts near both the bottom and top of the range.
- Staten Island's only district's
`saf_s_11`

score is a bit below the citywide mean. - Manhattan's districts have a narrower distribution with scores falling within approximately one standard deviation of the citywide mean.

The survey scores show some differences between the different groups (parents, students, teachers) taking the survey. Here I investigate those differences.

In [23]:

```
survey_data = combined[['DBN'] + survey_fields]
survey_data = survey_data.melt(id_vars='DBN')
#There has to be a better way?
survey_data[['question','respondant']] = pd.DataFrame(survey_data['variable'].str.split("_").to_list()).iloc[:,0:2]
survey_data = survey_data.drop('variable', axis=1)
survey_data = survey_data[survey_data['question'] != 'N']
survey_data = survey_data[survey_data['question'] != 'rr']
survey_data = survey_data[survey_data['respondant'] != 'tot']
survey_data['respondant'] = survey_data['respondant'].map({"p": "Parent", "s": "Student", "t": "Teacher"})
survey_data = survey_data.reset_index(drop=True)
#Finally!
ax = survey_data.pivot_table(index='question', columns='respondant', values='value').plot.barh(figsize=(9,4))
ax.set_yticklabels(["Safety", "Engagement", "Communication", "Academic Expectations"])
```

Out[23]:

We can see, in the chart above, that parents, students, and teachers have similar perceptions of safety, engagement, academic expections and communications. Students rate things slighly lower than teachers do, while parents rate them slightly higher.

In [24]:

```
race_columns = ['white_per', 'asian_per', 'black_per', 'hispanic_per']
correlations['sat_score'][race_columns].plot.barh()
```

Out[24]:

There is a inverse relationship between percentage of Black or Hispanic students and a school's average SAT scores. There is a stronger positive relationship between percentage of White or Asian students and a school's SAT scores.

*Keep in mind that we earlier found that a proxy for socioeconomic status, frl_percent (the percentage of students receiving subsidized school lunch) had a strong negative correlation with SAT scores.*

In [25]:

```
correlations['frl_percent'][race_columns].plot.barh()
```

Out[25]:

Above we see how `frl_percent`

, the percentage of students receiving subsidized lunch, correlate with racial makup of schools. Racial diferences in socioeconomic status could help explain racial disparities in SAT scores. This could be the subject of a separate analysis.

Let's take a closer look at the percentage of hispanic students and school's average SAT scores.

In [26]:

```
combined.plot.scatter(x='hispanic_per', y='sat_score')
```

Out[26]:

We can see that the schools with highest average SAT scores have low percentages of hispanic students. In addition, the schools with nearly 100% hispanic students have some of the lowest average SAT scores.

However, schools with the lowest SAT scores have a wide range of percentages of hispanic students. Moreover, schools with average SAT scores also have a wide range of percentages of hispanic students.

Let's look at schools with >95% hispanic students and see what we can learn about them.

In [27]:

```
high_hispanic = combined[combined['hispanic_per'] >95]
high_hispanic.T
```

Out[27]:

**Select details:**

- MANHATTAN BRIDGES HIGH SCHOOL: Caters to recently arrived students from spanish-speaking countries. 46% english language learners, but all have spanish as native language. 100% minority. 83% low-SES. High college readyness and performance on state-mandated standardized tests. Good discipline/order.
- WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL: 81% low-SES. 91%hispanic. 16% english language learners. Lots of experiential learning. 90% graduate in 4y. Average college prep performance.
- GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...: Focus on spanish speaking students new to the country (less than 2y) with at least 8y education in their native countries. 78% english language learners. Entering students are often poorly prepared academically -- Low 4y graduation rate. 80% 6-year graduation rate. 99% low SES. Below average college readyness.
- ACADEMY FOR LANGUAGE AND TECHNOLOGY: Focus on new immigrants. Most students graduate on time. 100% latino. 2/3rds english language learners. Orderly and Safe. 100% low-SES. Good college readyness.
- INTERNATIONAL SCHOOL FOR LIBERAL ARTS 100% hispanic. 71% learning english (admissions focused on those who use spanish as primary language at home). 75% graduate on time. Average safety. 98% low SES. Many students enter with poor academic preparation but overall, students make good progress.
- PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE: Focus on new spanish-speaking immigrants. Above average safety and discipline. 84% graduate in 4 years. 99% low SES. 88% english language learners. 30% interrupted education. 40% unaccompanied minors.
- MULTICULTURAL HIGH SCHOOL: Mix of In US less than 3y and longer residing bi-lingual students. 98% hispanic. 88% english language learners. 93% low-SES. Average order but above average bullying. Many students are older than average and have to help support their families. Problematic college prep.
- PAN AMERICAN INTERNATIONAL HIGH SCHOOL: All students have been in US less than 4y and speak spanish. Most teachers speak english-only. Good safety. 85% graduate in 4y. Problematic college prep. 99% hispanic. 84% low-SES. 84% english language learners.

More than 80% at all these schools (>90% in most) qualify for free or reduced-price lunch, indicating that most students are of low socio-economic status. Recall that we earlier found that the percentage of students receiving free or reduced price lunch had a strong negative correlation with SAT scores.

Most have ~50% or more students who are english language learners.

Let's contrast this with schools with low percentages of hispanic students and high SAT scores.

In [28]:

```
low_hispanic = combined[(combined['hispanic_per'] <10) & (combined['sat_score']>1800)]
low_hispanic.T
```

Out[28]:

The schools with the highest SAT scores and lowest proportions of hispanic students are majority asian, with one exception. They all have ~0% english language learners, and <=50% low-SES (using subsidized school lunch as the proxy). All have selective admissions based on test scores.

Most or all have a focus on Science, Technology, Engineering and Math (STEM).

In [29]:

```
combined[(combined['hispanic_per'] >20) & (combined['sat_score']>1800)].T
```

Out[29]:

There is only one school with high average SAT scores and more than than 20% hispanic students, the High School for Mathematics, Science, and Engineering (HSMSE). It is a small (~100 students per cohort) STEM-focused school located in Manhattan and draws from students from all five bouroghs. Despite having a significant hispanic population, the school has 0% english language learners. This sets it apart from the predominantly hispanic schools we looked at earlier.

HSMSE, like other selective, specialized high schools in the NYC Schools requires a foreign language for graduation. It is somewhat unusual though in that it only offers two languages. All students must take German, unless they can demonstrate proficiency in Spanish, in which case they can elect to study advanced Spanish, instead. This may not explain the relatively high percentage of hispanic students though as all the other competitive high schools, save one, offer Spanish among their language options. Two, Brooklyn Latin and the High School of American Studies, require Spanish of all students. Queens HS of Sciences also offers Spanish in addition to only one other language (Mandarin).

We saw earlier that the more english language learners a school has, the lower the average SAT scores tend to be. Let's look at that in a little more detail, along side some of the other variables we've been interested in.

In [30]:

```
combined.plot.scatter(x='ell_percent', y='sat_score')
```

Out[30]:

The relationship between `sat_scores`

and `ell_percent`

is most clear at the extremes. Top schools (investigated earlier) have few if any english language learners while schools that predominantly have english language learners have below school average SAT scores, and some have the lowest school average SAT scores in the system. Between the two extremes the relationship is still visible.

Interestingly schools tend to either have fewer than 30% of students learning english, or more than 70%.

In [31]:

```
combined[combined['sat_score'] > 1700][['SCHOOL NAME', 'sat_score', 'hispanic_per', 'frl_percent', 'ell_percent']]
```

Out[31]:

The table above shows schools with the highest high average SAT scores. Their hispanic student populations range from 2.4% to 22.8%, and the proportion of students receiving subsidized lunch (a proxy for low socioeconomic status) range from 15.8% to 53.3%. However, none have more 0.5% of their student body learning english.

In [32]:

```
correlations["sat_score"][['female_per', 'male_per']].plot.barh()
```

Out[32]:

These corrlation values are very low; there is no correlation between the gender-balance in schools and their average SAT scores.

Nevertheless, we can take this opportunity to look at the few high schools with predominantly female (>60%) student bodies and high (>1700) school average SAT scores.

In [33]:

```
combined[(combined['female_per'] >60) & (combined['sat_score']>1700)].T
```

Out[33]:

All of the above schools have selective admissions. Unlike the other selective schools with high SAT scores investigated earlier, these schools don't seem to have a scientific/technical focus. They are instead either a more balanced liberal arts cirriculum, arts or humanities. All have less than ~30% receiving free or reduced cost school lunch and ~0% English language learners.

There is only one school where the ratios are reversed.

In [34]:

```
combined[(combined['male_per'] >60) & (combined['sat_score']>1700)].T
```

Out[34]:

In [35]:

```
combined[((combined['male_per'] > 50)) & (combined['sat_score']>1700)][['SCHOOL NAME', 'sat_score','male_per']].reset_index(drop=True)
```

Out[35]:

There are, however eight high-scoring schools with >50% male student bodies. All, save one, are STEM schools. They make up the bulk of NYC's nine specialized high schools, with admissions determined by scores on a standardized aptitude test.

In [36]:

```
combined['ap_per'] = 100*combined['AP Test Takers '] / combined['total_enrollment']
```

In [37]:

```
combined[['ap_per', 'sat_score']].corr()
```

Out[37]:

In [38]:

```
combined.plot.scatter(x='ap_per', y='sat_score')
```

Out[38]:

There isn't a correlation between the percentage of students taking AP exams and SAT test scores. This is unexpected since both seem to go hand-in-hand as part of college prep.

In [39]:

```
correlations["sat_score"]['AVERAGE CLASS SIZE']
```

Out[39]:

In [40]:

```
combined.plot.scatter(x='AVERAGE CLASS SIZE', y='sat_score')
```

Out[40]:

There is a relationship between a school's average class size and it's average SAT score.

It's not clear why this would be the case. One hypothesis is that on average the most capable students need less attention from teachers, and so schools assign more of them to a classroom.

This analysis investigated factors that relate to average SAT performance at individual schools. The presence of these relationship should not be taken to imply a causal link. These relationships could exist because of coincidence, or because they stem from the same underlying cause, or because one is the cause of the other. This analysis can't distinguish among those possibilities.

We found:

- There isn't clear evidence of gender bias in the SAT.
- There is evidence that race and associated factors do relate to SAT scores. However, this relationship also seems to involve the relationship between race and low socioeconomic status.
- SAT scores of hispanic students are also likely influenced by a high percentage immigrant English language learners, many of whom didn't receive adequate schooling in their home countries. This intersects with the fact that the SAT is only given in English.

The source of the data used in this analysis is NYC's data portal. The specific source datasets follow:

- 2012 SAT ResultsEducation
- 2010 - 2011 School Attendance and Enrollment Statistics by District
- 2010-2011 Class Size - School-level detail
- 2010 AP (College Board) School Level Results
- 2005-2010 Graduation Outcomes - School Level
- 2006 - 2012 School Demographics and Accountability Snapshot
- 2011 NYC School Survey
- 2014 - 2015 DOE High School Directory