Analyzing NYC High School Data

In this project we will use the public data of New York City regarding SAT scores, school demographics, AP school level, survey to parents, students and teachers.

These are the datasets:

SAT exams are recently under fire. This analysis helps also to understand why the critics are right: the exams have a problem of circularity, it doesn't show that some people are smarter than others, only shows that perhaps if you are good at taking tests and have money for prepare way ahead you would obtain better results.

An special part is dedicated of the correlation of free lunch program and the sat scores. Lower resources is a predictor of bad scores.

Is it necessary to override the SAT exams? Yes, because is made by people that have good test scores to measure people that also is good at test scores.

But, the positive part is that if we totally eliminate this kind of exams we could also loss evidence of the problems of poverty and academic qualifications. Maybe the test must be focused to spot problems of poverty and how to deal with them.

Also the TEST is designed for people that will work in big companies with bureaucratic works, not for entrepreneurs or people that love art.

So, let's see the results.

1. Read in the data

We are going first open the datasets, but as we have very large datasets, we are going to convert all the datasets in one big dictionary:

  1. We are going to put the names of the files in one list.
  2. Next we create an empty dictionary where we put the datasets.
  3. Finally we loop using the datasets names so we can assign each datasets to the dictionary with the key name
In [1]:
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
import as px
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.image as mpimg
from matplotlib import rcParams

#List od names of the datasets
data_files = [

data = {}

#Looping to put the datasets in the dictionary
for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

2. Read in the surveys

Now we will use concatto unify all the surveys datasets that we downloaded:

  1. The encoding in this data sets is windows-1252, is and old kind of encoding, you can see the historical statistics usage here
  2. In other datasets the school identification code is written "DBN", we are going to change the name in the surveys to later concat using the same code.
  3. We are going to use only survey responses, removing other info like principal name or school location.
  4. Finally we will assign the new survey dataset to the previously created dictionary.
In [2]:
# With the encoding 'windows-1252' we use a differente delimiter "\t"
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
survey = survey.loc[:,survey_fields]
#Assign again the dataset to the dictionary
data["survey"] = survey

3. Add DBN columns

When we explored all of the data sets, we noticed that some of them, like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, though, so we can just rename it

Some of the datasets don't have a DBN column. For example class_sizehave two columns, "CSD" and "Code", when we combine both data sets we obtain the DBN.

So, we need to pad the "CSD" column, convert the single number to a 0x format so we can obtain the same number code that in DBN, and once we have that we combine the numbers.

We will use the pandas.Series.apply() method, along with a custom function that:

  1. Takes in a number.
  2. Converts the number to a string using the str() function.
  3. Check the length of the string using the len() function.
  4. If the string is two digits long, returns the string.
  5. If the string is one digit long, adds a 0 to the front of the string, then returns it.
  6. You can use the string method zfill() to do this.
In [3]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
        return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

4. Convert columns to numeric

We'll need to convert the SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score columns in the sat_results data set from the object (string) data type to a numeric data type. We can use the pandas.to_numeric() method for the conversion. If we don't convert the values, we won't be able to add the columns together.

It's important to pass the keyword argument errors="coerce" when we call pandas.to_numeric(), so that pandas treats any invalid strings it can't convert to numbers as missing values instead.

After we perform the conversion, we can use the addition operator (+) to add all three columns together.

We also want to extract the latitude, 40.8276026690005, and the longitude, -73.90447525699966. Taken together, latitude and longitude make up a pair of coordinates that allows us to pinpoint any location on Earth.

We can do the extraction with a regular expression.

This command will return [(40.8276026690005, -73.90447525699966)]. We'll need to process this result further using the string methods split() and replace() methods to extract each coordinate.

In [4]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

5. Condense datasets

We'll need to condense these data sets so that each value in the DBN column is unique. If not, we'll run into issues when it comes time to combine the data sets.

While the main data set we want to analyze, sat_results, has unique DBN values for every high school in New York City, other data sets aren't as clean. A single row in the sat_results data set may match multiple rows in the class_size data set, for example. This situation will create problems, because we don't know which of the multiple entries in the class_size data set we should combine with the single matching entry in sat_results.

So first we need to:

  • Create a new variable called class_size, and assign the value of data["class_size"] to it.
  • Filter class_size so the GRADE column only contains the value 09-12.
  • Filter class_size so that the PROGRAM TYPE column only contains the value GEN ED.
  • Display the first five rows of class_size to verify.
  • Group the data with the unique DBN code of each high school and assign to a new column "class_size"
  • Apply a pandas bool to select in the column "demographics" only the "schoolyear" 20112012 so we obtain the most recent year data of the dataset.
  • The "cohort" we choose the most recent year that is 2006
  • In the demographic we only select the "Total cohort" values.
In [5]:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

6. Convert AP scores to numeric

High school students take the AP exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.

AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.

It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools. To determine this, we'll need to convert the AP exam scores in the ap_2010 data set to numeric values first.

There are three columns we'll need to convert:

  • AP Test Takers (note that there's a trailing space in the column name)
  • Total Exams Taken
  • Number of Exams with scores 3 4 or 5
In [6]:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

7. Combine the datasets

Both the ap_2010 and thegraduation data sets have many missingDBN values, so we'll use a left join when we merge thesat_results data set with them. Because we're using aleft join, our final dataframe will have all of the sameDBN values as the originalsat_results dataframe.

We'll need to use the pandasdf.merge() method to merge dataframes. The "left" dataframe is the one we call the method on, and the "right" dataframe is the one we pass intodf.merge().

Because we're using theDBN column to join the dataframes, we'll need to specify the keyword argumenton="DBN" when calling pandas.DataFrame.merge().

First, we'll assigndata["sat_results"] to the variablecombined. Then, we'll merge all of the other dataframes withcombined. When we're finished,combined will have all of the columns from all of the data sets.

In [7]:
combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

8. Add a school district column for mapping

Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this.

The school district is just the first two characters of the DBN. We can apply a function over the DBN column of combinedthat pulls out the first two letters.

In [8]:
def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

9. Find correlations

Once we have the data ready we can start with the correlations, pandas.corr()help us to find the correlations, we select only the data correlated with sat scores.

In [9]:
correlations = combined.corr()
correlations = correlations["sat_score"]
SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
priority08                              NaN
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
Name: sat_score, Length: 67, dtype: float64

10. Plotting survey correlations

Visualize correlations is easier than loom at the table.

On this part we are going to use Pros: the visuals are astonishing, easy to see the values of the data, easy to deploy colors and sizes. Cons: takes more memory, heavier code, takes more time to process.

In [10]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
In [11]:
data = combined.corr()["sat_score"][survey_fields].sort_values()
fig =, y='sat_score', color='sat_score', height=600)