In this project we will use the public data of New York City regarding SAT scores, school demographics, AP school level, survey to parents, students and teachers.
These are the datasets:
SAT exams are recently under fire. This analysis helps also to understand why the critics are right: the exams have a problem of circularity, it doesn't show that some people are smarter than others, only shows that perhaps if you are good at taking tests and have money for prepare way ahead you would obtain better results.
An special part is dedicated of the correlation of free lunch program and the sat scores. Lower resources is a predictor of bad scores.
Is it necessary to override the SAT exams? Yes, because is made by people that have good test scores to measure people that also is good at test scores.
But, the positive part is that if we totally eliminate this kind of exams we could also loss evidence of the problems of poverty and academic qualifications. Maybe the test must be focused to spot problems of poverty and how to deal with them.
Also the TEST is designed for people that will work in big companies with bureaucratic works, not for entrepreneurs or people that love art.
So, let's see the results.
We are going first open the datasets, but as we have very large datasets, we are going to convert all the datasets in one big dictionary:
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.image as mpimg
from matplotlib import rcParams
#List od names of the datasets
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
#Looping to put the datasets in the dictionary
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
Now we will use concat
to unify all the surveys datasets that we downloaded:
windows-1252
, is and old kind of encoding, you can see the historical statistics usage here# With the encoding 'windows-1252' we use a differente delimiter "\t"
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
#Assign again the dataset to the dictionary
data["survey"] = survey
When we explored all of the data sets, we noticed that some of them, like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, though, so we can just rename it
Some of the datasets don't have a DBN column. For example class_size
have two columns, "CSD" and "Code", when we combine both data sets we obtain the DBN.
So, we need to pad the "CSD" column, convert the single number to a 0x format so we can obtain the same number code that in DBN, and once we have that we combine the numbers.
We will use the pandas.Series.apply()
method, along with a custom function that:
str()
function.len()
function.zfill()
to do this.data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
We'll need to convert the SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score columns in the sat_results data set from the object (string) data type to a numeric data type. We can use the pandas.to_numeric()
method for the conversion. If we don't convert the values, we won't be able to add the columns together.
It's important to pass the keyword argument errors="coerce"
when we call pandas.to_numeric()
, so that pandas treats any invalid strings it can't convert to numbers as missing values instead.
After we perform the conversion, we can use the addition operator (+) to add all three columns together.
We also want to extract the latitude, 40.8276026690005, and the longitude, -73.90447525699966. Taken together, latitude and longitude make up a pair of coordinates that allows us to pinpoint any location on Earth.
We can do the extraction with a regular expression.
This command will return [(40.8276026690005, -73.90447525699966)]. We'll need to process this result further using the string methods split()
and replace()
methods to extract each coordinate.
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
We'll need to condense these data sets so that each value in the DBN column is unique. If not, we'll run into issues when it comes time to combine the data sets.
While the main data set we want to analyze, sat_results, has unique DBN values for every high school in New York City, other data sets aren't as clean. A single row in the sat_results data set may match multiple rows in the class_size data set, for example. This situation will create problems, because we don't know which of the multiple entries in the class_size data set we should combine with the single matching entry in sat_results.
So first we need to:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
High school students take the AP exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.
AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.
It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools. To determine this, we'll need to convert the AP exam scores in the ap_2010 data set to numeric values first.
There are three columns we'll need to convert:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
Both the ap_2010
and thegraduation
data sets have many missingDBN
values, so we'll use a left
join when we merge thesat_results
data set with them. Because we're using aleft
join, our final dataframe will have all of the sameDBN
values as the originalsat_results
dataframe.
We'll need to use the pandasdf.merge()
method to merge dataframes. The "left" dataframe is the one we call the method on, and the "right" dataframe is the one we pass intodf.merge()
.
Because we're using theDBN
column to join the dataframes, we'll need to specify the keyword argumenton="DBN"
when calling pandas.DataFrame.merge()
.
First, we'll assigndata["sat_results"]
to the variablecombined
. Then, we'll merge all of the other dataframes withcombined
. When we're finished,combined
will have all of the columns from all of the data sets.
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this.
The school district is just the first two characters of the DBN
. We can apply a function over the DBN
column of combined
that pulls out the first two letters.
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
Once we have the data ready we can start with the correlations, pandas.corr()
help us to find the correlations, we select only the data correlated with sat scores.
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
Visualize correlations is easier than loom at the table.
On this part we are going to use plotly.express
. Pros: the visuals are astonishing, easy to see the values of the data, easy to deploy colors and sizes. Cons: takes more memory, heavier code, takes more time to process.
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
data = combined.corr()["sat_score"][survey_fields].sort_values()
fig = px.bar(data, y='sat_score', color='sat_score', height=600)
fig