Analyzing NYC High School Data¶

In this project we will use the public data of New York City regarding SAT scores, school demographics, AP school level, survey to parents, students and teachers.

These are the datasets:

2012 SAT Results: SAT Scores per school.
2010 - 2011 School Attendance and Enrollment Statistics by District: attendance information per district.
2010-2011 Class Size - School-level detail: number of alumni per school.
2010 AP (College Board) School Level Results: AP Test taking.
2005-2010 Graduation Outcomes - School Level: how many students graduate.
2006 - 2012 School Demographics and Accountability Snapshot: group identity and other info of the students.
2011 NYC School Survey: survey for teachers, students and parents.

SAT exams are recently under fire. This analysis helps also to understand why the critics are right: the exams have a problem of circularity, it doesn't show that some people are smarter than others, only shows that perhaps if you are good at taking tests and have money for prepare way ahead you would obtain better results.

An special part is dedicated of the correlation of free lunch program and the sat scores. Lower resources is a predictor of bad scores.

Is it necessary to override the SAT exams? Yes, because is made by people that have good test scores to measure people that also is good at test scores.

But, the positive part is that if we totally eliminate this kind of exams we could also loss evidence of the problems of poverty and academic qualifications. Maybe the test must be focused to spot problems of poverty and how to deal with them.

Also the TEST is designed for people that will work in big companies with bureaucratic works, not for entrepreneurs or people that love art.

So, let's see the results.

1. Read in the data¶

We are going first open the datasets, but as we have very large datasets, we are going to convert all the datasets in one big dictionary:

We are going to put the names of the files in one list.
Next we create an empty dictionary where we put the datasets.
Finally we loop using the datasets names so we can assign each datasets to the dictionary with the key name

In [1]:

import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.image as mpimg
from matplotlib import rcParams


#List od names of the datasets
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

#Looping to put the datasets in the dictionary
for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

2. Read in the surveys¶

Now we will use concatto unify all the surveys datasets that we downloaded:

The encoding in this data sets is windows-1252, is and old kind of encoding, you can see the historical statistics usage here
In other datasets the school identification code is written "DBN", we are going to change the name in the surveys to later concat using the same code.
We are going to use only survey responses, removing other info like principal name or school location.
Finally we will assign the new survey dataset to the previously created dictionary.

In [2]:

# With the encoding 'windows-1252' we use a differente delimiter "\t"
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
#Assign again the dataset to the dictionary
data["survey"] = survey

3. Add DBN columns¶

When we explored all of the data sets, we noticed that some of them, like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, though, so we can just rename it

Some of the datasets don't have a DBN column. For example class_sizehave two columns, "CSD" and "Code", when we combine both data sets we obtain the DBN.

So, we need to pad the "CSD" column, convert the single number to a 0x format so we can obtain the same number code that in DBN, and once we have that we combine the numbers.

We will use the pandas.Series.apply() method, along with a custom function that:

Takes in a number.
Converts the number to a string using the str() function.
Check the length of the string using the len() function.
If the string is two digits long, returns the string.
If the string is one digit long, adds a 0 to the front of the string, then returns it.
You can use the string method zfill() to do this.

In [3]:

data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

4. Convert columns to numeric¶

We'll need to convert the SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score columns in the sat_results data set from the object (string) data type to a numeric data type. We can use the pandas.to_numeric() method for the conversion. If we don't convert the values, we won't be able to add the columns together.

It's important to pass the keyword argument errors="coerce" when we call pandas.to_numeric(), so that pandas treats any invalid strings it can't convert to numbers as missing values instead.

After we perform the conversion, we can use the addition operator (+) to add all three columns together.

We also want to extract the latitude, 40.8276026690005, and the longitude, -73.90447525699966. Taken together, latitude and longitude make up a pair of coordinates that allows us to pinpoint any location on Earth.

We can do the extraction with a regular expression.

This command will return [(40.8276026690005, -73.90447525699966)]. We'll need to process this result further using the string methods split() and replace() methods to extract each coordinate.

In [4]:

cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

5. Condense datasets¶

We'll need to condense these data sets so that each value in the DBN column is unique. If not, we'll run into issues when it comes time to combine the data sets.

While the main data set we want to analyze, sat_results, has unique DBN values for every high school in New York City, other data sets aren't as clean. A single row in the sat_results data set may match multiple rows in the class_size data set, for example. This situation will create problems, because we don't know which of the multiple entries in the class_size data set we should combine with the single matching entry in sat_results.

So first we need to:

Create a new variable called class_size, and assign the value of data["class_size"] to it.
Filter class_size so the GRADE column only contains the value 09-12.
Filter class_size so that the PROGRAM TYPE column only contains the value GEN ED.
Display the first five rows of class_size to verify.
Group the data with the unique DBN code of each high school and assign to a new column "class_size"
Apply a pandas bool to select in the column "demographics" only the "schoolyear" 20112012 so we obtain the most recent year data of the dataset.
The "cohort" we choose the most recent year that is 2006
In the demographic we only select the "Total cohort" values.

In [5]:

class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

6. Convert AP scores to numeric¶

High school students take the AP exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.

AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.

It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools. To determine this, we'll need to convert the AP exam scores in the ap_2010 data set to numeric values first.

There are three columns we'll need to convert:

AP Test Takers (note that there's a trailing space in the column name)
Total Exams Taken
Number of Exams with scores 3 4 or 5

In [6]:

cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

7. Combine the datasets¶

Both the ap_2010 and thegraduation data sets have many missingDBN values, so we'll use a left join when we merge thesat_results data set with them. Because we're using aleft join, our final dataframe will have all of the sameDBN values as the originalsat_results dataframe.

We'll need to use the pandasdf.merge() method to merge dataframes. The "left" dataframe is the one we call the method on, and the "right" dataframe is the one we pass intodf.merge().

Because we're using theDBN column to join the dataframes, we'll need to specify the keyword argumenton="DBN" when calling pandas.DataFrame.merge().

First, we'll assigndata["sat_results"] to the variablecombined. Then, we'll merge all of the other dataframes withcombined. When we're finished,combined will have all of the columns from all of the data sets.

In [7]:

combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

8. Add a school district column for mapping¶

Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this.

The school district is just the first two characters of the DBN. We can apply a function over the DBN column of combinedthat pulls out the first two letters.

In [8]:

def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

9. Find correlations¶

Once we have the data ready we can start with the correlations, pandas.corr()help us to find the correlations, we select only the data correlated with sat scores.

In [9]:

correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)

SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
                                     ...   
priority08                              NaN
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
Name: sat_score, Length: 67, dtype: float64

10. Plotting survey correlations¶

Visualize correlations is easier than loom at the table.

On this part we are going to use plotly.express. Pros: the visuals are astonishing, easy to see the values of the data, easy to deploy colors and sizes. Cons: takes more memory, heavier code, takes more time to process.

In [10]:

# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

In [11]:

data = combined.corr()["sat_score"][survey_fields].sort_values()
fig = px.bar(data, y='sat_score', color='sat_score', height=600)
fig

The top correlations that we obtain are:

N_s: Number of student respondents
N_p: Number of parent respondents
aca_s_11: Academic expectations score based on student responses
saf_s_11: Safety and Respect score based on student responses
saf_tot_11: Safety and Respect total score

I couldn't find the exact questions that the NYC government asked to the students, teachers and parents, so I assume that the score is 0 to 10, 10 is the highest.

So, the correlation indicates that more responses of parents and students implies better results in the sat_score. But this is really weak, lower than 0.5, so you need take this with a big pinch of salt.

The most interesting part is the Safety and Respect score, is not clear what is the question, but it is related with: how the students feel about their institution, how the students feel about their borough.

11. Safety and Respect : scatter plot and maps.¶

What we are trying to see is how strong is related the Security and Respect score with the SAT scores of the schools.

The most interesting part is always the outliers: which schools have low Security and Respect score and also have good sat scores.

One useful scatter plot library is plotly express. The quality of the graphs are good, but the most valuable is that you can see which point is what, so you can easily retrieve data.

In [12]:

# With plotly express automatic put a degrade color using a column name
# and the name of the school can be deployed directly in the data using hover_name 
fig = px.scatter(combined, x="saf_s_11", y="sat_score", color='sat_score', size='saf_s_11'
                 ,hover_name="SCHOOL NAME")
fig.show()

As expected we have a graph that shows that schools with greater SAT scores tend to have a better qualification in Security and Respect. But as alway we have some interesting outliers:

BRONX HIGH SCHOOL OF SCIENCE: is an elite public school with a mean score of 1969, with 8 Nobel prize winner and 2 Pulitzer winner between their ex alumni, but in Security and Respect the results are near the mean of results.
PEACE AND DIVERSITY ACADEMY: near to the previous BRONX HIGH SCHOOL, with a better score in safety and respect, but sat_score 1155.
BENJAMIN N. CARDOZO HIGH SCHOOL: with above 1500 sat score and 6.5 safety score is a curious example of that security is not necessarily equal to more sat scores.

The point is to see that this correlation is not clearly strong, and is not clear that if you choose a safer place you will have better sat_scores.

11.2. Mapping the security score¶

We want know to map the relations between the safety score and the security of the boroughs.

We are using pandas.groupby and numpy.meanaggregation method so we can obtain the mean value of safety score by borough.
Next, we are using the plotly to map the values in a high resolution map, also using the mapbox_style like scatter_mapbox and density_mapbox to see the results.

In [13]:

districts = combined.groupby("school_dist").agg(numpy.mean)
#We need to reset the index so we can later use the results as dataset
districts.reset_index(inplace=True)
districts['saf_s_11'].head()

Out[13]:

0    6.768611
1    6.910660
2    6.716667
3    6.885714
4    6.314286
Name: saf_s_11, dtype: float64

In [14]:

import plotly.express as px
df = districts
#
fig = px.scatter_mapbox(df, lat="lat", lon="lon", color="saf_s_11", size="sat_score",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=9.5,
                  mapbox_style="carto-positron")
fig.show()

In [15]:

import plotly.express as px
df = combined
fig = px.density_mapbox(df, lat="lat", lon="lon", z="saf_s_11", radius=10,
                  zoom=9,hover_name="SCHOOL NAME",
                  mapbox_style="stamen-terrain")
fig.show()

To understand better the results, we need to compare with the crime map in NYC, it is not the same distribution at first glance, but it is also because they evaluate the safety into the school, not only externally.

In [16]:

rcParams['figure.figsize'] = 12 ,8 
image = mpimg.imread("NYC_Crime.png")
plt.imshow(image)
plt.show()

12. Group identity issues¶

Now we are going to analyze group identity correlation with sat_score. We have 4 groups:

white_per
asian_per
black_per
hispanic_per

To analyze the impact of each group in the sat score we need to categorize the schools regarding their sat score:

We are going to use the quantileoption to divide the groups.
We need to create a function that categorize the sat_score.
apply the function to the schools, and create a new column school_level.
Plot results using scatter and bar plots.

In [17]:

#Quantiles help us to divide the data in similar distributed groups 
combined.sat_score.quantile([0.01,0.5,0.75,0.9,0.99])

Out[17]:

0.01     942.72
0.50    1193.00
0.75    1266.50
0.90    1423.00
0.99    1913.80
Name: sat_score, dtype: float64

In [18]:

#define function for classifying schools based on sat_score quantiles
def level(row):
    if row < 942.72:
        return 'Lowest: Below 1%'
    elif row >= 942.72 and row < 1193:
        return 'Low: Below 50%'
    elif row >=1193 and row < 1266.5:
        return 'Medium: Above 50%'
    elif row >=1266.5 and row < 1423:
        return 'High: Above 75%'
    elif row >=1423 and row < 1913.8:
        return 'Top: Above 90%'
    else:
        return 'Elite: Above 99%'
     

#create new column using the function above
combined['school_level'] = combined['sat_score'].apply(level)

#view DataFrame 
combined['school_level'].head()

Out[18]:

0       Low: Below 50%
1       Low: Below 50%
2       Low: Below 50%
3    Medium: Above 50%
4       Top: Above 90%
Name: school_level, dtype: object

In [19]:

#explore the correlation between groups and sat_score
group_id = ["white_per", "asian_per", "black_per", "hispanic_per"]
group_corr = combined.corr()["sat_score"][group_id].sort_values()
fig = px.bar(group_corr, y='sat_score', color='sat_score', height=600, width=500)
fig

white_perhave a positive correlation with sat_score and hispanic_perhave a negative correlation. I think is important to be very wary of this kind of correlations, because is useful for a racist perspective and the correlations are not strong.

Anyway, we are going to explore what are the situation with the hispanic_perand comapre the data with the other groups.

12.1. `hispanic_per` vs `sat_score`¶

In [20]:

fig = px.scatter(combined, x="hispanic_per", y="sat_score",
                 size="sat_score", color="school_level",
                 hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. Hispanic percentage of alumns')
fig.show()

As we can see, 2 of the worst performing schools are 100% hispanic groups. But almost all the schools in NYC have hispanics in their schools.

The problem in this case is that if the correlation is clear you need to have not Top schools with high percentage of hispanics and good sat_score, but we have one case: MANHATTAN CENTER FOR SCIENCE AND MATHEMATICS. With 57.4% hispanic_perand sat_score 1430.

So, we can discard that this is a racial issue, because we need more info and a dynamic view of the scores, in 2019 hispanic sat_score is better than in 2011, so the problem is not "racial", we have other issues to explore, latter we are going to see another direction.

12.2. Group identity and school level: visualizing the data¶

How many people of different group identity does they have in each school level? To answer the question we will do:

Create a pivot_table using the school_level as index
Create a bar plot that aggregate the values so we can easily see the distribution of groups in each school levels.

In [21]:

table = pd.pivot_table(combined, values=group_id, index=['school_level'],
                       aggfunc=numpy.mean)
table.reset_index(inplace=True)
cm = sns.light_palette("#2ecc71", as_cmap=True)
table.style.background_gradient(cmap=cm)

Out[21]:

	school_level	asian_per	black_per	hispanic_per	white_per
0	Elite: Above 99%	46.025000	4.100000	8.225000	41.050000
1	High: Above 75%	14.520370	29.131481	40.603704	14.888889
2	Low: Below 50%	4.293220	44.193785	48.594915	2.252542
3	Lowest: Below 1%	7.750000	10.050000	78.025000	3.675000
4	Medium: Above 50%	9.815385	35.728571	46.183516	7.496703
5	Top: Above 90%	31.272727	17.378788	19.854545	30.190909

In [22]:

table = table.reindex([3,2,4,1,5,0])
fig = px.bar(table, x="school_level", y=group_id, title="Group percentages vs school level")
fig.show()

The group identity distribution show us the following:

The worst performing schools have the most hispanic_per: 78.025%
The elite schools have the lest hispanic_per: 8.225%
The best performing schools have more percentage of asians and white.
The elite and top schools have the lest percentage of blacks and hispanics.

The issue here is that the data can mislead us: we can associate better sat schools with some race properties, but this is false.

Behind the great results are preparation, a lot of preparation. The Asian families have an old culture that requires exams to access to government jobs. We don't have any "racial superiority". Even the low class members invest in SAT prep exams.

Also the great proportion of asians in elite schools indicate that these schools have a selection process that help them to have only students with greater chance of high sat scores.

So, the low sat score problem is not a "race" problem, even schools with high scores have a great percentage of black and hispanic people.

But, if the problem is not "racial", what is the problem?.

12.3. Low `hispanic_per`: how are the groups distributed¶

In [23]:

low_schools = combined[(combined["hispanic_per"]<10)]
table_low = pd.pivot_table(low_schools, values=group_id, index=['school_level'],
                       aggfunc=numpy.mean)
table_low.reset_index(inplace=True)
table_low = table_low.reindex([2,3,1,4,0])
fig = px.bar(table_low, x="school_level", y=group_id, title="Group identity vs school level, hispanic_per < 10%")
fig.show()

If the hispanic_peris fewer than 10%, we have some changes in the data:

We don't have lowest level schools.
The schools below the top level have a majority of black people.
In the top schools we have a greater percentage of asians and white.
More presence of black people in low level schools indicates not that the blacks are not smart, only indicates that maybe the problem is more correlated with poverty.

13. Gender and `sat_score`¶

In this part we want to explore the relation between gender and SAT scores, we need to explore first if we have any clear correlation, later the percentages.

In [24]:

gender_id = ["male_per", "female_per"]
group_corr = combined.corr()["sat_score"][gender_id].sort_values()
fig = px.bar(group_corr, y='sat_score', color='sat_score', height=600, width=500)
fig

Looking the correlation between gender and sat scores we found: nothing.

A slightly difference, but nothing strong to have a clear conclusion. Let's see the distribution of female_per vs sat_score.

13.1. `sat_score` vs `female_per`¶

In [25]:

fig = px.scatter(combined, x="female_per", y="sat_score",
                 size="sat_score", color="school_level",
                 hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. Female_per')
fig.show()

Looking through the dataset, we have some interesting stuff:

TOWNSEED HARRIS SCHOOL: Have a high sat_score (above 1900) and a high percentage of girls (71.1%).
The elite and the lowest level schools have similar percentage of women in their schools.
The schools with almost 100% women, have sat_scores between 1127 and 1326.
A cluster of TOP schools have more than 60% of women in their schools.

13.2. `female_per` vs `male_per` and `sat_score`.¶

To explore the gender differences regarding the school level, we need to do the following:

Define the quantiles to segment the percentage of female and male.
Create a function that segment the percentages.
Plot the data in bars to see the difference.

In [26]:

combined.female_per.quantile([0.01,0.2,0.5,0.75,0.95])

Out[26]:

0.01    12.092
0.20    43.180
0.50    50.100
0.75    55.900
0.95    73.690
Name: female_per, dtype: float64

In [27]:

#define function for classifying players based on points
def gender(row):
    if row < 12:
        return '< 12%'
    elif row >= 12 and row < 50:
        return '12% to 50%'
    elif row >=50 and row < 60:
        return '50% to 60%'
    elif row >=60 and row < 75:
        return '60% to 75%'
    else:
        return '> 75%'
     

#create new column using the function above
combined['fem_level'] = combined['female_per'].apply(gender)

#view DataFrame 
combined['fem_level'].head()

Out[27]:

0    12% to 50%
1    12% to 50%
2    12% to 50%
3    50% to 60%
4    50% to 60%
Name: fem_level, dtype: object

In [28]:

fem_table = pd.pivot_table(combined, values=["sat_score"], index=['fem_level'],
                       aggfunc=numpy.mean)
fem_table.reset_index(inplace=True)
fem_table = fem_table.reindex([3,0,1,2,4])
fem_table.style.background_gradient(cmap=cm)

Out[28]:

	fem_level	sat_score
3	< 12%	1137.250000
0	12% to 50%	1208.157612
1	50% to 60%	1220.491512
2	60% to 75%	1310.972973
4	> 75%	1233.832276

In [29]:

#define function for classifying players based on points
def gender_m(row):
    if row < 12:
        return '< 12%'
    elif row >= 12 and row < 50:
        return '12% to 50%'
    elif row >=50 and row < 60:
        return '50% to 60%'
    elif row >=60 and row < 75:
        return '60% to 75%'
    else:
        return '> 75%'
     

#create new column using the function above
combined['male_level'] = combined['male_per'].apply(gender_m)

#view DataFrame 
combined['male_level'].head()

Out[29]:

0    60% to 75%
1    50% to 60%
2    50% to 60%
3    12% to 50%
4    12% to 50%
Name: male_level, dtype: object

In [30]:

male_table = pd.pivot_table(combined, values=["sat_score"], index=['male_level'],
                       aggfunc=numpy.mean)
male_table.reset_index(inplace=True)
male_table = male_table.reindex([3,0,1,2,4])
male_table.style.background_gradient(cmap=cm)

Out[30]:

	male_level	sat_score
3	< 12%	1243.701824
0	12% to 50%	1241.574497
1	50% to 60%	1211.220043
2	60% to 75%	1195.403528
4	> 75%	1172.133333

In [31]:

fig = go.Figure(data=[
    go.Bar(name='Male percentage', x=male_table['male_level'], y=male_table['sat_score']),
    go.Bar(name='Female percentage', x=fem_table['fem_level'], y=fem_table['sat_score'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

We can take some quick conclusions:

We have a progressive tendency that with more percentage of women the schools obtain better sat score.
The inverse is with males: more percentage of males diminish sat scores.
Another curious result is that schools with lower percentage of women, the school obtain less sat score.

13.3. Gender vs School levels¶

Now we are going to explore the school levels vs the genders. The main objective is to see how the extremes have the distribution of genders.

In [32]:

table_gen = pd.pivot_table(combined, values=gender_id, index=['school_level'],
                       aggfunc=numpy.mean)
table_gen.reset_index(inplace=True)
table_gen = table_gen.reindex([3,2,4,1,5,0])
table_gen.style.background_gradient(cmap=cm)

Out[32]:

	school_level	female_per	male_per
3	Lowest: Below 1%	47.525000	52.475000
2	Low: Below 50%	48.566667	51.431638
4	Medium: Above 50%	52.129670	47.870330
1	High: Above 75%	51.981481	48.018519
5	Top: Above 90%	56.545455	43.454545
0	Elite: Above 99%	43.025000	56.975000

In [33]:

table_gen = table_gen.reindex([3,2,4,1,5,0])
fig = px.bar(table_gen, x="school_level", y=gender_id, title="Gender percentage vs School Level")
fig.show()

As expected from previous point, female_perincrease with the school_level, we have new observations:

In the lowest level school the male and female percentage is almost the same.
The top schools have a mean of more than 56% of women in their schools.
In the elite schools the male percentage is greater that the female percentage.

14. AP Test vs SAT Score¶

Another important data point is the AP test takers vs. SAT Score. The AP Tests have a different kind of grade (1 to 5) and is useful for colleges boards to decide if the student is qualified for their studies.

What we could expect is that a school with more AP test takers have more chance of better SAT scores.

In [34]:

combined["ap_per"] = combined["AP Test Takers "]/combined["total_enrollment"]
fig = px.scatter(combined, x="ap_per", y="sat_score",
                 size="sat_score", color="school_level",
                 hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. AP Percentage')
fig.show()

Looking through the data we can say:

Elite and top schools have more percentage of AP test takers than High level schools.
Medium and low schools have more percentage of AP test takers than any other schools.
Is not clear that AP test takers is correlated with better SAT scores.

15. Full correlations map¶

Visualize all the correlations in a table is really difficult, even segmentation of the data is not enough to fully appreciate all the dataset correlations.

Here we would explore strong correlation (above 0.7 or below -0.7) and visualize the results using a correlation matrix.

In [35]:

plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
correlation_A = combined.corr()
correlation_A = correlation_A[(correlation_A>0.7)&(correlation_A<1)]


s = correlation_A.stack()
s = s.reset_index()
s

Out[35]:

	level_0	level_1	0
0	SAT Critical Reading Avg. Score	SAT Math Avg. Score	0.929221
1	SAT Critical Reading Avg. Score	SAT Writing Avg. Score	0.982826
2	SAT Critical Reading Avg. Score	sat_score	0.986820
3	SAT Math Avg. Score	SAT Critical Reading Avg. Score	0.929221
4	SAT Math Avg. Score	SAT Writing Avg. Score	0.931385
...	...	...	...
257	total_students	male_num	0.943833
258	total_students	female_num	0.956412
259	total_students	N_s	0.942661
260	total_students	N_t	0.923733
261	total_students	N_p	0.817987

262 rows × 3 columns

<Figure size 1152x432 with 0 Axes>

In [36]:

flights = s.pivot("level_0", "level_1", 0)
fig = px.imshow(flights)
fig.show()

We have a lot of strong positive correlations, one interesting one is the correlation between Number of Exams with score 3 4 or 5 vs asian_num.

Another curious correlation is between female_num vswhite_num.

In [37]:

plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
correlation_B = combined.corr()
correlation_B = correlation_B[(correlation_B<-0.6)]


s_B = correlation_B.stack()
s_B = s_B.reset_index()
corr_matrix_B = s_B.pivot("level_0", "level_1", 0)
fig = px.imshow(corr_matrix_B)
fig.show()

<Figure size 1152x432 with 0 Axes>

In the negative side of the correlations we found a very interesting correlation: free lunch.

Free lunch in 2011 was huge. You need to qualify to had a free lunch in 2011 , and in this year the crisis forced more families to opt for free lunch.

We need to explore further this data.

15.1. Free Lunch vs. Sat scores¶

As in the previous exercises we are going to create groups according the percentage of free lunch students. Next we compare with the sat scores results.

In [38]:

#Define the groups to divide the free lunch data
combined.frl_percent.quantile([0.01,0.2,0.5,0.75,0.95])

Out[38]:

0.01    18.40
0.20    54.44
0.50    69.60
0.75    77.00
0.95    87.17
Name: frl_percent, dtype: float64

In [39]:

#define function for classifying sat scores based on free lunch
def frl(row):
    if row < 19:
        return 'frl < 19%'
    elif row >= 19 and row < 55:
        return 'frl 19% to 55%'
    elif row >=55 and row < 70:
        return 'frl 55% to 70%'
    elif row >=70 and row < 77:
        return 'frl 70% to 77%'
    elif row >=77 and row < 88:
        return 'frl 77% to 88%'
    else:
        return 'frl > 88%'
     

#create new column using the function above
combined['frl_level'] = combined['frl_percent'].apply(frl)

frl_table = pd.pivot_table(combined, values=["sat_score"], index=['frl_level'],
                       aggfunc=numpy.mean)
frl_table.reset_index(inplace=True)
frl_table = frl_table.reindex([4,0,1,2,3,5])
frl_table.style.background_gradient(cmap=cm)

Out[39]:

	frl_level	sat_score
4	frl < 19%	1762.666667
0	frl 19% to 55%	1409.714731
1	frl 55% to 70%	1210.293112
2	frl 70% to 77%	1160.910276
3	frl 77% to 88%	1127.959859
5	frl > 88%	1100.992351

In [40]:

fig = px.bar(frl_table, x="frl_level", y="sat_score", title="Sat score vs Free lunch level")
fig.show()

We can observe:

The schools with less percentage of students with free lunch have 60% better sat score than the schools with more than 88% of students with free lunch.
But the difference between the schools with 55% to 70% of students with free lunch have a 10% better score compared with schools with 88% or more free lunch students.

The difference is abysmal, let's explore further the data.

In [41]:

table_frl = pd.pivot_table(combined, values="frl_percent", index=['school_level'],
                       aggfunc=numpy.mean)
table_frl.reset_index(inplace=True)
table_frl = table_frl.reindex([3,2,4,1,5,0])
table_frl.style.background_gradient(cmap=cm)

Out[41]:

	school_level	frl_percent
3	Lowest: Below 1%	80.800000
2	Low: Below 50%	73.471751
4	Medium: Above 50%	68.582418
1	High: Above 75%	56.842593
5	Top: Above 90%	38.203030
0	Elite: Above 99%	25.575000

In [42]:

fig = px.bar(table_frl, x="school_level", y="frl_percent", title="Free lunch percentage vs School Level")
fig.show()

The elite schools have 25.5% of free lunch students, also have less free lunch percentage of students than any other schools.
The lowest level schools have more 80.8% percentage of free lunch students.
High, medium and low level schools all have more than 50% of free lunch students.

Now let's group the group identity and the free lunch percentage above 70%. Which groups have more presence with free lunch and have score above 1200?

In [43]:

top_frl = combined[(combined["frl_percent"]>70)&(combined["sat_score"]>1266.5)]
table_frl = pd.pivot_table(top_frl, values=group_id, index=['school_level'],
                       aggfunc=numpy.mean)
table_frl.reset_index(inplace=True)
table_frl = table_frl.reindex([1,0,2])
fig = px.bar(table_frl, x="school_level", y=group_id, title="Free lunch schools > 70%, Race vs Level School")
fig.show()

Reviewing the data we obtain:

In top schools we have a large percentage of asians, is possible that top schools could help with some perks to asians, like free lunch, to attract students with greater chance to obtain high sat scores.
In high level schools we have a greater proportion of hispanics and blacks, but black persons are not in great proportion, is possible that is schools prefer hispanics over blacks to obtain better sat scores? Difficult to say.

In [44]:

fig = px.scatter(combined, x="frl_percent", y="sat_score",
                 size="sat_score", color="school_level",
                 hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. Free Lunch')
fig.show()

We find several outliers:

ALL CITY LEADERSHIP SECONDARY SCHOOL: This school have a high percentage of hispanics, and also have an average SAT score of 1315.

In [45]:

#
high_frl = combined[(combined["frl_percent"]>80)&(combined["school_level"]=="High: Above 75%")]
groups_a= ["SCHOOL NAME","white_per", "asian_per", "black_per", "hispanic_per","female_per", "saf_s_11"]
high_frl[groups_a]

Out[45]:

	SCHOOL NAME	white_per	asian_per	black_per	hispanic_per	female_per	saf_s_11
361	ALL CITY LEADERSHIP SECONDARY SCHOOL	1.5	5.3	12.9	79.5	47.9	8.9

Top schools with more than 70% of students with free lunch:

HIGH SCHOOL FOR DUAL LANGUAGE AND ASIAN STUDIES: It's an school with high percentage of asian students, with sat score average of 1430
MANHATTAN CENTER FOR SCIENCE AND MATHEMATICS: An interesting case because it's a School totally against the odds, with more than 57.4 percent of hispanics and a sat score average of 1423.

In [46]:

top_frl = combined[(combined["frl_percent"]>70)&(combined["school_level"]=="Top: Above 90%")]
top_frl[groups_a]

Out[46]:

	SCHOOL NAME	white_per	asian_per	black_per	hispanic_per	female_per	saf_s_11
46	HIGH SCHOOL FOR DUAL LANGUAGE AND ASIAN STUDIES	2.3	89.5	3.4	4.0	48.7	6.611667
67	MANHATTAN CENTER FOR SCIENCE AND MATHEMATICS	2.2	20.1	19.7	57.4	53.6	6.600000

In the extreme of almost 99.2% students with free lunch:

HIGH SCHOOL FOR EXCELLENCE AND INNOVATION: the vast majority is hispanic and black students, almost all the students with free lunch, but an average sat score 1223.

In [47]:

top_frl = combined[(combined["SCHOOL NAME"]=="HIGH SCHOOL FOR EXCELLENCE AND INNOVATION")]
top_frl[groups_a]

Out[47]:

	SCHOOL NAME	white_per	asian_per	black_per	hispanic_per	female_per	saf_s_11
83	HIGH SCHOOL FOR EXCELLENCE AND INNOVATION	0.0	0.0	25.8	72.5	47.2	7.6

Conclusions¶

We found a positive but weak correlation between sat scores and security and respect answers in the survey of students.
Hispanic population have a negative correlation with sat scores. As explained previously this is not a "racial" issue, we need to explore more factor and see outliers that have great scores even having more than 70% of hispanic students.
Female percentage is weak correlated with positive sat scores, anyway the elite schools have mostly male students.
AP test takers are correlated with sat score results, only that is not enough to indicate a positive correlation with sat scores.
Free lunch program takers have a strong negative correlation with sat scores. Mostly this could indicate that have lower resources for sat preparation is a better predictor of low scores than race, gender or AP test takers.