In this project we will use the public data of New York City regarding SAT scores, school demographics, AP school level, survey to parents, students and teachers.
These are the datasets:
SAT exams are recently under fire. This analysis helps also to understand why the critics are right: the exams have a problem of circularity, it doesn't show that some people are smarter than others, only shows that perhaps if you are good at taking tests and have money for prepare way ahead you would obtain better results.
An special part is dedicated of the correlation of free lunch program and the sat scores. Lower resources is a predictor of bad scores.
Is it necessary to override the SAT exams? Yes, because is made by people that have good test scores to measure people that also is good at test scores.
But, the positive part is that if we totally eliminate this kind of exams we could also loss evidence of the problems of poverty and academic qualifications. Maybe the test must be focused to spot problems of poverty and how to deal with them.
Also the TEST is designed for people that will work in big companies with bureaucratic works, not for entrepreneurs or people that love art.
So, let's see the results.
We are going first open the datasets, but as we have very large datasets, we are going to convert all the datasets in one big dictionary:
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.image as mpimg
from matplotlib import rcParams
#List od names of the datasets
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
#Looping to put the datasets in the dictionary
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
Now we will use concat
to unify all the surveys datasets that we downloaded:
windows-1252
, is and old kind of encoding, you can see the historical statistics usage here# With the encoding 'windows-1252' we use a differente delimiter "\t"
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
#Assign again the dataset to the dictionary
data["survey"] = survey
When we explored all of the data sets, we noticed that some of them, like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, though, so we can just rename it
Some of the datasets don't have a DBN column. For example class_size
have two columns, "CSD" and "Code", when we combine both data sets we obtain the DBN.
So, we need to pad the "CSD" column, convert the single number to a 0x format so we can obtain the same number code that in DBN, and once we have that we combine the numbers.
We will use the pandas.Series.apply()
method, along with a custom function that:
str()
function.len()
function.zfill()
to do this.data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
We'll need to convert the SAT Math Avg. Score, SAT Critical Reading Avg. Score, and SAT Writing Avg. Score columns in the sat_results data set from the object (string) data type to a numeric data type. We can use the pandas.to_numeric()
method for the conversion. If we don't convert the values, we won't be able to add the columns together.
It's important to pass the keyword argument errors="coerce"
when we call pandas.to_numeric()
, so that pandas treats any invalid strings it can't convert to numbers as missing values instead.
After we perform the conversion, we can use the addition operator (+) to add all three columns together.
We also want to extract the latitude, 40.8276026690005, and the longitude, -73.90447525699966. Taken together, latitude and longitude make up a pair of coordinates that allows us to pinpoint any location on Earth.
We can do the extraction with a regular expression.
This command will return [(40.8276026690005, -73.90447525699966)]. We'll need to process this result further using the string methods split()
and replace()
methods to extract each coordinate.
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
We'll need to condense these data sets so that each value in the DBN column is unique. If not, we'll run into issues when it comes time to combine the data sets.
While the main data set we want to analyze, sat_results, has unique DBN values for every high school in New York City, other data sets aren't as clean. A single row in the sat_results data set may match multiple rows in the class_size data set, for example. This situation will create problems, because we don't know which of the multiple entries in the class_size data set we should combine with the single matching entry in sat_results.
So first we need to:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
High school students take the AP exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.
AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.
It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools. To determine this, we'll need to convert the AP exam scores in the ap_2010 data set to numeric values first.
There are three columns we'll need to convert:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
Both the ap_2010
and thegraduation
data sets have many missingDBN
values, so we'll use a left
join when we merge thesat_results
data set with them. Because we're using aleft
join, our final dataframe will have all of the sameDBN
values as the originalsat_results
dataframe.
We'll need to use the pandasdf.merge()
method to merge dataframes. The "left" dataframe is the one we call the method on, and the "right" dataframe is the one we pass intodf.merge()
.
Because we're using theDBN
column to join the dataframes, we'll need to specify the keyword argumenton="DBN"
when calling pandas.DataFrame.merge()
.
First, we'll assigndata["sat_results"]
to the variablecombined
. Then, we'll merge all of the other dataframes withcombined
. When we're finished,combined
will have all of the columns from all of the data sets.
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this.
The school district is just the first two characters of the DBN
. We can apply a function over the DBN
column of combined
that pulls out the first two letters.
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
Once we have the data ready we can start with the correlations, pandas.corr()
help us to find the correlations, we select only the data correlated with sat scores.
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
SAT Critical Reading Avg. Score 0.986820 SAT Math Avg. Score 0.972643 SAT Writing Avg. Score 0.987771 sat_score 1.000000 AP Test Takers 0.523140 ... priority08 NaN priority09 NaN priority10 NaN lat -0.121029 lon -0.132222 Name: sat_score, Length: 67, dtype: float64
Visualize correlations is easier than loom at the table.
On this part we are going to use plotly.express
. Pros: the visuals are astonishing, easy to see the values of the data, easy to deploy colors and sizes. Cons: takes more memory, heavier code, takes more time to process.
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
data = combined.corr()["sat_score"][survey_fields].sort_values()
fig = px.bar(data, y='sat_score', color='sat_score', height=600)
fig
The top correlations that we obtain are:
N_s
: Number of student respondentsN_p
: Number of parent respondentsaca_s_11
: Academic expectations score based on student responsessaf_s_11
: Safety and Respect score based on student responsessaf_tot_11
: Safety and Respect total scoreI couldn't find the exact questions that the NYC government asked to the students, teachers and parents, so I assume that the score is 0 to 10, 10 is the highest.
So, the correlation indicates that more responses of parents and students implies better results in the sat_score
. But this is really weak, lower than 0.5, so you need take this with a big pinch of salt.
The most interesting part is the Safety and Respect score, is not clear what is the question, but it is related with: how the students feel about their institution, how the students feel about their borough.
What we are trying to see is how strong is related the Security and Respect score with the SAT scores of the schools.
The most interesting part is always the outliers: which schools have low Security and Respect score and also have good sat scores.
One useful scatter plot library is plotly express
. The quality of the graphs are good, but the most valuable is that you can see which point is what, so you can easily retrieve data.
# With plotly express automatic put a degrade color using a column name
# and the name of the school can be deployed directly in the data using hover_name
fig = px.scatter(combined, x="saf_s_11", y="sat_score", color='sat_score', size='saf_s_11'
,hover_name="SCHOOL NAME")
fig.show()
As expected we have a graph that shows that schools with greater SAT scores tend to have a better qualification in Security and Respect. But as alway we have some interesting outliers:
The point is to see that this correlation is not clearly strong, and is not clear that if you choose a safer place you will have better sat_scores.
We want know to map the relations between the safety score and the security of the boroughs.
pandas.groupby
and numpy.mean
aggregation method so we can obtain the mean value of safety score by borough.plotly
to map the values in a high resolution map, also using the mapbox_style
like scatter_mapbox
and density_mapbox
to see the results.districts = combined.groupby("school_dist").agg(numpy.mean)
#We need to reset the index so we can later use the results as dataset
districts.reset_index(inplace=True)
districts['saf_s_11'].head()
0 6.768611 1 6.910660 2 6.716667 3 6.885714 4 6.314286 Name: saf_s_11, dtype: float64
import plotly.express as px
df = districts
#
fig = px.scatter_mapbox(df, lat="lat", lon="lon", color="saf_s_11", size="sat_score",
color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=9.5,
mapbox_style="carto-positron")
fig.show()
import plotly.express as px
df = combined
fig = px.density_mapbox(df, lat="lat", lon="lon", z="saf_s_11", radius=10,
zoom=9,hover_name="SCHOOL NAME",
mapbox_style="stamen-terrain")
fig.show()
To understand better the results, we need to compare with the crime map in NYC, it is not the same distribution at first glance, but it is also because they evaluate the safety into the school, not only externally.
rcParams['figure.figsize'] = 12 ,8
image = mpimg.imread("NYC_Crime.png")
plt.imshow(image)
plt.show()
Now we are going to analyze group identity correlation with sat_score
. We have 4 groups:
white_per
asian_per
black_per
hispanic_per
To analyze the impact of each group in the sat score we need to categorize the schools regarding their sat score:
quantile
option to divide the groups.sat_score
.apply
the function to the schools, and create a new column school_level
.scatter
and bar
plots.#Quantiles help us to divide the data in similar distributed groups
combined.sat_score.quantile([0.01,0.5,0.75,0.9,0.99])
0.01 942.72 0.50 1193.00 0.75 1266.50 0.90 1423.00 0.99 1913.80 Name: sat_score, dtype: float64
#define function for classifying schools based on sat_score quantiles
def level(row):
if row < 942.72:
return 'Lowest: Below 1%'
elif row >= 942.72 and row < 1193:
return 'Low: Below 50%'
elif row >=1193 and row < 1266.5:
return 'Medium: Above 50%'
elif row >=1266.5 and row < 1423:
return 'High: Above 75%'
elif row >=1423 and row < 1913.8:
return 'Top: Above 90%'
else:
return 'Elite: Above 99%'
#create new column using the function above
combined['school_level'] = combined['sat_score'].apply(level)
#view DataFrame
combined['school_level'].head()
0 Low: Below 50% 1 Low: Below 50% 2 Low: Below 50% 3 Medium: Above 50% 4 Top: Above 90% Name: school_level, dtype: object
#explore the correlation between groups and sat_score
group_id = ["white_per", "asian_per", "black_per", "hispanic_per"]
group_corr = combined.corr()["sat_score"][group_id].sort_values()
fig = px.bar(group_corr, y='sat_score', color='sat_score', height=600, width=500)
fig
white_per
have a positive correlation with sat_score and hispanic_per
have a negative correlation. I think is important to be very wary of this kind of correlations, because is useful for a racist perspective and the correlations are not strong.
Anyway, we are going to explore what are the situation with the hispanic_per
and comapre the data with the other groups.
hispanic_per
vs sat_score
¶fig = px.scatter(combined, x="hispanic_per", y="sat_score",
size="sat_score", color="school_level",
hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. Hispanic percentage of alumns')
fig.show()
As we can see, 2 of the worst performing schools are 100% hispanic groups. But almost all the schools in NYC have hispanics in their schools.
The problem in this case is that if the correlation is clear you need to have not Top schools with high percentage of hispanics and good sat_score, but we have one case: MANHATTAN CENTER FOR SCIENCE AND MATHEMATICS. With 57.4% hispanic_per
and sat_score
1430.
So, we can discard that this is a racial issue, because we need more info and a dynamic view of the scores, in 2019 hispanic sat_score is better than in 2011, so the problem is not "racial", we have other issues to explore, latter we are going to see another direction.
How many people of different group identity does they have in each school level? To answer the question we will do:
pivot_table
using the school_level
as indextable = pd.pivot_table(combined, values=group_id, index=['school_level'],
aggfunc=numpy.mean)
table.reset_index(inplace=True)
cm = sns.light_palette("#2ecc71", as_cmap=True)
table.style.background_gradient(cmap=cm)
school_level | asian_per | black_per | hispanic_per | white_per | |
---|---|---|---|---|---|
0 | Elite: Above 99% | 46.025000 | 4.100000 | 8.225000 | 41.050000 |
1 | High: Above 75% | 14.520370 | 29.131481 | 40.603704 | 14.888889 |
2 | Low: Below 50% | 4.293220 | 44.193785 | 48.594915 | 2.252542 |
3 | Lowest: Below 1% | 7.750000 | 10.050000 | 78.025000 | 3.675000 |
4 | Medium: Above 50% | 9.815385 | 35.728571 | 46.183516 | 7.496703 |
5 | Top: Above 90% | 31.272727 | 17.378788 | 19.854545 | 30.190909 |
table = table.reindex([3,2,4,1,5,0])
fig = px.bar(table, x="school_level", y=group_id, title="Group percentages vs school level")
fig.show()
The group identity distribution show us the following:
hispanic_per
: 78.025%hispanic_per
: 8.225%The issue here is that the data can mislead us: we can associate better sat schools with some race properties, but this is false.
Behind the great results are preparation, a lot of preparation. The Asian families have an old culture that requires exams to access to government jobs. We don't have any "racial superiority". Even the low class members invest in SAT prep exams.
Also the great proportion of asians in elite schools indicate that these schools have a selection process that help them to have only students with greater chance of high sat scores.
So, the low sat score problem is not a "race" problem, even schools with high scores have a great percentage of black and hispanic people.
But, if the problem is not "racial", what is the problem?.
hispanic_per
: how are the groups distributed¶low_schools = combined[(combined["hispanic_per"]<10)]
table_low = pd.pivot_table(low_schools, values=group_id, index=['school_level'],
aggfunc=numpy.mean)
table_low.reset_index(inplace=True)
table_low = table_low.reindex([2,3,1,4,0])
fig = px.bar(table_low, x="school_level", y=group_id, title="Group identity vs school level, hispanic_per < 10%")
fig.show()
If the hispanic_per
is fewer than 10%, we have some changes in the data:
sat_score
¶In this part we want to explore the relation between gender and SAT scores, we need to explore first if we have any clear correlation, later the percentages.
gender_id = ["male_per", "female_per"]
group_corr = combined.corr()["sat_score"][gender_id].sort_values()
fig = px.bar(group_corr, y='sat_score', color='sat_score', height=600, width=500)
fig
Looking the correlation between gender and sat scores we found: nothing.
A slightly difference, but nothing strong to have a clear conclusion. Let's see the distribution of female_per
vs sat_score
.
sat_score
vs female_per
¶fig = px.scatter(combined, x="female_per", y="sat_score",
size="sat_score", color="school_level",
hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. Female_per')
fig.show()
Looking through the dataset, we have some interesting stuff:
female_per
vs male_per
and sat_score
.¶To explore the gender differences regarding the school level, we need to do the following:
combined.female_per.quantile([0.01,0.2,0.5,0.75,0.95])
0.01 12.092 0.20 43.180 0.50 50.100 0.75 55.900 0.95 73.690 Name: female_per, dtype: float64
#define function for classifying players based on points
def gender(row):
if row < 12:
return '< 12%'
elif row >= 12 and row < 50:
return '12% to 50%'
elif row >=50 and row < 60:
return '50% to 60%'
elif row >=60 and row < 75:
return '60% to 75%'
else:
return '> 75%'
#create new column using the function above
combined['fem_level'] = combined['female_per'].apply(gender)
#view DataFrame
combined['fem_level'].head()
0 12% to 50% 1 12% to 50% 2 12% to 50% 3 50% to 60% 4 50% to 60% Name: fem_level, dtype: object
fem_table = pd.pivot_table(combined, values=["sat_score"], index=['fem_level'],
aggfunc=numpy.mean)
fem_table.reset_index(inplace=True)
fem_table = fem_table.reindex([3,0,1,2,4])
fem_table.style.background_gradient(cmap=cm)
fem_level | sat_score | |
---|---|---|
3 | < 12% | 1137.250000 |
0 | 12% to 50% | 1208.157612 |
1 | 50% to 60% | 1220.491512 |
2 | 60% to 75% | 1310.972973 |
4 | > 75% | 1233.832276 |
#define function for classifying players based on points
def gender_m(row):
if row < 12:
return '< 12%'
elif row >= 12 and row < 50:
return '12% to 50%'
elif row >=50 and row < 60:
return '50% to 60%'
elif row >=60 and row < 75:
return '60% to 75%'
else:
return '> 75%'
#create new column using the function above
combined['male_level'] = combined['male_per'].apply(gender_m)
#view DataFrame
combined['male_level'].head()
0 60% to 75% 1 50% to 60% 2 50% to 60% 3 12% to 50% 4 12% to 50% Name: male_level, dtype: object
male_table = pd.pivot_table(combined, values=["sat_score"], index=['male_level'],
aggfunc=numpy.mean)
male_table.reset_index(inplace=True)
male_table = male_table.reindex([3,0,1,2,4])
male_table.style.background_gradient(cmap=cm)
male_level | sat_score | |
---|---|---|
3 | < 12% | 1243.701824 |
0 | 12% to 50% | 1241.574497 |
1 | 50% to 60% | 1211.220043 |
2 | 60% to 75% | 1195.403528 |
4 | > 75% | 1172.133333 |
fig = go.Figure(data=[
go.Bar(name='Male percentage', x=male_table['male_level'], y=male_table['sat_score']),
go.Bar(name='Female percentage', x=fem_table['fem_level'], y=fem_table['sat_score'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()
We can take some quick conclusions:
Now we are going to explore the school levels vs the genders. The main objective is to see how the extremes have the distribution of genders.
table_gen = pd.pivot_table(combined, values=gender_id, index=['school_level'],
aggfunc=numpy.mean)
table_gen.reset_index(inplace=True)
table_gen = table_gen.reindex([3,2,4,1,5,0])
table_gen.style.background_gradient(cmap=cm)
school_level | female_per | male_per | |
---|---|---|---|
3 | Lowest: Below 1% | 47.525000 | 52.475000 |
2 | Low: Below 50% | 48.566667 | 51.431638 |
4 | Medium: Above 50% | 52.129670 | 47.870330 |
1 | High: Above 75% | 51.981481 | 48.018519 |
5 | Top: Above 90% | 56.545455 | 43.454545 |
0 | Elite: Above 99% | 43.025000 | 56.975000 |
table_gen = table_gen.reindex([3,2,4,1,5,0])
fig = px.bar(table_gen, x="school_level", y=gender_id, title="Gender percentage vs School Level")
fig.show()
As expected from previous point, female_per
increase with the school_level
, we have new observations:
Another important data point is the AP test takers vs. SAT Score. The AP Tests have a different kind of grade (1 to 5) and is useful for colleges boards to decide if the student is qualified for their studies.
What we could expect is that a school with more AP test takers have more chance of better SAT scores.
combined["ap_per"] = combined["AP Test Takers "]/combined["total_enrollment"]
fig = px.scatter(combined, x="ap_per", y="sat_score",
size="sat_score", color="school_level",
hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. AP Percentage')
fig.show()
Looking through the data we can say:
Visualize all the correlations in a table is really difficult, even segmentation of the data is not enough to fully appreciate all the dataset correlations.
Here we would explore strong correlation (above 0.7 or below -0.7) and visualize the results using a correlation matrix.
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
correlation_A = combined.corr()
correlation_A = correlation_A[(correlation_A>0.7)&(correlation_A<1)]
s = correlation_A.stack()
s = s.reset_index()
s
level_0 | level_1 | 0 | |
---|---|---|---|
0 | SAT Critical Reading Avg. Score | SAT Math Avg. Score | 0.929221 |
1 | SAT Critical Reading Avg. Score | SAT Writing Avg. Score | 0.982826 |
2 | SAT Critical Reading Avg. Score | sat_score | 0.986820 |
3 | SAT Math Avg. Score | SAT Critical Reading Avg. Score | 0.929221 |
4 | SAT Math Avg. Score | SAT Writing Avg. Score | 0.931385 |
... | ... | ... | ... |
257 | total_students | male_num | 0.943833 |
258 | total_students | female_num | 0.956412 |
259 | total_students | N_s | 0.942661 |
260 | total_students | N_t | 0.923733 |
261 | total_students | N_p | 0.817987 |
262 rows × 3 columns
<Figure size 1152x432 with 0 Axes>
flights = s.pivot("level_0", "level_1", 0)
fig = px.imshow(flights)
fig.show()
We have a lot of strong positive correlations, one interesting one is the correlation between Number of Exams with score 3 4 or 5
vs asian_num
.
Another curious correlation is between female_num
vswhite_num
.
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
correlation_B = combined.corr()
correlation_B = correlation_B[(correlation_B<-0.6)]
s_B = correlation_B.stack()
s_B = s_B.reset_index()
corr_matrix_B = s_B.pivot("level_0", "level_1", 0)
fig = px.imshow(corr_matrix_B)
fig.show()
<Figure size 1152x432 with 0 Axes>
In the negative side of the correlations we found a very interesting correlation: free lunch.
Free lunch in 2011 was huge. You need to qualify to had a free lunch in 2011 , and in this year the crisis forced more families to opt for free lunch.
We need to explore further this data.
As in the previous exercises we are going to create groups according the percentage of free lunch students. Next we compare with the sat scores results.
#Define the groups to divide the free lunch data
combined.frl_percent.quantile([0.01,0.2,0.5,0.75,0.95])
0.01 18.40 0.20 54.44 0.50 69.60 0.75 77.00 0.95 87.17 Name: frl_percent, dtype: float64
#define function for classifying sat scores based on free lunch
def frl(row):
if row < 19:
return 'frl < 19%'
elif row >= 19 and row < 55:
return 'frl 19% to 55%'
elif row >=55 and row < 70:
return 'frl 55% to 70%'
elif row >=70 and row < 77:
return 'frl 70% to 77%'
elif row >=77 and row < 88:
return 'frl 77% to 88%'
else:
return 'frl > 88%'
#create new column using the function above
combined['frl_level'] = combined['frl_percent'].apply(frl)
frl_table = pd.pivot_table(combined, values=["sat_score"], index=['frl_level'],
aggfunc=numpy.mean)
frl_table.reset_index(inplace=True)
frl_table = frl_table.reindex([4,0,1,2,3,5])
frl_table.style.background_gradient(cmap=cm)
frl_level | sat_score | |
---|---|---|
4 | frl < 19% | 1762.666667 |
0 | frl 19% to 55% | 1409.714731 |
1 | frl 55% to 70% | 1210.293112 |
2 | frl 70% to 77% | 1160.910276 |
3 | frl 77% to 88% | 1127.959859 |
5 | frl > 88% | 1100.992351 |
fig = px.bar(frl_table, x="frl_level", y="sat_score", title="Sat score vs Free lunch level")
fig.show()
We can observe:
The difference is abysmal, let's explore further the data.
table_frl = pd.pivot_table(combined, values="frl_percent", index=['school_level'],
aggfunc=numpy.mean)
table_frl.reset_index(inplace=True)
table_frl = table_frl.reindex([3,2,4,1,5,0])
table_frl.style.background_gradient(cmap=cm)
school_level | frl_percent | |
---|---|---|
3 | Lowest: Below 1% | 80.800000 |
2 | Low: Below 50% | 73.471751 |
4 | Medium: Above 50% | 68.582418 |
1 | High: Above 75% | 56.842593 |
5 | Top: Above 90% | 38.203030 |
0 | Elite: Above 99% | 25.575000 |
fig = px.bar(table_frl, x="school_level", y="frl_percent", title="Free lunch percentage vs School Level")
fig.show()
Now let's group the group identity and the free lunch percentage above 70%. Which groups have more presence with free lunch and have score above 1200?
top_frl = combined[(combined["frl_percent"]>70)&(combined["sat_score"]>1266.5)]
table_frl = pd.pivot_table(top_frl, values=group_id, index=['school_level'],
aggfunc=numpy.mean)
table_frl.reset_index(inplace=True)
table_frl = table_frl.reindex([1,0,2])
fig = px.bar(table_frl, x="school_level", y=group_id, title="Free lunch schools > 70%, Race vs Level School")
fig.show()
Reviewing the data we obtain:
fig = px.scatter(combined, x="frl_percent", y="sat_score",
size="sat_score", color="school_level",
hover_name="SCHOOL NAME", size_max=15, title='SAT Score vs. Free Lunch')
fig.show()
We find several outliers:
#
high_frl = combined[(combined["frl_percent"]>80)&(combined["school_level"]=="High: Above 75%")]
groups_a= ["SCHOOL NAME","white_per", "asian_per", "black_per", "hispanic_per","female_per", "saf_s_11"]
high_frl[groups_a]
SCHOOL NAME | white_per | asian_per | black_per | hispanic_per | female_per | saf_s_11 | |
---|---|---|---|---|---|---|---|
361 | ALL CITY LEADERSHIP SECONDARY SCHOOL | 1.5 | 5.3 | 12.9 | 79.5 | 47.9 | 8.9 |
Top schools with more than 70% of students with free lunch:
top_frl = combined[(combined["frl_percent"]>70)&(combined["school_level"]=="Top: Above 90%")]
top_frl[groups_a]
SCHOOL NAME | white_per | asian_per | black_per | hispanic_per | female_per | saf_s_11 | |
---|---|---|---|---|---|---|---|
46 | HIGH SCHOOL FOR DUAL LANGUAGE AND ASIAN STUDIES | 2.3 | 89.5 | 3.4 | 4.0 | 48.7 | 6.611667 |
67 | MANHATTAN CENTER FOR SCIENCE AND MATHEMATICS | 2.2 | 20.1 | 19.7 | 57.4 | 53.6 | 6.600000 |
In the extreme of almost 99.2% students with free lunch:
top_frl = combined[(combined["SCHOOL NAME"]=="HIGH SCHOOL FOR EXCELLENCE AND INNOVATION")]
top_frl[groups_a]
SCHOOL NAME | white_per | asian_per | black_per | hispanic_per | female_per | saf_s_11 | |
---|---|---|---|---|---|---|---|
83 | HIGH SCHOOL FOR EXCELLENCE AND INNOVATION | 0.0 | 0.0 | 25.8 | 72.5 | 47.2 | 7.6 |