In this demo project, we will dig into the topic of US education system, and the sector is that is there any efficacy of standardizing the test? Because this test score will be a pass-assist of high school students when one's wants to apply for College, does the standardizing have unfair to some certain group?? To investigate this sector, we will focus on the relation of SAT scores and demographic information. We could correlate SAT scores with factors like race, gender, income, and more.
The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's important to perform well.
The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the dataset for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.
The data of high school SAT score and NYC demographics for each high school is avaiable on NYC data website.
There is some picture of these data, and one thing we can observe that if we just work with these 2 dataset, we can't take full of infromation that we want for our analysis, so we have to gather information from somewhere else and combine it
Below is some website about dataset that covering SAT score and demographics information, but we have to take from each website, and gather it - due to the dataset is iterrelated - on the one dataset before we can start.
Before we move into coding, we'll need to do some background research. A thorough understanding of the data helps us avoid costly mistakes, such as thinking that a column represents something other than what it does. Background research gives us a better understanding of how to combine and analyze the data. We'll reseacrch about these source:
By the reseach, we've observed some information:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Since we have serveral files, we will read the file name from the list, and then store it in a dictionary for the after convenient.
## Create function to open file:
def open_file(file_name):
file = pd.read_csv('schools/{}'.format(file_name))
return file
## Create a list of colum name:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduations.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for i in data_files:
key = i.replace('.csv','') ## extract the name of file
d = open_file(i)
data[key] = d
## Check:
for key, value in data.items():
print(key)
Since the main interest of us is SAT score data, now let's take a look for a few records in key 'sat_results' on data dictionaries.
## Extract some records of SAT data:
print(data['sat_results'].head())
data['sat_results'].info()
We have some inform about this data:
For another data we got, we will loop through out its key in data store and see the first five row for each data.
for key in data:
if key != 'sat_results':
print(data[key].head())
We can observe that:
In the previous, we have not yet import survey file in because its format is txt and the encoding isn't in UFT-8 yet, now we will loading it into 2 file and merge it into 1 data frame. The file contains: survey of district 75 and the common survey.
## Read survey_all in:
all_survey = pd.read_csv('schools/survey_all.txt', delimiter = '\t', encoding='windows-1252')
# Read survey_d75 in:
d75_survey = pd.read_csv('schools/survey_d75.txt', delimiter = '\t', encoding='windows-1252')
## Combine data:
survey = pd.concat([all_survey,d75_survey], axis=0)
survey.head()
all_survey.info()
We observed that:
After cross-check with the describe for dataset, we can see that except the DBN field and those field are "rr_s" to "aca_tot_10", almost the rest we don't need => we will drop all these to make the data more clear and easier to process with. The reason to keep is: These columns give us aggregate survey data about how parents, teachers, and students feel about school safety, academic performance, and more. It also gives us the DBN, which allows us to uniquely identify the school.
To process this:
## Copy and create new column of dbn value:
survey.loc[:,'DBN'] = survey['dbn']
## Create a lst of field for analysis:
list_name = ['DBN',"rr_s", "rr_t", "rr_p",
"N_s", "N_t", "N_p", "saf_p_11", "com_p_11",
"eng_p_11", "aca_p_11", "saf_t_11", "com_t_11",
"eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11",
"aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
survey = survey.loc[:,list_name]
print(survey.shape)
# Upload into data store:
data['survey'] = survey
Back to the dataset in data store, we can see that data of directory school
and class_size
invisible for DBN field, in class_size
, it is completely gone out.
Looking in detaily, if we take the school code plus to the CSD in class_size field, we got the DBN in sat_results data, like below:
With that mean, we need to add a leading before each syntax of CSD field, like example below:
And then combine the new CSD with the School Code => We have DBN columns for class_size data.
# Check data type of CSD:
data['class_size']['CSD'].dtype
# Check data type of SCHOOL CODE
data['class_size']['SCHOOL CODE'].dtype
## Fill CSD function:
def fill_z(number):
if len(str(number))> 1:
return str(number)
if len(str(number)) == 1:
new = str(number).zfill(2)
return new
# Apply for CSD series:
data['class_size'].loc[:,'pad_CSD'] = data['class_size']['CSD'].apply(fill_z)
data['class_size']['pad_CSD'].unique()
## Create a new DBN column:
data['class_size'].loc[:,'DBN'] = data['class_size']['pad_CSD'] + data['class_size']['SCHOOL CODE']
## Check:
data['class_size']['DBN'].head()
As said, in SAT dataset we have three distinct section result of SAT test, and we'd like to combine it for the convenient with analysis progress. The job for us now is:
SAT Critical Reading Avg. Score
, SAT Math Avg. Score
, SAT Writing Avg. Score
), change it to numeric if not.column_check = ['SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score']
# Check the format:
data['sat_results'][column_check].dtypes
## Convert data type:
for i in column_check:
data['sat_results'][i] = pd.to_numeric(data['sat_results'][i], errors='coerce')
## Combine data:
data['sat_results'].loc[:,'sat_scores'] = data['sat_results'].iloc[:,3:].sum(axis=1)
data['sat_results']['sat_scores'].head()
In a sub-mission 6, we've combined the SAT result into one column and stored it in sat_scores
. Now looking for other dataset and check whether it OK, and I saw the Location 1
in hs_directory
file is wrtitten in a complex way. Now it's time to extract the longtitude and latitude coordinate - This enables us to map the schools and uncover any geographic patterns in the data.
To get the clearly picture for the location is mentioned above, let's see a few of data for location:
## Preview of location data:
data['hs_directory']['Location 1'].head(n=3)
## Extract the first location:
data['hs_directory']['Location 1'].iloc[0]
There is a same pattern we got here: XX address \n address XX \n (long_position, lat_position)
with the part we want to pull out is data in parenthese (long_pos, lat_pos)
. Taken together, latitude and longitude make up a pair of coordinates that allows us to pinpoint any location on Earth.
We will use regex to parsing only the part we want, in the code-block below:
## Import the regex library
import re
## Define function to extract the lat_coordinate:
def coordinate(str_):
modify = re.findall(r'\(.+\)', str_)
lat = (modify[0].split(','))[0].replace('(','')
return lat
## Filter:
data['hs_directory'].loc[:,'lat'] = data['hs_directory']['Location 1'].apply(coordinate)
data['hs_directory']['lat'].head()
## Define function for long coordinate:
def coordinate_long(str_):
modify = re.findall(r'\(.+\)', str_)
long = (modify[0].split(','))[1].replace(')','')
return long
# Filter:
data['hs_directory'].loc[:,'long'] = data['hs_directory']['Location 1'].apply(coordinate_long)
# Convert the data type:
order = ['lat', 'long']
for name in order:
data['hs_directory'][name] = pd.to_numeric(data['hs_directory'][name], errors='coerce')
## Check the longtitute data:
print(data['hs_directory']['long'].head(n=3))
## Check the data type:
data['hs_directory'][order].dtypes
Conclusion : Over 6 sub-mission, we've got the pre-processing data with maintaining of the multi-duplicate DBN records at other dataset while the sat_results dataset this field is unique. For the example:
sat_results
: 01M292
-> 01M448
->...01M292
x n-times -> 01M448
x n-times... (ex: class_size
dataset)This situation creates problems, because we don't know which of the multiple entries in the class_size
dataset we should combine with the single matching entry in sat_results
=> To resolve this issue, we'll condense the class_size
, graduation
and demographics
datasets so that each DBN is unique.
Let's take a look a few records of class_size data:
data['class_size'].head(n=3)
DBN number is duplicate 5~6 times due to the difference of PROGRAM TYPE
, GRADE
and the other field, but let's checking closer to PROGRAM TYPE
and GRADE
field:
## Identify the value of PROGRAM TYPE field:
print('Dataset of Program type: {}'.format(data['class_size']['PROGRAM TYPE'].unique()))
## Identify the value of GRADE field:
print('Dataset of Grade: {}'.format(data['class_size'].iloc[:,4].unique())) #GRADE is postion 5th columns
Because we just interest with those in High school level (they're SAT's attendance) => We've drop out all things below GRADE '09'&'MS Core'
and keep only 'GEN ED'
program type.
# Store the trial data:
class_size = data['class_size'].copy()
## Filter the GRADE & PROGRAM TYPE data:
bool_1 = class_size.iloc[:,4] == '09-12'
bool_2 = class_size['PROGRAM TYPE'] == 'GEN ED'
add = bool_1&bool_2
class_size = class_size[add]
# Check:
print(class_size.shape)
class_size.head()
As the pointed out result, the DBN still not unique, and one school still repeat 5 times or more. In the same time, we can see the value change in CORE SUBJECT
, CORE COURSE
(Ex: English 9, 10... and MATH with various course). But we don't want the final data just only a subset course of any subset subject in any schools, instead, we want our class size data to include every single class a school offers => That's will make the data maintain the maximum information as the best as we could.
To do this, we can groupby the DBN
field, and apply arrgregate average value on each group.
## Group dataset along DBN field, and computing:
class_size = (class_size.groupby('DBN')).agg(np.mean)
## Assign back the DBN column: (because the groupby operate make DBN become index)
class_size.reset_index(inplace=True)
class_size.head()
## Assign back to the class size data in data store:
data['class_size'] = class_size
We've done the class_size dataset, let's move to demographics data and see anything we could do.
# Check the demographics data:
data['demographics'].head(n=3)
We got the same scence as the previous version of class_size data: The duplicate. One item (school) duplicated by the schoolyear field, which describe the period for caculating of studen't graduation, and back to the dataset of SAT scores we got - remind that we just analysis on the period of 2011~2012 => This is a condition to filter the data.
## Check the schoolyear field:
data['demographics']['schoolyear'].dtypes
# Convert data type:
def to_str(number):
str_1 = str(number)[:4]
str_2 = str(number)[4:]
return str_1 + '-' + str_2
data['demographics'].loc[:,'schoolyear'] = data['demographics']['schoolyear'].apply(to_str)
## Filter:
bool_1 = data['demographics']['schoolyear'] == '2011-2012'
data['demographics'] = data['demographics'][bool_1]
data['demographics'].head(n=3)
Make a quick glance test for first 3 records, there's no duplicate of DBN items => Everything went OK.
The next process is check the graduations data and clean it if any duplicate appear, now let's take a few records:
## Check the graduations data:
data['graduations'].head(n=3)
Because we just give our interest in period of 2011-2012 => let's check the Cohort field if there is any value out of range we want. In additional, we just want the value Total Cohort
in Demographic field (the Demographic appears to refer to a specific demographic group, Cohort
refer to the year the data represents) so let's check out and see what's to delete.
## Check the range period in Cohort:
data['graduations']['Cohort'].unique()
## Create mask condition:
bool_1 = data['graduations']['Cohort'] == '2006'
bool_2 = data['graduations']['Demographic'] == 'Total Cohort'
combine = bool_1&bool_2
## Filter:
data['graduations'] = data['graduations'][combine]
data['graduations'].head(n=3)
We've got almost the data without any DBN duplicate, now back to the Advanced Placement (AP). High school students take the AP exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit. The thing we need to perform is convert the format of string AP test score to the numeric.
AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are rarer in schools that lack funding or academic rigor. It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools
There are three columns we'll need to convert:
## Check column name:
data['ap_2010'].columns
## Format column name:
data['ap_2010'].columns = data['ap_2010'].columns.str.strip()
## Convert value format
col = ['AP Test Takers', 'Total Exams Taken','Number of Exams with scores 3 4 or 5']
for name in col:
data['ap_2010'][name] = pd.to_numeric(data['ap_2010'][name], errors='coerce')
# Check:
data['ap_2010'][col].dtypes
Now we've almost done, so it's time to considering how to combine each dataset in data store. We have DBN field is unique, but we also observed that not all the DBN value in SAT result dataset will be match all in the other dataset (in other word, the other dataset will contain more DBN value than SAT result data) => That's the reason we need to care about what kind of merge we need to perform??
And it's a good time to look back at our goal: Find the correlation between SAT score vs demographics data => Since our SAT data isn't large as other data => We don't want any drop information of SAT score data happend, so which kind of combine should we have to operate?
We will use merge function of Py to operate, and use the method left
on those field which DBN isn't enough to avoid loss data, in the countracy, inner
will bring us the properly data. Let's start first with sat_results
dataset.
# Combine first pair dataset:
combine = data['sat_results'] ## define the left data
combine = combine.merge(data['ap_2010'],how='left',on='DBN')
# Combine the next dataset:
combine = combine.merge(data['graduations'], how='left', on='DBN')
# Check:
print(combine.shape)
combine.head(n=3)
Now that we've performed the left joins, we still have to merge class_size, demographics, survey,
and hs_directory
into combined. Since these files contain information that's more valuable to our analysis and also have fewer missing DBN values, we'll use the inner join type.
# Setting for display more columns:
pd.set_option('display.max_columns',60)
## Priority to merge: class_size -> demographics -> survey -> hs_directory
combine = combine.merge(data['class_size'],how='inner', on='DBN', suffixes=('_x1', '_y1'))
combine = combine.merge(data['demographics'], on='DBN', how='inner', suffixes=('_x2','_y2'))
combine = combine.merge(data['survey'], on='DBN', how='inner', suffixes=('_x3','_y3'))
print(combine.shape)
combine.head(n=3)
# Processing combine with hs_directory:
data['hs_directory'].columns = data['hs_directory'].columns.str.strip()
# Create a new columns:
data['hs_directory'].loc[:,'DBN'] = data['hs_directory']['dbn']
combine = combine.merge(data['hs_directory'], on='DBN', how='inner', suffixes=('_x4','_y4'))
print(combine.shape)
Now, we will fill in the missing value by mean value (if any columns is ratio scale like avg_class_size
...) and the remain with 0.
# Fill with mean value:
mean = combine.copy().mean()
combine.fillna(mean, inplace=True)
# Fill any remain with 0:
combine.fillna(0, inplace=True)
# Check:
combine.isnull().sum()
# Extract the school district:
def extract_dis(str_):
return str_[0:2]
combine.loc[:,'school_dis'] = combine.DBN.apply(extract_dis)
# Check:
combine.school_dis.head(n=3)
The data now is ready to analysis and plot, now let's move to the analysis process.
As mentioned, we'd like to relevant the relation of SAT result with demographics infor, so now in order to get the overview about what we will perform in next step, let's release the relation of these field.
We will use the correlation method with r
value technical to find the relation of these field. An r value falls between -1 and 1. The value determines whether two columns are positively correlated, not correlated, or negatively correlated. The closer to 1 the r value is, the stronger the positive correlation between the two columns. The closer to -1 the r value is, the stronger the negative correlation (i.e., the more "opposite" the columns are). The closer to 0, the weaker the correlation.
Because our main theme is focus in the fairness of standardlize test like SAT, any field stick along with SAT like gender, race
, ... and the correlation of it is the object we're aimming for. If men tended to score higher on the SAT, for example, that would indicate that the SAT is potentially unfair to women and vice-versa.
# Setting the display row:
pd.set_option('display.max_rows',80)
# Find the correlation:
correlation = combine.corr(method='pearson')['sat_scores'] ## Set the obj is sat scores
correlation
Let's make a barplot for this correlation result, we will contain the item of survey we added with the section test of SAT test to track easier.
## Create group of correlation graph:
col = ['SAT Critical Reading Avg. Score','SAT Math Avg. Score','SAT Writing Avg. Score',"rr_s", "rr_t", "rr_p",
"N_s", "N_t", "N_p", "saf_p_11", "com_p_11",
"eng_p_11", "aca_p_11", "saf_t_11", "com_t_11",
"eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11",
"aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
## Create relation graph:
df = correlation[col]
## Create color function:
cor_negative = df<0
cor_nev_map = cor_negative.map({True:'red',
False:'blue'})
ax = df.plot.barh(y='Correlations', figsize=(10, 10), color=cor_nev_map)
plt.xlabel('Relation score')
plt.ylabel('Factor')
plt.title('Relation between SAT result and factors')
plt.show()
For the correlation's result:
SAT Critical Reading Avg. Score
, SAT Math Avg. Score
, SAT Writing Avg. Score
) => The relation is exist, but not too strong, and this is the mismatch point due to the SAT result isn't relative too strong with its test distribution result => The high probability that we won't get any stronger conclusion from this data.man_num
, women_num
...)Black/White people
, While people seems highest for the relation with higher SAT result.total_enrollment
seem have shapely relation with SAT result => The school with large student seem do better than which school have less enroll than the one aboveell_percent
, frl_percent
) due to its negative correlation => It's like SAT results don't relation well with those school which for naive French/ Spanish speaker, English speaker. saf_t_11
and saf_s_11
is weak relativing with SAT score (less than 0.25).Since the enroll seem have relation with SAT result => Let's make a plot to see the relation of these field.
# Relation graph of SAT result and enroll:
plt.scatter(combine.total_enrollment, combine.sat_scores)
plt.xlabel('Total enrollment (person)')
plt.ylabel('SAT result')
plt.title('Relation of SAT result and total enrollment')
plt.show()
We're looking at the relation graph, and we can see a countracy thing: Though the enroll is high but the SAT result not seem higher than the area with less enroll => We can reject the supposing that the higher enrollment, the better SAT result be.
We also observe that there is a cluster area with total enrollment is less than 1000, along with SAT result less than 1000 scores too => We will filter the school name which appear in the cluster and see what's the school here.
## Get the name of school with total enrollment and SAT result both less than 1000
bool_1 = combine.total_enrollment <1000 # enroll amount less than 1000
bool_2 = combine.sat_scores <1000 # sat score less than 1000 points
total = bool_1&bool_2
low_enroll = combine[total]
low_enroll.loc[:,['SCHOOL NAME','ell_percent']]
COLUMBIA SECONDARY SCHOOL FOR MATH, SCIENCE..
case)ell_per
=> We will generate a relation graph of ell_per
and sat_scores
to see closer## Relation graph of English lang leaner with SAT result:
plt.scatter(combine['ell_percent'], combine['sat_scores'])
plt.xlabel('English language leaner percentage (%)')
plt.ylabel('SAT result (points)')
plt.show()
One way to make very granular statistics easier to read is to aggregate them. In this case, we aggregate by district, which enables us to understand how ell_percent
varies district-by-district instead of the unintelligibly granular school-by-school variation.
## Aggregate data by distinct:
districts = combine.groupby('school_dis').agg(np.mean)
## Display the data:
districts.reset_index(inplace=True)
districts.loc[:,['school_dis','ell_percent']].head(n=5)
As mentioned above, the safety report of teacher and student have weak relation with SAT test result. But we could plot a relation graph for it to check how the weak relationship is, by the graph below:
## Relation graph of saf_s_11 field and SAT result:
combine.plot.scatter(x='saf_s_11', y='sat_scores')
plt.xlabel('Safety survey from Student')
plt.ylabel('SAT results (score)')
plt.show()
Except for some points area from 7.0 and above relation with the higher SAT result, all the remains almost stick at range 1000~1500 points. Now let's move to the variety by borough of safety report.
## Aggregate safety scores by borough
combine.loc[:,'borough'] = combine.DBN.str[2] ## define the borough : X-borough-district_number
boroughs = combine.groupby('borough').agg(np.mean) ## we evaluate eveything by the mean value
boroughs.reset_index(inplace = True)
boroughs[['borough','saf_s_11']]
With the description about each borough : K, M, Q, R, X, we can observed that follow the result we got, the Manhattan and Queens borough is the place where the most student accept that the safety condition is factor for better SAT result. So, let's see if their place has the SAT result higher or not (though we know these field have not relation with each other)
boroughs[['borough','sat_scores']].sort_values(by='sat_scores', ascending=False)
As expect, the Richmond country (borough Staten Island) got the highest SAT result, but because we take the mean of SAT result for these borough, there's a chance that some case is standout, but we can expect about Staten Island will appear in the list of school with high SAT score.
In the sub-mission 1, we've mentioned to a bit of demographics data: which borough?... Now we will deep digger into the demographics data, like: race/ region (white_per
, asian_per
, black_per
, hispanic_per
), genre (man_per
, woman_per
).. and answer the question: Have the standardlize test had efficacy/ unfair with certain group?
To get a quick look, we will plot a relation graph about 4 factor above with SAT result, and take account into something deeper from that graph.
## Relation graph above race/ relagion to SAT result:
# Define columns:
reg = ['white_per', 'asian_per', 'black_per', 'hispanic_per']
# Define color filter:
corr_reg_pos = correlation[reg]>0
corr_pos_map = corr_reg_pos.map({True:'blue', False:'red'})
# Plot:
correlation[reg].plot.barh(figsize=(5,6), color = corr_pos_map)
plt.xlabel('Relation degree')
plt.ylabel("Student's infor: race/ religon")
plt.title('Relation between SAT result with\nrace/religon information of student')
plt.show()
We got some observe:
Now we want to find school where the student learned, and find for three cases:
hispanic_per
greater than 95%)hispanic_per
less than 10% and an average SAT score greater than 1800.Before, let's graph a relation plot between hispanic percentage student with SAT result:
## Relation between hispanic_per with sat_scores:
combine.plot.scatter(x='hispanic_per', y='sat_scores')
plt.xlabel('Ratio of hispanic student join (%)')
plt.ylabel('SAT result (points)')
plt.show()
So we've clearly know what dose our case mention, let's digger in.
## Find school with hispanic student in (find by hispanic_per)
hispanic_stdn = combine[['sat_scores','hispanic_per','SCHOOL NAME']]
## case 1:
#1. Low SAT result:
low_SAT = hispanic_stdn[hispanic_stdn.sat_scores<1200]
#2. High SAT result:
high_SAT = hispanic_stdn[hispanic_stdn.sat_scores>1200]
## case 2: find school with ratio of hispanic student greater than 95%:
hispnaic_95 = hispanic_stdn[hispanic_stdn.hispanic_per>95]
## case 3: find the school with less hispanic student and average SAT score greater than 1800:
case3 = hispanic_stdn[hispanic_stdn.hispanic_per<10]
# Groupby school:
school = case3.groupby('SCHOOL NAME').agg(np.mean)
school = school.reset_index()
# Find the information of school with hispanic student is the most:
hispnaic_95['SCHOOL NAME']
After the research about school information, all these school is for French language leaner (they've been taught for speak English, but most of time, they speak French). And for that purpose, the SAT result can be a bit lower because they must learn English first and they took time to used to it. Let's move to the school which less hispanic student in.
## Find the school with hispanic student in and average SAT scores is above 1800:
school[['SCHOOL NAME','sat_scores']][school['sat_scores']>1800]
We've mentioned this by find the relation between borough
and sat_scores
result, and got the conclusion that the borough with SAT result highest is Staten Island. So we have one of this here: STATEN ISLAND TECHNICAL HIGH SCHOOL
, and the school with highest SAT result is STUYVESANT HIGH SCHOOL
which location is in Manhanttan, by the principal is from Asian people whose got basic study thinking of Asian. And the shared point is the ratio of hispanic student here is small.
Now, take a look on gender
information, and as usual, we take a look about its relation graph for get a overview about among man/woman, which gender tend to have SAT score higher?
## Relation of gender to SAT result:
#Define the columns:
gender = ['male_per', 'female_per']
#Color filter:
corr_gen = correlation[gender]>0
corr_gen_map = corr_gen.map({True:'pink',False:'Grey'})
#Plot:
correlation[gender].plot.barh(x='gender',y='sat_scores',color=corr_gen_map)
plt.xlabel('Gender info')
plt.ylabel('SAT result (points)')
plt.show()
We can see a very weak relation between gender info and SAT results (less than 0.1), but if take a look closer, we can say that group of female student do slightly better than male student. Let's releave it by find some school name with special case below:
female_per
female_per
greater than 60% and an average SAT score greater than 1700Before this, let's make a graph about girls and SAT result
## Relation graph of female student and SAT result:
combine.plot.scatter(x='female_per', y='sat_scores')
plt.xlabel('Ratio of girls in school (%)')
plt.ylabel('SAT result (points)')
plt.show()
# Find school name with special case:
girls = combine[['SCHOOL NAME', 'female_per', 'sat_scores']] ## Extract female student into another df
# case 1: high female_per (>50%) and high SAT score (>1200 points)
bool_a = girls['female_per']>50
bool_b = girls['sat_scores']>1200
mask = bool_a&bool_b
case1 = girls[mask]['SCHOOL NAME']
# case 2: female_per > 60%, average SAT score >1700:
con1 = girls[girls['female_per'] >60]
case2 = con1.groupby('SCHOOL NAME').agg(np.mean)
school_name = case2.reset_index()
# School name case 1:
case1
Due to the less insight by case 1, we'll next to the case 2:
## School name case 2:
school_name[['SCHOOL NAME','sat_scores']][school_name['sat_scores']>1700]
We have some observe for the result:
Plus: On the correlation table we can see that the amount of student who taken AP test and total AP test is have correlation with SAT results too (though it's not much strong ~ 0.25). Now we will releave the relation of these field with SAT result to see is there have any interesting insights. Because total_enrollment
is highly correlated with sat_score
, we don't want to lead to bias results. Instead, we'll look at the percentage of students in each school who took at least one AP exam.
# Create a column with percentage of student who take at least 1 AP test:
combine.loc[:,'ap_percent'] = combine['AP Test Takers']/combine['total_enrollment']
# Comput the correlation of this field to SAT result:
corr_2 = combine.corr()['ap_percent']['sat_scores']
print('Correlation of percentage student taken AP test with SAT results is {}'.format(corr_2))
# Plot the relation graph of AP test takers percent to SAT result:
combine.plot.scatter(x='ap_percent', y='sat_scores')
plt.xlabel('Percentage of student taken AP test')
plt.ylabel('SAT result (points)')
plt.show()
Follow the correlation result we can consider that SAT result is negative effect with the percentage of student who take AP exam; and we can see a similar thing out here: from percentage of 20% below, the SAT result seem OK for the points can up to ~1500. But from the half of 20% to the last, the SAT result doesn't seem to be better (except some outlier) => We can suggest that the more student focucs on get better by take many AP test, the worse their result get because they take time to better at AP but not much time to focus on SAT.
There are 5 field of Class size we have: NUMBER OF STUDENTS / SEATS FILLED
, NUMBER OF SECTIONS
, AVERAGE CLASS SIZE
, SIZE OF SMALLEST CLASS
, SIZE OF LARGEST CLASS
. Now let's draw the relation graph of these field with SAT result to see the overview first, and then we will dig in a spectacular field to get the meaning.
# Define the column:
col_class = ['NUMBER OF STUDENTS / SEATS FILLED','NUMBER OF SECTIONS','AVERAGE CLASS SIZE',
'SIZE OF SMALLEST CLASS','SIZE OF LARGEST CLASS']
# Define the color filter:
class_corr = correlation[col_class]>0
class_corr_pos = class_corr.map({True:'green', False:'red'})
# Plot:
correlation[col_class].plot.barh(color = class_corr_pos, figsize=(8,7))
plt.xlabel('Correlation with SAT result')
plt.ylabel('Class size factor')
plt.show()
We can see in the graph: the SIZE OF LARGEST CLASS
have good relation with SAT result, give us a suggest that a school with large scale will have SAT result better than those which in smallest class. Now let's digger into this, create another relation graph between SIZE OF LARGEST CLASS
and sat_scores
to see the detail relation.
# Scatter plot of SIZE OF LARGEST CLASS vs SAT result:
combine.plot.scatter(x='SIZE OF LARGEST CLASS', y='sat_scores')
plt.xlabel('Amount of largest class')
plt.ylabel('SAT result (points)')
plt.show()
We can see the amount in one class get 30-35 students seems have better SAT result, now let's pick up the school name with class size is 30~35 student, and get SAT average above 1700.
# Pick up school name with class size >30 and average SAT result is above 1700:
# Create a df that use for both this case and Average Size case
class_sizeM = combine[['SCHOOL NAME', 'SIZE OF LARGEST CLASS','AVERAGE CLASS SIZE', 'sat_scores']]
# Filter:
#task1 : for <30 class size <35
task1 = class_sizeM[class_sizeM['SIZE OF LARGEST CLASS'].between(30,35)]
school = task1.groupby('SCHOOL NAME').agg(np.mean)
school.reset_index(inplace=True)
# Display some school name:
school[['SCHOOL NAME', 'SIZE OF LARGEST CLASS', 'sat_scores']][school['sat_scores']>1700]
As expect, we can see a lot of familiar school name here, with their maximum class size is in range of 30 - 35 student/ class. The class size is large enough to manage student, but don't quite small to not get enough fee to development the teaching / learning condition.
Now, we get an extra question: how many student in average in one class is efficacy for make them got effective learning enviroment, from there they can get SAT score better? => We will answer this question with AVERAGE CLASS SIZE
field, now let's see the relationship between AVERAGE CLASS SIZE
and sat_scores
result.
# Scatter plot of AVERAGE CLASS vs SAT result:
combine.plot.scatter(x='AVERAGE CLASS SIZE', y='sat_scores')
plt.xlabel('Amount of student in class (average)')
plt.ylabel('SAT result (points)')
plt.show()
We can perform immediately find the school name with size in class from 25 to 30, and get their SAT averge above 1700, to confirm one thing: The optimize number for almost school is 25-30, and the maximum size in one class can up to 35, but don't get over it, unless we can see the distribution with class size greater than 35 is not good at all (and everyone know it, that's the obvious thing)
# Filter:
#task2 : for <25 class size <30
task2 = class_sizeM[class_sizeM['AVERAGE CLASS SIZE'].between(25,30)]
school_2 = task2.groupby('SCHOOL NAME').agg(np.mean)
school_2.reset_index(inplace=True)
# Display some school name:
school_2[['SCHOOL NAME', 'AVERAGE CLASS SIZE', 'sat_scores']][school_2['sat_scores']>1700]
Once again, we must say that there are all familiar school name which is mentioned above in: Which school with female percentage greater still get SAT result better? or Which school have less hispanic student and get SAT result better? If you want to know why: that's one reason of it. Not only for the optimize number of student in class, they all have optimize learning method, and the last: Some of them have checking in first join.
CONCLUSION: The SAT result have good relation with class size, especially with which class size is 25 - 35 student/ class
To evaluate a school is good school, we might want define 'good' word here meaning of a school which graduated students can get their good SAT result. We can reference the summarized data follow by borough, the detail is sort the borough follow their rank of average SAT scores, and plot a barplot to see the detail.
## Call back the result:
task3 = boroughs[['borough','sat_scores']].sort_values(by='sat_scores', ascending=False)
print(task3)
## Generate bar plot:
plt.barh(task3['borough'], task3['sat_scores'], height=0.5)
plt.xlabel('SAT result (points)')
plt.ylabel('Borough')
plt.show()
Before take the further step, we will want to get clearly what the high SAT scores is? So let's take a histogram with a line of which the result of SAT test is appear the most, along there we can see what mark we can take to evaluate one got high SAT score.
combine['sat_scores'].describe()
# Define some measurement unit:
q1, q3 = np.quantile(combine['sat_scores'],[.25, .75])
sat_mode = combine['sat_scores'].mode()[0]
print('IQR={}'.format(q3-q1))
# Plot histogram of SAT score:
plt.hist(combine['sat_scores'], density=True, label='SAT result', histtype='step')
# combine['sat_scores'].plot.kde()
plt.axvline(sat_mode, color='black', label='Most of SAT score')
plt.axvline(q1, color='red', label='Min limit', ls='dashed')
plt.axvline(q3, color='red', label='Max limit', ls='dashed')
plt.xlabel('SAT scores (points)')
plt.ylabel('Density')
plt.legend()
plt.title('Distribution of SAT score')
plt.show()
From the distribution of SAT score and the graph above, we need to clean the data of SAT score, laying in the fact that the most of SAT score from this data is zero (0), and this is unacceptable for the real-world SAT result. To do this, we can drop all the record with SAT score = 0 (this can be happend when fill in missing value).
# Drop the records with SAT score = 0
combine = combine[combine['sat_scores']>0]
# Check the distribution again:
combine['sat_scores'].describe()
# Plot the graph again:
# Define some measurement unit:
q1, q3 = np.quantile(combine['sat_scores'],[.25, .75])
sat_mode = combine['sat_scores'].mode()[0]
print('IQR={}'.format(q3-q1))
print('Most of SAT scores = {}'.format(sat_mode))
kurtosis = combine['sat_scores'].kurt()
skewness = combine['sat_scores'].skew()
print('Kurtosis of distribution = {}'.format(kurtosis))
print('Skewness of distribution = {}'.format(skewness))
# Plot histogram of SAT score:
plt.hist(combine['sat_scores'], density=True, label='SAT result', histtype='step')
combine['sat_scores'].plot.kde()
plt.axvline(sat_mode, color='black', label='Most of SAT score')
plt.axvline(q1, color='green', label='Min limit', ls='dashed')
plt.axvline(q3, color='green', label='Max limit', ls='dashed')
plt.xlabel('SAT scores (points)')
plt.ylabel('Density')
plt.xlim(combine['sat_scores'].min(), combine['sat_scores'].max())
plt.legend()
plt.title('Distribution of SAT score')
plt.show()
From the statistic metric we can see that:
For the covenient, we will add the line of where the most of points less than or equal to the maximum we can reach. We will get the mark of score that beat over 85% the rest, to get the most chance of College pass-assist, and make the huge different with the rest.
# Define some measurement unit:
q1, q3 = np.quantile(combine['sat_scores'],[.25, .75])
percen_85 = np.percentile(combine['sat_scores'],85)
sat_mode = combine['sat_scores'].mode()[0]
print('The SAT point that beat over 85% the rest = {}'.format(percen_85))
print('Most of SAT scores = {}'.format(sat_mode))
# Plot histogram of SAT score:
plt.hist(combine['sat_scores'], density=True, label='SAT result', histtype='step')
combine['sat_scores'].plot.kde()
plt.axvline(sat_mode, color='black', label='Most of SAT score')
plt.axvline(q1, color='green', label='Min limit', ls='dashed')
plt.axvline(q3, color='green', label='Max limit', ls='dashed')
plt.axvline(percen_85, color='red', label='Most of 85%', ls='dashed')
plt.xlabel('SAT scores (points)')
plt.ylabel('Density')
plt.xlim(combine['sat_scores'].min(), combine['sat_scores'].max())
plt.legend()
plt.title('Distribution of SAT score')
plt.show()
We've done, from the graph we can see that a mark will beat over 85% the rest is when SAT scores reach 1364 points. We will take this result to filter all the school for each borough, and see the ratio of school can be reach to this degree.
# Filter the school from each borough:
schl_borg = combine.groupby('borough')
# Get the school for each borough:
data_group = {}
key = combine['borough'].unique()
for item in key:
data_group[item] = schl_borg.get_group(item) #To get the group of co-responed borough
#Check:
for k, v in data_group.items():
print(k)
# Count the original school in each borough:
count_ori = []
for item in key:
count = data_group[item]['SCHOOL NAME'].count()
count_ori.append(count)
# Check:
count_ori
# Filter the school from each borough with SAT score >1364 points
schl_borg_2 = combine.groupby('borough').apply(lambda x: x[x['sat_scores']>1364])
# Get the school for each borough:
data_group_2 = {}
key = combine['borough'].unique()
for item in key:
data_group_2[item] = schl_borg_2[schl_borg_2['borough']==item] #To get the group of co-responed borough
#Check:
for k, v in data_group_2.items():
print(k)
# Count the original school in each borough:
count_after = []
for item in key:
count = data_group_2[item]['SCHOOL NAME'].count()
count_after.append(count)
# Check:
print(count_after)
# Get the ratio of school which above 1364 SAT points:
result = []
for i, j in zip(count_after, count_ori):
ratio = (i/j)*100
result.append(ratio)
result
We've got both of how many school in borough/ how many school in borough match with SAT scores condition => Now we will plot 2 graph: 1 is describe the percentage of borough have school meet the SAT scores condition, the remain is describe both of number of schools before/ after filter with SAT scores condition.
# Set dataframe for the result:
df3 = pd.DataFrame(index=key, data={'Original':count_ori,
'With high SAT score':count_after,
'borough_fullname':['Manhattan','The Bronx','Brooklyn','Queens','State Island']})
# Combine plot:
plt.figure(figsize=(20,8))
plt.subplot(1,2,2)
#Statistic plot: which school is 'good'?
plt.barh(key, result, height = 0.3)
plt.xlabel('Percentage of school follow SAT scores (%)')
plt.ylabel('Borough')
plt.yticks(ticks=[4,3,2,1,0],labels=['State Island','Queens','Brooklyn','Bronx','Manhattan'])
plt.title('Percentage of schools amount in each borough\n which have high SAT score')
plt.subplot(1,2,1)
#Plot graph to see the different of each borough:
plt.barh(y=df3['borough_fullname'], width=df3['Original'],height=0.3, label='Original')
plt.barh(y=df3['borough_fullname'], width=df3['With high SAT score'],align='edge',height=-0.3, label='With high SAT score')
plt.xlabel('Number of schools')
plt.ylabel('Borough')
plt.yticks(ticks=[4,3,2,1,0],labels=['State Island','Queens','Brooklyn','Bronx','Manhattan'])
plt.legend()
plt.title('Distribution of number of school in each borough\nfollowed by SAT scores')
plt.show()
CONCLUSION: We can see a lot of things here:
OVERALL CONCLUSION: