The results of the standardized "SAT"-exam is used by US-colleges to determine which high school graduates they admit.
Understanding if there are any unintended patterns or reasons discriminating against certain groups of students is an important step to making the US education system more equal - something the US Department of Education states as a "formidable challenge".
To be an informed citizen on this matter (and learn some python along the way), I want to know:
This analysis will use descriptive statistics (e.g. mean, correlations, z-score) and data visualizations (e.g. scatter plots, bar charts, maps) to explore the following datasets on NYC high schools:
The ETL portion of this analysis was executed based on dataquest instructions.
Additional dataset: Property prices by NYC neighborhood ___
The main findings are:
Notes: The data is generally quote outdated (in part from 2011). SAT scores are not the only relevant parameter to evaluate the quality of education. ___
# standard libs from dq and additional utilities (PEP-8 made me do it in this order)
from tqdm.notebook import tqdm_notebook # progress bar for timeconsuming geocoding
import tqdm
import warnings # used here to surpress FutureWarning
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
import seaborn as sns
# additional libs used for this projects
from sodapy import Socrata # api calls
import geopy # reverse geocoding
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import plotly as py # interactive plotting
import plotly.graph_objs as go
import cufflinks as cf # interactive plotting made easy (pandas style)
# Using plotly + cufflinks in offline mode
from plotly.offline import iplot, init_notebook_mode
cf.go_offline(connected=True)
init_notebook_mode(connected=True)
# jupyter command to show plots in notebook
%matplotlib inline
# setting pandas display options
pd.options.display.float_format = '{:,.3f}'.format
pd.set_option("display.max_columns", 20)
#pd.set_option("display.max_rows", 20)
pd.set_option('max_colwidth', 50)
# read seperate input files into single dictionary
# csv-files
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
# survey data, different format
all_survey = pd.read_csv("schools/survey_all.txt",
delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt",
delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:, survey_fields]
data["survey"] = survey
# add DBN columns as unique identifier in datasets where they are not yet present
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].astype(str).str.pad(2,fillchar="0")
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
# summing SAT-score
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
# extracting latitude and longitude to separate columns
pattern = r"\((.+),(.+)\)"
data["hs_directory"][["lat","lon"]] = data["hs_directory"]["Location 1"].str.extract(pattern).astype(float)
# convert AP scores to numeric
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
# selecting the classes relevant for SAT-analysis (high school, general education)
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
# compute average class size by school across all subjects
data["class_size"] = pd.pivot_table(class_size, values="AVERAGE CLASS SIZE", index="DBN").reset_index()
# select most recent data available in "demographics" and "graduation" dataset
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
# initiate combined dataset with our main variable we want to investigate
combined = data["sat_results"]
# left joins for somewhat sparse datasets, because we do not want to loose any SAT-info
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
# inner joins for other dataset that potentially contain very relevant explanatory features.
# if these features are not available for a high school, our we want to exclude these schools
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
# missing values for numeric columns filled with mean, non-numerics with zeros
combined = combined.fillna(combined.mean()) # potential lever to improve quality of analysis, not scope of this learning exercise
combined = combined.fillna(0)
# Add a school district column for later analysis
combined["school_dist"] = combined["DBN"].str[:2]
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
# compute all correlations with SAT scores
correlations = combined.corr()["sat_score"]
# build a barchart to visualize SAT and survey response correlation
ax = correlations[survey_fields].sort_values().plot(kind="bar",
figsize=(8, 6),
title="Correlation of survey responses and SAT-scores")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches:
ax.annotate(str(round(p.get_height(), 2)),
(p.get_x()+0.1, max(p.get_height(), 0)+0.01),
rotation=90
)
ax.axhline(y=0.0, color='black', linestyle='-', linewidth=1)
plt.show()
Safety / Respect ("saf_...") and students academic expectations ("aca_s_11") are correlated with good SAT scores.
This intuitively makes sense - if students feel safe and have trust in themselves / sets higher expectations, they (can) focus on achieving good academic results.
Further details:
# build a scatter plot to visualize correlation between SAT and Safety and Respect Rating (students)
ax = combined.plot(kind="scatter", x="saf_s_11", y="sat_score",
figsize=(6, 4),
title="Correlation of students Safety and Respect Score and SAT-scores, by school")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()
Beyond a certain threshold of safety-rating (~6.5), there are no schools with high SAT-ratings. To me this looks like unsafe environments prohibit excellent scores.
# build second scatter plot to visualize correlation between SAT and Safety and Respect Rating (students)
by_district = combined.groupby("school_dist").mean().reset_index()
ax = by_district.plot(kind="scatter", x="saf_s_11", y="sat_score",
figsize=(6, 4),
title="Correlation of students Safety and Respect Score and SAT-scores, by school district")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()
The relationship is also visible on aggregate level - but with a higher spread of the SAT scores as the Safety / Respect Score increases. Again, this could be an indication that safety is a necessary but not a sufficient factor for high SAT scores.
I could not install Basemap (soon to be deprecated) to generate maps so I used Folium.
Useful links:
# code for map
import folium
import branca.colormap as cm
# set a simple baselayer and zoom-in on NY
school_map = folium.Map(location=[40.693943, -73.985880],
#width=500,
#height=800,
zoom_start=10,
tiles='CartoDB positron')
# generate a title above the map
# from https://github.com/python-visualization/folium/issues/1202
map_title = "NY highschools: Safety / Respect Score and avg. enrollment by school district"
title_html = '''
<h3 align="center" style="font-size:20px"><b>{}</b></h3>
'''.format(map_title)
school_map.get_root().html.add_child(folium.Element(title_html))
# select column names of the viz parameters
size_parameter = "total_enrollment"
color_parameter = "saf_s_11"
# generate a colormap to shade the cirles on the map
# from https://stackoverflow.com/questions/56876620/unsure-how-to-use-colormap-with-folium-marker-plot
start_col_val = by_district[color_parameter].min()
end_col_val = by_district[color_parameter].max()
start_col_hex = "#d01c8b" # used https://colorbrewer2.org/
end_col_hex = "#4dac26"
colormap = cm.LinearColormap(colors=[start_col_hex,
#"white",
end_col_hex],
vmin=start_col_val,
vmax=end_col_val
)
colormap.caption = "Safety / Respect Score, students response"
# create a circle for each school district
for index, row in by_district.iterrows():
folium.Circle(location=[row["lat"],row["lon"]],
color=colormap(row[color_parameter]),
fill_color=colormap(row[color_parameter]),
fill_opacity=0.75,
radius=row[size_parameter],
weight=2,
tooltip="District "+row["school_dist"]+
" <br> Avg. Safety / Respect Score, students response (color): {:.1f}".format(row[color_parameter])+
" <br> Avg. enrollment per school (size): {:.0f}".format(row[size_parameter])
).add_to(school_map)
# add a legend
school_map.add_child(colormap)
# display the map
school_map
Especially students in Northern Brooklyn and parts of the Bronx report below average Safety / Respect Scores.
This somewhat aligns with a New York Times article from around that time - the piece reports that precincts with the highest crime rates are in these areas.
Up to date stats by precinct and crime type can be found here.
# build scatter plot to visualize correlation between SAT and race of students
race_field = ["asian_per","black_per","hispanic_per","white_per"]
ax = correlations[race_field].sort_values().plot(kind="bar",
figsize=(8, 8),
title="Correlation of race to SAT-scores")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches: # set data point labels
ax.annotate(str(round(p.get_height(), 2)),
(p.get_x() + 0.2, max(p.get_height(), 0)+0.02),
rotation=90
)
ax.axhline(y=0.0, color='black', linestyle='-', linewidth=1)
plt.show()
Trying out plotly and cufflinks to enable easier data discovery through interactive charts - inspiration, examples, more examples.
with warnings.catch_warnings(): # to catch warning about np.module in Python 3.8.2.
warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
x='hispanic_per',
y='sat_score',
categories='boro',
text="SCHOOL NAME",
vline= {"x":combined["hispanic_per"].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":combined["sat_score"].mean(),
"color":"#000000",
"dash": "dash"
},
xTitle='Share hispanics, %',
yTitle='Avg. SAT score',
title='Share of hispanic students vs. SAT score by school <br> dashed lines: average of entire dataset'
)
print("Recommendation: choose option to 'show closest data point on hover' in top right corner")
Recommendation: choose option to 'show closest data point on hover' in top right corner
Almost all schools in the Bronx (double-click "Bronx" in legend) exhibit above average high shares of hispanics and below average SAT scores. There might be confounding factors (crime or poverty) leading to low SAT scores of hispanics in NYC. Sadly the only elite schools in the Bronx (Lehman, Bronx High School of Science) have very few hispanic students.
Schools with very low average SAT scores and very high shares of hispanic students are specialized on students with limited English skills (often named "International" schools etc.). It is not surprising that these schools score below average, because SAT tests both reading and writing.
The high schools with high SAT scores and low shares of hispanics are all highly selective public schools with excellent reputation. I am interested in seeing the racial makeup of all these elite schools (avg. SAT > 1,800) compared to all schools.
# build stacked bar to racial makeup of all schools and elite schools
num_race = ['asian_num', 'black_num', 'hispanic_num', 'white_num']
high_sat = combined["sat_score"] > 1800
race_dist = pd.DataFrame(data=[combined.loc[:, num_race].sum()/combined.loc[:, num_race].sum().sum(),
combined.loc[high_sat, num_race].sum()/combined.loc[high_sat, num_race].sum().sum()],
index=["All schools", "Elite schools"])*100
race_dist.columns = ["Asian","Black","Hispanic", "White"]
race_dist.astype(int).iplot(kind="bar", barmode="stack", yTitle='Share of all students in %',
title='Racial distribution NYC high schools, <br> all vs. elite (SAT avg. > 1,800)')
Blacks and hispanics are clearly underrepresented in NYC's elite high schools - white and especially asian students are overrepresented compared to all NYC high schools. This is the case although there are elite schools in all five boroughs.
# build bar chart to visualize correlation between SAT and gender of students
gender = ["male_per", "female_per"]
ax = correlations[gender].sort_values().plot(kind="bar",
figsize=(8, 8),
title="Correlation of gender to SAT-scores")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches: # set data point labels
ax.annotate(str(round(p.get_height(), 2)),
(p.get_x() + 0.2, max(p.get_height(), 0)+0.002),
rotation=0
)
ax.axhline(y=0.0, color='black', linestyle='-', linewidth=1)
plt.show()
Schools exhibit higher SAT-scores when the share of females is higher, but the correlation is pretty weak (~ 0.1).
It is generally necessary to plot only n-1 dummies of a categorical variable. In this case where gender has two possible values - thus female will always be symmetrical to male and the other way around.
with warnings.catch_warnings(): # to catch FutureWarning about np.module being used in pandas
warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
x='female_per',
y='sat_score',
categories='boro',
text="SCHOOL NAME",
vline= {"x":combined["female_per"].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":combined["sat_score"].mean(),
"color":"#000000",
"dash": "dash"
},
xTitle='Share females, %',
yTitle='Avg. SAT score',
title='Share of female students vs. SAT score by school <br> dashed lines: average of entire dataset'
)
print("Recommendation: choose option to 'show closest data point on hover' in top right corner")
Recommendation: choose option to 'show closest data point on hover' in top right corner
# built function "describe_plus" to help with investigation of subsets
# helper functions
def describe_nan(df):
"""
Return transposed pandas describe with share of nan-values
Parameters:
df = pandas.DataFrame
"""
df_des = df.describe().transpose()
df_des["share_nan"] = 1-(df_des["count"]/df.shape[0])
df_des_col_order = ["share_nan", "count", "mean",
"std", "min", "25%", "50%", "75%", "max"]
df_des = df_des[df_des_col_order]
return df_des
def describe_frequent(df, n=5):
"""
Returns dataframe with most common values
Parameters:
df = pandas.DataFrame
n = Optional number of most common values
"""
topn_df = pd.DataFrame()
col_names = df.columns
for col in col_names:
col_name = df[col].name
top_n = df[col].value_counts(dropna=False)[
:n].reset_index().reset_index()
top_n_norm = df[col].value_counts(dropna=False, normalize=True)[
:n].reset_index().reset_index()
top_n["norm"] = (top_n_norm[col_name]*100).astype(int)
top_n["level_0"] = top_n["level_0"]+1
top_n["string"] = "value: "+top_n["index"].astype(
str)+" | count: "+top_n[col_name].astype(str)+" | norm %: "+top_n["norm"].astype(str)
result = top_n["string"]
result.rename(col_name, inplace=True)
topn_df = pd.concat([topn_df, result], axis=1)
topn_df.index = "Top"+(topn_df.index + 1).astype(str)
return topn_df
# main function
def describe_plus(df, subset=None, output=None,
figsize=(15, 20), cmap="RdBu"):
'''
Returns a dataframe of descripitive strings.
If only a dataframe is passed, pandas 'describe'-method includes missing value count in %.
If also one subset is passed, the resulting subset is compared to the entire dataset.
Parameters:
df: Required pandas.DataFrame
subset: Optional pandas.DataFrame with identical columns to compare to df
output: Optional, if only one df is passed, "non_num" returns df with all
non numeric columns and adds descriptive stats as first rows.
If subset is passed, "clean" returns a DataFrame without styling.
If subset is passed, "sorted" returns a styled DataFrame,
sorted by z-score.
cmap: Optional parameter, default colormap "RdBu"
Returns:
(styled) pandas.Dataframe
'''
if subset is None:
if output == "non_num":
# get only the numeric columns from the dataframe
all_cols = df.columns
# describe only works on numerical columns
num_cols = describe_nan(df).index
non_num_cols = [col for col in all_cols if col not in num_cols]
df_non_num = df[non_num_cols]
# calc some stats on non-numeric columns
df_stats = pd.DataFrame(columns=non_num_cols)
df_stats.loc["Record count"] = df_non_num.shape[0]
df_stats.loc["Count Nan"] = df_non_num.isna().sum()
df_stats.loc["Share Nan"] = df_non_num.isna().sum() / \
df_non_num.shape[0]
df_stats.loc["Count unique"] = df_non_num.nunique()
# add most common values
df_stats = pd.concat([df_stats, describe_frequent(df_non_num)])
# prepend stats to df
df_non_num_up = pd.concat([df_stats, df_non_num])
return df_non_num_up
else:
return describe_nan(df).style.background_gradient(subset="share_nan", low=0, high=1, cmap="Blues")
else:
# calculate descriptive stats
subset_desc = describe_nan(subset)
df_desc = describe_nan(df)
# calculate percent change (used to be for sns.heatmap)
hm_data = (subset_desc/df_desc)
hm_data["z-score, subset mean"] = (subset_desc["mean"] -
df_desc["mean"])/(df_desc["std"])
hm_data["count"] = 0 # no color for count
hm_data = hm_data[['count', "share_nan", 'z-score, subset mean',
'mean', 'std', 'min', '25%', '50%', '75%', 'max']]
# format main annotation values
main_values = pd.DataFrame()
main_values = pd.concat(
[main_values, subset_desc.loc[:, "mean":"max"]], axis=1)
main_values = main_values.applymap('{:,.2f}'.format)
main_values["z-score, subset mean"] = (
subset_desc["mean"]-df_desc["mean"])/(df_desc["std"])
main_values["count"] = subset_desc['count'].apply('{:,.0f}'.format)
main_values["share_nan"] = (
subset_desc['share_nan']*100).apply('{:,.0f}%'.format)
main_values = main_values[[
'count', "share_nan", 'z-score, subset mean', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']]
# values of subset are compared to main dataset in annotation as follows:
# count (share of total records)
# share_Nan (% compared to entire dataset)
# mean (% compared to entire dataset)
# z-score of subset-mean (no comparison)
# std (% compared to entire dataset)
# min,25%,50%,75%,max (percentile rank of entire dataset)
# dataframe has same columns, but z-score is also seperate and shaded
# percentages are formatted as integer with sign
# pct rank: copy output from description method to original dataframe
subset_desc_pct_rank = pd.concat(
[df, subset_desc.loc[:, "min":"max"].transpose()]).dropna(axis=1, subset=["min"])
# pct rank: make a column with percentile rank for every numeric column
list_col_pct = []
for col in subset_desc_pct_rank:
col_pct = col+"_pct_rank"
list_col_pct.append(col_pct)
subset_desc_pct_rank[col_pct] = subset_desc_pct_rank[col].rank(
pct=True)
# pct rank: discard main dataset, apply proper names and shape
subset_desc_pct_rank = (
subset_desc_pct_rank.loc["min":"max", list_col_pct].transpose()*100).astype(int)
subset_desc_pct_rank.index = subset_desc_pct_rank.index.str.replace(
"_pct_rank", "")
subset_desc_pct_rank.columns = subset_desc_pct_rank.columns+"_pct_rank"
# format all values going into parenthesis
paran = pd.DataFrame()
paran["count"] = ((subset_desc/df_desc)["count"]
* 100).apply('{:,.0f}%'.format)
paran["share_nan"] = hm_data["share_nan"].apply('{:,.0f}%'.format)
paran["z-score, subset mean"] = "-"
paran["mean"] = ((((subset_desc/df_desc)["mean"])-1)
* 100).apply('{:+,.0f}%'.format)
paran["std"] = ((((subset_desc/df_desc)["std"])-1)
* 100).apply('{:+,.0f}%'.format)
paran = pd.concat([paran, subset_desc_pct_rank], axis=1)
paran.columns = paran.columns.str.replace("_pct_rank", "")
# build the annotation
annot = main_values.astype(str)+" ("+paran.astype(str)+")"
annot["z-score, subset mean"] = main_values["z-score, subset mean"]
# decided to redo column headers
new_col_headers = {'count': "count (share of df)",
"share_nan": "share_nan (change to df)",
'z-score, subset mean': "z-score, subset mean",
'mean': " mean (change to df)",
'std': "std (change to df)",
'min': "min (pct rank in df)",
'25%': "25% (pct rank in df)",
'50%': "50% (pct rank in df)",
'75%': "75% (pct rank in df)",
'max': "max (pct rank in df)"
}
annot.rename(inplace=True, columns=new_col_headers)
if output == "clean":
return annot
elif output == "sorted":
return annot.sort_values(by="z-score, subset mean").style.background_gradient(
subset="z-score, subset mean", cmap="RdBu")
else:
return annot.style.background_gradient(
subset="z-score, subset mean", cmap="RdBu")
# compute new feature: share of students that took Advanced Placement (AP) exams to earn college credit
combined["ap_per"] = combined["AP Test Takers "]/combined["total_enrollment"]*100
# build scatter plot investigating correlation between share of AP-Test takers and SAT-score
with warnings.catch_warnings(): # to catch FutureWarning about np.module being used in pandas
warnings.filterwarnings("ignore",category=FutureWarning)
combined.iplot(
x='ap_per',
y='sat_score',
categories='boro',
text="SCHOOL NAME",
vline= {"x":combined["ap_per"].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":combined["sat_score"].mean(),
"color":"#000000",
"dash": "dash"
},
xTitle='AP-Test Takers, %',
yTitle='Avg. SAT score',
title='Share of AP-Test takers vs. SAT score by school <br> dashed lines: average of entire dataset'
)
print("Recommendation: choose option to 'show closest data point on hover' in top right corner")
Recommendation: choose option to 'show closest data point on hover' in top right corner
Advanced Placement (AP) is a program in the United States and Canada created by the College Board which offers college-level curricula and examinations to high school students.
I see three interesting patterns in our data:
All in all it looks like the share of AP-test takers is not a strong indicator for SAT-performance.
See some detailed analysis below.
# select only schools with high (but not extremely high) shares of AP test takers and low SAT scores
high_ap_low_sat = combined[(combined["ap_per"]>25)&
(combined["ap_per"]<50)&
(combined["sat_score"]<1300)]
describe_plus(combined,subset=high_ap_low_sat,output="sorted")
count (share of df) | share_nan (change to df) | z-score, subset mean | mean (change to df) | std (change to df) | min (pct rank in df) | 25% (pct rank in df) | 50% (pct rank in df) | 75% (pct rank in df) | max (pct rank in df) | |
---|---|---|---|---|---|---|---|---|---|---|
AVERAGE CLASS SIZE | 105 (29%) | 0% (nan%) | -0.560828 | 22.80 (-8%) | 3.40 (-5%) | 10.00 (0) | 20.83 (11) | 23.17 (32) | 25.14 (53) | 30.94 (96) |
SAT Math Avg. Score | 105 (29%) | 0% (nan%) | -0.525060 | 385.63 (-8%) | 29.36 (-55%) | 312.00 (0) | 367.00 (15) | 385.00 (33) | 411.00 (52) | 441.00 (76) |
sat_score | 105 (29%) | 0% (nan%) | -0.513529 | 1,131.92 (-7%) | 80.00 (-55%) | 913.00 (0) | 1,080.00 (13) | 1,143.00 (33) | 1,199.00 (52) | 1,273.00 (77) |
SAT Critical Reading Avg. Score | 105 (29%) | 0% (nan%) | -0.510077 | 374.80 (-7%) | 29.49 (-49%) | 287.00 (0) | 360.00 (13) | 381.00 (32) | 400.00 (53) | 418.00 (75) |
SAT Writing Avg. Score | 105 (29%) | 0% (nan%) | -0.476021 | 371.49 (-7%) | 26.97 (-54%) | 291.00 (0) | 357.00 (15) | 377.00 (36) | 394.00 (55) | 414.00 (76) |
N_s | 105 (29%) | 0% (nan%) | -0.459995 | 291.04 (-51%) | 81.18 (-88%) | 93.00 (2) | 240.00 (17) | 300.00 (29) | 340.00 (42) | 598.21 (78) |
total_enrollment | 105 (29%) | 0% (nan%) | -0.458799 | 396.04 (-50%) | 64.15 (-93%) | 141.00 (0) | 356.00 (17) | 401.00 (28) | 445.00 (44) | 500.00 (58) |
total_students | 105 (29%) | 0% (nan%) | -0.451494 | 409.50 (-47%) | 85.48 (-90%) | 110.00 (0) | 351.00 (18) | 412.00 (30) | 466.00 (47) | 664.00 (76) |
male_num | 105 (29%) | 0% (nan%) | -0.448658 | 189.65 (-53%) | 64.14 (-86%) | 0.00 (1) | 159.00 (18) | 195.00 (33) | 225.00 (47) | 402.00 (78) |
female_num | 105 (29%) | 0% (nan%) | -0.437781 | 206.38 (-47%) | 73.66 (-83%) | 61.00 (1) | 163.00 (17) | 196.00 (31) | 239.00 (47) | 495.00 (83) |
Total Cohort | 105 (29%) | 0% (nan%) | -0.420025 | 99.71 (-49%) | 59.19 (-74%) | 1.00 (0) | 70.00 (16) | 93.00 (33) | 116.00 (53) | 193.87 (75) |
N_t | 105 (29%) | 0% (nan%) | -0.418452 | 23.34 (-41%) | 6.28 (-84%) | 6.00 (0) | 20.00 (23) | 24.00 (39) | 28.00 (55) | 37.00 (75) |
number_programs | 105 (29%) | 0% (nan%) | -0.411742 | 1.14 (-38%) | 0.43 (-75%) | 1.00 (35) | 1.00 (35) | 1.00 (35) | 1.00 (35) | 3.00 (84) |
N_p | 105 (29%) | 0% (nan%) | -0.381221 | 139.54 (-44%) | 80.69 (-72%) | 28.00 (1) | 78.00 (17) | 124.00 (32) | 195.00 (54) | 408.00 (88) |
hispanic_num | 105 (29%) | 0% (nan%) | -0.366238 | 186.80 (-39%) | 105.43 (-67%) | 17.00 (1) | 89.00 (19) | 204.00 (44) | 276.00 (61) | 405.00 (81) |
sped_num | 105 (29%) | 0% (nan%) | -0.360001 | 61.67 (-40%) | 28.84 (-74%) | 0.00 (1) | 49.00 (26) | 65.00 (43) | 80.00 (57) | 125.00 (80) |
white_per | 105 (29%) | 0% (nan%) | -0.333332 | 3.75 (-55%) | 8.02 (-43%) | 0.00 (2) | 0.70 (17) | 1.40 (38) | 3.10 (60) | 60.00 (98) |
white_num | 105 (29%) | 0% (nan%) | -0.318748 | 15.54 (-86%) | 34.60 (-88%) | 0.00 (2) | 3.00 (17) | 5.00 (30) | 12.00 (54) | 245.00 (88) |
asian_num | 105 (29%) | 0% (nan%) | -0.317136 | 23.87 (-82%) | 45.33 (-87%) | 0.00 (1) | 4.00 (19) | 8.00 (33) | 22.00 (54) | 272.00 (88) |
asian_per | 105 (29%) | 0% (nan%) | -0.297371 | 5.77 (-43%) | 10.68 (-27%) | 0.00 (1) | 1.00 (20) | 2.20 (38) | 5.10 (57) | 64.50 (99) |
black_num | 105 (29%) | 0% (nan%) | -0.282365 | 167.32 (-29%) | 102.75 (-58%) | 0.00 (0) | 95.00 (26) | 127.00 (42) | 251.00 (67) | 413.00 (87) |
ell_num | 105 (29%) | 0% (nan%) | -0.168173 | 65.17 (-29%) | 95.56 (-40%) | 0.00 (1) | 18.00 (28) | 30.00 (43) | 58.00 (67) | 377.00 (93) |
saf_t_11 | 105 (29%) | 0% (nan%) | -0.116720 | 7.04 (-1%) | 0.88 (-2%) | 4.40 (0) | 6.50 (23) | 7.20 (50) | 7.60 (70) | 9.00 (98) |
male_per | 105 (29%) | 0% (nan%) | -0.078240 | 48.28 (-2%) | 14.72 (+5%) | 0.00 (1) | 44.20 (25) | 49.80 (49) | 55.30 (76) | 84.30 (98) |
saf_tot_11 | 105 (29%) | 0% (nan%) | -0.062090 | 7.27 (-1%) | 0.57 (-10%) | 5.60 (0) | 6.90 (24) | 7.30 (48) | 7.70 (73) | 8.60 (98) |
saf_s_11 | 105 (29%) | 0% (nan%) | -0.054626 | 6.57 (-1%) | 0.64 (-8%) | 5.20 (1) | 6.10 (25) | 6.60 (51) | 7.00 (72) | 8.50 (99) |
rr_s | 105 (29%) | 0% (nan%) | -0.026024 | 80.53 (-0%) | 14.90 (-4%) | 0.00 (0) | 75.00 (26) | 84.00 (48) | 90.00 (70) | 99.00 (98) |
aca_s_11 | 105 (29%) | 0% (nan%) | -0.022072 | 7.37 (-0%) | 0.42 (-8%) | 6.40 (0) | 7.10 (27) | 7.38 (50) | 7.60 (72) | 9.10 (99) |
lat | 105 (29%) | 0% (nan%) | -0.016571 | 40.74 (-0%) | 0.08 (-0%) | 40.58 (1) | 40.67 (19) | 40.74 (52) | 40.82 (75) | 40.89 (99) |
aca_t_11 | 105 (29%) | 0% (nan%) | -0.015370 | 7.50 (-0%) | 0.96 (+2%) | 4.80 (0) | 6.90 (24) | 7.50 (48) | 8.30 (77) | 9.20 (98) |
Number of Exams with scores 3 4 or 5 | 105 (29%) | 0% (nan%) | -0.011268 | 150.76 (-2%) | 19.43 (-92%) | 11.00 (5) | 153.45 (61) | 153.45 (61) | 153.45 (61) | 153.45 (61) |
AP Test Takers | 105 (29%) | 0% (nan%) | -0.008076 | 127.61 (-1%) | 9.52 (-95%) | 42.00 (24) | 129.03 (63) | 129.03 (63) | 129.03 (63) | 129.03 (63) |
Total Exams Taken | 105 (29%) | 0% (nan%) | -0.008049 | 194.54 (-1%) | 16.17 (-95%) | 53.00 (23) | 197.04 (65) | 197.04 (65) | 197.04 (65) | 197.04 (65) |
com_t_11 | 105 (29%) | 0% (nan%) | 0.003475 | 6.53 (+0%) | 1.10 (-5%) | 3.50 (1) | 5.70 (23) | 6.60 (52) | 7.50 (78) | 8.60 (98) |
zip | 105 (29%) | 0% (nan%) | 0.016941 | 10,733.81 (+0%) | 513.98 (-4%) | 10,002.00 (1) | 10,455.00 (31) | 10,473.00 (51) | 11,208.00 (67) | 11,691.00 (99) |
eng_t_11 | 105 (29%) | 0% (nan%) | 0.028632 | 7.02 (+0%) | 1.02 (-1%) | 4.10 (1) | 6.40 (28) | 7.10 (53) | 7.80 (78) | 9.10 (98) |
eng_s_11 | 105 (29%) | 0% (nan%) | 0.029803 | 6.64 (+0%) | 0.56 (+3%) | 5.50 (1) | 6.30 (28) | 6.60 (50) | 7.00 (76) | 8.60 (99) |
saf_p_11 | 105 (29%) | 0% (nan%) | 0.044588 | 8.21 (+0%) | 0.53 (-11%) | 6.30 (1) | 7.90 (28) | 8.30 (54) | 8.60 (76) | 9.80 (99) |
aca_tot_11 | 105 (29%) | 0% (nan%) | 0.046239 | 7.59 (+0%) | 0.49 (-5%) | 6.10 (0) | 7.30 (28) | 7.60 (52) | 8.00 (80) | 8.60 (97) |
lon | 105 (29%) | 0% (nan%) | 0.056514 | -73.92 (-0%) | 0.06 (-14%) | -74.16 (0) | -73.96 (29) | -73.92 (57) | -73.89 (76) | -73.73 (99) |
eng_tot_11 | 105 (29%) | 0% (nan%) | 0.070815 | 7.08 (+1%) | 0.53 (-2%) | 5.70 (1) | 6.80 (30) | 7.10 (56) | 7.50 (81) | 8.30 (99) |
rr_p | 105 (29%) | 0% (nan%) | 0.075437 | 40.40 (+4%) | 19.66 (+2%) | 9.00 (1) | 25.00 (26) | 39.00 (54) | 51.00 (76) | 90.00 (98) |
female_per | 105 (29%) | 0% (nan%) | 0.078164 | 51.71 (+2%) | 14.72 (+5%) | 15.70 (2) | 44.70 (23) | 50.20 (51) | 55.80 (74) | 100.00 (99) |
com_s_11 | 105 (29%) | 0% (nan%) | 0.096943 | 6.15 (+1%) | 0.50 (-6%) | 5.20 (1) | 5.80 (32) | 6.10 (56) | 6.40 (76) | 8.40 (99) |
hispanic_per | 105 (29%) | 0% (nan%) | 0.104693 | 46.68 (+6%) | 24.99 (+0%) | 4.50 (1) | 22.70 (27) | 53.80 (57) | 66.00 (79) | 100.00 (99) |
com_tot_11 | 105 (29%) | 0% (nan%) | 0.110361 | 6.81 (+1%) | 0.50 (-10%) | 5.30 (0) | 6.50 (32) | 6.80 (55) | 7.20 (78) | 8.10 (99) |
eng_p_11 | 105 (29%) | 0% (nan%) | 0.112140 | 7.57 (+1%) | 0.43 (-5%) | 6.70 (2) | 7.30 (30) | 7.60 (59) | 7.80 (76) | 9.40 (99) |
ell_percent | 105 (29%) | 0% (nan%) | 0.166164 | 16.67 (+25%) | 24.10 (+20%) | 0.00 (1) | 4.90 (36) | 8.80 (56) | 14.40 (75) | 92.90 (99) |
aca_p_11 | 105 (29%) | 0% (nan%) | 0.181744 | 7.90 (+1%) | 0.45 (-10%) | 6.80 (1) | 7.60 (34) | 7.90 (57) | 8.20 (78) | 9.50 (99) |
sped_percent | 105 (29%) | 0% (nan%) | 0.217671 | 15.84 (+11%) | 7.20 (-1%) | 0.00 (1) | 13.20 (37) | 17.30 (63) | 20.80 (82) | 28.40 (98) |
rr_t | 105 (29%) | 0% (nan%) | 0.220547 | 86.91 (+4%) | 15.59 (-6%) | 28.00 (0) | 83.00 (38) | 92.00 (59) | 97.00 (78) | 100.00 (90) |
com_p_11 | 105 (29%) | 0% (nan%) | 0.236069 | 7.74 (+2%) | 0.46 (-11%) | 6.60 (2) | 7.50 (40) | 7.80 (62) | 8.00 (77) | 9.20 (99) |
black_per | 105 (29%) | 0% (nan%) | 0.253546 | 43.16 (+18%) | 26.29 (+1%) | 0.00 (0) | 25.60 (39) | 33.20 (57) | 64.20 (80) | 93.50 (99) |
frl_percent | 105 (29%) | 0% (nan%) | 0.360121 | 72.03 (+9%) | 10.48 (-36%) | 35.40 (7) | 66.40 (40) | 73.80 (63) | 78.50 (78) | 100.00 (99) |
ap_per | 105 (29%) | 0% (nan%) | 0.710746 | 32.95 (+57%) | 5.30 (-68%) | 25.28 (60) | 28.74 (68) | 31.94 (77) | 35.94 (84) | 47.09 (93) |
schoolyear | 105 (29%) | 0% (nan%) | nan | 20,112,012.00 (+0%) | 0.00 (+nan%) | 20,112,012.00 (50) | 20,112,012.00 (50) | 20,112,012.00 (50) | 20,112,012.00 (50) | 20,112,012.00 (50) |
fl_percent | 105 (29%) | 0% (nan%) | nan | 0.00 (+nan%) | 0.00 (+nan%) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) |
grade_span_max | 105 (29%) | 0% (nan%) | nan | 12.00 (+0%) | 0.00 (+nan%) | 12.00 (50) | 12.00 (50) | 12.00 (50) | 12.00 (50) | 12.00 (50) |
expgrade_span_max | 105 (29%) | 0% (nan%) | nan | 12.00 (+0%) | 0.00 (+nan%) | 12.00 (50) | 12.00 (50) | 12.00 (50) | 12.00 (50) | 12.00 (50) |
priority08 | 105 (29%) | 0% (nan%) | nan | 0.00 (+nan%) | 0.00 (+nan%) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) |
priority09 | 105 (29%) | 0% (nan%) | nan | 0.00 (+nan%) | 0.00 (+nan%) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) |
priority10 | 105 (29%) | 0% (nan%) | nan | 0.00 (+nan%) | 0.00 (+nan%) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) | 0.00 (50) |
The schools with above average shares of AP-test takers and low SAT scores achieve poor results across all three parts of the SAT (negative z-score) - writing being least below average.
Other noteworthy characteristics these schools have compared to NYC averages are:
# investigating non-numeric features
pd.set_option("display.max_rows", 95)
describe_plus(high_ap_low_sat,output="non_num")[:9].transpose().head()
# describe_plus(subset,output="non_num")
print("results not shown in final doc to conserve vertical space")
results not shown in final doc to conserve vertical space
For the subset with low average SAT scores but above AP-shares, "advancedplacement_courses" are very often "English Literature..." or "English Language and Composition".
I am surprised that these courses did not prepare students better in reading and writing. Let's look at what schools with high SAT score do differently in terms of AP courses they offer to students.
# extract and display list of most common AP-coures and plot how often they appear
# first select other schools for comparison:
high_ap_high_sat = combined[(combined["ap_per"]>25)&
(combined["ap_per"]<50)&
(combined["sat_score"]>1300)]
# define loop trough subsets and extract relative distribution of ap_courses
subsets = {"High AP, low SAT": high_ap_low_sat,
"High AP, high SAT": high_ap_high_sat}
subsets_ap_courses = pd.DataFrame()
for k, v in subsets.items():
ap_courses = v["advancedplacement_courses"].str.split(",",expand=True)
ap_courses_melted = ap_courses.melt()["value"].str.strip().value_counts(normalize=True).sort_values()
ap_courses_melted.name = k
subsets_ap_courses = pd.concat([subsets_ap_courses, ap_courses_melted], axis=1)
ax = (subsets_ap_courses*100).iplot(kind="bar",
#figsize=(10,7),
layout = dict(title="AP classes by schol types, normalized count in %",
xaxis=dict(automargin=True), hovermode="x", height=600))
Schools with high shares of students taking AP-tests offer a narrower range of courses with a focus on history, English literature / language / composition. Calculus is offered, but almost only AB while the broader BC is less covered (info).
Aim of this section: identify "affordable" neighborhoods with good schools.
To make this analysis, we need to add the following information to our dataset:
# read-in property sales data from 'NYC Open Data'-server and clean it
# the dataset and an excellent API documentation can be accessed here:
# https://dev.socrata.com/foundry/data.cityofnewyork.us/5ebm-myj7
client = Socrata("data.cityofnewyork.us", None)
housing = pd.DataFrame.from_records(client.get("5ebm-myj7", limit=6000))
# remove new line characters causing trouble
housing = housing.replace(r'\\n',' ', regex=True)
# assign correct datatypes to numeric columns
num_col = housing.loc[:,"number_of_sales":"year"].columns.to_list()
for c in num_col:
housing[c] = pd.to_numeric(housing[c],errors="coerce")
# according to the data dictionary only three house types exist,
# but the spelling differs a bit between them and needs to be aligned
def home_type(e):
if "01" in e:
return "01 ONE FAMILY HOUSE"
elif "02" in e:
return "02 TWO FAMILY HOUSE"
elif "03" in e:
return "03 THREE FAMILY HOUSE"
else:
"Something went wrong"
housing["type_of_home"] = housing["type_of_home"].apply(home_type)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
# get some general statistics
housing.info()
housing.describe()
#housing.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5979 entries, 0 to 5978 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 borough 5979 non-null object 1 neighborhood 5978 non-null object 2 type_of_home 5979 non-null object 3 number_of_sales 5979 non-null int64 4 lowest_sale_price 5979 non-null int64 5 average_sale_price 5979 non-null int64 6 median_sale_price 5979 non-null int64 7 highest_sale_price 5979 non-null float64 8 year 5979 non-null int64 dtypes: float64(1), int64(5), object(3) memory usage: 420.5+ KB
number_of_sales | lowest_sale_price | average_sale_price | median_sale_price | highest_sale_price | year | |
---|---|---|---|---|---|---|
count | 5,979.000 | 5,979.000 | 5,979.000 | 5,979.000 | 5,979.000 | 5,979.000 |
mean | 35.560 | 712,938.596 | 1,207,114.844 | 1,150,549.290 | 2,075,932.909 | 2,014.521 |
std | 50.761 | 1,509,407.390 | 1,989,596.662 | 1,868,417.469 | 3,667,126.216 | 2.860 |
min | 1.000 | 150,000.000 | 167,500.000 | 167,500.000 | 176,750.000 | 2,010.000 |
25% | 5.000 | 212,000.000 | 453,514.000 | 443,250.000 | 700,000.000 | 2,012.000 |
50% | 16.000 | 290,000.000 | 624,008.000 | 611,500.000 | 990,000.000 | 2,015.000 |
75% | 46.000 | 500,000.000 | 962,607.000 | 920,000.000 | 1,860,000.000 | 2,017.000 |
max | 519.000 | 26,893,000.000 | 26,893,000.000 | 26,893,000.000 | 77,100,000.000 | 2,019.000 |
The dataset contains info on home prices by:
While we definitely want all boroughs and neighborhoods, we need to decide which types of homes shall be used as indicator for property prices, and which years should be considered.
Home price information is broken down into:
For the purpose of this analysis we will only look at average prices.
# plotting with matplotlib for a change
temp_pivot = housing.pivot_table(columns="type_of_home",index="year",
values="number_of_sales", aggfunc=sum)
ax = temp_pivot.plot(kind="bar",
title="Number of recorded sales by year and type of home")
plt.legend(bbox_to_anchor=(1,1))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()
I will use one family houses, because they are the most common types of properties sold. They probably also have the most real life relevance when a family considers moving.
Now we need to decide which years we want to include in our analysis.
# removing the columns and records we don't need
keep_cols = ['borough',
'neighborhood',
'type_of_home',
'number_of_sales',
'average_sale_price',
'year']
housing_ofh = housing.loc[housing["type_of_home"] == "01 ONE FAMILY HOUSE",keep_cols]
# checking by year for how many neighborhoods no sale was recorded
print("No of neighborhoods per year with no recorded one family house sale")
housing_ofh.pivot_table(index='neighborhood', values="type_of_home", columns="year", aggfunc="count").isna().sum()
No of neighborhoods per year with no recorded one family house sale
year 2010 23 2011 29 2012 19 2013 22 2014 25 2015 20 2016 21 2017 25 2018 23 2019 22 dtype: int64
The dataset by default has at most one record per year, type of housing and neighborhood. A quick analysis shows, that for every year there are neighborhoods with no recorded price. I have therefore decided to include all years to hopefully have information on as many neighborhoods as possible.
Next up: Calculating an average price and number of sales for each neighborhood.
# calculate average price per neighborhood
ofh_price = housing_ofh.groupby(["neighborhood"]).mean().reset_index().drop(columns=["year","number_of_sales"])
ofh_price.rename(inplace=True, columns={"average_sale_price":"avg_sale_price"})
# calculate average number of sales per year per neighborhood
# this cant be done in same step as price, because in the original dataset
# years without sale are not recorded as Nan but missing entirely
ofh_num_sales = housing_ofh.groupby(["neighborhood"]).sum().reset_index().drop(columns=["year","average_sale_price"])
ofh_num_sales["sales_per_year"] = ofh_num_sales["number_of_sales"]/len(housing_ofh["year"].unique())
ofh_num_sales.drop(inplace=True,columns="number_of_sales")
# putting it together
ofh = pd.merge(left=ofh_price, right=ofh_num_sales, on="neighborhood", indicator=True)
# generate histogram for sales per year
ofh["sales_per_year"].iplot(kind="histogram", asFigure=True, bins=150,
title="NYC one family house sales by neighborhood",xTitle='Sales per year', yTitle='Count')
# generate histogram for sales per year
ofh["avg_sale_price"].iplot(kind="histogram", asFigure=True, bins=50,
title="NYC one family house sales by neighborhood",xTitle='Average price, USD', yTitle='Count')
# exclude neighborhoods
ofh = pd.merge(left=ofh_price, right=ofh_num_sales, on="neighborhood")
print("All neighborhoods: {}".format(ofh.shape[0]))
ofh = ofh[ofh["sales_per_year"] >= 5]
print("After threshold for # of sales: {}".format(ofh.shape[0]))
ofh = ofh[ofh["avg_sale_price"] < 1500000]
print("After threshold for price: {}".format(ofh.shape[0]))
All neighborhoods: 242 After threshold for # of sales: 187 After threshold for price: 175
Neighborhoods were excluded from the dataset for two reasons:
Applying these filters, 175 of 242 neighborhoods (72%) remain.
Geocoding was implemented with the help of this nice dataquest user project (also has some pointers for using beautiful soup for webscraping) and this medium post.
# consider coverting to markup cell to avoid rerunning - it takes quite long
# build coordinates as inputs for reverse geocoding
combined["coordinates"] = combined["lat"].astype(
str)+","+combined["lon"].astype(str)
# make API request
# read documentation https://geopy.readthedocs.io/
locator = Nominatim(user_agent="Geocoder_for_DQ", timeout=10)
# adding delays between queries and retrying failed requests
rgeocode = RateLimiter(locator.reverse, min_delay_seconds=0.1)
# getting FutureWarning from tqdm ¯\_(ツ)_/¯ used warnings lib
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=FutureWarning)
tqdm_notebook.pandas()
# progress_apply from tqdm for progressbar in Jupyter
combined["rgeocode"] = combined["coordinates"].progress_apply(rgeocode)
HBox(children=(FloatProgress(value=0.0, max=363.0), HTML(value='')))
# paste info from location object to separat columns
combined["rgeocode_raw"] = [x.raw for x in combined["rgeocode"]]
combined["address_dict"] = [x.get("address") for x in combined["rgeocode_raw"]]
combined["neighborhood"] = [x.get("neighbourhood") for x in combined["address_dict"]]
combined["suburb"] = [x.get("suburb") for x in combined["address_dict"]]
# take suburb if neighborhood not found
print ("impute {} missing neighborhoods from suburb".format(combined["neighborhood"].isna().sum()))
combined["neighborhood"] = combined["neighborhood"].fillna(combined["suburb"])
print ("{} missing neighborhoods remaining and filled with '0'".format(combined["neighborhood"].isna().sum()))
combined["neighborhood"] = combined["neighborhood"].fillna(0)
impute 103 missing neighborhoods from suburb 0 missing neighborhoods remaining and filled with '0'
# drop dict and location column because they are not needed anymore
combined_incl_locations = combined.copy() # but keep them in backup variable because API call takes so long
combined.drop(columns=["rgeocode","rgeocode_raw","address_dict"])
# prepare datatype and formatting for merging
ofh["neighborhood"] = ofh["neighborhood"].astype(str).str.lower().str.strip()
combined["neighborhood"] = combined["neighborhood"].astype(str).str.lower().str.strip()
# some schools geocoded neighborhoods were not found in the property dataset
# did some manual lookup to arrive at this list for some of the unmatched schools
schools_new_neighborhoods = {'bedford park': 'bedford park/norwood',
'bedford-stuyvesant': 'bedford stuyvesant',
'castle hill': 'castle hill/unionport',
'cobble hill historic district': 'cobble hill-west',
'east flatbush': 'flatbush-east',
'flatbush': 'flatbush-central',
'harlem': 'harlem-central',
'kingsbridge': 'kingsbridge/jerome park',
'kingsbridge heights': 'kingsbridge hts/univ hts',
'melrose': 'melrose/concourse',
'midtown': 'midtown east',
'midtown south': 'midtown west',
'morris heights': 'highbridge/morris heights',
'morrisania': 'morrisania/longwood',
'mott haven': 'mott haven/port morris',
'norwood': 'bedford park/norwood',
'port morris': 'mott haven/port morris',
'saint george': 'new brighton-st. george',
'upper west side': 'upper west side (79-96)',
'west brighton': 'west new brighton',
'williams bridge': 'williamsbridge',
'williamsburg': 'williamsburg-central'}
combined["neighborhood"].replace(schools_new_neighborhoods,inplace=True)
# If I were to do it properly, I would get shape objects of all neighborhoods
# per the property dataset and do spatial matching with the schools coordinates...
# now finally add property data to school dataset
combined_prop = pd.merge(left=combined, right=ofh, on="neighborhood",how="left")
share_prop = 100-combined_prop["avg_sale_price"].isna().sum()/combined_prop.shape[0]*100
print("{:,.0f}% of schools have property values associated with them".format(share_prop))
42% of schools have property values associated with them
We now have information on the property values of (some) schools. Lets plot them and see where students score relatively well on SAT exams compared to how expansive the property of the neighborhood is.
# compute average SAT-score by neighborhood
combined_prop = combined_prop[combined_prop["avg_sale_price"].notna()]
analysis_cols = ["neighborhood", "boro", "sat_score", "avg_sale_price",'saf_tot_11']
combined_prop_analysis = combined_prop.loc[:,analysis_cols]
hood_schools = combined_prop_analysis.groupby(["boro","neighborhood"]).mean().reset_index()
hood_schools["hovertext"] = hood_schools["neighborhood"]+" | avg. safety score: "+hood_schools["saf_tot_11"].round(1).astype(str)
with warnings.catch_warnings(): # to catch FutureWarning about np.module being used in panda
df_temp = hood_schools
x_temp = "avg_sale_price"
y_temp = "sat_score"
cat_temp = "boro"
text_temp = "hovertext"
warnings.filterwarnings("ignore",category=FutureWarning)
df_temp.iplot(
x=x_temp,
y=y_temp,
categories=cat_temp,
text=text_temp,
vline= {"x":df_temp[x_temp].mean(),
"color":"#000000",
"dash": "dash"
},
hline={"y":df_temp[y_temp].mean(),
"color":"#000000",
"dash": "dash"
},
layout = dict(
hovermode="closest",
# customize labels!
title='Property value vs. SAT score by neighborhood <br> dashed lines: average of entire dataset'),
xTitle='Average poperty value one family home, USD',
yTitle='Avg. SAT score'
)
Sheepshead Bay and Jamaica both have relatively low property prices and good public schools. Bedford park / Norwood students feel less safe, so this neighborhood should maybe not be first choice. As the graph below shows, property prices in one of the recommended neighborhoods (Jamaica) have increased above average over the past decade.
Note: The obvious best pick "Oakwood" is home to the elite school "STATEN ISLAND TECHNICAL HIGH SCHOOL" which is open to all NYC residents and has strict admissions criteria.
ofh_average = housing_ofh.pivot_table(margins=True,
index="neighborhood",
columns="year",
values="average_sale_price")
recommended_hoods = ["SHEEPSHEAD BAY", "JAMAICA", "All"]
# build indexchart
price_index = ofh_average.div(ofh_average[2010], axis=0).loc[recommended_hoods,:]*100
title="Development of property prices in recommended neighborhoods, <br>compared to all NYC neighborhoods"
price_index.transpose().iplot(kind="line",title=title, yTitle="2010 = 100%", mode="lines+markers")
Note: This was mainly a technical challenge with the aim to learn data analysis techniques and tools in Python. Please refer to "results" at the beginning of this document or the respective headings of the individual sections for an overview of insights gained from the data.
In this guided project we have applied the learnings of Step 2 of dataquest's "Data Analyst" learning path. Some of the concepts include:
Additionally I explored: