Academic success in high schools: What drives differences in SAT-scores across NYC?

Abstract

Motivation

The results of the standardized "SAT"-exam is used by US-colleges to determine which high school graduates they admit.
Understanding if there are any unintended patterns or reasons discriminating against certain groups of students is an important step to making the US education system more equal - something the US Department of Education states as a "formidable challenge".
To be an informed citizen on this matter (and learn some python along the way), I want to know:

  • Which factors drive differences in SAT-scores and thus opportunities to attend (selective) US-colleges?
  • Bonus question: Where in NYC does one get the best high school education compared to property prices?

Methodology

This analysis will use descriptive statistics (e.g. mean, correlations, z-score) and data visualizations (e.g. scatter plots, bar charts, maps) to explore the following datasets on NYC high schools:

The ETL portion of this analysis was executed based on dataquest instructions.

Additional dataset: Property prices by NYC neighborhood


Results

The main findings are:

  • Consistent with serious research, hispanic and black students are at a disadvantage. They are schooled in less safe neighborhoods and less often admitted to their borough's elite high schools.
  • STEM-focused elite schools have a high share of male students, excellent high-schools focused on non-STEM subjects tend be predominantly visited by females.
  • Advanced Placement (AP) courses are not always a good preparation for SAT-tests.
  • Sheepshead Bay and Jamaica are neighborhoods that offer good public education and have comparably low property cost.

Notes: The data is generally quote outdated (in part from 2011). SAT scores are not the only relevant parameter to evaluate the quality of education.


[ETL] Reading data

In [1]:
# standard libs from dq and additional utilities (PEP-8 made me do it in this order)
from tqdm.notebook import tqdm_notebook # progress bar for timeconsuming geocoding
import tqdm
import warnings  # used here to surpress FutureWarning
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
import seaborn as sns

# additional libs used for this projects
from sodapy import Socrata  # api calls
import geopy  # reverse geocoding
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import plotly as py  # interactive plotting
import plotly.graph_objs as go
import cufflinks as cf  # interactive plotting made easy (pandas style)

# Using plotly + cufflinks in offline mode
from plotly.offline import iplot, init_notebook_mode
cf.go_offline(connected=True)
init_notebook_mode(connected=True)

# jupyter command to show plots in notebook
%matplotlib inline

# setting pandas display options
pd.options.display.float_format = '{:,.3f}'.format
pd.set_option("display.max_columns", 20)
#pd.set_option("display.max_rows", 20)
pd.set_option('max_colwidth', 50)


# read seperate input files into single dictionary
# csv-files
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

# survey data, different format
all_survey = pd.read_csv("schools/survey_all.txt",
                         delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt",
                         delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN",
    "rr_s",
    "rr_t",
    "rr_p",
    "N_s",
    "N_t",
    "N_p",
    "saf_p_11",
    "com_p_11",
    "eng_p_11",
    "aca_p_11",
    "saf_t_11",
    "com_t_11",
    "eng_t_11",
    "aca_t_11",
    "saf_s_11",
    "com_s_11",
    "eng_s_11",
    "aca_s_11",
    "saf_tot_11",
    "com_tot_11",
    "eng_tot_11",
    "aca_tot_11",
]
survey = survey.loc[:, survey_fields]
data["survey"] = survey

[ETL] Adding and converting features

In [2]:
# add DBN columns as unique identifier in datasets where they are not yet present
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

data["class_size"]["padded_csd"] = data["class_size"]["CSD"].astype(str).str.pad(2,fillchar="0")
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

# summing SAT-score
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

# extracting latitude and longitude to separate columns
pattern = r"\((.+),(.+)\)"
data["hs_directory"][["lat","lon"]] = data["hs_directory"]["Location 1"].str.extract(pattern).astype(float)

# convert AP scores to numeric
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

[ETL] Condensing and combining datasets

In [3]:
# selecting the classes relevant for SAT-analysis (high school, general education)
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

# compute average class size by school across all subjects
data["class_size"] = pd.pivot_table(class_size, values="AVERAGE CLASS SIZE", index="DBN").reset_index()

# select most recent data available in "demographics" and "graduation" dataset
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

# initiate combined dataset with our main variable we want to investigate
combined = data["sat_results"]

# left joins for somewhat sparse datasets, because we do not want to loose any SAT-info
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

# inner joins for other dataset that potentially contain very relevant explanatory features.
# if these features are not available for a high school, our we want to exclude these schools
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

# missing values for numeric columns filled with mean, non-numerics with zeros 
combined = combined.fillna(combined.mean()) # potential lever to improve quality of analysis, not scope of this learning exercise
combined = combined.fillna(0)

# Add a school district column for later analysis
combined["school_dist"] = combined["DBN"].str[:2]

Survey results: Safety and students expectations are important

In [4]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
In [5]:
# compute all correlations with SAT scores
correlations = combined.corr()["sat_score"]

# build a barchart to visualize SAT and survey response correlation  
ax = correlations[survey_fields].sort_values().plot(kind="bar",
                                                    figsize=(8, 6),
                                                    title="Correlation of survey responses and SAT-scores")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches:
    ax.annotate(str(round(p.get_height(), 2)),
                (p.get_x()+0.1, max(p.get_height(), 0)+0.01),
                rotation=90
                )
ax.axhline(y=0.0, color='black', linestyle='-', linewidth=1)
plt.show()

Safety / Respect ("saf_...") and students academic expectations ("aca_s_11") are correlated with good SAT scores.
This intuitively makes sense - if students feel safe and have trust in themselves / sets higher expectations, they (can) focus on achieving good academic results.

Further details:

  • aca_...: Academic expectations score based on students (s), teachers (t) and parents(p) responses: Interesting to see that aca_p and aca_t (teachers) exhibit a lower correlation than expectation set by students.
  • saf_t_11, Safety and Respect score based on teachers responses: Also teachers perceived safety and respect score seems relevant.

Crime: Missing safety in the Bronx and Brooklyn impedes high SAT-scores

In [6]:
# build a scatter plot to visualize correlation between SAT and Safety and Respect Rating (students)
ax = combined.plot(kind="scatter", x="saf_s_11", y="sat_score",
                   figsize=(6, 4),
                   title="Correlation of students Safety and Respect Score and SAT-scores, by school")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()

Beyond a certain threshold of safety-rating (~6.5), there are no schools with high SAT-ratings. To me this looks like unsafe environments prohibit excellent scores.

In [7]:
# build second scatter plot to visualize correlation between SAT and Safety and Respect Rating (students)
by_district = combined.groupby("school_dist").mean().reset_index()

ax = by_district.plot(kind="scatter", x="saf_s_11", y="sat_score",
                      figsize=(6, 4),
                      title="Correlation of students Safety and Respect Score and SAT-scores, by school district")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()

The relationship is also visible on aggregate level - but with a higher spread of the SAT scores as the Safety / Respect Score increases. Again, this could be an indication that safety is a necessary but not a sufficient factor for high SAT scores.

I could not install Basemap (soon to be deprecated) to generate maps so I used Folium.
Useful links:

In [8]:
# code for map
import folium
import branca.colormap as cm

# set a simple baselayer and zoom-in on NY
school_map = folium.Map(location=[40.693943, -73.985880], 
                        #width=500,
                        #height=800,
                        zoom_start=10,
                        tiles='CartoDB positron')

# generate a title above the map
# from https://github.com/python-visualization/folium/issues/1202
map_title = "NY highschools: Safety / Respect Score and avg. enrollment by school district"
title_html = '''
             <h3 align="center" style="font-size:20px"><b>{}</b></h3>
             '''.format(map_title)
school_map.get_root().html.add_child(folium.Element(title_html))

# select column names of the viz parameters
size_parameter = "total_enrollment"
color_parameter = "saf_s_11"

# generate a colormap to shade the cirles on the map
# from https://stackoverflow.com/questions/56876620/unsure-how-to-use-colormap-with-folium-marker-plot
start_col_val = by_district[color_parameter].min()
end_col_val = by_district[color_parameter].max()
start_col_hex = "#d01c8b" # used https://colorbrewer2.org/ 
end_col_hex = "#4dac26"
colormap = cm.LinearColormap(colors=[start_col_hex,
                                     #"white",
                                     end_col_hex],
                             vmin=start_col_val,
                             vmax=end_col_val
                            )
colormap.caption = "Safety / Respect Score, students response"

# create a circle for each school district
for index, row in by_district.iterrows():
    folium.Circle(location=[row["lat"],row["lon"]],
                  color=colormap(row[color_parameter]),
                  fill_color=colormap(row[color_parameter]),
                  fill_opacity=0.75,
                  radius=row[size_parameter],
                  weight=2,
                  tooltip="District "+row["school_dist"]+
                      " <br> Avg. Safety / Respect Score, students response (color): {:.1f}".format(row[color_parameter])+
                      " <br> Avg. enrollment per school (size): {:.0f}".format(row[size_parameter])
                 ).add_to(school_map)

# add a legend
school_map.add_child(colormap)
    
# display the map
school_map
Out[8]:

Especially students in Northern Brooklyn and parts of the Bronx report below average Safety / Respect Scores. This somewhat aligns with a New York Times article from around that time - the piece reports that precincts with the highest crime rates are in these areas.
Up to date stats by precinct and crime type can be found here.

Race: Hispanics and blacks disadvantaged

In [9]:
# build scatter plot to visualize correlation between SAT and race of students
race_field = ["asian_per","black_per","hispanic_per","white_per"]

ax = correlations[race_field].sort_values().plot(kind="bar",
                                                    figsize=(8, 8),
                                                    title="Correlation of race to SAT-scores")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches: # set data point labels
    ax.annotate(str(round(p.get_height(), 2)),
                (p.get_x() + 0.2, max(p.get_height(), 0)+0.02),
                rotation=90
                )
ax.axhline(y=0.0, color='black', linestyle='-', linewidth=1)
plt.show()
  • Achievement gaps in the US are well documented (example1, example2 - both are from the same time period as our data). The general trends of hispanics and blacks scoring lower than asians and whites also persists in this NYC dataset.
    The negative correlation is higher for hispanics (about -0.4) than for blacks (about -0.3). Examples 1 and 2 above showed higher achievement gaps for blacks than for hispanics - data cited on wikipedia is however consistent with our NYC data.
  • A slightly higher positive correlation for whites compared to asians was a bit surprising. It is commonly reported the other way around.

Trying out plotly and cufflinks to enable easier data discovery through interactive charts - inspiration, examples, more examples.

In [10]:
with warnings.catch_warnings(): # to catch warning about np.module in Python 3.8.2.
    warnings.filterwarnings("ignore",category=FutureWarning)
    combined.iplot(
        x='hispanic_per',
        y='sat_score',
        categories='boro',
        text="SCHOOL NAME",
        vline= {"x":combined["hispanic_per"].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":combined["sat_score"].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        xTitle='Share hispanics, %',
        yTitle='Avg. SAT score',
        title='Share of hispanic students vs. SAT score by school <br> dashed lines: average of entire dataset'
    )
print("Recommendation: choose option to 'show closest data point on hover' in top right corner")
Recommendation: choose option to 'show closest data point on hover' in top right corner
  • Almost all schools in the Bronx (double-click "Bronx" in legend) exhibit above average high shares of hispanics and below average SAT scores. There might be confounding factors (crime or poverty) leading to low SAT scores of hispanics in NYC. Sadly the only elite schools in the Bronx (Lehman, Bronx High School of Science) have very few hispanic students.

  • Schools with very low average SAT scores and very high shares of hispanic students are specialized on students with limited English skills (often named "International" schools etc.). It is not surprising that these schools score below average, because SAT tests both reading and writing.

  • The high schools with high SAT scores and low shares of hispanics are all highly selective public schools with excellent reputation. I am interested in seeing the racial makeup of all these elite schools (avg. SAT > 1,800) compared to all schools.

In [11]:
# build stacked bar to racial makeup of all schools and elite schools
num_race = ['asian_num', 'black_num', 'hispanic_num', 'white_num']
high_sat = combined["sat_score"] > 1800

race_dist = pd.DataFrame(data=[combined.loc[:, num_race].sum()/combined.loc[:, num_race].sum().sum(),
                               combined.loc[high_sat, num_race].sum()/combined.loc[high_sat, num_race].sum().sum()],
                         index=["All schools", "Elite schools"])*100
race_dist.columns = ["Asian","Black","Hispanic", "White"]
race_dist.astype(int).iplot(kind="bar", barmode="stack", yTitle='Share of all students in %',
                            title='Racial distribution NYC high schools, <br> all vs. elite (SAT avg. > 1,800)')

Blacks and hispanics are clearly underrepresented in NYC's elite high schools - white and especially asian students are overrepresented compared to all NYC high schools. This is the case although there are elite schools in all five boroughs.

Gender: Males over-represented in STEM-elite school, Queens and Manhattan with excellent predominantly female schools

In [12]:
# build bar chart to visualize correlation between SAT and gender of students
gender = ["male_per", "female_per"]

ax = correlations[gender].sort_values().plot(kind="bar",
                                                    figsize=(8, 8),
                                                    title="Correlation of gender to SAT-scores")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for p in ax.patches: # set data point labels
    ax.annotate(str(round(p.get_height(), 2)),
                (p.get_x() + 0.2, max(p.get_height(), 0)+0.002),
                rotation=0
                )
ax.axhline(y=0.0, color='black', linestyle='-', linewidth=1)
plt.show()

Schools exhibit higher SAT-scores when the share of females is higher, but the correlation is pretty weak (~ 0.1).
It is generally necessary to plot only n-1 dummies of a categorical variable. In this case where gender has two possible values - thus female will always be symmetrical to male and the other way around.

In [13]:
with warnings.catch_warnings(): # to catch FutureWarning about np.module being used in pandas
    warnings.filterwarnings("ignore",category=FutureWarning)
    combined.iplot(
        x='female_per',
        y='sat_score',
        categories='boro',
        text="SCHOOL NAME",
        vline= {"x":combined["female_per"].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":combined["sat_score"].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        xTitle='Share females, %',
        yTitle='Avg. SAT score',
        title='Share of female students vs. SAT score by school <br> dashed lines: average of entire dataset'
    )
print("Recommendation: choose option to 'show closest data point on hover' in top right corner")
Recommendation: choose option to 'show closest data point on hover' in top right corner
  • All eleven schools with an above 60% share of females and SAT-scores above 1,400 are located in Manhattan and Queens. Girls in other boroughs may have poorer access to excellent education.
  • The high performing / mainly female schools are mostly humanities / arts focused.
  • Seven of the nine previously defined "elite schools" with average SAT scores above 1,800 are majority male - they are STEM focused.

Custom function to help with analysis

In [14]:
# built function "describe_plus" to help with investigation of subsets
# helper functions


def describe_nan(df):
    """
    Return transposed pandas describe with share of nan-values

    Parameters:
    df = pandas.DataFrame
    """
    df_des = df.describe().transpose()
    df_des["share_nan"] = 1-(df_des["count"]/df.shape[0])
    df_des_col_order = ["share_nan", "count", "mean",
                        "std", "min", "25%", "50%", "75%", "max"]
    df_des = df_des[df_des_col_order]
    return df_des


def describe_frequent(df, n=5):
    """
    Returns dataframe with most common values

    Parameters:
    df = pandas.DataFrame
    n = Optional number of most common values 

    """
    topn_df = pd.DataFrame()
    col_names = df.columns

    for col in col_names:
        col_name = df[col].name
        top_n = df[col].value_counts(dropna=False)[
            :n].reset_index().reset_index()
        top_n_norm = df[col].value_counts(dropna=False, normalize=True)[
            :n].reset_index().reset_index()
        top_n["norm"] = (top_n_norm[col_name]*100).astype(int)
        top_n["level_0"] = top_n["level_0"]+1
        top_n["string"] = "value: "+top_n["index"].astype(
            str)+" | count: "+top_n[col_name].astype(str)+" | norm %: "+top_n["norm"].astype(str)
        result = top_n["string"]
        result.rename(col_name, inplace=True)
        topn_df = pd.concat([topn_df, result], axis=1)
    topn_df.index = "Top"+(topn_df.index + 1).astype(str)
    return topn_df

# main function


def describe_plus(df, subset=None, output=None,
                  figsize=(15, 20), cmap="RdBu"):
    '''
    Returns a dataframe of descripitive strings.
    If only a dataframe is passed, pandas 'describe'-method includes missing value count in %. 
    If also one subset is passed, the resulting subset is compared to the entire dataset.

    Parameters:
    df: Required pandas.DataFrame
    subset: Optional pandas.DataFrame with identical columns to compare to df
    output: Optional, if only one df is passed, "non_num" returns df with all 
            non numeric columns and adds descriptive stats as first rows.
            If subset is passed, "clean" returns a DataFrame without styling.
            If subset is passed, "sorted" returns a styled DataFrame, 
            sorted by z-score. 
    cmap: Optional parameter, default colormap "RdBu"

    Returns:
    (styled) pandas.Dataframe
    '''

    if subset is None:
        if output == "non_num":
            # get only the numeric columns from the dataframe
            all_cols = df.columns
            # describe only works on numerical columns
            num_cols = describe_nan(df).index
            non_num_cols = [col for col in all_cols if col not in num_cols]
            df_non_num = df[non_num_cols]

            # calc some stats on non-numeric columns
            df_stats = pd.DataFrame(columns=non_num_cols)
            df_stats.loc["Record count"] = df_non_num.shape[0]
            df_stats.loc["Count Nan"] = df_non_num.isna().sum()
            df_stats.loc["Share Nan"] = df_non_num.isna().sum() / \
                df_non_num.shape[0]
            df_stats.loc["Count unique"] = df_non_num.nunique()

            # add most common values
            df_stats = pd.concat([df_stats, describe_frequent(df_non_num)])

            # prepend stats to df
            df_non_num_up = pd.concat([df_stats, df_non_num])

            return df_non_num_up

        else:
            return describe_nan(df).style.background_gradient(subset="share_nan", low=0, high=1, cmap="Blues")

    else:
        # calculate descriptive stats
        subset_desc = describe_nan(subset)
        df_desc = describe_nan(df)

        # calculate percent change (used to be for sns.heatmap)
        hm_data = (subset_desc/df_desc)
        hm_data["z-score, subset mean"] = (subset_desc["mean"] -
                                           df_desc["mean"])/(df_desc["std"])
        hm_data["count"] = 0  # no color for count
        hm_data = hm_data[['count', "share_nan", 'z-score, subset mean',
                           'mean', 'std', 'min', '25%', '50%', '75%', 'max']]

        # format main annotation values
        main_values = pd.DataFrame()
        main_values = pd.concat(
            [main_values, subset_desc.loc[:, "mean":"max"]], axis=1)
        main_values = main_values.applymap('{:,.2f}'.format)
        main_values["z-score, subset mean"] = (
            subset_desc["mean"]-df_desc["mean"])/(df_desc["std"])
        main_values["count"] = subset_desc['count'].apply('{:,.0f}'.format)
        main_values["share_nan"] = (
            subset_desc['share_nan']*100).apply('{:,.0f}%'.format)
        main_values = main_values[[
            'count', "share_nan", 'z-score, subset mean', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']]

        # values of subset are compared to main dataset in annotation as follows:
        # count (share of total records)
        # share_Nan (% compared to entire dataset)
        # mean (% compared to entire dataset)
        # z-score of subset-mean (no comparison)
        # std (% compared to entire dataset)
        # min,25%,50%,75%,max (percentile rank of entire dataset)
        # dataframe has same columns, but z-score is also seperate and shaded
        # percentages are formatted as integer with sign

        # pct rank: copy output from description method to original dataframe
        subset_desc_pct_rank = pd.concat(
            [df, subset_desc.loc[:, "min":"max"].transpose()]).dropna(axis=1, subset=["min"])

        # pct rank: make a column with percentile rank for every numeric column
        list_col_pct = []
        for col in subset_desc_pct_rank:
            col_pct = col+"_pct_rank"
            list_col_pct.append(col_pct)
            subset_desc_pct_rank[col_pct] = subset_desc_pct_rank[col].rank(
                pct=True)

        # pct rank: discard main dataset, apply proper names and shape
        subset_desc_pct_rank = (
            subset_desc_pct_rank.loc["min":"max", list_col_pct].transpose()*100).astype(int)
        subset_desc_pct_rank.index = subset_desc_pct_rank.index.str.replace(
            "_pct_rank", "")
        subset_desc_pct_rank.columns = subset_desc_pct_rank.columns+"_pct_rank"

        # format all values going into parenthesis
        paran = pd.DataFrame()
        paran["count"] = ((subset_desc/df_desc)["count"]
                          * 100).apply('{:,.0f}%'.format)
        paran["share_nan"] = hm_data["share_nan"].apply('{:,.0f}%'.format)
        paran["z-score, subset mean"] = "-"
        paran["mean"] = ((((subset_desc/df_desc)["mean"])-1)
                         * 100).apply('{:+,.0f}%'.format)
        paran["std"] = ((((subset_desc/df_desc)["std"])-1)
                        * 100).apply('{:+,.0f}%'.format)
        paran = pd.concat([paran, subset_desc_pct_rank], axis=1)
        paran.columns = paran.columns.str.replace("_pct_rank", "")

        # build the annotation
        annot = main_values.astype(str)+" ("+paran.astype(str)+")"
        annot["z-score, subset mean"] = main_values["z-score, subset mean"]

        # decided to redo column headers
        new_col_headers = {'count': "count (share of df)",
                           "share_nan": "share_nan (change to df)",
                           'z-score, subset mean': "z-score, subset mean",
                           'mean': " mean (change to df)",
                           'std': "std (change to df)",
                           'min': "min (pct rank in df)",
                           '25%': "25% (pct rank in df)",
                           '50%': "50% (pct rank in df)",
                           '75%': "75% (pct rank in df)",
                           'max': "max (pct rank in df)"
                           }
        annot.rename(inplace=True, columns=new_col_headers)

        if output == "clean":
            return annot

        elif output == "sorted":
            return annot.sort_values(by="z-score, subset mean").style.background_gradient(
                subset="z-score, subset mean", cmap="RdBu")

        else:
            return annot.style.background_gradient(
                subset="z-score, subset mean", cmap="RdBu")

Advanced placement: Not always a good preparation for SAT

In [15]:
# compute new feature: share of students that took Advanced Placement (AP) exams to earn college credit
combined["ap_per"] = combined["AP Test Takers "]/combined["total_enrollment"]*100
In [16]:
# build scatter plot investigating correlation between share of AP-Test takers and SAT-score
with warnings.catch_warnings(): # to catch FutureWarning about np.module being used in pandas
    warnings.filterwarnings("ignore",category=FutureWarning)
    combined.iplot(
        x='ap_per',
        y='sat_score',
        categories='boro',
        text="SCHOOL NAME",
        vline= {"x":combined["ap_per"].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":combined["sat_score"].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        xTitle='AP-Test Takers, %',
        yTitle='Avg. SAT score',
        title='Share of AP-Test takers vs. SAT score by school <br> dashed lines: average of entire dataset'
    )
print("Recommendation: choose option to 'show closest data point on hover' in top right corner")
Recommendation: choose option to 'show closest data point on hover' in top right corner

Advanced Placement (AP) is a program in the United States and Canada created by the College Board which offers college-level curricula and examinations to high school students.

I see three interesting patterns in our data:

  • SAT above 1,400: Amongst schools with high SAT-scores (above 1,400) there seems to be a strong correlation between the share of AP-test takers and SAT-Scores.
    All these school seemed somewhat selective when I researched them. A hypothesis might be that students visiting these schools generally have above average academic aptitude and benefit more from academic placement than students in other schools.
  • SAT below 1,300 and above average share of AP-test takers: As the detailed analysis below shows, these are schools that might score lower on SATs because they cater to otherwise disadvantaged students (more free lunches, more special education) and are more focused on generally disadvantaged parts of New Yorks population (hispanics / blacks). The AP courses they offer are less diverse and mainly focused on English literature and language - it is surprising, that these schools nonetheless score low on SAT's reading and writing sections.
  • Average SAT and very high shares of AP-test takers: Since almost all these schools have SAT score that are equal to the average of the dataset, I suspect that their SAT score was missing and imputed. They are there not investigated any further.

All in all it looks like the share of AP-test takers is not a strong indicator for SAT-performance.

See some detailed analysis below.

In [17]:
# select only schools with high (but not extremely high) shares of AP test takers and low SAT scores
high_ap_low_sat = combined[(combined["ap_per"]>25)&
                           (combined["ap_per"]<50)&
                           (combined["sat_score"]<1300)]
describe_plus(combined,subset=high_ap_low_sat,output="sorted")
Out[17]:
count (share of df) share_nan (change to df) z-score, subset mean mean (change to df) std (change to df) min (pct rank in df) 25% (pct rank in df) 50% (pct rank in df) 75% (pct rank in df) max (pct rank in df)
AVERAGE CLASS SIZE 105 (29%) 0% (nan%) -0.560828 22.80 (-8%) 3.40 (-5%) 10.00 (0) 20.83 (11) 23.17 (32) 25.14 (53) 30.94 (96)
SAT Math Avg. Score 105 (29%) 0% (nan%) -0.525060 385.63 (-8%) 29.36 (-55%) 312.00 (0) 367.00 (15) 385.00 (33) 411.00 (52) 441.00 (76)
sat_score 105 (29%) 0% (nan%) -0.513529 1,131.92 (-7%) 80.00 (-55%) 913.00 (0) 1,080.00 (13) 1,143.00 (33) 1,199.00 (52) 1,273.00 (77)
SAT Critical Reading Avg. Score 105 (29%) 0% (nan%) -0.510077 374.80 (-7%) 29.49 (-49%) 287.00 (0) 360.00 (13) 381.00 (32) 400.00 (53) 418.00 (75)
SAT Writing Avg. Score 105 (29%) 0% (nan%) -0.476021 371.49 (-7%) 26.97 (-54%) 291.00 (0) 357.00 (15) 377.00 (36) 394.00 (55) 414.00 (76)
N_s 105 (29%) 0% (nan%) -0.459995 291.04 (-51%) 81.18 (-88%) 93.00 (2) 240.00 (17) 300.00 (29) 340.00 (42) 598.21 (78)
total_enrollment 105 (29%) 0% (nan%) -0.458799 396.04 (-50%) 64.15 (-93%) 141.00 (0) 356.00 (17) 401.00 (28) 445.00 (44) 500.00 (58)
total_students 105 (29%) 0% (nan%) -0.451494 409.50 (-47%) 85.48 (-90%) 110.00 (0) 351.00 (18) 412.00 (30) 466.00 (47) 664.00 (76)
male_num 105 (29%) 0% (nan%) -0.448658 189.65 (-53%) 64.14 (-86%) 0.00 (1) 159.00 (18) 195.00 (33) 225.00 (47) 402.00 (78)
female_num 105 (29%) 0% (nan%) -0.437781 206.38 (-47%) 73.66 (-83%) 61.00 (1) 163.00 (17) 196.00 (31) 239.00 (47) 495.00 (83)
Total Cohort 105 (29%) 0% (nan%) -0.420025 99.71 (-49%) 59.19 (-74%) 1.00 (0) 70.00 (16) 93.00 (33) 116.00 (53) 193.87 (75)
N_t 105 (29%) 0% (nan%) -0.418452 23.34 (-41%) 6.28 (-84%) 6.00 (0) 20.00 (23) 24.00 (39) 28.00 (55) 37.00 (75)
number_programs 105 (29%) 0% (nan%) -0.411742 1.14 (-38%) 0.43 (-75%) 1.00 (35) 1.00 (35) 1.00 (35) 1.00 (35) 3.00 (84)
N_p 105 (29%) 0% (nan%) -0.381221 139.54 (-44%) 80.69 (-72%) 28.00 (1) 78.00 (17) 124.00 (32) 195.00 (54) 408.00 (88)
hispanic_num 105 (29%) 0% (nan%) -0.366238 186.80 (-39%) 105.43 (-67%) 17.00 (1) 89.00 (19) 204.00 (44) 276.00 (61) 405.00 (81)
sped_num 105 (29%) 0% (nan%) -0.360001 61.67 (-40%) 28.84 (-74%) 0.00 (1) 49.00 (26) 65.00 (43) 80.00 (57) 125.00 (80)
white_per 105 (29%) 0% (nan%) -0.333332 3.75 (-55%) 8.02 (-43%) 0.00 (2) 0.70 (17) 1.40 (38) 3.10 (60) 60.00 (98)
white_num 105 (29%) 0% (nan%) -0.318748 15.54 (-86%) 34.60 (-88%) 0.00 (2) 3.00 (17) 5.00 (30) 12.00 (54) 245.00 (88)
asian_num 105 (29%) 0% (nan%) -0.317136 23.87 (-82%) 45.33 (-87%) 0.00 (1) 4.00 (19) 8.00 (33) 22.00 (54) 272.00 (88)
asian_per 105 (29%) 0% (nan%) -0.297371 5.77 (-43%) 10.68 (-27%) 0.00 (1) 1.00 (20) 2.20 (38) 5.10 (57) 64.50 (99)
black_num 105 (29%) 0% (nan%) -0.282365 167.32 (-29%) 102.75 (-58%) 0.00 (0) 95.00 (26) 127.00 (42) 251.00 (67) 413.00 (87)
ell_num 105 (29%) 0% (nan%) -0.168173 65.17 (-29%) 95.56 (-40%) 0.00 (1) 18.00 (28) 30.00 (43) 58.00 (67) 377.00 (93)
saf_t_11 105 (29%) 0% (nan%) -0.116720 7.04 (-1%) 0.88 (-2%) 4.40 (0) 6.50 (23) 7.20 (50) 7.60 (70) 9.00 (98)
male_per 105 (29%) 0% (nan%) -0.078240 48.28 (-2%) 14.72 (+5%) 0.00 (1) 44.20 (25) 49.80 (49) 55.30 (76) 84.30 (98)
saf_tot_11 105 (29%) 0% (nan%) -0.062090 7.27 (-1%) 0.57 (-10%) 5.60 (0) 6.90 (24) 7.30 (48) 7.70 (73) 8.60 (98)
saf_s_11 105 (29%) 0% (nan%) -0.054626 6.57 (-1%) 0.64 (-8%) 5.20 (1) 6.10 (25) 6.60 (51) 7.00 (72) 8.50 (99)
rr_s 105 (29%) 0% (nan%) -0.026024 80.53 (-0%) 14.90 (-4%) 0.00 (0) 75.00 (26) 84.00 (48) 90.00 (70) 99.00 (98)
aca_s_11 105 (29%) 0% (nan%) -0.022072 7.37 (-0%) 0.42 (-8%) 6.40 (0) 7.10 (27) 7.38 (50) 7.60 (72) 9.10 (99)
lat 105 (29%) 0% (nan%) -0.016571 40.74 (-0%) 0.08 (-0%) 40.58 (1) 40.67 (19) 40.74 (52) 40.82 (75) 40.89 (99)
aca_t_11 105 (29%) 0% (nan%) -0.015370 7.50 (-0%) 0.96 (+2%) 4.80 (0) 6.90 (24) 7.50 (48) 8.30 (77) 9.20 (98)
Number of Exams with scores 3 4 or 5 105 (29%) 0% (nan%) -0.011268 150.76 (-2%) 19.43 (-92%) 11.00 (5) 153.45 (61) 153.45 (61) 153.45 (61) 153.45 (61)
AP Test Takers 105 (29%) 0% (nan%) -0.008076 127.61 (-1%) 9.52 (-95%) 42.00 (24) 129.03 (63) 129.03 (63) 129.03 (63) 129.03 (63)
Total Exams Taken 105 (29%) 0% (nan%) -0.008049 194.54 (-1%) 16.17 (-95%) 53.00 (23) 197.04 (65) 197.04 (65) 197.04 (65) 197.04 (65)
com_t_11 105 (29%) 0% (nan%) 0.003475 6.53 (+0%) 1.10 (-5%) 3.50 (1) 5.70 (23) 6.60 (52) 7.50 (78) 8.60 (98)
zip 105 (29%) 0% (nan%) 0.016941 10,733.81 (+0%) 513.98 (-4%) 10,002.00 (1) 10,455.00 (31) 10,473.00 (51) 11,208.00 (67) 11,691.00 (99)
eng_t_11 105 (29%) 0% (nan%) 0.028632 7.02 (+0%) 1.02 (-1%) 4.10 (1) 6.40 (28) 7.10 (53) 7.80 (78) 9.10 (98)
eng_s_11 105 (29%) 0% (nan%) 0.029803 6.64 (+0%) 0.56 (+3%) 5.50 (1) 6.30 (28) 6.60 (50) 7.00 (76) 8.60 (99)
saf_p_11 105 (29%) 0% (nan%) 0.044588 8.21 (+0%) 0.53 (-11%) 6.30 (1) 7.90 (28) 8.30 (54) 8.60 (76) 9.80 (99)
aca_tot_11 105 (29%) 0% (nan%) 0.046239 7.59 (+0%) 0.49 (-5%) 6.10 (0) 7.30 (28) 7.60 (52) 8.00 (80) 8.60 (97)
lon 105 (29%) 0% (nan%) 0.056514 -73.92 (-0%) 0.06 (-14%) -74.16 (0) -73.96 (29) -73.92 (57) -73.89 (76) -73.73 (99)
eng_tot_11 105 (29%) 0% (nan%) 0.070815 7.08 (+1%) 0.53 (-2%) 5.70 (1) 6.80 (30) 7.10 (56) 7.50 (81) 8.30 (99)
rr_p 105 (29%) 0% (nan%) 0.075437 40.40 (+4%) 19.66 (+2%) 9.00 (1) 25.00 (26) 39.00 (54) 51.00 (76) 90.00 (98)
female_per 105 (29%) 0% (nan%) 0.078164 51.71 (+2%) 14.72 (+5%) 15.70 (2) 44.70 (23) 50.20 (51) 55.80 (74) 100.00 (99)
com_s_11 105 (29%) 0% (nan%) 0.096943 6.15 (+1%) 0.50 (-6%) 5.20 (1) 5.80 (32) 6.10 (56) 6.40 (76) 8.40 (99)
hispanic_per 105 (29%) 0% (nan%) 0.104693 46.68 (+6%) 24.99 (+0%) 4.50 (1) 22.70 (27) 53.80 (57) 66.00 (79) 100.00 (99)
com_tot_11 105 (29%) 0% (nan%) 0.110361 6.81 (+1%) 0.50 (-10%) 5.30 (0) 6.50 (32) 6.80 (55) 7.20 (78) 8.10 (99)
eng_p_11 105 (29%) 0% (nan%) 0.112140 7.57 (+1%) 0.43 (-5%) 6.70 (2) 7.30 (30) 7.60 (59) 7.80 (76) 9.40 (99)
ell_percent 105 (29%) 0% (nan%) 0.166164 16.67 (+25%) 24.10 (+20%) 0.00 (1) 4.90 (36) 8.80 (56) 14.40 (75) 92.90 (99)
aca_p_11 105 (29%) 0% (nan%) 0.181744 7.90 (+1%) 0.45 (-10%) 6.80 (1) 7.60 (34) 7.90 (57) 8.20 (78) 9.50 (99)
sped_percent 105 (29%) 0% (nan%) 0.217671 15.84 (+11%) 7.20 (-1%) 0.00 (1) 13.20 (37) 17.30 (63) 20.80 (82) 28.40 (98)
rr_t 105 (29%) 0% (nan%) 0.220547 86.91 (+4%) 15.59 (-6%) 28.00 (0) 83.00 (38) 92.00 (59) 97.00 (78) 100.00 (90)
com_p_11 105 (29%) 0% (nan%) 0.236069 7.74 (+2%) 0.46 (-11%) 6.60 (2) 7.50 (40) 7.80 (62) 8.00 (77) 9.20 (99)
black_per 105 (29%) 0% (nan%) 0.253546 43.16 (+18%) 26.29 (+1%) 0.00 (0) 25.60 (39) 33.20 (57) 64.20 (80) 93.50 (99)
frl_percent 105 (29%) 0% (nan%) 0.360121 72.03 (+9%) 10.48 (-36%) 35.40 (7) 66.40 (40) 73.80 (63) 78.50 (78) 100.00 (99)
ap_per 105 (29%) 0% (nan%) 0.710746 32.95 (+57%) 5.30 (-68%) 25.28 (60) 28.74 (68) 31.94 (77) 35.94 (84) 47.09 (93)
schoolyear 105 (29%) 0% (nan%) nan 20,112,012.00 (+0%) 0.00 (+nan%) 20,112,012.00 (50) 20,112,012.00 (50) 20,112,012.00 (50) 20,112,012.00 (50) 20,112,012.00 (50)
fl_percent 105 (29%) 0% (nan%) nan 0.00 (+nan%) 0.00 (+nan%) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50)
grade_span_max 105 (29%) 0% (nan%) nan 12.00 (+0%) 0.00 (+nan%) 12.00 (50) 12.00 (50) 12.00 (50) 12.00 (50) 12.00 (50)
expgrade_span_max 105 (29%) 0% (nan%) nan 12.00 (+0%) 0.00 (+nan%) 12.00 (50) 12.00 (50) 12.00 (50) 12.00 (50) 12.00 (50)
priority08 105 (29%) 0% (nan%) nan 0.00 (+nan%) 0.00 (+nan%) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50)
priority09 105 (29%) 0% (nan%) nan 0.00 (+nan%) 0.00 (+nan%) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50)
priority10 105 (29%) 0% (nan%) nan 0.00 (+nan%) 0.00 (+nan%) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50) 0.00 (50)

The schools with above average shares of AP-test takers and low SAT scores achieve poor results across all three parts of the SAT (negative z-score) - writing being least below average.

Other noteworthy characteristics these schools have compared to NYC averages are:

  • smaller class sizes and smaller schools (total_enrollment)
  • higher shares of students receiving free lunch (frl_per)
  • higher share of special education students (sped_per)
  • higher shares of hispanic and especially black students (black_per, hispanic_per)
In [18]:
# investigating non-numeric features
pd.set_option("display.max_rows", 95)
describe_plus(high_ap_low_sat,output="non_num")[:9].transpose().head()
# describe_plus(subset,output="non_num")
print("results not shown in final doc to conserve vertical space")
results not shown in final doc to conserve vertical space

For the subset with low average SAT scores but above AP-shares, "advancedplacement_courses" are very often "English Literature..." or "English Language and Composition".
I am surprised that these courses did not prepare students better in reading and writing. Let's look at what schools with high SAT score do differently in terms of AP courses they offer to students.

In [19]:
# extract and display list of most common AP-coures and plot how often they appear 
# first select other schools for comparison:
high_ap_high_sat = combined[(combined["ap_per"]>25)&
                            (combined["ap_per"]<50)&
                            (combined["sat_score"]>1300)]

# define loop trough subsets and extract relative distribution of ap_courses
subsets = {"High AP, low SAT": high_ap_low_sat,
           "High AP, high SAT": high_ap_high_sat}
subsets_ap_courses = pd.DataFrame()

for k, v in subsets.items():
    ap_courses = v["advancedplacement_courses"].str.split(",",expand=True)
    ap_courses_melted = ap_courses.melt()["value"].str.strip().value_counts(normalize=True).sort_values()
    ap_courses_melted.name = k
    subsets_ap_courses = pd.concat([subsets_ap_courses, ap_courses_melted], axis=1)

ax = (subsets_ap_courses*100).iplot(kind="bar", 
                            #figsize=(10,7), 
                            layout = dict(title="AP classes by schol types, normalized count in %",
                                          xaxis=dict(automargin=True), hovermode="x", height=600))

Schools with high shares of students taking AP-tests offer a narrower range of courses with a focus on history, English literature / language / composition. Calculus is offered, but almost only AB while the broader BC is less covered (info).

Neighborhoods: Sheepshead Bay and Jamaica offer good public education and have comparably low property cost

Aim of this section: identify "affordable" neighborhoods with good schools.
To make this analysis, we need to add the following information to our dataset:

  • property prices by neighborhood
  • neighborhood of schools (which requires reverse geocoding the available coordinates)

Property prices by neighborhood

In [20]:
# read-in property sales data from 'NYC Open Data'-server and clean it 
# the dataset and an excellent API documentation can be accessed here:
# https://dev.socrata.com/foundry/data.cityofnewyork.us/5ebm-myj7
client = Socrata("data.cityofnewyork.us", None)
housing = pd.DataFrame.from_records(client.get("5ebm-myj7", limit=6000))

# remove new line characters causing trouble
housing = housing.replace(r'\\n',' ', regex=True) 

# assign correct datatypes to numeric columns
num_col = housing.loc[:,"number_of_sales":"year"].columns.to_list()
for c in num_col:
    housing[c] = pd.to_numeric(housing[c],errors="coerce")
    
# according to the data dictionary only three house types exist,
# but the spelling differs a bit between them and needs to be aligned
def home_type(e):
    if "01" in e:
        return "01 ONE FAMILY HOUSE"
    elif "02" in e:
        return "02 TWO FAMILY HOUSE"
    elif "03" in e:
        return "03 THREE FAMILY HOUSE"
    else:
        "Something went wrong"
housing["type_of_home"] = housing["type_of_home"].apply(home_type)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
In [21]:
# get some general statistics
housing.info()
housing.describe()
#housing.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5979 entries, 0 to 5978
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   borough             5979 non-null   object 
 1   neighborhood        5978 non-null   object 
 2   type_of_home        5979 non-null   object 
 3   number_of_sales     5979 non-null   int64  
 4   lowest_sale_price   5979 non-null   int64  
 5   average_sale_price  5979 non-null   int64  
 6   median_sale_price   5979 non-null   int64  
 7   highest_sale_price  5979 non-null   float64
 8   year                5979 non-null   int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 420.5+ KB
Out[21]:
number_of_sales lowest_sale_price average_sale_price median_sale_price highest_sale_price year
count 5,979.000 5,979.000 5,979.000 5,979.000 5,979.000 5,979.000
mean 35.560 712,938.596 1,207,114.844 1,150,549.290 2,075,932.909 2,014.521
std 50.761 1,509,407.390 1,989,596.662 1,868,417.469 3,667,126.216 2.860
min 1.000 150,000.000 167,500.000 167,500.000 176,750.000 2,010.000
25% 5.000 212,000.000 453,514.000 443,250.000 700,000.000 2,012.000
50% 16.000 290,000.000 624,008.000 611,500.000 990,000.000 2,015.000
75% 46.000 500,000.000 962,607.000 920,000.000 1,860,000.000 2,017.000
max 519.000 26,893,000.000 26,893,000.000 26,893,000.000 77,100,000.000 2,019.000

The dataset contains info on home prices by:

  • boroughs
  • neighborhoods
  • types of homes
  • years

While we definitely want all boroughs and neighborhoods, we need to decide which types of homes shall be used as indicator for property prices, and which years should be considered.

Home price information is broken down into:

  • lowest price
  • average price
  • median price
  • highest price

For the purpose of this analysis we will only look at average prices.

In [22]:
# plotting with matplotlib for a change
temp_pivot = housing.pivot_table(columns="type_of_home",index="year",
                                 values="number_of_sales", aggfunc=sum)
ax = temp_pivot.plot(kind="bar",
               title="Number of recorded sales by year and type of home")
plt.legend(bbox_to_anchor=(1,1))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()

I will use one family houses, because they are the most common types of properties sold. They probably also have the most real life relevance when a family considers moving.
Now we need to decide which years we want to include in our analysis.

In [23]:
# removing the columns and records we don't need
keep_cols = ['borough',
 'neighborhood',
 'type_of_home',
 'number_of_sales',
 'average_sale_price',
  'year']
housing_ofh = housing.loc[housing["type_of_home"] == "01 ONE FAMILY HOUSE",keep_cols]

# checking by year for how many neighborhoods no sale was recorded
print("No of neighborhoods per year with no recorded one family house sale")
housing_ofh.pivot_table(index='neighborhood', values="type_of_home", columns="year", aggfunc="count").isna().sum()
No of neighborhoods per year with no recorded one family house sale
Out[23]:
year
2010    23
2011    29
2012    19
2013    22
2014    25
2015    20
2016    21
2017    25
2018    23
2019    22
dtype: int64

The dataset by default has at most one record per year, type of housing and neighborhood. A quick analysis shows, that for every year there are neighborhoods with no recorded price. I have therefore decided to include all years to hopefully have information on as many neighborhoods as possible.
Next up: Calculating an average price and number of sales for each neighborhood.

In [24]:
# calculate average price per neighborhood
ofh_price = housing_ofh.groupby(["neighborhood"]).mean().reset_index().drop(columns=["year","number_of_sales"])
ofh_price.rename(inplace=True, columns={"average_sale_price":"avg_sale_price"})

# calculate average number of sales per year per neighborhood
# this cant be done in same step as price, because in the original dataset
# years without sale are not recorded as Nan but missing entirely
ofh_num_sales = housing_ofh.groupby(["neighborhood"]).sum().reset_index().drop(columns=["year","average_sale_price"])
ofh_num_sales["sales_per_year"] = ofh_num_sales["number_of_sales"]/len(housing_ofh["year"].unique())
ofh_num_sales.drop(inplace=True,columns="number_of_sales")

# putting it together
ofh = pd.merge(left=ofh_price, right=ofh_num_sales, on="neighborhood", indicator=True)
In [25]:
# generate histogram for sales per year
ofh["sales_per_year"].iplot(kind="histogram", asFigure=True, bins=150, 
                            title="NYC one family house sales by neighborhood",xTitle='Sales per year', yTitle='Count')
In [26]:
# generate histogram for sales per year
ofh["avg_sale_price"].iplot(kind="histogram", asFigure=True, bins=50, 
                            title="NYC one family house sales by neighborhood",xTitle='Average price, USD', yTitle='Count')
In [27]:
# exclude neighborhoods
ofh = pd.merge(left=ofh_price, right=ofh_num_sales, on="neighborhood")
print("All neighborhoods: {}".format(ofh.shape[0]))
ofh = ofh[ofh["sales_per_year"] >= 5]
print("After threshold for # of sales: {}".format(ofh.shape[0]))
ofh = ofh[ofh["avg_sale_price"] < 1500000]
print("After threshold for price: {}".format(ofh.shape[0]))
All neighborhoods: 242
After threshold for # of sales: 187
After threshold for price: 175

Neighborhoods were excluded from the dataset for two reasons:

  • they are on average too expansive for the average person (more than USD 1,5m)
  • the markets themselves are not very liquid, i.e. it is too difficult to buy a house when they are on the market (less than 5 sales per year)

Applying these filters, 175 of 242 neighborhoods (72%) remain.

Getting school's neighborhoods through reverse geocoding

Geocoding was implemented with the help of this nice dataquest user project (also has some pointers for using beautiful soup for webscraping) and this medium post.

In [28]:
# consider coverting to markup cell to avoid rerunning - it takes quite long
# build coordinates as inputs for reverse geocoding
combined["coordinates"] = combined["lat"].astype(
    str)+","+combined["lon"].astype(str)

# make API request
# read documentation https://geopy.readthedocs.io/
locator = Nominatim(user_agent="Geocoder_for_DQ", timeout=10)

# adding delays between queries and retrying failed requests
rgeocode = RateLimiter(locator.reverse, min_delay_seconds=0.1)

# getting FutureWarning from tqdm ¯\_(ツ)_/¯ used warnings lib
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=FutureWarning)
    tqdm_notebook.pandas()
    # progress_apply from tqdm for progressbar in Jupyter
    combined["rgeocode"] = combined["coordinates"].progress_apply(rgeocode)

In [29]:
# paste info from location object to separat columns
combined["rgeocode_raw"] = [x.raw for x in combined["rgeocode"]]
combined["address_dict"] = [x.get("address") for x in combined["rgeocode_raw"]]
combined["neighborhood"] = [x.get("neighbourhood") for x in combined["address_dict"]]
combined["suburb"] = [x.get("suburb") for x in combined["address_dict"]]

# take suburb if neighborhood not found  
print ("impute {} missing neighborhoods from suburb".format(combined["neighborhood"].isna().sum()))
combined["neighborhood"] = combined["neighborhood"].fillna(combined["suburb"])
print ("{} missing neighborhoods remaining and filled with '0'".format(combined["neighborhood"].isna().sum()))
combined["neighborhood"] = combined["neighborhood"].fillna(0)
impute 103 missing neighborhoods from suburb
0 missing neighborhoods remaining and filled with '0'
In [31]:
# drop dict and location column because they are not needed anymore
combined_incl_locations = combined.copy() # but keep them in backup variable because API call takes so long
combined.drop(columns=["rgeocode","rgeocode_raw","address_dict"])

# prepare datatype and formatting for merging 
ofh["neighborhood"] = ofh["neighborhood"].astype(str).str.lower().str.strip()
combined["neighborhood"] = combined["neighborhood"].astype(str).str.lower().str.strip()

# some schools geocoded neighborhoods were not found in the property dataset
# did some manual lookup to arrive at this list for some of the unmatched schools
schools_new_neighborhoods = {'bedford park': 'bedford park/norwood',
                              'bedford-stuyvesant': 'bedford stuyvesant',
                                'castle hill': 'castle hill/unionport',
                                'cobble hill historic district': 'cobble hill-west',
                                'east flatbush': 'flatbush-east',
                                'flatbush': 'flatbush-central',
                                'harlem': 'harlem-central',
                                'kingsbridge': 'kingsbridge/jerome park',
                                'kingsbridge heights': 'kingsbridge hts/univ hts',
                                'melrose': 'melrose/concourse',
                                'midtown': 'midtown east',
                                'midtown south': 'midtown west',
                                'morris heights': 'highbridge/morris heights',
                                'morrisania': 'morrisania/longwood',
                                'mott haven': 'mott haven/port morris',
                                'norwood': 'bedford park/norwood',
                                'port morris': 'mott haven/port morris',
                                'saint george': 'new brighton-st. george',
                                'upper west side': 'upper west side (79-96)',
                                'west brighton': 'west new brighton',
                                'williams bridge': 'williamsbridge',
                                'williamsburg': 'williamsburg-central'}
combined["neighborhood"].replace(schools_new_neighborhoods,inplace=True)

# If I were to do it properly, I would get shape objects of all neighborhoods 
# per the property dataset and do spatial matching with the schools coordinates...

# now finally add property data to school dataset
combined_prop = pd.merge(left=combined, right=ofh, on="neighborhood",how="left")
share_prop = 100-combined_prop["avg_sale_price"].isna().sum()/combined_prop.shape[0]*100
print("{:,.0f}% of schools have property values associated with them".format(share_prop))
42% of schools have property values associated with them

We now have information on the property values of (some) schools. Lets plot them and see where students score relatively well on SAT exams compared to how expansive the property of the neighborhood is.

In [32]:
# compute average SAT-score by neighborhood
combined_prop = combined_prop[combined_prop["avg_sale_price"].notna()]
analysis_cols = ["neighborhood", "boro", "sat_score", "avg_sale_price",'saf_tot_11']
combined_prop_analysis = combined_prop.loc[:,analysis_cols]
hood_schools = combined_prop_analysis.groupby(["boro","neighborhood"]).mean().reset_index()
hood_schools["hovertext"] = hood_schools["neighborhood"]+" | avg. safety score: "+hood_schools["saf_tot_11"].round(1).astype(str)

with warnings.catch_warnings(): # to catch FutureWarning about np.module being used in panda
    df_temp = hood_schools
    x_temp = "avg_sale_price"
    y_temp = "sat_score"
    cat_temp = "boro"
    text_temp = "hovertext"
    warnings.filterwarnings("ignore",category=FutureWarning)
    df_temp.iplot(
        x=x_temp,
        y=y_temp,
        categories=cat_temp,
        text=text_temp,
        vline= {"x":df_temp[x_temp].mean(),
                "color":"#000000",
                "dash": "dash"
               },
        hline={"y":df_temp[y_temp].mean(),
               "color":"#000000",
               "dash": "dash"
              },
        layout = dict(
            hovermode="closest",
            # customize labels!   
            title='Property value vs. SAT score by neighborhood <br> dashed lines: average of entire dataset'),
            xTitle='Average poperty value one family home, USD',
            yTitle='Avg. SAT score'
    )

Sheepshead Bay and Jamaica both have relatively low property prices and good public schools. Bedford park / Norwood students feel less safe, so this neighborhood should maybe not be first choice. As the graph below shows, property prices in one of the recommended neighborhoods (Jamaica) have increased above average over the past decade.

Note: The obvious best pick "Oakwood" is home to the elite school "STATEN ISLAND TECHNICAL HIGH SCHOOL" which is open to all NYC residents and has strict admissions criteria.

In [33]:
ofh_average = housing_ofh.pivot_table(margins=True,
                                      index="neighborhood", 
                                      columns="year",
                                      values="average_sale_price")
recommended_hoods = ["SHEEPSHEAD BAY", "JAMAICA", "All"]

# build indexchart
price_index = ofh_average.div(ofh_average[2010], axis=0).loc[recommended_hoods,:]*100
title="Development of property prices in recommended neighborhoods, <br>compared to all NYC neighborhoods"
price_index.transpose().iplot(kind="line",title=title, yTitle="2010 = 100%", mode="lines+markers")

Conclusion

Note: This was mainly a technical challenge with the aim to learn data analysis techniques and tools in Python. Please refer to "results" at the beginning of this document or the respective headings of the individual sections for an overview of insights gained from the data.

In this guided project we have applied the learnings of Step 2 of dataquest's "Data Analyst" learning path. Some of the concepts include:

  • data aggregation using pandas methods like 'pivot_table' and 'groupby'
  • regular expressions
  • vectorized string operations
  • list comprehension and use of apply-function
  • combining dataset with joins and concatenation
  • plotting with matplotlib

Additionally I explored:

  • interactive charts with pyplot and cufflinks
  • interactive maps with folium
  • writing my own function to expedite data exploration ('describe_plus')
  • accessing data through APIs using sodapy
  • reverse geocoding with geopy