We have a number of datasets from NYC Open Data, which may be used to explore various characteristics of schools in this city. As long as SAT scores are available for 2012, all the other data cover roughly the same period. Overall, we have 8 distinct datasets, which we want to merge together based on schools codes. The list of datasets is presented below:
Let's start with data donwloading and preparing school codes for merging.
import pandas as pd
import numpy as np
import re
# set the list of file names for downloading
data_files = [
"2010__AP__College_Board__School_Level_Results.csv",
"2010-2011_Class_Size_-_School-level_detail.csv",
"2006_-_2012_School_Demographics_and_Accountability_Snapshot.csv",
"2005-2010_Graduation_Outcomes_-_School_Level.csv",
"2012_SAT_Results.csv",
'Archived_DOE_High_School_Directory_2014-2015.csv'
]
# create dictionary with downloaded datasets
data = {}
for f in data_files:
d = pd.read_csv("/Users/mac/downloads/{0}".format(f))
d.to_csv("Projects_data/{0}".format(f))
data[f.replace(".csv", "")] = d
# download additional data on attendance and surveys of students, parents and teachers
School_Attendance_and_Enrollment = pd.read_csv("/Users/mac/downloads/2010_-_2011_School_Attendance_and_Enrollment_Statistics_by_District.csv")
School_Attendance_and_Enrollment.to_csv("Projects_data/2010_-_2011_School_Attendance_and_Enrollment_Statistics_by_District.csv")
School_Attendance_and_Enrollment['CSD'] = School_Attendance_and_Enrollment.index + 1
pupil_teacher_ratio = data["2010-2011_Class_Size_-_School-level_detail"][data["2010-2011_Class_Size_-_School-level_detail"]['SCHOOLWIDE PUPIL-TEACHER RATIO'].notna()]
data["2010-2011_Class_Size_-_School-level_detail"] = pd.merge(data["2010-2011_Class_Size_-_School-level_detail"], pupil_teacher_ratio[['SCHOOL CODE','SCHOOLWIDE PUPIL-TEACHER RATIO']], on='SCHOOL CODE', how='left')
data["2010-2011_Class_Size_-_School-level_detail"] = data["2010-2011_Class_Size_-_School-level_detail"].rename(columns={'SCHOOLWIDE PUPIL-TEACHER RATIO_y': 'Student-Teacher Ratio'})
all_survey = pd.read_csv("/Users/mac/downloads/masterfile11_gened_final.txt", delimiter="\t", encoding='windows-1252')
all_survey.to_csv("Projects_data/masterfile11_gened_final.csv")
d75_survey = pd.read_csv("/Users/mac/downloads/masterfile11_d75_final.txt", delimiter="\t", encoding='windows-1252')
all_survey.to_csv("Projects_data/masterfile11_d75_final.csv")
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
# create a list of survey topics to extract for further analysis
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
# make school codes comparable for further matching
data["Archived_DOE_High_School_Directory_2014-2015"]["DBN"] = data["Archived_DOE_High_School_Directory_2014-2015"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["2010-2011_Class_Size_-_School-level_detail"]["padded_csd"] = data["2010-2011_Class_Size_-_School-level_detail"]["CSD"].apply(pad_csd)
data["2010-2011_Class_Size_-_School-level_detail"]["DBN"] = data["2010-2011_Class_Size_-_School-level_detail"]["padded_csd"] + data["2010-2011_Class_Size_-_School-level_detail"]["SCHOOL CODE"]
Next, we should also make SAT and AP scores numeric for further analysis and prepare schools' longitude and latitude columns to locate them on the New York City map.
# convert SAT subsections' scores into numeric values and calculate the total score
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["2012_SAT_Results"][c] = pd.to_numeric(data["2012_SAT_Results"][c], errors="coerce")
data['2012_SAT_Results']['sat_score'] = data['2012_SAT_Results'][cols[0]] + data['2012_SAT_Results'][cols[1]] + data['2012_SAT_Results'][cols[2]]
# convert AP scores into numeric values
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["2010__AP__College_Board__School_Level_Results"][col] = pd.to_numeric(data["2010__AP__College_Board__School_Level_Results"][col], errors="coerce")
# create functions to extract latitude and longitude values out of rows and apply them
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["Archived_DOE_High_School_Directory_2014-2015"]["lat"] = data["Archived_DOE_High_School_Directory_2014-2015"]["Location 1"].apply(find_lat)
data["Archived_DOE_High_School_Directory_2014-2015"]["lon"] = data["Archived_DOE_High_School_Directory_2014-2015"]["Location 1"].apply(find_lon)
data["Archived_DOE_High_School_Directory_2014-2015"]["lat"] = pd.to_numeric(data["Archived_DOE_High_School_Directory_2014-2015"]["lat"], errors="coerce")
data["Archived_DOE_High_School_Directory_2014-2015"]["lon"] = pd.to_numeric(data["Archived_DOE_High_School_Directory_2014-2015"]["lon"], errors="coerce")
In our further analysis we will consider only students of 9-12 grades who are enrolled in General Education program, as well as school demographics data for 2011-12 period. In addition, we also select only the 2006 total cohorts of students, who entered 9th grade in a given school in that year and graduated in 2010.
# choose only student from 9-12 grades enrolled in the general education program
class_size = data["2010-2011_Class_Size_-_School-level_detail"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
# aggregate values in Class Size file by school codes
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["2010-2011_Class_Size_-_School-level_detail"] = class_size
# choose only values for 2011 and 2012 in School Demographics file
data["2006_-_2012_School_Demographics_and_Accountability_Snapshot"] = data["2006_-_2012_School_Demographics_and_Accountability_Snapshot"][data["2006_-_2012_School_Demographics_and_Accountability_Snapshot"]["schoolyear"] == 20112012]
# choose only the most recent cohort of students
data["2005-2010_Graduation_Outcomes_-_School_Level"] = data["2005-2010_Graduation_Outcomes_-_School_Level"][data["2005-2010_Graduation_Outcomes_-_School_Level"]["Cohort"] == "2006"]
data["2005-2010_Graduation_Outcomes_-_School_Level"] = data["2005-2010_Graduation_Outcomes_-_School_Level"][data["2005-2010_Graduation_Outcomes_-_School_Level"]["Demographic"] == "Total Cohort"]
When all datasets are prepared we combine them together into a single one for further analysis.
# create a merged dataset
combined = data["2012_SAT_Results"]
combined = combined.merge(data["2010__AP__College_Board__School_Level_Results"], on="DBN", how="left").copy()
combined = combined.merge(data["2005-2010_Graduation_Outcomes_-_School_Level"], on="DBN", how="left").copy()
# create list of datasets for merging and loop through it
to_merge = ["2010-2011_Class_Size_-_School-level_detail", "2006_-_2012_School_Demographics_and_Accountability_Snapshot", "survey", "Archived_DOE_High_School_Directory_2014-2015"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner").copy()
combined = combined.merge(School_Attendance_and_Enrollment, on="CSD", how="left").copy()
# create function to extract school district codes
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
print('Number of rows and columns in combined dataset:', combined.shape)
After compiling the dataset, let's check the variable of our major interest - the distribution of SAT scores among New York schools.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
fig = plt.figure(figsize=(17,8))
sns.set(palette='summer')
sns.distplot(combined['sat_score'], bins=60, hist=True)
plt.title("Distribution of SAT Scores among New York Schools",
fontweight="bold", fontsize=15, y=1.06)
plt.xlabel('SAT Score', labelpad=20)
plt.yticks([])
plt.show()
The histogram tells us that the disribution of SAT scores is unimodal and skewed to the right. Not that many schools on average perform over the 1600 threshold.
Let's start our analysis with exploring correlations between major variables. In NYC school survey there are data on teachers, students and parents perception of safety, communication, engagement and academic expectations. Investigating their relationship with SAT scores may help us understanding better the determinants of school performance.
# create correlation matrix
correlations = combined.corr()
survey_fields.remove("DBN")
print('Variables correlation with the SAT score:\n\n',correlations["sat_score"])
fig = plt.figure(figsize=(15,12))
sns.barplot(y=survey_fields, x=correlations["sat_score"][survey_fields], palette='summer')
# list of variables to show on the graph with their respective correlations
y_ticks = [
'Student Response Rate',
'Teacher Response Rate',
'Parent Response Rate',
'Number of student respondents',
'Number of teacher respondents',
'Number of parent respondents',
'Safety and Respect score based on parent responses',
'Communication score based on parent responses',
'Engagement score based on parent responses',
'Academic expectations score based on parent responses',
'Safety and Respect score based on teacher responses',
'Communication score based on teacher responses',
'Engagement score based on teacher responses',
'Academic expectations score based on teacher responses',
'Safety and Respect score based on student responses',
'Communication score based on student responses',
'Engagement score based on student responses',
'Academic expectations score based on student responses',
'Safety and Respect total score',
'Communication total score',
'Engagement total score',
'Academic Expectations total score',
]
plt.title("School Survey Results' Correlation with SAT Scores" ,
fontweight="bold", fontsize = 20, y=1.03)
locs, labels=plt.yticks()
plt.yticks(locs, y_ticks)
plt.xlabel('Correlation with SAT Scores',labelpad=20, fontsize=15)
plt.show()
There is a significant positive correlation between the number of survey responses (closely related to the total number of students in a school) made and SAT scores - the number of students in the best public schools is usually over 1000, while the average number of students per school in New York City is around 500. This result is not surprising, as many of such schools are of magnet type, i.e. they may draw students outside of defined school zones and provide specialised curriculum, while many parents all over the city, quite naturally, wish better future for their children and try to place them in such schools.
We may also observe that students' response rate is positively correlated with SAT score, which might be explained by self-selection - students confident in their test results can also be more responsive to surveys. In addition, teachers' and students' safety and respect perception of environment seems to be related to higher SAT scores, which is less the case for parents, who are not the direct participants of the studying process. Therefore, safety or, say, the general well-being of surroundings matters - the total score for this indicator has the highest correlation among other totals, indicating its overall (i.e. among all survey participants) higher significance. Finally, it is quite noteworthy that academic expectations of students are correlated with SAT scores more, than analogous indicators for teachers or parents. This theoretically may mean that students are able to assess their abilities more precisely than even teachers, who teach them.
Knowing that among the totals considered Safety and Respect total score correlates the most with the SAT scores of the corresponding schools, let's explore their relationship in greater detail by plotting a scatter plot.
fig = plt.figure(figsize=(7,5))
combined = combined.rename(columns={'saf_tot_11':'Safety and Respect Total Score', 'sat_score':'SAT Score'})
sns.regplot(x='Safety and Respect Total Score', y='SAT Score', data=combined)
plt.title("Environment Safety and Respect Perception Correlation with SAT Scores" ,
fontweight="bold", fontsize = 15, y=1.06)
plt.show()
We can see, that there is some positive correlation between safety and SAT scores with the number of outliers in the upper-right corner. At the same time, there is also a group of schools with comparatively high safety level, but lower SAT scores - in the lower-right part of the plot. Nonetheless, it doesn't work the other way around - none of the schools with Safety and Respect total score lower than 6 has a SAT score higher than 1400. Presumably, this may be explained by the fact that unsafe districts are usually populated by poor people with lower paid jobs - generally speaking, it is quite rare to encounter schools with well-performing students there.
To check this conjecture, let's inspect how the Safety and Respect Score is distributed in New York City by school districts.
from mpl_toolkits.basemap import Basemap
# aggregated data by school districts for mapping
combined_grouped = combined.groupby('school_dist').agg(np.mean)
combined_grouped.reset_index(inplace=True)
fig = plt.figure(figsize = (18, 18))
longitudes = list(combined_grouped['lon'])
latitudes = list(combined_grouped['lat'])
ax = fig.add_subplot(121)
ax.set_title("Safety and Respect Score by School Districts", fontweight="bold", fontsize = 15, y=1.05)
# set map boundaries
map = Basemap(llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272, epsg=2263)
# set map background and locate school districts with their mean values
map.arcgisimage(service='World_Topo_Map', xpixels = 1500, verbose= True)
map.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=combined_grouped["Safety and Respect Total Score"], cmap='winter')
map.readshapefile('/Users/mac/downloads/geo_export_9b152510-8274-4e63-bfa9-0e6b8d75f118', name='NYC_districts', color='grey')
map.drawcounties(linewidth=0.5, linestyle='solid', color='k', antialiased=1, facecolor='none', ax=None, zorder=2, drawbounds=False)
# set borough names
plt.annotate('Brooklyn', xy=(0.45, 0.35), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Queens', xy=(0.75, 0.55), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Bronx', xy=(0.65, 0.85), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Manhattan', xy=(0.43, 0.65), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Staten Island', xy=(0.1, 0.25), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.colorbar(label='Safety and Respect Score',fraction=0.046, pad=0.04, orientation='horizontal')
# the same procedure is done for the second map
ax = fig.add_subplot(122)
ax.set_title('Average SAT Scores by School Districts', fontweight="bold", fontsize = 15, y=1.05)
map = Basemap(llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272, epsg=2263)
#http://server.arcgisonline.com/arcgis/rest/services
map.arcgisimage(service='World_Topo_Map', xpixels = 1500, verbose= True)
map.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=combined_grouped["SAT Score"], cmap='winter')
map.readshapefile('/Users/mac/downloads/geo_export_9b152510-8274-4e63-bfa9-0e6b8d75f118', name='NYC_districts', color='grey')
map.drawcounties(linewidth=0.5, linestyle='solid', color='k', antialiased=1, facecolor='none', ax=None, zorder=2, drawbounds=False)
plt.annotate('Brooklyn', xy=(0.45, 0.35), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Queens', xy=(0.75, 0.55), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Bronx', xy=(0.65, 0.85), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Manhattan', xy=(0.43, 0.65), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.annotate('Staten Island', xy=(0.1, 0.25), xycoords='axes fraction', fontweight="bold",fontsize = 15)
plt.colorbar(label='Average SAT Score',fraction=0.046, pad=0.04, orientation='horizontal')
plt.show()
From the above plots we may note that overall Brooklyn school districts are considered as the least safe, followed by Bronx and Queens in this respect. Nonetheless, there are also two safest school districts in Brooklyn, indicating that this measure is not homogeneous within boroughs. On the other hand, the best performing school district in terms of SAT scores is located in Brooklyn, closely followed by another districts in Staten Island and Queens. Overall, these maps show us some divergence in the geographical distribution of these two variables, even though the overall correlation is high. It is possible to assume then that richer districts are often safer, have higher level of education-related funding and henceforth the best-performing schools are located there - i.e. the correlation between the level of life and academic performance should be even more ubiquitous, as compared to pure safety perception.
To confirm or reject the above-mentioned proposition, we may actually explore in greater detail the average price per square foot over New York City, as one of the possible proxies for wealth distribution on the neighborhood level. For that purpose we use 2012 property annual sales data from NYC Department of Finance. To make our analysis more representative and accurate, we get rid of unrealistically small and extremely large values for sales prices and footage, so that to take into account only average properties. In addition, we supply the map with the data on Adjusted Gross Income per Tax Return reported by the Internal Revenue Service on the ZIP code level and defined as gross individual income reported by tax payers, adjusted for education expenses, alimony payments and contributions to a retirement account.
# download property data
manhattan = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_manhattan.xls')
bronx = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_bronx.xls')
brooklyn = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_brooklyn.xls')
staten_island = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_statenisland.xls')
queens = pd.read_excel('https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_queens.xls')
# merge data and get rid of outliers and unrealistic values
property_data = pd.concat([manhattan, bronx, brooklyn, queens, staten_island], axis=0)
property_data = property_data.iloc[:,[10,14,19]].dropna()
property_data.columns = ['ZIP Code','Land Square Feet','Sale price']
property_data.to_csv('Projects_data/NYC_property_data')
property_data = property_data[(property_data['ZIP Code'] != 'ZIP CODE\n') & (property_data['ZIP Code'] != 0)]
property_data = property_data.astype(int)
property_data = property_data[(property_data['Sale price']>50000) & (property_data['Sale price']<5000000)]
property_data = property_data[(property_data['Land Square Feet']>500) & (property_data['Land Square Feet']<5000)]
property_data['Price Per Square Foot'] = property_data['Sale price']/property_data['Land Square Feet']
property_data_aggregated = round(property_data.groupby('ZIP Code').agg(np.mean),2)
property_data_aggregated.reset_index(inplace=True)
# download income data
adjusted_gross_income = pd.read_excel('https://www.irs.gov/pub/irs-soi/12zp33ny.xls')
adjusted_gross_income = adjusted_gross_income.iloc[5:,[0,1,2,9]]
adjusted_gross_income.columns = ['ZIP Code','Size of adjusted gross income','Number of Returns (AGI)','Adjusted Gross Income']
adjusted_gross_income.to_csv('Projects_data/NYC_adjusted_gross_income')
adjusted_gross_income['Size of adjusted gross income'] = adjusted_gross_income['Size of adjusted gross income'].fillna('Total')
adjusted_gross_income = adjusted_gross_income[(adjusted_gross_income['Number of Returns (AGI)']>0) & (adjusted_gross_income['Adjusted Gross Income']>0)]
adjusted_gross_income['Adjusted Gross Income per Tax Return'] = round(adjusted_gross_income['Adjusted Gross Income'].astype(int)*1000/adjusted_gross_income['Number of Returns (AGI)'].astype(int),2)
adjusted_gross_income = adjusted_gross_income[adjusted_gross_income['Size of adjusted gross income']=='Total'].iloc[:,[0,-1]]
combined = pd.merge(combined, adjusted_gross_income, left_on='zip', right_on='ZIP Code', how='left')
# ZIP Code Definitions of New York City Neighborhoods (NYC Department of Health) webfile was manually
# adjusted to make a correspondence between ZIP codes and neighborhoods for mapping
NYC_zips_dist = pd.read_excel('/Users/mac/downloads/NewYorkZips.xlsx')
NYC_zips_dist.to_csv('Projects_data/NYC_zips_dist')
property_data_merged = pd.merge(NYC_zips_dist, property_data_aggregated, on='ZIP Code', how='left')
combined = pd.merge(combined, property_data_aggregated, left_on='zip', right_on='ZIP Code', how='left')
# dowload locations of all ZIP codes in New York
import json
from urllib.request import urlopen
with urlopen('https://raw.githubusercontent.com/gdobler/nycep/master/d3/data/nyc-zip-code.json') as f:
geodata = json.load(f)
# create list with all New York ZIP codes for complete mapping
zips = []
GEO_zips = pd.DataFrame()
for i in geodata['features']:
for j in i['properties']:
zips.append(i['properties']['ZIP'])
# create dataset for mapping
zips = pd.DataFrame(zips).astype(int)
zips.columns = ['ZIP Code']
GEO_zips = zips.drop_duplicates()
property_data_for_graph = pd.merge(GEO_zips, property_data_merged, on='ZIP Code', how='left')
property_data_for_graph = pd.merge(property_data_for_graph, adjusted_gross_income, on='ZIP Code', how='left')
import plotly.express as px
# set variable to show on the map
property_data_for_graph['Neighborhood'] = property_data_for_graph['Neighborhood'].fillna('No data')
property_data_for_graph['color'] = property_data_for_graph['Price Per Square Foot'].fillna(-500) # set minimal color for colorscale
property_data_for_graph['Avg. Price Per Square Foot'] = property_data_for_graph['Price Per Square Foot'].fillna('No data')
property_data_for_graph['Gross Income per Tax Return'] = property_data_for_graph['Adjusted Gross Income per Tax Return'].fillna('No data')
combined['School Average SAT Score'] = combined['SAT Score'].fillna('No data')
combined['SAT Critical Reading Score'] = combined['SAT Critical Reading Avg. Score'].fillna('No data')
combined['SAT Math Reading Score'] = combined['SAT Math Avg. Score'].fillna('No data')
combined['SAT Writing Reading Score'] = combined['SAT Writing Avg. Score'].fillna('No data')
combined['Borough'] = combined['boro']
combined['Total Enrollment'] = combined['total_enrollment']
combined['marker_size'] = combined['SAT Score'].fillna(800)/100 # adjust sizes of circles mapping schools
combined['color'] = combined['SAT Score']+550 # adjust school colors for the coloscale based on SAT scores
# plot the map
fig = px.choropleth_mapbox(property_data_for_graph, geojson=geodata, color="