#!/usr/bin/env python
# coding: utf-8

# # Analyzing NYC High School SAT Scores
# 
# The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.
# 
# The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the data set for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.
# 
# In this project I am going to investigate the correlation between SAT scores in NYC high schools and various demographics such as race, gender and income.
# 
# All of the data sets used in this project have been taken from https://data.cityofnewyork.us/.
# 
# Here are the details all of the data sets I'll be using:
# 
# * SAT scores by school - SAT scores for each high school in New York City https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4
# 
# * School attendance - Attendance information for each school in New York City https://data.cityofnewyork.us/Education/2010-2011-School-Attendance-and-Enrollment-Statist/7z8d-msnt
# 
# * Class size - Information on class size for each school https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3
# 
# * AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject) https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e
# 
# * Graduation outcomes - The percentage of students who graduated, and other outcome information https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a
# 
# * Demographics - Demographic information for each school https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j
# 
# * School survey - Surveys of parents, teachers, and students at each school https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8
# 
# Basic research into New York, SAT's and the data has led to the following information:
# 
# Only high school students take the SAT, so I'llfocus on high schools.
# New York City is made up of five boroughs, which are essentially distinct regions.
# New York City schools fall within several different school districts, each of which can contains dozens of schools.
# Our data sets include several different types of schools. We'll need to clean them so that we can focus on high schools only.
# Each school in New York City has a unique code called a DBN, or district borough number.
# Aggregating data by district will allow us to use the district mapping data to plot district-by-district differences.
# To start with I am going to read the various data sets into pandas dataframes.

# # Read in the data

# In[1]:


import pandas as pd
import numpy
import re

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d


# # Read in the surveys

# In[2]:


all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey


# # Add DBN columns

# In[3]:


data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]


# # Convert columns to numeric

# In[4]:


cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")


# # Condense datasets

# In[5]:


class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]


# # Convert AP scores to numeric

# In[6]:


cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")


# # Combine the datasets

# In[7]:


combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)


# # Add a school district column for mapping

# In[8]:


def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)


# # Find correlations

# In[9]:


correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)


# # Plotting survey correlations

# In[10]:


# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")


# In[27]:


get_ipython().run_line_magic('matplotlib', 'inline')
combined.corr()["sat_score"][survey_fields].sort_values(ascending = False).plot.bar()


# There are high correlations between N_s, N_t, N_p and sat_score. Since these columns are correlated with total_enrollment, it makes sense that they would be high.
# 
# It is more interesting that rr_s, the student response rate, or the percentage of students that completed the survey, correlates with sat_score. This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.
# 
# How students and teachers percieved safety (saf_t_11 and saf_s_11) correlate with sat_score. This make sense, as it's hard to teach or learn in an unsafe environment.
# 
# The last interesting correlation is the aca_s_11, which indicates how the student perceives academic standards, correlates with sat_score, but this is not true for aca_t_11, how teachers perceive academic standards, or aca_p_11, how parents perceive academic standards.

# # Exploring safety

# In[13]:


combined.plot.scatter("saf_s_11", "sat_score");


# There appears to be a correlation between SAT scores and safety, although it isn't thatstrong. It looks like there are a few schools with extremely high SAT scores and high safety scores. There are a few schools with low safety scores and low SAT scores. No school with a safety score lower than 6.5 has an average SAT score higher than 1500 or so.

# # Borough safety

# In[14]:


# Compute the average safety score for each borough.
boros = combined.groupby("boro").agg(numpy.mean)["saf_s_11"]
print(boros)


# It looks like Manhattan and Queens tend to have higher safety scores, whereas Brooklyn has low safety scores.

# https://github.com/dataquestio/solutions/blob/master/Mission217Solutions.ipynb

# # Racial differences in SAT scores

# In[15]:


# make bar plot
race_fields = ["white_per", "asian_per", "black_per", "hispanic_per"]
combined.corr()["sat_score"][race_fields].plot.bar();


# It looks like a higher percentage of white or asian students at a school correlates positively with sat score, whereas a higher percentage of black or hispanic students correlates negatively with sat score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students.

# In[16]:


# Make a scatter plot on low sat score and high values for hispanic
combined.plot.scatter("hispanic_per", "sat_score");


# This scatter plot shows a negative relation with sat score where sat score moving downward with increasing hispanic_per

# In[17]:


# Research any schools with a hispanic_per greater than 95%.
print(combined[combined["hispanic_per"] > 95]["SCHOOL NAME"])


# The schools listed above appear to primarily be geared towards recent immigrants to the US. These schools have a lot of students who are learning English, which would explain the lower SAT scores.

# In[18]:


# Research any schools with a hispanic_per less than 10% and an average SAT score greater than 1800
print(combined[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)]["SCHOOL NAME"])


# Many of the schools above appear to be specialized science and technology schools that receive extra funding, and only admit students who pass an entrance exam. This doesn't explain the low hispanic_per, but it does explain why their students tend to do better on the SAT -- they are students from all over New York City who did well on a standardized test.

# # Gender differences in SAT scores

# In[20]:


gender_fields = ["male_per", "female_per"]
combined.corr()["sat_score"][gender_fields].plot.bar();


# n the plot above, we can see that a high percentage of females at a school positively correlates with SAT score, whereas a high percentage of males at a school negatively correlates with SAT score. Neither correlation is extremely strong.

# In[21]:


# Investigate schools with high SAT scores and a high female_per, make scatter plot.
combined.plot.scatter("female_per", "sat_score");


# Based on the scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a cluster of schools with a high percentage of females (60 to 80), and high SAT scores.

# In[23]:


# Research any schools with a female_per greater than 60% and 
# an average SAT score greater than 1700
print(combined[(combined["female_per"] > 60) & (combined["sat_score"] > 1700)]["SCHOOL NAME"])


# These schools appears to be very selective liberal arts schools that have high academic standards.

# # AP Exam Scores vs SAT Scores

# In[25]:


# calculate the percentage of students in each school that took an AP exam.

combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]

combined.plot.scatter(x='ap_per', y='sat_score');


# It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores. It's not an extremely strong correlation, though.

# # summery of results
# Based on the research, these are our findings:
# 
# * Schools with bigger share of Hispanic and Black students tend o perform poorly in SAT.
# * High expection in academic by student has got a positive impact in SAT.
# * Schools with safer environment tend to produce student that correlate positively with the SAT.Most students in these schools performs really good in SAT.
# * Passing in AP test, projects a good perfomance in the SAT.
# * Those students where english isn't the native language tend to perform poorly in SAT.

# # Conclusion
# In this project, we cleaned,combined,visualized and analyzed different datasets containing informations about SAT scores and demographics in NYC public high schools.We have dig deep into SAT scores trying to find out how it correlates with diffirent fields.And here are the findings;
# 
# * Shools with high percentage of females tend to do better in SAT compared to those with higer percentage of men. This indicates that female do better in SAT compared to men
# * White and Asian students tend to do better in SAT compared to Hispanic and Black students.Poor perfomance of Hispanic students, is caused by recent immigartion in US.
# * Students who always engage positively with the survey also do well in SAT.
# * Those who didn't do well in AP tests tend to do poorly in SAT tests as well
# 
# We also confirm that, schools that do enjoy safety, ignited a good perfomaance in SAT.

# In[ ]: