Data Science Project 1: NYC Dept. of Education Data on NYS Math Exam results¶

Data source: https://nycopendata.socrata.com/data?cat=education
Data description:
NYS Math Exam Results for NYC between Grades 3-8 from 2006-2011.
Proficieny Level 1, 2- Below level for that grade
Proficiency Level 3- appropriate for that grade
Proficiency Level 4- above the alevel apprpriate for that grade

What am I exploring?¶

Exam Proficiency levels across all of NYC
Exam Proficiency levels within boroughs
Exam Proficiency levels across all of NYC by gender
Exam Proficiency levels in NYC changes from 2006-2011

------------------------------------------------------------------------------------------------

1. Acquiring Data¶

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np

Manual data fixing done before reading it here:
Removed columns with percentage (This I know)
Added columns 1&2 together to create a new one (I don't know how to do in Python"

In [2]:

BoroMath = pd.read_csv("MathTest_Boro2.csv")
#saved locally in the same folder

------------------------------------------------------------------------------------------------

2.Cleaning Data¶

In [3]:

BoroMath = BoroMath.rename(columns={'Level1&2': 'BelowAverage', 'Level3':'Average', 'Level4':'AboveAverage'})
#renamed the header for ease 

BoroMath.head(3)

Out[3]:

	Borough	Grade	Year	Category	Number Tested	Mean Scale Score	BelowAverage	Average	AboveAverage	Level3&4
0	BRONX	3	2006	Female	7984	664	2546	4232	1206	5438
1	BRONX	3	2006	Male	8461	663	2773	4386	1302	5688
2	BRONX	3	2007	Female	7803	675	1780	4410	1613	6023

3 rows × 10 columns

In [4]:

BoroMath2 = BoroMath.drop(['Level3&4'], inplace=True,axis=1)
#Got rid of the Level3&4 here for practice, could've deleted in file like the others. 

Creating sorted groups:¶

GroupedMath stores all entries for that borough (grades 3-8, all grades)
AllMath stores all entries for a particular borough for "All Grades"
AllStudents stores all the entries for "All Students" for all of the boroughs

Q: what form does this become? List? _
_Q: When would I care?
Q: Can I do math operations with this form?

In [5]:

GroupedBronxMath = BoroMath[BoroMath['Borough']=='BRONX']
#This con
AllBronxMath = GroupedBronxMath[GroupedBronxMath['Grade']=='All Grades']
#AllBronxMath is all grades for a borough

GroupedBrooklynMath = BoroMath[BoroMath['Borough']=='BROOKLYN']
AllBrooklynMath = GroupedBrooklynMath[GroupedBrooklynMath['Grade']=='All Grades']

GroupedManhattanMath = BoroMath[BoroMath['Borough']=='MANHATTAN']
AllManhattanMath = GroupedManhattanMath[GroupedManhattanMath['Grade']=='All Grades']

GroupedQueensMath = BoroMath[BoroMath['Borough']=='QUEENS']
AllQueensMath = GroupedQueensMath[GroupedQueensMath['Grade']=='All Grades']

GroupedSIMath = BoroMath[BoroMath['Borough']=='STATEN ISLAND']
AllSIMath = GroupedSIMath[GroupedSIMath['Grade']=='All Grades']

AllStudents = BoroMath[BoroMath['Grade']=='All Grades']

#There's 12 per borough
#6 per gender
#2 per year

All Students¶

In [6]:

AllStudents.head(12)

Out[6]:

	Borough	Grade	Year	Category	Number Tested	Mean Scale Score	BelowAverage	Average	AboveAverage
72	BRONX	All Grades	2006	Female	48244	644	26437	18403	3404
73	BRONX	All Grades	2006	Male	49616	642	27282	18682	3652
74	BRONX	All Grades	2007	Female	47011	654	21301	21034	4676
75	BRONX	All Grades	2007	Male	50061	651	23866	21195	5000
76	BRONX	All Grades	2008	Female	45661	662	15303	25117	5241
77	BRONX	All Grades	2008	Male	48700	659	18005	25219	5476
78	BRONX	All Grades	2009	Female	45423	671	10599	27585	7239
79	BRONX	All Grades	2009	Male	48521	668	13256	27851	7414
80	BRONX	All Grades	2010	Female	45466	670	26240	13502	5724
81	BRONX	All Grades	2010	Male	48687	668	28920	13910	5857
82	BRONX	All Grades	2011	Female	45598	672	24794	16042	4762
83	BRONX	All Grades	2011	Male	48423	669	27535	15843	5045

12 rows × 9 columns

In [7]:

AllStudents.describe()

Out[7]:

	Year	Number Tested	Mean Scale Score	BelowAverage	Average	AboveAverage
count	60.000000	60.000000	60.000000	60.000000	60.000000	60.000000
mean	2008.500000	42922.066667	672.683333	15044.266667	18976.133333	8901.666667
std	1.722237	20169.577332	11.342872	9136.840989	9767.739463	5341.662627
min	2006.000000	12431.000000	642.000000	1639.000000	4452.000000	2493.000000
25%	2007.000000	26797.000000	665.750000	6785.250000	9237.000000	4414.000000
50%	2008.500000	48333.500000	673.500000	13360.000000	19191.500000	6629.000000
75%	2010.000000	60109.250000	682.000000	22884.500000	27596.750000	13857.250000
max	2011.000000	70730.000000	688.000000	32470.000000	36905.000000	19667.000000

8 rows × 6 columns

This doesn't tell much because of the data type

In [8]:

AllStudentsYr = AllStudents.groupby(['Year','Category']).sum()

AllStudentsYr

Out[8]:

		Number Tested	Mean Scale Score	BelowAverage	Average	AboveAverage
Year	Category
2006	Female	216807	3289	90995	93221	32591
2006	Male	223781	3282	96436	93633	33712
2007	Female	211962	3335	71104	99512	41346
2007	Male	222919	3322	80398	100880	41641
2008	Female	206949	3370	49665	111877	45407
2008	Male	217858	3357	59549	112195	46114
2009	Female	206320	3408	34282	116928	55110
2009	Male	217072	3394	42827	119581	54664
2010	Female	207155	3404	92805	66925	47425
2010	Male	218583	3393	103001	68683	46899
2011	Female	207503	3409	85770	77753	43980
2011	Male	218415	3398	95824	77380	45211

12 rows × 5 columns

All FEMALE Students¶

In [9]:

Allgirls = AllStudents[AllStudents['Category']=='Female']
Allgirls.groupby(['Year']).sum()

Out[9]:

	Number Tested	Mean Scale Score	BelowAverage	Average	AboveAverage
Year
2006	216807	3289	90995	93221	32591
2007	211962	3335	71104	99512	41346
2008	206949	3370	49665	111877	45407
2009	206320	3408	34282	116928	55110
2010	207155	3404	92805	66925	47425
2011	207503	3409	85770	77753	43980

6 rows × 5 columns

All MALE Students¶

In [10]:

Allboys = AllStudents[AllStudents['Category']=='Male']
Allboys.groupby(['Year']).sum()

Out[10]:

	Number Tested	Mean Scale Score	BelowAverage	Average	AboveAverage
Year
2006	223781	3282	96436	93633	33712
2007	222919	3322	80398	100880	41641
2008	217858	3357	59549	112195	46114
2009	217072	3394	42827	119581	54664
2010	218583	3393	103001	68683	46899
2011	218415	3398	95824	77380	45211

6 rows × 5 columns

Looking at Boroughs¶

In [11]:

AllBronxMath.sum()

Out[11]:

Borough             BRONXBRONXBRONXBRONXBRONXBRONXBRONXBRONXBRONXB...
Grade               All GradesAll GradesAll GradesAll GradesAll Gr...
Year                                                            24102
Category            FemaleMaleFemaleMaleFemaleMaleFemaleMaleFemale...
Number Tested                                                  571411
Mean Scale Score                                                 7930
BelowAverage                                                   263538
Average                                                        244383
AboveAverage                                                    63490
dtype: object

Data sets sorted for 2011 and Boroughs¶

In [12]:

Bx_below = AllBronxMath[AllBronxMath['Year']==2011].sum().BelowAverage
Bx_avg = AllBronxMath[AllBronxMath['Year']==2011].sum().Average
Bx_above = AllBronxMath[AllBronxMath['Year']==2011].sum().AboveAverage

M_below = AllManhattanMath[AllManhattanMath['Year']==2011].sum().BelowAverage
M_avg = AllManhattanMath[AllManhattanMath['Year']==2011].sum().Average
M_above = AllManhattanMath[AllManhattanMath['Year']==2011].sum().AboveAverage

Bk_below = AllBrooklynMath[AllBrooklynMath['Year']==2011].sum().BelowAverage
Bk_avg = AllBrooklynMath[AllBrooklynMath['Year']==2011].sum().Average
Bk_above = AllBrooklynMath[AllBrooklynMath['Year']==2011].sum().AboveAverage

Q_below = AllQueensMath[AllQueensMath['Year']==2011].sum().BelowAverage
Q_avg = AllQueensMath[AllQueensMath['Year']==2011].sum().Average
Q_above = AllQueensMath[AllQueensMath['Year']==2011].sum().AboveAverage

SI_below = AllSIMath[AllSIMath['Year']==2011].sum().BelowAverage
SI_avg = AllSIMath[AllSIMath['Year']==2011].sum().Average
SI_above = AllSIMath[AllSIMath['Year']==2011].sum().AboveAverage

In [13]:

NYC_Below = [int(Bx_below), int(M_below), int(Bk_below), int(Q_below), int(SI_below)]
NYC_Avg = [int(Bx_avg), int(M_avg), int(Bk_avg), int(Q_avg), int(SI_avg)]
NYC_Above = [int(Bx_above), int(M_above), int(Bk_above), int(Q_above), int(SI_above)]

NYC_Bxtot= [float(Bx_below) + float(Bx_avg) + float(Bx_above)]
NYC_Mtot= [int(M_below) + int(M_avg) + int(M_above)]
NYC_Bktot= [int(Bk_below) + int(Bk_avg) + int(Bk_above)]
NYC_Qtot= [int(Q_below) + int(Q_avg) + int(Q_above)]
NYC_SItot= [int(SI_below) + int(SI_avg) + int(SI_above)]

____________________________________________________________________________________________________________________________ ____________________________________________________________________________________________________________________________

____________________________________________________________________________________________________________________________

Proficiency Level Changes over Time (2006-2011)¶

In [14]:

#Scores... proficiency Level...

plt.figure(figsize=(10,5))
plt.scatter(AllStudents.Year, AllStudents.BelowAverage, lw=10, alpha=.5, color='m')
plt.scatter(AllStudents.Year, AllStudents.Average, lw=10, alpha=.5, color='c')
plt.scatter(AllStudents.Year, AllStudents.AboveAverage, lw=10, alpha=.5, color='g')
plt.xlabel("Year")

#plt.set_xticklabels(('2006', '2007', '2008', '2009', '2010', '2011') )
#Tried this from above plot, didn't work

plt.ylabel("Students at Proficiency")
plt.title("Proficiency Over Time",fontsize='15')

plt.legend(('Below Average', 'Average', 'Above Average'), bbox_to_anchor = (1.3, 1))

Out[14]:

<matplotlib.legend.Legend at 0x10729f7d0>

Q: Why did I have to rename the column from two words to one word in order to sum it? Ex: It was Below Average, now it's BelowAverage

____________________________________________________________________________________________________________________________

Female vs Male G3-8 Students at AVERAGE Math proficiency (2006-2011)¶

In [15]:

N = 6
BoysLevel3 = Allboys.groupby(['Year']).sum().Average
GirlsLevel3 = Allgirls.groupby(['Year']).sum().Average

ind = np.arange(N)
width = 0.35

fig, ax = plt.subplots()
rectsl = ax.bar(ind, BoysLevel3, width, color='b')
rects2 = ax.bar(ind+width, GirlsLevel3, width, color='y')



# add some
ax.set_ylabel('Number of Students')
ax.set_title('Number of Students at Average Proficiency in Math')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('2006', '2007', '2008', '2009', '2010', '2011') )

ax.legend( ('rects1'[0], rects2[0]), ('Boys', 'Girls') )

ax.legend(('Boys', 'Girls'), loc='upper right')

/Users/aribajahan/anaconda/lib/python2.7/site-packages/matplotlib/legend.py:613: UserWarning: Legend does not support r
Use proxy artist instead.

http://matplotlib.sourceforge.net/users/legend_guide.html#using-proxy-artist

  (str(orig_handle),))

Out[15]:

<matplotlib.legend.Legend at 0x10783f1d0>

____________________________________________________________________________________________________________________________

Female vs Male G3-8 Students at BELOW AVERAGE proficiency (2006-2011)¶

In [16]:

N = 6
BoysLevel12 = Allboys.groupby(['Year']).sum().BelowAverage
GirlsLevel12 = Allgirls.groupby(['Year']).sum().BelowAverage

ind = np.arange(N)
width = 0.35

fig, ax = plt.subplots()
rectsl = ax.bar(ind, BoysLevel12, width, color='b')
rects2 = ax.bar(ind+width, GirlsLevel12, width, color='y')



# add some
ax.set_ylabel('Number of Students')
ax.set_title('Number of Students at Below Average Proficiency in Math')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('2006', '2007', '2008', '2009', '2010', '2011') )

ax.legend( ('rects1'[0], rects2[0]), ('Boys', 'Girls'))

ax.legend(('Boys', 'Girls'), bbox_to_anchor = (1.3, 1) )

Out[16]:

<matplotlib.legend.Legend at 0x107a0db90>

____________________________________________________________________________________________________________________________

Female vs Male G3-8 Students at ABOVE AVERAGE proficiency (2006-2011)##¶

In [17]:

N = 6
BoysLevel4 = Allboys.groupby(['Year']).sum().AboveAverage
GirlsLevel4 = Allgirls.groupby(['Year']).sum().AboveAverage

ind = np.arange(N)
width = 0.35

fig, ax = plt.subplots()
rectsl = ax.bar(ind, BoysLevel4, width, color='b')
rects2 = ax.bar(ind+width, GirlsLevel4, width, color='y')


#Labels of Axes
ax.set_ylabel('Number of Students')
ax.set_title('Number of Students at Above Average Proficiency in Math')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('2006', '2007', '2008', '2009', '2010', '2011') )

ax.legend( ('rects1'[0], rects2[0]), ('Boys', 'Girls') )

ax.legend(('Boys', 'Girls'), loc='upper right')

Out[17]:

<matplotlib.legend.Legend at 0x107a53e50>

Q: But how do I fix the years... can I modify the tick names??

____________________________________________________________________________________________________________________________

2011 Math Proficiency in NYC (G3-8)¶

In [18]:

PIE1 = [181594, 155133, 89191] 
# NYC_Avg, NYC_Above]
labelsP = ('Below Average', 'Average', 'Above Average')
plt.subplot(aspect=True)
plt.pie(PIE1, labels=labelsP, colors = ('y', 'm', 'b'),autopct='%i%%')
plt.title("NYC Math Proficiency Level Grade 3-8, 2011")

Out[18]:

<matplotlib.text.Text at 0x107a711d0>

Q: How do I do percentage?!?!
I need to do: Bx_below/NYC_Bxtot *100

____________________________________________________________________________________________________________________________

2011 Math Proficiency Breakdown by Boroughs (G3-8)¶

In [19]:

#NYC_Below = [Bx_below, M_below, Bk_below, Q_below, SI_below]
#NYC_Avg = [Bx_avg, M_avg, Bk_avg, Q_avg, SI_avg]
#NYC_Above = [Bx_above, M_above, Bk_above, Q_above, SI_above]

N = 5

ind = np.arange(N)
#width = 0.35

margin = 0.8
width = (1.-2.*margin)/N
fig, ax = plt.subplots(figsize=(10,5))


rects1 = ax.bar(ind+width+width, NYC_Below, width, color='y')
rects2 = ax.bar(ind+width, NYC_Avg, width, color='m')
rects3 = ax.bar(ind, NYC_Above, width, color='b')


ax.set_ylabel('Students')
ax.set_title(('Students Proficiency Levels by Borough 2011'), fontsize = 15)
ax.set_xticks(ind+width)
ax.set_xticklabels(('Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten sland'))


ax.legend( (rects1[0], rects2[0], rects3[0]), ('Below Average', 'Average', 'Above Average') )

ax.legend(('Below Average', 'Average', 'Above Average'), fontsize=10, loc='upper right')

Out[19]:

<matplotlib.legend.Legend at 0x107d1ce90>

____________________________________________________________________________________________________________________________

2011 Math Proficiency in NYC (G3-8)¶

(Stacked View)

In [20]:

#NYC_Below = [Bx_below, M_below, Bk_below, Q_below, SI_below]
#NYC_Avg = [Bx_avg, M_avg, Bk_avg, Q_avg, SI_avg]
#NYC_Above = [Bx_above, M_above, Bk_above, Q_above, SI_above]

N = 5

Boro_Below= (55.7, 40.13, 43.5, 34.5, 34.7)
Std1= (1,1,1,1,1)

Boro_Avg= (34.0, 34.7, 36, 38.6, 41.0)
Std2= (1,1,1,1,1)

Boro_Above= (10.3, 25.2, 20.6, 26.9, 24.0)
Std3= (1,1,1,1,1)

ind = np.arange(N)
width = 0.35

margin = 0.8
#width = (1.-2.*margin)/N
#fig, ax = plt.subplots(figsize=(10,5))


p1 = plt.bar(ind,Boro_Below, width, color='y')
p2 = plt.bar(ind,Boro_Avg, width, color='m', bottom=Boro_Below)
p3 = plt.bar(ind,Boro_Above, width, color='b', bottom=[Boro_Below[j] + Boro_Avg[j] for j in range(len(Boro_Below))])


plt.ylabel('Students %')
plt.title(('Students Proficiency Levels by Borough 2011'), fontsize = 15)
plt.xticks(ind+width/2., ('Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten sland') )
#plt.yticks(np.arange(0,81,10))

plt.legend((p1[0], p2[0], p3[0]),('Below Average', 'Average', 'Above Average'),fontsize=10, bbox_to_anchor = (1.4, 1))

plt.show()

In [ ]: