Data scraping, data wrangling, data analytics, exploratory data analysis, advaced plotting and clustering with love.¶

In this tutorial, we are going to read the csv file and the json file we saved in the previous tutorial.

Then we are going to use pandas [1] to do data wrangling and manipulation of the DataFrame.

Finally, we are going to do some basic data analytics and basic plotting using matplotlib [2].

In [ ]:

In [2]:

%matplotlib inline

In [ ]:

In [3]:

import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns

In [ ]:

To load the csv file, we proceed as follows,

In [4]:

#shot_df.to_csv(path_or_buf='test.csv',mode='w')
shot_df = pd.read_csv(filepath_or_buffer='test.csv')
#shot_df = pd.read_csv('test.csv')

In [ ]:

To print the first 4 rows in the DataFrame and display all the columns, you can proceed as follows,

In [5]:

shot_df.head(4)

Out[5]:

	Unnamed: 0	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	...	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	0	Shot Chart Detail	21400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	...	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	...	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1
2	2	Shot Chart Detail	21400018	53	2544	LeBron James	1610612739	Cleveland Cavaliers	1	4	...	Fadeaway Jump Shot	2PT Field Goal	Mid-Range	Left Side(L)	8-16 ft.	12	-105	63	1	0
3	3	Shot Chart Detail	21400018	77	2544	LeBron James	1610612739	Cleveland Cavaliers	1	2	...	Jump Shot	3PT Field Goal	Right Corner 3	Right Side(R)	24+ ft.	22	227	-16	1	0

4 rows × 22 columns

Have in mind that this is a larger DataFrame, if you do not use .head(4) it will print all the rows in the DataFrame.

In [ ]:

To force pandas to display all the columns, you can proceed as follows (we only display the first 4 rows),

In [6]:

pd.set_option('display.max_columns', None)
shot_df.head(4)
#shot_df

Out[6]:

	Unnamed: 0	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	0	Shot Chart Detail	21400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	20	Missed Shot	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1
2	2	Shot Chart Detail	21400018	53	2544	LeBron James	1610612739	Cleveland Cavaliers	1	4	45	Missed Shot	Fadeaway Jump Shot	2PT Field Goal	Mid-Range	Left Side(L)	8-16 ft.	12	-105	63	1	0
3	3	Shot Chart Detail	21400018	77	2544	LeBron James	1610612739	Cleveland Cavaliers	1	2	31	Missed Shot	Jump Shot	3PT Field Goal	Right Corner 3	Right Side(R)	24+ ft.	22	227	-16	1	0

In [ ]:

To erase a column in a DataFrame we can proceed as follows.

Notice that we want to erase the column with the name Unnamed: 0, hence the notation ('Unnamed: 0', 1).

In [7]:

shot_df1 = shot_df.drop('Unnamed: 0', 1)

In [ ]:

To display the information in shot_df1 (we only display the first 2 rows),

In [8]:

shot_df1.head(2)

Out[8]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	Shot Chart Detail	21400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	20	Missed Shot	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1

As you can see, we erased the column ('Unnamed: 0', 1).

In [ ]:

Alternatively we can use the column 0 as the row labels of the DataFrame.

In [9]:

shot_df2 = pd.read_csv(filepath_or_buffer='test.csv',index_col=0)

In [ ]:

To display the information in shot_df2 (we only display the first 2 rows),

In [10]:

shot_df2.head(2)

Out[10]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	Shot Chart Detail	21400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	20	Missed Shot	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1

In [ ]:

We can do indexing and slicing in a DataFrame.

To display all the columns belonging to rows 0 to 2 of the DataFrame shot_df2,

In [11]:

shot_df2.iloc[0:2,:]

Out[11]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	Shot Chart Detail	21400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	20	Missed Shot	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1

In [ ]:

or we can display columns 10 to 13 of the first five rows,

In [75]:

shot_df2.iloc[0:5,10:13]

Out[75]:

	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE
0	Missed Shot	Jump Shot	2PT Field Goal
1	Made Shot	Layup Shot	2PT Field Goal
2	Missed Shot	Fadeaway Jump Shot	2PT Field Goal
3	Missed Shot	Jump Shot	3PT Field Goal
4	Missed Shot	Jump Shot	3PT Field Goal

In [ ]:

To load a json file using the module json (the hard way),

In [12]:

#import json
#import pprint
#from pprint import pprint

#with open('data_json.json') as data_file:    
#    data_json = json.load(data_file)

#pprint(data_json)

In [13]:

#type(data_json)

In [14]:

#data_json['resultSets'][0]['headers']

In [15]:

#data_json['resultSets'][0]['rowSet']

In [16]:

#for x in data_json:
#    print (x)

In [17]:

#for x in data_json['resultSets'][x]
#    print (data_json['resultSets'][x])

In [18]:

#data_json
#type(data_json['resultSets'][0])
#print (data_json['resultSets'][0])

In [ ]:

To load the json file using pandas, we proceed as follows,

In [19]:

#pd_json=pd.read_json(path_or_buf='data_json.json')
pd_json=pd.read_json(path_or_buf='data_json.json',typ='series')

In [ ]:

Remember, you can inspect the json file using JSONView, as in the previous tutorial.

To know the type of the json file we just loaded,

In [20]:

type(pd_json)

Out[20]:

pandas.core.series.Series

Notice that it is a series and not a DataFrame. Later on we are going to see how to convert series to a DataFrame, for the moment this does not generate any problem.

In [ ]:

The next line will print the content of the json file. This is the same information you see when you use JSONView, but now we read in the hard way.

In [21]:

pd_json

Out[21]:

parameters    {u'PlayerID': 2544, u'StartPeriod': None, u'St...
resource                                        shotchartdetail
resultSets    [{u'headers': [u'GRID_TYPE', u'GAME_ID', u'GAM...
dtype: object

In [ ]:

These lines will print the information inside each block of the json file, they are commented as they print a lot information.

In [22]:

#pd_json.parameters

In [23]:

#pd_json.resource

In [24]:

#pd_json.resultSets

In [ ]:

At this point, we can create the DataFrame using the json file we imported in pandas (pretty much as in the previous tutorial).

First we need to grab the headers and shot chart data.

In [25]:

# Grab the headers to be used as column headers for our DataFrame
json_headers = pd_json['resultSets'][0]['headers']

In [26]:

# Grab the shot chart data
json_shots = pd_json['resultSets'][0]['rowSet']

In [ ]:

To display the content of json_headers

In [27]:

json_headers

Out[27]:

[u'GRID_TYPE',
 u'GAME_ID',
 u'GAME_EVENT_ID',
 u'PLAYER_ID',
 u'PLAYER_NAME',
 u'TEAM_ID',
 u'TEAM_NAME',
 u'PERIOD',
 u'MINUTES_REMAINING',
 u'SECONDS_REMAINING',
 u'EVENT_TYPE',
 u'ACTION_TYPE',
 u'SHOT_TYPE',
 u'SHOT_ZONE_BASIC',
 u'SHOT_ZONE_AREA',
 u'SHOT_ZONE_RANGE',
 u'SHOT_DISTANCE',
 u'LOC_X',
 u'LOC_Y',
 u'SHOT_ATTEMPTED_FLAG',
 u'SHOT_MADE_FLAG']

In [ ]:

To display the content of json_shots. This line is commented as it prints a lot information.

In [28]:

#json_shots

In [ ]:

To create the DataFrame using json_headers and json_shots,

In [29]:

json_to_pd = pd.DataFrame(json_shots, columns=json_headers)

In [ ]:

and to display the first 5 rows of json_to_pd,

In [30]:

json_to_pd.head()

Out[30]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	Shot Chart Detail	0021400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	20	Missed Shot	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	Shot Chart Detail	0021400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1
2	Shot Chart Detail	0021400018	53	2544	LeBron James	1610612739	Cleveland Cavaliers	1	4	45	Missed Shot	Fadeaway Jump Shot	2PT Field Goal	Mid-Range	Left Side(L)	8-16 ft.	12	-105	63	1	0
3	Shot Chart Detail	0021400018	77	2544	LeBron James	1610612739	Cleveland Cavaliers	1	2	31	Missed Shot	Jump Shot	3PT Field Goal	Right Corner 3	Right Side(R)	24+ ft.	22	227	-16	1	0
4	Shot Chart Detail	0021400018	82	2544	LeBron James	1610612739	Cleveland Cavaliers	1	1	51	Missed Shot	Jump Shot	3PT Field Goal	Above the Break 3	Right Side Center(RC)	24+ ft.	26	91	246	1	0

In [ ]:

To determine if json_to_pd is equal to shot_df1 (we only display the first 5 rows),

In [31]:

json_to_pd.head() == shot_df1.head()

Out[31]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	True	False	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True
1	True	False	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True
2	True	False	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True
3	True	False	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True
4	True	False	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True

Notice that the only difference is the column GAME_ID. When loading the csv, pandas erased the leading zeroes, the rest of the columns are equal.

This can be fixed, but I will let you as an exercise.

In [ ]:

The DataFrame shot_df1 contains the shot chart data of all the field goal attempts Lebron James took during the 2014-15 regular season.

We are specifically interested in the data saved in the columns LOC_X, LOC_Y and SHOT_MADE_FLAG. The columns LOC_X and LOC_Y contain the coordinate values for each shot measured from the basket rim. The column SHOT_MADE_FLAG contains the outcome of the shot, 0 for missed and 1 for converted.

To extract from the DataFrame shot_df1 the converted shots or shot_df1.SHOT_MADE_FLAG == 1, we proceed as follows (we only display the first 2 rows),

In [32]:

shot_df1[shot_df1.SHOT_MADE_FLAG == 1].head(2)

Out[32]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1
9	Shot Chart Detail	21400018	299	2544	LeBron James	1610612739	Cleveland Cavaliers	3	6	54	Made Shot	Jump Shot	3PT Field Goal	Above the Break 3	Center(C)	24+ ft.	25	26	249	1	1

In [ ]:

To save the previous output in the DataFrame converted,

In [33]:

converted = shot_df1[shot_df1.SHOT_MADE_FLAG == 1]

In [ ]:

To get the dimensions of the DataFrame

In [34]:

converted.shape

Out[34]:

(624, 21)

In [ ]:

To save the output of the missed shots in the DataFrame missed,

In [35]:

missed = shot_df1[shot_df1.SHOT_MADE_FLAG == 0]

In [ ]:

Let's create a new DataFrame as follows,

In [36]:

new_df = pd.DataFrame()

In [ ]:

and let's extract some information from the DataFrame shot_df1 and put it on the DataFrame new_df.

Notice that we can access the information in a DataFrame as

shot_df1.LOC_X

or

shot_df1['LOC_X'].

In [37]:

new_df['LOC_X'] = shot_df1.LOC_X
new_df['LOC_Y'] = shot_df1['LOC_Y']
new_df['SHOT_MADE_FLAG'] = shot_df1['SHOT_MADE_FLAG']
new_df['SHOT_TYPE'] = shot_df1['SHOT_TYPE']

In [ ]:

To display the information in the DataFrame new_df (we only display the first 5 rows),

In [38]:

#pprint(new_df.head(4))
new_df.head()

Out[38]:

	LOC_X	LOC_Y	SHOT_MADE_FLAG	SHOT_TYPE
0	114	148	0	2PT Field Goal
1	-7	0	1	2PT Field Goal
2	-105	63	0	2PT Field Goal
3	227	-16	0	3PT Field Goal
4	91	246	0	3PT Field Goal

In [ ]:

We can also extract information from a DataFrame using strings. From the DataFrame new_df, let's extract the SHOT_TYPE information as follows,

In [39]:

three_pointer = new_df[new_df.SHOT_TYPE == '3PT Field Goal']
two_pointer = new_df[new_df.SHOT_TYPE == '2PT Field Goal']

In [ ]:

To display the information contained in the DataFrame three_pointer and two_pointer (we only display the first 2 rows),

In [40]:

three_pointer.head(2)

Out[40]:

	LOC_X	LOC_Y	SHOT_MADE_FLAG	SHOT_TYPE
3	227	-16	0	3PT Field Goal
4	91	246	0	3PT Field Goal

In [41]:

two_pointer.head(2)

Out[41]:

	LOC_X	LOC_Y	SHOT_MADE_FLAG	SHOT_TYPE
0	114	148	0	2PT Field Goal
1	-7	0	1	2PT Field Goal

In [ ]:

To know the type of two_pointer,

In [42]:

type(two_pointer)

Out[42]:

pandas.core.frame.DataFrame

In [ ]:

Notice that if we extract the column two_pointer['LOC_X'] and save it in the object tmp_object, as follows,

In [43]:

tmp_object = two_pointer['LOC_X']

In [ ]:

The type of the object tmp_object is a pandas series instead of a DataFrame.

In [44]:

type(tmp_object)

Out[44]:

pandas.core.series.Series

In [ ]:

To convert tmp_object to a DataFrame, we proceed as follows.

Notice that the option name in to_frame(name='LOC_X'), corresponds to the name we want to assign to that column in the DataFrame tmp_object1.

In [45]:

tmp_object1 = two_pointer['LOC_X'].to_frame(name='LOC_X')

In [ ]:

which has a type

In [46]:

type(tmp_object1)

Out[46]:

pandas.core.frame.DataFrame

In [ ]:

To convert a pandas series or DataFrame to a numpy array, we proceed as follows,

In [47]:

np_array0 = shot_df1.as_matrix(columns=None)

np_array1 = two_pointer['LOC_X'].as_matrix(columns=None)

In [ ]:

which has a type

In [48]:

type(np_array0)

Out[48]:

numpy.ndarray

In [49]:

type(np_array1)

Out[49]:

numpy.ndarray

In [ ]:

We can also do mathematical operations between columns (and rows) of a DataFrame, for example (we only display the first 5 rows),

In [50]:

(three_pointer['LOC_X']*100).head()
#(three_pointer*0).head()

Out[50]:

3     22700
4      9100
6     12200
9      2600
12   -12700
Name: LOC_X, dtype: int64

In [51]:

((three_pointer['LOC_X']-three_pointer.LOC_Y)/three_pointer.LOC_Y**2).head()

Out[51]:

3     0.949219
4    -0.002561
6    -0.002046
9    -0.003597
12   -0.006749
dtype: float64

In [52]:

(three_pointer['LOC_X']/three_pointer['LOC_X'].max()).head()

Out[52]:

3     0.941909
4     0.377593
6     0.506224
9     0.107884
12   -0.526971
Name: LOC_X, dtype: float64

In [ ]:

To create a new DataFrame with the normalized values of LOC_X and LOC_Y,

In [53]:

df_norm = pd.DataFrame()
c1=three_pointer['LOC_X']/three_pointer['LOC_X'].max()
c2=three_pointer['LOC_Y']/three_pointer['LOC_Y'].max()
df_norm['c1'] = c1
df_norm['c2'] = c2
df_norm.head()

Out[53]:

	c1	c2
3	0.941909	-0.038278
4	0.377593	0.588517
6	0.506224	0.562201
9	0.107884	0.595694
12	-0.526971	0.550239

Notice that we only display the first 5 rows.

To generate a summary statistics of a DataFrame.

In [54]:

shot_df1.describe()

Out[54]:

	GAME_ID	GAME_EVENT_ID	PLAYER_ID	TEAM_ID	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
count	1279.000000	1279.000000	1279	1279	1279.000000	1279.000000	1279.000000	1279.000000	1279.000000	1279.000000	1279	1279.000000
mean	21400602.060985	250.136044	2544	1610612739	2.448006	5.295543	28.170446	12.136826	-13.997654	83.936669	1	0.487881
std	346.471023	156.436191	0	0	1.137046	3.566744	17.278734	10.302508	104.266921	91.996852	0	0.500049
min	21400018.000000	2.000000	2544	1610612739	1.000000	0.000000	0.000000	0.000000	-245.000000	-30.000000	1	0.000000
25%	21400281.000000	116.000000	2544	1610612739	1.000000	2.000000	13.000000	1.000000	-94.000000	4.000000	1	0.000000
50%	21400643.000000	254.000000	2544	1610612739	2.000000	5.000000	29.000000	12.000000	-2.000000	42.000000	1	0.000000
75%	21400891.000000	379.000000	2544	1610612739	3.000000	8.000000	42.000000	23.000000	22.000000	164.000000	1	1.000000
max	21401203.000000	628.000000	2544	1610612739	5.000000	11.000000	59.000000	46.000000	241.000000	418.000000	1	1.000000

Notice that it computes the statistics only of the columns with numerical values.

In [ ]:

To count the total number of 3 point shots attempted,

In [55]:

three_pointer.count()

Out[55]:

LOC_X             339
LOC_Y             339
SHOT_MADE_FLAG    339
SHOT_TYPE         339
dtype: int64

In [ ]:

To count the number of 3 point shots converted and only for the column SHOT_MADE_FLAG,

In [56]:

three_pointer.SHOT_MADE_FLAG[three_pointer.SHOT_MADE_FLAG == 1].count()

Out[56]:

In [ ]:

and to count the number of 3 point shots missed,

In [57]:

three_pointer.SHOT_MADE_FLAG[three_pointer.SHOT_MADE_FLAG == 0].count()

Out[57]:

In [ ]:

At this point, let's do some basic plotting using matplotlib [2].

The first plot will be the converted and missed shots using the DataFrames we just created, that is, we are going to plot the shot charts.

In [58]:

sns.set_style("white")
sns.set_color_codes()

#shottrue = shot_df1[shot_df.SHOT_MADE_FLAG == 1]
#shotfalse = shot_df1[shot_df.SHOT_MADE_FLAG == 0]

plt.figure(figsize=(12,11))

plt.scatter(converted.LOC_X, converted.LOC_Y, color='green',label='converted',s=20,marker='o',alpha=0.5)
plt.scatter(missed.LOC_X, missed.LOC_Y, color='red',label='missed',s=20,marker='o',alpha=0.5)

#plt.scatter(shottrue.LOC_X, shottrue.LOC_Y)
#plt.scatter(shotfalse.LOC_X, shotfalse.LOC_Y)

plt.legend()
plt.grid()

plt.xlim(-300,300)
plt.ylim(-100,500)
#plt.grid()
plt.show()

FYI, the values in LOC_X and LOC_Y are in inches

In [ ]:

Let's do a plot of all the 3 point shots.

In [59]:

sns.set_style("white")
sns.set_color_codes()

#shottrue = shot_df1[shot_df.SHOT_MADE_FLAG == 1]
#shotfalse = shot_df1[shot_df.SHOT_MADE_FLAG == 0]

plt.figure(figsize=(12,11))

plt.scatter(three_pointer.LOC_X, three_pointer.LOC_Y, color='green',label='3 point shots',s=20,marker='o',alpha=0.5)


#plt.scatter(shottrue.LOC_X, shottrue.LOC_Y)
#plt.scatter(shotfalse.LOC_X, shotfalse.LOC_Y)

plt.legend()

plt.xlim(-300,300)
plt.ylim(-100,500)
#plt.grid()
plt.show()

In [ ]:

We can also plot all the 3 point shots by converted and missed, as follows,

In [60]:

sns.set_style("white")
sns.set_color_codes()

#shottrue = shot_df1[shot_df.SHOT_MADE_FLAG == 1]
#shotfalse = shot_df1[shot_df.SHOT_MADE_FLAG == 0]

plt.figure(figsize=(12,11))

plt.scatter(three_pointer.LOC_X[three_pointer.SHOT_MADE_FLAG == 1], three_pointer.LOC_Y[three_pointer.SHOT_MADE_FLAG == 1], color='green',label='3 point shots converted',s=20,marker='o',alpha=0.5)
plt.scatter(three_pointer.LOC_X[three_pointer.SHOT_MADE_FLAG == 0], three_pointer.LOC_Y[three_pointer.SHOT_MADE_FLAG == 0], color='red',label='3 point shots missed',s=20,marker='o',alpha=0.5)


#plt.scatter(shottrue.LOC_X, shottrue.LOC_Y)
#plt.scatter(shotfalse.LOC_X, shotfalse.LOC_Y)

plt.legend()

plt.xlim(-300,300)
plt.ylim(-100,500)
#plt.grid()
plt.show()

In [ ]:

Finally, we can use seaborn [3] to do more advanced plots,

In [61]:

# create our jointplot
joint_shot_chart = sns.jointplot(shot_df1.LOC_X, shot_df1.LOC_Y, stat_func=None,
                                 kind='scatter', space=0.2, alpha=1, 
                                 size=12, edgecolor='w', color='g').set_axis_labels("x location", "y location")

plt.figure(figsize=(12,11))

Out[61]:

<matplotlib.figure.Figure at 0x10966b210>

<matplotlib.figure.Figure at 0x10966b210>

In [ ]:

In the next tutorial, we are going to work a little bit more on data analytics and advanced plotting.¶

In [ ]:

References¶

[1] http://pandas.pydata.org/

[2] http://matplotlib.org/

[3] http://stanford.edu/~mwaskom/software/seaborn/

In [ ]:

In [62]:

#import sys
#print('Python version:', sys.version_info)

#import IPython
#print('IPython version:', IPython.__version__)

#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)

#import matplotlib
#print('matplotlib version:', matplotlib.__version__)

#print('seaborn version:', sns.__version__)

In [ ]: