Data scraping, data wrangling, data analytics, exploratory data analysis, advaced plotting and clustering with love.¶

In this tutorial we mainly address advanced data analytics and plotting.

We are going to use pandas [1] to do data manipulation and data analytics of the DataFrame.

Finally, we are going to do advanced plotting using matplotlib [2] and seaborn [3].

In [ ]:

In [1]:

%matplotlib inline

In [ ]:

In [2]:

import pandas as pd
#import json
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [ ]:

To load the csv file and use column 0 as the row labels of the DataFrame, we proceed as follows,

In [3]:

#shot_df = pd.read_csv('test.csv')
#shot_df = pd.read_csv(filepath_or_buffer='test.csv')
shot_df = pd.read_csv(filepath_or_buffer='test.csv',index_col=0)

In [ ]:

To display the information in shot_df (we only display the first 10 rows), and force pandas to display all the columns, you can proceed as follows,

In [4]:

pd.set_option('display.max_columns', None)
shot_df.head(10)

Out[4]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
0	Shot Chart Detail	21400018	4	2544	LeBron James	1610612739	Cleveland Cavaliers	1	11	20	Missed Shot	Jump Shot	2PT Field Goal	Mid-Range	Right Side Center(RC)	16-24 ft.	18	114	148	1	0
1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1
2	Shot Chart Detail	21400018	53	2544	LeBron James	1610612739	Cleveland Cavaliers	1	4	45	Missed Shot	Fadeaway Jump Shot	2PT Field Goal	Mid-Range	Left Side(L)	8-16 ft.	12	-105	63	1	0
3	Shot Chart Detail	21400018	77	2544	LeBron James	1610612739	Cleveland Cavaliers	1	2	31	Missed Shot	Jump Shot	3PT Field Goal	Right Corner 3	Right Side(R)	24+ ft.	22	227	-16	1	0
4	Shot Chart Detail	21400018	82	2544	LeBron James	1610612739	Cleveland Cavaliers	1	1	51	Missed Shot	Jump Shot	3PT Field Goal	Above the Break 3	Right Side Center(RC)	24+ ft.	26	91	246	1	0
5	Shot Chart Detail	21400018	136	2544	LeBron James	1610612739	Cleveland Cavaliers	2	9	18	Missed Shot	Jump Bank Shot	2PT Field Goal	In The Paint (Non-RA)	Right Side(R)	8-16 ft.	9	70	58	1	0
6	Shot Chart Detail	21400018	202	2544	LeBron James	1610612739	Cleveland Cavaliers	2	3	16	Missed Shot	Jump Shot	3PT Field Goal	Above the Break 3	Right Side Center(RC)	24+ ft.	26	122	235	1	0
7	Shot Chart Detail	21400018	217	2544	LeBron James	1610612739	Cleveland Cavaliers	2	1	55	Missed Shot	Reverse Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-8	0	1	0
8	Shot Chart Detail	21400018	227	2544	LeBron James	1610612739	Cleveland Cavaliers	2	1	16	Missed Shot	Turnaround Jump Shot	2PT Field Goal	Mid-Range	Right Side(R)	8-16 ft.	13	135	22	1	0
9	Shot Chart Detail	21400018	299	2544	LeBron James	1610612739	Cleveland Cavaliers	3	6	54	Made Shot	Jump Shot	3PT Field Goal	Above the Break 3	Center(C)	24+ ft.	25	26	249	1	1

In [ ]:

We can create a new DataFrame using the DataFrame shot_df with some given columns, as follows,

In [5]:

shot_df1 = pd.DataFrame(shot_df, columns = ['PERIOD','SHOT_TYPE', 'SHOT_ZONE_BASIC', 'SHOT_MADE_FLAG', 'LOC_X', 'LOC_Y'])

In [ ]:

To display the information in shot_df1 (we only display the first 5 rows),

In [6]:

shot_df1.head()

Out[6]:

	PERIOD	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_MADE_FLAG	LOC_X	LOC_Y
0	1	2PT Field Goal	Mid-Range	0	114	148
1	1	2PT Field Goal	Restricted Area	1	-7	0
2	1	2PT Field Goal	Mid-Range	0	-105	63
3	1	3PT Field Goal	Right Corner 3	0	227	-16
4	1	3PT Field Goal	Above the Break 3	0	91	246

In [ ]:

Let's create a few DataFrame using shot_df.

Notice that we are using strings and logical operators to create the new DataFrame.

In [7]:

c1 = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal')]
c1c = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 1)]
c1m = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 0)]

c2 = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal')]
c2c = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 1)]
c2m = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 0)]

c3  = shot_df[(shot_df.SHOT_MADE_FLAG)]
c3c = shot_df[(shot_df.SHOT_MADE_FLAG == 1)]
c3m = shot_df[(shot_df.SHOT_MADE_FLAG == 0)]

In [ ]:

To compute the dimension of the DataFrame c3 or the total field goals attempted,

In [8]:

len(c3.index)

Out[8]:

In [ ]:

To compute the sum of the missed shots (c3m) and converted shots (c3c),

In [9]:

len(c3c.index) + len(c3m.index)

Out[9]:

In [ ]:

At this point, we can compute the statistics of the DataFrame we just created,

In [10]:

#shot_df1.LOC_X[(shot_df.SHOT_MADE_FLAG == 1)].sum()
shot_df1.sum()

Out[10]:

PERIOD                                                          3131
SHOT_TYPE          2PT Field Goal2PT Field Goal2PT Field Goal3PT ...
SHOT_ZONE_BASIC    Mid-RangeRestricted AreaMid-RangeRight Corner ...
SHOT_MADE_FLAG                                                   624
LOC_X                                                         -17903
LOC_Y                                                         107355
dtype: object

In [ ]:

or we can compute the sum of a single column,

In [11]:

#shot_df1.LOC_X[(shot_df.SHOT_MADE_FLAG == 1)].sum()
shot_df1.LOC_X.sum()

Out[11]:

-17903

In [ ]:

To compute the mean of a DataFrame,

In [12]:

shot_df1.LOC_X[(shot_df1.SHOT_MADE_FLAG == 1)].mean()

Out[12]:

-8.434294871794872

In [ ]:

To compute the cummulative sum of a DataFrame,

In [13]:

#shot_df1.LOC_Y[(shot_df.SHOT_MADE_FLAG == 1)].cumsum()
shot_df1.LOC_Y[(shot_df1.SHOT_MADE_FLAG == 1)].cumsum().tail()

Out[13]:

1269    38243
1270    38468
1271    38471
1273    38674
1275    38887
Name: LOC_Y, dtype: int64

In [ ]:

To count the number of non-NA values,

In [14]:

shot_df1.count()

Out[14]:

PERIOD             1279
SHOT_TYPE          1279
SHOT_ZONE_BASIC    1279
SHOT_MADE_FLAG     1279
LOC_X              1279
LOC_Y              1279
dtype: int64

In [ ]:

To compute the minimum value of a DataFrame,

In [15]:

shot_df1.min()

Out[15]:

PERIOD                             1
SHOT_TYPE             2PT Field Goal
SHOT_ZONE_BASIC    Above the Break 3
SHOT_MADE_FLAG                     0
LOC_X                           -245
LOC_Y                            -30
dtype: object

In [ ]:

To compute the maximum value of a DataFrame,

In [16]:

shot_df1.max()

Out[16]:

PERIOD                          5
SHOT_TYPE          3PT Field Goal
SHOT_ZONE_BASIC    Right Corner 3
SHOT_MADE_FLAG                  1
LOC_X                         241
LOC_Y                         418
dtype: object

In [ ]:

To compute the median of a DataFrame,

In [17]:

shot_df1.LOC_Y.median()

Out[17]:

42.0

In [ ]:

To compute the standard deviation of a DataFrame,

In [18]:

shot_df1.LOC_Y.std()

Out[18]:

91.99685249349442

In [ ]:

To compute the variance of a DataFrame,

In [19]:

shot_df1.LOC_Y.var()

Out[19]:

8463.420868709769

In [ ]:

To compute the skewness of a DataFrame,

In [20]:

shot_df1.LOC_Y.skew()

Out[20]:

0.81651481685383376

In [ ]:

To compute the kurtosis of a DataFrame,

In [21]:

shot_df1.LOC_X[(shot_df1.LOC_Y > 0)].kurt()

Out[21]:

-0.38142873041066466

In [ ]:

To compute the correlation matrix of a DataFrame,

In [22]:

shot_df1.corr()

Out[22]:

	PERIOD	SHOT_MADE_FLAG	LOC_X	LOC_Y
PERIOD	1.000000	0.008868	-0.028171	0.101323
SHOT_MADE_FLAG	0.008868	1.000000	0.052099	-0.229445
LOC_X	-0.028171	0.052099	1.000000	0.007676
LOC_Y	0.101323	-0.229445	0.007676	1.000000

In [ ]:

To compute a summary of the statistics of a DataFrame,

In [23]:

shot_df1.describe()

Out[23]:

	PERIOD	SHOT_MADE_FLAG	LOC_X	LOC_Y
count	1279.000000	1279.000000	1279.000000	1279.000000
mean	2.448006	0.487881	-13.997654	83.936669
std	1.137046	0.500049	104.266921	91.996852
min	1.000000	0.000000	-245.000000	-30.000000
25%	1.000000	0.000000	-94.000000	4.000000
50%	2.000000	0.000000	-2.000000	42.000000
75%	3.000000	1.000000	22.000000	164.000000
max	5.000000	1.000000	241.000000	418.000000

In [ ]:

We can also group data in a DataFrame using groupby. In this example, we want to group the information in the DataFrame shot_by by SHOT_ZONE_AREA

In [24]:

#gb=shot_df.groupby('SHOT_ZONE_AREA','SHOT_ATTEMPTED_FLAG')
gb=shot_df.groupby('SHOT_ZONE_AREA')

In [ ]:

which has the following type,

In [25]:

type(gb)

Out[25]:

pandas.core.groupby.DataFrameGroupBy

In [ ]:

To know the size of the DataFrameGroupBy (and the names of the groups),

In [26]:

gb.size()

Out[26]:

SHOT_ZONE_AREA
Back Court(BC)             3
Center(C)                670
Left Side Center(LC)     177
Left Side(L)             181
Right Side Center(RC)    149
Right Side(R)             99
dtype: int64

In [ ]:

We can use list() to view what that grouping looks like,

In [27]:

#The line is commented as it prints a lot information,
#list(gb)

In [ ]:

Now we can apply an operation to the group we just created. In this case we apply the operation to only one column,

In [28]:

#gb.describe()
#gb.SHOT_ATTEMPTED_FLAG.describe()

#gb['SHOT_ATTEMPTED_FLAG'].describe()
gb['SHOT_ATTEMPTED_FLAG'].describe().unstack()

Out[28]:

	count	mean	std	min	25%	50%	75%	max
SHOT_ZONE_AREA
Back Court(BC)	3	1	0	1	1	1	1	1
Center(C)	670	1	0	1	1	1	1	1
Left Side Center(LC)	177	1	0	1	1	1	1	1
Left Side(L)	181	1	0	1	1	1	1	1
Right Side Center(RC)	149	1	0	1	1	1	1	1
Right Side(R)	99	1	0	1	1	1	1	1

The unstack() option does kind of a pretty print.

You can also print the DataFrame without the unstack() option.

In [ ]:

Also, have in mind that when you apply an operation to a DataFrameGroupBy, it will return a DataFrame

In [29]:

type(gb['SHOT_ATTEMPTED_FLAG'].describe())

Out[29]:

pandas.core.series.Series

In [ ]:

We can group data using more than one variable,

In [30]:

gb1 = shot_df.groupby(['SHOT_ZONE_AREA','PERIOD'])

In [ ]:

To print the names of the groups in DataFrameGroupBy and their respectives size,

In [31]:

gb1.size()

Out[31]:

SHOT_ZONE_AREA         PERIOD
Back Court(BC)         2           3
Center(C)              1         203
                       2         182
                       3         143
                       4         136
                       5           6
Left Side Center(LC)   1          29
                       2          44
                       3          54
                       4          48
                       5           2
Left Side(L)           1          54
                       2          37
                       3          51
                       4          36
                       5           3
Right Side Center(RC)  1          33
                       2          32
                       3          43
                       4          40
                       5           1
Right Side(R)          1          27
                       2          26
                       3          20
                       4          26
dtype: int64

In [ ]:

To compute the summary of statistics of the group gb1,

In [32]:

pd.set_option('display.max_rows', None)
#gb1.describe()
gb1.describe().unstack()

Out[32]:

		GAME_EVENT_ID								GAME_ID								LOC_X								LOC_Y								MINUTES_REMAINING								PLAYER_ID								SECONDS_REMAINING								SHOT_ATTEMPTED_FLAG								SHOT_DISTANCE								SHOT_MADE_FLAG								TEAM_ID
		count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
SHOT_ZONE_AREA	PERIOD
Back Court(BC)	2	3	262.333333	13.576941	254	254.50	255.0	266.50	278	3	21400622.666667	361.950733	21400211	21400488.50	21400766.0	21400828.50	21400891	3	-36.000000	180.385698	-234	-113.50	7.0	63.00	119	3	408.333333	8.504901	402	403.50	405.0	411.50	418	3	0.000000	0.000000	0	0.00	0.0	0.00	0	3	2544	0	2544	2544	2544	2544	2544	3	0.000000	0.000000	0	0.00	0.0	0.00	0	3	1	0	1	1	1	1	1	3	43.000000	3.000000	40	41.50	43.0	44.50	46	3	0.000000	0.000000	0	0.00	0.0	0.00	0	3	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
Center(C)	1	203	56.103448	39.394478	2	21.50	47.0	91.00	154	203	21400595.044335	339.873321	21400018	21400292.00	21400658.0	21400854.00	21401203	203	-1.295567	24.519972	-84	-11.00	0.0	6.00	78	203	45.463054	80.599870	-13	1.00	7.0	44.50	291	203	5.630542	3.594062	0	2.00	6.0	9.00	11	203	2544	0	2544	2544	2544	2544	2544	203	26.275862	17.907152	0	9.00	27.0	40.00	59	203	1	0	1	1	1	1	1	203	4.896552	7.894988	0	0.00	1.0	5.00	29	203	0.596059	0.491899	0	0.00	1.0	1.00	1	203	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	2	182	195.423077	45.324881	107	156.50	195.0	236.25	292	182	21400600.357143	337.699397	21400018	21400308.00	21400643.0	21400883.00	21401203	182	-0.362637	20.689238	-87	-7.00	0.0	9.00	67	182	45.587912	80.545999	-17	1.00	9.0	38.50	373	182	4.961538	3.629781	0	1.25	5.0	8.00	11	182	2544	0	2544	2544	2544	2544	2544	182	27.082418	17.154766	0	11.00	29.0	41.00	58	182	1	0	1	1	1	1	1	182	4.747253	7.921010	0	0.00	1.0	5.00	37	182	0.604396	0.490329	0	0.00	1.0	1.00	1	182	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	3	143	324.566434	45.495327	227	288.00	322.0	357.50	448	143	21400552.496503	351.709585	21400018	21400238.00	21400453.0	21400854.00	21401203	143	-2.153846	24.325639	-65	-15.00	0.0	6.00	77	143	51.076923	84.421401	-19	1.00	11.0	57.00	301	143	5.195804	3.458524	0	2.00	5.0	8.00	11	143	2544	0	2544	2544	2544	2544	2544	143	30.202797	16.671456	0	17.50	30.0	44.00	59	143	1	0	1	1	1	1	1	143	5.426573	8.281567	0	0.00	1.0	6.00	30	143	0.636364	0.482737	0	0.00	1.0	1.00	1	143	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	4	136	450.992647	47.864701	339	420.75	451.0	486.25	581	136	21400645.411765	363.907934	21400018	21400308.00	21400700.0	21400995.00	21401203	136	-5.419118	26.283513	-76	-11.50	0.0	2.00	69	136	52.029412	84.447392	-11	1.00	11.5	44.25	306	136	5.455882	3.435906	0	3.00	5.5	8.00	11	136	2544	0	2544	2544	2544	2544	2544	136	30.345588	16.882767	1	15.75	30.0	44.25	59	136	1	0	1	1	1	1	1	136	5.602941	8.324243	0	0.00	1.0	6.00	30	136	0.588235	0.493972	0	0.00	1.0	1.00	1	136	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	5	6	577.666667	46.659047	522	540.75	577.5	618.75	628	6	21400452.500000	471.589122	21400022	21400022.00	21400452.5	21400883.00	21400883	6	10.000000	14.791890	-19	10.50	14.5	19.25	20	6	50.333333	95.088730	6	7.25	11.0	20.75	244	6	2.000000	1.414214	0	1.25	2.0	2.75	4	6	2544	0	2544	2544	2544	2544	2544	6	40.833333	10.515069	20	43.00	43.0	46.00	49	6	1	0	1	1	1	1	1	6	5.500000	9.071935	1	2.00	2.0	2.00	24	6	0.500000	0.547723	0	0.00	0.5	1.00	1	6	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
Left Side Center(LC)	1	29	51.620690	40.438148	4	21.00	35.0	76.00	144	29	21400619.517241	330.169288	21400125	21400336.00	21400620.0	21400839.00	21401178	29	-149.827586	40.551615	-226	-166.00	-144.0	-119.00	-82	29	182.896552	41.950945	105	148.00	184.0	217.00	255	29	6.275862	3.514222	0	4.00	7.0	9.00	11	29	2544	0	2544	2544	2544	2544	2544	29	28.034483	16.774732	0	16.00	29.0	43.00	52	29	1	0	1	1	1	1	1	29	23.689655	2.867346	16	24.00	25.0	25.00	28	29	0.448276	0.506120	0	0.00	0.0	1.00	1	29	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	2	44	202.000000	44.015853	117	163.75	207.0	238.00	274	44	21400591.386364	328.073603	21400055	21400315.25	21400623.0	21400829.25	21401092	44	-152.454545	42.163279	-239	-188.50	-147.0	-126.00	-64	44	183.181818	39.273799	97	153.25	189.5	213.75	257	44	4.318182	3.838439	0	1.00	4.0	8.25	11	44	2544	0	2544	2544	2544	2544	2544	44	26.681818	17.614440	0	9.50	28.0	40.50	58	44	1	0	1	1	1	1	1	44	23.795455	2.681370	17	24.00	25.0	25.00	28	44	0.295455	0.461522	0	0.00	0.0	1.00	1	44	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	3	54	322.277778	45.913543	248	290.75	314.0	342.00	456	54	21400668.277778	340.890256	21400066	21400374.00	21400709.0	21400907.00	21401203	54	-135.537037	31.051213	-215	-152.25	-131.0	-112.50	-70	54	190.833333	42.681687	101	151.50	203.0	219.00	304	54	5.888889	3.467731	0	4.00	7.0	9.00	11	54	2544	0	2544	2544	2544	2544	2544	54	28.222222	17.641789	0	12.25	30.0	42.00	59	54	1	0	1	1	1	1	1	54	23.203704	3.530886	16	20.00	24.0	25.00	32	54	0.425926	0.499126	0	0.00	0.0	1.00	1	54	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	4	48	456.229167	49.689350	355	419.50	452.5	490.25	551	48	21400611.729167	336.998626	21400018	21400418.25	21400667.0	21400887.25	21401148	48	-142.583333	43.279809	-245	-177.00	-131.0	-110.50	-75	48	195.666667	43.360602	88	157.00	210.0	230.25	265	48	5.395833	3.227143	0	3.00	5.0	8.00	11	48	2544	0	2544	2544	2544	2544	2544	48	30.687500	16.572415	1	17.25	32.0	43.25	56	48	1	0	1	1	1	1	1	48	24.312500	2.918439	16	24.00	25.0	26.00	28	48	0.354167	0.483321	0	0.00	0.0	1.00	1	48	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	5	2	579.500000	45.961941	547	563.25	579.5	595.75	612	2	21400923.500000	57.275649	21400883	21400903.25	21400923.5	21400943.75	21400964	2	-117.000000	33.941125	-141	-129.00	-117.0	-105.00	-93	2	216.500000	26.162951	198	207.25	216.5	225.75	235	2	1.500000	0.707107	1	1.25	1.5	1.75	2	2	2544	0	2544	2544	2544	2544	2544	2	43.000000	15.556349	32	37.50	43.0	48.50	54	2	1	0	1	1	1	1	1	2	24.500000	0.707107	24	24.25	24.5	24.75	25	2	0.500000	0.707107	0	0.25	0.5	0.75	1	2	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
Left Side(L)	1	54	55.314815	42.385840	2	19.75	41.0	93.50	144	54	21400598.314815	375.380452	21400018	21400241.75	21400642.0	21400897.75	21401203	54	-141.296296	45.644496	-235	-160.00	-137.0	-105.75	-48	54	38.888889	32.251176	-22	15.50	36.5	57.50	127	54	5.666667	3.865669	0	2.00	7.0	9.00	11	54	2544	0	2544	2544	2544	2544	2544	54	29.425926	17.608533	0	15.00	34.0	42.75	58	54	1	0	1	1	1	1	1	54	14.629630	4.057587	8	12.00	14.0	17.50	23	54	0.296296	0.460911	0	0.00	0.0	1.00	1	54	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	2	37	178.567568	48.461061	104	134.00	173.0	211.00	278	37	21400590.891892	376.121344	21400022	21400199.00	21400658.0	21400915.00	21401178	37	-141.108108	47.363713	-235	-168.00	-152.0	-100.00	-45	37	41.162162	32.009125	-6	19.00	37.0	63.00	115	37	6.459459	3.420482	0	4.00	7.0	9.00	11	37	2544	0	2544	2544	2544	2544	2544	37	27.891892	15.327506	1	16.00	27.0	39.00	57	37	1	0	1	1	1	1	1	37	14.567568	4.675828	8	10.00	15.0	19.00	23	37	0.351351	0.483978	0	0.00	0.0	1.00	1	37	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	3	51	326.823529	50.073428	237	285.00	324.0	362.50	448	51	21400555.392157	342.813540	21400018	21400224.50	21400453.0	21400815.00	21401203	51	-152.901961	43.258412	-235	-183.00	-147.0	-118.00	-78	51	41.882353	35.894650	-10	7.00	41.0	71.50	110	51	5.529412	3.951470	0	2.00	5.0	9.00	11	51	2544	0	2544	2544	2544	2544	2544	51	26.764706	17.214631	0	12.00	27.0	40.50	59	51	1	0	1	1	1	1	1	51	15.764706	3.982905	8	13.00	15.0	18.50	24	51	0.352941	0.482640	0	0.00	0.0	1.00	1	51	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	4	36	445.972222	40.500784	359	418.50	455.5	474.00	524	36	21400649.722222	376.531434	21400055	21400343.25	21400642.0	21400995.00	21401203	36	-146.027778	45.930684	-242	-171.00	-147.5	-114.00	-70	36	43.277778	35.611885	-13	16.25	40.5	58.50	124	36	5.527778	3.517191	0	2.00	5.0	9.00	11	36	2544	0	2544	2544	2544	2544	2544	36	29.583333	18.885179	0	12.25	30.0	46.25	58	36	1	0	1	1	1	1	1	36	15.138889	4.599085	8	11.75	15.0	18.00	24	36	0.388889	0.494413	0	0.00	0.0	1.00	1	36	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	5	3	571.000000	21.166010	547	563.00	579.0	583.00	587	3	21400022.000000	0.000000	21400022	21400022.00	21400022.0	21400022.00	21400022	3	-132.000000	38.000000	-164	-153.00	-142.0	-116.00	-90	3	67.333333	57.134345	4	43.50	83.0	99.00	115	3	1.000000	1.732051	0	0.00	0.0	1.50	3	3	2544	0	2544	2544	2544	2544	2544	3	27.000000	22.649503	6	15.00	24.0	37.50	51	3	1	0	1	1	1	1	1	3	15.333333	1.154701	14	15.00	16.0	16.00	16	3	0.000000	0.000000	0	0.00	0.0	0.00	0	3	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
Right Side Center(RC)	1	33	58.606061	36.952621	2	34.00	55.0	82.00	128	33	21400381.636364	323.949535	21400018	21400133.00	21400273.0	21400570.00	21401167	33	138.696970	36.666133	78	113.00	141.0	157.00	220	33	192.939394	44.687764	124	154.00	200.0	219.00	313	33	5.000000	3.526684	0	1.00	5.0	7.00	11	33	2544	0	2544	2544	2544	2544	2544	33	26.757576	16.414609	0	16.00	26.0	37.00	59	33	1	0	1	1	1	1	1	33	23.696970	3.566203	17	21.00	25.0	26.00	33	33	0.242424	0.435194	0	0.00	0.0	0.00	1	33	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	2	32	194.187500	45.258710	112	160.75	198.5	226.50	282	32	21400646.218750	313.708179	21400018	21400443.75	21400697.0	21400885.00	21401078	32	137.218750	39.067959	67	109.50	128.0	159.25	223	32	192.406250	58.923055	108	155.50	191.0	220.25	377	32	5.031250	3.864515	0	1.00	6.0	8.00	11	32	2544	0	2544	2544	2544	2544	2544	32	19.968750	16.395287	0	5.75	16.5	31.50	58	32	1	0	1	1	1	1	1	32	23.593750	5.375328	16	20.00	24.0	25.25	41	32	0.343750	0.482559	0	0.00	0.0	1.00	1	32	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	3	43	336.348837	43.759562	249	310.00	338.0	364.50	416	43	21400671.279070	310.432519	21400066	21400408.50	21400766.0	21400883.00	21401203	43	148.651163	38.434783	77	123.50	147.0	175.00	223	43	187.883721	43.155979	108	159.50	192.0	215.00	277	43	4.046512	3.272930	0	1.00	4.0	6.50	11	43	2544	0	2544	2544	2544	2544	2544	43	28.953488	20.499656	0	9.50	34.0	44.00	58	43	1	0	1	1	1	1	1	43	23.883721	3.325295	16	24.00	25.0	26.00	30	43	0.441860	0.502486	0	0.00	0.0	1.00	1	43	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	4	40	459.050000	46.408968	370	421.75	464.0	487.00	557	40	21400638.225000	281.721675	21400098	21400425.75	21400639.0	21400861.25	21401167	40	132.150000	39.003320	67	103.75	128.5	148.00	218	40	190.975000	53.965082	108	144.00	200.5	230.75	361	40	5.225000	3.555332	0	2.75	5.5	8.00	11	40	2544	0	2544	2544	2544	2544	2544	40	28.300000	16.469397	2	14.00	28.0	42.00	59	40	1	0	1	1	1	1	1	40	23.200000	4.569688	16	19.75	25.0	26.00	39	40	0.425000	0.500641	0	0.00	0.0	1.00	1	40	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	5	1	558.000000	NaN	558	558.00	558.0	558.00	558	1	21400964.000000	NaN	21400964	21400964.00	21400964.0	21400964.00	21400964	1	176.000000	NaN	176	176.00	176.0	176.00	176	1	186.000000	NaN	186	186.00	186.0	186.00	186	1	0.000000	NaN	0	0.00	0.0	0.00	0	1	2544	NaN	2544	2544	2544	2544	2544	1	32.000000	NaN	32	32.00	32.0	32.00	32	1	1	NaN	1	1	1	1	1	1	25.000000	NaN	25	25.00	25.0	25.00	25	1	1.000000	NaN	1	1.00	1.0	1.00	1	1	1610612739	NaN	1610612739	1610612739	1610612739	1610612739	1610612739
Right Side(R)	1	27	51.814815	32.643809	6	24.00	46.0	76.50	116	27	21400498.296296	323.769720	21400018	21400226.00	21400433.0	21400776.00	21401124	27	166.444444	53.197985	70	127.00	162.0	227.50	241	27	38.370370	36.061004	-16	8.00	31.0	68.00	116	27	5.703704	3.291187	0	3.50	6.0	8.00	11	27	2544	0	2544	2544	2544	2544	2544	27	30.629630	16.699028	1	22.50	31.0	38.00	59	27	1	0	1	1	1	1	1	27	17.222222	4.551697	9	14.00	16.0	22.00	24	27	0.333333	0.480384	0	0.00	0.0	1.00	1	27	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	2	26	196.961538	47.785756	129	153.75	197.5	226.25	301	26	21400670.923077	389.014542	21400018	21400286.50	21400714.5	21401023.00	21401178	26	136.000000	60.099251	62	76.25	135.0	172.50	239	26	50.000000	28.223395	-6	28.50	55.0	66.75	101	26	4.884615	3.502307	0	2.00	5.0	8.00	11	26	2544	0	2544	2544	2544	2544	2544	26	28.500000	19.146279	0	12.50	30.5	39.75	59	26	1	0	1	1	1	1	1	26	14.461538	5.558638	8	10.00	13.5	17.75	24	26	0.230769	0.429669	0	0.00	0.0	0.00	1	26	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	3	20	321.450000	40.158207	261	296.50	316.5	347.25	426	20	21400700.600000	329.374751	21400098	21400582.75	21400733.0	21400934.00	21401203	20	148.750000	57.570895	67	100.25	150.0	190.50	239	20	56.800000	35.402349	-5	40.00	54.5	77.75	118	20	5.350000	3.313052	0	2.75	5.0	8.25	11	20	2544	0	2544	2544	2544	2544	2544	20	29.500000	17.024750	3	17.75	25.5	39.50	59	20	1	0	1	1	1	1	1	20	15.850000	5.431632	8	10.50	16.0	21.25	23	20	0.450000	0.510418	0	0.00	0.0	1.00	1	20	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739
	4	26	446.576923	37.038545	367	419.00	452.5	467.25	527	26	21400685.884615	325.674602	21400211	21400319.75	21400667.0	21401023.00	21401167	26	132.461538	59.717489	45	90.25	108.0	168.25	241	26	38.807692	31.304976	-30	20.00	40.5	56.75	112	26	5.846154	3.294751	0	3.25	5.0	9.00	11	26	2544	0	2544	2544	2544	2544	2544	26	30.192308	13.608877	7	19.00	28.5	41.75	51	26	1	0	1	1	1	1	1	26	13.961538	5.325266	8	10.00	12.0	16.75	24	26	0.423077	0.503831	0	0.00	0.0	1.00	1	26	1610612739	0	1610612739	1610612739	1610612739	1610612739	1610612739

In [ ]:

Let's create another group,

In [33]:

gb2 = shot_df.groupby('SHOT_ZONE_AREA')

In [ ]:

and extract a specific sub-group of the DataFrameGroupBy as follows,

In [34]:

gb2.get_group('Center(C)').head(2)

Out[34]:

	GRID_TYPE	GAME_ID	GAME_EVENT_ID	PLAYER_ID	PLAYER_NAME	TEAM_ID	TEAM_NAME	PERIOD	MINUTES_REMAINING	SECONDS_REMAINING	EVENT_TYPE	ACTION_TYPE	SHOT_TYPE	SHOT_ZONE_BASIC	SHOT_ZONE_AREA	SHOT_ZONE_RANGE	SHOT_DISTANCE	LOC_X	LOC_Y	SHOT_ATTEMPTED_FLAG	SHOT_MADE_FLAG
1	Shot Chart Detail	21400018	33	2544	LeBron James	1610612739	Cleveland Cavaliers	1	6	30	Made Shot	Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-7	0	1	1
7	Shot Chart Detail	21400018	217	2544	LeBron James	1610612739	Cleveland Cavaliers	2	1	55	Missed Shot	Reverse Layup Shot	2PT Field Goal	Restricted Area	Center(C)	Less Than 8 ft.	0	-8	0	1	0

In [ ]:

To compute the number of entries in the group gb1,

In [35]:

#gb1.PERIOD.count()
gb1.PERIOD.count().unstack()

Out[35]:

PERIOD	1	2	3	4	5
SHOT_ZONE_AREA
Back Court(BC)	NaN	3	NaN	NaN	NaN
Center(C)	203	182	143	136	6
Left Side Center(LC)	29	44	54	48	2
Left Side(L)	54	37	51	36	3
Right Side Center(RC)	33	32	43	40	1
Right Side(R)	27	26	20	26	NaN

In [ ]:

To compute the number of entries in the group gb1 you can also proceed in the following way,

In [36]:

#gb1.size()
gb1.size().unstack()

Out[36]:

PERIOD	1	2	3	4	5
SHOT_ZONE_AREA
Back Court(BC)	NaN	3	NaN	NaN	NaN
Center(C)	203	182	143	136	6
Left Side Center(LC)	29	44	54	48	2
Left Side(L)	54	37	51	36	3
Right Side Center(RC)	33	32	43	40	1
Right Side(R)	27	26	20	26	NaN

In [ ]:

To compute the summary of statistics of the group gb1 for the column PERIOD,

In [37]:

#gb1.PERIOD.describe()
gb1.PERIOD.describe().unstack()

Out[37]:

		count	mean	std	min	25%	50%	75%	max
SHOT_ZONE_AREA	PERIOD
Back Court(BC)	2	3	2	0	2	2	2	2	2
Center(C)	1	203	1	0	1	1	1	1	1
	2	182	2	0	2	2	2	2	2
	3	143	3	0	3	3	3	3	3
	4	136	4	0	4	4	4	4	4
	5	6	5	0	5	5	5	5	5
Left Side Center(LC)	1	29	1	0	1	1	1	1	1
	2	44	2	0	2	2	2	2	2
	3	54	3	0	3	3	3	3	3
	4	48	4	0	4	4	4	4	4
	5	2	5	0	5	5	5	5	5
Left Side(L)	1	54	1	0	1	1	1	1	1
	2	37	2	0	2	2	2	2	2
	3	51	3	0	3	3	3	3	3
	4	36	4	0	4	4	4	4	4
	5	3	5	0	5	5	5	5	5
Right Side Center(RC)	1	33	1	0	1	1	1	1	1
	2	32	2	0	2	2	2	2	2
	3	43	3	0	3	3	3	3	3
	4	40	4	0	4	4	4	4	4
	5	1	5	NaN	5	5	5	5	5
Right Side(R)	1	27	1	0	1	1	1	1	1
	2	26	2	0	2	2	2	2	2
	3	20	3	0	3	3	3	3	3
	4	26	4	0	4	4	4	4	4

In [ ]:

At this point, let's do some advanced plotting using matplotlib and seaborn.¶

There are many ways to plot the data in a pandas DataFrame, let's use first matplotlib.

Let's plot all the field attempts,

In [38]:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("white")
sns.set_color_codes()

plt.figure(figsize=(10,10))
plt.scatter( shot_df.LOC_X, shot_df.LOC_Y,s=20,marker='o',alpha=1)

plt.xlim(-300,300)
plt.ylim(-100,500)

plt.title('Lebron James shot chart 2014-2015 \n All field attempts', y = 1.02, fontsize=20)

plt.grid()
plt.show()

In [ ]:

Let's plot the converted and missed shots,

In [39]:

sns.set_style("white")
sns.set_color_codes()


plt.figure(figsize=(10,10))

plt.scatter(shot_df[shot_df.SHOT_MADE_FLAG == 1].LOC_X, shot_df[shot_df.SHOT_MADE_FLAG == 1].LOC_Y, color='green',label='shots converted',s=20,marker='o',alpha=1.0/2.0)
plt.scatter(shot_df[shot_df.SHOT_MADE_FLAG == 0].LOC_X, shot_df[shot_df.SHOT_MADE_FLAG == 0].LOC_Y, color='red',label='shots missed',s=20,marker='o',alpha=0.5)

plt.legend()

plt.xlim(-300,300)
plt.ylim(-100,500)

plt.title('Lebron James shot chart 2014-2015 \n Converted-missed shots', y = 1.02, fontsize=20)

plt.grid()
plt.show()

In [ ]:

Let's see in what position Lebron took more shots by using histograms.

First we need to create a DataFrame with the LOC_X and LOC_Y information,

In [40]:

h1=pd.DataFrame(shot_df, columns=['LOC_X', 'LOC_Y'])

In [ ]:

Now we use this DataFrame to plot the histograms for LOC_X and LOC_Y,

In [41]:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)

plt.figure(figsize=(10,10))

plt.subplot(2, 1, 1)
sns.distplot(h1.LOC_X,kde=False);

plt.subplot(2, 1, 2)
sns.distplot(h1.LOC_Y,kde=False);

As we can see from these histogram, Lebron took more shots around the rim.

We are doing exploratory data analysis or EDA. At this point, take your time and try to do different plots, there are many types and combinations.

In [ ]:

Let's plot all shots and the histograms in one single plot, for this we are going to use seaborn,

In [42]:

# create our jointplot
sns.set(color_codes=True)

joint_shot_chart = sns.jointplot(shot_df1.LOC_X, shot_df1.LOC_Y, stat_func=None,
                                 kind='scatter',space=0.2, alpha=1, 
                                 size=8, edgecolor='w', color='b').set_axis_labels("x location", "y location")

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to adjust
joint_shot_chart.fig.set_size_inches(10,10)
ax = joint_shot_chart.ax_joint

# Adjust the axis limits and orientation of the plot in order
ax.set_xlim(-250,250)
#ax.set_ylim(422.5, -47.5)
ax.set_ylim(-47.5, 422.5)

# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')

# Add a title
ax.set_title('Lebron James shot chart 2014-2015', 
             y=1.25, fontsize=14)

plt.show()

As we can see here, Lebron is very active close to the rim area.

What about his effectivity?, this is let to you as an exercise.

In [ ]:

Instead of using a scatter plot, we can use hexbins. In seaborn, you can proceed as follows,

In [43]:

# create our jointplot
sns.set_style("white")

cmap=plt.cm.gist_heat_r
joint_shot_chart = sns.jointplot(shot_df.LOC_X, shot_df.LOC_Y, stat_func=None,
                                 kind='hex',gridsize=40,space=0, color=cmap(0.2), cmap=cmap,vmin=0, vmax=50)

#joint_shot_chart = sns.jointplot(shot_df.LOC_X, shot_df.LOC_Y, stat_func=None,
#                                 kind='hex',space=0, color=cmap(0.2), cmap=cmap)

joint_shot_chart.fig.set_size_inches(12,11)

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to adjust
ax = joint_shot_chart.ax_joint

# Adjust the axis limits and orientation of the plot in order
ax.set_xlim(-250,250)
#ax.set_ylim(422.5, -47.5)
ax.set_ylim(-47.5, 422.5)

# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')

# Add a title
ax.set_title('Lebron James shot chart 2014-2015', 
             y=1.25, fontsize=14)


# Add James Harden's image to the top right
#img = OffsetImage(image, zoom=0.6)
#img.set_offset((625,621))
#ax.add_artist(img)

plt.show()

In [ ]:

Let's use boxplot to represent the field attempts, this kind of plots are very useful to identify outliers or anomalies in your observations.

In [44]:

sns.boxplot(shot_df.LOC_Y, orient="v")
plt.title('Lebron James - Boxplot of all field attempts for LOC_Y', y = 1.1, fontsize=20)

Out[44]:

<matplotlib.text.Text at 0x10a36ee10>

In this plot we can see that there are a few outliers or irregular observations.

In [ ]:

We can also plot the boxplot with the scatter point superimpose, as follows,

In [45]:

#sns.stripplot(data=shot_df.LOC_Y, jitter=True, color="white", edgecolor="gray")
sns.stripplot(data=shot_df.LOC_Y, jitter=0.3, color="white", edgecolor="gray")
sns.boxplot(shot_df.LOC_Y, orient="v")

plt.title('Lebron James - Boxplot of all field attempts for LOC_Y', y = 1.1, fontsize=20)

Out[45]:

<matplotlib.text.Text at 0x10a56f5d0>

Here we see better the outliers, which corresponds to the shots taken from far away, and presumably close to the end of a period. This is left as an exercise to the reader.

In [ ]:

We can do the same for LOC_X,

In [46]:

sns.stripplot(data=shot_df.LOC_X, jitter=0.3, color="white", edgecolor="gray")
sns.boxplot(shot_df.LOC_X, orient="v")

plt.title('Lebron James - Boxplot of all field attempts for LOC_X', y = 1.1, fontsize=20)

Out[46]:

<matplotlib.text.Text at 0x10a25ced0>

In [ ]:

At the point let's do some pretty plotting. We are going to add a background to the scatter plot and we are going to plot only the 3 point shots attempted.

To plot the backgorund image we are going to use scipy [4].

In [47]:

from scipy.misc import imread

plt.figure(figsize=(10,10))

#draw_court(outer_lines=True)
#plt.xlim(-300,300)
#plt.ylim(-100,500)

datafile = 'chart1.png'
#datafile = 'chart2.jpg'
#datafile = 'bg_court.jpg'
img = imread(datafile)

#plt.imshow(img, zorder=0, extent=[-260, 260, -60, 400])
plt.imshow(img, zorder=0, extent=[-260, 260, -400, 60])
plt.scatter(c2.LOC_X,-1*c2.LOC_Y,zorder=1,color='blue',label='3 point shots attempted',s=20,marker='o',alpha=0.6)

plt.title('Lebron James shot chart 2014-2015 \n 3 point shots attempted', y = 1.02, fontsize=20)

#plt.ylim(400, -60)
plt.show()

Much better no, here we see the importance of visualizing the data and using the right visual encoders.

Now we get a better idea of those outliers we identified with the boxplots.

In [ ]:

Let's do the same plot of the 2 point shots attempted,

In [48]:

from scipy.misc import imread

plt.figure(figsize=(10,10))

#draw_court(outer_lines=True)
#plt.xlim(-300,300)
#plt.ylim(-100,500)

datafile = 'chart1.png'
#datafile = 'chart2.jpg'
#datafile = 'bg_court.jpg'
img = imread(datafile)

#plt.imshow(img, zorder=0, extent=[-260, 260, -60, 400])
plt.imshow(img, zorder=0, extent=[-260, 260, -400, 60])
plt.scatter(c1.LOC_X,-1*c1.LOC_Y,zorder=1,color='green',label='2 point shots attempted',s=20,marker='o',alpha=0.6)

plt.title('Lebron James shot chart 2014-2015 \n 2 point shots attempted', y = 1.02, fontsize=20)

#plt.ylim(400, -60)
plt.show()

In [ ]:

Let's do a pretty plot of the 2 point shots attempted, grouped by missed and converted shots.

Pay attention of how we are manipulating the DataFrame at plot time.

In [49]:

from scipy.misc import imread

plt.figure(figsize=(10,10))

#draw_court(outer_lines=True)
#plt.xlim(-300,300)
#plt.ylim(-100,500)

datafile = 'chart1.png'
#datafile = 'chart2.jpg'
#datafile = 'bg_court.jpg'
img = imread(datafile)

#plt.imshow(img, zorder=0, extent=[-260, 260, -60, 400])
plt.imshow(img, zorder=0, extent=[-260, 260, -400, 60])
plt.scatter(shot_df1.LOC_X[(shot_df1.SHOT_MADE_FLAG == 1) & (shot_df.SHOT_TYPE == '2PT Field Goal')] ,-1*shot_df1.LOC_Y[(shot_df1.SHOT_MADE_FLAG == 1) & (shot_df.SHOT_TYPE == '2PT Field Goal')],
            zorder=1,color='green',label='2 point shots converted',s=20,marker='o',alpha=0.6)
plt.scatter(shot_df1.LOC_X[(shot_df1.SHOT_MADE_FLAG == 0) & (shot_df.SHOT_TYPE == '2PT Field Goal')] ,-1*shot_df1.LOC_Y[(shot_df1.SHOT_MADE_FLAG == 0) & (shot_df.SHOT_TYPE == '2PT Field Goal')],
            zorder=1,color='red',label='2 point shots converted',s=10,marker='o',alpha=0.6)



plt.title('Lebron James shot chart 2014-2015 \n 2 point shots attempted', y = 1.02, fontsize=20)

plt.legend(loc='lower left')

#plt.ylim(400, -60)
plt.show()

In [ ]:

In the previous plots, we used matplotlib and seaborn.

We can also plot using pandas as follows,

In [50]:

shot_df.plot(kind='scatter',x='LOC_X', y='LOC_Y',c='SHOT_MADE_FLAG',colorbar=False);

#plt.xlim(-300,300)
#plt.ylim(-100,500)

_ = plt.xlim(-300,300)
_ = plt.ylim(-100,500)

#The _ is used to hold the result of the last executed statement in an interactive interpreter session.
#If you do not use, it will print out the value of the last statement

In [ ]:

To plot hexbins using pandas we proceed as follows,

In [51]:

shot_df.plot(kind='hexbin', x='LOC_X', y='LOC_Y', gridsize=50, colorbar=False, vmin=0, vmax=10)
plt.xlim(-300,300)
_ = plt.ylim(-100,500)

#The _ is used to hold the result of the last executed statement in an interactive interpreter session.
#If you do not use, it will print out the value of the last statement

In [ ]:

We can also plot using the DataFrameGroupBy.

Let's create a group as follows,

In [52]:

#gb3 = shot_df.groupby('SHOT_ZONE_AREA')
gb3 = shot_df.groupby('SHOT_TYPE')

#gb3x = shot_df.groupby('SHOT_TYPE').LOC_X
#gb3y = shot_df.groupby('SHOT_TYPE').LOC_Y

In [ ]:

which has a size of,

In [53]:

gb3.size()

Out[53]:

SHOT_TYPE
2PT Field Goal    940
3PT Field Goal    339
dtype: int64

In [ ]:

and a type,

In [54]:

type(gb3)

Out[54]:

pandas.core.groupby.DataFrameGroupBy

In [ ]:

Now we can collect the coordinates LOC_X and LOC_Y of the group we just created,

In [55]:

x=gb3.LOC_X.get_group('2PT Field Goal')
y=gb3.LOC_Y.get_group('2PT Field Goal')
xx=gb3.LOC_X.get_group('3PT Field Goal')
yy=gb3.LOC_Y.get_group('3PT Field Goal')

In [ ]:

In [56]:

#x=gb3.LOC_X.head(2)
#y=gb3.LOC_Y.head(2)

In [ ]:

Now we can do the scatter plot as follows,

In [57]:

plt.scatter( x, y,s=20,marker='o',alpha=1,color='green')
plt.scatter( xx, yy,s=20,marker='o',alpha=1,color='red')

Out[57]:

<matplotlib.collections.PathCollection at 0x10b4a55d0>

In [ ]:

We can also directly plot the DataFrameGroupBy as follows,

In [58]:

gb3.plot(kind='scatter', x='LOC_X', y='LOC_Y',xlim=[-300,300], ylim=[-50,500]);

Notice that it automatically creates two plots, one corresponding to the two point shots and the other one to the three point shots.

With a little bit manipulation, you can put both plots in a single figure and use colors to represent the shot type, this is left to the reader as an exercise.

In [ ]:

We can also plot histograms of the DataFrameGroupBy,

In [59]:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("white")
sns.set_color_codes()

gb3.PERIOD.plot(kind='hist')
plt.legend()

Out[59]:

<matplotlib.legend.Legend at 0x10877f410>

In [ ]:

Let's create a new DataFrameGroupBy as follows,

In [60]:

gb4 = shot_df.groupby('SHOT_ZONE_AREA')

In [61]:

gb4.size()

Out[61]:

SHOT_ZONE_AREA
Back Court(BC)             3
Center(C)                670
Left Side Center(LC)     177
Left Side(L)             181
Right Side Center(RC)    149
Right Side(R)             99
dtype: int64

In [ ]:

And to plot the histograms,

In [62]:

#gb4.LOC_X.hist()

#It does not print out data
_ = gb4.LOC_X.hist()

In [63]:

#gb4.LOC_Y.hist()

#It does not print out data
_ = gb4.LOC_Y.hist()

In [ ]:

Let's do the scatter plots for the DataFrameGroupBy,

In [64]:

gb4.plot(kind='scatter', x='LOC_X', y='LOC_Y',xlim=[-300,300], ylim=[-50,500]);

In [ ]:

We can choose to plot only the Center shots of the DataFrameGroupBy as follows,

In [65]:

from scipy.misc import imread

plt.figure(figsize=(10,10))

datafile = 'chart1.png'
img = imread(datafile)

#gb4.get_group('Center(C)').head(2)
#gb4.get_group('Center(C)').LOC_X

plt.imshow(img, zorder=0, extent=[-260, 260, -400, 60])

plt.scatter(gb4.get_group('Center(C)').LOC_X, -1*gb4.get_group('Center(C)').LOC_Y,s=20,marker='o',alpha=1,color='green')

#plt.xlim(-300,300)
#plt.ylim(-100,500)

Out[65]:

<matplotlib.collections.PathCollection at 0x10d92a6d0>

In [ ]:

As you can see, there are many ways to plot using Python, it is up to you to choose.¶

The main take away of this tutorial, is that by using exploratory data analysis (EDA) we can analyze data sets using visual methods.¶

In [ ]:

In the next tutorial, we are going to work a little bit more on advanced data analytics methods such as unsupervised methods or clustering techniques and a little bit more on advanced data visualization.¶

In [ ]:

References¶

[1] http://pandas.pydata.org/

[2] http://matplotlib.org/

[3] http://stanford.edu/~mwaskom/software/seaborn/

[4] http://www.scipy.org/

In [ ]:

In [66]:

#import sys
#print('Python version:', sys.version_info)

#import IPython
#print('IPython version:', IPython.__version__)

#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)

#import matplotlib
#print('matplotlib version:', matplotlib.__version__)

#print('seaborn version:', sns.__version__)

#import scipy
#print('scipy version:', scipy.__version__)

In [ ]: