Data scraping, data wrangling, data analytics, exploratory data analysis, advaced plotting and clustering with love.

In this tutorial we mainly address advanced data analytics and plotting.

We are going to use pandas [1] to do data manipulation and data analytics of the DataFrame.

Finally, we are going to do advanced plotting using matplotlib [2] and seaborn [3].

In [ ]:
 
In [1]:
%matplotlib inline
In [ ]:
 
In [2]:
import pandas as pd
#import json
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [ ]:
 

To load the csv file and use column 0 as the row labels of the DataFrame, we proceed as follows,

In [3]:
#shot_df = pd.read_csv('test.csv')
#shot_df = pd.read_csv(filepath_or_buffer='test.csv')
shot_df = pd.read_csv(filepath_or_buffer='test.csv',index_col=0)
In [ ]:
 

To display the information in shot_df (we only display the first 10 rows), and force pandas to display all the columns, you can proceed as follows,

In [4]:
pd.set_option('display.max_columns', None)
shot_df.head(10)
Out[4]:
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING EVENT_TYPE ACTION_TYPE SHOT_TYPE SHOT_ZONE_BASIC SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG
0 Shot Chart Detail 21400018 4 2544 LeBron James 1610612739 Cleveland Cavaliers 1 11 20 Missed Shot Jump Shot 2PT Field Goal Mid-Range Right Side Center(RC) 16-24 ft. 18 114 148 1 0
1 Shot Chart Detail 21400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 Made Shot Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1
2 Shot Chart Detail 21400018 53 2544 LeBron James 1610612739 Cleveland Cavaliers 1 4 45 Missed Shot Fadeaway Jump Shot 2PT Field Goal Mid-Range Left Side(L) 8-16 ft. 12 -105 63 1 0
3 Shot Chart Detail 21400018 77 2544 LeBron James 1610612739 Cleveland Cavaliers 1 2 31 Missed Shot Jump Shot 3PT Field Goal Right Corner 3 Right Side(R) 24+ ft. 22 227 -16 1 0
4 Shot Chart Detail 21400018 82 2544 LeBron James 1610612739 Cleveland Cavaliers 1 1 51 Missed Shot Jump Shot 3PT Field Goal Above the Break 3 Right Side Center(RC) 24+ ft. 26 91 246 1 0
5 Shot Chart Detail 21400018 136 2544 LeBron James 1610612739 Cleveland Cavaliers 2 9 18 Missed Shot Jump Bank Shot 2PT Field Goal In The Paint (Non-RA) Right Side(R) 8-16 ft. 9 70 58 1 0
6 Shot Chart Detail 21400018 202 2544 LeBron James 1610612739 Cleveland Cavaliers 2 3 16 Missed Shot Jump Shot 3PT Field Goal Above the Break 3 Right Side Center(RC) 24+ ft. 26 122 235 1 0
7 Shot Chart Detail 21400018 217 2544 LeBron James 1610612739 Cleveland Cavaliers 2 1 55 Missed Shot Reverse Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -8 0 1 0
8 Shot Chart Detail 21400018 227 2544 LeBron James 1610612739 Cleveland Cavaliers 2 1 16 Missed Shot Turnaround Jump Shot 2PT Field Goal Mid-Range Right Side(R) 8-16 ft. 13 135 22 1 0
9 Shot Chart Detail 21400018 299 2544 LeBron James 1610612739 Cleveland Cavaliers 3 6 54 Made Shot Jump Shot 3PT Field Goal Above the Break 3 Center(C) 24+ ft. 25 26 249 1 1
In [ ]:
 

We can create a new DataFrame using the DataFrame shot_df with some given columns, as follows,

In [5]:
shot_df1 = pd.DataFrame(shot_df, columns = ['PERIOD','SHOT_TYPE', 'SHOT_ZONE_BASIC', 'SHOT_MADE_FLAG', 'LOC_X', 'LOC_Y'])
In [ ]:
 

To display the information in shot_df1 (we only display the first 5 rows),

In [6]:
shot_df1.head()
Out[6]:
PERIOD SHOT_TYPE SHOT_ZONE_BASIC SHOT_MADE_FLAG LOC_X LOC_Y
0 1 2PT Field Goal Mid-Range 0 114 148
1 1 2PT Field Goal Restricted Area 1 -7 0
2 1 2PT Field Goal Mid-Range 0 -105 63
3 1 3PT Field Goal Right Corner 3 0 227 -16
4 1 3PT Field Goal Above the Break 3 0 91 246
In [ ]:
 

Let's create a few DataFrame using shot_df.

Notice that we are using strings and logical operators to create the new DataFrame.

In [7]:
c1 = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal')]
c1c = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 1)]
c1m = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 0)]

c2 = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal')]
c2c = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 1)]
c2m = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 0)]

c3  = shot_df[(shot_df.SHOT_MADE_FLAG)]
c3c = shot_df[(shot_df.SHOT_MADE_FLAG == 1)]
c3m = shot_df[(shot_df.SHOT_MADE_FLAG == 0)]
In [ ]:
 

To compute the dimension of the DataFrame c3 or the total field goals attempted,

In [8]:
len(c3.index)
Out[8]:
1279
In [ ]:
 

To compute the sum of the missed shots (c3m) and converted shots (c3c),

In [9]:
len(c3c.index) + len(c3m.index)
Out[9]:
1279
In [ ]:
 

At this point, we can compute the statistics of the DataFrame we just created,

In [10]:
#shot_df1.LOC_X[(shot_df.SHOT_MADE_FLAG == 1)].sum()
shot_df1.sum()
Out[10]:
PERIOD                                                          3131
SHOT_TYPE          2PT Field Goal2PT Field Goal2PT Field Goal3PT ...
SHOT_ZONE_BASIC    Mid-RangeRestricted AreaMid-RangeRight Corner ...
SHOT_MADE_FLAG                                                   624
LOC_X                                                         -17903
LOC_Y                                                         107355
dtype: object
In [ ]:
 

or we can compute the sum of a single column,

In [11]:
#shot_df1.LOC_X[(shot_df.SHOT_MADE_FLAG == 1)].sum()
shot_df1.LOC_X.sum()
Out[11]:
-17903
In [ ]:
 

To compute the mean of a DataFrame,

In [12]:
shot_df1.LOC_X[(shot_df1.SHOT_MADE_FLAG == 1)].mean()
Out[12]:
-8.434294871794872
In [ ]:
 

To compute the cummulative sum of a DataFrame,

In [13]:
#shot_df1.LOC_Y[(shot_df.SHOT_MADE_FLAG == 1)].cumsum()
shot_df1.LOC_Y[(shot_df1.SHOT_MADE_FLAG == 1)].cumsum().tail()
Out[13]:
1269    38243
1270    38468
1271    38471
1273    38674
1275    38887
Name: LOC_Y, dtype: int64
In [ ]:
 

To count the number of non-NA values,

In [14]:
shot_df1.count()
Out[14]:
PERIOD             1279
SHOT_TYPE          1279
SHOT_ZONE_BASIC    1279
SHOT_MADE_FLAG     1279
LOC_X              1279
LOC_Y              1279
dtype: int64
In [ ]:
 

To compute the minimum value of a DataFrame,

In [15]:
shot_df1.min()
Out[15]:
PERIOD                             1
SHOT_TYPE             2PT Field Goal
SHOT_ZONE_BASIC    Above the Break 3
SHOT_MADE_FLAG                     0
LOC_X                           -245
LOC_Y                            -30
dtype: object
In [ ]:
 

To compute the maximum value of a DataFrame,

In [16]:
shot_df1.max()
Out[16]:
PERIOD                          5
SHOT_TYPE          3PT Field Goal
SHOT_ZONE_BASIC    Right Corner 3
SHOT_MADE_FLAG                  1
LOC_X                         241
LOC_Y                         418
dtype: object
In [ ]:
 

To compute the median of a DataFrame,

In [17]:
shot_df1.LOC_Y.median()
Out[17]:
42.0
In [ ]:
 

To compute the standard deviation of a DataFrame,

In [18]:
shot_df1.LOC_Y.std()
Out[18]:
91.99685249349442
In [ ]:
 

To compute the variance of a DataFrame,

In [19]:
shot_df1.LOC_Y.var()
Out[19]:
8463.420868709769
In [ ]:
 

To compute the skewness of a DataFrame,

In [20]:
shot_df1.LOC_Y.skew()
Out[20]:
0.81651481685383376
In [ ]:
 

To compute the kurtosis of a DataFrame,

In [21]:
shot_df1.LOC_X[(shot_df1.LOC_Y > 0)].kurt()
Out[21]:
-0.38142873041066466
In [ ]:
 

To compute the correlation matrix of a DataFrame,

In [22]:
shot_df1.corr()
Out[22]:
PERIOD SHOT_MADE_FLAG LOC_X LOC_Y
PERIOD 1.000000 0.008868 -0.028171 0.101323
SHOT_MADE_FLAG 0.008868 1.000000 0.052099 -0.229445
LOC_X -0.028171 0.052099 1.000000 0.007676
LOC_Y 0.101323 -0.229445 0.007676 1.000000
In [ ]:
 

To compute a summary of the statistics of a DataFrame,

In [23]:
shot_df1.describe()
Out[23]:
PERIOD SHOT_MADE_FLAG LOC_X LOC_Y
count 1279.000000 1279.000000 1279.000000 1279.000000
mean 2.448006 0.487881 -13.997654 83.936669
std 1.137046 0.500049 104.266921 91.996852
min 1.000000 0.000000 -245.000000 -30.000000
25% 1.000000 0.000000 -94.000000 4.000000
50% 2.000000 0.000000 -2.000000 42.000000
75% 3.000000 1.000000 22.000000 164.000000
max 5.000000 1.000000 241.000000 418.000000
In [ ]:
 

We can also group data in a DataFrame using groupby. In this example, we want to group the information in the DataFrame shot_by by SHOT_ZONE_AREA

In [24]:
#gb=shot_df.groupby('SHOT_ZONE_AREA','SHOT_ATTEMPTED_FLAG')
gb=shot_df.groupby('SHOT_ZONE_AREA')
In [ ]:
 

which has the following type,

In [25]:
type(gb)
Out[25]:
pandas.core.groupby.DataFrameGroupBy
In [ ]:
 

To know the size of the DataFrameGroupBy (and the names of the groups),

In [26]:
gb.size()
Out[26]:
SHOT_ZONE_AREA
Back Court(BC)             3
Center(C)                670
Left Side Center(LC)     177
Left Side(L)             181
Right Side Center(RC)    149
Right Side(R)             99
dtype: int64
In [ ]:
 

We can use list() to view what that grouping looks like,

In [27]:
#The line is commented as it prints a lot information,
#list(gb)
In [ ]:
 

Now we can apply an operation to the group we just created. In this case we apply the operation to only one column,

In [28]:
#gb.describe()
#gb.SHOT_ATTEMPTED_FLAG.describe()

#gb['SHOT_ATTEMPTED_FLAG'].describe()
gb['SHOT_ATTEMPTED_FLAG'].describe().unstack()
Out[28]:
count mean std min 25% 50% 75% max
SHOT_ZONE_AREA
Back Court(BC) 3 1 0 1 1 1 1 1
Center(C) 670 1 0 1 1 1 1 1
Left Side Center(LC) 177 1 0 1 1 1 1 1
Left Side(L) 181 1 0 1 1 1 1 1
Right Side Center(RC) 149 1 0 1 1 1 1 1
Right Side(R) 99 1 0 1 1 1 1 1

The unstack() option does kind of a pretty print.

You can also print the DataFrame without the unstack() option.

In [ ]:
 

Also, have in mind that when you apply an operation to a DataFrameGroupBy, it will return a DataFrame

In [29]:
type(gb['SHOT_ATTEMPTED_FLAG'].describe())
Out[29]:
pandas.core.series.Series
In [ ]:
 

We can group data using more than one variable,

In [30]:
gb1 = shot_df.groupby(['SHOT_ZONE_AREA','PERIOD'])
In [ ]:
 

To print the names of the groups in DataFrameGroupBy and their respectives size,

In [31]:
gb1.size()
Out[31]:
SHOT_ZONE_AREA         PERIOD
Back Court(BC)         2           3
Center(C)              1         203
                       2         182
                       3         143
                       4         136
                       5           6
Left Side Center(LC)   1          29
                       2          44
                       3          54
                       4          48
                       5           2
Left Side(L)           1          54
                       2          37
                       3          51
                       4          36
                       5           3
Right Side Center(RC)  1          33
                       2          32
                       3          43
                       4          40
                       5           1
Right Side(R)          1          27
                       2          26
                       3          20
                       4          26
dtype: int64
In [ ]:
 

To compute the summary of statistics of the group gb1,

In [32]:
pd.set_option('display.max_rows', None)
#gb1.describe()
gb1.describe().unstack()
Out[32]:
GAME_EVENT_ID GAME_ID LOC_X LOC_Y MINUTES_REMAINING PLAYER_ID SECONDS_REMAINING SHOT_ATTEMPTED_FLAG SHOT_DISTANCE SHOT_MADE_FLAG TEAM_ID
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
SHOT_ZONE_AREA PERIOD
Back Court(BC) 2 3 262.333333 13.576941 254 254.50 255.0 266.50 278 3 21400622.666667 361.950733 21400211 21400488.50 21400766.0 21400828.50 21400891 3 -36.000000 180.385698 -234 -113.50 7.0 63.00 119 3 408.333333 8.504901 402 403.50 405.0 411.50 418 3 0.000000 0.000000 0 0.00 0.0 0.00 0 3 2544 0 2544 2544 2544 2544 2544 3 0.000000 0.000000 0 0.00 0.0 0.00 0 3 1 0 1 1 1 1 1 3 43.000000 3.000000 40 41.50 43.0 44.50 46 3 0.000000 0.000000 0 0.00 0.0 0.00 0 3 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
Center(C) 1 203 56.103448 39.394478 2 21.50 47.0 91.00 154 203 21400595.044335 339.873321 21400018 21400292.00 21400658.0 21400854.00 21401203 203 -1.295567 24.519972 -84 -11.00 0.0 6.00 78 203 45.463054 80.599870 -13 1.00 7.0 44.50 291 203 5.630542 3.594062 0 2.00 6.0 9.00 11 203 2544 0 2544 2544 2544 2544 2544 203 26.275862 17.907152 0 9.00 27.0 40.00 59 203 1 0 1 1 1 1 1 203 4.896552 7.894988 0 0.00 1.0 5.00 29 203 0.596059 0.491899 0 0.00 1.0 1.00 1 203 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
2 182 195.423077 45.324881 107 156.50 195.0 236.25 292 182 21400600.357143 337.699397 21400018 21400308.00 21400643.0 21400883.00 21401203 182 -0.362637 20.689238 -87 -7.00 0.0 9.00 67 182 45.587912 80.545999 -17 1.00 9.0 38.50 373 182 4.961538 3.629781 0 1.25 5.0 8.00 11 182 2544 0 2544 2544 2544 2544 2544 182 27.082418 17.154766 0 11.00 29.0 41.00 58 182 1 0 1 1 1 1 1 182 4.747253 7.921010 0 0.00 1.0 5.00 37 182 0.604396 0.490329 0 0.00 1.0 1.00 1 182 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
3 143 324.566434 45.495327 227 288.00 322.0 357.50 448 143 21400552.496503 351.709585 21400018 21400238.00 21400453.0 21400854.00 21401203 143 -2.153846 24.325639 -65 -15.00 0.0 6.00 77 143 51.076923 84.421401 -19 1.00 11.0 57.00 301 143 5.195804 3.458524 0 2.00 5.0 8.00 11 143 2544 0 2544 2544 2544 2544 2544 143 30.202797 16.671456 0 17.50 30.0 44.00 59 143 1 0 1 1 1 1 1 143 5.426573 8.281567 0 0.00 1.0 6.00 30 143 0.636364 0.482737 0 0.00 1.0 1.00 1 143 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
4 136 450.992647 47.864701 339 420.75 451.0 486.25 581 136 21400645.411765 363.907934 21400018 21400308.00 21400700.0 21400995.00 21401203 136 -5.419118 26.283513 -76 -11.50 0.0 2.00 69 136 52.029412 84.447392 -11 1.00 11.5 44.25 306 136 5.455882 3.435906 0 3.00 5.5 8.00 11 136 2544 0 2544 2544 2544 2544 2544 136 30.345588 16.882767 1 15.75 30.0 44.25 59 136 1 0 1 1 1 1 1 136 5.602941 8.324243 0 0.00 1.0 6.00 30 136 0.588235 0.493972 0 0.00 1.0 1.00 1 136 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
5 6 577.666667 46.659047 522 540.75 577.5 618.75 628 6 21400452.500000 471.589122 21400022 21400022.00 21400452.5 21400883.00 21400883 6 10.000000 14.791890 -19 10.50 14.5 19.25 20 6 50.333333 95.088730 6 7.25 11.0 20.75 244 6 2.000000 1.414214 0 1.25 2.0 2.75 4 6 2544 0 2544 2544 2544 2544 2544 6 40.833333 10.515069 20 43.00 43.0 46.00 49 6 1 0 1 1 1 1 1 6 5.500000 9.071935 1 2.00 2.0 2.00 24 6 0.500000 0.547723 0 0.00 0.5 1.00 1 6 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
Left Side Center(LC) 1 29 51.620690 40.438148 4 21.00 35.0 76.00 144 29 21400619.517241 330.169288 21400125 21400336.00 21400620.0 21400839.00 21401178 29 -149.827586 40.551615 -226 -166.00 -144.0 -119.00 -82 29 182.896552 41.950945 105 148.00 184.0 217.00 255 29 6.275862 3.514222 0 4.00 7.0 9.00 11 29 2544 0 2544 2544 2544 2544 2544 29 28.034483 16.774732 0 16.00 29.0 43.00 52 29 1 0 1 1 1 1 1 29 23.689655 2.867346 16 24.00 25.0 25.00 28 29 0.448276 0.506120 0 0.00 0.0 1.00 1 29 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
2 44 202.000000 44.015853 117 163.75 207.0 238.00 274 44 21400591.386364 328.073603 21400055 21400315.25 21400623.0 21400829.25 21401092 44 -152.454545 42.163279 -239 -188.50 -147.0 -126.00 -64 44 183.181818 39.273799 97 153.25 189.5 213.75 257 44 4.318182 3.838439 0 1.00 4.0 8.25 11 44 2544 0 2544 2544 2544 2544 2544 44 26.681818 17.614440 0 9.50 28.0 40.50 58 44 1 0 1 1 1 1 1 44 23.795455 2.681370 17 24.00 25.0 25.00 28 44 0.295455 0.461522 0 0.00 0.0 1.00 1 44 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
3 54 322.277778 45.913543 248 290.75 314.0 342.00 456 54 21400668.277778 340.890256 21400066 21400374.00 21400709.0 21400907.00 21401203 54 -135.537037 31.051213 -215 -152.25 -131.0 -112.50 -70 54 190.833333 42.681687 101 151.50 203.0 219.00 304 54 5.888889 3.467731 0 4.00 7.0 9.00 11 54 2544 0 2544 2544 2544 2544 2544 54 28.222222 17.641789 0 12.25 30.0 42.00 59 54 1 0 1 1 1 1 1 54 23.203704 3.530886 16 20.00 24.0 25.00 32 54 0.425926 0.499126 0 0.00 0.0 1.00 1 54 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
4 48 456.229167 49.689350 355 419.50 452.5 490.25 551 48 21400611.729167 336.998626 21400018 21400418.25 21400667.0 21400887.25 21401148 48 -142.583333 43.279809 -245 -177.00 -131.0 -110.50 -75 48 195.666667 43.360602 88 157.00 210.0 230.25 265 48 5.395833 3.227143 0 3.00 5.0 8.00 11 48 2544 0 2544 2544 2544 2544 2544 48 30.687500 16.572415 1 17.25 32.0 43.25 56 48 1 0 1 1 1 1 1 48 24.312500 2.918439 16 24.00 25.0 26.00 28 48 0.354167 0.483321 0 0.00 0.0 1.00 1 48 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
5 2 579.500000 45.961941 547 563.25 579.5 595.75 612 2 21400923.500000 57.275649 21400883 21400903.25 21400923.5 21400943.75 21400964 2 -117.000000 33.941125 -141 -129.00 -117.0 -105.00 -93 2 216.500000 26.162951 198 207.25 216.5 225.75 235 2 1.500000 0.707107 1 1.25 1.5 1.75 2 2 2544 0 2544 2544 2544 2544 2544 2 43.000000 15.556349 32 37.50 43.0 48.50 54 2 1 0 1 1 1 1 1 2 24.500000 0.707107 24 24.25 24.5 24.75 25 2 0.500000 0.707107 0 0.25 0.5 0.75 1 2 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
Left Side(L) 1 54 55.314815 42.385840 2 19.75 41.0 93.50 144 54 21400598.314815 375.380452 21400018 21400241.75 21400642.0 21400897.75 21401203 54 -141.296296 45.644496 -235 -160.00 -137.0 -105.75 -48 54 38.888889 32.251176 -22 15.50 36.5 57.50 127 54 5.666667 3.865669 0 2.00 7.0 9.00 11 54 2544 0 2544 2544 2544 2544 2544 54 29.425926 17.608533 0 15.00 34.0 42.75 58 54 1 0 1 1 1 1 1 54 14.629630 4.057587 8 12.00 14.0 17.50 23 54 0.296296 0.460911 0 0.00 0.0 1.00 1 54 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
2 37 178.567568 48.461061 104 134.00 173.0 211.00 278 37 21400590.891892 376.121344 21400022 21400199.00 21400658.0 21400915.00 21401178 37 -141.108108 47.363713 -235 -168.00 -152.0 -100.00 -45 37 41.162162 32.009125 -6 19.00 37.0 63.00 115 37 6.459459 3.420482 0 4.00 7.0 9.00 11 37 2544 0 2544 2544 2544 2544 2544 37 27.891892 15.327506 1 16.00 27.0 39.00 57 37 1 0 1 1 1 1 1 37 14.567568 4.675828 8 10.00 15.0 19.00 23 37 0.351351 0.483978 0 0.00 0.0 1.00 1 37 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
3 51 326.823529 50.073428 237 285.00 324.0 362.50 448 51 21400555.392157 342.813540 21400018 21400224.50 21400453.0 21400815.00 21401203 51 -152.901961 43.258412 -235 -183.00 -147.0 -118.00 -78 51 41.882353 35.894650 -10 7.00 41.0 71.50 110 51 5.529412 3.951470 0 2.00 5.0 9.00 11 51 2544 0 2544 2544 2544 2544 2544 51 26.764706 17.214631 0 12.00 27.0 40.50 59 51 1 0 1 1 1 1 1 51 15.764706 3.982905 8 13.00 15.0 18.50 24 51 0.352941 0.482640 0 0.00 0.0 1.00 1 51 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
4 36 445.972222 40.500784 359 418.50 455.5 474.00 524 36 21400649.722222 376.531434 21400055 21400343.25 21400642.0 21400995.00 21401203 36 -146.027778 45.930684 -242 -171.00 -147.5 -114.00 -70 36 43.277778 35.611885 -13 16.25 40.5 58.50 124 36 5.527778 3.517191 0 2.00 5.0 9.00 11 36 2544 0 2544 2544 2544 2544 2544 36 29.583333 18.885179 0 12.25 30.0 46.25 58 36 1 0 1 1 1 1 1 36 15.138889 4.599085 8 11.75 15.0 18.00 24 36 0.388889 0.494413 0 0.00 0.0 1.00 1 36 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
5 3 571.000000 21.166010 547 563.00 579.0 583.00 587 3 21400022.000000 0.000000 21400022 21400022.00 21400022.0 21400022.00 21400022 3 -132.000000 38.000000 -164 -153.00 -142.0 -116.00 -90 3 67.333333 57.134345 4 43.50 83.0 99.00 115 3 1.000000 1.732051 0 0.00 0.0 1.50 3 3 2544 0 2544 2544 2544 2544 2544 3 27.000000 22.649503 6 15.00 24.0 37.50 51 3 1 0 1 1 1 1 1 3 15.333333 1.154701 14 15.00 16.0 16.00 16 3 0.000000 0.000000 0 0.00 0.0 0.00 0 3 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
Right Side Center(RC) 1 33 58.606061 36.952621 2 34.00 55.0 82.00 128 33 21400381.636364 323.949535 21400018 21400133.00 21400273.0 21400570.00 21401167 33 138.696970 36.666133 78 113.00 141.0 157.00 220 33 192.939394 44.687764 124 154.00 200.0 219.00 313 33 5.000000 3.526684 0 1.00 5.0 7.00 11 33 2544 0 2544 2544 2544 2544 2544 33 26.757576 16.414609 0 16.00 26.0 37.00 59 33 1 0 1 1 1 1 1 33 23.696970 3.566203 17 21.00 25.0 26.00 33 33 0.242424 0.435194 0 0.00 0.0 0.00 1 33 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
2 32 194.187500 45.258710 112 160.75 198.5 226.50 282 32 21400646.218750 313.708179 21400018 21400443.75 21400697.0 21400885.00 21401078 32 137.218750 39.067959 67 109.50 128.0 159.25 223 32 192.406250 58.923055 108 155.50 191.0 220.25 377 32 5.031250 3.864515 0 1.00 6.0 8.00 11 32 2544 0 2544 2544 2544 2544 2544 32 19.968750 16.395287 0 5.75 16.5 31.50 58 32 1 0 1 1 1 1 1 32 23.593750 5.375328 16 20.00 24.0 25.25 41 32 0.343750 0.482559 0 0.00 0.0 1.00 1 32 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
3 43 336.348837 43.759562 249 310.00 338.0 364.50 416 43 21400671.279070 310.432519 21400066 21400408.50 21400766.0 21400883.00 21401203 43 148.651163 38.434783 77 123.50 147.0 175.00 223 43 187.883721 43.155979 108 159.50 192.0 215.00 277 43 4.046512 3.272930 0 1.00 4.0 6.50 11 43 2544 0 2544 2544 2544 2544 2544 43 28.953488 20.499656 0 9.50 34.0 44.00 58 43 1 0 1 1 1 1 1 43 23.883721 3.325295 16 24.00 25.0 26.00 30 43 0.441860 0.502486 0 0.00 0.0 1.00 1 43 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
4 40 459.050000 46.408968 370 421.75 464.0 487.00 557 40 21400638.225000 281.721675 21400098 21400425.75 21400639.0 21400861.25 21401167 40 132.150000 39.003320 67 103.75 128.5 148.00 218 40 190.975000 53.965082 108 144.00 200.5 230.75 361 40 5.225000 3.555332 0 2.75 5.5 8.00 11 40 2544 0 2544 2544 2544 2544 2544 40 28.300000 16.469397 2 14.00 28.0 42.00 59 40 1 0 1 1 1 1 1 40 23.200000 4.569688 16 19.75 25.0 26.00 39 40 0.425000 0.500641 0 0.00 0.0 1.00 1 40 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
5 1 558.000000 NaN 558 558.00 558.0 558.00 558 1 21400964.000000 NaN 21400964 21400964.00 21400964.0 21400964.00 21400964 1 176.000000 NaN 176 176.00 176.0 176.00 176 1 186.000000 NaN 186 186.00 186.0 186.00 186 1 0.000000 NaN 0 0.00 0.0 0.00 0 1 2544 NaN 2544 2544 2544 2544 2544 1 32.000000 NaN 32 32.00 32.0 32.00 32 1 1 NaN 1 1 1 1 1 1 25.000000 NaN 25 25.00 25.0 25.00 25 1 1.000000 NaN 1 1.00 1.0 1.00 1 1 1610612739 NaN 1610612739 1610612739 1610612739 1610612739 1610612739
Right Side(R) 1 27 51.814815 32.643809 6 24.00 46.0 76.50 116 27 21400498.296296 323.769720 21400018 21400226.00 21400433.0 21400776.00 21401124 27 166.444444 53.197985 70 127.00 162.0 227.50 241 27 38.370370 36.061004 -16 8.00 31.0 68.00 116 27 5.703704 3.291187 0 3.50 6.0 8.00 11 27 2544 0 2544 2544 2544 2544 2544 27 30.629630 16.699028 1 22.50 31.0 38.00 59 27 1 0 1 1 1 1 1 27 17.222222 4.551697 9 14.00 16.0 22.00 24 27 0.333333 0.480384 0 0.00 0.0 1.00 1 27 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
2 26 196.961538 47.785756 129 153.75 197.5 226.25 301 26 21400670.923077 389.014542 21400018 21400286.50 21400714.5 21401023.00 21401178 26 136.000000 60.099251 62 76.25 135.0 172.50 239 26 50.000000 28.223395 -6 28.50 55.0 66.75 101 26 4.884615 3.502307 0 2.00 5.0 8.00 11 26 2544 0 2544 2544 2544 2544 2544 26 28.500000 19.146279 0 12.50 30.5 39.75 59 26 1 0 1 1 1 1 1 26 14.461538 5.558638 8 10.00 13.5 17.75 24 26 0.230769 0.429669 0 0.00 0.0 0.00 1 26 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
3 20 321.450000 40.158207 261 296.50 316.5 347.25 426 20 21400700.600000 329.374751 21400098 21400582.75 21400733.0 21400934.00 21401203 20 148.750000 57.570895 67 100.25 150.0 190.50 239 20 56.800000 35.402349 -5 40.00 54.5 77.75 118 20 5.350000 3.313052 0 2.75 5.0 8.25 11 20 2544 0 2544 2544 2544 2544 2544 20 29.500000 17.024750 3 17.75 25.5 39.50 59 20 1 0 1 1 1 1 1 20 15.850000 5.431632 8 10.50 16.0 21.25 23 20 0.450000 0.510418 0 0.00 0.0 1.00 1 20 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
4 26 446.576923 37.038545 367 419.00 452.5 467.25 527 26 21400685.884615 325.674602 21400211 21400319.75 21400667.0 21401023.00 21401167 26 132.461538 59.717489 45 90.25 108.0 168.25 241 26 38.807692 31.304976 -30 20.00 40.5 56.75 112 26 5.846154 3.294751 0 3.25 5.0 9.00 11 26 2544 0 2544 2544 2544 2544 2544 26 30.192308 13.608877 7 19.00 28.5 41.75 51 26 1 0 1 1 1 1 1 26 13.961538 5.325266 8 10.00 12.0 16.75 24 26 0.423077 0.503831 0 0.00 0.0 1.00 1 26 1610612739 0 1610612739 1610612739 1610612739 1610612739 1610612739
In [ ]:
 

Let's create another group,

In [33]:
gb2 = shot_df.groupby('SHOT_ZONE_AREA')
In [ ]:
 

and extract a specific sub-group of the DataFrameGroupBy as follows,

In [34]:
gb2.get_group('Center(C)').head(2)
Out[34]:
GRID_TYPE GAME_ID GAME_EVENT_ID PLAYER_ID PLAYER_NAME TEAM_ID TEAM_NAME PERIOD MINUTES_REMAINING SECONDS_REMAINING EVENT_TYPE ACTION_TYPE SHOT_TYPE SHOT_ZONE_BASIC SHOT_ZONE_AREA SHOT_ZONE_RANGE SHOT_DISTANCE LOC_X LOC_Y SHOT_ATTEMPTED_FLAG SHOT_MADE_FLAG
1 Shot Chart Detail 21400018 33 2544 LeBron James 1610612739 Cleveland Cavaliers 1 6 30 Made Shot Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -7 0 1 1
7 Shot Chart Detail 21400018 217 2544 LeBron James 1610612739 Cleveland Cavaliers 2 1 55 Missed Shot Reverse Layup Shot 2PT Field Goal Restricted Area Center(C) Less Than 8 ft. 0 -8 0 1 0
In [ ]:
 

To compute the number of entries in the group gb1,

In [35]:
#gb1.PERIOD.count()
gb1.PERIOD.count().unstack()
Out[35]:
PERIOD 1 2 3 4 5
SHOT_ZONE_AREA
Back Court(BC) NaN 3 NaN NaN NaN
Center(C) 203 182 143 136 6
Left Side Center(LC) 29 44 54 48 2
Left Side(L) 54 37 51 36 3
Right Side Center(RC) 33 32 43 40 1
Right Side(R) 27 26 20 26 NaN
In [ ]:
 

To compute the number of entries in the group gb1 you can also proceed in the following way,

In [36]:
#gb1.size()
gb1.size().unstack()
Out[36]:
PERIOD 1 2 3 4 5
SHOT_ZONE_AREA
Back Court(BC) NaN 3 NaN NaN NaN
Center(C) 203 182 143 136 6
Left Side Center(LC) 29 44 54 48 2
Left Side(L) 54 37 51 36 3
Right Side Center(RC) 33 32 43 40 1
Right Side(R) 27 26 20 26 NaN
In [ ]:
 

To compute the summary of statistics of the group gb1 for the column PERIOD,

In [37]:
#gb1.PERIOD.describe()
gb1.PERIOD.describe().unstack()
Out[37]:
count mean std min 25% 50% 75% max
SHOT_ZONE_AREA PERIOD
Back Court(BC) 2 3 2 0 2 2 2 2 2
Center(C) 1 203 1 0 1 1 1 1 1
2 182 2 0 2 2 2 2 2
3 143 3 0 3 3 3 3 3
4 136 4 0 4 4 4 4 4
5 6 5 0 5 5 5 5 5
Left Side Center(LC) 1 29 1 0 1 1 1 1 1
2 44 2 0 2 2 2 2 2
3 54 3 0 3 3 3 3 3
4 48 4 0 4 4 4 4 4
5 2 5 0 5 5 5 5 5
Left Side(L) 1 54 1 0 1 1 1 1 1
2 37 2 0 2 2 2 2 2
3 51 3 0 3 3 3 3 3
4 36 4 0 4 4 4 4 4
5 3 5 0 5 5 5 5 5
Right Side Center(RC) 1 33 1 0 1 1 1 1 1
2 32 2 0 2 2 2 2 2
3 43 3 0 3 3 3 3 3
4 40 4 0 4 4 4 4 4
5 1 5 NaN 5 5 5 5 5
Right Side(R) 1 27 1 0 1 1 1 1 1
2 26 2 0 2 2 2 2 2
3 20 3 0 3 3 3 3 3
4 26 4 0 4 4 4 4 4
In [ ]:
 
In [ ]:
 
In [ ]:
 

At this point, let's do some advanced plotting using matplotlib and seaborn.

There are many ways to plot the data in a pandas DataFrame, let's use first matplotlib.

Let's plot all the field attempts,

In [38]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("white")
sns.set_color_codes()

plt.figure(figsize=(10,10))
plt.scatter( shot_df.LOC_X, shot_df.LOC_Y,s=20,marker='o',alpha=1)

plt.xlim(-300,300)
plt.ylim(-100,500)

plt.title('Lebron James shot chart 2014-2015 \n All field attempts', y = 1.02, fontsize=20)

plt.grid()
plt.show()
In [ ]:
 

Let's plot the converted and missed shots,

In [39]:
sns.set_style("white")
sns.set_color_codes()


plt.figure(figsize=(10,10))

plt.scatter(shot_df[shot_df.SHOT_MADE_FLAG == 1].LOC_X, shot_df[shot_df.SHOT_MADE_FLAG == 1].LOC_Y, color='green',label='shots converted',s=20,marker='o',alpha=1.0/2.0)
plt.scatter(shot_df[shot_df.SHOT_MADE_FLAG == 0].LOC_X, shot_df[shot_df.SHOT_MADE_FLAG == 0].LOC_Y, color='red',label='shots missed',s=20,marker='o',alpha=0.5)

plt.legend()

plt.xlim(-300,300)
plt.ylim(-100,500)

plt.title('Lebron James shot chart 2014-2015 \n Converted-missed shots', y = 1.02, fontsize=20)

plt.grid()
plt.show()
In [ ]:
 

Let's see in what position Lebron took more shots by using histograms.

First we need to create a DataFrame with the LOC_X and LOC_Y information,

In [40]:
h1=pd.DataFrame(shot_df, columns=['LOC_X', 'LOC_Y'])
In [ ]:
 

Now we use this DataFrame to plot the histograms for LOC_X and LOC_Y,

In [41]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)

plt.figure(figsize=(10,10))

plt.subplot(2, 1, 1)
sns.distplot(h1.LOC_X,kde=False);

plt.subplot(2, 1, 2)
sns.distplot(h1.LOC_Y,kde=False);

As we can see from these histogram, Lebron took more shots around the rim.

We are doing exploratory data analysis or EDA. At this point, take your time and try to do different plots, there are many types and combinations.

In [ ]:
 

Let's plot all shots and the histograms in one single plot, for this we are going to use seaborn,

In [42]:
# create our jointplot
sns.set(color_codes=True)

joint_shot_chart = sns.jointplot(shot_df1.LOC_X, shot_df1.LOC_Y, stat_func=None,
                                 kind='scatter',space=0.2, alpha=1, 
                                 size=8, edgecolor='w', color='b').set_axis_labels("x location", "y location")

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to adjust
joint_shot_chart.fig.set_size_inches(10,10)
ax = joint_shot_chart.ax_joint

# Adjust the axis limits and orientation of the plot in order
ax.set_xlim(-250,250)
#ax.set_ylim(422.5, -47.5)
ax.set_ylim(-47.5, 422.5)

# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')

# Add a title
ax.set_title('Lebron James shot chart 2014-2015', 
             y=1.25, fontsize=14)

plt.show()

As we can see here, Lebron is very active close to the rim area.

What about his effectivity?, this is let to you as an exercise.

In [ ]:
 

Instead of using a scatter plot, we can use hexbins. In seaborn, you can proceed as follows,

In [43]:
# create our jointplot
sns.set_style("white")

cmap=plt.cm.gist_heat_r
joint_shot_chart = sns.jointplot(shot_df.LOC_X, shot_df.LOC_Y, stat_func=None,
                                 kind='hex',gridsize=40,space=0, color=cmap(0.2), cmap=cmap,vmin=0, vmax=50)

#joint_shot_chart = sns.jointplot(shot_df.LOC_X, shot_df.LOC_Y, stat_func=None,
#                                 kind='hex',space=0, color=cmap(0.2), cmap=cmap)

joint_shot_chart.fig.set_size_inches(12,11)

# A joint plot has 3 Axes, the first one called ax_joint 
# is the one we want to adjust
ax = joint_shot_chart.ax_joint

# Adjust the axis limits and orientation of the plot in order
ax.set_xlim(-250,250)
#ax.set_ylim(422.5, -47.5)
ax.set_ylim(-47.5, 422.5)

# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')

# Add a title
ax.set_title('Lebron James shot chart 2014-2015', 
             y=1.25, fontsize=14)


# Add James Harden's image to the top right
#img = OffsetImage(image, zoom=0.6)
#img.set_offset((625,621))
#ax.add_artist(img)

plt.show()
In [ ]:
 

Let's use boxplot to represent the field attempts, this kind of plots are very useful to identify outliers or anomalies in your observations.

In [44]:
sns.boxplot(shot_df.LOC_Y, orient="v")
plt.title('Lebron James - Boxplot of all field attempts for LOC_Y', y = 1.1, fontsize=20)
Out[44]:
<matplotlib.text.Text at 0x10a36ee10>

In this plot we can see that there are a few outliers or irregular observations.

In [ ]:
 

We can also plot the boxplot with the scatter point superimpose, as follows,

In [45]:
#sns.stripplot(data=shot_df.LOC_Y, jitter=True, color="white", edgecolor="gray")
sns.stripplot(data=shot_df.LOC_Y, jitter=0.3, color="white", edgecolor="gray")
sns.boxplot(shot_df.LOC_Y, orient="v")

plt.title('Lebron James - Boxplot of all field attempts for LOC_Y', y = 1.1, fontsize=20)
Out[45]:
<matplotlib.text.Text at 0x10a56f5d0>

Here we see better the outliers, which corresponds to the shots taken from far away, and presumably close to the end of a period. This is left as an exercise to the reader.

In [ ]:
 

We can do the same for LOC_X,

In [46]:
sns.stripplot(data=shot_df.LOC_X, jitter=0.3, color="white", edgecolor="gray")
sns.boxplot(shot_df.LOC_X, orient="v")

plt.title('Lebron James - Boxplot of all field attempts for LOC_X', y = 1.1, fontsize=20)
Out[46]:
<matplotlib.text.Text at 0x10a25ced0>
In [ ]:
 

At the point let's do some pretty plotting. We are going to add a background to the scatter plot and we are going to plot only the 3 point shots attempted.

To plot the backgorund image we are going to use scipy [4].

In [47]:
from scipy.misc import imread

plt.figure(figsize=(10,10))

#draw_court(outer_lines=True)
#plt.xlim(-300,300)
#plt.ylim(-100,500)

datafile = 'chart1.png'
#datafile = 'chart2.jpg'
#datafile = 'bg_court.jpg'
img = imread(datafile)

#plt.imshow(img, zorder=0, extent=[-260, 260, -60, 400])
plt.imshow(img, zorder=0, extent=[-260, 260, -400, 60])
plt.scatter(c2.LOC_X,-1*c2.LOC_Y,zorder=1,color='blue',label='3 point shots attempted',s=20,marker='o',alpha=0.6)

plt.title('Lebron James shot chart 2014-2015 \n 3 point shots attempted', y = 1.02, fontsize=20)

#plt.ylim(400, -60)
plt.show()