In this tutorial we mainly address advanced data analytics and plotting.
We are going to use pandas [1] to do data manipulation and data analytics of the DataFrame
.
Finally, we are going to do advanced plotting using matplotlib [2] and seaborn [3].
%matplotlib inline
import pandas as pd
#import json
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
To load the csv file and use column 0 as the row labels of the DataFrame
, we proceed as follows,
#shot_df = pd.read_csv('test.csv')
#shot_df = pd.read_csv(filepath_or_buffer='test.csv')
shot_df = pd.read_csv(filepath_or_buffer='test.csv',index_col=0)
To display the information in shot_df
(we only display the first 10 rows), and force pandas to display all the columns, you can proceed as follows,
pd.set_option('display.max_columns', None)
shot_df.head(10)
We can create a new DataFrame
using the DataFrame shot_df
with some given columns, as follows,
shot_df1 = pd.DataFrame(shot_df, columns = ['PERIOD','SHOT_TYPE', 'SHOT_ZONE_BASIC', 'SHOT_MADE_FLAG', 'LOC_X', 'LOC_Y'])
To display the information in shot_df1
(we only display the first 5 rows),
shot_df1.head()
Let's create a few DataFrame
using shot_df
.
Notice that we are using strings and logical operators to create the new DataFrame
.
c1 = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal')]
c1c = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 1)]
c1m = shot_df[(shot_df.SHOT_TYPE == '2PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 0)]
c2 = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal')]
c2c = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 1)]
c2m = shot_df[(shot_df.SHOT_TYPE == '3PT Field Goal') & (shot_df.SHOT_MADE_FLAG == 0)]
c3 = shot_df[(shot_df.SHOT_MADE_FLAG)]
c3c = shot_df[(shot_df.SHOT_MADE_FLAG == 1)]
c3m = shot_df[(shot_df.SHOT_MADE_FLAG == 0)]
To compute the dimension of the DataFrame c3
or the total field goals attempted,
len(c3.index)
To compute the sum of the missed shots (c3m
) and converted shots (c3c
),
len(c3c.index) + len(c3m.index)
At this point, we can compute the statistics of the DataFrame
we just created,
#shot_df1.LOC_X[(shot_df.SHOT_MADE_FLAG == 1)].sum()
shot_df1.sum()
or we can compute the sum of a single column,
#shot_df1.LOC_X[(shot_df.SHOT_MADE_FLAG == 1)].sum()
shot_df1.LOC_X.sum()
To compute the mean of a DataFrame
,
shot_df1.LOC_X[(shot_df1.SHOT_MADE_FLAG == 1)].mean()
To compute the cummulative sum of a DataFrame
,
#shot_df1.LOC_Y[(shot_df.SHOT_MADE_FLAG == 1)].cumsum()
shot_df1.LOC_Y[(shot_df1.SHOT_MADE_FLAG == 1)].cumsum().tail()
To count the number of non-NA values,
shot_df1.count()
To compute the minimum value of a DataFrame
,
shot_df1.min()
To compute the maximum value of a DataFrame
,
shot_df1.max()
To compute the median of a DataFrame
,
shot_df1.LOC_Y.median()
To compute the standard deviation of a DataFrame
,
shot_df1.LOC_Y.std()
To compute the variance of a DataFrame
,
shot_df1.LOC_Y.var()
To compute the skewness of a DataFrame
,
shot_df1.LOC_Y.skew()
To compute the kurtosis of a DataFrame
,
shot_df1.LOC_X[(shot_df1.LOC_Y > 0)].kurt()
To compute the correlation matrix of a DataFrame
,
shot_df1.corr()
To compute a summary of the statistics of a DataFrame
,
shot_df1.describe()
We can also group data in a DataFrame
using groupby
. In this example, we want to group the information in the DataFrame shot_by
by SHOT_ZONE_AREA
#gb=shot_df.groupby('SHOT_ZONE_AREA','SHOT_ATTEMPTED_FLAG')
gb=shot_df.groupby('SHOT_ZONE_AREA')
which has the following type,
type(gb)
To know the size of the DataFrameGroupBy
(and the names of the groups),
gb.size()
We can use list()
to view what that grouping looks like,
#The line is commented as it prints a lot information,
#list(gb)
Now we can apply an operation to the group we just created. In this case we apply the operation to only one column,
#gb.describe()
#gb.SHOT_ATTEMPTED_FLAG.describe()
#gb['SHOT_ATTEMPTED_FLAG'].describe()
gb['SHOT_ATTEMPTED_FLAG'].describe().unstack()
The unstack()
option does kind of a pretty print.
You can also print the DataFrame
without the unstack()
option.
Also, have in mind that when you apply an operation to a DataFrameGroupBy
, it will return a DataFrame
type(gb['SHOT_ATTEMPTED_FLAG'].describe())
We can group data using more than one variable,
gb1 = shot_df.groupby(['SHOT_ZONE_AREA','PERIOD'])
To print the names of the groups in DataFrameGroupBy
and their respectives size,
gb1.size()
To compute the summary of statistics of the group gb1
,
pd.set_option('display.max_rows', None)
#gb1.describe()
gb1.describe().unstack()
Let's create another group,
gb2 = shot_df.groupby('SHOT_ZONE_AREA')
and extract a specific sub-group of the DataFrameGroupBy
as follows,
gb2.get_group('Center(C)').head(2)
To compute the number of entries in the group gb1
,
#gb1.PERIOD.count()
gb1.PERIOD.count().unstack()
To compute the number of entries in the group gb1
you can also proceed in the following way,
#gb1.size()
gb1.size().unstack()
To compute the summary of statistics of the group gb1
for the column PERIOD
,
#gb1.PERIOD.describe()
gb1.PERIOD.describe().unstack()
There are many ways to plot the data in a pandas DataFrame
, let's use first matplotlib.
Let's plot all the field attempts,
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
sns.set_color_codes()
plt.figure(figsize=(10,10))
plt.scatter( shot_df.LOC_X, shot_df.LOC_Y,s=20,marker='o',alpha=1)
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.title('Lebron James shot chart 2014-2015 \n All field attempts', y = 1.02, fontsize=20)
plt.grid()
plt.show()
Let's plot the converted and missed shots,
sns.set_style("white")
sns.set_color_codes()
plt.figure(figsize=(10,10))
plt.scatter(shot_df[shot_df.SHOT_MADE_FLAG == 1].LOC_X, shot_df[shot_df.SHOT_MADE_FLAG == 1].LOC_Y, color='green',label='shots converted',s=20,marker='o',alpha=1.0/2.0)
plt.scatter(shot_df[shot_df.SHOT_MADE_FLAG == 0].LOC_X, shot_df[shot_df.SHOT_MADE_FLAG == 0].LOC_Y, color='red',label='shots missed',s=20,marker='o',alpha=0.5)
plt.legend()
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.title('Lebron James shot chart 2014-2015 \n Converted-missed shots', y = 1.02, fontsize=20)
plt.grid()
plt.show()
Let's see in what position Lebron took more shots by using histograms.
First we need to create a DataFrame
with the LOC_X and LOC_Y information,
h1=pd.DataFrame(shot_df, columns=['LOC_X', 'LOC_Y'])
Now we use this DataFrame
to plot the histograms for LOC_X and LOC_Y,
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
plt.figure(figsize=(10,10))
plt.subplot(2, 1, 1)
sns.distplot(h1.LOC_X,kde=False);
plt.subplot(2, 1, 2)
sns.distplot(h1.LOC_Y,kde=False);
As we can see from these histogram, Lebron took more shots around the rim.
We are doing exploratory data analysis or EDA. At this point, take your time and try to do different plots, there are many types and combinations.
Let's plot all shots and the histograms in one single plot, for this we are going to use seaborn,
# create our jointplot
sns.set(color_codes=True)
joint_shot_chart = sns.jointplot(shot_df1.LOC_X, shot_df1.LOC_Y, stat_func=None,
kind='scatter',space=0.2, alpha=1,
size=8, edgecolor='w', color='b').set_axis_labels("x location", "y location")
# A joint plot has 3 Axes, the first one called ax_joint
# is the one we want to adjust
joint_shot_chart.fig.set_size_inches(10,10)
ax = joint_shot_chart.ax_joint
# Adjust the axis limits and orientation of the plot in order
ax.set_xlim(-250,250)
#ax.set_ylim(422.5, -47.5)
ax.set_ylim(-47.5, 422.5)
# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')
# Add a title
ax.set_title('Lebron James shot chart 2014-2015',
y=1.25, fontsize=14)
plt.show()
As we can see here, Lebron is very active close to the rim area.
What about his effectivity?, this is let to you as an exercise.
Instead of using a scatter plot, we can use hexbins. In seaborn, you can proceed as follows,
# create our jointplot
sns.set_style("white")
cmap=plt.cm.gist_heat_r
joint_shot_chart = sns.jointplot(shot_df.LOC_X, shot_df.LOC_Y, stat_func=None,
kind='hex',gridsize=40,space=0, color=cmap(0.2), cmap=cmap,vmin=0, vmax=50)
#joint_shot_chart = sns.jointplot(shot_df.LOC_X, shot_df.LOC_Y, stat_func=None,
# kind='hex',space=0, color=cmap(0.2), cmap=cmap)
joint_shot_chart.fig.set_size_inches(12,11)
# A joint plot has 3 Axes, the first one called ax_joint
# is the one we want to adjust
ax = joint_shot_chart.ax_joint
# Adjust the axis limits and orientation of the plot in order
ax.set_xlim(-250,250)
#ax.set_ylim(422.5, -47.5)
ax.set_ylim(-47.5, 422.5)
# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(labelbottom='off', labelleft='off')
# Add a title
ax.set_title('Lebron James shot chart 2014-2015',
y=1.25, fontsize=14)
# Add James Harden's image to the top right
#img = OffsetImage(image, zoom=0.6)
#img.set_offset((625,621))
#ax.add_artist(img)
plt.show()
Let's use boxplot
to represent the field attempts, this kind of plots are very useful to identify outliers or anomalies in your observations.
sns.boxplot(shot_df.LOC_Y, orient="v")
plt.title('Lebron James - Boxplot of all field attempts for LOC_Y', y = 1.1, fontsize=20)
In this plot we can see that there are a few outliers or irregular observations.
We can also plot the boxplot with the scatter point superimpose, as follows,
#sns.stripplot(data=shot_df.LOC_Y, jitter=True, color="white", edgecolor="gray")
sns.stripplot(data=shot_df.LOC_Y, jitter=0.3, color="white", edgecolor="gray")
sns.boxplot(shot_df.LOC_Y, orient="v")
plt.title('Lebron James - Boxplot of all field attempts for LOC_Y', y = 1.1, fontsize=20)
Here we see better the outliers, which corresponds to the shots taken from far away, and presumably close to the end of a period. This is left as an exercise to the reader.
We can do the same for LOC_X,
sns.stripplot(data=shot_df.LOC_X, jitter=0.3, color="white", edgecolor="gray")
sns.boxplot(shot_df.LOC_X, orient="v")
plt.title('Lebron James - Boxplot of all field attempts for LOC_X', y = 1.1, fontsize=20)
At the point let's do some pretty plotting. We are going to add a background to the scatter plot and we are going to plot only the 3 point shots attempted.
To plot the backgorund image we are going to use scipy [4].
from scipy.misc import imread
plt.figure(figsize=(10,10))
#draw_court(outer_lines=True)
#plt.xlim(-300,300)
#plt.ylim(-100,500)
datafile = 'chart1.png'
#datafile = 'chart2.jpg'
#datafile = 'bg_court.jpg'
img = imread(datafile)
#plt.imshow(img, zorder=0, extent=[-260, 260, -60, 400])
plt.imshow(img, zorder=0, extent=[-260, 260, -400, 60])
plt.scatter(c2.LOC_X,-1*c2.LOC_Y,zorder=1,color='blue',label='3 point shots attempted',s=20,marker='o',alpha=0.6)
plt.title('Lebron James shot chart 2014-2015 \n 3 point shots attempted', y = 1.02, fontsize=20)
#plt.ylim(400, -60)
plt.show()