In this final tutorial we address clustering and advanced plotting.
Unsupervised learning or clustering consists in finding hidden structures in unlabelled data. We are going to use k-means clustering [1], which is included with scipy [2].
As we already have seen, the dataset is labelled. Nevertheless, we are going to compare the solution obtained with k-means clustering and the labelled data.
Finally, we are going to do advanced plotting using matplotlib [3] and seaborn [4].
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#import json
#import seaborn as sns
Let's use pandas [5] to load the csv file and use column 0 in the dataset as the row labels of the created DataFrame
,
#shot_df = pd.read_csv('test.csv')
#shot_df = pd.read_csv(filepath_or_buffer='test.csv')
shot_df = pd.read_csv(filepath_or_buffer='test.csv',index_col=0)
To display the information in shot_df
(we only display the first 4 rows),
shot_df.head(4)
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | ... | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 21400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | ... | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | ... | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
2 | Shot Chart Detail | 21400018 | 53 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 4 | 45 | ... | Fadeaway Jump Shot | 2PT Field Goal | Mid-Range | Left Side(L) | 8-16 ft. | 12 | -105 | 63 | 1 | 0 |
3 | Shot Chart Detail | 21400018 | 77 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 2 | 31 | ... | Jump Shot | 3PT Field Goal | Right Corner 3 | Right Side(R) | 24+ ft. | 22 | 227 | -16 | 1 | 0 |
4 rows × 21 columns
To force pandas to display all the columns, you can proceed as follows,
#uncomment this line to display all the columns
#pd.set_option('display.max_columns', None)
#shot_df.head(4)
These are the scipy modules we need to do clustering. We are going to use the rand
function of numpy to generate random points.
#from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq,whiten
Let's illustrate clustering by using a simple illustrative example. First we are going to create two datasets and then we are going to use k-means clustering.
# data generation
data1 = rand(50,2) + np.array([.5,.5])
data2 = rand(50,2) + np.array([0,-.2])
#data1 = rand(200,2) + np.array([.5,.5])
#data2 = rand(200,2) + np.array([0,-.2])
Now we plot the two datasets using matplotlib,
plt.plot(data1[:,0],data1[:,1],'ob',markersize=4, alpha=0.6)
plt.plot(data2[:,0],data2[:,1],'or',markersize=4, alpha=0.6)
plt.grid()
plt.show()
To do clustering, we need one single dataset. To join the datasets we can use the concatenate
function,
jdata1=np.concatenate((data1,data2))
or we can use the vstack
function,
jdata2=np.vstack((data1,data2))
To check the dimension of the newly created datasets,
jdata1.shape
(100, 2)
jdata2.shape
(100, 2)
To check the object type of the newly created datasets,
type(jdata1)
numpy.ndarray
type(jdata2)
numpy.ndarray
To check the data type of the numpy arrays,
jdata1.dtype
dtype('float64')
jdata2.dtype
dtype('float64')
And to check that the two arrays are equal,
np.array_equal(jdata1,jdata2)
True
or in the long way,
#uncomment this line to compare the arrays
#jdata1==jdata2
As with a pandas DataFrame
, we can also do indexing and slicing with a numpy array.
To display the first row of an array,
jdata1[0]
array([ 10.17426978, 1.08181619])
To display the first two rows of the array,
jdata1[0:2]
array([[ 10.17426978, 1.08181619], [ 10.66345049, 1.07132164]])
To display the first column of the first five rows of the array,
jdata1[0:5,0]
array([ 10.17426978, 10.66345049, 10.20683049, 10.97340282, 10.71436518])
To display the second column of the first five rows of the array,
jdata1[0:5,1]
array([ 1.08181619, 1.07132164, 1.91494455, 1.47372759, 1.48677703])
To plot the joined dataset using matplotlib,
# some plotting using numpy's logical indexing
plt.plot(jdata1[:,0],jdata1[:,1],'ob',markersize=4, alpha=0.6)
plt.grid()
plt.show()
To compute the clusters using K-means we proceed as follows.
First we need to compute the centroids, in this case the centroids are choosen randomly. Alternatively, the user have the option to give the inital position of the centroids.
In this case we want to use 2 centroids on the dataset jdata1
and the centroid position is saved in the array centroids
,
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(jdata1,2)
To print the information in the array centroids
,
#from pprint import pprint
#pprint(centroids)
#centroids
print(centroids)
[[ 0.48544814 0.34605385] [ 1.06113769 0.95313386]]
Now we proceed to label the data using the array centroids
, for this we use the function vq
.
This function assigns a label to each observation. Each observation in the input array is compared with the centroids in the label vector and assigned the label of the closest centroid.
# assign each sample to a cluster
idx,_ = vq(jdata1,centroids)
#whiten is used to normalize a group of observations on a per feature basis.
#whitened = whiten(jdata1)
#idx,_ = vq(whitened,centroids)
Tp print the information in the array idx,
print(idx)
#print(idx[0:4])
[1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Let's plot the clusters and the centroid information.
# some plotting using numpy's logical indexing
plt.plot(jdata1[idx==0,0],jdata1[idx==0,1],'ob',markersize=4, alpha=1)
plt.plot(jdata1[idx==1,0],jdata1[idx==1,1],'or',markersize=4, alpha=1)
plt.plot(centroids[:,0],centroids[:,1],'sk',markersize=6)
plt.grid()
plt.show()
Let's group the arrays jdata1
and idx
in a DataFrame
. First we need to create an empty DataFrame
,
cluster1=pd.DataFrame()
Now we can copy the arrays into the DataFrame cluster1
as follows,
cluster1['X']=jdata1[:,0]
cluster1['Y']=jdata1[:,1]
cluster1['LABEL']=idx
To display the information in cluster1
(we display the first 10 rows),
cluster1.head(10)
X | Y | LABEL | |
---|---|---|---|
0 | 1.084007 | 1.067920 | 1 |
1 | 1.118422 | 0.787775 | 1 |
2 | 1.393798 | 0.560966 | 1 |
3 | 0.665576 | 0.508579 | 0 |
4 | 1.013516 | 0.743290 | 1 |
5 | 1.348927 | 0.891368 | 1 |
6 | 1.417622 | 1.194459 | 1 |
7 | 1.285903 | 1.223910 | 1 |
8 | 1.000396 | 0.650109 | 1 |
9 | 1.194782 | 0.538871 | 1 |
To display the last 5 rows in cluster1
,
cluster1.tail()
X | Y | LABEL | |
---|---|---|---|
95 | 0.022988 | 0.236782 | 0 |
96 | 0.624744 | 0.722844 | 0 |
97 | 0.434906 | 0.025984 | 0 |
98 | 0.501344 | 0.169842 | 0 |
99 | 0.088473 | 0.799280 | 0 |
At this point we have created the DataFrame cluster1
, which you can use to do some plotting or conduct some data analytics as we have done in the previous tutorials.
Let's complicate things a little bit more. In this example we are going to generate a bigger dataset with a different structure.
# data generation
data1 = rand(200,2) + np.array([10,1])
data2 = rand(200,2) + np.array([9,0.5])
data3 = rand(200,2) + np.array([9.5,0])
To pot the original datasets,
# some plotting using numpy's logical indexing
plt.plot(data1[:,0],data1[:,1],'ob',markersize=4, alpha=0.6)
plt.plot(data2[:,0],data2[:,1],'or',markersize=4, alpha=0.6)
plt.plot(data3[:,0],data3[:,1],'og',markersize=4, alpha=0.6)
plt.grid()
plt.show()
Remember, to do the clustering we need one single dataset, to concatenate the arrays we proceed as follows,
jdata1=np.concatenate((data1,data2,data3))
Now we can compute the clusters. Notice that we are using 3 centroids.
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(jdata1,3)
# assign each sample to a cluster
idx,_ = vq(jdata1,centroids)
#vq(data,centroids)
Let's create a DataFrame
,
cluster2=pd.DataFrame()
cluster2['X']=jdata1[:,0]
cluster2['Y']=jdata1[:,1]
cluster2['LABEL']=idx
Let's plot the clusters and the centroid information.
plt.plot(jdata1[idx==0,0],jdata1[idx==0,1],'ob',markersize=4, alpha=1)
plt.plot(jdata1[idx==1,0],jdata1[idx==1,1],'or',markersize=4, alpha=1)
plt.plot(jdata1[idx==2,0],jdata1[idx==2,1],'og',markersize=4, alpha=1)
plt.plot(centroids[:,0],centroids[:,1],'sk',markersize=6)
plt.grid()
plt.show()
To compare the original data and the computed clusters using subplot
in matplotlib,
#fig = plt.figure()
plt.figure(figsize=(10, 4))
ax = plt.subplot(121)
ax.set_title("Original data or training data")
ax.plot(data1[:,0],data1[:,1],'ob',markersize=4, alpha=1)
ax.plot(data2[:,0],data2[:,1],'or',markersize=4, alpha=1)
ax.plot(data3[:,0],data3[:,1],'og',markersize=4, alpha=1)
ax.grid()
ax = plt.subplot(122)
ax.set_title("Clusters or test data")
#ax.plot(jdata1[idx==0,0],jdata1[idx==0,1],'ob',markersize=4, alpha=1)
#ax.plot(jdata1[idx==1,0],jdata1[idx==1,1],'or',markersize=4, alpha=1)
#ax.plot(jdata1[idx==2,0],jdata1[idx==2,1],'og',markersize=4, alpha=1)
ax.plot(cluster2.X[cluster2.LABEL == 0],cluster2.Y[cluster2.LABEL == 0],'ob',markersize=4, alpha=1)
ax.plot(cluster2.X[cluster2.LABEL == 1],cluster2.Y[cluster2.LABEL == 1],'or',markersize=4, alpha=1)
ax.plot(cluster2.X[cluster2.LABEL == 2],cluster2.Y[cluster2.LABEL == 2],'og',markersize=4, alpha=1)
ax.grid()
plt.show()
To add reusability and automation to the plotting, let's add a loop when plotting the data. Let's use 10 centroids to illustrate the advantage of using a loop at plot time,
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.figure(figsize=(10,10))
#number of centroids
cd=10
centroids,_ = kmeans(jdata1,cd)
idx,_ = vq(jdata1,centroids)
#colors = cm.rainbow(np.linspace(0, 1, cd))
colors=iter(cm.rainbow(np.linspace(0,1,cd)))
for tt in range(cd):
c=next(colors)
#plot(data_s1[idx== tt,0],data_s1[idx== tt,1],'o')
plt.plot(jdata1[idx== tt,0],jdata1[idx== tt,1],'o', c=c)
plt.plot(centroids[:,0],centroids[:,1],'sk',markersize=8)
plt.grid()
plt.show()
Now that we introduced the basic notions of clustering using scipy, let's compute the clusters for the data shot_df
or Lebron James shot chart.
The first thing to do is to create a DataFrame
where we are going to put the information of LOC_X and LOC_Y.
stmp=pd.DataFrame()
stmp['LOC_X']=shot_df.LOC_X
stmp['LOC_Y']=shot_df.LOC_Y
To print a concise summary of the DataFrame shot_df
,
shot_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1279 entries, 0 to 1278 Data columns (total 21 columns): GRID_TYPE 1279 non-null object GAME_ID 1279 non-null int64 GAME_EVENT_ID 1279 non-null int64 PLAYER_ID 1279 non-null int64 PLAYER_NAME 1279 non-null object TEAM_ID 1279 non-null int64 TEAM_NAME 1279 non-null object PERIOD 1279 non-null int64 MINUTES_REMAINING 1279 non-null int64 SECONDS_REMAINING 1279 non-null int64 EVENT_TYPE 1279 non-null object ACTION_TYPE 1279 non-null object SHOT_TYPE 1279 non-null object SHOT_ZONE_BASIC 1279 non-null object SHOT_ZONE_AREA 1279 non-null object SHOT_ZONE_RANGE 1279 non-null object SHOT_DISTANCE 1279 non-null int64 LOC_X 1279 non-null int64 LOC_Y 1279 non-null int64 SHOT_ATTEMPTED_FLAG 1279 non-null int64 SHOT_MADE_FLAG 1279 non-null int64 dtypes: int64(12), object(9) memory usage: 219.8+ KB
To print a concise summary of the DataFrame stmp
,
stmp.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1279 entries, 0 to 1278 Data columns (total 2 columns): LOC_X 1279 non-null int64 LOC_Y 1279 non-null int64 dtypes: int64(2) memory usage: 30.0 KB
To display the first 5 rows of stmp
,
stmp.head()
LOC_X | LOC_Y | |
---|---|---|
0 | 114 | 148 |
1 | -7 | 0 |
2 | -105 | 63 |
3 | 227 | -16 |
4 | 91 | 246 |
To use k-means with scipy we need to convert the pandas DataFrame
into a numpy array.
stmp1=stmp.as_matrix(columns=None)
To know the type of stmp1
,
type(stmp1)
numpy.ndarray
which have a data type,
stmp1.dtype
dtype('int64')
However, k-means in scipy only supports float
or double
data type. To convert a int64
to float
, we proceed as follows,
#type other than float or double not supported
data_s1=stmp1.astype(float)
data_s1.dtype
dtype('float64')
To print the numpy array data_s1
,
print(data_s1)
[[ 114. 148.] [ -7. 0.] [-105. 63.] ..., [-116. 168.] [ -48. 61.] [-199. 77.]]
Now we can compute the clusters (centroids and data labelling).
Notice that we are using 6 centroids.
#centroids,_ = kmeans(stmp1,10)
centroids,_ = kmeans(data_s1,6)
idx,_ = vq(data_s1,centroids)
To plot the positions of the centroids,
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=5)
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.grid()
plt.show()
Now we can plot the clusters.
Have in mind that you can change the number of clusters with the variable cd
, in this case we are using 6 clusters.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.figure(figsize=(10,10))
cd=6
centroids,_ = kmeans(data_s1,cd)
idx,_ = vq(data_s1,centroids)
#colors = cm.rainbow(np.linspace(0, 1, cd))
colors=iter(cm.rainbow(np.linspace(0,1,cd)))
for tt in range(cd):
c=next(colors)
#plot(data_s1[idx== tt,0],data_s1[idx== tt,1],'o')
plt.plot(data_s1[idx== tt,0],data_s1[idx== tt,1],'o', c=c)
plt.plot(centroids[:,0],centroids[:,1],'sk',markersize=8)
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.grid()
plt.show()
At this point, we labelled the data using k-means clustering (unsupervised training).
Notice that the algorithm did a pretty good job, as the zones labelled are very similar to those used in the nba to classify the shot zone area or shot_df.SHOT_ZONE_AREA
in our dataset.
shot_df.groupby('SHOT_ZONE_AREA').count().GRID_TYPE
SHOT_ZONE_AREA Back Court(BC) 3 Center(C) 670 Left Side Center(LC) 177 Left Side(L) 181 Right Side Center(RC) 149 Right Side(R) 99 Name: GRID_TYPE, dtype: int64
For completeness, let's plot the original labelled data.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.figure(figsize=(10,10))
#To plot in the por man's way
#plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == "Back Court(BC)"],shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == "Back Court(BC)"],'o',color='b')
#plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == "Center(C)"], shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == "Center(C)"],'o',color='g')
#plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == "Left Side Center(LC)"], shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == "Left Side Center(LC)"],'o',color='c')
#plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == "Left Side(L)"], shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == "Left Side(L)"],'o',color='m')
#plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == "Right Side Center(RC)"], shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == "Right Side Center(RC)"],'o',color='k')
#plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == "Right Side(R)"], shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == "Right Side(R)"],'o',color='y')
#plt.scatter(missed.LOC_X, missed.LOC_Y, color='red',label='missed',s=20,marker='o',alpha=0.5)
#To plot in the lazy way (smart)
li = ['Back Court(BC)','Center(C)','Left Side Center(LC)','Left Side(L)','Right Side Center(RC)','Right Side(R)']
sets=6
colors=iter(cm.rainbow(np.linspace(0,1,sets)))
#for s in li:
for s,c in zip(li,colors):
plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == s],shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == s],'o',color=c)
#plt.legend()
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.grid()
plt.show()
and let's use subplots to compare both figure,
#fig = plt.figure()
plt.figure(figsize=(14, 6))
plt.subplot(121)
cd=6
centroids,_ = kmeans(data_s1,cd)
idx,_ = vq(data_s1,centroids)
colors=iter(cm.rainbow(np.linspace(0,1,cd)))
for tt in range(cd):
c=next(colors)
plt.plot(data_s1[idx== tt,0],data_s1[idx== tt,1],'o', c=c)
#plt.plot(centroids[:,0],centroids[:,1],'sk',markersize=5)
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.xlabel('LOC_X')
plt.ylabel('LOC_Y')
plt.grid()
plt.title('Cluster data',fontsize=18)
#second plot
plt.subplot(122)
li = ['Back Court(BC)','Center(C)','Left Side Center(LC)','Left Side(L)','Right Side Center(RC)','Right Side(R)']
sets=6
colors=iter(cm.rainbow(np.linspace(0,1,sets)))
for s,c in zip(li,colors):
plt.plot(shot_df.LOC_X[shot_df.SHOT_ZONE_AREA == s],shot_df.LOC_Y[shot_df.SHOT_ZONE_AREA == s],'o',color=c)
plt.xlim(-300,300)
plt.ylim(-100,500)
plt.xlabel('LOC_X')
#plt.ylabel('LOC_Y')
plt.grid()
plt.title('Original data',fontsize=18)
#plt.subplot(122).set_title("ax3")
plt.show()
#import sys
#print('Python version:', sys.version_info)
#import IPython
#print('IPython version:', IPython.__version__)
#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)
#import matplotlib
#print('matplotlib version:', matplotlib.__version__)
#print('seaborn version:', sns.__version__)
#import scipy
#print('scipy version:', scipy.__version__)