In [2]:

```
import pandas as pd
import numpy as np
```

`rpy2`

to read the .rda file format.

In [ ]:

```
%load_ext rmagic
```

In [3]:

```
%%R
download.file("https://dl.dropbox.com/u/7710864/courseraPublic/samsungData.rda",destfile="../data/samsungData.rda",method="curl")
load("../data/samsungData.rda")
write.csv(samsungData,file="../data/samsungData.csv")
```

In [3]:

```
samsungData = pd.read_csv('../data/samsungData.csv')
samsungData = samsungData.drop(['Unnamed: 0'], axis=1)
```

A quick look on the column names and the type of activities in the data:

In [4]:

```
samsungData.columns[:12]
```

Out[4]:

Index([tBodyAcc-mean()-X, tBodyAcc-mean()-Y, tBodyAcc-mean()-Z, tBodyAcc-std()-X, tBodyAcc-std()-Y, tBodyAcc-std()-Z, tBodyAcc-mad()-X, tBodyAcc-mad()-Y, tBodyAcc-mad()-Z, tBodyAcc-max()-X, tBodyAcc-max()-Y, tBodyAcc-max()-Z], dtype=object)

In [14]:

```
samsungData['activity'].value_counts()
```

Out[14]:

laying 1407 standing 1374 sitting 1286 walk 1226 walkup 1073 walkdown 986

The scatter plots below show the first and second variable/feature vectors for subject number 1, colour coded by activity. We see that almost all activities have the first and second variable values around zero, except for LAYING, which is spread throughout the y-axis.

The conclusion from these plots is that we can't really classify the activities based on the two variables, since their values are very close to each other.

In [8]:

```
subj1 = samsungData[samsungData['subject'] == 1]
numericActivity = subj1.groupby('activity')
cols = {'laying' : 'b',
'sitting' : 'g',
'standing' : 'r',
'walk' : 'c',
'walkdown' : 'm',
'walkup' : 'y'}
f, (ax1, ax2) = subplots(ncols=2)
f.set_size_inches(10, 5)
for act, df in numericActivity:
ax1.scatter(df.index, df.ix[:,0], c=cols[act], label=act)
ax2.scatter(df.index, df.ix[:,1], c=cols[act], label=act)
ax1.set_ylabel(samsungData.columns[0])
ax2.set_ylabel(samsungData.columns[1])
ax2.legend(loc='lower right')
f.tight_layout();
```

Let's run hierarchical clustering on the data, using the first three variables, which are:

In [10]:

```
subj1.ix[:,:3].columns
```

Out[10]:

Index([tBodyAcc-mean()-X, tBodyAcc-mean()-Y, tBodyAcc-mean()-Z], dtype=object)

The dendrogram below again shows that there is no 'clustering' of the activities based on the used variables.

I use the character 'O' repeated n times to encode the activities in the x-axis of the dendrogram, where n goes from 1 to 6. That is 'O' represents LAYING, and 'OOOOOO' represents WALKUP.

In [17]:

```
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
actlabels = pd.Categorical.from_array(subj1['activity'])
distanceMatrix = pdist(subj1.ix[:,:3])
dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=0.3,
leaf_label_func=lambda x: 'O' * (actlabels.labels[x] + 1),
leaf_font_size=6)
f = gcf()
f.set_size_inches(8, 4);
```

Let's see if variable number 9 and 10 could give us a better clustering. These are:

In [20]:

```
subj1.ix[:,[9, 10]].columns
```

Out[20]:

Index([tBodyAcc-max()-X, tBodyAcc-max()-Y], dtype=object)

In [22]:

```
f, (ax1, ax2) = subplots(ncols=2)
f.set_size_inches(10, 5)
for act, df in numericActivity:
ax1.scatter(df.index, df.ix[:,9], c=cols[act], label=act)
ax2.scatter(df.index, df.ix[:,10], c=cols[act], label=act)
ax1.set_ylabel(samsungData.columns[9])
ax2.set_ylabel(samsungData.columns[10])
ax2.legend(loc='lower right')
f.tight_layout();
```

As we can see `tBodyAcc-max()-X`

values show some grouping for WALK, WALKUP, and WALKDOWN. In particular, the variable values for activity WALKDOWN is very well separated from the others. We still can't separate WALK from WALKUP. The same also for `tBodyAcc-max()-Y`

.

If we look at the dendrogram, taking variables number 9 to 12, we have the same conclusion as just mentioned, namely that WALKDOWN is very well separated from the others (those are labeled with 5 'O's in the horizontal axis). Yet we still can't see any groupings for other activities.

In [23]:

```
distanceMatrix = pdist(subj1.ix[:,9:12])
dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=0.3,
leaf_label_func=lambda x: 'O' * (actlabels.labels[x] + 1),
leaf_font_size=6)
f = gcf()
f.set_size_inches(8, 4);
```

Now we'll use SVD to figure out which variables have the most importance, i.e. the ones that best separate the activities. To do this we need to normalize the matrix that contains all variable values for all observations, excluding the subject ID and the activity value.

Such scaling functionality is actually available in `scikit-learn`

, but since it's just a simple scaling that we need, we'll just define a short function to do that:

In [24]:

```
# a simple scale function to normalize a matrix
def scale(matrix):
from numpy import mean, std
return (matrix - mean(matrix, axis=0)) / std(matrix, axis=0)
```

In [25]:

```
U, D, Vt = np.linalg.svd(subj1.ix[:,:-2].apply(scale), full_matrices=False)
```

In [26]:

```
f, (ax1, ax2) = subplots(ncols=2)
f.set_size_inches(10, 5)
for lb, cl in zip(list(actlabels.levels), 'b g r c m y k'.split()):
idx = subj1['activity'] == lb
ax1.scatter(subj1.index[idx], U[idx,0], c=cl, label=lb)
ax2.scatter(subj1.index[idx], U[idx,1], c=cl, label=lb)
ax1.set_ylabel('U[:,0]')
ax2.set_ylabel('U[:,1]')
ax2.legend(loc='lower right')
f.tight_layout();
```

In [27]:

```
plot(Vt[1,:], 'ok');
```

In [31]:

```
maxContrib = np.argmax(Vt[1,:])
maxContrib
```

Out[31]:

295

In [34]:

```
distanceMatrix = pdist(subj1.take(range(9,12) + [maxContrib], axis=1))
dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=0.3,
leaf_label_func=lambda x: 'O' * (actlabels.labels[x] + 1),
leaf_font_size=6)
f = gcf()
f.set_size_inches(8, 4);
```

And it works! We can see now that WALKDOWN (with 6 'O's on the left hand side of the horizontal axis) is completely separated from WALK (4 'O's) and WALKUP (5 'O's).

Since we are curious, let's check out what variable it is that separates WALKDOWN from everything else, the one that we discovered through SVD:

In [36]:

```
samsungData.columns[maxContrib]
```

Out[36]:

'fBodyAcc-meanFreq()-Z'

It's got something to do with the Z-axis, which kind a make sense, sort of.

**K-means**

Another approach is to use K-means clustering. Here we'll just 'throw out' the feature/variable matrix to the K-means algorithm implementation from `scipy`

, knowing that we have 6 cluster centres (i.e. 6 activities). Then we'll show in tabular form which activity is classified in which cluster.

In [37]:

```
from scipy.cluster.vq import kmeans, vq
data = np.matrix(subj1.ix[:,:-2])
centers, _ = kmeans(data, 6, iter=100)
cluster, _ = vq(data, centers)
df = pd.crosstab(cluster, subj1['activity'])
df
```

Out[37]:

activity | laying | sitting | standing | walk | walkdown | walkup |
---|---|---|---|---|---|---|

row_0 | ||||||

0 | 0 | 0 | 0 | 0 | 48 | 0 |

1 | 4 | 0 | 0 | 0 | 0 | 53 |

2 | 0 | 0 | 0 | 95 | 1 | 0 |

3 | 27 | 0 | 0 | 0 | 0 | 0 |

4 | 19 | 13 | 3 | 0 | 0 | 0 |

5 | 0 | 34 | 50 | 0 | 0 | 0 |

In [44]:

```
idmax = np.argmax(df['laying'])
plot(centers[idmax,:10], 'ok')
ylabel('Cluster Center');
```

In [47]:

```
idmax = np.argmax(df['walk'])
plot(centers[idmax,:10], 'ok')
ylabel('Cluster Center');
```

In [ ]:

```
```