Now we're going to use a large and messy data set from a familiar source object and then prepare it for analysis using Random Forests. Why do we want to use Random Forests? This will become clear very shortly.
We will use a data set of mobile phone accelerometer and gyroscope readings to create a predictive model. The data set is found in R Data form  on Amazon S3 and raw form at the UCI Repository  The data set readings encode data on mobile phone orientation and motion of the wearer of the phone.
The subject is known to be doing one of six activities - sitting, standing, lying down, walking, walking up, and walking down.
Our goal is to predict, given one data point, which activity they are doing. We set ourself a goal of creating a model with understandable variables rather than a black box model. We have the choice of creating a black box model that just has variables and coefficients. When given a data point we feed it to the model and out pops an answer. This generally works but is simply too much "magic" to give us any help in building our intuition or giving us any opportunity to use our domain knowledge.
So we are going to open the box a bit and we are going to use domain knowledge combined with the massive power of Random Forests once we have some intuition going. We find that in the long run this is a much more satisfying approach and also, it appears, a much more powerful one.
We will reduce the independent variable set to 36 variables using domain knowledge alone and then use Random Forests to predict the variable ‘activity’. This may not be the best model from the point of view of accuracy, but we want to understand what is going on and from that perspective it turns out to be much better.
We use accuracy measures Positive and Negative Prediction Value, Sensitivity and Specificity to rate our model.
Since we have 563 total columns we will dispense with the step of creating a formal data dictionary and refer to feature_info.txt instead
Initial exploration of the data shows it has dirty column name text with a number of problems:
%pylab inline import pandas as pd df = pd.read_csv('../datasets/samsung/samsungdata.csv')
Populating the interactive namespace from numpy and matplotlib
We want to create an interpretable model rather than use Random Forests as a black box. So we will need to understand our variables and leverage our intuition about them.
To plan the data exploration, the documentation of the data set from the UCI website  is very useful and we study it in detail. Especially the file feature_info.txt is very important in understanding our variables. It is, in effect, the data dictionary which we have avoided listing here. Also the explanation for terminology which we use is in feature_info.txt. So going through it in some detail is critical.
Do each of the above data cleanup activities on the data set. i.e.
The major value of this data set is as follows
It teaches the implicit lesson that
So this particular data set may seem a little techy but it could easily be in the direction of bio, or finance or mechanics of fractures or sports analytics or whatever - a data scientist should be willing to get hands and mind dirty. The most successful ones are/will be the ones that are willing to be interdisciplinary.
That's the implicit lesson here.
Aside from understanding what each variable represents, we also want to get some technical background about the meaning of each variable.
So we use the Android Developer Reference  to educate ourselves about each of the physical parameters that are important. In this way we extend our domain knowledge so that we understand the language of the data - we allow it to come alive and figuratively speak to us and reveal it's secrets. The more we learn about the background context from which the data comes, the better, faster, and deeper our exploration of the data will be.
In this case see that the variables have X, Y, Z prefixes/suffixes and the Android Developer Reference  gives us the specific reference frame with which these are measured. They are vector components of jerk and acceleration, the angles are measured with respect to the direction of gravity or more precisely the vector acceleration due to gravity. We use this information and combine it with some intuition about motion, velocity, acceleration etc.
So we dig into the variables and make some quick notes.
Before we go further, you'll need to open a file in the dataset directory for the HAR data set. There is a file called feature_info.txt. This file describes each feature, it's physical significance and also describes features that are derived from raw data by doing some averaging, or sampling or some operation that gives a numerical results.
We want to look at
a) all the variable names
b) physical quantities
and take some time to understand these.
Once we spend some time doing all that, we can extract some useful guidelines using physical understanding and common sense.
Figure 1. Using a histogram of Body Acceleration Magnitude to evaluate that variable as a predictor of static vs dynamic activities. This is an example of data exploration in support of our heuristic variable selection using domain knowledge.
In dropping the -X -Y -Z variables (cartesian coordinates) we removed a large number of confounding variables as these have information strongly correlated with Magnitude + Angle (polar coordinates). There may still be some confounding influences but the remaining effects are hard to interpret.
From common sense we see other variables -min, -max, -mad have correlations with mean/std so we drop all these confounders also. The number of variables is now reduced to 37 as below.
Note to reviewers - we do some tedious name mapping to keep the semantics intact since we want to have a "white box" like model. If we don't we can just take the remaining variables and map them to v1, v2 ..... v37. This would be a couple of lines of code and explanation but we would lose a lot of the value we derived from retaining interpretability using domain knowledge. So we soldier on for just one last step and then we are into the happy land of analysis
To be able to explore the data easily we rename variables and simplify them for readability as follows.
We drop the "Body" and "Mag" wherever they occur as these are common to all our remaining variables. We map ‘mean’ to Mean and ‘std’ to SD
tAccBodyMag-mean -> tAccMean
fAccBodyMag-std -> fAccSD
The reduced set of selected variables with transformed names is now (with meaningful groupings):
Now after all these data cleanup calisthenics we raise our weary heads and notice something pleasantly surprising and positively encouraging.
These variables are primarily magnitudes of acceleration and jerk with their statistics, along with angle variables. This encourages us to think that our approach of focusing on domain knowledge, doing some extra reading and research and using some elementary physical intuition seems to be bearing fruit.
This is a set of variables that is semantically compact, interpretable and relatively easy to reason about.
We can do another round of winnowing down the variables, because we might have a feeling that 37 variables is too many to hold in our mind at one time - and we would be right. But at this point we bring in the heavy artillery and let the modeling software do the work, using Random Forests on this variable set.
 Human Activity Recognition Using Smartphones <http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
 Android Developer Reference http://developer.android.com/reference/android/hardware/Sensor.html
 Random Forests http://en.wikipedia.org/wiki/Random_forest
 Code for computation of error measures https://gist.github.com/nborwankar/5131870
from IPython.core.display import HTML def css_styling(): styles = open("../styles/custom.css", "r").read() return HTML(styles) css_styling()