INTRODUCTION TO PYTHON FOR DATA MINING¶

Python is a great language for data mining. It has a lot of great libraries for exploring, modeling, and visualizing data. To get started I would recommend downloading the Anaconda Package. It comes with most of the libraries you will need and provides and IDE and package manager.

I do most of my work from the command line, but Anaconda comes with a launcher app that can be found in the ~/anaconda directory. To get the launcher to work with a Mac, you need to do the following:

Go to your terminal (hit command-space_bar and then type terminal)
Type conda install -f launcher
After that runs, type conda install -f node-webkit

Now you can open the launcher and see:

glueviz - This lets you link multiple plots across files
Ipython Notebook - A great way to display and work on your data mining projects
Ipython qtconsole - Basically an Ipython terminal for coding
Spyder - An IDE for Ipython

IPython vs Python¶

Ipython is what makes Python interactive. Meaning that you can type some code, get some results, and then type some more code. This is very useful for exploring data because you don't always know what you are looking for and it can be annoying to have to run your entire program every time you make changes.

Libraries You Should Know About¶

Pandas - Provides R like data structures and a high level API to work with data
Numpy - Provides fast numerical computing such as arrays and linear algebra
Scipy - For scientific computing such as drawing from distributions
Matplotlib - For plotting
1. Seaborn - To make your plots look better
Scikit-Learn - For machine learning; great documentation and tutorials
Statsmodels - For more traditional statistics

Getting Seaborn¶

In the terminal type pip install seaborn

An Example¶

Read in Data¶

I will use pandas to read in some data from the web and quickly remove the NA rows.

In [2]:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
import pandas as pd
from pandas import DataFrame, Series
from __future__ import division
import seaborn as sns
from sklearn.cross_validation import train_test_split
sns.set(style='ticks', palette='Set2')
%matplotlib inline

data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original",
                   delim_whitespace = True, header=None,
                   names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
                            'model', 'origin', 'car_name'])
print(data.shape)
data = data.dropna()
data.head()

(406, 9)

Out[2]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	model	origin	car_name
0	18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15	8	350	165	3693	11.5	70	1	buick skylark 320
2	18	8	318	150	3436	11.0	70	1	plymouth satellite
3	16	8	304	150	3433	12.0	70	1	amc rebel sst
4	17	8	302	140	3449	10.5	70	1	ford torino

Scikit-Learn¶

Here is a quick intro to modeling with scikit-learn. Basically you split the data into test and training. Then choose a model, fit the train data, and predict of the test data. Scikit-learn has great documentation, so check out their page.

In [3]:

indep_vars = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']
dep_vars = ['mpg']
indep_data = data[indep_vars]
dep_data = data[dep_vars]
indep_train, indep_test, dep_train, dep_test = train_test_split(indep_data, dep_data, test_size=0.33, random_state=42)

In [4]:

regr = linear_model.LinearRegression()
regr.fit(indep_train, dep_train)
print('Coefficients: {0}'.format(zip(indep_vars,np.squeeze(regr.coef_))))

Coefficients: [('cylinders', -0.26226222120889731), ('displacement', -0.0019562996653683384), ('horsepower', -0.059849640079272119), ('weight', -0.0049401742792735473), ('acceleration', 0.0040334885752014889)]

In [5]:

regr_predict = regr.predict(indep_test)
print("Residual sum of squares: %.2f"
      % np.mean((regr_predict - dep_test) ** 2))

Residual sum of squares: 19.71

Tables¶

Some examples of how to use pandas to create summary statistics and tables

In [6]:

data.groupby(['cylinders']).mpg.describe()

Out[6]:

cylinders       
3          count      4.000000
           mean      20.550000
           std        2.564501
           min       18.000000
           25%       18.750000
           50%       20.250000
           75%       22.050000
           max       23.700000
4          count    199.000000
           mean      29.283920
           std        5.670546
           min       18.000000
           25%       25.000000
           50%       28.400000
           75%       32.950000
           max       46.600000
5          count      3.000000
           mean      27.366667
           std        8.228204
           min       20.300000
           25%       22.850000
           50%       25.400000
           75%       30.900000
           max       36.400000
6          count     83.000000
           mean      19.973494
           std        3.828809
           min       15.000000
           25%       18.000000
           50%       19.000000
           75%       21.000000
           max       38.000000
8          count    103.000000
           mean      14.963107
           std        2.836284
           min        9.000000
           25%       13.000000
           50%       14.000000
           75%       16.000000
           max       26.600000
dtype: float64

In [7]:

pivot_table = data.pivot_table(index='cylinders', columns='acceleration', values='mpg', aggfunc=np.mean)
pivot_table.head()

Out[7]:

acceleration	8.0	8.5	9.0	9.5	10.0	10.5	11.0	11.1	11.2	11.3	...	21.5	21.7	21.8	21.9	22.1	22.2	23.5	23.7	24.6	24.8
cylinders
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	43.1	44.3	30	19	24.5	29.0	23	43.4	44	27.2
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	28.8	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	14	14.5	14	15.5	14.5	17	13.285714	16	18.1	NaN	...	NaN	NaN	NaN	NaN	NaN	23.9	NaN	NaN	NaN	NaN

5 rows × 95 columns

Plotting¶

Some examples creating some useful plots using matplotlib and seaborn. The despine function removes chart junk.

In [8]:

p = plt.hist(data.mpg)
plt.title("MPG")
p
sns.despine()

In [9]:

sns.lmplot("mpg", "weight", data);

In [10]:

sns.lmplot("mpg", "weight", data, order=2);

In [11]:

sns.jointplot("mpg", "weight", data, kind="reg")

Out[11]:

<seaborn.axisgrid.JointGrid at 0x10f72c610>

In [12]:

sns.boxplot(data[['displacement', 'horsepower']])

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x10fc9a4d0>

In [13]:

sns.violinplot(data[['displacement', 'horsepower']])

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x10fdebfd0>

In [14]:

g = sns.FacetGrid(data, col="cylinders")
g.map(plt.hist, "mpg");