Lecturer: Ashish Mahabal
Jupyter Notebook Authors: Ashish Mahabal & Yuhan Yao
This is a Jupyter notebook lesson taken from the GROWTH Summer School 2019. For other lessons and their accompanying lectures, please see: http://growth.caltech.edu/growth-school-2019.html
Classify different classes using (a) decision trees and (b) random forest
See GROWTH school webpage for detailed instructions on how to install these modules and packages. Nominally, you should be able to install the python modules with pip install <module>
. The external astromatic packages are easiest installed using package managers (e.g., rpm
, apt-get
).
None
Pavlos Protopapas (LSSDS notebook)
# For inline plots
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io
import pydotplus
from IPython.display import Image
# Various scikit-learn modules
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
datadir = 'data'
# for decision tree
catalog = datadir + '/CatalinaVars.tbl.gz'
lightcurves = datadir + '/CRTS_6varclasses.csv.gz'
# for random forest
featuresfile = datadir + '/cvs_and_blazars.dat'
lcs = pd.read_csv(lightcurves,
compression='gzip',
header=1,
sep=',',
skipinitialspace=True,
nrows=100000)
#skiprows=[4,5])
#,nrows=100000)
lcs.columns = ['ID', 'MJD', 'Mag', 'magerr', 'RA', 'Dec']
lcs.head()
ID | MJD | Mag | magerr | RA | Dec | |
---|---|---|---|---|---|---|
0 | 1109065026725 | 53705.501925 | 16.943797 | 0.082004 | 182.25871 | 9.76580 |
1 | 1109065026725 | 53731.483314 | 16.645102 | 0.075203 | 182.25867 | 9.76585 |
2 | 1109065026725 | 53731.491406 | 16.693791 | 0.076497 | 182.25870 | 9.76574 |
3 | 1109065026725 | 53731.499465 | 16.793651 | 0.078755 | 182.25869 | 9.76576 |
4 | 1109065026725 | 53731.507529 | 16.767817 | 0.077436 | 182.25878 | 9.76581 |
len(lcs.groupby('ID'))
301
cat = pd.read_csv(catalog,
compression='gzip',
header=5,
sep=' ',
skipinitialspace=True,
)
columns = cat.columns[1:]
cat = cat[cat.columns[:-1]]
cat.columns = columns
cat.head()
Catalina_Surveys_ID | Numerical_ID | RA_J2000 | Dec | V_mag | Period_days | Amplitude | Number_Obs | Var_Type | |
---|---|---|---|---|---|---|---|---|---|
0 | CSS_J000020.4+103118 | 1109001041232 | 00:00:20.41 | +10:31:18.9 | 14.62 | 1.491758 | 2.39 | 223 | 2 |
1 | CSS_J000031.5-084652 | 1009001044997 | 00:00:31.50 | -08:46:52.3 | 14.14 | 0.404185 | 0.12 | 163 | 1 |
2 | CSS_J000036.9+412805 | 1140001063366 | 00:00:36.94 | +41:28:05.7 | 17.39 | 0.274627 | 0.73 | 158 | 1 |
3 | CSS_J000037.5+390308 | 1138001069849 | 00:00:37.55 | +39:03:08.1 | 17.74 | 0.30691 | 0.23 | 219 | 1 |
4 | CSS_J000103.3+105724 | 1109001050739 | 00:01:03.37 | +10:57:24.4 | 15.25 | 1.5837582 | 0.11 | 223 | 8 |
vars6 = cat[ cat['Var_Type'].isin([2,4,5,6,8,13]) & (cat['Number_Obs']>100) ]
vars6.head()
Catalina_Surveys_ID | Numerical_ID | RA_J2000 | Dec | V_mag | Period_days | Amplitude | Number_Obs | Var_Type | |
---|---|---|---|---|---|---|---|---|---|
0 | CSS_J000020.4+103118 | 1109001041232 | 00:00:20.41 | +10:31:18.9 | 14.62 | 1.491758 | 2.39 | 223 | 2 |
4 | CSS_J000103.3+105724 | 1109001050739 | 00:01:03.37 | +10:57:24.4 | 15.25 | 1.5837582 | 0.11 | 223 | 8 |
8 | CSS_J000131.5+324913 | 1132001052010 | 00:01:31.54 | +32:49:13.1 | 14.71 | 13.049549 | 0.17 | 188 | 8 |
16 | CSS_J000216.1-165109 | 1015001002091 | 00:02:16.16 | -16:51:09.7 | 16.07 | 0.30487 | 0.17 | 124 | 5 |
23 | CSS_J000309.5+193816 | 1118001060639 | 00:03:09.56 | +19:38:16.6 | 17.82 | 1.12582 | 0.59 | 206 | 2 |
print(vars6.shape)
plt.hist(vars6.Var_Type)
(13876, 9)
(array([4614., 1683., 5148., 491., 0., 1474., 0., 0., 0., 466.]), array([ 2. , 3.1, 4.2, 5.3, 6.4, 7.5, 8.6, 9.7, 10.8, 11.9, 13. ]), <a list of 10 Patch objects>)
plt.plot(pd.to_numeric(vars6.Period_days).values, pd.to_numeric(vars6.V_mag).values, 'r.')
[<matplotlib.lines.Line2D at 0x12099da90>]
vars2 = vars6[ vars6['Var_Type'].isin([6,13]) ]
vars2.head()
Catalina_Surveys_ID | Numerical_ID | RA_J2000 | Dec | V_mag | Period_days | Amplitude | Number_Obs | Var_Type | |
---|---|---|---|---|---|---|---|---|---|
115 | CSS_J001420.8+031214 | 1104002007409 | 00:14:20.84 | +03:12:14.0 | 17.45 | 0.3871100 | 0.56 | 174 | 6 |
174 | CSS_J001616.0-173612 | 1018002041429 | 00:16:16.00 | -17:36:12.4 | 13.87 | 215.564 | 1.17 | 111 | 13 |
198 | CSS_J001724.9+200542 | 1121002007726 | 00:17:24.90 | +20:05:42.2 | 16.64 | 0.3571291 | 0.39 | 224 | 6 |
214 | CSS_J001812.9+210201 | 1121002027610 | 00:18:12.97 | +21:02:01.5 | 14.54 | 0.41616 | 0.34 | 224 | 6 |
328 | CSS_J002230.8-183246 | 1018002024540 | 00:22:30.83 | -18:32:46.1 | 11.72 | 201.28718 | 1.26 | 111 | 13 |
Y = vars2['Var_Type'].values
Y = np.array([1 if y==6 else 0 for y in Y])
X = vars2.drop('Var_Type',1).as_matrix()
/Users/chummels/src/yt-conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. This is separate from the ipykernel package so we can avoid doing imports until
vars2['target'] = (vars2['Var_Type'].values==6)*1
/Users/chummels/src/yt-conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
vars2.head()
Catalina_Surveys_ID | Numerical_ID | RA_J2000 | Dec | V_mag | Period_days | Amplitude | Number_Obs | Var_Type | target | |
---|---|---|---|---|---|---|---|---|---|---|
115 | CSS_J001420.8+031214 | 1104002007409 | 00:14:20.84 | +03:12:14.0 | 17.45 | 0.3871100 | 0.56 | 174 | 6 | 1 |
174 | CSS_J001616.0-173612 | 1018002041429 | 00:16:16.00 | -17:36:12.4 | 13.87 | 215.564 | 1.17 | 111 | 13 | 0 |
198 | CSS_J001724.9+200542 | 1121002007726 | 00:17:24.90 | +20:05:42.2 | 16.64 | 0.3571291 | 0.39 | 224 | 6 | 1 |
214 | CSS_J001812.9+210201 | 1121002027610 | 00:18:12.97 | +21:02:01.5 | 14.54 | 0.41616 | 0.34 | 224 | 6 | 1 |
328 | CSS_J002230.8-183246 | 1018002024540 | 00:22:30.83 | -18:32:46.1 | 11.72 | 201.28718 | 1.26 | 111 | 13 | 0 |
Y[:10]
array([1, 0, 1, 1, 0, 0, 1, 1, 0, 0])
X[:,:3]
array([['CSS_J001420.8+031214', 1104002007409, '00:14:20.84'], ['CSS_J001616.0-173612', 1018002041429, '00:16:16.00'], ['CSS_J001724.9+200542', 1121002007726, '00:17:24.90'], ..., ['CSS_J235422.2+072256', 1107127033122, '23:54:22.25'], ['CSS_J235702.7+395916', 1140099015488, '23:57:02.79'], ['CSS_J235946.8+371110', 1138103015911, '23:59:46.82']], dtype=object)
# Create test/train mask
itrain, itest = train_test_split(range(vars2.shape[0]), train_size=0.6)
mask=np.ones(vars2.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)
mask[:10]
array([ True, True, True, True, True, False, True, True, False, False])
print("% Class 6 objects in Training:", np.mean(vars2.target[mask]), np.std((vars2.target[mask])))
print("% Class 13 objects in Testing:", np.mean(vars2.target[~mask]), np.std((vars2.target[~mask])))
% Class 6 objects in Training: 0.5087108013937283 0.499924116180725 % Class 13 objects in Testing: 0.5195822454308094 0.4996163885061093
One builds a decision tree using one or more predictors:
Here we want to classify variable astronomical sources (e.g. EA, RRab, RRc, RRd, RS CVn, LPV etc.) using a few features we have access to viz. V_mag, Period_days, Amplitude, Number_Obs (note that not all of these may be good or useful for our purpose).
Generically let's call them 𝑋𝑖1,𝑋𝑖2,...,𝑋𝑖𝑝 ( 𝑖 for source type, 𝑝 for predictors). We also have an observed label 𝑌𝑖 for each type of variable.
We first assign everyone to the same class, say 𝑌̂ 𝑖=1 . We can calculate the squared error 𝐸𝑟𝑟=∑𝑖(𝑌̂ 𝑖−𝑌𝑖)2 At each step of the algorithm we consider a list of possible decision (or split), for example 𝑋12>35 , i.e. period is greater than 35. For each possible decision we recalculate the predictor for that rule, for example 𝑌̂ 𝑖=6 if 𝑋12>35 and Y i=13 if X<35. We recalculate the error for each possible decision: 𝐸𝑟𝑟=∑𝑖(𝑌̂ 𝑖−𝑌𝑖)2 We choose the decision that reduces the error by the largest amount then keep going (if there are multiple predictors). The Err term is our impurity/loss function.
def display_dt(dt):
dummy_io = io.StringIO()
tree.export_graphviz(dt, out_file = dummy_io, proportion=True)
print(dummy_io.getvalue())
# This function creates images of tree models using pydotplus
# https://github.com/JWarmenhoven/ISLR-python
def print_tree(estimator, features, class_names=None, filled=True):
tree = estimator
names = features
color = filled
classn = class_names
dot_data = io.StringIO()
export_graphviz(estimator, out_file=dot_data, feature_names=features, proportion=True, class_names=classn, filled=filled)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return(graph)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
# Important parameters
# indf - Input dataframe
# featurenames - vector of names of predictors
# targetname - name of column you want to predict (e.g. 0 or 1, 'M' or 'F',
# 'yes' or 'no')
# target1val - particular value you want to have as a 1 in the target
# mask - boolean vector indicating test set (~mask is training set)
# reuse_split - dictionary that contains traning and testing dataframes
# (we'll use this to test different classifiers on the same
# test-train splits)
# score_func - we've used the accuracy as a way of scoring algorithms but
# this can be more general later on
# n_folds - Number of folds for cross validation ()
# n_jobs - used for parallelization
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #
def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
subdf=indf[featurenames]
X=subdf.values
y=(indf[targetname].values==target1val)*1
if mask.any() !=None:
print("using mask")
Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
if reuse_split !=None:
print("using reuse split")
Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
if parameters:
clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
clf=clf.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print("############# based on standard predict ################")
print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data: %0.2f" % (test_accuracy))
print(confusion_matrix(ytest, clf.predict(Xtest)))
print("########################################################")
return(clf, Xtrain, ytrain, Xtest, ytest)
Let's first fit on two covariates to help us visualize what's going on. Have a look at the options on the help page https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. We'll be optimizing over two options here: max_depth - the maximum depth of the tree, min_samples_leaf - the minimum number of samples required to be at a leaf node.
Assuming the root note is S, and S' iterates over leaf notes such that the union of S'=S. Then the Gini impurity measures L(S') = |S'| 1− p_{S'}^2 − (1− p_{S'})^2, where p_{S'} is the fraction of S' that are positive examples
clfTree1 = tree.DecisionTreeClassifier(max_depth=3, criterion='gini')
subdf=vars2[['V_mag', 'Period_days']]
X=subdf.values
y=(vars2['target'].values==1)*1
# TRAINING AND TESTING
Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
# FIT THE TREE
clf=clfTree1.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print("############# based on standard predict ################")
print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data: %0.2f" % (test_accuracy))
print(confusion_matrix(ytest, clf.predict(Xtest)))
print("########################################################")
############# based on standard predict ################ Accuracy on training data: 1.00 Accuracy on test data: 1.00 [[184 0] [ 0 199]] ########################################################
len(Xtrain), len(Xtest), len(ytrain), len(ytest)
(574, 383, 574, 383)
display_dt(clf)
digraph Tree { node [shape=box] ; 0 [label="X[1] <= 35.422\ngini = 0.5\nsamples = 100.0%\nvalue = [0.491, 0.509]"] ; 1 [label="gini = 0.0\nsamples = 50.9%\nvalue = [0.0, 1.0]"] ; 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ; 2 [label="gini = 0.0\nsamples = 49.1%\nvalue = [1.0, 0.0]"] ; 0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ; }
graph3 = print_tree(clf, features=['V_mag', 'Period_days'], class_names=['No', 'Yes'])
Image(graph3.create_png())
def dtclassify(allclasses,class1,class2,var1,var2):
vars2 = allclasses[ allclasses['Var_Type'].isin([class1,class2]) ]
Y = vars2['Var_Type'].values
Y = np.array([1 if y==6 else 0 for y in Y])
X = vars2.drop('Var_Type',1).as_matrix()
vars2['target'] = (vars2['Var_Type'].values==class1)*1
# Create test/train mask
itrain, itest = train_test_split(range(vars2.shape[0]), train_size=0.6)
mask=np.ones(vars2.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)
print("% Class ",class1," objects in Training:", np.mean(vars2.target[mask]), np.std((vars2.target[mask])))
print("% Class ",class2," objects in Testing:", np.mean(vars2.target[~mask]), np.std((vars2.target[~mask])))
clfTree1 = tree.DecisionTreeClassifier(max_depth=3, criterion='gini')
subdf=vars2[[var1, var2]]
X=subdf.values
y=(vars2['target'].values==1)*1
# TRAINING AND TESTING
Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
# FIT THE TREE
clf=clfTree1.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print("############# based on standard predict ################")
print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data: %0.2f" % (test_accuracy))
print(confusion_matrix(ytest, clf.predict(Xtest)))
print("########################################################")
display_dt(clf)
return [clf,var1,var2]
# graph3 = print_tree(clf, features=[var1, var2], class_names=['No', 'Yes'])
# Image(graph3.create_png())
# A generic function to do CV
def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):
if score_func:
gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
else:
gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
gs.fit(X, y)
best = gs.best_estimator_
return best
dtclassify(vars6,6,13,'V_mag','Period_days')
% Class 6 objects in Training: 0.5209059233449478 0.4995627511825669 % Class 13 objects in Testing: 0.5013054830287206 0.4999982957111571 ############# based on standard predict ################ Accuracy on training data: 1.00 Accuracy on test data: 1.00 [[191 0] [ 0 192]] ######################################################## digraph Tree { node [shape=box] ; 0 [label="X[1] <= 35.33\ngini = 0.499\nsamples = 100.0%\nvalue = [0.479, 0.521]"] ; 1 [label="gini = 0.0\nsamples = 52.1%\nvalue = [0.0, 1.0]"] ; 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ; 2 [label="gini = 0.0\nsamples = 47.9%\nvalue = [1.0, 0.0]"] ; 0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ; }
/Users/chummels/src/yt-conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. """ /Users/chummels/src/yt-conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'), 'V_mag', 'Period_days']
graph3 = print_tree(clf, features=['V_mag', 'Period_days'], class_names=['No', 'Yes'])
Image(graph3.create_png())
[clf,var1,var2] = dtclassify(vars6,8,13,'V_mag','Amplitude')
% Class 8 objects in Training: 0.7525773195876289 0.4315144234320897 % Class 13 objects in Testing: 0.770618556701031 0.42043500897172076 ############# based on standard predict ################ Accuracy on training data: 0.97 Accuracy on test data: 0.96 [[160 18] [ 11 587]] ######################################################## digraph Tree { node [shape=box] ; 0 [label="X[1] <= 0.57\ngini = 0.372\nsamples = 100.0%\nvalue = [0.247, 0.753]"] ; 1 [label="X[1] <= 0.415\ngini = 0.078\nsamples = 76.5%\nvalue = [0.04, 0.96]"] ; 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ; 2 [label="X[0] <= 12.255\ngini = 0.031\nsamples = 71.2%\nvalue = [0.016, 0.984]"] ; 1 -> 2 ; 3 [label="gini = 0.5\nsamples = 0.2%\nvalue = [0.5, 0.5]"] ; 2 -> 3 ; 4 [label="gini = 0.029\nsamples = 71.0%\nvalue = [0.015, 0.985]"] ; 2 -> 4 ; 5 [label="X[0] <= 13.105\ngini = 0.47\nsamples = 5.2%\nvalue = [0.377, 0.623]"] ; 1 -> 5 ; 6 [label="gini = 0.133\nsamples = 1.2%\nvalue = [0.929, 0.071]"] ; 5 -> 6 ; 7 [label="gini = 0.335\nsamples = 4.0%\nvalue = [0.213, 0.787]"] ; 5 -> 7 ; 8 [label="X[0] <= 16.53\ngini = 0.148\nsamples = 23.5%\nvalue = [0.92, 0.08]"] ; 0 -> 8 [labeldistance=2.5, labelangle=-45, headlabel="False"] ; 9 [label="X[1] <= 0.665\ngini = 0.078\nsamples = 21.2%\nvalue = [0.96, 0.04]"] ; 8 -> 9 ; 10 [label="gini = 0.434\nsamples = 1.9%\nvalue = [0.682, 0.318]"] ; 9 -> 10 ; 11 [label="gini = 0.026\nsamples = 19.3%\nvalue = [0.987, 0.013]"] ; 9 -> 11 ; 12 [label="X[1] <= 0.85\ngini = 0.494\nsamples = 2.3%\nvalue = [0.556, 0.444]"] ; 8 -> 12 ; 13 [label="gini = 0.408\nsamples = 1.2%\nvalue = [0.286, 0.714]"] ; 12 -> 13 ; 14 [label="gini = 0.26\nsamples = 1.1%\nvalue = [0.846, 0.154]"] ; 12 -> 14 ; }
/Users/chummels/src/yt-conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. """ /Users/chummels/src/yt-conda/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
graph3 = print_tree(clf, features=[var1, var2], class_names=['No', 'Yes'])
Image(graph3.create_png())