In this Case Study, we'll look at how a Decision Tree classifier deals with different shapes of data and the kind of decision regions it makes in 2-D.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
Lets look at the blobs data. It is linearly separable, meaning the two classes can be seperated with just a line!
angle_train = pd.read_csv("angle_train.csv")
test_data = pd.read_csv("test_data.csv")
# plot the train data.
angle_train.plot.scatter(x="feat1",y="feat2",c="label", cmap="viridis", colorbar=False, figsize=(12,8));
# train the classifier
mx_depth = 1
clf = DecisionTreeClassifier(random_state=1, max_depth=mx_depth)
clf.fit(angle_train[["feat1", "feat2"]], angle_train["label"])
# predict the label on test data
test_data["pred"] = clf.predict(test_data[["feat1", "feat2"]])
# plot both of the data in one plot
# We are making the test point 40% transparent using alpha = 0.4
# Dark points are training points
ax1 = angle_train.plot.scatter(x="feat1",y="feat2",c="label", cmap="viridis", colorbar=False, figsize=(12,8));
test_data.plot.scatter(x="feat1",y="feat2",c="pred", cmap="viridis", colorbar=False, figsize=(12,8), ax=ax1, alpha=0.4);
print("Using max_depth =",mx_depth)
Using max_depth = 1
test_data
contains all the points of the gridWhat are some of the decision rules leaned by this model?
Now lets look at how a Decision Tree classifier performs when the data is not linearly separable
circles_train = pd.read_csv("circles_train.csv")
test_data = pd.read_csv("test_data.csv")
circles_train.plot.scatter(x="feat1",y="feat2",c="label", cmap="viridis", colorbar=False, figsize=(12,8));
mx_depth = 2
clf = DecisionTreeClassifier(random_state=1, max_depth=mx_depth)
clf.fit(circles_train[["feat1", "feat2"]], circles_train["label"])
test_data["pred"] = clf.predict(test_data[["feat1", "feat2"]])
ax1 = circles_train.plot.scatter(x="feat1",y="feat2",c="label", cmap="viridis", colorbar=False, figsize=(12,8));
test_data.plot.scatter(x="feat1",y="feat2",c="pred", cmap="viridis", colorbar=False, figsize=(12,8), ax=ax1, alpha=0.4)
print("Using max_depth =",mx_depth)
Using max_depth = 2
Great!! even with non linear data, the decision tree finds good decision rules!
Notice how we'd need max_depth
at least 4 to classify circular data like the one above.
Not lets understand feature importance visually.
Which of the two features do you think is more important in the plot below?
vertical_train = pd.read_csv("vertical_train.csv")
test_data = pd.read_csv("test_data.csv")
vertical_train.plot.scatter(x="feat1",y="feat2",c="label", cmap="viridis", colorbar=False, figsize=(12,8));
feat2
seems to be much more important to classify the above data as if we use the rule if feat2 > 0, class_blue otherwise class_yellow
We'd do a pretty good job of classifying the above data!
Notice we can not find any such decision line with respect to feat1
hence it is not very important for the above classification
Lets visualize the decison tree classifier's predictions to see what rule it learns
mx_depth = 1
clf = DecisionTreeClassifier(random_state=1, max_depth=mx_depth)
clf.fit(vertical_train[["feat1", "feat2"]], vertical_train["label"])
test_data["pred"] = clf.predict(test_data[["feat1", "feat2"]])
ax1 = vertical_train.plot.scatter(x="feat1",y="feat2",c="label", cmap="viridis", colorbar=False, figsize=(12,8));
test_data.plot.scatter(x="feat1",y="feat2",c="pred", cmap="viridis", colorbar=False, figsize=(12,8), ax=ax1, alpha=0.4)
print("Using max_depth =",mx_depth)
Using max_depth = 1
It does learn the decision region we hypothesized above!
Now lets see what does the decision tree classifier thinks about feature importances
clf.feature_importances_
array([0., 1.])
Low importance is assigned to feat1
and a much higher importance is assigned to feat2
Read the
blobs_train.csv
andtest_data.csv
dataset. Fit a DecisionTreeClassifier (random_state = 1, max_depth=2
) model on the training data. Predict on the testing data and plot the results. What rules does the decision tree learn forfeat1
andfeat2
? Choose the correct answer from the options
Please use the
cmap="viridis"
as we have above.
blobs_train = pd.read_csv("blobs_train.csv")
test_data = pd.read_csv("test_data.csv")
# your code here
Read the
moons_train.csv
andtest_data.csv
train and test data dataset. Fit a DecisionTreeClassifier (random_state = 1, max_depth=10
) model on the training data. Plot the training data, then predict on the testing data.Now plot both the training data and testing data together and reason about the feature importance of the two features by looking at the plots. Find the feature importance from the classifier and choose the correct answer(s)
moons_train = pd.read_csv("moons_train.csv")
test_data = pd.read_csv("test_data.csv")
# your code here