Decision Trees¶

Intro¶

This is a walkthrough of some code for a simple decision-tree learner.

We'll import a dataset, where each row is a mushroom, and each column specifies attributes of that mushroom. We want to guess whether the mushroom is poisonous or not based on the other attributes.

NOTE: This is not a robust version of the algorithm: for example, it can only handle discrete/nominal attributes, it doesn't handle overfitting, etc. The purpose is for the code to be relatively simple and easy to understand.

In [1]:

import pandas as pd
from pandas import DataFrame 
df_shroom = DataFrame.from_csv('../datasets/mushroom_data.csv')
df_shroom

Out[1]:

	class	cap-shape	cap-surface	cap-color	bruises?	odor	gill-attachment	gill-spacing	gill-size	gill-color	...	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	poisonous	convex	smooth	brown	bruises	pungent	free	close	narrow	black	...	smooth	white	white	partial	white	one	pendant	black	scattered	urban
1	edible	convex	smooth	yellow	bruises	almond	free	close	broad	black	...	smooth	white	white	partial	white	one	pendant	brown	numerous	grasses
2	edible	bell	smooth	white	bruises	anise	free	close	broad	brown	...	smooth	white	white	partial	white	one	pendant	brown	numerous	meadows
3	poisonous	convex	scaly	white	bruises	pungent	free	close	narrow	brown	...	smooth	white	white	partial	white	one	pendant	black	scattered	urban
4	edible	convex	smooth	gray	no	none	free	crowded	broad	black	...	smooth	white	white	partial	white	one	evanescent	brown	abundant	grasses
5	edible	convex	scaly	yellow	bruises	almond	free	close	broad	brown	...	smooth	white	white	partial	white	one	pendant	black	numerous	grasses
6	edible	bell	smooth	white	bruises	almond	free	close	broad	gray	...	smooth	white	white	partial	white	one	pendant	black	numerous	meadows
7	edible	bell	scaly	white	bruises	anise	free	close	broad	brown	...	smooth	white	white	partial	white	one	pendant	brown	scattered	meadows
8	poisonous	convex	scaly	white	bruises	pungent	free	close	narrow	pink	...	smooth	white	white	partial	white	one	pendant	black	several	grasses
9	edible	bell	smooth	yellow	bruises	almond	free	close	broad	gray	...	smooth	white	white	partial	white	one	pendant	black	scattered	meadows
10	edible	convex	scaly	yellow	bruises	anise	free	close	broad	gray	...	smooth	white	white	partial	white	one	pendant	brown	numerous	grasses
11	edible	convex	scaly	yellow	bruises	almond	free	close	broad	brown	...	smooth	white	white	partial	white	one	pendant	black	scattered	meadows
12	edible	bell	smooth	yellow	bruises	almond	free	close	broad	white	...	smooth	white	white	partial	white	one	pendant	brown	scattered	grasses
13	poisonous	convex	scaly	white	bruises	pungent	free	close	narrow	black	...	smooth	white	white	partial	white	one	pendant	brown	several	urban
14	edible	convex	fibrous	brown	no	none	free	crowded	broad	brown	...	fibrous	white	white	partial	white	one	evanescent	black	abundant	grasses
15	edible	sunken	fibrous	gray	no	none	free	close	narrow	black	...	smooth	white	white	partial	white	one	pendant	brown	solitary	urban
16	edible	flat	fibrous	white	no	none	free	crowded	broad	black	...	smooth	white	white	partial	white	one	evanescent	brown	abundant	grasses
17	poisonous	convex	smooth	brown	bruises	pungent	free	close	narrow	brown	...	smooth	white	white	partial	white	one	pendant	black	scattered	grasses
18	poisonous	convex	scaly	white	bruises	pungent	free	close	narrow	brown	...	smooth	white	white	partial	white	one	pendant	brown	scattered	urban
19	poisonous	convex	smooth	brown	bruises	pungent	free	close	narrow	black	...	smooth	white	white	partial	white	one	pendant	brown	scattered	urban
20	edible	bell	smooth	yellow	bruises	almond	free	close	broad	black	...	smooth	white	white	partial	white	one	pendant	brown	scattered	meadows
21	poisonous	convex	scaly	brown	bruises	pungent	free	close	narrow	brown	...	smooth	white	white	partial	white	one	pendant	brown	several	grasses
22	edible	bell	scaly	yellow	bruises	anise	free	close	broad	black	...	smooth	white	white	partial	white	one	pendant	brown	scattered	meadows
23	edible	bell	scaly	white	bruises	almond	free	close	broad	white	...	smooth	white	white	partial	white	one	pendant	brown	numerous	meadows
24	edible	bell	smooth	white	bruises	anise	free	close	broad	gray	...	smooth	white	white	partial	white	one	pendant	black	scattered	meadows
25	poisonous	flat	smooth	white	bruises	pungent	free	close	narrow	brown	...	smooth	white	white	partial	white	one	pendant	brown	several	grasses
26	edible	convex	scaly	yellow	bruises	almond	free	close	broad	brown	...	smooth	white	white	partial	white	one	pendant	brown	numerous	meadows
27	edible	convex	scaly	white	bruises	anise	free	close	broad	white	...	smooth	white	white	partial	white	one	pendant	brown	numerous	meadows
28	edible	flat	fibrous	brown	no	none	free	close	narrow	black	...	smooth	white	white	partial	white	one	pendant	black	solitary	urban
29	edible	convex	smooth	yellow	bruises	almond	free	crowded	narrow	brown	...	smooth	white	white	partial	white	one	pendant	brown	several	woods
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5614	edible	flat	smooth	gray	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	solitary	paths
5615	edible	convex	smooth	pink	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	several	paths
5616	poisonous	conical	scaly	yellow	no	none	free	crowded	narrow	white	...	scaly	yellow	yellow	partial	yellow	one	evanescent	white	clustered	leaves
5617	poisonous	knobbed	scaly	red	no	musty	attached	close	broad	yellow	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5618	poisonous	convex	scaly	cinnamon	no	musty	free	close	broad	white	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5619	edible	bell	scaly	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	woods
5620	edible	convex	scaly	brown	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	several	paths
5621	poisonous	convex	scaly	brown	no	musty	attached	close	broad	yellow	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5622	poisonous	knobbed	scaly	yellow	no	none	free	crowded	narrow	white	...	scaly	yellow	yellow	partial	yellow	one	evanescent	white	clustered	leaves
5623	edible	knobbed	smooth	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	woods
5624	poisonous	flat	scaly	cinnamon	no	musty	attached	close	broad	white	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5625	poisonous	convex	scaly	brown	no	musty	free	close	broad	white	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5626	poisonous	knobbed	scaly	red	no	musty	free	close	broad	white	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5627	edible	convex	scaly	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	woods
5628	edible	flat	scaly	gray	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	several	paths
5629	poisonous	flat	scaly	brown	no	musty	attached	close	broad	yellow	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5630	edible	flat	scaly	pink	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	several	paths
5631	edible	flat	smooth	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	paths
5632	edible	flat	scaly	gray	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	solitary	paths
5633	poisonous	knobbed	scaly	red	no	musty	attached	close	broad	white	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5634	edible	convex	scaly	cinnamon	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	several	paths
5635	edible	flat	smooth	cinnamon	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	several	paths
5636	edible	convex	scaly	brown	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	solitary	paths
5637	poisonous	knobbed	scaly	cinnamon	no	musty	attached	close	broad	yellow	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5638	edible	flat	smooth	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	woods
5639	edible	bell	scaly	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	paths
5640	edible	convex	scaly	brown	no	none	free	close	broad	white	...	scaly	brown	brown	partial	white	two	pendant	white	solitary	paths
5641	edible	convex	scaly	gray	bruises	none	free	close	broad	white	...	smooth	white	white	partial	white	two	pendant	white	solitary	paths
5642	poisonous	convex	scaly	cinnamon	no	musty	free	close	broad	yellow	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods
5643	poisonous	flat	scaly	cinnamon	no	musty	attached	close	broad	yellow	...	scaly	cinnamon	cinnamon	partial	white	none	none	white	clustered	woods

5644 rows × 23 columns

Entropy¶

The decision tree algorithm works by looking at all the attributes, and picking the 'best' one to split the data on, then running recursively on the split data.

The way we pick the 'best' attribute to split on is by picking the attribute that decreases 'entropy'.

What's entropy? Recall that we're trying to classify mushrooms as poisonous or not. Entropy is a value that will be low if a group of mushrooms mostly has the same class (all poisonous or all edible) and high if a group of mushrooms varies in their classes (half poisonous and half edible). So every split of the data that minimizes entropy is a split that does a good job classifying.

In [2]:

def entropy(probs):
    '''
    Takes a list of probabilities and calculates their entropy
    '''
    import math
    return sum( [-prob*math.log(prob, 2) for prob in probs] )
    

def entropy_of_list(a_list):
    '''
    Takes a list of items with discrete values (e.g., poisonous, edible)
    and returns the entropy for those items.
    '''
    from collections import Counter
    
    # Tally Up:
    cnt = Counter(x for x in a_list)
    
    # Convert to Proportion
    num_instances = len(a_list)*1.0
    probs = [x / num_instances for x in cnt.values()]
    
    # Calculate Entropy:
    return entropy(probs)
    
# The initial entropy of the poisonous/not attribute for our dataset.
total_entropy = entropy_of_list(df_shroom['class'])
print total_entropy

0.959441337353

In order to decide which attribute to split on, we want to quantify how each attribute decreases the entropy.

We do this in a fairly intuitive way: we split our dataset by the possible values of an attribute, then do a weighted sum of the entropies for each of these split datasets, weighted by how big that sub-dataset is.

We'll make a function that quantifies the decrease in entropy, or conversely, the gain in information.

In [3]:

def information_gain(df, split_attribute_name, target_attribute_name, trace=0):
    '''
    Takes a DataFrame of attributes, and quantifies the entropy of a target
    attribute after performing a split along the values of another attribute.
    '''
    
    # Split Data by Possible Vals of Attribute:
    df_split = df.groupby(split_attribute_name)
    
    # Calculate Entropy for Target Attribute, as well as Proportion of Obs in Each Data-Split
    nobs = len(df.index) * 1.0
    df_agg_ent = df_split.agg({target_attribute_name : [entropy_of_list, lambda x: len(x)/nobs] })[target_attribute_name]
    df_agg_ent.columns = ['Entropy', 'PropObservations']
    if trace: # helps understand what fxn is doing:
        print df_agg_ent
    
    # Calculate Information Gain:
    new_entropy = sum( df_agg_ent['Entropy'] * df_agg_ent['PropObservations'] )
    old_entropy = entropy_of_list(df[target_attribute_name])
    return old_entropy-new_entropy

print '\nExample: Info-gain for best attribute is ' + str( information_gain(df_shroom, 'odor', 'class') )

Example: Info-gain for best attribute is 0.859670435885

ID3 Decision Tree Algorithm¶

Now we'll write the decision tree algorithm itself, which is called "ID3".

In [4]:

def id3(df, target_attribute_name, attribute_names, default_class=None):
    
    ## Tally target attribute:
    from collections import Counter
    cnt = Counter(x for x in df[target_attribute_name])
    
    ## First check: Is this split of the dataset homogeneous?
    # (e.g., all mushrooms in this set are poisonous)
    # if yes, return that homogenous label (e.g., 'poisonous')
    if len(cnt) == 1:
        return cnt.keys()[0]
    
    ## Second check: Is this split of the dataset empty?
    # if yes, return a default value
    elif df.empty or (not attribute_names):
        return default_class 
    
    ## Otherwise: This dataset is ready to be divvied up!
    else:
        # Get Default Value for next recursive call of this function:
        index_of_max = cnt.values().index(max(cnt.values())) 
        default_class = cnt.keys()[index_of_max] # most common value of target attribute in dataset
        
        # Choose Best Attribute to split on:
        gainz = [information_gain(df, attr, target_attribute_name) for attr in attribute_names]
        index_of_max = gainz.index(max(gainz)) 
        best_attr = attribute_names[index_of_max]
        
        # Create an empty tree, to be populated in a moment
        tree = {best_attr:{}}
        remaining_attribute_names = [i for i in attribute_names if i != best_attr]
        
        # Split dataset
        # On each split, recursively call this algorithm.
        # populate the empty tree with subtrees, which
        # are the result of the recursive call
        for attr_val, data_subset in df.groupby(best_attr):
            subtree = id3(data_subset,
                        target_attribute_name,
                        remaining_attribute_names,
                        default_class)
            tree[best_attr][attr_val] = subtree
        return tree

In [5]:

# Get Predictor Names (all but 'class')
attribute_names = list(df_shroom.columns)
attribute_names.remove('class')

In [7]:

# Run Algorithm:
from pprint import pprint
tree = id3(df_shroom, 'class', attribute_names)
pprint(tree)

{'odor': {'almond': 'edible',
          'anise': 'edible',
          'creosote': 'poisonous',
          'foul': 'poisonous',
          'musty': 'poisonous',
          'none': {'spore-print-color': {'black': 'edible',
                                         'brown': 'edible',
                                         'green': 'poisonous',
                                         'white': {'ring-type': {'evanescent': {'stalk-surface-above-ring': {'fibrous': 'edible',
                                                                                                             'scaly': 'poisonous',
                                                                                                             'smooth': 'edible'}},
                                                                 'pendant': {'stalk-surface-above-ring': {'scaly': 'edible',
                                                                                                          'smooth': {'stalk-surface-below-ring': {'smooth': {'stalk-color-above-ring': {'white': {'stalk-color-below-ring': {'white': {'stalk-shape': {'enlarging': {'gill-color': {'white': {'cap-color': {'brown': 'edible',
                                                                                                                                                                                                                                                                                                            'cinnamon': 'edible',
                                                                                                                                                                                                                                                                                                            'gray': 'edible',
                                                                                                                                                                                                                                                                                                            'pink': 'edible',
                                                                                                                                                                                                                                                                                                            'white': 'poisonous'}}}}}}}}}}}}}}}}}},
          'pungent': 'poisonous'}}

Classification Accuracy¶

Let's make sure our resulting tree accurately predicts the class, based on the features.

Below is a 'classify' algorithm that takes an instance and classifies it based on the tree.

In [8]:

def classify(instance, tree, default=None):
    attribute = tree.keys()[0]
    if instance[attribute] in tree[attribute].keys():
        result = tree[attribute][instance[attribute]]
        if isinstance(result, dict): # this is a tree, delve deeper
            return classify(instance, result)
        else:
            return result # this is a label
    else:
        return default

In [20]:

df_shroom['predicted'] = df_shroom.apply(classify, axis=1, args=(tree,'poisonous') ) 
    # classify func allows for a default arg: when tree doesn't have answer for a particular
    # combitation of attribute-values, we can use 'poisonous' as the default guess (better safe than sorry!)

print 'Accuracy is ' + str( sum(df_shroom['class']==df_shroom['predicted'] ) / (1.0*len(df_shroom.index)) )

df_shroom[['class', 'predicted']]

Accuracy is 1.0

Out[20]:

	class	predicted
0	poisonous	poisonous
1	edible	edible
2	edible	edible
3	poisonous	poisonous
4	edible	edible
5	edible	edible
6	edible	edible
7	edible	edible
8	poisonous	poisonous
9	edible	edible
10	edible	edible
11	edible	edible
12	edible	edible
13	poisonous	poisonous
14	edible	edible
15	edible	edible
16	edible	edible
17	poisonous	poisonous
18	poisonous	poisonous
19	poisonous	poisonous
20	edible	edible
21	poisonous	poisonous
22	edible	edible
23	edible	edible
24	edible	edible
25	poisonous	poisonous
26	edible	edible
27	edible	edible
28	edible	edible
29	edible	edible
...	...	...
5614	edible	edible
5615	edible	edible
5616	poisonous	poisonous
5617	poisonous	poisonous
5618	poisonous	poisonous
5619	edible	edible
5620	edible	edible
5621	poisonous	poisonous
5622	poisonous	poisonous
5623	edible	edible
5624	poisonous	poisonous
5625	poisonous	poisonous
5626	poisonous	poisonous
5627	edible	edible
5628	edible	edible
5629	poisonous	poisonous
5630	edible	edible
5631	edible	edible
5632	edible	edible
5633	poisonous	poisonous
5634	edible	edible
5635	edible	edible
5636	edible	edible
5637	poisonous	poisonous
5638	edible	edible
5639	edible	edible
5640	edible	edible
5641	edible	edible
5642	poisonous	poisonous
5643	poisonous	poisonous

5644 rows × 2 columns

Classification Accuracy: Training/Testing Set¶

Of course, a more accurate assessement of the algorithm is to train it on a subset of the data, then test it on a different subset.

In [29]:

training_data = df_shroom.iloc[1:-1000] # all but last thousand instances
test_data  = df_shroom.iloc[-1000:] # just the last thousand
train_tree = id3(training_data, 'class', attribute_names)

test_data['predicted2'] = test_data.apply(                                # <---- test_data source
                                          classify, 
                                          axis=1, 
                                          args=(train_tree,'poisonous') ) # <---- train_data tree

print 'Accuracy is ' + str( sum(test_data['class']==test_data['predicted2'] ) / (1.0*len(test_data.index)) )

Accuracy is 0.944

In [ ]: