Legendary Pokemon Classifier¶

In this exercise, we'll use decision trees to predict if a Pokemon has a "Legendary" moniker based on their stats alone.

Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon. They are often featured prominently in the legends and myths of the Pokémon world, with some even going so far as to view them as deities. ~ Bulbapedia

Your goal is to get the most accurate model possible, where accuracy is defined:

$$ \text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Positives} + \text{Negatives}} $$

Setup¶

In [42]:

# JUST RUN THIS

from google.colab import drive
import pandas as pd

drive.mount('/content/gdrive')

# Load the data
df = pd.read_csv('/content/gdrive/MyDrive/datasets/pokemon.csv')
df.sample(5)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

Out[42]:

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
740	672	Skiddo	Grass	NaN	350	66	65	48	62	57	52	6	False
688	627	Rufflet	Normal	Flying	350	70	83	50	37	50	60	5	False
251	232	Donphan	Ground	NaN	500	90	120	120	60	60	50	2	False
52	47	Parasect	Bug	Grass	405	60	95	80	60	80	30	1	False
469	422	Shellos	Water	NaN	325	76	48	48	57	62	34	4	False

Your Code¶

Model Choice¶

Click on any of the below links to see the SKLearn documentation about the model.

Decision Trees

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

Logistic Regressions

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)

Support Vector Machines (SVM)

from sklearn.svm import SVC
model = SVC()

Note: If you use SVC, be sure to scale your features before-hand and use these matricies for training and evaluation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train/Test Split¶

To ensure that we all are all using the same train and test set and that there's a well balanced number of both legendary and non-legendary Pokemon in each set, set your train set to the Pokemon from generations 1-5 and the test set to the Pokemon from generation 6.

df_train = df[df["Generation"] != 6]
df_test  = df[df["Generation"] == 6]

Feature Matrix (`X`) and Label Vector (`y`)¶

Make sure that your feature matrix, X, is only numeric features. You can use the double bracket syntax [["col1", "col2"]] to extract multiple columns.

# Example with two features
X_train = df_train[["Attack", "Defense"]]
X_test = df_test[["Attack", "Defense"]]

# You can include more: ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]

For the label vector, just use Legendary.

y_train = df_train['Legendary']
y_test  = df_test['Legendary']

Feature Engineering¶

To try and improve your model try a few different forms of feature engineering to create new features:

Adding/Subtracting two features

df['Total Attack'] = df["Attack"] + df["Sp. Attack"]
df["Special Total"] = df["Sp. Atk"] + df["Sp. Def"]

Multiplying/Dividing two features

df['Attack over HP'] = df["Attack"] / df["HP"]

Scaling a feature

df['Attack Scaled'] = (df["Attack"] - df["Attack"].min()) / (df["Attack"].max() - df["Attack"].min())
df['Attack Normalized'] = (df["Attack"] - df["Attack"].mean()) / df["Attack"].std()
df['Attack Percent'] = df["Attack"] / df["Attack"].max()

One-hot encoding

df = pd.get_dummies(df, columns=["Type 1"])

Feature engineering may improve the performance of your model.

Check Your Accuracy¶

Here's the code from the previous assignment for calculating the confusion matrix and checking your accuracy.

def calculate_confusion_matrix(y_test, y_pred):
    # Input: df has 'Admitted' and 'Predicted' columns
    # Output: Returns tp, tn, fp, fn
    tp = ((y_test == True)  & (y_pred == True)).sum()  # True Positive
    tn = ((y_test == False) & (y_pred == False)).sum() # True Negative
    fp = ((y_test == False) & (y_pred == True)).sum()  # False Positive
    fn = ((y_test == True)  & (y_pred == False)).sum() # False Negative
    return tp, tn, fp, fn

# Calculate confusion matrix
tp, tn, fp, fn = calculate_confusion_matrix(df_test)
print("                  Predicted Positive | Predicted Negative")
print(f"Actual Positive |{tp:>19d} |{fn:>19d} ")
print(f"Actual Negative |{fp:>19d} |{tn:>19d} ")
print("")

# Calculate accuracy, precision, and recall
total = len(y_test)
accuracy = (tp + tn) / total
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy:  {accuracy:>6.2%} (Correctly classified {tp + tn} out of {total})")
print(f"Precision: {precision:>6.2%} (When predicted positive, correct {precision:.0%} of the time)")
print(f"Recall:    {recall:>6.2%} (Found {recall:.0%} of all positive cases)")

Scoring the Model¶

Make sure you convert y_pred to a pandas Series and set the index to match X_test.

y_pred = pd.Series(model.predict(X_test), index=X_test.index)

Hint¶

If you'd like to see which rows your test failed on, try this:

display(df_test[y_pred != y_test])

In [45]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


def name_length(name):
    return len(name)

df["Total"] = df["HP"] + df["Attack"] + df["Defense"] + df["Sp. Atk"] + df["Sp. Def"] + df["Speed"]
df["Atk Ratio"] = df["Attack"] / df["Sp. Atk"]
df["Type 1 Dragon"] = df["Type 1"] == "Dragon"
df["Total is 600"] = df["Total"] == 600
df["Name Length"] = df["Name"].apply(name_length)

df_train = df[df["Generation"] != 6]
df_test = df[df["Generation"] == 6]

features = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed", "Total", "Atk Ratio", "Type 1 Dragon", "Total is 600", "Name Length"]
X_train = df_train[features]
X_test = df_test[features]

y_train = df_train['Legendary']
y_test  = df_test['Legendary']

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = SVC()
model.fit(X_train, y_train)

y_pred = pd.Series(model.predict(X_test), index=y_test.index)

In [46]:

def calculate_confusion_matrix(y_test, y_pred):
    # Input: df has 'Admitted' and 'Predicted' columns
    # Output: Returns tp, tn, fp, fn
    tp = ((y_test == True)  & (y_pred == True)).sum()  # True Positive
    tn = ((y_test == False) & (y_pred == False)).sum() # True Negative
    fp = ((y_test == False) & (y_pred == True)).sum()  # False Positive
    fn = ((y_test == True)  & (y_pred == False)).sum() # False Negative
    return tp, tn, fp, fn

# Calculate confusion matrix
tp, tn, fp, fn = calculate_confusion_matrix(y_test, y_pred)
print("                  Predicted Positive | Predicted Negative")
print(f"Actual Positive |{tp:>19d} |{fn:>19d} ")
print(f"Actual Negative |{fp:>19d} |{tn:>19d} ")
print("")

# Calculate accuracy, precision, and recall
total = len(y_test)
accuracy = (tp + tn) / total
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy:  {accuracy:>6.2%} (Correctly classified {tp + tn} out of {total})")
print(f"Precision: {precision:>6.2%} (When predicted positive, correct {precision:.0%} of the time)")
print(f"Recall:    {recall:>6.2%} (Found {recall:.0%} of all positive cases)")

                  Predicted Positive | Predicted Negative
Actual Positive |                  3 |                  5 
Actual Negative |                  1 |                 73 

Accuracy:  92.68% (Correctly classified 76 out of 82)
Precision: 75.00% (When predicted positive, correct 75% of the time)
Recall:    37.50% (Found 38% of all positive cases)

In [47]:

df_test["Pred. Legendary"] = y_pred
display(df_test[y_pred != y_test])

/tmp/ipython-input-3204755773.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test["Pred. Legendary"] = y_pred

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	Atk Ratio	Type 1 Dragon	Total is 600	Name Length	Pred. Legendary
776	706	Goodra	Dragon	NaN	600	90	100	70	110	150	80	6	False	0.909091	True	True	6	True
794	718	Zygarde50% Forme	Dragon	Ground	600	108	100	121	81	95	95	6	True	1.234568	True	True	16	False
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True	1.000000	False	True	7	False
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True	1.000000	False	False	19	False
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True	0.733333	False	True	19	False
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True	0.846154	False	True	9	False

In [ ]: