Legendary Pokemon Classifier¶

In this exercise, we'll use decision trees to predict if a Pokemon has a "Legendary" moniker based on their stats alone.

Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon. They are often featured prominently in the legends and myths of the Pokémon world, with some even going so far as to view them as deities. ~ Bulbapedia

Your goal is to get the most accurate model possible, where accuracy is defined:

$$ \text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Positives} + \text{Negatives}} $$

Setup¶

In [ ]:

# JUST RUN THIS

from google.colab import drive
import pandas as pd

drive.mount('/content/gdrive')

# Load the data
df = pd.read_csv('/content/gdrive/MyDrive/datasets/pokemon.csv')
df.sample(5)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

Out[ ]:

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
659	598	Ferrothorn	Grass	Steel	489	74	94	131	54	116	20	5	False
695	634	Zweilous	Dark	Dragon	420	72	85	70	65	70	58	5	False
734	666	Vivillon	Bug	Flying	411	80	52	50	90	50	89	6	False
697	636	Larvesta	Bug	Fire	360	55	85	55	50	55	60	5	False
38	33	Nidorino	Poison	NaN	365	61	72	57	55	55	65	1	False

Your Code¶

Model Choice¶

Click on any of the below links to see the SKLearn documentation about the model.

Decision Trees

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

Logistic Regressions

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)

Support Vector Machines (SVM)

from sklearn.svm import SVC
model = SVC()

Note: If you use SVC, be sure to scale your features before-hand and use these matricies for training and evaluation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train/Test Split¶

To ensure that we all are all using the same train and test set and that there's a well balanced number of both legendary and non-legendary Pokemon in each set, set your train set to the Pokemon from generations 1-5 and the test set to the Pokemon from generation 6.

df_train = df[df["Generation"] != 6]
df_test  = df[df["Generation"] == 6]

Feature Matrix (`X`) and Label Vector (`y`)¶

Make sure that your feature matrix, X, is only numeric features. You can use the double bracket syntax [["col1", "col2"]] to extract multiple columns.

# Example with two features
X_train = df_train[["Attack", "Defense"]]
X_test = df_test[["Attack", "Defense"]]

# You can include more: ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]

For the label vector, just use Legendary.

y_train = df_train['Legendary']
y_test  = df_test['Legendary']

Feature Engineering¶

To try and improve your model try a few different forms of feature engineering to create new features:

Adding/Subtracting two features

df['Total Attack'] = df["Attack"] + df["Sp. Attack"]
df["Special Total"] = df["Sp. Atk"] + df["Sp. Def"]

Multiplying/Dividing two features

df['Attack over HP'] = df["Attack"] / df["HP"]

Scaling a feature

df['Attack Scaled'] = (df["Attack"] - df["Attack"].min()) / (df["Attack"].max() - df["Attack"].min())
df['Attack Normalized'] = (df["Attack"] - df["Attack"].mean()) / df["Attack"].std()
df['Attack Percent'] = df["Attack"] / df["Attack"].max()

One-hot encoding

df = pd.get_dummies(df, columns=["Type 1"])

Feature engineering may improve the performance of your model.

Check Your Accuracy¶

Here's the code from the previous assignment for calculating the confusion matrix and checking your accuracy.

def calculate_confusion_matrix(y_test, y_pred):
    # Input: df has 'Admitted' and 'Predicted' columns
    # Output: Returns tp, tn, fp, fn
    tp = ((y_test == True)  & (y_pred == True)).sum()  # True Positive
    tn = ((y_test == False) & (y_pred == False)).sum() # True Negative
    fp = ((y_test == False) & (y_pred == True)).sum()  # False Positive
    fn = ((y_test == True)  & (y_pred == False)).sum() # False Negative
    return tp, tn, fp, fn

# Calculate confusion matrix
tp, tn, fp, fn = calculate_confusion_matrix(y_test, y_pred)
print("                  Predicted Positive | Predicted Negative")
print(f"Actual Positive |{tp:>19d} |{fn:>19d} ")
print(f"Actual Negative |{fp:>19d} |{tn:>19d} ")
print("")

# Calculate accuracy, precision, and recall
total = len(y_test)
accuracy = (tp + tn) / total
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy:  {accuracy:>6.2%} (Correctly classified {tp + tn} out of {total})")
print(f"Precision: {precision:>6.2%} (When predicted positive, correct {precision:.0%} of the time)")
print(f"Recall:    {recall:>6.2%} (Found {recall:.0%} of all positive cases)")

Scoring the Model¶

Make sure you convert y_pred to a pandas Series and set the index to match X_test.

y_pred = pd.Series(model.predict(X_test), index=X_test.index)

Hint¶

If you'd like to see which rows your test failed on, try this:

display(df_test[y_pred != y_test])