In this exercise, we'll use decision trees to predict if a Pokemon has a "Legendary" moniker based on their stats alone.
Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon. They are often featured prominently in the legends and myths of the Pokémon world, with some even going so far as to view them as deities. ~ Bulbapedia
Your goal is to get the most accurate model possible, where accuracy is defined:
$$ \text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Positives} + \text{Negatives}} $$# JUST RUN THIS
from google.colab import drive
import pandas as pd
drive.mount('/content/gdrive')
# Load the data
df = pd.read_csv('/content/gdrive/MyDrive/datasets/pokemon.csv')
df.sample(5)
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
659 | 598 | Ferrothorn | Grass | Steel | 489 | 74 | 94 | 131 | 54 | 116 | 20 | 5 | False |
695 | 634 | Zweilous | Dark | Dragon | 420 | 72 | 85 | 70 | 65 | 70 | 58 | 5 | False |
734 | 666 | Vivillon | Bug | Flying | 411 | 80 | 52 | 50 | 90 | 50 | 89 | 6 | False |
697 | 636 | Larvesta | Bug | Fire | 360 | 55 | 85 | 55 | 50 | 55 | 60 | 5 | False |
38 | 33 | Nidorino | Poison | NaN | 365 | 61 | 72 | 57 | 55 | 55 | 65 | 1 | False |
Click on any of the below links to see the SKLearn documentation about the model.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
from sklearn.svm import SVC
model = SVC()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
To ensure that we all are all using the same train and test set and that there's a well balanced number of both legendary and non-legendary Pokemon in each set, set your train set to the Pokemon from generations 1-5 and the test set to the Pokemon from generation 6.
df_train = df[df["Generation"] != 6]
df_test = df[df["Generation"] == 6]
X
) and Label Vector (y
)¶Make sure that your feature matrix, X
, is only numeric features. You can use the double bracket syntax [["col1", "col2"]]
to extract multiple columns.
# Example with two features
X_train = df_train[["Attack", "Defense"]]
X_test = df_test[["Attack", "Defense"]]
# You can include more: ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]
For the label vector, just use Legendary
.
y_train = df_train['Legendary']
y_test = df_test['Legendary']
To try and improve your model try a few different forms of feature engineering to create new features:
df['Total Attack'] = df["Attack"] + df["Sp. Attack"]
df["Special Total"] = df["Sp. Atk"] + df["Sp. Def"]
df['Attack over HP'] = df["Attack"] / df["HP"]
df['Attack Scaled'] = (df["Attack"] - df["Attack"].min()) / (df["Attack"].max() - df["Attack"].min())
df['Attack Normalized'] = (df["Attack"] - df["Attack"].mean()) / df["Attack"].std()
df['Attack Percent'] = df["Attack"] / df["Attack"].max()
df = pd.get_dummies(df, columns=["Type 1"])
Feature engineering may improve the performance of your model.
Here's the code from the previous assignment for calculating the confusion matrix and checking your accuracy.
def calculate_confusion_matrix(y_test, y_pred):
# Input: df has 'Admitted' and 'Predicted' columns
# Output: Returns tp, tn, fp, fn
tp = ((y_test == True) & (y_pred == True)).sum() # True Positive
tn = ((y_test == False) & (y_pred == False)).sum() # True Negative
fp = ((y_test == False) & (y_pred == True)).sum() # False Positive
fn = ((y_test == True) & (y_pred == False)).sum() # False Negative
return tp, tn, fp, fn
# Calculate confusion matrix
tp, tn, fp, fn = calculate_confusion_matrix(y_test, y_pred)
print(" Predicted Positive | Predicted Negative")
print(f"Actual Positive |{tp:>19d} |{fn:>19d} ")
print(f"Actual Negative |{fp:>19d} |{tn:>19d} ")
print("")
# Calculate accuracy, precision, and recall
total = len(y_test)
accuracy = (tp + tn) / total
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy: {accuracy:>6.2%} (Correctly classified {tp + tn} out of {total})")
print(f"Precision: {precision:>6.2%} (When predicted positive, correct {precision:.0%} of the time)")
print(f"Recall: {recall:>6.2%} (Found {recall:.0%} of all positive cases)")
Make sure you convert y_pred
to a pandas Series
and set the index to match X_test
.
y_pred = pd.Series(model.predict(X_test), index=X_test.index)
If you'd like to see which rows your test failed on, try this:
display(df_test[y_pred != y_test])
# Your Code Here