In this exercise, we'll use decision trees to predict if a Pokemon has a "Legendary" moniker based on their stats alone.
Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon. They are often featured prominently in the legends and myths of the Pokémon world, with some even going so far as to view them as deities. ~ Bulbapedia
Your goal is to get the most accurate model possible, where accuracy is defined:
$$ \text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Positives} + \text{Negatives}} $$# JUST RUN THIS
from google.colab import drive
import pandas as pd
drive.mount('/content/gdrive')
# Load the data
df = pd.read_csv('/content/gdrive/MyDrive/datasets/pokemon.csv')
df.sample(5)
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
740 | 672 | Skiddo | Grass | NaN | 350 | 66 | 65 | 48 | 62 | 57 | 52 | 6 | False |
688 | 627 | Rufflet | Normal | Flying | 350 | 70 | 83 | 50 | 37 | 50 | 60 | 5 | False |
251 | 232 | Donphan | Ground | NaN | 500 | 90 | 120 | 120 | 60 | 60 | 50 | 2 | False |
52 | 47 | Parasect | Bug | Grass | 405 | 60 | 95 | 80 | 60 | 80 | 30 | 1 | False |
469 | 422 | Shellos | Water | NaN | 325 | 76 | 48 | 48 | 57 | 62 | 34 | 4 | False |
Click on any of the below links to see the SKLearn documentation about the model.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
from sklearn.svm import SVC
model = SVC()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
To ensure that we all are all using the same train and test set and that there's a well balanced number of both legendary and non-legendary Pokemon in each set, set your train set to the Pokemon from generations 1-5 and the test set to the Pokemon from generation 6.
df_train = df[df["Generation"] != 6]
df_test = df[df["Generation"] == 6]
X
) and Label Vector (y
)¶Make sure that your feature matrix, X
, is only numeric features. You can use the double bracket syntax [["col1", "col2"]]
to extract multiple columns.
# Example with two features
X_train = df_train[["Attack", "Defense"]]
X_test = df_test[["Attack", "Defense"]]
# You can include more: ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]
For the label vector, just use Legendary
.
y_train = df_train['Legendary']
y_test = df_test['Legendary']
To try and improve your model try a few different forms of feature engineering to create new features:
df['Total Attack'] = df["Attack"] + df["Sp. Attack"]
df["Special Total"] = df["Sp. Atk"] + df["Sp. Def"]
df['Attack over HP'] = df["Attack"] / df["HP"]
df['Attack Scaled'] = (df["Attack"] - df["Attack"].min()) / (df["Attack"].max() - df["Attack"].min())
df['Attack Normalized'] = (df["Attack"] - df["Attack"].mean()) / df["Attack"].std()
df['Attack Percent'] = df["Attack"] / df["Attack"].max()
df = pd.get_dummies(df, columns=["Type 1"])
Feature engineering may improve the performance of your model.
Here's the code from the previous assignment for calculating the confusion matrix and checking your accuracy.
def calculate_confusion_matrix(y_test, y_pred):
# Input: df has 'Admitted' and 'Predicted' columns
# Output: Returns tp, tn, fp, fn
tp = ((y_test == True) & (y_pred == True)).sum() # True Positive
tn = ((y_test == False) & (y_pred == False)).sum() # True Negative
fp = ((y_test == False) & (y_pred == True)).sum() # False Positive
fn = ((y_test == True) & (y_pred == False)).sum() # False Negative
return tp, tn, fp, fn
# Calculate confusion matrix
tp, tn, fp, fn = calculate_confusion_matrix(df_test)
print(" Predicted Positive | Predicted Negative")
print(f"Actual Positive |{tp:>19d} |{fn:>19d} ")
print(f"Actual Negative |{fp:>19d} |{tn:>19d} ")
print("")
# Calculate accuracy, precision, and recall
total = len(y_test)
accuracy = (tp + tn) / total
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy: {accuracy:>6.2%} (Correctly classified {tp + tn} out of {total})")
print(f"Precision: {precision:>6.2%} (When predicted positive, correct {precision:.0%} of the time)")
print(f"Recall: {recall:>6.2%} (Found {recall:.0%} of all positive cases)")
Make sure you convert y_pred
to a pandas Series
and set the index to match X_test
.
y_pred = pd.Series(model.predict(X_test), index=X_test.index)
If you'd like to see which rows your test failed on, try this:
display(df_test[y_pred != y_test])
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
def name_length(name):
return len(name)
df["Total"] = df["HP"] + df["Attack"] + df["Defense"] + df["Sp. Atk"] + df["Sp. Def"] + df["Speed"]
df["Atk Ratio"] = df["Attack"] / df["Sp. Atk"]
df["Type 1 Dragon"] = df["Type 1"] == "Dragon"
df["Total is 600"] = df["Total"] == 600
df["Name Length"] = df["Name"].apply(name_length)
df_train = df[df["Generation"] != 6]
df_test = df[df["Generation"] == 6]
features = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed", "Total", "Atk Ratio", "Type 1 Dragon", "Total is 600", "Name Length"]
X_train = df_train[features]
X_test = df_test[features]
y_train = df_train['Legendary']
y_test = df_test['Legendary']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = SVC()
model.fit(X_train, y_train)
y_pred = pd.Series(model.predict(X_test), index=y_test.index)
def calculate_confusion_matrix(y_test, y_pred):
# Input: df has 'Admitted' and 'Predicted' columns
# Output: Returns tp, tn, fp, fn
tp = ((y_test == True) & (y_pred == True)).sum() # True Positive
tn = ((y_test == False) & (y_pred == False)).sum() # True Negative
fp = ((y_test == False) & (y_pred == True)).sum() # False Positive
fn = ((y_test == True) & (y_pred == False)).sum() # False Negative
return tp, tn, fp, fn
# Calculate confusion matrix
tp, tn, fp, fn = calculate_confusion_matrix(y_test, y_pred)
print(" Predicted Positive | Predicted Negative")
print(f"Actual Positive |{tp:>19d} |{fn:>19d} ")
print(f"Actual Negative |{fp:>19d} |{tn:>19d} ")
print("")
# Calculate accuracy, precision, and recall
total = len(y_test)
accuracy = (tp + tn) / total
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy: {accuracy:>6.2%} (Correctly classified {tp + tn} out of {total})")
print(f"Precision: {precision:>6.2%} (When predicted positive, correct {precision:.0%} of the time)")
print(f"Recall: {recall:>6.2%} (Found {recall:.0%} of all positive cases)")
Predicted Positive | Predicted Negative Actual Positive | 3 | 5 Actual Negative | 1 | 73 Accuracy: 92.68% (Correctly classified 76 out of 82) Precision: 75.00% (When predicted positive, correct 75% of the time) Recall: 37.50% (Found 38% of all positive cases)
df_test["Pred. Legendary"] = y_pred
display(df_test[y_pred != y_test])
/tmp/ipython-input-3204755773.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_test["Pred. Legendary"] = y_pred
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | Atk Ratio | Type 1 Dragon | Total is 600 | Name Length | Pred. Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
776 | 706 | Goodra | Dragon | NaN | 600 | 90 | 100 | 70 | 110 | 150 | 80 | 6 | False | 0.909091 | True | True | 6 | True |
794 | 718 | Zygarde50% Forme | Dragon | Ground | 600 | 108 | 100 | 121 | 81 | 95 | 95 | 6 | True | 1.234568 | True | True | 16 | False |
795 | 719 | Diancie | Rock | Fairy | 600 | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True | 1.000000 | False | True | 7 | False |
796 | 719 | DiancieMega Diancie | Rock | Fairy | 700 | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True | 1.000000 | False | False | 19 | False |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 600 | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True | 0.733333 | False | True | 19 | False |
799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True | 0.846154 | False | True | 9 | False |