Visualization of Pokémon Data and Battle Simulation

Contents

1. Data pre-processing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

1.1 Introduction of the dataset

The raw_data used here is downloaded from kaggle(site:https://www.kaggle.com/rounakbanik/pokemon), which contains information on 801 Pokémons from seven generations. (The latest edition is now in eighth generation.) However, the full list of the first seven generations should contain 809 Pokémons, so the missing parts was sorted out from https://wiki.52poke.com/wiki/%E4%B8%BB%E9%A1%B5. The dataset provides 41 indicators including name, base stats, type, performance against other types and some other characteristics, but only some key indicators will be analysed here.

In [2]:
raw_data = pd.read_csv('Pokemon.csv')
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 809 entries, 0 to 808
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   abilities          809 non-null    object 
 1   against_bug        809 non-null    float64
 2   against_dark       809 non-null    float64
 3   against_dragon     809 non-null    float64
 4   against_electric   809 non-null    float64
 5   against_fairy      809 non-null    float64
 6   against_fighting   809 non-null    float64
 7   against_fire       809 non-null    float64
 8   against_flying     809 non-null    float64
 9   against_ghost      809 non-null    float64
 10  against_grass      809 non-null    float64
 11  against_ground     809 non-null    float64
 12  against_ice        809 non-null    float64
 13  against_normal     809 non-null    float64
 14  against_poison     809 non-null    float64
 15  against_psychic    809 non-null    float64
 16  against_rock       809 non-null    float64
 17  against_steel      809 non-null    float64
 18  against_water      809 non-null    float64
 19  attack             809 non-null    int64  
 20  base_egg_steps     809 non-null    int64  
 21  base_happiness     809 non-null    int64  
 22  base_total         809 non-null    int64  
 23  capture_rate       809 non-null    object 
 24  classfication      809 non-null    object 
 25  defense            809 non-null    int64  
 26  experience_growth  809 non-null    int64  
 27  height_m           794 non-null    float64
 28  hp                 809 non-null    int64  
 29  japanese_name      809 non-null    object 
 30  name               809 non-null    object 
 31  percentage_male    703 non-null    float64
 32  pokedex_number     809 non-null    int64  
 33  sp_attack          809 non-null    int64  
 34  sp_defense         809 non-null    int64  
 35  speed              809 non-null    int64  
 36  type1              809 non-null    object 
 37  type2              414 non-null    object 
 38  weight_kg          794 non-null    float64
 39  generation         809 non-null    int64  
 40  legendary          809 non-null    int64  
dtypes: float64(21), int64(13), object(7)
memory usage: 259.3+ KB

1.2 Data clean

The following indicators will be selected for further analysis:

  • 'generation': indicate which generation the Pokémon was first introduced
  • 'ledendary': indicates whether a Pokémon is a legendary one, 1 means legendary, 0 means ordinary
  • 'name': English name of the Pokémon
  • 'type1','type2': the type of the Pokémons, some of them have dual type
  • 'hp','attack','defense','sp_attack','sp_defense','speed','base_total': base stats of six factors and their sum
  • 'against_*': performance against other types ('*' means each type) (back to 3.1)

The 'against_*' seems kind of complex, so these columns name will be changed into '*'

In [3]:
# use `.replace` to cut out the "against_"
raw_data.columns = raw_data.columns.map(lambda x: x.replace("against_",""))

Use the following list main to extract the data from the raw_data. It's worth noting that the dataset has already sorted the Pokémon in order. So there is no need to reset new index. Moreover, the type_list is a list represented the performance against other types. But it can also be used as a list of type name in section 2.2. And we can tell from the raw_data.info() that only 'type2' has missing values in data, which means that 395 Pokémons only have one type.(back to 2.3)

In [4]:
# `main` is combined by three list 
basic_list = ['generation', 'legendary', 'name', 'type1', 'type2']
stats_list = ['base_total', 'hp', 'attack', 'defense', 
              'sp_attack', 'sp_defense', 'speed']
type_list = list(raw_data.columns[1:19]) 
main = basic_list + stats_list + type_list 

data = raw_data.loc[:,main]
data.head()
Out[4]:
generation legendary name type1 type2 base_total hp attack defense sp_attack ... ghost grass ground ice normal poison psychic rock steel water
0 1 0 Bulbasaur grass poison 318 45 49 49 65 ... 1.0 0.25 1.0 2.0 1.0 1.0 2.0 1.0 1.0 0.5
1 1 0 Ivysaur grass poison 405 60 62 63 80 ... 1.0 0.25 1.0 2.0 1.0 1.0 2.0 1.0 1.0 0.5
2 1 0 Venusaur grass poison 625 80 100 123 122 ... 1.0 0.25 1.0 2.0 1.0 1.0 2.0 1.0 1.0 0.5
3 1 0 Charmander fire NaN 309 39 52 43 60 ... 1.0 0.50 2.0 0.5 1.0 1.0 1.0 2.0 0.5 2.0
4 1 0 Charmeleon fire NaN 405 58 64 58 80 ... 1.0 0.50 2.0 0.5 1.0 1.0 1.0 2.0 0.5 2.0

5 rows × 30 columns

2.Overall data visualization

2.1 Generation and legendary

We will count the number of Pokémons in different generations (whether it's legendary as a subdivision standard) Fisrtly, we have to create ordinary and legendary to store the number of different kinds of Pokémons.

In [5]:
%%time
# create two empty array to store data
ordinary = np.zeros(7)
legendary = np.zeros(7)

# run a loop for each generation
for g in range(7):
    for l in [0,1]:
        # give the codition to extract corresponding data
        judge = (data['generation'] == g+1) & (data['legendary'] == l)
        
        if l == 0:
            # get the ordinary Pokémon in each generation
            ordinary[g] = data.loc[judge]['name'].count()
        else:
            # get the legendary Pokémon in each generation
            legendary[g] = data.loc[judge]['name'].count()

# the elements in two list are all float
# quantity should be int
ordinary = ordinary.astype(int)
legendary = legendary.astype(int)

# create a dataframe
total = ordinary + legendary
index = np.array(('ordinary','legendary', 'total'))
columns = np.array([f'G{i+1}' for i in range(7)])
pd.DataFrame(data=(ordinary, legendary, total),
             index=index, columns=columns)
Wall time: 17.1 ms
Out[5]:
G1 G2 G3 G4 G5 G6 G7
ordinary 146 94 125 94 143 66 63
legendary 5 6 10 13 13 6 25
total 151 100 135 107 156 72 88
In [6]:
%%time
# shorter one but the same as above
ordinary_and_legendary = np.zeros((7,3))

# use `.groupby` to divide into different categories
gb_gen = data.groupby("generation")
for i in range(7):
    gg_gen = gb_gen.get_group(i+1)
    
    # the total must be counted after the first `.groupby`
    ordinary_and_legendary[i,2] = gg_gen['name'].count()
    
    # second `.groupby`
    gb_gen_leg = gg_gen.groupby('legendary')
    for j in range(2):
        gg_gen_led = gb_gen_leg.get_group(j)
        ordinary_and_legendary[i,j] = gg_gen_led['name'].count()
    
pd.DataFrame(data=ordinary_and_legendary.T.astype(int), 
             index=index, columns=columns)
Wall time: 15.3 ms
Out[6]:
G1 G2 G3 G4 G5 G6 G7
ordinary 146 94 125 94 143 66 63
legendary 5 6 10 13 13 6 25
total 151 100 135 107 156 72 88

Then we will use a stacked bar chart to present the data.

In [7]:
# give the width and colors of the bars
width = 0.5
colors = ['#008080', '#48D1CC']

# create a figure and set the size
fig, ax = plt.subplots(figsize=[10,8]) 

# plot the stacked bar chart
p1 = ax.bar(range(7), ordinary, width, color=colors[0])
p2 = ax.bar(range(7), legendary, width, bottom=ordinary, color=colors[1])

# other setting of the chart
ax.set_title('Quantity of Pokémon in each generation', fontsize='xx-large')
ax.set_xlabel('Generation', fontsize='large')
ax.set_ylabel('Quantity', fontsize='large')
ax.set_xticks(range(7))
ax.set_xticklabels(columns)
ax.legend((p1, p2), ('Ordinary', 'Legendary'), fontsize='large')
ax.grid(axis='y', ls=':')

# mark the numbers in the bar chart
for i in range(7):
    a = ordinary[i]
    b = total[i]
    ax.text(i, a, f'{a}', fontsize='large', horizontalalignment='center')
    ax.text(i, b, f'{b}', fontsize='large', horizontalalignment='center')
  • We can find that the number of Pokémon doesn't have much to do with generation and so as two kinds of Pokémon. Generation 5 has the most and 6 has the least. In general, there is an upward trend in the proportion of 'legendary'.

2.2 Type

Although we can use the method in section 2.1 to process the type data and get the number of Pokémon with different types, there are two reasons against so. First, there are 18 types of Pokémon, which makes it hard to recognize the number of each type when the data is stacked in just one bar. Second, some Pokémons might have two types. The information in 'type2' may be ignored when using the method above.Here we extract columns 'type1' and 'type2' from data to create type_df and try to find the number of Pokémon that correspond to all types of combinations.(back to 3.1)

In [8]:
type_df = data.loc[:,['type1','type2']]

# For those single-type Pokémon, 'type2' is missing at first.
# So make them 'type2' == 'type1' to represent their types.

fill = type_df.loc[:,'type1']
type_df.loc[:,'type2'].fillna(fill, inplace = True)
type_df.T
Out[8]:
0 1 2 3 4 5 6 7 8 9 ... 799 800 801 802 803 804 805 806 807 808
type1 grass grass grass fire fire fire water water water bug ... psychic steel fighting poison poison rock fire electric steel steel
type2 poison poison poison fire fire flying water water water bug ... psychic fairy ghost poison dragon steel ghost electric steel steel

2 rows × 809 columns

Then we will check each Pokémon's type and calculate the number of each type or each dual type. Here we use the type_list we defined before just to name the origin's index and colunms. (back to 1.2)

In [9]:
# create empty dataframe
data0 = np.zeros((18,18))
origin = pd.DataFrame(data=data0, index=type_list, 
                      columns=type_list, dtype='int32')

# get each Pokémon's type information
type1 = type_df.loc[:,'type1'].values
type2 = type_df.loc[:,'type2'].values

# fill in the information to the dataframe 
for i,j in zip(type1,type2):
    origin.loc[i,j] += 1

origin
Out[9]:
bug dark dragon electric fairy fighting fire flying ghost grass ground ice normal poison psychic rock steel water
bug 18 0 0 4 2 3 2 13 1 6 1 0 0 11 0 3 5 3
dark 0 9 4 0 0 2 2 5 1 0 0 2 0 0 2 0 2 0
dragon 0 0 12 1 0 2 1 4 0 0 4 1 0 0 2 0 0 0
electric 0 0 0 28 2 0 0 3 1 0 0 0 2 0 0 0 4 0
fairy 0 0 0 0 16 0 0 2 0 0 0 0 0 0 0 0 0 0
fighting 0 1 0 0 0 22 0 1 1 0 0 1 0 0 2 0 1 0
fire 0 1 1 0 0 6 28 6 1 0 2 2 2 0 1 1 1 1
flying 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
ghost 0 1 1 0 1 0 3 2 9 5 2 0 0 3 0 0 0 0
grass 0 3 0 0 5 3 0 6 1 38 1 2 0 14 2 0 3 0
ground 0 3 2 1 0 0 1 3 2 0 12 2 0 0 2 3 1 0
ice 0 0 0 0 0 0 0 2 1 0 3 12 0 0 2 0 0 3
normal 0 4 1 0 4 2 0 26 0 2 1 0 61 0 3 0 0 1
poison 1 3 2 0 0 2 2 3 0 0 2 0 0 16 0 0 0 3
psychic 0 0 0 0 6 1 1 6 2 1 0 0 0 0 35 0 1 0
rock 2 1 2 0 2 1 0 4 0 2 6 2 0 1 2 11 4 6
steel 0 0 1 0 3 1 0 2 3 0 1 0 0 0 6 3 6 0
water 2 4 2 2 4 2 0 7 2 3 9 3 0 3 5 4 1 61

The data is not intuitive from the table, so we use heatmap to present the data. Using imshow is a way of drawing a heat map but kind of complicated. So seaborn is used to streamline the code cause it was developed based on the matplotlib and create visualizations with less code.

In [10]:
# ignore the FutureWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# single type and dual type
single_type = (type_df.loc[:,'type1'] == type_df.loc[:,'type2']).sum()
total = origin.sum().sum()
print(f'The number of single-type Pokémon is {single_type}.')
print('The proportion of single-type Pokémon is {0:.2%}'.format(single_type/total))

# heatmap
fig,ax=plt.subplots(figsize=(18,18))
import seaborn as sns
sns.heatmap(origin,annot=True,cmap='GnBu',square=True)
ax.set_title('Type of the Pokémon', fontsize='xx-large');
The number of single-type Pokémon is 395.
The proportion of single-type Pokémon is 48.83%
  • The vertical axis represents the primary type and the horizontal axis represents the secondary type.
  • We can find that the squares on the diagonal are darker than the squares elsewhere generally, which indicates that Pokémons' types are concentrated in the single-type, while the distribution of the dual-type is relatively scattered. The proportion of single-type is 48.83% and the number of 'normal' and 'water' type is the most.
  • Although there are many combinations of dual-type, the difference in quantity is still large.For example, the number of 'normal-flying' type is 26 while some other dual-types have nothing, which means that some types are hard to blend.

2.3 Base stats

There are six base stats for each Pokémon that indicate its ability. Here we will give some general statistics of these value. The stats_list in section 1.2 will be used to extract the base stats from the data.(back to 2.4)

In [11]:
stats_df = data.loc[:,['name']+stats_list].set_index('name')
stats_df.head()
Out[11]:
base_total hp attack defense sp_attack sp_defense speed
name
Bulbasaur 318 45 49 49 65 65 45
Ivysaur 405 60 62 63 80 80 60
Venusaur 625 80 100 123 122 120 80
Charmander 309 39 52 43 60 50 65
Charmeleon 405 58 64 58 80 65 80
In [12]:
# statistic of base stats
stats_des = stats_df.describe()

# give a 'formater' to take two decimal places
formater = '{:.2f}'.format
stats_des.applymap(formater)
Out[12]:
base_total hp attack defense sp_attack sp_defense speed
count 809.00 809.00 809.00 809.00 809.00 809.00 809.00
mean 429.33 69.03 78.14 73.23 71.50 70.94 66.48
std 119.43 26.58 32.25 31.11 32.41 27.87 29.17
min 180.00 1.00 5.00 5.00 10.00 20.00 5.00
25% 323.00 50.00 55.00 50.00 46.00 50.00 45.00
50% 440.00 65.00 75.00 70.00 65.00 66.00 65.00
75% 509.00 80.00 100.00 90.00 92.00 90.00 87.00
max 780.00 255.00 185.00 230.00 194.00 230.00 180.00

stats_df is a Dataframe of base stats.

stats_des is some statistics on stats_df.

Then we will use a boxplot to present the data to get a general idea of the distribution of the base stats.

In [13]:
# some setting for boxplot
boxprops = {'facecolor':'#008080'}
flierprops = {'markerfacecolor':'#48D1CC'} 
medianprops = {'markerfacecolor':'orange'}
meanprops = {'marker':'D','markerfacecolor':'white'}

# ax0 for base total
ax0 = plt.axes([0.1, 0.1, 0.3, 1.2])
ax0.set_title('base total', fontsize='x-large')
ax0.set_xticklabels(['total'], fontsize='large')
ax0.grid(axis='y', ls=':')

ax0.boxplot(stats_df.loc[:, 'base_total'], patch_artist=True, 
            showmeans=True, flierprops=flierprops, 
            medianprops=medianprops, meanprops=meanprops)

# ax1 for six base stats
ax1 = plt.axes([0.5, 0.1, 1.8, 1.2])
ax1.set_title('base stats', fontsize='x-large')
ax1data = [stats_df.loc[:, i] for i in stats_list[1:]]
ax1.set_xticklabels(stats_list[1:], fontsize='large')
ax1.grid(axis='y', ls=':')

# patch_artist: fill the box with color ; showmeans: show the mean
# boxprops: box set; flierprops: extremum set; 
# medianprops: median set; meanprops: mean set
ax1.boxplot(ax1data, patch_artist=True, showmeans=True, 
            boxprops=boxprops, flierprops=flierprops, 
            medianprops=medianprops, meanprops=meanprops);
  • As we can see in the dataframe and the boxplot, base_total varies widely between Pokémon ranging from 180 to 780.
  • There is also a large deviation from the ability value of a single stats and extreme values occur more than comprehensive strength.
  • Generally speaking, a Pokémon does not reach very high levels in all stats but in some particular ability, which means that it might have high 'attack' but low 'defense'. That's why single stats have so many extreme values.
  • Some Pokémon at an early stage may have lower abilities in all stats, while some legendary ones may be prominent in every stats.
In [14]:
# kernel density of 'base_total'
ax0 = plt.axes([0.1, 0.1, 1.1, 1.2])
stats_df['base_total'].plot.density(xlim=(0, 850), label='base_total')
ax0.legend(fontsize='x-large')
ax0.grid()

# kernel density of six stats
ax1 = plt.axes([1.4, 0.1, 1.6, 1.2])
for i in stats_df.columns[1:]:
    stats_df[i].plot.density(xlim=(0, 260), label=i)
ax1.legend(fontsize='x-large')
ax1.grid();
  • We can find that there are two peaks at nearly 300 and 500, which indicates that lower 'base_total' Pokémons are concentrated around 300, while higher are concentrated around 500.
  • The six stats are all right-skewed, which indicates that the mean value is on the right side of mode, and the large extreme value is relatively more. The overall conclusion is consistent with the above.

2.4 Categorizations by base stats

Pokémon can be roughly divided into some categories due to their base stats (not an official rule). We only choose some Pokémon ('base_total' >= 75% quantile) for analysis because it will be intricate when you put more than 800 observations in a chart. In this section, we must use the stats_df and stats_des in section 2.3.

  • aggressive_index = 0.6 max{'attack','sp_attack'} + 0.4 'speed'
  • tough_index = 0.6 max{'defense','sp_defense'} + 0.4 'hp'
  • If a Pokémon has higher 'aggressive_index', we will call it 'aggressive'.
  • If a Pokémon has higher 'tough_index', we will call it 'tough'.
  • If a Pokémon has exceptionally high 'aggressive_index' or 'tough_index', we will call it 'special'.
In [15]:
def aggressive_index(s):  
    if s['attack'] > s['sp_attack']:
        aggressive_index = 0.6*s['attack'] + 0.4*s['speed']
    else:
        aggressive_index = 0.6*s['sp_attack'] + 0.4*s['speed']
    return aggressive_index

def tough_index(s):
    if s['defense'] > s['sp_defense']:
        tough_index = 0.6*s['defense'] + 0.4*s['hp']
    else:
        tough_index = 0.6*s['sp_defense'] + 0.4*s['hp']
    return tough_index

Use aggregate to apply on the functions defined above to combines multiple values into a single value and we will get a dataframe contained all Pokémons' 'aggressive_index' and 'tough_index'. Because the .describe() doesn't report the 'μ+3σ', so here we give the a_special and t_special as the critical value of 'special'.

In [16]:
# compute the two index
# stats_df is from section 2.3
index = stats_df.agg((aggressive_index, tough_index), axis=1)

# μ+3σ 'special' critical value
a_special = index['aggressive_index'].mean() + 3*index['aggressive_index'].std()
t_special = index['tough_index'].mean() + 3*index['tough_index'].std()

index.describe()
Out[16]:
aggressive_index tough_index
count 809.000000 809.000000
mean 79.812608 76.090729
std 26.664726 24.544133
min 8.000000 27.400000
25% 59.800000 57.200000
50% 78.000000 74.800000
75% 97.600000 92.000000
max 172.400000 183.000000
In [17]:
# choose the Pokémon which 'base_total' >= 75% quantile and get their index
# stats_df and stats_des is from section 2.3
stats_high = stats_df.loc[stats_df['base_total'] >= stats_des.loc['75%','base_total']]
index_high = stats_high.agg((aggressive_index, tough_index), axis=1)

stats_high is a sample of stats_df.

index_high is a sample of index.

In [18]:
fig, ax = plt.subplots(figsize=(14,8))

# scatter plot
ax.scatter(index_high['aggressive_index'], 
           index_high['tough_index'], color='#6A5ACD')

# 'aggressive_index' == 'tough_index'
ax.plot(range(45,180),range(45,180), 
        color="orange", linestyle=':', label='equilibrium')

# indicate the mean and special critical value
ax.vlines(index['aggressive_index'].mean(), 45, 180, 
          color="green", linestyle=':')
ax.hlines(index['tough_index'].mean(), 45, 180, 
          color="green", linestyle=':', label='mean')
ax.vlines(a_special, 45, 180, color="red", linestyle=':')
ax.hlines(t_special, 45, 180, color="red", linestyle=':', label='μ+3σ')

# other settings
ax.set_xlabel('aggressive index', fontsize='x-large')
ax.set_ylabel('tough index', fontsize='x-large')
ax.set_title('Aggressive and tough index', fontsize='xx-large')
ax.legend(fontsize='large')

# mark the special Pokémons
for i in index_high.index:
    a = index_high.loc[i,'aggressive_index']
    t = index_high.loc[i,'tough_index']
    
    # This is separate for fear of overlapping the text of the annotations.
    if (i == 'Regirock')|(i == 'Regice'):
        ax.annotate('Regirock/Regice',(a,t), xytext=(-150, 10), 
                    textcoords="offset points", fontsize='large',
                    arrowprops=dict(arrowstyle="->"))
        
    elif (a > a_special)|(t > t_special):
        ax.text(a, t, i, fontsize='large') 
  • In our selected part of the Pokémons, most of them fall in the space enclosed by the green and red lines in the figure. That means if a Pokémon possesses higher 'base_total' (>= 75% quantile), it will more chance to have higher index in two aspects (>= mean).
  • Relatively speaking, the dots above the dotted yellow line indicate that the Pokemon is 'tough', and those below the dotted yellow line indicate that the Pokemon is 'aggressive'. So the 'aggressive' Pokemons in the sample are in the majority.
  • There are nine dots that fall outside the dotted red line, which means that they are 'special' in specific stats.But we can't see in this figure about how their six specific stats are distributed, so we have to conduct further data analysis in section 2.5.

2.5 Stats radar map

This section will select the nine 'special' Pokémons' information as a sample special_data. Create another two dataframes all_data (all Pokémons) and high_data (25% Pokémons) in the same way.

In [19]:
# get those special Pokémon's base stats and index
a = index_high.loc[:,'aggressive_index']
t = index_high.loc[:,'tough_index']
stats_special = stats_high.loc[(a > a_special)|(t > t_special)]
index_special = index_high.loc[(a > a_special)|(t > t_special)]

# create three dataframe
special_data = pd.merge(stats_special, index_special, 
                        left_on='name', right_index=True)
all_data = pd.merge(stats_df, index, 
                        left_on='name', right_index=True)
high_data = pd.merge(stats_high, index_high, 
                        left_on='name', right_index=True)

special_data
Out[19]:
base_total hp attack defense sp_attack sp_defense speed aggressive_index tough_index
name
Alakazam 600 55 50 65 175 105 150 165.0 85.0
Mewtwo 780 106 150 70 194 120 140 172.4 114.4
Steelix 610 75 125 230 55 95 30 87.0 168.0
Blissey 540 255 10 10 75 135 55 67.0 183.0
Aggron 630 70 140 230 60 80 50 104.0 166.0
Regirock 580 80 100 200 50 100 50 80.0 152.0
Regice 580 80 50 100 100 200 50 80.0 152.0
Zygarde 708 216 100 121 91 95 85 94.0 159.0
Stakataka 570 61 131 211 53 101 13 83.8 151.0
In [20]:
# 'aggresive'
# basic settings
ax0 = plt.axes([0.1, 0.1, 0.8, 0.9])
ax0.set_title('Alakazam aggressive performance', fontsize = 'x-large')

# get the specific value of the Pokémon
a_ = ['attack', 'sp_attack', 'speed', 'aggressive_index']
aggressive = special_data.loc['Alakazam'][a_]

# find the criteria for comparison
aggressive_all_mean = all_data.mean()[a_]
aggressive_high_mean = high_data.mean()[a_]

# scatter plot 
ax0.scatter(np.array(a_), aggressive, label='Alakazam')
ax0.scatter(np.array(a_), aggressive_all_mean, label='all_mean')
ax0.scatter(np.array(a_), aggressive_high_mean, label='high_mean')

# give some tag lines and legend
ax0.vlines(np.array(a_), ymin=0, ymax=aggressive, color='orange', 
           alpha=0.7, linewidth=2, linestyle=':')
ax0.legend(fontsize = 'medium')

# 'tough'
# nearly the same as above
ax1 = plt.axes([1.1, 0.1, 0.8, 0.9])
ax1.set_title('Blissey tough performance', fontsize = 'x-large')

t_ = ['defense', 'sp_defense', 'hp', 'tough_index']
tough = special_data.loc['Blissey'][t_]

tough_mean_all = all_data.mean()[t_]
tough_mean_high = high_data.mean()[t_]

ax1.scatter(np.array(t_), tough, label='Blissey')
ax1.scatter(np.array(t_), tough_mean_all, label='all_mean')
ax1.scatter(np.array(t_), tough_mean_high, label='high_mean')

ax1.vlines(np.array(t_), ymin=0, ymax=tough, color='orange', 
           alpha=0.7, linewidth=2, linestyle=':')
ax1.legend(fontsize = 'medium', loc='upper left');
  • From the data, it can be seen that some Pokémons are 'aggressive' due to the advantage of 'sp_attack' and 'speed', but they are relatively mediocre in 'attack', such as Alakazam. Some Pokémons are 'tough' because of their superior 'hp' but are less able to resist attacks, such as Blissey.

Therefore, the simple categorization of 'aggressive' and 'tough' is a bit too arbitrary, and a more detailed categorization is needed obviously. Here we will plot the six stats on a radar map to visually see where a Pokémon stands out. To prevent label overlapping, here we change the 'sp_attack' and 'sp_defense' into 'sp_a' and 'sp_d'.

In [21]:
# remove the 'base_total'
df0 = stats_special[stats_special.columns[1:]]
df0 = df0.rename(columns={'sp_attack':'sp_a', 'sp_defense':'sp_d'})
df0  # what we will input into the following function 
Out[21]:
hp attack defense sp_a sp_d speed
name
Alakazam 55 50 65 175 105 150
Mewtwo 106 150 70 194 120 140
Steelix 75 125 230 55 95 30
Blissey 255 10 10 75 135 55
Aggron 70 140 230 60 80 50
Regirock 80 100 200 50 100 50
Regice 80 50 100 100 200 50
Zygarde 216 100 121 91 95 85
Stakataka 61 131 211 53 101 13
In [22]:
# get the boundary value of the map
def boundary(df):
    minb = maxb = 0
    a,b = df.values.shape
    for i in range(a):
        for j in range(b):
            value = df.iloc[i,j]
            if  value < minb:
                minb = value
            if value > maxb:
                maxb = value
    return minb, maxb

import matplotlib.colors as mcolors

# create radar map 
def base_stats_map(df, rows, columns):
    """
    create several radar map on base stats
    inputs:
        df: a dataframe contain six stats of Pokémons
            the index must be Pokémons' names
            the columns must be stats names
        rows: the number of rows on the figure
        columns: the number of columns on the figure
    outputs:
        (rows*columns) radar maps in one figure
    """
    # get all available colors
    colors = list(mcolors.TABLEAU_COLORS)
    
    minb, maxb = boundary(df)
    number = len(df.columns)
    
    # divide the circle into sections based on the number of stats
    # 0 to 2pi, divide for several times
    angles = np.linspace(0, 2*np.pi, number, endpoint=False)
    # go around the circle and come back to the origin
    angles = np.append(angles, angles[0])
    
    # create polar axises of rows * columns
    fig, ax = plt.subplots(rows, columns, figsize=(18,18), 
                           subplot_kw=dict(polar=True))
    
    # The original ax.shape is a two-dimensional array.
    # For the convenience of the next iteration loop, 
    # we break it down into a one-dimensional array.
    ax = ax.ravel()
    
    name = df.index
    stats = df.values
    
    for i, (name, stats) in enumerate(zip(name, stats)):
        # go around the circle and come back to the origin
        stats = np.append(np.array(stats), stats[0])
        
        # plot on each axis and fill color
        ax[i].plot(angles, stats, color=colors[i])
        ax[i].fill(angles, stats, alpha=0.7, color=colors[i])
        
        # set the labels of xtick
        ax[i].set_xticks(angles)
        ax[i].set_xticklabels(df.columns)
        
        # set the title (or the 'name')
        ax[i].set_title(name, size=10, color='black', position=(0.5, 0.4))
                
        # set the minimun and maximun r
        # +0.1 to prevent the outermost circle from appearing incompletely
        ax[i].set_rmin(minb)
        ax[i].set_rmax(maxb + 0.1)
    plt.show()
In [23]:
# plot the radar map
base_stats_map(df0,3,3)
  • We can see the advantages and disadvantages of each Pokémon from the radar map clearly. Since the unit lengths in different radar maps are the same, direct comparisons can be made between them. For example, Mewtwo has the highest 'sp_attack' ,Aggron has the highest 'defense'...
  • The area of the shaded area can be roughly seen as the 'base_total" of Pokémon, with the larger the area, the stronger the overall strength.
  • This method can be applied on every Pokémon to describe their specific abilities.

3. Battle simulation

3.1 Battle mechanism

The data_battle we will use here is the copy of data with some same treatment as section 2.2 (fill in the NaN in 'type2'). In this section, we will create a specific battle mechanism to simulate the battle between the Pokémons.

In [24]:
fill = data.loc[:,'type1']
data_battle = data.copy()
data_battle.loc[:,'type2'].fillna(fill, inplace=True)

Here are the details:

  • We will select two Pokémons randomly and make them fight assuming that the odds of any two Pokémons fighting are the same, and that the result of the battle is determined only by their types and base stats. We stipulate that the same Pokémons will never be in one battle.
  • Firstly, decide which Pokémons is the first to attack based on which has higher 'speed'. The faster one is 'first', the slower one is 'second'.
  • Secondly, compute how much damage they can do to their opponents.
    • 'dm': Damage multiplier. As we see in the section 1.2, each Pokémon has 18 performances against other types, which means that when others attack this Pokémon, the damage is determined by the number of the corresponding type. For example, the 'dm' of a single-type (grass) Pokémon to a dual-type (green-bug) is 0.25; the 'dm' of a dual-type (green-bug) Pokémon to a single-type (grass) is 0.5 * 2 = 1. (See the code below)
    • 'da': Damage. Assume that when a Pokémon has higher 'attack' than 'sp_attack', it will use physical attacks and its opponent will resist the 'attack' with 'defense'. If one's 'attack' is lower than the other's 'defense', the attack will give a 'da' = 0.5 'dm'.If not, the 'da' = (one's 'attack' - the other's 'defense') 'dm'. Same thing for those who have a higher 'sp_attack'.
  • Lastly, when attacked, a Pokémon will lose some 'hp' (Δhp = -'da'). When one of two runs out of its 'hp', the battle is over and the round of the battle will be recorded.
In [25]:
def choose_who(df1):
    a, b = np.random.randint(0,809,(2,))
    # not the same
    if a == b:
        if a == 0:
            a = a + 1
        else:
            a = a - 1
    p1 = df1.iloc[a]
    p2 = df1.iloc[b]
    
    # return all the information of the two Pokémons
    return p1, p2


def who_first(p1, p2):
    # based on their 'speed'
    if p1['speed'] >= p2['speed']:
        return p1, p2
    else:
        return p2, p1


def damage_multiplier(p1, p2):
    p1type1 = p1['type1']
    p1type2 = p1['type2']
    p2type1 = p2['type1']
    p2type2 = p2['type2']
    if p1type1 == p1type2:
        dm1 = p2[p1type1]
    else:
        dm1 = p2[p1type1] * p2[p1type2] 
    if p2type1 == p2type2:
        dm2 = p1[p2type1]
    else:
        dm2 = p1[p2type1] * p1[p2type2]
        
    # the battle must end
    # so dm1 = dm2 = 0 is not allowed
    if (dm1 == 0) & (dm2 == 0):
        dm1, dm2 = (0.05, 0.05)
    return dm1, dm2


def damage(p1, p2, dm1, dm2):
    if p1['attack'] >= p1['sp_attack']:
        if p1['attack'] <= p2['defense']:
            # negative damage is not allowed
            da1 = 0.5 * dm1
        else:
            da1 = (p1['attack']-p2['defense']) * dm1
    else:
        if p1['sp_attack'] <= p2['sp_defense']:
            da1 = 0.5 * dm1
        else:
            da1 = (p1['sp_attack']-p2['sp_defense']) * dm1
            
    # same as above
    if p2['attack'] >= p2['sp_attack']:
        if p2['attack'] <= p1['defense']:
            da2 = 0.5 * dm2
        else:
            da2 = (p2['attack']-p1['defense']) * dm2
    else:
        if p2['sp_attack'] <= p1['sp_defense']:
            da2 = 0.5 * dm2
        else:
            da2 = (p2['sp_attack']-p1['sp_defense']) * dm2
    return da1, da2


def bout(p1, p2, da1, da2):
    bround = 0
    hp1 = p1['hp']
    hp2 = p2['hp']
    while (hp1 > 0) & (hp2 > 0):
        # the while loop adds 1, the round adds 1
        bround += 1
        
        if da1 >= hp2:
            hp2 = 0
            winner = p1['name']
            loser = p2['name']
            break # when one's 'hp'=0, battle over
        hp2 = hp2 - da1    
        if da2 >= hp1:
            hp1 = 0
            winner = p2['name']
            loser = p1['name']
            break
        hp1 = hp1 - da2
    return (winner, round(hp1,2), 
            round(hp2,2), bround)
    
    
def battle(df, N):
    """
    get a df of all information of some Pokémons
    and simulate battles for N times to get the results
    return another dataframe
    """
    # make a list to store the results
    list0 = [0]
    
    # simulate for N times
    for i in range(N):
        # firstly
        p1, p2 = choose_who(df)
        first, second = who_first(p1, p2)
        
        # secondly
        dm1, dm2 = damage_multiplier(first, second)
        da1, da2 = damage(first, second, dm1, dm2)
        
        # lastly
        winner, hp1, hp2, bround = bout(first, second, da1, da2)
        
        # add one result after one battle ends
        list0.append([first['name'], second['name'], winner, 
                      bround, first['hp'], second['hp'], 
                      hp1, hp2, dm1, dm2, da1, da2])
    
    # change the results into a dataframe
    columns=['first', 'second', 'winner', 'round', 
             'first_hp0', 'second_hp0', 'first_hp', 'second_hp', 
             'first_dm', 'second_dm', 'first_da', 'second_da']
    df0 = pd.DataFrame(data=list0[1:], index=range(N), 
                       columns=columns)
    
    return df0
In [26]:
%%timeit
simulation_sample = battle(data_battle, 50)
60.8 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Simulating for 50 times needs 60-80ms, which is not very quickly. However, less simulations can not produce reliable results. So saving the CSV document with large amounts of data (simulating for 100,000 times) is not demonstrated here. (The code will showed as markdown)

simulation_of_battle = battle(data_battle, 100000)
simulation_of_battle.to_csv('simulation_of_battle.csv')

3.2 Battle result

In [27]:
simulation_of_battle = pd.read_csv('simulation_of_battle.csv')
simulation_of_battle
Out[27]:
Unnamed: 0 first second winner round first_hp0 second_hp0 first_hp second_hp first_dm second_dm first_da second_da
0 0 Skiploom Starly Skiploom 6 55 40 5.00 0.00 0.50 2.00 7.500 10.00
1 1 Hoopa Timburr Hoopa 1 80 75 80.00 0.00 2.00 0.00 270.000 0.00
2 2 Lopunny Banette Banette 19 65 64 0.00 6.05 0.05 0.05 3.050 3.55
3 3 Vivillon Guzzlord Guzzlord 2 80 223 0.00 75.00 2.00 1.00 74.000 51.00
4 4 Eevee Aromatisse Aromatisse 2 55 101 0.00 100.00 1.00 1.00 0.500 34.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 99995 Pidgeotto Gourgeist Gourgeist 28 63 85 0.00 84.30 0.05 0.05 0.025 2.25
99996 99996 Luvdisc Hippopotas Hippopotas 3 43 68 0.00 65.00 2.00 1.00 1.000 17.00
99997 99997 Shaymin Wimpod Shaymin 1 100 25 100.00 0.00 1.00 1.00 90.000 0.50
99998 99998 Doduo Slurpuff Slurpuff 1 35 82 0.00 81.50 1.00 1.00 0.500 50.00
99999 99999 Milotic Slowking Milotic 380 95 95 0.25 0.00 0.50 0.50 0.250 0.25

100000 rows × 13 columns

In [28]:
# Calculate how many times each Pokémon appears in the battle simulations
first_one = simulation_of_battle['first'].value_counts()
second_one = simulation_of_battle['second'].value_counts()
occur = first_one + second_one 
occur.isnull().sum()
Out[28]:
9
In [29]:
# compute the win rate of each Pokémon
winner = simulation_of_battle['winner'].value_counts()
rate = winner/occur
# regardless of the NaN in win rate
# here is the top 10 of win rate
rate.sort_values().tail(21)[:10][::-1]
Out[29]:
Kyogre        0.995261
Tyranitar     0.991189
Regigigas     0.981900
Gyarados      0.972868
Slaking       0.961977
Darmanitan    0.959184
Scizor        0.958159
Wishiwashi    0.947368
Aggron        0.945946
Blastoise     0.940945
dtype: float64
  • Although we have simulate for 100,000 times, there are still 9 Pokémons did not appear.
  • The simulation results can be visualized by the method used in the second 2, which will not be repeated here.
  • The simulation results can be further analyzed and extended by machine learning.

4. The last

Although the dataset used here is not academic in some ways. But the operation of data visualization can be applied to many other datasets. Although I chose such a dataset with a playful attitude at the beginning, after a period of practical operation, I found that I could also learn a lot of data processing knowledge from it, which benefited me a lot.