How many Pro Players are out there ?¶

On PlayGwent only data from the top 2860 pro players is shared. Though it is like there are (many) more. We'll try using some data science on the MMR scores from players from previous seasons to estmate how many players are actually out there. This can be done because the very first season in Masters 2 (Season of the Wolf) the total number of active players (MMR 2400 indicating at least 25 games were played) was about that threshold (the lowest MMR that season is 2407).

Assuming the distribution of MMR scores remains similar accross seasons, we can leverage that to estimate the total number of players for seasons where there were more players. The idea is to determine the percentage of players with an MMR higher than 9700, 9800, 9900, 10000 and 10100 from the first season and, using those percentages, extrapolate the number of players above those thresholds to the total number of players for other seasons.

In [1]:

# # When running on Binder: Uncomment and execute this cell to install packages required
# import sys
# !conda install --yes --prefix {sys.prefix} seaborn
# !conda install --yes --prefix {sys.prefix} nb_black

In [2]:

%load_ext nb_black

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:

df = pd.read_excel("./output/player_stats.xlsx").drop(columns=["Unnamed: 0"])
seasons_df = (
    df.groupby(["season"])
    .agg(
        min_mmr=pd.NamedAgg("mmr", "min"),
        max_mmr=pd.NamedAgg("mmr", "max"),
        num_matches=pd.NamedAgg("matches", "sum"),
    )
    .reset_index()
)

seasons_df

Out[3]:

	season	min_mmr	max_mmr	num_matches
0	M2_01 Wolf 2020	2407	10484	699496
1	M2_02 Love 2020	7776	10537	769172
2	M2_03 Bear 2020	9427	10669	862283
3	M2_04 Elf 2020	9666	10751	1004603
4	M2_05 Viper 2020	9635	10622	859640
5	M2_06 Magic 2020	9624	10597	793013
6	M2_07 Griffin 2020	9698	10667	996516
7	M2_08 Draconid 2020	9666	10546	837545
8	M2_09 Dryad 2020	9678	10725	854593
9	M2_10 Cat 2020	9703	10804	928845
10	M2_11 Mahakam 2020	9706	10783	983150
11	M2_12 Wild Hunt 2020	9756	10724	1182353
12	M3_01 Wolf 2021	9637	10653	808651
13	M3_02 Love 2021	9684	10714	917027
14	M3_03 Bear 2021	9637	10576	766502
15	M3_04 Elf 2021	9686	10678	944323
16	M3_05 Viper 2021	9701	10753	956484
17	M3_06 Magic 2021	9681	10632	869262
18	M3_07 Griffin 2021	9669	10633	856103
19	M3_08 Draconid 2021	9681	10767	911273
20	M3_09 Dryad 2021	9688	10809	940655
21	M3_10 Cat 2021	9614	10366	719696
22	M3_11 Mahakam 2021	9725	10580	1017256
23	M3_12 Wild Hunt 2021	9735	10714	1044941
24	M4_01 Wolf 2022	9646	10684	883881

In [4]:

sns.distplot(df[df["season"] == "M2_01 Wolf 2020"]["mmr"], bins=50)
plt.show()

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

In [5]:

percentiles = np.percentile(
    df[df["season"] == "M2_01 Wolf 2020"]["mmr"], [x / 2 for x in range(0, 200, 1)]
)
percentiles_df = pd.DataFrame(
    {"percentile": [x / 2 for x in range(0, 200, 1)], "mmr": percentiles}
)
percentiles_df.to_excel("temp.xlsx")
percentiles_df

Out[5]:

	percentile	mmr
0	0.0	2407.000
1	0.5	2413.285
2	1.0	2426.000
3	1.5	2447.855
4	2.0	2477.420
...	...	...
195	97.5	10037.575
196	98.0	10064.000
197	98.5	10088.145
198	99.0	10135.170
199	99.5	10290.865

200 rows × 2 columns

In [ ]:

Test case¶

To test if the method actually worked, I got into Pro Rank during the Season of the Dryad and stopped playing ranked (actually stopped playing that season entirely as I didn't have time to climb). The result was that I had an unimpressive MMR of 3360, good for rank of 12816. If our estimates work we should get a number that is in that order of magnitude.

We'll check the percentage of players who are at MMR 9678 during the first season (74 percent is at or below this threshold, so 26 is above. So if the total number of players listed, 2 860 corresponds with 26 % of the total Pro players, we can quickly figure out the total number of player should be around 11 000. Still some ways off the total number I know by getting into Pro Rank and staying there. Actually, using the same trick we can use the MMR I got and the position to get an approximation of 13782 Pro Players with MMR 2400 or above that season. However, for now this is close enough.

In [6]:

# MMR cutoff to appear on pro ladder during the Season of the Dryad = 9678
percentiles_df[percentiles_df.mmr >= 9678][:1]

Out[6]:

	percentile	mmr
148	74.0	9678.0

In [7]:

(2860 / 26) * 100

Out[7]:

11000.0

Using multiple cutoffs¶

As small changes in the number of players can cause a big shift, here we'll use multiple MMR cutoffs to make the estimations and do some statistics on them to see if we can get close to the total number of players we are aware off.

In [8]:

seasons = list(set(df["season"]))

output = []

for i in [9700, 9800, 9900, 10000, 10100]:
    for season in seasons:
        players_above_threshold = (
            df[(df.season == season) & (df.mmr >= i)]
            .groupby(["season"])
            .agg(num_players=pd.NamedAgg("mmr", "count"))
            .reset_index()
        ).iloc[0]["num_players"]
        percentile = int(100 - percentiles_df[percentiles_df.mmr > i][:1]["percentile"])

        output.append(
            {
                "season": season,
                "mmr_cutoff": i,
                "num_players": players_above_threshold,
                "total_players_est": players_above_threshold * 100 / percentile,
            }
        )

estimates_df = pd.DataFrame(output).sort_values("season")

In [9]:

estimate_summary = (
    estimates_df.groupby(["season"])
    .agg(
        low_estimate=pd.NamedAgg("total_players_est", "min"),
        high_estimate=pd.NamedAgg("total_players_est", "max"),
        mean_estimate=pd.NamedAgg("total_players_est", "mean"),
        std_err=pd.NamedAgg("total_players_est", "sem"),
    )
    .reset_index()
)
estimate_summary.to_excel("./output/player_estimates.xlsx")
estimate_summary

Out[9]:

	season	low_estimate	high_estimate	mean_estimate	std_err
0	M2_01 Wolf 2020	2900.000000	3600.0	3117.636364	124.944153
1	M2_02 Love 2020	4566.666667	7100.0	5620.242424	441.315936
2	M2_03 Bear 2020	6036.363636	10300.0	7329.272727	760.229978
3	M2_04 Elf 2020	9927.272727	18000.0	12319.454545	1494.370140
4	M2_05 Viper 2020	7766.666667	11400.0	9372.060606	727.332197
5	M2_06 Magic 2020	6800.000000	9800.0	8320.181818	618.201230
6	M2_07 Griffin 2020	12836.363636	19900.0	14683.272727	1331.618216
7	M2_08 Draconid 2020	9566.666667	13300.0	11186.242424	696.853229
8	M2_09 Dryad 2020	9733.333333	12580.0	11218.666667	458.757501
9	M2_10 Cat 2020	12800.000000	14620.0	13774.181818	369.275995
10	M2_11 Mahakam 2020	12995.454545	18900.0	16041.757576	976.287680
11	M2_12 Wild Hunt 2020	12995.454545	35900.0	23051.757576	3678.743988
12	M3_01 Wolf 2021	7968.181818	13700.0	9929.636364	1046.105394
13	M3_02 Love 2021	11645.454545	19500.0	13977.757576	1430.087791
14	M3_03 Bear 2021	7466.666667	10900.0	9195.333333	715.137594
15	M3_04 Elf 2021	11586.363636	20300.0	14814.606061	1527.809485
16	M3_05 Viper 2021	12990.909091	20900.0	16159.515152	1322.547676
17	M3_06 Magic 2021	11263.636364	17200.0	13526.060606	1034.558786
18	M3_07 Griffin 2021	10140.909091	15800.0	12241.515152	1040.267969
19	M3_08 Draconid 2021	11200.000000	16900.0	12854.545455	1051.176279
20	M3_09 Dryad 2021	11868.181818	15900.0	13442.969697	746.264455
21	M3_10 Cat 2021	5100.000000	7770.0	6559.939394	575.186402
22	M3_11 Mahakam 2021	12981.818182	23300.0	17920.363636	1647.708090
23	M3_12 Wild Hunt 2021	12981.818182	27500.0	19567.696970	2331.781053
24	M4_01 Wolf 2022	8690.909091	18900.0	12063.515152	1789.893240

Results ... maybe ...¶

So here we can see the minimum and maximum estimated numbers of Pro Players can differ a lot, and for the season of the Dryad even the max estimate is still at least 1000 players shy of the ground truth. Though, given the scarcity of data to work with and some of the assumptions we need to make (it is unlikely that the discribution remains exactly the same as buffs/nerfs to cards and new cards could affect how easy it is to reach a certain fMMR with specific factions) it is the best that can be done with the data at hand and it is probably close enough.

One thing I wanted to check if if there is a correlation between the number of games played by the top 2860 players and the estimated total number of players. This can easily be done by mergin our estimations with the number of players in the season summaries and plotting them using a scatter plot (or regplot to have a regression line in the plot). We'll do this for the top 500 players seperately as well to see if the trend holds for that section of the players.

In [13]:

# same thing but only considering the top 500 players
seasons_top500only_df = (
    df[pd.to_numeric(df["rank"]) <= 500]
    .groupby(["season"])
    .agg(
        min_mmr=pd.NamedAgg("mmr", "min"),
        max_mmr=pd.NamedAgg("mmr", "max"),
        num_matches=pd.NamedAgg("matches", "sum"),
    )
    .reset_index()
)

merged_df = pd.merge(
    seasons_df,
    seasons_top500only_df,
    how="inner",
    on="season",
    suffixes=("", "_top500"),
)
merged_df = pd.merge(merged_df, estimate_summary, how="inner", on="season").drop(
    columns=["min_mmr_top500", "max_mmr_top500"]
)

merged_df["Masters"] = merged_df["season"].apply(lambda x: x.split("_")[0])

merged_df

Out[13]:

	season	min_mmr	max_mmr	num_matches	num_matches_top500	low_estimate	high_estimate	mean_estimate	std_err	Masters
0	M2_01 Wolf 2020	2407	10484	699496	178323	2900.000000	3600.0	3117.636364	124.944153	M2
1	M2_02 Love 2020	7776	10537	769172	183972	4566.666667	7100.0	5620.242424	441.315936	M2
2	M2_03 Bear 2020	9427	10669	862283	205439	6036.363636	10300.0	7329.272727	760.229978	M2
3	M2_04 Elf 2020	9666	10751	1004603	251712	9927.272727	18000.0	12319.454545	1494.370140	M2
4	M2_05 Viper 2020	9635	10622	859640	207622	7766.666667	11400.0	9372.060606	727.332197	M2
5	M2_06 Magic 2020	9624	10597	793013	188536	6800.000000	9800.0	8320.181818	618.201230	M2
6	M2_07 Griffin 2020	9698	10667	996516	259713	12836.363636	19900.0	14683.272727	1331.618216	M2
7	M2_08 Draconid 2020	9666	10546	837545	209785	9566.666667	13300.0	11186.242424	696.853229	M2
8	M2_09 Dryad 2020	9678	10725	854593	202099	9733.333333	12580.0	11218.666667	458.757501	M2
9	M2_10 Cat 2020	9703	10804	928845	213867	12800.000000	14620.0	13774.181818	369.275995	M2
10	M2_11 Mahakam 2020	9706	10783	983150	230710	12995.454545	18900.0	16041.757576	976.287680	M2
11	M2_12 Wild Hunt 2020	9756	10724	1182353	290718	12995.454545	35900.0	23051.757576	3678.743988	M2
12	M3_01 Wolf 2021	9637	10653	808651	224998	7968.181818	13700.0	9929.636364	1046.105394	M3
13	M3_02 Love 2021	9684	10714	917027	243266	11645.454545	19500.0	13977.757576	1430.087791	M3
14	M3_03 Bear 2021	9637	10576	766502	189128	7466.666667	10900.0	9195.333333	715.137594	M3
15	M3_04 Elf 2021	9686	10678	944323	242792	11586.363636	20300.0	14814.606061	1527.809485	M3
16	M3_05 Viper 2021	9701	10753	956484	240472	12990.909091	20900.0	16159.515152	1322.547676	M3
17	M3_06 Magic 2021	9681	10632	869262	215751	11263.636364	17200.0	13526.060606	1034.558786	M3
18	M3_07 Griffin 2021	9669	10633	856103	212764	10140.909091	15800.0	12241.515152	1040.267969	M3
19	M3_08 Draconid 2021	9681	10767	911273	232360	11200.000000	16900.0	12854.545455	1051.176279	M3
20	M3_09 Dryad 2021	9688	10809	940655	223832	11868.181818	15900.0	13442.969697	746.264455	M3
21	M3_10 Cat 2021	9614	10366	719696	149104	5100.000000	7770.0	6559.939394	575.186402	M3
22	M3_11 Mahakam 2021	9725	10580	1017256	247078	12981.818182	23300.0	17920.363636	1647.708090	M3
23	M3_12 Wild Hunt 2021	9735	10714	1044941	258469	12981.818182	27500.0	19567.696970	2331.781053	M3
24	M4_01 Wolf 2022	9646	10684	883881	226411	8690.909091	18900.0	12063.515152	1789.893240	M4

In [15]:

sns.regplot(
    data=merged_df, x="num_matches", y="high_estimate", scatter=False, color=".25",
)
sns.scatterplot(data=merged_df, x="num_matches", y="high_estimate", hue="Masters")
plt.show()

In [14]:

sns.regplot(
    data=merged_df,
    x="num_matches_top500",
    y="high_estimate",
    scatter=False,
    color=".25",
)
sns.scatterplot(
    data=merged_df, x="num_matches_top500", y="high_estimate", hue="Masters"
)
plt.show()

There is a clear and mostly linear correlation between the number of Pro Ranked players and the number of games played by both the top 2860 and top 500 players. So the more players there are, the more matches someone has to play to get in the higher ranks.