On PlayGwent only data from the top 2860 pro players is shared. Though it is like there are (many) more. We'll try using some data science on the MMR scores from players from previous seasons to estmate how many players are actually out there. This can be done because the very first season in Masters 2 (Season of the Wolf) the total number of active players (MMR 2400 indicating at least 25 games were played) was about that threshold (the lowest MMR that season is 2407).
Assuming the distribution of MMR scores remains similar accross seasons, we can leverage that to estimate the total number of players for seasons where there were more players. The idea is to determine the percentage of players with an MMR higher than 9700, 9800, 9900, 10000 and 10100 from the first season and, using those percentages, extrapolate the number of players above those thresholds to the total number of players for other seasons.
# # When running on Binder: Uncomment and execute this cell to install packages required
# import sys
# !conda install --yes --prefix {sys.prefix} seaborn
# !conda install --yes --prefix {sys.prefix} nb_black
%load_ext nb_black
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_excel("./output/player_stats.xlsx").drop(columns=["Unnamed: 0"])
seasons_df = (
df.groupby(["season"])
.agg(
min_mmr=pd.NamedAgg("mmr", "min"),
max_mmr=pd.NamedAgg("mmr", "max"),
num_matches=pd.NamedAgg("matches", "sum"),
)
.reset_index()
)
seasons_df
season | min_mmr | max_mmr | num_matches | |
---|---|---|---|---|
0 | M2_01 Wolf 2020 | 2407 | 10484 | 699496 |
1 | M2_02 Love 2020 | 7776 | 10537 | 769172 |
2 | M2_03 Bear 2020 | 9427 | 10669 | 862283 |
3 | M2_04 Elf 2020 | 9666 | 10751 | 1004603 |
4 | M2_05 Viper 2020 | 9635 | 10622 | 859640 |
5 | M2_06 Magic 2020 | 9624 | 10597 | 793013 |
6 | M2_07 Griffin 2020 | 9698 | 10667 | 996516 |
7 | M2_08 Draconid 2020 | 9666 | 10546 | 837545 |
8 | M2_09 Dryad 2020 | 9678 | 10725 | 854593 |
9 | M2_10 Cat 2020 | 9703 | 10804 | 928845 |
10 | M2_11 Mahakam 2020 | 9706 | 10783 | 983150 |
11 | M2_12 Wild Hunt 2020 | 9756 | 10724 | 1182353 |
12 | M3_01 Wolf 2021 | 9637 | 10653 | 808651 |
13 | M3_02 Love 2021 | 9684 | 10714 | 917027 |
14 | M3_03 Bear 2021 | 9637 | 10576 | 766502 |
15 | M3_04 Elf 2021 | 9686 | 10678 | 944323 |
16 | M3_05 Viper 2021 | 9701 | 10753 | 956484 |
17 | M3_06 Magic 2021 | 9681 | 10632 | 869262 |
18 | M3_07 Griffin 2021 | 9669 | 10633 | 856103 |
19 | M3_08 Draconid 2021 | 9681 | 10767 | 911273 |
20 | M3_09 Dryad 2021 | 9688 | 10809 | 940655 |
21 | M3_10 Cat 2021 | 9614 | 10366 | 719696 |
22 | M3_11 Mahakam 2021 | 9725 | 10580 | 1017256 |
23 | M3_12 Wild Hunt 2021 | 9735 | 10714 | 1044941 |
24 | M4_01 Wolf 2022 | 9646 | 10684 | 883881 |
sns.distplot(df[df["season"] == "M2_01 Wolf 2020"]["mmr"], bins=50)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
percentiles = np.percentile(
df[df["season"] == "M2_01 Wolf 2020"]["mmr"], [x / 2 for x in range(0, 200, 1)]
)
percentiles_df = pd.DataFrame(
{"percentile": [x / 2 for x in range(0, 200, 1)], "mmr": percentiles}
)
percentiles_df.to_excel("temp.xlsx")
percentiles_df
percentile | mmr | |
---|---|---|
0 | 0.0 | 2407.000 |
1 | 0.5 | 2413.285 |
2 | 1.0 | 2426.000 |
3 | 1.5 | 2447.855 |
4 | 2.0 | 2477.420 |
... | ... | ... |
195 | 97.5 | 10037.575 |
196 | 98.0 | 10064.000 |
197 | 98.5 | 10088.145 |
198 | 99.0 | 10135.170 |
199 | 99.5 | 10290.865 |
200 rows × 2 columns
To test if the method actually worked, I got into Pro Rank during the Season of the Dryad and stopped playing ranked (actually stopped playing that season entirely as I didn't have time to climb). The result was that I had an unimpressive MMR of 3360, good for rank of 12816. If our estimates work we should get a number that is in that order of magnitude.
We'll check the percentage of players who are at MMR 9678 during the first season (74 percent is at or below this threshold, so 26 is above. So if the total number of players listed, 2 860 corresponds with 26 % of the total Pro players, we can quickly figure out the total number of player should be around 11 000. Still some ways off the total number I know by getting into Pro Rank and staying there. Actually, using the same trick we can use the MMR I got and the position to get an approximation of 13782 Pro Players with MMR 2400 or above that season. However, for now this is close enough.
# MMR cutoff to appear on pro ladder during the Season of the Dryad = 9678
percentiles_df[percentiles_df.mmr >= 9678][:1]
percentile | mmr | |
---|---|---|
148 | 74.0 | 9678.0 |
(2860 / 26) * 100
11000.0
As small changes in the number of players can cause a big shift, here we'll use multiple MMR cutoffs to make the estimations and do some statistics on them to see if we can get close to the total number of players we are aware off.
seasons = list(set(df["season"]))
output = []
for i in [9700, 9800, 9900, 10000, 10100]:
for season in seasons:
players_above_threshold = (
df[(df.season == season) & (df.mmr >= i)]
.groupby(["season"])
.agg(num_players=pd.NamedAgg("mmr", "count"))
.reset_index()
).iloc[0]["num_players"]
percentile = int(100 - percentiles_df[percentiles_df.mmr > i][:1]["percentile"])
output.append(
{
"season": season,
"mmr_cutoff": i,
"num_players": players_above_threshold,
"total_players_est": players_above_threshold * 100 / percentile,
}
)
estimates_df = pd.DataFrame(output).sort_values("season")
estimate_summary = (
estimates_df.groupby(["season"])
.agg(
low_estimate=pd.NamedAgg("total_players_est", "min"),
high_estimate=pd.NamedAgg("total_players_est", "max"),
mean_estimate=pd.NamedAgg("total_players_est", "mean"),
std_err=pd.NamedAgg("total_players_est", "sem"),
)
.reset_index()
)
estimate_summary.to_excel("./output/player_estimates.xlsx")
estimate_summary
season | low_estimate | high_estimate | mean_estimate | std_err | |
---|---|---|---|---|---|
0 | M2_01 Wolf 2020 | 2900.000000 | 3600.0 | 3117.636364 | 124.944153 |
1 | M2_02 Love 2020 | 4566.666667 | 7100.0 | 5620.242424 | 441.315936 |
2 | M2_03 Bear 2020 | 6036.363636 | 10300.0 | 7329.272727 | 760.229978 |
3 | M2_04 Elf 2020 | 9927.272727 | 18000.0 | 12319.454545 | 1494.370140 |
4 | M2_05 Viper 2020 | 7766.666667 | 11400.0 | 9372.060606 | 727.332197 |
5 | M2_06 Magic 2020 | 6800.000000 | 9800.0 | 8320.181818 | 618.201230 |
6 | M2_07 Griffin 2020 | 12836.363636 | 19900.0 | 14683.272727 | 1331.618216 |
7 | M2_08 Draconid 2020 | 9566.666667 | 13300.0 | 11186.242424 | 696.853229 |
8 | M2_09 Dryad 2020 | 9733.333333 | 12580.0 | 11218.666667 | 458.757501 |
9 | M2_10 Cat 2020 | 12800.000000 | 14620.0 | 13774.181818 | 369.275995 |
10 | M2_11 Mahakam 2020 | 12995.454545 | 18900.0 | 16041.757576 | 976.287680 |
11 | M2_12 Wild Hunt 2020 | 12995.454545 | 35900.0 | 23051.757576 | 3678.743988 |
12 | M3_01 Wolf 2021 | 7968.181818 | 13700.0 | 9929.636364 | 1046.105394 |
13 | M3_02 Love 2021 | 11645.454545 | 19500.0 | 13977.757576 | 1430.087791 |
14 | M3_03 Bear 2021 | 7466.666667 | 10900.0 | 9195.333333 | 715.137594 |
15 | M3_04 Elf 2021 | 11586.363636 | 20300.0 | 14814.606061 | 1527.809485 |
16 | M3_05 Viper 2021 | 12990.909091 | 20900.0 | 16159.515152 | 1322.547676 |
17 | M3_06 Magic 2021 | 11263.636364 | 17200.0 | 13526.060606 | 1034.558786 |
18 | M3_07 Griffin 2021 | 10140.909091 | 15800.0 | 12241.515152 | 1040.267969 |
19 | M3_08 Draconid 2021 | 11200.000000 | 16900.0 | 12854.545455 | 1051.176279 |
20 | M3_09 Dryad 2021 | 11868.181818 | 15900.0 | 13442.969697 | 746.264455 |
21 | M3_10 Cat 2021 | 5100.000000 | 7770.0 | 6559.939394 | 575.186402 |
22 | M3_11 Mahakam 2021 | 12981.818182 | 23300.0 | 17920.363636 | 1647.708090 |
23 | M3_12 Wild Hunt 2021 | 12981.818182 | 27500.0 | 19567.696970 | 2331.781053 |
24 | M4_01 Wolf 2022 | 8690.909091 | 18900.0 | 12063.515152 | 1789.893240 |
So here we can see the minimum and maximum estimated numbers of Pro Players can differ a lot, and for the season of the Dryad even the max estimate is still at least 1000 players shy of the ground truth. Though, given the scarcity of data to work with and some of the assumptions we need to make (it is unlikely that the discribution remains exactly the same as buffs/nerfs to cards and new cards could affect how easy it is to reach a certain fMMR with specific factions) it is the best that can be done with the data at hand and it is probably close enough.
One thing I wanted to check if if there is a correlation between the number of games played by the top 2860 players and the estimated total number of players. This can easily be done by mergin our estimations with the number of players in the season summaries and plotting them using a scatter plot (or regplot to have a regression line in the plot). We'll do this for the top 500 players seperately as well to see if the trend holds for that section of the players.
# same thing but only considering the top 500 players
seasons_top500only_df = (
df[pd.to_numeric(df["rank"]) <= 500]
.groupby(["season"])
.agg(
min_mmr=pd.NamedAgg("mmr", "min"),
max_mmr=pd.NamedAgg("mmr", "max"),
num_matches=pd.NamedAgg("matches", "sum"),
)
.reset_index()
)
merged_df = pd.merge(
seasons_df,
seasons_top500only_df,
how="inner",
on="season",
suffixes=("", "_top500"),
)
merged_df = pd.merge(merged_df, estimate_summary, how="inner", on="season").drop(
columns=["min_mmr_top500", "max_mmr_top500"]
)
merged_df["Masters"] = merged_df["season"].apply(lambda x: x.split("_")[0])
merged_df
season | min_mmr | max_mmr | num_matches | num_matches_top500 | low_estimate | high_estimate | mean_estimate | std_err | Masters | |
---|---|---|---|---|---|---|---|---|---|---|
0 | M2_01 Wolf 2020 | 2407 | 10484 | 699496 | 178323 | 2900.000000 | 3600.0 | 3117.636364 | 124.944153 | M2 |
1 | M2_02 Love 2020 | 7776 | 10537 | 769172 | 183972 | 4566.666667 | 7100.0 | 5620.242424 | 441.315936 | M2 |
2 | M2_03 Bear 2020 | 9427 | 10669 | 862283 | 205439 | 6036.363636 | 10300.0 | 7329.272727 | 760.229978 | M2 |
3 | M2_04 Elf 2020 | 9666 | 10751 | 1004603 | 251712 | 9927.272727 | 18000.0 | 12319.454545 | 1494.370140 | M2 |
4 | M2_05 Viper 2020 | 9635 | 10622 | 859640 | 207622 | 7766.666667 | 11400.0 | 9372.060606 | 727.332197 | M2 |
5 | M2_06 Magic 2020 | 9624 | 10597 | 793013 | 188536 | 6800.000000 | 9800.0 | 8320.181818 | 618.201230 | M2 |
6 | M2_07 Griffin 2020 | 9698 | 10667 | 996516 | 259713 | 12836.363636 | 19900.0 | 14683.272727 | 1331.618216 | M2 |
7 | M2_08 Draconid 2020 | 9666 | 10546 | 837545 | 209785 | 9566.666667 | 13300.0 | 11186.242424 | 696.853229 | M2 |
8 | M2_09 Dryad 2020 | 9678 | 10725 | 854593 | 202099 | 9733.333333 | 12580.0 | 11218.666667 | 458.757501 | M2 |
9 | M2_10 Cat 2020 | 9703 | 10804 | 928845 | 213867 | 12800.000000 | 14620.0 | 13774.181818 | 369.275995 | M2 |
10 | M2_11 Mahakam 2020 | 9706 | 10783 | 983150 | 230710 | 12995.454545 | 18900.0 | 16041.757576 | 976.287680 | M2 |
11 | M2_12 Wild Hunt 2020 | 9756 | 10724 | 1182353 | 290718 | 12995.454545 | 35900.0 | 23051.757576 | 3678.743988 | M2 |
12 | M3_01 Wolf 2021 | 9637 | 10653 | 808651 | 224998 | 7968.181818 | 13700.0 | 9929.636364 | 1046.105394 | M3 |
13 | M3_02 Love 2021 | 9684 | 10714 | 917027 | 243266 | 11645.454545 | 19500.0 | 13977.757576 | 1430.087791 | M3 |
14 | M3_03 Bear 2021 | 9637 | 10576 | 766502 | 189128 | 7466.666667 | 10900.0 | 9195.333333 | 715.137594 | M3 |
15 | M3_04 Elf 2021 | 9686 | 10678 | 944323 | 242792 | 11586.363636 | 20300.0 | 14814.606061 | 1527.809485 | M3 |
16 | M3_05 Viper 2021 | 9701 | 10753 | 956484 | 240472 | 12990.909091 | 20900.0 | 16159.515152 | 1322.547676 | M3 |
17 | M3_06 Magic 2021 | 9681 | 10632 | 869262 | 215751 | 11263.636364 | 17200.0 | 13526.060606 | 1034.558786 | M3 |
18 | M3_07 Griffin 2021 | 9669 | 10633 | 856103 | 212764 | 10140.909091 | 15800.0 | 12241.515152 | 1040.267969 | M3 |
19 | M3_08 Draconid 2021 | 9681 | 10767 | 911273 | 232360 | 11200.000000 | 16900.0 | 12854.545455 | 1051.176279 | M3 |
20 | M3_09 Dryad 2021 | 9688 | 10809 | 940655 | 223832 | 11868.181818 | 15900.0 | 13442.969697 | 746.264455 | M3 |
21 | M3_10 Cat 2021 | 9614 | 10366 | 719696 | 149104 | 5100.000000 | 7770.0 | 6559.939394 | 575.186402 | M3 |
22 | M3_11 Mahakam 2021 | 9725 | 10580 | 1017256 | 247078 | 12981.818182 | 23300.0 | 17920.363636 | 1647.708090 | M3 |
23 | M3_12 Wild Hunt 2021 | 9735 | 10714 | 1044941 | 258469 | 12981.818182 | 27500.0 | 19567.696970 | 2331.781053 | M3 |
24 | M4_01 Wolf 2022 | 9646 | 10684 | 883881 | 226411 | 8690.909091 | 18900.0 | 12063.515152 | 1789.893240 | M4 |
sns.regplot(
data=merged_df, x="num_matches", y="high_estimate", scatter=False, color=".25",
)
sns.scatterplot(data=merged_df, x="num_matches", y="high_estimate", hue="Masters")
plt.show()
sns.regplot(
data=merged_df,
x="num_matches_top500",
y="high_estimate",
scatter=False,
color=".25",
)
sns.scatterplot(
data=merged_df, x="num_matches_top500", y="high_estimate", hue="Masters"
)
plt.show()
There is a clear and mostly linear correlation between the number of Pro Ranked players and the number of games played by both the top 2860 and top 500 players. So the more players there are, the more matches someone has to play to get in the higher ranks.