In [ ]:

```
import numpy as np
import pandas as pd
# Regression metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
np.random.seed(17)
```

** Leaderboard probing ** - it's a competition specific technick tightly connected with data leakages. There are two tipes of LB probing:

- extraction ground thruth from public part of LB by changing small subset of predictions in submission;
- extraction information about private part of LB by submitting to public LB (based on consistent categories).

In this tutorial we will concentrate on the first type of probing. In some cases specific exploit makes possible to find all y-values of the public leaderboard (LB) score. In other cases we can obtain some imformation about public LB y-values distribution.

Now, we are going to create some functions, which will help us in testing exploits.

In [ ]:

```
def generate_leaderboards(values):
"""
Generate public and private leaderboard from values.
"""
df = pd.DataFrame(values, columns=["target"])
public_lb, private_pb = train_test_split(df, test_size=0.3, shuffle=False)
return public_lb, private_pb
```

In [ ]:

```
def generate_submission(values, leaderboard):
"""
Generate sample submission from values for leaderboard.
"""
sample_submission = pd.DataFrame(leaderboard, copy=True)
sample_submission["target"] = values
return sample_submission
```

In [ ]:

```
def generate_data(values):
"""
Generate experimental environment: public and private leaderboards, zero sample submissio from values.
"""
public_lb, private_pb = generate_leaderboards(values)
zero_submission = generate_submission(0, public_lb)
n = public_lb.size
return public_lb, private_pb, zero_submission, n
```

In [ ]:

```
def make_submission(predicted_values, metric, leaderboard):
"""
Evaluate predicted values with metric for leaderboard.
"""
return metric(leaderboard["target"], predicted_values["target"])
```

Let's start with mean absolute error metric:

$$\large{MAE(y, \hat{y}) = \frac{\sum_{i=1}^n |y_i - \hat{y}_i|}{n} }$$

where

- $y$ - vector with true components;
- $y_i$ - true y-value;
- $\hat{y}$ - vector with predicted (i.e. submitted) values;
- $\hat{y_i}$ - predicted y-value;
- $n$ - number of data points.

**non negative** we can make zero submission to **obtain mean target value**:

$$\large{MAE(y, 0) = \frac{\sum_{i=1}^n |y_i - 0|}{n} = \frac{\sum_{i=1}^n y_i}{n} }$$

In [ ]:

```
# Generate environment with non negative target values
target_values = np.random.randint(0, 10, size=1000)
public_lb, private_lb, sample_submission, n = generate_data(target_values)
sample_submission.head(3)
```

In [ ]:

```
# Make zero submission
p_z = make_submission(sample_submission, mean_absolute_error, public_lb)
print("Zero submission score:", p_z)
print("Public leaderbord target mean:", public_lb["target"].mean())
```

Using this information we can modify our predictions to improve public LB score.

**find all y-values** for all data points used in computation of public LB score. Now we will talk about mean squared error:

$$\large{ MSE(y, \hat{y}) = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} }$$

At first, let's make all zeros submission and denote result score as $p_z$:

$$\large{ p_{z} = \frac{\sum_{i=1}^n (y_i - 0)^2}{n} } = \frac{(y_1 - 0)^2}{n} + \frac{\sum_{i=2}^n (y_i - 0)^2}{n}$$

In [ ]:

```
# Generate environment
target_values = np.random.randint(-10, 10, size=1000)
public_lb, private_lb, sample_submission, n = generate_data(target_values)
sample_submission.head(3)
```

In [ ]:

```
# Make zero submission
p_z = make_submission(sample_submission, mean_squared_error, public_lb)
print("Zero submission score:", p_z)
```

In [ ]:

```
# Set first y-value to 100
sample_submission["target"][0] = 100
sample_submission.head(3)
```

We make a new submission where 1st y-value set to 100 and denote result score as $p^1_{100}$:

$$ \large{ p_{100}^{1} = \frac{(y_1 - 100)^2}{n} + \frac{\sum_{i=2}^n (y_i - 0)^2}{n} }$$

In [ ]:

```
p_1_100 = make_submission(sample_submission, mean_squared_error, public_lb)
print("Submission score:", p_1_100)
```

$$\large{ y_1 = \frac{100^2 - n*(p_{100}^1 - p_{z})}{2*100} }$$

In [ ]:

```
# Calculate y_1
y1 = (100 ** 2 - n * (p_1_100 - p_z)) / 200
print("Obtained y_1:", y1)
print("Actual y_1:", public_lb["target"][0])
```

More generally we can obtain $y_i$ by this formula:

$$\large{ y_i = \frac{100^2 - n*(p_{100}^i - p_{z})}{2*100} }$$

For example, for $y_2$:

In [ ]:

```
# Set second y-value to 100
sample_submission["target"][0] = 0
sample_submission["target"][1] = 100
# Make submission
p_2_100 = make_submission(sample_submission, mean_squared_error, public_lb)
# Calculate y_2
y2 = (100 ** 2 - n * (p_2_100 - p_z)) / 200
print("Obtained y_2:", y2)
print("Actual y_2:", public_lb["target"][1])
```

**variance of public LB** and **find all y-values** for all data points used in computation of public LB score. Metric definition:

$$\large{ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y}_i)^2} }$$

$$\large{ S_{tot} = \sum_{i=1}^n (y_i - \bar{y}_i)^2 }$$

Let's look on the variance formula:

$$\large{ Var(y) = \frac{\sum_{i=1}^n (y_i - \bar{y}_i)^2}{n} = \frac{S_{tot}}{n} }$$

So, if we can find $S_{tot}$ in $R^2$ metric, then we can obtain the variance of public LB.

As always, let's start with zero submission:

$$\large{ p_z = 1 - \frac{\sum_{i=1}^n (y_i - 0)^2}{S_{tot}} = 1 - \frac{y_1^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$

In [ ]:

```
## Generate environment
data = np.random.randint(-15, 15, size=1000)
public_lb, private_pb, sample_submission, n = generate_data(data)
sample_submission.head(3)
```

In [ ]:

```
# Make zero submission
p_z = make_submission(sample_submission, r2_score, public_lb)
print("Zero submission score:", p_z)
```

Second probe will be with 1st y-value set to 100 (no suprise here):

$$\large{ p^1_{100} = 1 - \frac{(y_1 - 100)^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$

In [ ]:

```
# Set first y-value to 100
sample_submission["target"][0] = 100
sample_submission.head(3)
```

In [ ]:

```
# Make submission
p_1_100 = make_submission(sample_submission, r2_score, public_lb)
print("Submission score:", p_1_100)
```

$$\large{ y_1 = \frac{S_{tot}(p^1_{100} - p_z) + 100^2}{2*100} }$$

$$\large{ p^1_{200} = 1 - \frac{(y_1 - 200)^2}{S_{tot}} - \frac{\sum_{i=2}^n (y_i - 0)^2}{S_{tot}} }$$

In [ ]:

```
# Set first y-value to 200
sample_submission["target"][0] = 200
sample_submission.head(3)
```

In [ ]:

```
# Make submission
p_1_200 = make_submission(sample_submission, r2_score, public_lb)
print("Submission score:", p_1_200)
```

$$\large{ y_1 = \frac{S_{tot}(p^1_{200} - p_z) + 200^2}{2*200} }$$

Now we have two equations that can be solved over $S_{tot}$, by excluding $y_1$:

$$\large{ S_{tot} = \frac{200^2 - 2*100^2}{2*p^1_{100} - p_z - p^1_{200}} }$$

In [ ]:

```
# Calculate total sum of squares
s_tot = (200.0 ** 2 - 2 * 100.0 ** 2) / (2 * p_1_100 - p_z - p_1_200)
print("Total sum of squares:", s_tot)
```

Now we are ready to calculate the variance:

In [ ]:

```
# Calculate variance
var_y = s_tot / n
print("Obtained variance of public LB:", var_y)
print("Actual variance:", public_lb["target"].var(ddof=0))
```

And $y_1$:

In [ ]:

```
# Calculate y_1
y1 = (s_tot * (p_1_100 - p_z) + 100 ** 2) / (2 * 100.0)
print("Obtained y_1:", y1)
print("Actual y_1:", public_lb["target"][0])
```

Having now $S_{tot}$ we can find $y_i$ by this formula:

$$\large{ y_i = \frac{S_{tot}(p^i_{100} - p_z) + 100^2}{2*100} }$$

So base idea behind all of described methods is to making submissions which defferent by one value from each other and then solving equations (mostly by substracting one from another).

Knowledge obtained by described methods cannot be used to directly improve result in private LB, since we cannot probe the y-values from there. However if public and private LB data cames from the same distribution, obtained mean or variance of public LB can be useful.

And, of course, obtaining all y-values methods are limited by some number of submissions per day.

- Mean absolute error
- Mean squared error
- Coefficient of determination
- How to get the exact y-values of all data points used for the computation of the public leaderboard score in the ”Mercedes-Benz Greener Manufacturing” competition on Kaggle
- How to Win a Data Science Competition: Learn from Top Kagglers (Week 2)
- The "Perfect Score" Script