Policy Evaluation in Contextual Bandits

This IPython notebook illustrates the usage of the contextualbandits package's evaluation module through a simulation with public datasets.

Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following this link.


1. Problem description

2. Methods

3. Experiments

4. References

1. Problem description

For a general description of the contextual bandits problem, see the first part of the package's guide Online Contextual Bandits.

The previous two guides Online Contextual Bandits and Off-policy Learning in Contextual Bandits evaluated the performance of different policies by looking at the actions they would have chosen in a fully-labeled dataset for multi-label classification.

However, in contextual bandits settings one doesn't have access to fully-labeled data, and the data that one has is usually very biased, as it is collected through some policy that aims to maximize rewards. In this situation, it is a lot more difficult to evaluate the performance of a new policy. This module deals with such problem.

2. Methods

This module implements three policy evaluation methods:

  • evaluateRejectionSampling (see "A contextual-bandit approach to personalized news article recommendation"), for both online and offline policies.

  • evaluateDoublyRobust (see "Doubly Robust Policy Evaluation and Learning").

  • evaluateNCIS (see "Offline a/b testing for recommender systems")

These should ideally be based on a train-test split - that is, the policy is trained with some data and evaluated on different data.

The best way to obtain a good estimate of the performance of a policy is to collect some data on which actions are chosen at random. When such data is available, one can iterate through it, let the policy choose an action for each observation, and if it matches with what was chosen, take it along with its rewards for evaluation purposes, skip it if not. This simple rejection sampling method is unbiased and lets you evaluate both online and offline algorithms. It must be stressed that evaluating data like this only works when the actions of this test sample are chosen at random, otherwise the estimates will be biased (and likely very wrong).

When such data is not available and there is reasonable variety of actions chosen, other options are doubly-robust estimates and inverse-propensity-adjusted estimates. The first one is meant for the case of continuous rewards though, and doesn't work as well with discrete rewards, especially when there are many labels, but can still be tried.

The doubly-robust estimate requires, as it names suggests, two estimates: one of the reward that each arm will give, and another of the probability or score that the policy that collected the data gave to each arm it chose for each observation.

The estimates based on inverse-prosensity corrections require probabilistic distributions over the chosen arms, which this package does not produce as it follows different paradigms, and as such, cannot really be used properly with the kind of data that is shown here, and its outputs will not have the same properties as described in the references, but it can still provide an improvement in the estimations.

There are different ways of building a reward estimator for the doubly-robust method. One option is to fit a (non-online) model to both the train and test sets to make reward estimates on the test set, or fit it only on the test set (while the policy to be evaluated is fitted to the training set); or perhaps even use the score estimates from the old policy (which chose the actions on the training and test data) or from the new policy. The function evaluateDoublyRobust provides an API that can accomodate all these methods.

3. Experiments

Just like in the previous guide Off-policy Learning in Contextual Bandits, I will simualate data generated from a policy by fitting a logistic regression model with a sample of the fully-labeled data, then let it choose actions for some more data, and take those actions and rewards as input for a new policy, along with the estimated reward probabilities for the actions that were chosen.

The new policy will then be evaluated on a test sample with actions already pre-selected, and the estimates from the methods here will be compared with the real rewards, which we can know because the data is fully labeled.

The data are again the Bibtext dataset, plus the larger Mediamill dataset.

Loading the Bibtex dataset again:

In [1]:
import pandas as pd, numpy as np, re
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.datasets import load_svmlight_file

def parse_data(filename):
    with open(filename, "rb") as f:
        infoline = f.readline()
        infoline = re.sub(r"^b'", "", str(infoline))
        n_features = int(re.sub(r"^\d+\s(\d+)\s\d+.*$", r"\1", infoline))
        features, labels = load_svmlight_file(f, n_features=n_features, multilabel=True)
    mlb = MultiLabelBinarizer()
    labels = mlb.fit_transform(labels)
    features = np.array(features.todense())
    features = np.ascontiguousarray(features)
    return features, labels

features, y = parse_data("Bibtex_data.txt")
(7395, 1836)
(7395, 159)

Simulating a stationary exploration policy and a test set:

In [2]:
from sklearn.linear_model import LogisticRegression

# the 'explorer' polcy will be fit to this small sample of the rows
st_seed = 0
end_seed = 2000

# then it will choose actions for this larger sample, which will be the input for the new policy
st_exploration = 0
end_exploration = 3000

# the new policy will be evaluated with a separate test set
st_test = 3000
end_test = 7395

# separating the covariates data for each case
Xseed = features[st_seed:end_seed, :]
Xexplore_sample = features[st_exploration:end_exploration, :]
Xtest = features[st_test:end_test, :]
nchoices = y.shape[1]

# now constructing an exploration policy as explained above, with fully-labeled data
explorer = LogisticRegression(C=0.1, solver="lbfgs", max_iter=15000)
explorer.fit(Xseed, np.argmax(y[st_seed:end_seed], axis=1))

# letting the exploration policy choose actions for the new policy input
rewards_explore_sample=y[st_exploration:end_exploration, :]\
                        [np.arange(end_exploration - st_exploration), actions_explore_sample]

# extracting the probabilities it estimated
ix_internal_actions = {j:i for i,j in enumerate(explorer.classes_)}
ix_internal_actions = [ix_internal_actions[i] for i in actions_explore_sample]
ix_internal_actions = np.array(ix_internal_actions)
prob_actions_explore = explorer.predict_proba(Xexplore_sample)[np.arange(Xexplore_sample.shape[0]),

# generating a test set with random actions
actions_test = np.random.randint(nchoices, size=end_test - st_test)
rewards_test = y[st_test:end_test, :][np.arange(end_test - st_test), actions_test]

Rejection sampling estimate:

In [3]:
from contextualbandits.online import SeparateClassifiers
from contextualbandits.evaluation import evaluateRejectionSampling

new_policy = SeparateClassifiers(LogisticRegression(C=0.1, solver="lbfgs", max_iter=15000),
                                 y.shape[1], smoothing=(1,2), beta_prior=None, random_state=123)
new_policy.fit(Xexplore_sample, actions_explore_sample, rewards_explore_sample)
est_r, ncases = evaluateRejectionSampling(new_policy, X=Xtest, a=actions_test, r=rewards_test,
real_r = np.mean(y[st_test:end_test,:][np.arange(end_test - st_test), new_policy.predict(Xtest)])

print('Test set Rejection Sampling mean reward estimate (new policy)')
print('Estimated mean reward: ',est_r)
print('Sample size: ', ncases)
print('Real mean reward: ', real_r)
Test set Rejection Sampling mean reward estimate (new policy)
Estimated mean reward:  0.13043478260869565
Sample size:  23
Real mean reward:  0.1447098976109215

We can also evaluate the exploration policy with the same method:

In [4]:
est_r, ncases = evaluateRejectionSampling(explorer, X=Xtest, a=actions_test, r=rewards_test, online=False)
real_r = np.mean(y[st_test:end_test, :][np.arange(end_test - st_test), explorer.predict(Xtest)])

print('Test set Rejection Sampling mean reward estimate (old policy)')
print('Estimated mean reward: ', est_r)
print('Sample size: ', ncases)
print('Real mean reward: ', real_r)
Test set Rejection Sampling mean reward estimate (old policy)
Estimated mean reward:  0.5789473684210527
Sample size:  19
Real mean reward:  0.4814562002275313

(Remember that the exploration policy was fit with a smaller set of fully-labeled data, thus it's no surprise it performs a lot better)

The estimates are not exact, but they are somewhat close to the real values as expected. They get better the more cases are successfully sampled, and their estimate should follow the central limit theorem.

To be stressed again, such an evaluation method only works when the data was collected by choosing actions at random. If we evaluate it with the actions chosen by the exploration policy, the results will be totally biased as demonstrated here:

In [5]:
actions_test_biased = explorer.predict(Xtest)
rewards_test_biased = y[st_test:end_test, :][np.arange(end_test - st_test), actions_test_biased]
est_r, ncases = evaluateRejectionSampling(new_policy, X=Xtest, a=actions_test_biased,
                                          r=rewards_test_biased, online=False)
real_r = np.mean(y[st_test:end_test, :][np.arange(end_test - st_test), new_policy.predict(Xtest)])

print('Biased Test set Rejection Sampling mean reward estimate (new policy)')
print('Estimated mean reward: ', est_r)
print('Sample size: ', ncases)
print('Real mean reward: ', real_r)
print("(Don't try rejection sampling on a biased test set)")
Biased Test set Rejection Sampling mean reward estimate (new policy)
Estimated mean reward:  1.0
Sample size:  551
Real mean reward:  0.1447098976109215
(Don't try rejection sampling on a biased test set)

We can also try Doubly-Robust estimates, but these work poorly for a dataset like this:

In [6]:
from contextualbandits.evaluation import evaluateDoublyRobust

# getting estimated probabilities for the biased test sample chosen by the old policy
ix_internal_actions = {j:i for i,j in enumerate(explorer.classes_)}
ix_internal_actions = [ix_internal_actions[i] for i in actions_test_biased]
ix_internal_actions = np.array(ix_internal_actions)
prob_actions_test_biased = explorer.predict_proba(Xtest)[np.arange(Xtest.shape[0]), ix_internal_actions]

# actions that the new policy will choose
pred = new_policy.predict(Xtest)

# method 1: estimating rewards by fitting another model to the whole data (train + test)
model_fit_on_all_data = SeparateClassifiers(LogisticRegression(C=0.1, solver="lbfgs", max_iter=15000),
                                            y.shape[1], random_state=123)
model_fit_on_all_data.fit(np.r_[Xexplore_sample, Xtest],
                          np.r_[actions_explore_sample, actions_test_biased],
                          np.r_[rewards_explore_sample, rewards_test_biased])
est_r_dr_whole = evaluateDoublyRobust(pred,
                                      X=Xtest, a=actions_test_biased, r=rewards_test_biased,
                                      p=prob_actions_test_biased, reward_estimator=model_fit_on_all_data,

# method 2: estimating rewards by fitting another model to the test data only
est_r_dr_test_only = evaluateDoublyRobust(pred, X=Xtest, a=actions_test_biased,
                                          r=rewards_test_biased, p=prob_actions_test_biased,
                                          reward_estimator=LogisticRegression(C=0.1, solver="lbfgs", max_iter=15000),
                                          nchoices=y.shape[1], random_state=123)

print('Biased Test set mean reward estimates (new policy)')
print('DR estimate (reward estimator fit on train+test): ', est_r_dr_whole)
print('DR estimate (reward estimator fit on test only): ', est_r_dr_test_only)
print('Real mean reward: ', real_r)
Biased Test set mean reward estimates (new policy)
DR estimate (reward estimator fit on train+test):  0.8688975703150306
DR estimate (reward estimator fit on test only):  0.8724866012102672
Real mean reward:  0.1447098976109215

Both estimates are very wrong, although less wrong than rejection sampling with a non-random test set. This is however a rather unlucky case as varying e.g. the random seed and regularization in the original policy can lead to far better estimates with the doubly-robust method.

In this situation, the NCIS method can provide a more realistic estimate - it should be stressed again that this is not the same as the NCIS method described in the references, as here we don't use probabilistic distributions over arms, and thus, it will not enjoy the same theoretical guarantees (see documentation for details).

In [7]:
from contextualbandits.evaluation import evaluateNCIS

rew_pred = new_policy.predict_proba_separate(Xtest)[np.arange(Xtest.shape[0]), actions_test_biased]
est_r_ncis = evaluateNCIS(rew_pred, rewards_test_biased, prob_actions_test_biased)
print('Approximate NCIS mean reward estimates (new policy)')
print('Estimated mean reward (approx. NCIS): ', est_r_ncis)
print('Real mean reward: ', real_r)
Approximate NCIS mean reward estimates (new policy)
Estimated mean reward (approx. NCIS):  0.3235873655002847
Real mean reward:  0.1447098976109215

Finally, rejection sampling can also be used to evaluate online policies - in this case though, be aware that the estimate will only be considered up to a certain number of rounds (as many as it accepts, but it will end up rejecting the majority), but online policies keep improving with time.

Here I will use the Mediamill dataset instead, as it has a lot more data - be aware that it takes a long time to evaluate:

In [8]:
from contextualbandits.online import BootstrappedUCB

features, y = parse_data("Mediamill_data.txt")
nchoices = y.shape[1]

actions_random = np.random.randint(nchoices, size = features.shape[0])
rewards_actions = y[np.arange(y.shape[0]), actions_random]

online_policy = BootstrappedUCB(LogisticRegression(C=0.1, solver="lbfgs", max_iter=15000),
                                y.shape[1], random_state=1234)
                          X = features,
                          a = actions_random,
                          r = rewards_actions,
                          online = True,
                          start_point_online = 'random',
                          batch_size = 10,
                          random_state = 5678)
Batches: 100%|██████████| 4391/4391 [08:48<00:00,  8.30it/s]
(0.45393258426966293, 445)

4. References

  • Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010, April). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (pp. 661-670). ACM.

  • Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601.