Notebook

In [ ]:

%load_ext autoreload
%autoreload 2

Installation¶

If you haven't installed Pearl, please make sure you install Pearl with the following cell. Otherwise, you can skip the cell below.

In [ ]:

%pip uninstall Pearl -y
%rm -rf Pearl
!git clone https://github.com/facebookresearch/Pearl.git
%cd Pearl
%pip install .
%cd ..

WARNING: Skipping Pearl as it is not installed.
Cloning into 'Pearl'...
remote: Enumerating objects: 5987, done.
remote: Counting objects: 100% (2196/2196), done.
remote: Compressing objects: 100% (675/675), done.
remote: Total 5987 (delta 1674), reused 1941 (delta 1503), pack-reused 3791
Receiving objects: 100% (5987/5987), 54.36 MiB | 14.03 MiB/s, done.
Resolving deltas: 100% (4001/4001), done.
/content/Pearl
Processing /content/Pearl
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: gym in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (0.25.2)
Collecting gymnasium[accept-rom-license,atari,mujoco] (from Pearl==0.1.0)
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 953.9/953.9 kB 15.9 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (1.25.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (3.7.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.0.3)
Collecting parameterized (from Pearl==0.1.0)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl (20 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.31.0)
Collecting mujoco (from Pearl==0.1.0)
  Downloading mujoco-3.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 71.6 MB/s eta 0:00:00
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.2.1+cu121)
Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (0.17.1+cu121)
Requirement already satisfied: torchaudio in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.2.1+cu121)
Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from gym->Pearl==0.1.0) (2.2.1)
Requirement already satisfied: gym-notices>=0.0.4 in /usr/local/lib/python3.10/dist-packages (from gym->Pearl==0.1.0) (0.0.8)
Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (4.11.0)
Collecting farama-notifications>=0.0.1 (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting autorom[accept-rom-license]~=0.4.2 (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0)
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting shimmy[atari]<1.0,>=0.1.0 (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0)
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Requirement already satisfied: imageio>=2.14.1 in /usr/local/lib/python3.10/dist-packages (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (2.31.6)
Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from mujoco->Pearl==0.1.0) (1.4.0)
Requirement already satisfied: etils[epath] in /usr/local/lib/python3.10/dist-packages (from mujoco->Pearl==0.1.0) (1.7.0)
Collecting glfw (from mujoco->Pearl==0.1.0)
  Downloading glfw-2.7.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38-none-manylinux2014_x86_64.whl (211 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.8/211.8 kB 33.2 MB/s eta 0:00:00
Requirement already satisfied: pyopengl in /usr/local/lib/python3.10/dist-packages (from mujoco->Pearl==0.1.0) (3.1.7)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (4.51.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (24.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->Pearl==0.1.0) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->Pearl==0.1.0) (2024.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (2024.2.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (3.14.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (3.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (2023.6.0)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->Pearl==0.1.0)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->Pearl==0.1.0)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->Pearl==0.1.0)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->Pearl==0.1.0)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->Pearl==0.1.0)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->Pearl==0.1.0)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch->Pearl==0.1.0)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch->Pearl==0.1.0)
  Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch->Pearl==0.1.0)
  Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
Collecting nvidia-nccl-cu12==2.19.3 (from torch->Pearl==0.1.0)
  Using cached nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch->Pearl==0.1.0)
  Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Requirement already satisfied: triton==2.2.0 in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (2.2.0)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch->Pearl==0.1.0)
  Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (8.1.7)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (4.66.4)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0)
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 434.7/434.7 kB 47.7 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->Pearl==0.1.0) (1.16.0)
Collecting ale-py~=0.8.1 (from shimmy[atari]<1.0,>=0.1.0->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0)
  Downloading ale_py-0.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 36.7 MB/s eta 0:00:00
Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[epath]->mujoco->Pearl==0.1.0) (6.4.0)
Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[epath]->mujoco->Pearl==0.1.0) (3.18.1)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->Pearl==0.1.0) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->Pearl==0.1.0) (1.3.0)
Building wheels for collected packages: Pearl, AutoROM.accept-rom-license
  Building wheel for Pearl (pyproject.toml) ... done
  Created wheel for Pearl: filename=Pearl-0.1.0-py3-none-any.whl size=215044 sha256=c2d11dd3e65c5bb6720538a7d224f13e2f26f857c237ad6f718fa55a99983926
  Stored in directory: /tmp/pip-ephem-wheel-cache-mi9lzaly/wheels/83/80/1d/d9211ba70ee392341daf21a07252739e0cb2af9f95439a28cd
  Building wheel for AutoROM.accept-rom-license (pyproject.toml) ... done
  Created wheel for AutoROM.accept-rom-license: filename=AutoROM.accept_rom_license-0.6.1-py3-none-any.whl size=446659 sha256=50b64bf3c726e54dfa462a4cd8af9e666df0c9401e6bb1aedd4a0644e4abc2da
  Stored in directory: /root/.cache/pip/wheels/6b/1b/ef/a43ff1a2f1736d5711faa1ba4c1f61be1131b8899e6a057811
Successfully built Pearl AutoROM.accept-rom-license
Installing collected packages: glfw, farama-notifications, parameterized, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, gymnasium, ale-py, shimmy, nvidia-cusparse-cu12, nvidia-cudnn-cu12, AutoROM.accept-rom-license, autorom, nvidia-cusolver-cu12, mujoco, Pearl
Successfully installed AutoROM.accept-rom-license-0.6.1 Pearl-0.1.0 ale-py-0.8.1 autorom-0.4.2 farama-notifications-0.0.4 glfw-2.7.0 gymnasium-0.29.1 mujoco-3.1.5 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.19.3 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 parameterized-0.9.0 shimmy-0.2.1
/content

Import Modules¶

In [ ]:

from pearl.utils.functional_utils.experimentation.set_seed import set_seed
from pearl.action_representation_modules.one_hot_action_representation_module import OneHotActionTensorRepresentationModule
from pearl.replay_buffers.sequential_decision_making.fifo_off_policy_replay_buffer import FIFOOffPolicyReplayBuffer
from pearl.utils.functional_utils.train_and_eval.online_learning import online_learning
from pearl.pearl_agent import PearlAgent
from pearl.utils.uci_data import download_uci_data
from pearl.utils.instantiations.environments.contextual_bandit_uci_environment import (
    SLCBEnvironment,
)
from pearl.policy_learners.exploration_modules.contextual_bandits.squarecb_exploration import SquareCBExploration
from pearl.policy_learners.exploration_modules.contextual_bandits.ucb_exploration import (
    UCBExploration,
)
from pearl.policy_learners.exploration_modules.contextual_bandits.thompson_sampling_exploration import (
    ThompsonSamplingExplorationLinear,
)
from pearl.policy_learners.contextual_bandits.neural_bandit import NeuralBandit
from pearl.policy_learners.contextual_bandits.neural_linear_bandit import (
    NeuralLinearBandit,
)
import torch
import matplotlib.pyplot as plt
import numpy as np
import os

set_seed(0)

Load Environment¶

The environment which underlies the experiments to follow is a contextual bandit environment we added to Pearl that allows us to use UCI datasets (https://archive.ics.uci.edu/datasets).

The UCI datasets span a wide variety of prediction tasks. We use these tasks to construct a contexual bandit environment in which an agent receives an expected reward of 1 if it correctly labels a data point and 0 otherwise. Pearl currently supports the following datasets: pendigits, letter, satimage, yeast. Additional ones can be readily added.

In the following experiment we will test different types of contextual bandits algorithms on the pendigits UCI dataset.

In [ ]:

# load environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Download UCI dataset if doesn't exist
uci_data_path = "./utils/instantiations/environments/uci_datasets"
if not os.path.exists(uci_data_path):
    os.makedirs(uci_data_path)
    download_uci_data(data_path=uci_data_path)

# Built CB environment using the pendigits UCI dataset
pendigits_uci_dict =  {
    "path_filename": os.path.join(uci_data_path, "pendigits/pendigits.tra"),
    "action_embeddings": "discrete",
    "delim_whitespace": False,
    "ind_to_drop": [],
    "target_column": 16,
}
env = SLCBEnvironment(**pendigits_uci_dict)

# experiment code
number_of_steps = 10000
record_period = 400

Contextual Bandits learners¶

The following sections show how to implement the neural versions of SquareCB, LinUCB and LinTS with Pearl.

Contextual Bandits learners: SquareCB¶

The SquareCB algorithm requires only a regression model with which it learns the reward function. Given the reward model, SquareCB executes the following policy: $$ \widehat{a}_*\in \arg\max_a\widehat{r}(x,a)\\ \widehat{r}_*\in \max_a\widehat{r}(x,a)\\ \text{If $a\neq \widehat{a}_*$}: \pi(a,x)= \frac{1}{A + \gamma (\widehat{r}_* - \widehat{r}(x,a))}\\ \text{If $a= \widehat{a}_*$}: \pi(a,x) = 1-\sum_{a'\neq \widehat{a}_*}\pi(a',x). $$ This policy balances exploration and exploitation in an intelligent way.

To use the SquareCB algrorithm in Pearl we set the policy learner as NeuralBandit. NeuralBandit is class supportings the estimation of the reward function with a neural architecture. With access to an estimated reward model, we then use an instance of SquareCBExploration as an exploration module.

To further highlight the versatility of the modular design of Pearl, we use the OneHotActionTensorRepresentationModule as the action representation module. This module internally converts actions from integers to one-hot-encoded vectors.

In [ ]:

# Create a Neural SquareCB pearl agent with 1-hot action representation
action_representation_module = OneHotActionTensorRepresentationModule(
    max_number_actions= env.unique_labels_num,
)

agent = PearlAgent(
    policy_learner=NeuralBandit(
        feature_dim = env.observation_dim + env.unique_labels_num,
        hidden_dims=[64, 16],
        training_rounds=10,
        learning_rate=0.01,
        action_representation_module=action_representation_module,
        exploration_module= SquareCBExploration(gamma = env.observation_dim * env.unique_labels_num * number_of_steps)
    ),
    replay_buffer=FIFOOffPolicyReplayBuffer(100_000),
    device_id=-1,
)


info = online_learning(
    agent=agent,
    env=env,
    number_of_steps=number_of_steps,
    print_every_x_steps=100,
    record_period=record_period,
    learn_after_episode=True,
)
torch.save(info["return"], "SquareCB-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="SquareCB")
plt.xlabel("time step")
plt.ylabel("return")
plt.legend()
plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

episode 100, step 100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.02050408534705639
episode 200, step 200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.039917074143886566
episode 300, step 300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0325154066085815
episode 400, step 400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0192910432815552
episode 500, step 500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.020320570096373558
episode 600, step 600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0140972137451172
episode 700, step 700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1774886846542358
episode 800, step 800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0041024684906006
episode 900, step 900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9082531929016113
episode 1000, step 1000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0025124549865723
episode 1100, step 1100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9679498076438904
episode 1200, step 1200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9570890069007874
episode 1300, step 1300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9786309599876404
episode 1400, step 1400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0437270402908325
episode 1500, step 1500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1215401887893677
episode 1600, step 1600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0220673084259033
episode 1700, step 1700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.05659715086221695
episode 1800, step 1800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1713285446166992
episode 1900, step 1900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9805613160133362
episode 2000, step 2000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0700596570968628
episode 2100, step 2100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9755151867866516
episode 2200, step 2200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.06023371219635
episode 2300, step 2300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.2449081689119339
episode 2400, step 2400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9615944623947144
episode 2500, step 2500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1294598579406738
episode 2600, step 2600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0927162170410156
episode 2700, step 2700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0192159414291382
episode 2800, step 2800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.14172282814979553
episode 2900, step 2900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.020617127418518
episode 3000, step 3000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.07183039188385
episode 3100, step 3100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1373060941696167
episode 3200, step 3200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8844553828239441
episode 3300, step 3300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1407734155654907
episode 3400, step 3400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0968198776245117
episode 3500, step 3500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0457595586776733
episode 3600, step 3600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0209089517593384
episode 3700, step 3700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.016340732574463
episode 3800, step 3800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.076174259185791
episode 3900, step 3900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9254880547523499
episode 4000, step 4000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9435696601867676
episode 4100, step 4100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9403758645057678
episode 4200, step 4200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.062246013432741165
episode 4300, step 4300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.09749681502580643
episode 4400, step 4400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9377647042274475
episode 4500, step 4500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9589371681213379
episode 4600, step 4600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8841699361801147
episode 4700, step 4700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0646312236785889
episode 4800, step 4800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1062999963760376
episode 4900, step 4900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1038718223571777
episode 5000, step 5000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.7176197171211243
episode 5100, step 5100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8898045420646667
episode 5200, step 5200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.010150366462767124
episode 5300, step 5300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9760612845420837
episode 5400, step 5400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.06860487163066864
episode 5500, step 5500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.03486515209078789
episode 5600, step 5600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1883316040039062
episode 5700, step 5700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9038528800010681
episode 5800, step 5800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8557829260826111
episode 5900, step 5900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0221916437149048
episode 6000, step 6000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8543550968170166
episode 6100, step 6100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9892317056655884
episode 6200, step 6200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9139447212219238
episode 6300, step 6300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.0652153417468071
episode 6400, step 6400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0828280448913574
episode 6500, step 6500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9806367754936218
episode 6600, step 6600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.2393121719360352
episode 6700, step 6700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0103856325149536
episode 6800, step 6800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8305141925811768
episode 6900, step 6900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9939833879470825
episode 7000, step 7000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.7439374923706055
episode 7100, step 7100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9110734462738037
episode 7200, step 7200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.805903434753418
episode 7300, step 7300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0544805526733398
episode 7400, step 7400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1555981636047363
episode 7500, step 7500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9537562727928162
episode 7600, step 7600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0062040090560913
episode 7700, step 7700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8801354169845581
episode 7800, step 7800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.7815372943878174
episode 7900, step 7900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0681519508361816
episode 8000, step 8000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9116690158843994
episode 8100, step 8100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.01902437210083
episode 8200, step 8200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9659032225608826
episode 8300, step 8300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8814241290092468
episode 8400, step 8400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9224587082862854
episode 8500, step 8500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.2102456092834473
episode 8600, step 8600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9070967435836792
episode 8700, step 8700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.2006794214248657
episode 8800, step 8800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0592678785324097
episode 8900, step 8900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0463409423828125
episode 9000, step 9000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9944213032722473
episode 9100, step 9100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9995201230049133
episode 9200, step 9200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9861416220664978
episode 9300, step 9300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9023017287254333
episode 9400, step 9400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9761253595352173
episode 9500, step 9500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8212168216705322
episode 9600, step 9600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9000095129013062
episode 9700, step 9700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9509682655334473
episode 9800, step 9800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.136844277381897
episode 9900, step 9900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0328410863876343
episode 10000, step 10000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9436702728271484

Contextual Bandits learners: LinUCB¶

Next, we describe how to use the neural version of the LinUCB algorithm with Pearl, which uses UCB type of exploration with neural architectures. LinUCB and its neural version are generalizations of the seminal Upper Confidence Bound (UCB) algorithm. Both execute a policy of the following form: $$ \pi(a,x) \in \arg\max_a \widehat{r}(x,a) + \mathrm{score}(x,a), $$ that is, both use a function that estimates the expected reward with an additional bonus term that quantifies the potential of choosing an action given a certain context. A common way to estimate the score function in the linear case with features $\phi(x,a)$ is: $$ \mathrm{score}(x,a) = \alpha ||\phi(x,a) ||_{A^{-1}}\\ \text{where } A= \lambda I + \sum_{n\leq t} \phi(x_n,a_n)\phi^T(x_n,a_n). $$

To implement the LinUCB algorithm in Pearl, use the NeuralLinearBandit policy learner module. This module supports (i) learning a reward model, and (ii) calculating a score function by estimating the uncertainty using the last layer features. Further, we set the exploration module to an instance of UCBExploration and set the alpha hyper-parameter to enable the agent with the UCB-like update rule.

In [ ]:

# Create a Neural LinUCB pearl agent with 1-hot action representation

action_representation_module = OneHotActionTensorRepresentationModule(
    max_number_actions= env._action_space.n,
)

agent = PearlAgent(
    policy_learner=NeuralLinearBandit(
        feature_dim = env.observation_dim + env._action_space.n,
        hidden_dims=[64, 16],
        state_features_only=False,
        training_rounds=10,
        learning_rate=0.01,
        action_representation_module=action_representation_module,
        exploration_module= UCBExploration(alpha=1.0)
    ),
    replay_buffer=FIFOOffPolicyReplayBuffer(100_000),
    device_id=-1,
)


info = online_learning(
    agent=agent,
    env=env,
    number_of_steps=number_of_steps,
    print_every_x_steps=100,
    record_period=record_period,
    learn_after_episode=True,
)
torch.save(info["return"], "LinUCB-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="LinUCB")
plt.xlabel("time step")
plt.ylabel("return")
plt.legend()
plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

episode 100, step 100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.20280151069164276
episode 200, step 200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.054456718266010284
episode 300, step 300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.13178661465644836
episode 400, step 400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9865397214889526
episode 500, step 500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.14140896499156952
episode 600, step 600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1187363862991333
episode 700, step 700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.2206898927688599
episode 800, step 800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.22980380058288574
episode 900, step 900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0912549495697021
episode 1000, step 1000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.09148281812667847
episode 1100, step 1100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9348994493484497
episode 1200, step 1200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.0068538920022547245
episode 1300, step 1300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9425393342971802
episode 1400, step 1400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.016242675483226776
episode 1500, step 1500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.09567844122648239
episode 1600, step 1600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9641198515892029
episode 1700, step 1700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0396792888641357
episode 1800, step 1800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9747456908226013
episode 1900, step 1900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8553079962730408
episode 2000, step 2000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9961886405944824
episode 2100, step 2100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0519009828567505
episode 2200, step 2200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.05198976397514343
episode 2300, step 2300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.960572361946106
episode 2400, step 2400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9912083745002747
episode 2500, step 2500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.055912960320711136
episode 2600, step 2600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9400607943534851
episode 2700, step 2700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0280994176864624
episode 2800, step 2800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.024195637553930283
episode 2900, step 2900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1863090991973877
episode 3000, step 3000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.147364616394043
episode 3100, step 3100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.197069525718689
episode 3200, step 3200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9864020347595215
episode 3300, step 3300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.899193286895752
episode 3400, step 3400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1261299848556519
episode 3500, step 3500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1639732122421265
episode 3600, step 3600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.833449125289917
episode 3700, step 3700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0843310356140137
episode 3800, step 3800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9897075891494751
episode 3900, step 3900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.2469022274017334
episode 4000, step 4000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.991430401802063
episode 4100, step 4100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.936269223690033
episode 4200, step 4200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0206660032272339
episode 4300, step 4300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9332488775253296
episode 4400, step 4400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9900529384613037
episode 4500, step 4500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0219541788101196
episode 4600, step 4600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8899850845336914
episode 4700, step 4700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9463319778442383
episode 4800, step 4800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.121014952659607
episode 4900, step 4900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.040822982788086
episode 5000, step 5000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.984503448009491
episode 5100, step 5100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9031474590301514
episode 5200, step 5200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0292550325393677
episode 5300, step 5300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8698779344558716
episode 5400, step 5400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.999628484249115
episode 5500, step 5500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9985622763633728
episode 5600, step 5600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.11453455686569214
episode 5700, step 5700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1535426378250122
episode 5800, step 5800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0082354545593262
episode 5900, step 5900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0808966159820557
episode 6000, step 6000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9922400712966919
episode 6100, step 6100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0792633295059204
episode 6200, step 6200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8470718860626221
episode 6300, step 6300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8804739117622375
episode 6400, step 6400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0601600408554077
episode 6500, step 6500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0142043828964233
episode 6600, step 6600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.948247492313385
episode 6700, step 6700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9545789957046509
episode 6800, step 6800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.010140061378479
episode 6900, step 6900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8813512921333313
episode 7000, step 7000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9689813852310181
episode 7100, step 7100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.060390591621399
episode 7200, step 7200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9932740330696106
episode 7300, step 7300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0427846908569336
episode 7400, step 7400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0150879621505737
episode 7500, step 7500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0593535900115967
episode 7600, step 7600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.04487299919128418
episode 7700, step 7700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0764495134353638
episode 7800, step 7800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8947196006774902
episode 7900, step 7900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0651267766952515
episode 8000, step 8000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.015225519426167011
episode 8100, step 8100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9122466444969177
episode 8200, step 8200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0724564790725708
episode 8300, step 8300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0051934719085693
episode 8400, step 8400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0062308311462402
episode 8500, step 8500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9350183606147766
episode 8600, step 8600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.920856773853302
episode 8700, step 8700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9818281531333923
episode 8800, step 8800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9964724183082581
episode 8900, step 8900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0790421962738037
episode 9000, step 9000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.0036260781344026327
episode 9100, step 9100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9669089913368225
episode 9200, step 9200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9350236058235168
episode 9300, step 9300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.08956778049469
episode 9400, step 9400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.76462721824646
episode 9500, step 9500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9117434620857239
episode 9600, step 9600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.015561874024569988
episode 9700, step 9700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0754772424697876
episode 9800, step 9800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9884140491485596
episode 9900, step 9900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9845787286758423
episode 10000, step 10000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9720855951309204

Contextual Bandits learners: Linear Thompson Sampling¶

Lastly, we describe how to use the neural version of the Linear Thompson Sampling (LinTS) algorithm with Pearl. The algorithm which uses Thompson sampling exploration with neural architectures. The LinTS sampling is closely related to the LinUCB algorithm, with a key modification that often improves its convergence in practice: sample the score function from a probability, instead of fixing it determinstically. Practically, this often reduces the over-exploring of arms, since the score may be smaller than in the LinUCB algorithm.

To implement the LinTS algorithm in Pearl, use the NeuralLinearBandit policy learner module combined with an exploration module of type ThompsonSamplingExplorationLinear. This enables the agent to sample the score based on its estimated uncertainty, rather than to fix it as in LinUCB algorithm.

In [ ]:

# Create a Neural LinTS pearl agent with 1-hot action representation

action_representation_module = OneHotActionTensorRepresentationModule(
    max_number_actions= env._action_space.n,
)

agent = PearlAgent(
    policy_learner=NeuralLinearBandit(
        feature_dim = env.observation_dim + env._action_space.n,
        hidden_dims=[64, 16],
        state_features_only=False,
        training_rounds=10,
        learning_rate=0.01,
        action_representation_module=action_representation_module,
        exploration_module= ThompsonSamplingExplorationLinear()
    ),
    replay_buffer=FIFOOffPolicyReplayBuffer(100_000),
    device_id=-1,
)

info = online_learning(
    agent=agent,
    env=env,
    number_of_steps=number_of_steps,
    print_every_x_steps=100,
    record_period=record_period,
    learn_after_episode=True,
)
torch.save(info["return"], "LinTS-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="LinTS")
plt.xlabel("time step")
plt.ylabel("return")
plt.legend()
plt.show()

episode 100, step 100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.09024851769208908
episode 200, step 200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.002633035881444812
episode 300, step 300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.07467412203550339
episode 400, step 400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0156103372573853
episode 500, step 500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.04409921169281
episode 600, step 600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9837733507156372
episode 700, step 700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9862121939659119
episode 800, step 800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.11129053682088852
episode 900, step 900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9800935983657837
episode 1000, step 1000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9999375939369202
episode 1100, step 1100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0631517171859741
episode 1200, step 1200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9909786581993103
episode 1300, step 1300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1609396934509277
episode 1400, step 1400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9601272940635681
episode 1500, step 1500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.1314946413040161
episode 1600, step 1600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0190272331237793
episode 1700, step 1700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.900674045085907
episode 1800, step 1800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0322463512420654
episode 1900, step 1900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.0550333634018898
episode 2000, step 2000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.131584644317627
episode 2100, step 2100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9966853857040405
episode 2200, step 2200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8430404663085938
episode 2300, step 2300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1651915311813354
episode 2400, step 2400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9456443786621094
episode 2500, step 2500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9995026588439941
episode 2600, step 2600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1561548709869385
episode 2700, step 2700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1483546495437622
episode 2800, step 2800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0468939542770386
episode 2900, step 2900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0259349346160889
episode 3000, step 3000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.853131115436554
episode 3100, step 3100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9850102663040161
episode 3200, step 3200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8096411228179932
episode 3300, step 3300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0070034265518188
episode 3400, step 3400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.771452784538269
episode 3500, step 3500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9490416646003723
episode 3600, step 3600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8729531764984131
episode 3700, step 3700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.12757402658462524
episode 3800, step 3800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0074385404586792
episode 3900, step 3900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.033884882926941
episode 4000, step 4000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.046915888786316
episode 4100, step 4100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0219038724899292
episode 4200, step 4200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.15234242379665375
episode 4300, step 4300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9609275460243225
episode 4400, step 4400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.034550666809082
episode 4500, step 4500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.2472548484802246
episode 4600, step 4600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9214726686477661
episode 4700, step 4700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.021003952249884605
episode 4800, step 4800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0047966241836548
episode 4900, step 4900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.034260630607605
episode 5000, step 5000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.7625596523284912
episode 5100, step 5100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9447240233421326
episode 5200, step 5200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.7146114706993103
episode 5300, step 5300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.044114351272583
episode 5400, step 5400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9947097897529602
episode 5500, step 5500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0141396522521973
episode 5600, step 5600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1748143434524536
episode 5700, step 5700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0289136171340942
episode 5800, step 5800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0836181640625
episode 5900, step 5900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.92894047498703
episode 6000, step 6000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9187952876091003
episode 6100, step 6100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8998274207115173
episode 6200, step 6200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: -0.02114623598754406
episode 6300, step 6300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0222702026367188
episode 6400, step 6400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0138862133026123
episode 6500, step 6500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0265398025512695
episode 6600, step 6600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8881807923316956
episode 6700, step 6700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9364562630653381
episode 6800, step 6800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0452048778533936
episode 6900, step 6900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8812369704246521
episode 7000, step 7000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0886296033859253
episode 7100, step 7100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8804692029953003
episode 7200, step 7200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.066596269607544
episode 7300, step 7300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0423859357833862
episode 7400, step 7400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0080853700637817
episode 7500, step 7500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.999132513999939
episode 7600, step 7600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9367015361785889
episode 7700, step 7700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.008610486984253
episode 7800, step 7800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9758915305137634
episode 7900, step 7900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1150867938995361
episode 8000, step 8000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9644365906715393
episode 8100, step 8100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0202032327651978
episode 8200, step 8200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0238661766052246
episode 8300, step 8300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.941068708896637
episode 8400, step 8400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.052219033241272
episode 8500, step 8500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9174469113349915
episode 8600, step 8600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9747493267059326
episode 8700, step 8700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0260030031204224
episode 8800, step 8800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9549689292907715
episode 8900, step 8900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9588802456855774
episode 9000, step 9000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0665802955627441
episode 9100, step 9100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0095921754837036
episode 9200, step 9200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9650076627731323
episode 9300, step 9300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.7871606945991516
episode 9400, step 9400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0182504653930664
episode 9500, step 9500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9524450302124023
episode 9600, step 9600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.0539082288742065
episode 9700, step 9700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 1.1573524475097656
episode 9800, step 9800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.9590475559234619
episode 9900, step 9900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.943342387676239
episode 10000, step 10000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets
return: 0.8112016916275024

Summary¶

In this example, we showed how to use popular contextual bandits algorithms in Pearl.