%load_ext autoreload
%autoreload 2
If you haven't installed Pearl, please make sure you install Pearl with the following cell. Otherwise, you can skip the cell below.
%pip uninstall Pearl -y
%rm -rf Pearl
!git clone https://github.com/facebookresearch/Pearl.git
%cd Pearl
%pip install .
%cd ..
WARNING: Skipping Pearl as it is not installed. Cloning into 'Pearl'... remote: Enumerating objects: 5987, done. remote: Counting objects: 100% (2196/2196), done. remote: Compressing objects: 100% (675/675), done. remote: Total 5987 (delta 1674), reused 1941 (delta 1503), pack-reused 3791 Receiving objects: 100% (5987/5987), 54.36 MiB | 14.03 MiB/s, done. Resolving deltas: 100% (4001/4001), done. /content/Pearl Processing /content/Pearl Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: gym in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (0.25.2) Collecting gymnasium[accept-rom-license,atari,mujoco] (from Pearl==0.1.0) Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 953.9/953.9 kB 15.9 MB/s eta 0:00:00 Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (1.25.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (3.7.1) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.0.3) Collecting parameterized (from Pearl==0.1.0) Downloading parameterized-0.9.0-py2.py3-none-any.whl (20 kB) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.31.0) Collecting mujoco (from Pearl==0.1.0) Downloading mujoco-3.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 71.6 MB/s eta 0:00:00 Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.2.1+cu121) Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (0.17.1+cu121) Requirement already satisfied: torchaudio in /usr/local/lib/python3.10/dist-packages (from Pearl==0.1.0) (2.2.1+cu121) Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from gym->Pearl==0.1.0) (2.2.1) Requirement already satisfied: gym-notices>=0.0.4 in /usr/local/lib/python3.10/dist-packages (from gym->Pearl==0.1.0) (0.0.8) Requirement already satisfied: typing-extensions>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (4.11.0) Collecting farama-notifications>=0.0.1 (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB) Collecting autorom[accept-rom-license]~=0.4.2 (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB) Collecting shimmy[atari]<1.0,>=0.1.0 (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB) Requirement already satisfied: imageio>=2.14.1 in /usr/local/lib/python3.10/dist-packages (from gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (2.31.6) Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from mujoco->Pearl==0.1.0) (1.4.0) Requirement already satisfied: etils[epath] in /usr/local/lib/python3.10/dist-packages (from mujoco->Pearl==0.1.0) (1.7.0) Collecting glfw (from mujoco->Pearl==0.1.0) Downloading glfw-2.7.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38-none-manylinux2014_x86_64.whl (211 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.8/211.8 kB 33.2 MB/s eta 0:00:00 Requirement already satisfied: pyopengl in /usr/local/lib/python3.10/dist-packages (from mujoco->Pearl==0.1.0) (3.1.7) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (1.2.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (4.51.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (1.4.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (24.0) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->Pearl==0.1.0) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->Pearl==0.1.0) (2023.4) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->Pearl==0.1.0) (2024.1) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->Pearl==0.1.0) (2024.2.2) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (3.14.0) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (3.3) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (3.1.4) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (2023.6.0) Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->Pearl==0.1.0) Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB) Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->Pearl==0.1.0) Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB) Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->Pearl==0.1.0) Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB) Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->Pearl==0.1.0) Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB) Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->Pearl==0.1.0) Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB) Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->Pearl==0.1.0) Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB) Collecting nvidia-curand-cu12==10.3.2.106 (from torch->Pearl==0.1.0) Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB) Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch->Pearl==0.1.0) Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB) Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch->Pearl==0.1.0) Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB) Collecting nvidia-nccl-cu12==2.19.3 (from torch->Pearl==0.1.0) Using cached nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB) Collecting nvidia-nvtx-cu12==12.1.105 (from torch->Pearl==0.1.0) Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB) Requirement already satisfied: triton==2.2.0 in /usr/local/lib/python3.10/dist-packages (from torch->Pearl==0.1.0) (2.2.0) Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch->Pearl==0.1.0) Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB) Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (8.1.7) Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) (4.66.4) Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 434.7/434.7 kB 47.7 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->Pearl==0.1.0) (1.16.0) Collecting ale-py~=0.8.1 (from shimmy[atari]<1.0,>=0.1.0->gymnasium[accept-rom-license,atari,mujoco]->Pearl==0.1.0) Downloading ale_py-0.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 36.7 MB/s eta 0:00:00 Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[epath]->mujoco->Pearl==0.1.0) (6.4.0) Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[epath]->mujoco->Pearl==0.1.0) (3.18.1) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->Pearl==0.1.0) (2.1.5) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->Pearl==0.1.0) (1.3.0) Building wheels for collected packages: Pearl, AutoROM.accept-rom-license Building wheel for Pearl (pyproject.toml) ... done Created wheel for Pearl: filename=Pearl-0.1.0-py3-none-any.whl size=215044 sha256=c2d11dd3e65c5bb6720538a7d224f13e2f26f857c237ad6f718fa55a99983926 Stored in directory: /tmp/pip-ephem-wheel-cache-mi9lzaly/wheels/83/80/1d/d9211ba70ee392341daf21a07252739e0cb2af9f95439a28cd Building wheel for AutoROM.accept-rom-license (pyproject.toml) ... done Created wheel for AutoROM.accept-rom-license: filename=AutoROM.accept_rom_license-0.6.1-py3-none-any.whl size=446659 sha256=50b64bf3c726e54dfa462a4cd8af9e666df0c9401e6bb1aedd4a0644e4abc2da Stored in directory: /root/.cache/pip/wheels/6b/1b/ef/a43ff1a2f1736d5711faa1ba4c1f61be1131b8899e6a057811 Successfully built Pearl AutoROM.accept-rom-license Installing collected packages: glfw, farama-notifications, parameterized, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, gymnasium, ale-py, shimmy, nvidia-cusparse-cu12, nvidia-cudnn-cu12, AutoROM.accept-rom-license, autorom, nvidia-cusolver-cu12, mujoco, Pearl Successfully installed AutoROM.accept-rom-license-0.6.1 Pearl-0.1.0 ale-py-0.8.1 autorom-0.4.2 farama-notifications-0.0.4 glfw-2.7.0 gymnasium-0.29.1 mujoco-3.1.5 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.19.3 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 parameterized-0.9.0 shimmy-0.2.1 /content
from pearl.utils.functional_utils.experimentation.set_seed import set_seed
from pearl.action_representation_modules.one_hot_action_representation_module import OneHotActionTensorRepresentationModule
from pearl.replay_buffers.sequential_decision_making.fifo_off_policy_replay_buffer import FIFOOffPolicyReplayBuffer
from pearl.utils.functional_utils.train_and_eval.online_learning import online_learning
from pearl.pearl_agent import PearlAgent
from pearl.utils.uci_data import download_uci_data
from pearl.utils.instantiations.environments.contextual_bandit_uci_environment import (
SLCBEnvironment,
)
from pearl.policy_learners.exploration_modules.contextual_bandits.squarecb_exploration import SquareCBExploration
from pearl.policy_learners.exploration_modules.contextual_bandits.ucb_exploration import (
UCBExploration,
)
from pearl.policy_learners.exploration_modules.contextual_bandits.thompson_sampling_exploration import (
ThompsonSamplingExplorationLinear,
)
from pearl.policy_learners.contextual_bandits.neural_bandit import NeuralBandit
from pearl.policy_learners.contextual_bandits.neural_linear_bandit import (
NeuralLinearBandit,
)
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
set_seed(0)
The environment which underlies the experiments to follow is a contextual bandit environment we added to Pearl that allows us to use UCI datasets (https://archive.ics.uci.edu/datasets).
The UCI datasets span a wide variety of prediction tasks. We use these tasks to construct a contexual bandit environment in which an agent receives an expected reward of 1 if it correctly labels a data point and 0 otherwise. Pearl currently supports the following datasets: pendigits, letter, satimage, yeast. Additional ones can be readily added.
In the following experiment we will test different types of contextual bandits algorithms on the pendigits UCI dataset.
# load environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Download UCI dataset if doesn't exist
uci_data_path = "./utils/instantiations/environments/uci_datasets"
if not os.path.exists(uci_data_path):
os.makedirs(uci_data_path)
download_uci_data(data_path=uci_data_path)
# Built CB environment using the pendigits UCI dataset
pendigits_uci_dict = {
"path_filename": os.path.join(uci_data_path, "pendigits/pendigits.tra"),
"action_embeddings": "discrete",
"delim_whitespace": False,
"ind_to_drop": [],
"target_column": 16,
}
env = SLCBEnvironment(**pendigits_uci_dict)
# experiment code
number_of_steps = 10000
record_period = 400
The following sections show how to implement the neural versions of SquareCB, LinUCB and LinTS with Pearl.
The SquareCB algorithm requires only a regression model with which it learns the reward function. Given the reward model, SquareCB executes the following policy: $$ \widehat{a}_*\in \arg\max_a\widehat{r}(x,a)\\ \widehat{r}_*\in \max_a\widehat{r}(x,a)\\ \text{If $a\neq \widehat{a}_*$}: \pi(a,x)= \frac{1}{A + \gamma (\widehat{r}_* - \widehat{r}(x,a))}\\ \text{If $a= \widehat{a}_*$}: \pi(a,x) = 1-\sum_{a'\neq \widehat{a}_*}\pi(a',x). $$ This policy balances exploration and exploitation in an intelligent way.
To use the SquareCB algrorithm in Pearl we set the policy learner as NeuralBandit
. NeuralBandit
is class supportings the estimation of the reward function with a neural architecture. With access to an estimated reward model, we then use an instance of SquareCBExploration
as an exploration module.
To further highlight the versatility of the modular design of Pearl, we use the OneHotActionTensorRepresentationModule
as the action representation module. This module internally converts actions from integers to one-hot-encoded vectors.
# Create a Neural SquareCB pearl agent with 1-hot action representation
action_representation_module = OneHotActionTensorRepresentationModule(
max_number_actions= env.unique_labels_num,
)
agent = PearlAgent(
policy_learner=NeuralBandit(
feature_dim = env.observation_dim + env.unique_labels_num,
hidden_dims=[64, 16],
training_rounds=10,
learning_rate=0.01,
action_representation_module=action_representation_module,
exploration_module= SquareCBExploration(gamma = env.observation_dim * env.unique_labels_num * number_of_steps)
),
replay_buffer=FIFOOffPolicyReplayBuffer(100_000),
device_id=-1,
)
info = online_learning(
agent=agent,
env=env,
number_of_steps=number_of_steps,
print_every_x_steps=100,
record_period=record_period,
learn_after_episode=True,
)
torch.save(info["return"], "SquareCB-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="SquareCB")
plt.xlabel("time step")
plt.ylabel("return")
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
episode 100, step 100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.02050408534705639 episode 200, step 200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.039917074143886566 episode 300, step 300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0325154066085815 episode 400, step 400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0192910432815552 episode 500, step 500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.020320570096373558 episode 600, step 600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0140972137451172 episode 700, step 700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1774886846542358 episode 800, step 800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0041024684906006 episode 900, step 900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9082531929016113 episode 1000, step 1000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0025124549865723 episode 1100, step 1100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9679498076438904 episode 1200, step 1200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9570890069007874 episode 1300, step 1300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9786309599876404 episode 1400, step 1400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0437270402908325 episode 1500, step 1500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1215401887893677 episode 1600, step 1600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0220673084259033 episode 1700, step 1700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.05659715086221695 episode 1800, step 1800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1713285446166992 episode 1900, step 1900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9805613160133362 episode 2000, step 2000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0700596570968628 episode 2100, step 2100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9755151867866516 episode 2200, step 2200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.06023371219635 episode 2300, step 2300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.2449081689119339 episode 2400, step 2400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9615944623947144 episode 2500, step 2500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1294598579406738 episode 2600, step 2600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0927162170410156 episode 2700, step 2700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0192159414291382 episode 2800, step 2800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.14172282814979553 episode 2900, step 2900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.020617127418518 episode 3000, step 3000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.07183039188385 episode 3100, step 3100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1373060941696167 episode 3200, step 3200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8844553828239441 episode 3300, step 3300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1407734155654907 episode 3400, step 3400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0968198776245117 episode 3500, step 3500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0457595586776733 episode 3600, step 3600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0209089517593384 episode 3700, step 3700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.016340732574463 episode 3800, step 3800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.076174259185791 episode 3900, step 3900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9254880547523499 episode 4000, step 4000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9435696601867676 episode 4100, step 4100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9403758645057678 episode 4200, step 4200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.062246013432741165 episode 4300, step 4300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.09749681502580643 episode 4400, step 4400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9377647042274475 episode 4500, step 4500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9589371681213379 episode 4600, step 4600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8841699361801147 episode 4700, step 4700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0646312236785889 episode 4800, step 4800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1062999963760376 episode 4900, step 4900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1038718223571777 episode 5000, step 5000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.7176197171211243 episode 5100, step 5100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8898045420646667 episode 5200, step 5200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.010150366462767124 episode 5300, step 5300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9760612845420837 episode 5400, step 5400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.06860487163066864 episode 5500, step 5500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.03486515209078789 episode 5600, step 5600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1883316040039062 episode 5700, step 5700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9038528800010681 episode 5800, step 5800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8557829260826111 episode 5900, step 5900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0221916437149048 episode 6000, step 6000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8543550968170166 episode 6100, step 6100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9892317056655884 episode 6200, step 6200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9139447212219238 episode 6300, step 6300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.0652153417468071 episode 6400, step 6400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0828280448913574 episode 6500, step 6500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9806367754936218 episode 6600, step 6600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.2393121719360352 episode 6700, step 6700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0103856325149536 episode 6800, step 6800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8305141925811768 episode 6900, step 6900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9939833879470825 episode 7000, step 7000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.7439374923706055 episode 7100, step 7100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9110734462738037 episode 7200, step 7200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.805903434753418 episode 7300, step 7300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0544805526733398 episode 7400, step 7400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1555981636047363 episode 7500, step 7500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9537562727928162 episode 7600, step 7600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0062040090560913 episode 7700, step 7700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8801354169845581 episode 7800, step 7800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.7815372943878174 episode 7900, step 7900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0681519508361816 episode 8000, step 8000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9116690158843994 episode 8100, step 8100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.01902437210083 episode 8200, step 8200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9659032225608826 episode 8300, step 8300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8814241290092468 episode 8400, step 8400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9224587082862854 episode 8500, step 8500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.2102456092834473 episode 8600, step 8600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9070967435836792 episode 8700, step 8700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.2006794214248657 episode 8800, step 8800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0592678785324097 episode 8900, step 8900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0463409423828125 episode 9000, step 9000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9944213032722473 episode 9100, step 9100, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9995201230049133 episode 9200, step 9200, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9861416220664978 episode 9300, step 9300, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9023017287254333 episode 9400, step 9400, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9761253595352173 episode 9500, step 9500, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8212168216705322 episode 9600, step 9600, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9000095129013062 episode 9700, step 9700, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9509682655334473 episode 9800, step 9800, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.136844277381897 episode 9900, step 9900, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0328410863876343 episode 10000, step 10000, agent=PearlAgent with NeuralBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9436702728271484
Next, we describe how to use the neural version of the LinUCB algorithm with Pearl, which uses UCB type of exploration with neural architectures. LinUCB and its neural version are generalizations of the seminal Upper Confidence Bound (UCB) algorithm. Both execute a policy of the following form: $$ \pi(a,x) \in \arg\max_a \widehat{r}(x,a) + \mathrm{score}(x,a), $$ that is, both use a function that estimates the expected reward with an additional bonus term that quantifies the potential of choosing an action given a certain context. A common way to estimate the score function in the linear case with features $\phi(x,a)$ is: $$ \mathrm{score}(x,a) = \alpha ||\phi(x,a) ||_{A^{-1}}\\ \text{where } A= \lambda I + \sum_{n\leq t} \phi(x_n,a_n)\phi^T(x_n,a_n). $$
To implement the LinUCB algorithm in Pearl, use the NeuralLinearBandit
policy learner module. This module supports (i) learning a reward model, and (ii) calculating a score function by estimating the uncertainty using the last layer features. Further, we set the exploration module to an instance of UCBExploration
and set the alpha
hyper-parameter to enable the agent with the UCB-like update rule.
# Create a Neural LinUCB pearl agent with 1-hot action representation
action_representation_module = OneHotActionTensorRepresentationModule(
max_number_actions= env._action_space.n,
)
agent = PearlAgent(
policy_learner=NeuralLinearBandit(
feature_dim = env.observation_dim + env._action_space.n,
hidden_dims=[64, 16],
state_features_only=False,
training_rounds=10,
learning_rate=0.01,
action_representation_module=action_representation_module,
exploration_module= UCBExploration(alpha=1.0)
),
replay_buffer=FIFOOffPolicyReplayBuffer(100_000),
device_id=-1,
)
info = online_learning(
agent=agent,
env=env,
number_of_steps=number_of_steps,
print_every_x_steps=100,
record_period=record_period,
learn_after_episode=True,
)
torch.save(info["return"], "LinUCB-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="LinUCB")
plt.xlabel("time step")
plt.ylabel("return")
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
episode 100, step 100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.20280151069164276 episode 200, step 200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.054456718266010284 episode 300, step 300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.13178661465644836 episode 400, step 400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9865397214889526 episode 500, step 500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.14140896499156952 episode 600, step 600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1187363862991333 episode 700, step 700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.2206898927688599 episode 800, step 800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.22980380058288574 episode 900, step 900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0912549495697021 episode 1000, step 1000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.09148281812667847 episode 1100, step 1100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9348994493484497 episode 1200, step 1200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.0068538920022547245 episode 1300, step 1300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9425393342971802 episode 1400, step 1400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.016242675483226776 episode 1500, step 1500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.09567844122648239 episode 1600, step 1600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9641198515892029 episode 1700, step 1700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0396792888641357 episode 1800, step 1800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9747456908226013 episode 1900, step 1900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8553079962730408 episode 2000, step 2000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9961886405944824 episode 2100, step 2100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0519009828567505 episode 2200, step 2200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.05198976397514343 episode 2300, step 2300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.960572361946106 episode 2400, step 2400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9912083745002747 episode 2500, step 2500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.055912960320711136 episode 2600, step 2600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9400607943534851 episode 2700, step 2700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0280994176864624 episode 2800, step 2800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.024195637553930283 episode 2900, step 2900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1863090991973877 episode 3000, step 3000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.147364616394043 episode 3100, step 3100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.197069525718689 episode 3200, step 3200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9864020347595215 episode 3300, step 3300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.899193286895752 episode 3400, step 3400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1261299848556519 episode 3500, step 3500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1639732122421265 episode 3600, step 3600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.833449125289917 episode 3700, step 3700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0843310356140137 episode 3800, step 3800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9897075891494751 episode 3900, step 3900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.2469022274017334 episode 4000, step 4000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.991430401802063 episode 4100, step 4100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.936269223690033 episode 4200, step 4200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0206660032272339 episode 4300, step 4300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9332488775253296 episode 4400, step 4400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9900529384613037 episode 4500, step 4500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0219541788101196 episode 4600, step 4600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8899850845336914 episode 4700, step 4700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9463319778442383 episode 4800, step 4800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.121014952659607 episode 4900, step 4900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.040822982788086 episode 5000, step 5000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.984503448009491 episode 5100, step 5100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9031474590301514 episode 5200, step 5200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0292550325393677 episode 5300, step 5300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8698779344558716 episode 5400, step 5400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.999628484249115 episode 5500, step 5500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9985622763633728 episode 5600, step 5600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.11453455686569214 episode 5700, step 5700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1535426378250122 episode 5800, step 5800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0082354545593262 episode 5900, step 5900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0808966159820557 episode 6000, step 6000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9922400712966919 episode 6100, step 6100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0792633295059204 episode 6200, step 6200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8470718860626221 episode 6300, step 6300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8804739117622375 episode 6400, step 6400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0601600408554077 episode 6500, step 6500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0142043828964233 episode 6600, step 6600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.948247492313385 episode 6700, step 6700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9545789957046509 episode 6800, step 6800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.010140061378479 episode 6900, step 6900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8813512921333313 episode 7000, step 7000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9689813852310181 episode 7100, step 7100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.060390591621399 episode 7200, step 7200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9932740330696106 episode 7300, step 7300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0427846908569336 episode 7400, step 7400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0150879621505737 episode 7500, step 7500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0593535900115967 episode 7600, step 7600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.04487299919128418 episode 7700, step 7700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0764495134353638 episode 7800, step 7800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8947196006774902 episode 7900, step 7900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0651267766952515 episode 8000, step 8000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.015225519426167011 episode 8100, step 8100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9122466444969177 episode 8200, step 8200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0724564790725708 episode 8300, step 8300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0051934719085693 episode 8400, step 8400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0062308311462402 episode 8500, step 8500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9350183606147766 episode 8600, step 8600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.920856773853302 episode 8700, step 8700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9818281531333923 episode 8800, step 8800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9964724183082581 episode 8900, step 8900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0790421962738037 episode 9000, step 9000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.0036260781344026327 episode 9100, step 9100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9669089913368225 episode 9200, step 9200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9350236058235168 episode 9300, step 9300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.08956778049469 episode 9400, step 9400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.76462721824646 episode 9500, step 9500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9117434620857239 episode 9600, step 9600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.015561874024569988 episode 9700, step 9700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0754772424697876 episode 9800, step 9800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9884140491485596 episode 9900, step 9900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9845787286758423 episode 10000, step 10000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9720855951309204
Lastly, we describe how to use the neural version of the Linear Thompson Sampling (LinTS) algorithm with Pearl. The algorithm which uses Thompson sampling exploration with neural architectures. The LinTS sampling is closely related to the LinUCB algorithm, with a key modification that often improves its convergence in practice: sample the score function from a probability, instead of fixing it determinstically. Practically, this often reduces the over-exploring of arms, since the score may be smaller than in the LinUCB algorithm.
To implement the LinTS algorithm in Pearl, use the NeuralLinearBandit
policy learner module combined with an exploration module of type ThompsonSamplingExplorationLinear
. This enables the agent to sample the score based on its estimated uncertainty, rather than to fix it as in LinUCB algorithm.
# Create a Neural LinTS pearl agent with 1-hot action representation
action_representation_module = OneHotActionTensorRepresentationModule(
max_number_actions= env._action_space.n,
)
agent = PearlAgent(
policy_learner=NeuralLinearBandit(
feature_dim = env.observation_dim + env._action_space.n,
hidden_dims=[64, 16],
state_features_only=False,
training_rounds=10,
learning_rate=0.01,
action_representation_module=action_representation_module,
exploration_module= ThompsonSamplingExplorationLinear()
),
replay_buffer=FIFOOffPolicyReplayBuffer(100_000),
device_id=-1,
)
info = online_learning(
agent=agent,
env=env,
number_of_steps=number_of_steps,
print_every_x_steps=100,
record_period=record_period,
learn_after_episode=True,
)
torch.save(info["return"], "LinTS-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="LinTS")
plt.xlabel("time step")
plt.ylabel("return")
plt.legend()
plt.show()
episode 100, step 100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.09024851769208908 episode 200, step 200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.002633035881444812 episode 300, step 300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.07467412203550339 episode 400, step 400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0156103372573853 episode 500, step 500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.04409921169281 episode 600, step 600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9837733507156372 episode 700, step 700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9862121939659119 episode 800, step 800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.11129053682088852 episode 900, step 900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9800935983657837 episode 1000, step 1000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9999375939369202 episode 1100, step 1100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0631517171859741 episode 1200, step 1200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9909786581993103 episode 1300, step 1300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1609396934509277 episode 1400, step 1400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9601272940635681 episode 1500, step 1500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.1314946413040161 episode 1600, step 1600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0190272331237793 episode 1700, step 1700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.900674045085907 episode 1800, step 1800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0322463512420654 episode 1900, step 1900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.0550333634018898 episode 2000, step 2000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.131584644317627 episode 2100, step 2100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9966853857040405 episode 2200, step 2200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8430404663085938 episode 2300, step 2300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1651915311813354 episode 2400, step 2400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9456443786621094 episode 2500, step 2500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9995026588439941 episode 2600, step 2600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1561548709869385 episode 2700, step 2700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1483546495437622 episode 2800, step 2800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0468939542770386 episode 2900, step 2900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0259349346160889 episode 3000, step 3000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.853131115436554 episode 3100, step 3100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9850102663040161 episode 3200, step 3200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8096411228179932 episode 3300, step 3300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0070034265518188 episode 3400, step 3400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.771452784538269 episode 3500, step 3500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9490416646003723 episode 3600, step 3600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8729531764984131 episode 3700, step 3700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.12757402658462524 episode 3800, step 3800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0074385404586792 episode 3900, step 3900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.033884882926941 episode 4000, step 4000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.046915888786316 episode 4100, step 4100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0219038724899292 episode 4200, step 4200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.15234242379665375 episode 4300, step 4300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9609275460243225 episode 4400, step 4400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.034550666809082 episode 4500, step 4500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.2472548484802246 episode 4600, step 4600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9214726686477661 episode 4700, step 4700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.021003952249884605 episode 4800, step 4800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0047966241836548 episode 4900, step 4900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.034260630607605 episode 5000, step 5000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.7625596523284912 episode 5100, step 5100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9447240233421326 episode 5200, step 5200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.7146114706993103 episode 5300, step 5300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.044114351272583 episode 5400, step 5400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9947097897529602 episode 5500, step 5500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0141396522521973 episode 5600, step 5600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1748143434524536 episode 5700, step 5700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0289136171340942 episode 5800, step 5800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0836181640625 episode 5900, step 5900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.92894047498703 episode 6000, step 6000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9187952876091003 episode 6100, step 6100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8998274207115173 episode 6200, step 6200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: -0.02114623598754406 episode 6300, step 6300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0222702026367188 episode 6400, step 6400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0138862133026123 episode 6500, step 6500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0265398025512695 episode 6600, step 6600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8881807923316956 episode 6700, step 6700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9364562630653381 episode 6800, step 6800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0452048778533936 episode 6900, step 6900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8812369704246521 episode 7000, step 7000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0886296033859253 episode 7100, step 7100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8804692029953003 episode 7200, step 7200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.066596269607544 episode 7300, step 7300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0423859357833862 episode 7400, step 7400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0080853700637817 episode 7500, step 7500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.999132513999939 episode 7600, step 7600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9367015361785889 episode 7700, step 7700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.008610486984253 episode 7800, step 7800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9758915305137634 episode 7900, step 7900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1150867938995361 episode 8000, step 8000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9644365906715393 episode 8100, step 8100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0202032327651978 episode 8200, step 8200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0238661766052246 episode 8300, step 8300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.941068708896637 episode 8400, step 8400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.052219033241272 episode 8500, step 8500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9174469113349915 episode 8600, step 8600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9747493267059326 episode 8700, step 8700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0260030031204224 episode 8800, step 8800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9549689292907715 episode 8900, step 8900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9588802456855774 episode 9000, step 9000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0665802955627441 episode 9100, step 9100, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0095921754837036 episode 9200, step 9200, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9650076627731323 episode 9300, step 9300, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.7871606945991516 episode 9400, step 9400, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0182504653930664 episode 9500, step 9500, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9524450302124023 episode 9600, step 9600, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.0539082288742065 episode 9700, step 9700, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 1.1573524475097656 episode 9800, step 9800, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.9590475559234619 episode 9900, step 9900, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.943342387676239 episode 10000, step 10000, agent=PearlAgent with NeuralLinearBandit, FIFOOffPolicyReplayBuffer, env=Contextual bandits with CB datasets return: 0.8112016916275024
In this example, we showed how to use popular contextual bandits algorithms in Pearl.