"Reinforcement learning (RL)" is a very interesting sub-field of Machine Learning. There have been many new developments in RL in the last 5 years. Publication of "Deep Q-Networks" from DeepMind in particualr ushered in a new era for RL. An important concept in all RL algorithms is the tradeoff between exploration and exploitation. In this post we will simulate a problem called "Multi-armed bandit" and understand the details of this tradeoff.
There are several great resources on RL. Below are some of the best ones I found for practitioners like myself. These are good starting points for understanding the foundations and learning by doing. Richard Sutton and Andrew Barto's "second edition" is beautifully written. I really like how they provide intuitive explanation of algorithms and the pseudo-code. The pseudo-code is proving to be invaluable when I want to code up an algorithm and understand the details. Andrej Karpathy wrote an "excellent post" almost two years ago. It's a great general introduction and also a good starting point for a type of RL aglorithms called Policy gradient methods. Finally I found "this" course from Berkeley that is very recent with full lecture notes and videos available. I have only reviewed one lecture so far, but it looks very promising.
Code:
The following resources can be helpful in understanding the code. Introduction to OpenAI gym can be found here. https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym
We will using this library that is built on top of openai gym to simulate 10-armed bandit problem. https://github.com/JKCooper2/gym-bandits
Before we look at the Multi-armed bandit problem, lets take a quick look at the general RL problem setting. The picture below captures the general RL problem. There are two entities - agent and environment. At time t, the Agent observes state $S_t$ from the environment and also receives a reward $R_t$. The agent then takes an action $A_t$. In response to action $A_t$, the environment provides the next state and reward pair and the process continues. This setup represents what is called a Markov Decision Process. The goal of the agent is to maximize the cumulative reward it receives from the environment.
The most distinguishing feature of RL compared to supervised learning is that there are no labels associated with actions; there is only reward for each action taken.
Multi-armed bandit problem is a simple RL problem. At every time step, the agent can choose one of K actions. The agent tehn receives a reward that is drawn from an unknown (to the agent) probability distribution corresponding to the said action. The goal of the agent is to choose actions such that the total reward received within a certain number of timesteps is maximized. The environements state remains unchanged for all time steps. This simplifies the the problem considerably and makes the successive time steps IID. This can be represented as shown below.
Where $A_t \in {1,2,3....K}$ and the reward $ R_t \sim \mathcal{N}(\mu_k,\,\sigma^{2})\, $ where k is the action taken.
We can estimate the reward distribution for each action by simulating an agent that takes random actions at every time step. This is shown below for K=10
# imports
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../")
import gym
import gym_bandits
import logging
import numpy as np
import os
import pandas as pd
from dotenv import load_dotenv, find_dotenv
from src.visualization.visualize import plot_rewards, plot_actions, dist_plots
from src.models.k_armed_bandit import Agent, play_wrapper
import plotly.offline as pyoffline
pyoffline.init_notebook_mode()
import plotly.plotly
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
load_dotenv(find_dotenv())
env = gym.make("BanditTenArmedGaussian-v0")
env.reset()
WARN: Environment '<class 'gym_bandits.bandit.BanditTenArmedGaussian'>' has deprecated methods '_step' and '_reset' rather than 'step' and 'reset'. Compatibility code invoked. Set _gym_disable_underscore_compat = True to disable this behavior.
0
# epsilon = 1 => Explore at every time step by taking a random action
agent = Agent(env, 1)
df = play_wrapper('random agent', env, agent)
Our simulation runs for 3000 steps. We can see below our random agent chose the actions approximately evenly.
df.groupby('random agent action').count()
r | e | random agent avg reward | |
---|---|---|---|
random agent action | |||
0 | 273 | 273 | 273 |
1 | 294 | 294 | 294 |
2 | 311 | 311 | 311 |
3 | 289 | 289 | 289 |
4 | 311 | 311 | 311 |
5 | 268 | 268 | 268 |
6 | 321 | 321 | 321 |
7 | 327 | 327 | 327 |
8 | 282 | 282 | 282 |
9 | 324 | 324 | 324 |
We can estimate the densities of the 10 distributions from this simulation data.
dfs = [df.loc[df['random agent action']==a,'r'] for a in range(10)]
labels = ['action_'+str(i) for i in range(10)]
_ = dist_plots(dfs, labels, "Distribuiton of reward by action taken",
bin_size=0.05,
show_hist=False)
As we can see from the plot, if our agent selects the action that corresponds to the right-most distribution, we could maximaize our reward. But instead of playing a for a large number of timesteps to estimate the distributions, it would be good to have our agent:
A class of methods in RL called Action-value methods do exaclty this.
The K-armed bandit problem can be solved using Tabular method. It is quite simple: We maintain a table of actions and their estimated expected reward. The agent then uses this table to take actions based on some policy.
At the time step t, the estimate of expected reward for taking action a, $Q_t(a)$ = (sum of rewards when action a is taken until t-1) /Number of times a is taken
One of the simplest policies is the greedy policy where the agent always chooses the action with the maximum expected return.
$A_t=\underset{a}{\operatorname{argmax}} Q_t(a)$
Another approach is called epsilon-greedy policy which takes action using the greedy policy with a probability of $1-\epsilon$ and a random action with a probability of $\epsilon$. This approach ensures all the action space is "explored"
We can define yet another policy called decaying-epsilon-greedy method, where we slowly decay the epsilon overtime.
Code at https://github.com/manifoldai/reinforcement_learning_public/blob/master/src/models/k_armed_bandit.py trains three agents with these three policies. We can do the same inline in the jupyter notebook.
# epsilon = 0 => No exploration, always greedy
agent = Agent(env, 0)
g_df = play_wrapper('greedy agent', env, agent)
# 10% exploration
agent = Agent(env, 0.1)
epsilon_g_df = play_wrapper('epsilon-greedy agent', env, agent)
# 10% exploration at the beginning and decaying later
agent = Agent(env, 0.1, 0.01)
decaying_epsilon_g_df = play_wrapper('decaying epsilon-greedy agent', env, agent)
df = pd.concat([g_df, epsilon_g_df, decaying_epsilon_g_df], axis=1)
Through simulaiton we can understand how these three algorithms explore the action space and exploit the knowledge at any given time step. This tradeoff is fundamental to many RL algorithms. So it will be informative to look closely at this concept through this 10-armed bandit problem.
# plot the actions taken by the three policies over time
fig = plot_actions(df)
This is the format of your plot grid: [ (1,1) x1,y1 ] [ (1,2) x2,y1 ] [ (1,3) x3,y1 ]
As we can see, the greedy policy explored very little and settled on choosing action 7 very quickly just by luck. The epsilon greedy and decaying epsilon greedy algorithms found the optimal action early but continued to explore. However, the decaying espsilon greedy algorithm explores less and less over time. By timestep 3000, it is choosing the optimal action almost all the time.
fig = plot_rewards(df)
2019-01-07 17:54:22,952,952 INFO [visualize.py:29] Plotting rewards
Above, we can see the average rewards for three policies over time. The greedy algorithm converged by luck to the optimal solution early on. The epsilon-greedy and decaying epsilon greedy algorithms converged to that action too but not through luck, but by systematic exploration of the action space. However, the epsilon greedy algorithm continues to pay the price of exploration and therefore never catches up to the performance of decaying epsilon-greedy algorithm.
Note that due to randomness, the results may be different in another run. For example, the greedy algorithm can end up selecting a sub-optimal action early on and suffer from it. More generally, counting on luck is a bad strategy and over a long number of runs, it will be sub-optimal strategy.
Lets do one more run.
# epsilon = 0 => No exploration, always greedy
agent = Agent(env, 0)
g_df = play_wrapper('greedy agent', env, agent)
# 10% exploration
agent = Agent(env, 0.1)
epsilon_g_df = play_wrapper('epsilon-greedy agent', env, agent)
# 10% exploration at the beginning and decaying later
agent = Agent(env, 0.1, 0.01)
decaying_epsilon_g_df = play_wrapper('decaying epsilon-greedy agent', env, agent)
df = pd.concat([g_df, epsilon_g_df, decaying_epsilon_g_df], axis=1)
fig = plot_actions(df)
This is the format of your plot grid: [ (1,1) x1,y1 ] [ (1,2) x2,y1 ] [ (1,3) x3,y1 ]
fig = plot_rewards(df)
2019-01-07 17:58:50,807,807 INFO [visualize.py:29] Plotting rewards
We see that the greedy algorithm converged to a sub-optimal action with very little exploration. The other two algorithms conveged to the optimal action (action 7). The epsilon greedy algorithm performed better initially between those two, but it lost to the decaying epsilon policy around time step 372. This is due to the fact that the epsilon greedy algorithm continues to pay the cost of exploration while the decaying epsilon policy reduces this cost over time.
In this post we looked at a problem called Multi-armed bandit problem in RL and simulated three different policies to solve it. We studied the nuances between them using simulations. Specifically, we looked at the tradeoff between exploration and exploitation, which is a fundamental concept in RL.