Reinforcement Learning


"Reinforcement learning (RL)" is a very interesting sub-field of Machine Learning. There have been many new developments in RL in the last 5 years. Publication of "Deep Q-Networks" from DeepMind in particualr ushered in a new era for RL. An important concept in all RL algorithms is the tradeoff between exploration and exploitation. In this post we will simulate a problem called "Multi-armed bandit" and understand the details of this tradeoff.

Goals of the post

  • Share some good resources on RL.
  • Simulate K-Armed bandit problem
  • Understand the tradeoff between Exploration and Exploitation in RL


There are several great resources on RL. Below are some of the best ones I found for practitioners like myself. These are good starting points for understanding the foundations and learning by doing. Richard Sutton and Andrew Barto's "second edition" is beautifully written. I really like how they provide intuitive explanation of algorithms and the pseudo-code. The pseudo-code is proving to be invaluable when I want to code up an algorithm and understand the details. Andrej Karpathy wrote an "excellent post" almost two years ago. It's a great general introduction and also a good starting point for a type of RL aglorithms called Policy gradient methods. Finally I found "this" course from Berkeley that is very recent with full lecture notes and videos available. I have only reviewed one lecture so far, but it looks very promising.


  1. All of the code for this post can be found here
  2. Jupyter notebook corresponding to this blog post can be found at

The following resources can be helpful in understanding the code. Introduction to OpenAI gym can be found here.

We will using this library that is built on top of openai gym to simulate 10-armed bandit problem.

Multi-armed bandit problem

General RL problem

Before we look at the Multi-armed bandit problem, lets take a quick look at the general RL problem setting. The picture below captures the general RL problem. There are two entities - agent and environment. At time t, the Agent observes state $S_t$ from the environment and also receives a reward $R_t$. The agent then takes an action $A_t$. In response to action $A_t$, the environment provides the next state and reward pair and the process continues. This setup represents what is called a Markov Decision Process. The goal of the agent is to maximize the cumulative reward it receives from the environment.

The most distinguishing feature of RL compared to supervised learning is that there are no labels associated with actions; there is only reward for each action taken.

Multi-armed bandit problem

Multi-armed bandit problem is a simple RL problem. At every time step, the agent can choose one of K actions. The agent tehn receives a reward that is drawn from an unknown (to the agent) probability distribution corresponding to the said action. The goal of the agent is to choose actions such that the total reward received within a certain number of timesteps is maximized. The environements state remains unchanged for all time steps. This simplifies the the problem considerably and makes the successive time steps IID. This can be represented as shown below.

Where $A_t \in {1,2,3....K}$ and the reward $ R_t \sim \mathcal{N}(\mu_k,\,\sigma^{2})\, $ where k is the action taken.

We can estimate the reward distribution for each action by simulating an agent that takes random actions at every time step. This is shown below for K=10

In [26]:
# imports
%load_ext autoreload
%autoreload 2
import sys

import gym
import gym_bandits
import logging
import numpy as np
import os
import pandas as pd
from dotenv import load_dotenv, find_dotenv
from src.visualization.visualize import plot_rewards, plot_actions, dist_plots
from src.models.k_armed_bandit import Agent, play_wrapper
import plotly.offline as pyoffline
import plotly.plotly
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload