Notebook

Deep Reinforcement Learning¶

mario

Reinforcement learning is one of the hottest (if not the hottest) area of deep learning. It is possibly the closest that we have come to achieving general AI at this point in time. The reason it is touted as getting close to general AI is that we can use the same framework repeatedly in different environments to maximise future rewards. Apart from the ATARI games in which Deep Reinforcement Learning got famous for and beating the current world GO chamption, it is now being used in finance and teaching robots to walk.

The general concept of RL is to simulate game a large number of times and to learn which actions constitutes a good move. One of the biggest challenges that RL faces is to infer what constitutes a good move. A good move at the current state might be a bad move for future states. This is definitely the case in playing chess for example where you constantly sacrifice pieces for larger future rewards.

Belman Equation¶

The goal of reinforcement learning is to maximise its return. In this tutorial we get 'envirnonments' from OpenAI which provides us a game, the physics/ rules by which the game is controlled by and outputs the rewards achieved for a given action, at the current state of the game.

The bellman equation plays a crucial role in reinforcement learning. \begin{align} Q(S_t)(a_t) = r_t + \gamma\max_{a_{t+1}}Q(S_{t+1})(a_{t+1}) \end{align} where $Q$ is a quality function, which depends on state $S_t$ and outputs the maximum possible future reward for action $a_t$. $\gamma$ is a discount factor. Most texts would prefer to depict the Quality function as $Q(S_t, a_t)$, however I wish to stress that $Q(S_t)$ outputs the maximum discounted future reward for all actions. $a_t$ is in this tutorial, a one-hot encoded variable which helps to choose a reward for all possible actions. Notice how could recursively expand this function.

Pseudocode for DQN¶

Reference: https://github.com/yenchenlin/DeepLearningFlappyBird

Initialize replay memory D to size N
Initialize action-value function Q with random weights
for episode = 1, M do
    Initialize state s_1
    for t = 1, T do
        With probability ϵ select random action a_t
        otherwise select a_t=max_a  Q(s_t,a)
        Execute action a_t in emulator and observe r_t and s_(t+1) and if terminated
        Store transition (s_t,a_t,r_t,s_(t+1)) in D
        if episode > observation episodes:
            Sample a minibatch of transitions (s_j,a_j,r_j,s_(j+1)) from D
            Set y_j:=
                r_j for terminal s_(j+1)
                r_j+γ*max_(a^' )  Q(s_(j+1),a'; θ_i) for non-terminal s_(j+1)
            Perform a gradient step on (y_j-Q(s_j,a_j))^2 (and change model Q)
    end for
end for

OpenAI functions¶

env = gym.make(environment_name) <- sets up the environment
env.reset() <- resets the environment to starting point
env.step(action) <- takes action and goes to state $S_{t+1}$, returns $r_t$, $S_{t+1}$ and if game terminated.
env.render() <- render the output

Reference:¶

In [1]:

!pip install gym
!conda install -y JSAnimation

Requirement already satisfied: gym in /Users/sachin/anaconda/lib/python3.5/site-packages
Requirement already satisfied: six in /Users/sachin/anaconda/lib/python3.5/site-packages (from gym)
Requirement already satisfied: numpy>=1.10.4 in /Users/sachin/anaconda/lib/python3.5/site-packages (from gym)
Requirement already satisfied: pyglet>=1.2.0 in /Users/sachin/anaconda/lib/python3.5/site-packages (from gym)
Requirement already satisfied: requests>=2.0 in /Users/sachin/anaconda/lib/python3.5/site-packages (from gym)
Requirement already satisfied: idna<2.7,>=2.5 in /Users/sachin/anaconda/lib/python3.5/site-packages (from requests>=2.0->gym)
Requirement already satisfied: certifi>=2017.4.17 in /Users/sachin/anaconda/lib/python3.5/site-packages (from requests>=2.0->gym)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/sachin/anaconda/lib/python3.5/site-packages (from requests>=2.0->gym)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /Users/sachin/anaconda/lib/python3.5/site-packages (from requests>=2.0->gym)
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /Users/sachin/anaconda:
#
jsanimation               0.1                      py35_0    conda-forge

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re

from keras.models import Sequential, Model
from keras.layers import Dense, Input, Dot
from keras.models import load_model, model_from_json
from keras.optimizers import Adam

import gym

from collections import deque

import time

%matplotlib inline

# Imports specifically so we can render outputs in Jupyter.
from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython.display import display

Using TensorFlow backend.

In [2]:

def create_model(n_states, n_actions):
    # Maximum future discounted reward
    # Q(S_t)
    state = Input(shape=(n_states,))
    x1 = Dense(4, activation='relu')(state)
    x2 = Dense(4, activation='relu')(x1)
    out = Dense(n_actions)(x2)
    # Q(S_t)(a_t)
    actions = Input(shape=(n_actions,))
    out2 = Dot(axes=-1)([out, actions])
    
    # wrap the above in Keras Model class
    model = Model(inputs=[state, actions], outputs=out2)
    model.compile(loss='mse', optimizer='rmsprop')
    
    model2 = Model(inputs=state, outputs=out)

    return model, model2

In [3]:

def train_data(minibatch, model):
    s_j_batch = np.array([d[0] for d in minibatch])
    a_batch = np.array([d[1] for d in minibatch])
    r_batch = np.array([d[2] for d in minibatch])
    s_j1_batch = np.array([d[3] for d in minibatch])
    terminal_batch = np.array([d[4] for d in minibatch])

    readout_j1_batch = model.predict(s_j1_batch, batch_size=BATCH)
    y_batch = r_batch + GAMMA * np.max(readout_j1_batch, axis=1)
    y_batch[terminal_batch] = r_batch[terminal_batch]
    return s_j_batch, a_batch, y_batch

In [4]:

env = gym.make('CartPole-v0')
STATES, ACTIONS = env.observation_space.shape[0], env.action_space.n
model, out = create_model(STATES, ACTIONS)
INITIAL_EPSILON = 1e-1
FINAL_EPSILON = 1e-4
DECAY = 0.9
GAMMA = 0.9 # decay rate of past observations
OBSERVE = 5000. # timesteps to observe before training
REPLAY_MEMORY = 5000 # number of previous transitions to remember
TIME_LIMIT = 100000
BATCH = 128
# open up a game state to communicate with emulator
state = env.reset()

# store the previous observations in replay memory
D = deque(maxlen=REPLAY_MEMORY)
loss = []

a_t = np.zeros(ACTIONS)
a_t[np.random.choice(ACTIONS)] = 1
s_t, r_0, terminal, _ = env.step(np.argmax(a_t))

# start training
epsilon = INITIAL_EPSILON

up_time = [0]
start = time.time()
for t in range(TIME_LIMIT):
    # choose an action
    readout_t = out.predict(s_t[None, :])
    a_t = np.zeros([ACTIONS])
    if np.random.random() <= epsilon:
        a_t[np.random.choice(ACTIONS)] = 1
    else:
        a_t[np.argmax(readout_t)] = 1

    # scale down epsilon
    if epsilon > FINAL_EPSILON and t > OBSERVE:
        epsilon *= DECAY

    # run the selected action and observe next state and reward
    # store the transition in D
    s_t1, r_t, terminal, _ = env.step(np.argmax(a_t))
    D.append((s_t, a_t, r_t, s_t1, terminal))
    
    if terminal:
        up_time.append(0)
        env.reset()
    else:
        up_time[-1] += r_t
    
    # only train if done observing
    if t > OBSERVE:
        # sample a minibatch to train on
        idx = np.random.choice(REPLAY_MEMORY, BATCH, replace=False)
        minibatch = [D[i] for i in idx]
        # get the batch variables
        s_t_batch, a_batch, y_batch = train_data(minibatch, out)
        # perform gradient step
        loss.append(model.train_on_batch([s_t_batch, a_batch], y_batch))

    # update the old state
    s_t = s_t1
        
    if t%(TIME_LIMIT/20)==0:
        print('Episode :', t, ', time taken: ', time.time() - start, 's, average up time: ',np.mean(up_time[-100:]))
        start = time.time()

[2017-10-29 15:57:44,293] Making new env: CartPole-v0

Episode : 0 , time taken:  0.04322099685668945 s, average up time:  1.0
Episode : 5000 , time taken:  1.8772480487823486 s, average up time:  97.0784313725
Episode : 10000 , time taken:  14.008546113967896 s, average up time:  36.98
Episode : 15000 , time taken:  13.076855897903442 s, average up time:  70.53
Episode : 20000 , time taken:  12.869633913040161 s, average up time:  104.74
Episode : 25000 , time taken:  12.837337017059326 s, average up time:  133.49
Episode : 30000 , time taken:  12.839324951171875 s, average up time:  146.25
Episode : 35000 , time taken:  13.132809162139893 s, average up time:  144.52
Episode : 40000 , time taken:  12.963682889938354 s, average up time:  167.73
Episode : 45000 , time taken:  13.547950029373169 s, average up time:  181.39
Episode : 50000 , time taken:  14.607528924942017 s, average up time:  170.48
Episode : 55000 , time taken:  15.984797954559326 s, average up time:  151.86
Episode : 60000 , time taken:  16.933092832565308 s, average up time:  146.43
Episode : 65000 , time taken:  17.319180965423584 s, average up time:  163.32
Episode : 70000 , time taken:  17.957252979278564 s, average up time:  184.96
Episode : 75000 , time taken:  21.94894814491272 s, average up time:  198.13
Episode : 80000 , time taken:  24.839221000671387 s, average up time:  171.12
Episode : 85000 , time taken:  24.760560035705566 s, average up time:  136.11
Episode : 90000 , time taken:  20.755791902542114 s, average up time:  116.67
Episode : 95000 , time taken:  21.71216917037964 s, average up time:  97.53

In [5]:

plt.figure(figsize=(12, 5))
plt.plot(up_time)
plt.show()

In [6]:

plt.figure(figsize=(12, 5))
plt.plot(loss)
plt.show()

Only run the following if you are running it locally and not in Docker

In [7]:

def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    #plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='once'))

In [8]:

env = gym.make('CartPole-v0')

# Run a demo of the environment
observation = env.reset()
cum_reward = 0
frames = []
for t in range(5000):
    # Render into buffer. 
    frames.append(env.render(mode = 'rgb_array'))
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        break
env.render(close=True)
display_frames_as_gif(frames)
print(t)

[2017-10-29 16:04:01,470] Making new env: CartPole-v0

In [9]:

# Run a demo of the environment
observation = env.reset()
cum_reward = 0
frames = []
a_t = np.zeros([ACTIONS])
a_t[env.action_space.sample()] = 1
for t in range(5000):
    # Render into buffer.  
    s_t, reward, done, info = env.step(np.argmax(a_t))
    frames.append(env.render(mode = 'rgb_array'))
    readout_t = out.predict(s_t[None, :])
    a_t = np.zeros([ACTIONS])
    a_t[np.argmax(readout_t)] = 1
    
    if done:
        break
env.render(close=True)
display_frames_as_gif(frames)
print(t)

In [ ]: