We consider a simple environment as shown below. Our simple agent has the possibility to move West, East, North and South (except on the borders of the environment). Moreover it has two additional actions:
It can hit with its sword (which is essentially useful when facing the skeleton)
It can open the door and escape the room (useful when being in the upper rightmost cell)
As you can imagine, hitting while in an empty cell is a loss of energy and should be penalized. We will thus associate useless behaviors (like hitting when alone and opening non existing doors) with a negative reward of -10. Hitting the skeleton and opening the door when the skeleton has been killed should be associated with higher rewards of +30. Escaping without the kill will be associated with a negative reward of -10 (The agent should not exit before getting rid of all the evil in the room)
There are a total of 48 cells. Each cell can contain the skeleton, be empty or contain the door (exclusively) which makes a total of $48\times3 = 144$ states for your environment. Moreover, we want to keep track of whether or not the skeleton has been killed so we will take into account an additional variable encoding whether the kill has been completed. This thus leads to a total of $48\times 3\times2 = 288$.
Any move will be associated with a small penalty of -1 point
Start by coding a simple TD Q-learning agent which stores the Q-table entirely as a $288\times 6$ numpy array. Use an exploration function to avoid being stuck in a limited region of the environment. Recall that such a function can be defined as
\begin{align} f(Q_\theta[s,a], N[s,a]) = \left\{\begin{array}{ll} R^+& \text{if $N[s,a]<N_e$}\\ Q_\theta[s,a]&\text{otherwise} \end{array}\right. \end{align}where $R^+$ is an estimate of the best possible reward that can be obtained in any state.
Concretely for each new state and action, you should update the $Q$-table after incrementing the count $N[s,a]$ (of the number of times action $a$ has been taken in state $s$)
\begin{align} Q[s, a] & \leftarrow Q[s,a] +\eta \left(N_{sa}[s,a]\right)\left(r+\gamma \max_{a'} Q[s',a'] - Q[s,a]\right)\\ s,a,r &\leftarrow s', \underset{a'}{\operatorname{argmax}} f\left(Q[s',a'], N[s',a']\right), r' \end{align}\eta \left(N_{sa}[s,a]\right) is a learning rate that can be set to a sufficiently small constant or be evolving as the inverse of $N[s, a]$
Consider a couple of episodes to make sure that you fill out a sufficient number of entries in the table. Interlace the learning phases with a couple of evaluation phases during which you should store the number of steps needed to complete the kill and exit so that you will be able to plot the evolution of your agent at the end of the simulation.
You can associate a negative reward with the impossible actions corresponding to hitting walls on the borders of the environment.
import numpy as np
import matplotlib.pyplot as plt
num_evaluations = 10
Q = np.zeros((288,6))
time_to_completion = np.zeros((num_evaluations,))
for k in np.arange(num_evaluations):
# train for a couple of epochs
for episode in np.arange(200):
simulationEnded = False
while simulationEnded !=True:
'''complete with the TD-Q learning steps and learn the Q-table'''
# The evaluate the agent
simulationEnded = False
step_count = 0
while simulationEnded !=True:
'''complete with the TD-Q learning steps and learn the Q-table'''
step_count+=1
time_to_completion[k] = step_count
We now want to improve the generalization skills of our agent. We will encode our $Q$-table as a simple feedforward neural network. To avoid losing time with the implementation, we will rely on keras for the architecture (see below). We will consider a simple one hidden layer network that takes as input a vector encoding the current state of the environment (note that such a vector can be encoded in various ways, for example as a $5$-tuple including the cartesian coordinates of the agent and 3 variables indicating whether there is a door, skeleton,..) and returns as ouput the value associated with each of the actions. In other words we consider a parametric model $f_\theta\in \mathbb{R}^{|a|}$ such that $[f_{\theta}(s)]_a \approx Q[s, a]$. Recall that the TD update for such a model is given
\begin{align} \theta_i &\leftarrow \theta_i + \alpha \left[R[s] + \gamma \max_{a'} \hat{Q}_{\theta}[s',a'] - \hat{Q}[s,a]\right]\frac{\partial \hat{Q}_\theta[s,a]}{\partial \theta_i} \end{align}Start by updating the weights after each new step (in an SGD fashion) and evaluate the agent as before by interlacing your training with a couple of evaluation step where you should compute the time needed to exit with the kill completed.
We will then be a little wiser and keep track of a short memory before updating the network.
Consider the following steps :
Initialize the memory to an empty vector
As before, we will consider a number of episodes. For each episode, we will consider a full simulation (until exiting with the kill). Again you should rely on some exploration function to alternate between exploration and exploitation (selection of the action corresponding to the maximum output of the network for the current state).
After each action, observe the reward and store the tuple $(s_t, a_t, r_t, s_{t+1})$ and store it in the memory.
Select a random minibatch from the memory and update the networks through a gradient update on the minibatch (mb)
$$\min_{\theta_i} \frac{1}{\text{size}(\text{mb})}\sum_{t\in \text{mb}}\left(R[s_t] + \gamma \max_{a_{t+1}} \hat{Q}_{\theta}[s_{t+1},a_{t+1}] - \hat{Q}_\theta[s_t,a_t]\right)^2 $$
To see how to build a Feedforward neural network in keras, the simplest approach is to use the method 'add' such as in
model = keras.Sequential()
model.add(layers.Dense(n, activation="relu"))
you can see here for more details.
Once you have built the model, you need to select an optimizer before you can train it. This can be implemented through the 'compile method'
model.compile(optimizer='SGD',
loss='mean_squared_error',
metrics=['accuracy'])
Finally to train the network, you can rely on the fit function (see also here)
model.fit(train_data, train_labels,epochs=5)
Note that Keras (when using 'compile' with the SGD option) relies on a default batch of size 32 so either you should specify a smaller batch size or you should make sure your memory is sufficiently big. Also note that as written above we will in fact do a coule of SGD iterations (one for each sample in the minibatch, repeated epochs times).
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
num_evaluations = 10
time_to_completion = np.zeros((num_evaluations,))
for k in np.arange(num_evaluations):
# train for a couple of epochs
for episode in np.arange(200):
simulationEnded = False
while simulationEnded !=True:
'''complete with the TD-Q learning steps and learn the Q-table'''
# The evaluate the agent
simulationEnded = False
step_count = 0
while simulationEnded !=True:
'''complete with the TD-Q learning steps and learn the Q-table'''
step_count+=1
time_to_completion[k] = step_count
Compare and plot the improvement in the behavior of your agent for a couple of neural networks architectures. Don't make the network overly complicated. You can just try different activation functions with different numbers of units per layers (possibly also consider a small number of hidden layers)