We consider a simple Tic Tac Toe game such as shown below.
Implement a simple TD Learning reinforcement learning framework for this game. Recall that the temporal difference (TD) update rule is given by
\begin{align*} U^\pi(s) \leftarrow U^\pi(s) + \alpha\left(R(s) + \gamma U^\pi(s')- U^\pi(s)\right) \end{align*}The agent can start by alternating between exploration and exploitation but should ultimately exploit the utility.
Hint: add a reward of -.1 for each action and a reward of +100 for a win and -100 for a loss
Start by implementing a simple RL agent that plays against a random oponent. We will train the agent for a couple of episodes (game completions) and then evaluate it on a couple more games (through the total number of wins). How many possible states are there in tic tac toe?
import numpy as np
won = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while won != True:
# add your simulation here
Modify your answer to the first question so that the utility is stored as a linear model in some features. Try different models and compare the result with your previous answer. Possible features include
-...
import numpy as np
won = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while won != True:
# add your simulation here
import numpy as np
won = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while won != True:
# add your simulation here
Competition can be good in training too provided that the trainees are actually willing to learn. In this second question we want the agents to train against each other.
import numpy as np
won = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while won != True:
# add your simulation here
We consider the obstacle-free environment below.
For this simple environment, implement a direct utility estimation agent for which the utility is stored in tabular form first and through the simple linear model
\begin{align*} \hat{U}_\theta(x, y) = \theta_0 + \theta_1 x + \theta_2 y \end{align*}then. train the two agents for several episodes and compare their performance.
import numpy as np
exit = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while exit != True:
# add your simulation here
We now want to add obstacles back.
To handle the obstacles, we will consider a Q learning approach. Implement both the traditional TD Q-learning agent
\begin{align*} Q[s, a] \leftarrow Q[s, a] + \eta \left(R[s] + \gamma \max_{a'} Q[s', a'] - Q[s, a]\right) \end{align*}as well as the SARSA update
\begin{align*} Q[s, a] \leftarrow Q[s, a] + \eta \left(R[s] + \gamma Q[s', a'] - Q[s, a]\right) \end{align*}Use an exploration/exploitation framework or implement an exploration function of the form
\begin{align*} f(U, n) &= \left\{\begin{array}{ll} R^+ & \text{if $n<N_e$}\\ U & \text{otherwise} \end{array}\right. \end{align*}import numpy as np
exit = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while exit != True:
# add your simulation here
We now want to store the Q-table as a parametric function of the form. You can for example take this function to be of the form
\begin{align*} Q[s, a] = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 a \end{align*}Compare the performance of the two agents. Recall that for a parametric representation of a TD learning agent, the parameters are updated as
\begin{align*} \mathbf{\theta}_i \leftarrow \mathbf{\theta}_i + \eta \left[R[s] + \gamma \max_{a'} \hat{Q}_{\theta}[s', a'] - \hat{Q}[s, a]\right]\frac{\partial \hat{Q}_\theta[s, a]}{\partial \theta_i} \end{align*}import numpy as np
exit = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while exit != True:
# add your simulation here
Encode the Q-table by means of a neural network which takes as input a position $(x, y)$ and returns the probability of each move being optimal for that particular position.
import numpy as np
exit = False # should switch to true when 3 crosses are aligned
for episode in np.arange():
while exit != True:
# add your simulation here