NYU CSCI-UA 9472 Artificial Intelligence ¶

Q learning and Generalization in RL ¶

In this session, we will discuss Q-learning and generalization through parametric models

Question 1. Tic Tac Toe¶

We consider a simple Tic Tac Toe game such as shown below.

Implement a simple TD Learning reinforcement learning framework for this game. Recall that the temporal difference (TD) update rule is given by

\begin{align*} U^\pi(s) \leftarrow U^\pi(s) + \alpha\left(R(s) + \gamma U^\pi(s')- U^\pi(s)\right) \end{align*}

The agent can start by alternating between exploration and exploitation but should ultimately exploit the utility.

Hint: add a reward of -.1 for each action and a reward of +100 for a win and -100 for a loss

Question 1.1. Single Agent¶

Start by implementing a simple RL agent that plays against a random oponent. We will train the agent for a couple of episodes (game completions) and then evaluate it on a couple more games (through the total number of wins). How many possible states are there in tic tac toe?

In [ ]:

import numpy as np

won = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while won != True:
        
        # add your simulation here

Question 1.2. Parametric representation¶

Modify your answer to the first question so that the utility is stored as a linear model in some features. Try different models and compare the result with your previous answer. Possible features include

A representation of the board with each feature taking the value 0,1 or 2 depending on whether it contains a '+', a 'O' or is empty
The number of rows/columns/diagonals with two of the agent's pieces and one emtpy field
The number of rows/columns/diagonals with two of the opponent's pieces and one empty field
Is there one of the agent's piece in the center
Number of agent's pieces in corners
Number of rows/columns/diagonals with one of the agent's pieces and two empty fields
Number of rows/columns/diagonals with three of the agent's pieces

-...

In [ ]:

import numpy as np

won = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while won != True:
        
        # add your simulation here

Question 1.3. Adversarial Framework¶

We now want to consider an adversarial environment in which two agents are competing against each other.

Question 1.3.a competition¶

In [ ]:

import numpy as np

won = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while won != True:
        
        # add your simulation here

Question 1.3.b Adversarial training¶

Competition can be good in training too provided that the trainees are actually willing to learn. In this second question we want the agents to train against each other.

In [ ]:

import numpy as np

won = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while won != True:
        
        # add your simulation here

Question 2. Obstacle free environment¶

We consider the obstacle-free environment below.

For this simple environment, implement a direct utility estimation agent for which the utility is stored in tabular form first and through the simple linear model

\begin{align*} \hat{U}_\theta(x, y) = \theta_0 + \theta_1 x + \theta_2 y \end{align*}

then. train the two agents for several episodes and compare their performance.

In [ ]:

import numpy as np

exit = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while exit != True:
        
        # add your simulation here

Question 3. Bringing back the obstacles¶

Question 3.1. Q learning¶

We now want to add obstacles back.

To handle the obstacles, we will consider a Q learning approach. Implement both the traditional TD Q-learning agent

\begin{align*} Q[s, a] \leftarrow Q[s, a] + \eta \left(R[s] + \gamma \max_{a'} Q[s', a'] - Q[s, a]\right) \end{align*}

as well as the SARSA update

\begin{align*} Q[s, a] \leftarrow Q[s, a] + \eta \left(R[s] + \gamma Q[s', a'] - Q[s, a]\right) \end{align*}

Use an exploration/exploitation framework or implement an exploration function of the form

\begin{align*} f(U, n) &= \left\{\begin{array}{ll} R^+ & \text{if $n<N_e$}\\ U & \text{otherwise} \end{array}\right. \end{align*}

In [ ]:

import numpy as np

exit = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while exit != True:
        
        # add your simulation here

Question 3.2. Parametric Q-learning¶

We now want to store the Q-table as a parametric function of the form. You can for example take this function to be of the form

\begin{align*} Q[s, a] = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 a \end{align*}

Compare the performance of the two agents. Recall that for a parametric representation of a TD learning agent, the parameters are updated as

\begin{align*} \mathbf{\theta}_i \leftarrow \mathbf{\theta}_i + \eta \left[R[s] + \gamma \max_{a'} \hat{Q}_{\theta}[s', a'] - \hat{Q}[s, a]\right]\frac{\partial \hat{Q}_\theta[s, a]}{\partial \theta_i} \end{align*}

In [ ]:

import numpy as np

exit = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while exit != True:
        
        # add your simulation here

Question 3.3. Towards deep Q-learning¶

Encode the Q-table by means of a neural network which takes as input a position $(x, y)$ and returns the probability of each move being optimal for that particular position.

In [ ]:

import numpy as np

exit = False # should switch to true when 3 crosses are aligned 

for episode in np.arange():

    while exit != True:
        
        # add your simulation here