NYU CSCI-UA 9472 Artificial Intelligence

Markov Decision Processes and RL

This week we will continue to work on reinforcement learning. In particular, in this session, we will cover Markov Decision Processes as well as

Exercise 1. Simple maze and the Bellman update

We consider the simple world given below. The objective is to reach the terminal state which has an associated reward of +1000 and to avoid the trap which has an associated negative reward of -1000. All the other states have an associated negative reward of -.04

Implement each of the following approaches:

  • Value iteration. In this case, we simply update the utility by means of the Bellman equation \begin{align*} U'[s]\leftarrow R(s) + \gamma\max_{a\in A(s)}\sum_{s'}P(s'|s,a)U[s'] \end{align*}
  • Policy iteration. In this case, we alternate between a policy evaluation step (in which we compute the utility $U_i = U^{\pi_i}$, which can be done through simple linear algebra solvers as the equations become linear) and a policy improvement step which computes a new MEU policy $\pi_{i+1}$ using a one step look ahead based on $U_i$, i.e.
\begin{align*} \pi[s]\leftarrow \underset{a\in A(s)}{\operatorname{argmax}} \sum_{s'}P(s'|s,a)U[s'] \end{align*}

Each time your agent reaches the exit, it goes back to the start cell.

Repeat the simulation for a couple of start-to-finish simulations and plot the evolution of the total time to completion as a function of time.

For each framework, once you have the utlity, display the simulation of a trained agent that leverage the MEU policy

\begin{align*} \pi^*(s) = \underset{a\in A(s)}{\operatorname{argmax}} \sum_{s'}P(s'|s,a)U(s') \end{align*}
In [ ]:
# put your code here

Exercise 2. DUE and ADP

We still consider the 10x14 world given above. we consider a fixed policy that consists in moving up whenever possible, then moving right when the agent has reached the upper limit of the world. Finally, when we reach an obstacle, teh policy consists in moving along that obstacle (i.e. not away from the obstacle). In short if the 'above' and 'right' cells are not available and the agent has an obstacle on its right, it will move in the only direction that is not away from the obstacle first as show by the cell highlighted in green below.

Despite this policy, we also assume an uncertain environment such as show by the diagram next to the map where the "intended" outcomes only occur with probability 0.8 while with probability .2 the agent moves at right angles with respect to the intended location.

The rewards are as before (-.04 for every states except the goal state and the trap). Given this policy, implement each of the following approaches:

  • A Direct Utility estimation (DUE) agent that updates the utity of each state once after each start-to-finish cycle has been completed.
  • An Adaptive Dynamic Programming (ADP) agent for which, at each new step from state $s$ to state $s'$, we want to increment the transition model from $s$ to any state $t$ that can be reached from $s$ as \begin{align*} P(t|s,a)\leftarrow N_{s'|s,a}[t, s,a]/N_{sa}[s,a] \end{align*} where $N[s,a]$ indicate the number of times action a was selected in state $s$ and $N_{s'|sa}[t,s,a]$ indicate the number of times state $t$ was reached from state $s$ after applying action $a$ when in this last state. Then evaluate the utility.
In [ ]:
# put your code here

Exercise 3.

Still in the 10x14 world above, for the policy of Exercise 2, implement a TD agent. Recall that in this case, we update the utility as \begin{align*} U[s]\leftarrow U[s] + \eta(N_s[s])(r+\gamma U[s'] - U[s]) \end{align*}

In [ ]:
# put your code here

Exercise 4.

We now consider a different policy. The agent should move at random except if it has an estimate of the utility of adjacent cells. If so with probability .1 it should move at random and with probability .9 it should move to the cell with the highest utility.

In [ ]:
# put your code here

Exercise 5.

We now don't assume any policy anymore. Implement an $\varepsilon$-greedy active RL agent that moves a random (on one of the adjacent cells) with probability $\varepsilon$ (take a small $\epsilon$ such as $.1$ to start) and behaves in a greedy way otherwise, taking the action that will maximize its utility.

In [ ]:
# put your code here