DiscreteDP¶

*Implementation Details*

Daisuke Oyama
Faculty of Economics, University of Tokyo

This notebook describes the implementation details of the DiscreteDP class.

For the theoretical background and notation, see the lecture Discrete Dynamic Programming.

Solution methods¶

The DiscreteDP class currently implements the following solution algorithms:

value iteration;
policy iteration (default);
modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an $\varepsilon$-optimal policy for a prespecified value of $\varepsilon$.

Value iteration relies on (only) the fact that the Bellman operator $T$ is a contraction mapping and thus iterative application of $T$ to any initial function $v^0$ converges to its unique fixed point $v^*$.

Policy iteration more closely exploits the particular structure of the problem, where each iteration consists of a policy evaluation step, which computes the value $v_{\sigma}$ of a policy $\sigma$ by solving the linear equation $v = T_{\sigma} v$, and a policy improvement step, which computes a $v_{\sigma}$-greedy policy.

Modified policy iteration replaces the policy evaluation step in policy iteration with "partial policy evaluation", which computes an approximation of the value of a policy $\sigma$ by iterating $T_{\sigma}$ for a specified number of times.

Below we describe our implementation of these algorithms more in detail.
(While not explicit, in the actual implementation each algorithm is terminated when the number of iterations reaches iter_max.)

Value iteration¶

DiscreteDP.value_iteration(v_init, epsilon, iter_max)

Choose any $v^0 \in \mathbb{R}^n$, and specify $\varepsilon > 0$; set $i = 0$.
Compute $v^{i+1} = T v^i$.
If $\lVert v^{i+1} - v^i\rVert < [(1 - \beta) / (2\beta)] \varepsilon$, then go to step 4; otherwise, set $i = i + 1$ and go to step 2.
Compute a $v^{i+1}$-greedy policy $\sigma$, and return $v^{i+1}$ and $\sigma$.

Given $\varepsilon > 0$, the value iteration algorithm terminates in a finite number of iterations, and returns an $\varepsilon/2$-approximation of the optimal value funciton and an $\varepsilon$-optimal policy function (unless iter_max is reached).

Policy iteration¶

DiscreteDP.policy_iteration(v_init, iter_max)

Choose any $v^0 \in \mathbb{R}^n$ and compute a $v^0$-greedy policy $\sigma^0$; set $i = 0$.
[Policy evaluation] Compute the value $v_{\sigma^i}$ by solving the equation $v = T_{\sigma^i} v$.
[Policy improvement] Compute a $v_{\sigma^i}$-greedy policy $\sigma^{i+1}$; let $\sigma^{i+1} = \sigma^i$ if possible.
If $\sigma^{i+1} = \sigma^i$, then return $v_{\sigma^i}$ and $\sigma^{i+1}$; otherwise, set $i = i + 1$ and go to step 2.

The policy iteration algorithm terminates in a finite number of iterations, and returns an optimal value function and an optimal policy function (unless iter_max is reached).

Modified policy iteration¶

DiscreteDP.modified_policy_iteration(v_init, epsilon, iter_max, k)

Choose any $v^0 \in \mathbb{R}^n$, and specify $\varepsilon > 0$ and $k \geq 0$; set $i = 0$.
[Policy improvement] Compute a $v^i$-greedy policy $\sigma^{i+1}$; let $\sigma^{i+1} = \sigma^i$ if possible (for $i \geq 1$).
Compute $u = T v^i$ ($= T_{\sigma^{i+1}} v^i$). If $\mathrm{span}(u - v^i) < [(1 - \beta) / \beta] \varepsilon$, then go to step 5; otherwise go to step 4.
[Partial policy evaluation] Compute $v^{i+1} = (T_{\sigma^{i+1}})^k u$ ($= (T_{\sigma^{i+1}})^{k+1} v^i$). Set $i = i + 1$ and go to step 2.
Return $v = u + [\beta / (1 - \beta)] [(\min(u - v^i) + \max(u - v^i)) / 2] \mathbf{1}$ and $\sigma_{i+1}$.

Given $\varepsilon > 0$, provided that $v^0$ is such that $T v^0 \geq v^0$, the modified policy iteration algorithm terminates in a finite number of iterations, and returns an $\varepsilon/2$-approximation of the optimal value funciton and an $\varepsilon$-optimal policy function (unless iter_max is reached).

Remarks

Here we employ the termination criterion based on the span semi-norm, where $\mathrm{span}(z) = \max(z) - \min(z)$ for $z \in \mathbb{R}^n$. Since $\mathrm{span}(T v - v) \leq 2\lVert T v - v\rVert$, this reaches $\varepsilon$-optimality faster than the norm-based criterion as employed in the value iteration above.
Except for the termination criterion, modified policy is equivalent to value iteration if $k = 0$ and to policy iteration in the limit as $k \to \infty$.
Thus, if one would like to have value iteration with the span-based rule, run modified policy iteration with $k = 0$.
In returning a value function, our implementation is slightly different from that by Puterman (2005), Section 6.6.3, pp.201-202, which uses $u + [\beta / (1 - \beta)] \min(u - v^i) \mathbf{1}$.
The condition for convergence, $T v^0 \geq v^0$, is satisfied for example when $v^0 = v_{\sigma}$ for some policy $\sigma$, or when $v^0(s) = \min_{(s', a)} r(s', a)$ for all $s$. If v_init is not specified, it is set to the latter, $\min_{(s', a)} r(s', a))$.

Illustration¶

We illustrate the algorithms above by the simple example from Puterman (2005), Section 3.1, pp.33-35.

In [1]:

import numpy as np
import pandas as pd
from quantecon.markov import DiscreteDP

In [2]:

n = 2  # Number of states
m = 2  # Number of actions

# Reward array
R = [[5, 10],
     [-1, -float('inf')]]

# Transition probability array
Q = [[(0.5, 0.5), (0, 1)],
     [(0, 1), (0.5, 0.5)]] # Probabilities in Q[1, 1] are arbitrary

# Discount rate
beta = 0.95

ddp = DiscreteDP(R, Q, beta)

Analytical solution:

In [3]:

def sigma_star(beta):
    sigma = np.empty(2, dtype=int)
    sigma[1] = 0
    if beta > 10/11:
        sigma[0] = 0
    else:
        sigma[0] = 1
    return sigma

def v_star(beta):
    v = np.empty(2)
    v[1] = -1 / (1 - beta)
    if beta > 10/11:
        v[0] = (5 - 5.5*beta) / ((1 - 0.5*beta) * (1 - beta))
    else:
        v[0] = (10 - 11*beta) / (1 - beta)
    return v

In [4]:

sigma_star(beta=beta)

Out[4]:

array([0, 0])

In [5]:

v_star(beta=beta)

Out[5]:

array([ -8.57142857, -20.        ])

Value iteration¶

Solve the problem by value iteration; see Example 6.3.1, p.164 in Puterman (2005).

In [6]:

epsilon = 1e-2
v_init = [0, 0]
res_vi = ddp.solve(method='value_iteration', v_init=v_init, epsilon=epsilon)

The number of iterations required to satisfy the termination criterion:

In [7]:

res_vi.num_iter

Out[7]:

The returned value function:

In [8]:

res_vi.v

Out[8]:

array([ -8.5665053 , -19.99507673])

It is indeed an $\varepsilon/2$-approximation of $v^*$:

In [9]:

np.abs(res_vi.v - v_star(beta=beta)).max() < epsilon/2

Out[9]:

True

The returned policy function:

In [10]:

res_vi.sigma

Out[10]:

array([0, 0])

Value iteration converges very slowly. Let us replicate Table 6.3.1 on p.165:

In [11]:

num_reps = 164
values = np.empty((num_reps, n))
diffs = np.empty(num_reps)
spans = np.empty(num_reps)
v = np.array([0, 0])

values[0] = v
diffs[0] = np.nan
spans[0] = np.nan

for i in range(1, num_reps):
    v_new = ddp.bellman_operator(v)
    values[i] = v_new
    diffs[i] = np.abs(v_new - v).max()
    spans[i] = (v_new - v).max() - (v_new - v).min()
    v = v_new

In [12]:

df = pd.DataFrame()
df[0], df[1], df[2], df[3] = values[:, 0], values[:, 1], diffs, spans
df.columns = '$v^i(0)$', '$v^i(1)$', \
             '$\\lVert v^i - v^{i-1}\\rVert$', '$\\mathrm{span}(v^i - v^{i-1})$'

iter_nums = pd.Series(list(range(num_reps)), name='$i$')
df.index = iter_nums

display_nums = \
    list(range(10)) + [10*i for i in range(1, 16)] + [160+i for i in range(4)]
df.iloc[display_nums, [0, 1, 2]]

Out[12]:

	$v^i(0)$	$v^i(1)$	$\lVert v^i - v^{i-1}\rVert$
$i$
0	0.000000	0.000000	NaN
1	10.000000	-1.000000	10.000000
2	9.275000	-1.950000	0.950000
3	8.479375	-2.852500	0.902500
4	7.672766	-3.709875	0.857375
5	6.882373	-4.524381	0.814506
6	6.120046	-5.298162	0.773781
7	5.390395	-6.033254	0.735092
8	4.694642	-6.731591	0.698337
9	4.032449	-7.395012	0.663420
10	3.402783	-8.025261	0.630249
20	-1.401710	-12.830282	0.377354
30	-4.278653	-15.707225	0.225936
40	-6.001185	-17.429757	0.135276
50	-7.032529	-18.461100	0.080995
60	-7.650033	-19.078604	0.048495
70	-8.019755	-19.448326	0.029035
80	-8.241121	-19.669693	0.017385
90	-8.373661	-19.802233	0.010409
100	-8.453018	-19.881589	0.006232
110	-8.500532	-19.929103	0.003731
120	-8.528980	-19.957551	0.002234
130	-8.546013	-19.974584	0.001338
140	-8.556211	-19.984783	0.000801
150	-8.562317	-19.990889	0.000480
160	-8.565973	-19.994545	0.000287
161	-8.566246	-19.994818	0.000273
162	-8.566505	-19.995077	0.000259
163	-8.566751	-19.995323	0.000246

On the other hand, the span decreases faster than the norm; the following replicates Table 6.6.1, page 205:

In [13]:

df.iloc[list(range(1, 13)) + [10*i for i in  range(2, 7)], [2, 3]]

Out[13]:

	$\lVert v^i - v^{i-1}\rVert$	$\mathrm{span}(v^i - v^{i-1})$
$i$
1	10.000000	1.100000e+01
2	0.950000	2.250000e-01
3	0.902500	1.068750e-01
4	0.857375	5.076562e-02
5	0.814506	2.411367e-02
6	0.773781	1.145399e-02
7	0.735092	5.440647e-03
8	0.698337	2.584307e-03
9	0.663420	1.227546e-03
10	0.630249	5.830844e-04
11	0.598737	2.769651e-04
12	0.568800	1.315584e-04
20	0.377354	3.409318e-07
30	0.225936	1.993445e-10
40	0.135276	1.154632e-13
50	0.080995	0.000000e+00
60	0.048495	1.776357e-15

The span-based termination criterion is satisfied when $i = 11$:

In [14]:

epsilon * (1-beta) / beta

Out[14]:

0.0005263157894736847

In [15]:

spans[11] < epsilon * (1-beta) / beta

Out[15]:

True

In fact, modified policy iteration with $k = 0$ terminates with $11$ iterations:

In [16]:

epsilon = 1e-2
v_init = [0, 0]
k = 0
res_mpi_1 = ddp.solve(method='modified_policy_iteration',
                      v_init=v_init, epsilon=epsilon, k=k)

In [17]:

res_mpi_1.num_iter

Out[17]:

In [18]:

res_mpi_1.v

Out[18]:

array([ -8.56904799, -19.99736883])

Policy iteration¶

If $\{\sigma^i\}$ is the sequence of policies obtained by policy iteration with an initial policy $\sigma^0$, one can show that $T^i v_{\sigma^0} \leq v_{\sigma^i}$ ($\leq v^*$), so that the number of iterations required for policy iteration is smaller than that for value iteration at least weakly, and indeed in many cases, the former is significantly smaller than the latter.

In [19]:

v_init = [0, 0]
res_pi = ddp.solve(method='policy_iteration', v_init=v_init)

In [20]:

res_pi.num_iter

Out[20]:

Policy iteration returns the exact optimal value function (up to rounding errors):

In [21]:

res_pi.v

Out[21]:

array([ -8.57142857, -20.        ])

In [22]:

np.abs(res_pi.v - v_star(beta=beta)).max()

Out[22]:

3.5527136788005009e-15

To look into the iterations:

In [23]:

v = np.array([0, 0])
sigma = np.array([-1, -1])  # Dummy
sigma_new = ddp.compute_greedy(v)
i = 0

while True:
    print('Iterate {0}'.format(i))
    print(' value:  {0}'.format(v))
    print(' policy: {0}'.format(sigma_new))
    if np.array_equal(sigma_new, sigma):
        break
    sigma[:] = sigma_new
    v = ddp.evaluate_policy(sigma)
    sigma_new = ddp.compute_greedy(v)
    i += 1

print('Terminated')

Iterate 0
 value:  [0 0]
 policy: [1 0]
Iterate 1
 value:  [ -9. -20.]
 policy: [0 0]
Iterate 2
 value:  [ -8.57142857 -20.        ]
 policy: [0 0]
Terminated

See Example 6.4.1, pp.176-177.

Modified policy iteration¶

The evaluation step in policy iteration which solves the linear equation $v = T_{\sigma} v$ to obtain the policy value $v_{\sigma}$ can be expensive for problems with a large number of states. Modified policy iteration is to reduce the cost of this step by using an approximation of $v_{\sigma}$ obtained by iteration of $T_{\sigma}$. The tradeoff is that this approach only computes an $\varepsilon$-optimal policy, and for small $\varepsilon$, takes a larger number of iterations than policy iteration (but much smaller than value iteration).

In [24]:

epsilon = 1e-2
v_init = [0, 0]
k = 6
res_mpi = ddp.solve(method='modified_policy_iteration',
                    v_init=v_init, epsilon=epsilon, k=k)

In [25]:

res_mpi.num_iter

Out[25]:

The returned value function:

In [26]:

res_mpi.v

Out[26]:

array([ -8.57137101, -19.99993638])

It is indeed an $\varepsilon/2$-approximation of $v^*$:

In [27]:

np.abs(res_mpi.v - v_star(beta=beta)).max() < epsilon/2

Out[27]:

True

To look into the iterations:

In [28]:

epsilon = 1e-2
v = np.array([0, 0])
k = 6
i = 0
print('Iterate {0}'.format(i))
print(' v: {0}'.format(v))

sigma = np.empty(n, dtype=int)  # Store the policy function

while True:
    i += 1
    u = ddp.bellman_operator(v, sigma=sigma)
    diff = u - v
    span = diff.max() - diff.min()
    print('Iterate {0}'.format(i))
    print(' sigma: {0}'.format(sigma))
    print(' T_sigma(v): {0}'.format(u))
    print(' span: {0}'.format(span))
    if span < epsilon * (1-ddp.beta) / ddp.beta:
        v = u + ((diff.max() + diff.min()) / 2) * \
            (ddp.beta / (1 - ddp.beta))
        break
    ddp.operator_iteration(ddp.T_sigma(sigma), v=u, max_iter=k)
    v = u
    print(' T_sigma^k+1(v): {0}'.format(v))

print('Terminated')
print(' sigma: {0}'.format(sigma))
print(' v: {0}'.format(v))

Iterate 0
 v: [0 0]
Iterate 1
 sigma: [1 0]
 T_sigma(v): [ 10.  -1.]
 span: 11.0
 T_sigma^k+1(v): [ 4.96674592 -6.03325408]
Iterate 2
 sigma: [0 0]
 T_sigma(v): [ 4.49340863 -6.73159137]
 span: 0.22499999999999964
 T_sigma^k+1(v): [  1.17973283 -10.24650042]
Iterate 3
 sigma: [0 0]
 T_sigma(v): [  0.69328539 -10.7341754 ]
 span: 0.0012275460282897832
 T_sigma^k+1(v): [ -1.7602088  -13.18876747]
Iterate 4
 sigma: [0 0]
 T_sigma(v): [ -2.10076373 -13.5293291 ]
 span: 6.6971966727891186e-06
Terminated
 sigma: [0 0]
 v: [ -8.57137101 -19.99993638]

Compare this with the implementation with the norm-based termination rule as described in Example 6.5.1, pp.187-188.

Reference¶

M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, 2005.

In [ ]: