by Fedor Iskhakov, ANU
Description: Policy iterations solution method for infinite horizon dynamic models. Solving stochastic inventory management problem with policy iterations.
Also known as Howard policy improvement algorithm
Area of application: dynamic decision problems with
Break the search of the fixed point (of Bellman or Coleman-Reffett operators) into two steps:
Iterative (back and forth) approach instead of jointly finding the value and policy functions in the fixed point problem.
The optimal choices are revealed along the solution of the Bellman equation as decisions which solve the maximization problem in the right hand side!
For fixed policy function (decisions) solve for function $ V(\text{state}) $:
$$ V(\text{state}) = U(\text{state},\text{decision}) + \beta \mathbb{E}\big\{ V(\text{next state}) \big| \text{state},\text{decision} \big\} $$For fixed value function $ V_0(\text{state}) $ find the optimal policy while computing:
$$ V(\text{state}) = \max_{\text{decisions}} \big[ U(\text{state},\text{decision}) + \beta \mathbb{E}\big\{ V_0(\text{next state}) \big| \text{state},\text{decision} \big\} \big] $$Rate of convergence!
Question: can linear interpolation of the value function on a fixed grid be represented as matrix multiplication? Gaussian quadrature? 3-dim array algebra as in stochastic consumption-savings problem with discretized choices?
Recall the inventory management problem (video 37 and exercise 11)
Parameters in the model:
Taking the expectation with respect to $ d $ on both sides, we get
$$ EV(x) = \mathbb{E}\Big[ \max_{q \ge 0} \Big\{ sp - x' r - c \mathbb{1}\{q>0\} + \beta EV(x') \Big\} \Big] $$$$ EV(x) = \sum_{d} \Big[ \max_{q \ge 0} \Big\{ sp - x' r - c \mathbb{1}\{q>0\} + \beta EV(x') \Big\} \Big] pr(d) $$Sequence of events in each time period (timing assumptions in the model):
Note that ordering decision is made based on the stock remaining after trading is over $ \max\{x-d,0\} $ rather than on $ x $ and $ d $ independently!
Policy iterations solver is applicable!
# example of re-indexing as matrix multiplication
import numpy as np
ev = np.array([11,21,32,43,54,65,76]) # vector EV
j = [1,0,2,2,4,3,5] # represent y+q(y)
n = ev.size
M = np.zeros((n,n),dtype=int) # start with all zeros
M[range(n),j] = 1 # mark
ev1 = M@ev
print('EV=',ev,'y+q(y)=',j,'re-indexing matrix =',M,'EV(y+q(y))',ev1,sep='\n')
EV= [11 21 32 43 54 65 76] y+q(y)= [1, 0, 2, 2, 4, 3, 5] re-indexing matrix = [[0 1 0 0 0 0 0] [1 0 0 0 0 0 0] [0 0 1 0 0 0 0] [0 0 1 0 0 0 0] [0 0 0 0 1 0 0] [0 0 0 1 0 0 0] [0 0 0 0 0 1 0]] EV(y+q(y)) [21 11 32 32 54 43 65]
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [12, 8]
# COPY of the code from exercise 11, only infinite horizon version, combined into single class
class inventory_model:
'''Small class to hold model fundamentals and its solution'''
def __init__(self,label='noname',
max_inventory=10, # upper bound on the state space
c = 3.2, # fixed cost of order
p = 2.5, # profit per unit of good
r = 0.5, # storage cost per unit of good
β = 0.95, # discount factor
dp = 0.5, # parameter in geometric distribution of demand
demand = 4 # fixed demand
):
'''Create model with default parameters'''
self.label=label # label for the model instance
self.c, self.p, self.r, self.β, self.dp= c, p, r, β, dp
self.demand = demand
# created dependent attributes (it would be better to have them updated when underlying parameters change)
self.n = max_inventory+1 # number of inventory levels
self.upper = int(max_inventory) # upper boundary on inventory
self.x = np.arange(self.n,dtype=int) # all possible values of inventory and demand (state space)
def __repr__(self):
'''String representation of the model'''
return 'Inventory model labeled "{}"\nParamters (c,p,r,β) = ({},{},{},{})\nDemand={}\nUpper bound on inventory {}' \
.format (self.label,self.c,self.p,self.r,self.β,self.demand,self.upper)
def sales(self,x,d):
'''Sales in given period'''
return np.minimum(x,d)
def next_x(self,x,d,q):
'''Inventory to be stored, becomes next period state'''
return x - self.sales(x,d) + q
def profit(self,x,d,q):
'''Profit in given period'''
return self.p * self.sales(x,d) - self.r * self.next_x(x,d,q) - self.c * (q>0)
def demand_pr(self,plot=False):
'''Computes stochastic demand probs'''
k = np.arange(self.n) # all possible values of demand
pr = (1-self.dp)**k *self.dp
pr[-1] = 1 - pr[:-1].sum() # update last prob to ensure sum=1
if plot:
plt.step(self.x,pr,where='mid')
plt.title('Distribution of demand')
plt.show()
return pr
def bellman(self,ev0):
'''Bellman operator for inventory model
Inputs: model object
next period EXPECTED value function
'''
pr = self.demand_pr()
ev1 = np.zeros(shape=self.x.shape)
for j,d in enumerate(self.x): # over all values of demand
# create the grid of choices (same as x), column-vector
q = self.x[:,np.newaxis]
# compute current period profit (relying on numpy broadcasting to get the matrix with choices in rows)
p = self.profit(self.x,d,q)
# indexes for next period value with extrapolation using last value
i = np.minimum(self.next_x(self.x,d,q),self.upper)
# compute the Bellman maximand
vm = p + self.β*ev0[i]
# find max and argmax
v1 = np.amax(vm,axis=0) # maximum in every column
ev1 = ev1 + pr[j]*v1
q1 = self.optimal_policy(ev1)
return ev1, q1
def optimal_policy(self,ev):
'''Computes the optimal policy function as function of post trade stock for the stochastic
inventory dynamics model for given EV function'''
# idea: 2-dim array with q in axes 0, y = max(x-d,0) in axis 1
q = self.x[:,np.newaxis] # choices
y = self.x[np.newaxis,:] # post trading stock
# indexes for next period value with extrapolation using last value
i = np.minimum(y+q,self.upper)
# compute the Bellman maximand
vm = -self.r*q -self.c*(q>0) + self.β*ev[i]
# find argmax and argmax
return np.argmax(vm,axis=0) # maximum in every column
def solve_vfi(self,tol=1e-6,maxiter=500,callback=None):
'''Solves the model using value function iterations
'''
ev0 = np.zeros(self.n) # initial point for VFI
for i in range(maxiter): # main loop
ev1, q1 = self.bellman(ev0) # update approximation
err = np.amax(np.abs(ev0-ev1))
if callback != None: callback(iter=i,err=err,ev1=ev1,ev0=ev0,q1=q1,model=self)
if err<tol:
break # break out if converged
ev0 = ev1 # get ready to the next iteration
else:
raise RuntimeError('Failed to converge in %d iterations'%maxiter)
return ev1, q1
def policy_evaluation(self,q):
'''The expected value of the given policy
Returns a vector ev for all points in the state space
'''
pr = self.demand_pr()
Cbar = np.zeros(self.n)
Mbar = np.zeros((self.n,self.n))
for j,d in enumerate(self.x): # over all values of demand
y = self.x - self.sales(self.x,d) # post trade stock for all values of x | d
p = self.profit(self.x,d,q[y]) # profits for all values of x | d, q()
Cbar += pr[j]*p
i = np.minimum(y+q[y],self.upper) # next period stock for all x | d, q()
M = np.zeros((self.n,self.n),dtype=int) # start with all zeros
M[range(self.n),i] = 1 # transformation matrix
Mbar += pr[j]*M
A = np.eye(self.n) - self.β*Mbar
ev = np.linalg.solve(A,Cbar)
return ev
def solve_policyiter(self,tol=1e-6,maxiter=500,callback=None):
'''Solves the model using policy iterations
'''
q0 = np.zeros(self.n,dtype=int) # initial point for policy iterations
for i in range(maxiter): # main loop
# step 1: policy evaluation
ev = self.policy_evaluation(q0)
# step 2: policy improvement
q1 = self.optimal_policy(ev)
err = np.amax(np.abs(q0-q1))
if callback != None: callback(iter=i,err=err,ev1=ev,q1=q1,model=self)
if err<tol:
break # break out if converged
q0 = q1 # get ready to the next iteration
else:
raise RuntimeError('Failed to converge in %d iterations'%maxiter)
ev1 = self.policy_evaluation(q1)
return ev1,q1
def solve_show(self,method='vfi',maxiter=1000,tol=1e-6,**kvargs):
'''Illustrate solution'''
fig1, (ax1,ax2) = plt.subplots(1,2,figsize=(14,8))
ax1.grid(b=True, which='both', color='0.65', linestyle='-')
ax2.grid(b=True, which='both', color='0.65', linestyle='-')
ax1.set_xlabel('Inventory')
ax2.set_xlabel('Post trade inventory')
ax1.set_title('Value function')
ax2.set_title('Optimal new orders')
def callback(**argvars):
mod, ev, q = argvars['model'],argvars['ev1'],argvars['q1']
ax1.step(mod.x,ev,color='k',alpha=0.25,where='mid')
ax2.step(mod.x,q,color='k',alpha=0.25,where='mid')
print(argvars['iter'],end=' ')
if method=='vfi':
ev,pk = self.solve_vfi(maxiter=maxiter,tol=tol,callback=callback,**kvargs)
elif method=='policyiter':
ev,pk = self.solve_policyiter(maxiter=maxiter,tol=tol,callback=callback,**kvargs)
else:
raise RuntimeError('Unknown method in solve_show()')
# add solutions
ax1.step(self.x,ev,color='r',linewidth=2.5,where='mid')
ax2.step(self.x,pk,color='r',linewidth=2.5,where='mid')
plt.show()
return ev,pk
mod = inventory_model(max_inventory=25)
mod.dp=.25
mod.c = .25
mod.p = 3.5
mod.r = 0.4
mod.β = 0.9
mod.demand_pr(plot=True)
ev1,q1 = mod.solve_vfi()
mod.solve_show()
ev2,q2 = mod.solve_policyiter()
mod.solve_show(method='policyiter');
print('Diff in value functions = %1.3e\nDiff in policy functions = %1.5f'%(np.amax(ev1-ev2),np.amax(q1-q2)))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
0 1 2 3
Diff in value functions = -8.159e-06 Diff in policy functions = 0.00000
def optimal_policy(self,ev):
'''Computes the optimal policy function as function of post trade stock for the stochastic
inventory dynamics model for given EV function'''
# idea: 2-dim array with q in axes 0, y = max(x-d,0) in axis 1
q = self.x[:,np.newaxis] # choices
y = self.x[np.newaxis,:] # post trading stock
# indexes for next period value with extrapolation using last value
i = np.minimum(y+q,self.upper)
# compute the Bellman maximand
vm = -self.r*q -self.c*(q>0) + self.β*ev[i]
# find argmax and argmax
return np.argmax(vm,axis=0) # maximum in every column
ev,_ = mod.solve_vfi()
q = optimal_policy(mod,ev)
print('Optimal orders of new inventory for y = max(x-d,0):')
print(q)
Optimal orders of new inventory for y = max(x-d,0): [7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# COPY from exercise 11
def optimal_policy_old(m,ev):
'''Computes the optimal policy function for the stochastic
inventory dynamics model for given EV function'''
# idea: 3-dim array with q in axes 0, d in axis 1 and x in axis 2
q = m.x[:,np.newaxis,np.newaxis] # choices
d = m.x[np.newaxis,:,np.newaxis] # demand
x = m.x[np.newaxis,np.newaxis,:] # inventories
# compute current period profit (relying on numpy broadcasting to get the matrix with choices in rows)
p = m.profit(x,d,q) # 3-dim array
# indexes for next period value with extrapolation using last value
i = np.minimum(m.next_x(x,d,q),m.upper)
# compute the Bellman maximand
vm = p + m.β*ev[i]
# find argmax and argmax
return np.argmax(vm,axis=0) # maximum in every column
q = optimal_policy_old(mod,ev)
print('Optimal orders of new inventory for d,x:\n(d in rows, x in columns)')
print(q)
# Note the symmetry in the optimal policy!
# This implies that knowing both x and d is not necessary for the
# optional new order, it's enough to condition on the inventory
# remaining after sales, i.e. x-min(x,d) = max(0,x-d)
Optimal orders of new inventory for d,x: (d in rows, x in columns) [[7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2 0] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3 2] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 3] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 5] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6] [7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7]]
In problems where the first step (policy evaluation) is hard to perform efficiently, it can be replaced by a finite forward simulation of Bellman operator under current policy to yield an approximate value.