# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
#sns.set(font='SimHei', font_scale=2.5)
#plt.rcParams['axes.grid'] = False
import numpy as np
import pandas as pd
#pd.options.display.max_rows = 20
#import sklearn
#import itertools
#import logging
#logger = logging.getLogger()
#from IPython.display import SVG
from IPython.display import Image
MDP(Markov Decision Processes): actions influence not just immediate rewards, but also subsequential situations.
Image('./res/fig3_1.png')
finite MDP: the sets of states, actions and rewards all have a finite number of elements.
\begin{align} & p(s', r \mid s, a) \doteq \operatorname{Pr} \{ S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a \} \\ & \displaystyle \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1 \quad \text{, for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$} \end{align}agent-environment boundary represents the limit of the agent's absolute control, not of its knowledge.
# Transition Graph
Image('./res/ex3_3.png')
$r(S_t, a) \; \pi(a \mid S_t)$
goal: to maximize the total amount of reward it receives.
In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what it to do.
0 for escaping from the maze and -1 at all other times
$G_1 = \frac{7}{1 - r} = 70$
$G_0 = R_1 + 0.9 G_1 = 65$
$G_t \doteq \sum_{k=t+1}^T \gamma^{k-t-1} R_k$, including the possibility that $T = \infty$ or $\gamma = 1$.
Bellman equation for $v_\pi$:
\begin{equation} v_\pi(s) \doteq \displaystyle \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left [ r + \gamma v_\pi(s') \right ] \quad \text{, for all $s \in \mathcal{S}$} \end{equation}The value functions $v_\pi$ and $q_\pi$ can be estimated from experience, say Monte Carlo methods (average).
# Example 3.5
from scipy.signal import convolve2d
reward_matrix = np.zeros((5, 5))
# kernel
kernel = np.array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]])
iteration_nums = 100
for _ in range(iteration_nums):
reward = convolve2d(reward_matrix, kernel, mode='same', boundary='fill', fillvalue=-1)
reward /= 4.0
# A -> A'
reward[0, 1] = 10 + reward[-1, 1]
# B -> B'
reward[0, -2] = 5 + reward[2, -2]
reward_matrix = reward
pd.DataFrame(reward_matrix)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 2.253150 | 9.592586 | 4.524364 | 6.244556 | 1.271214 |
1 | 1.420013 | 3.970976 | 3.260312 | 2.897431 | 0.840299 |
2 | 0.455929 | 1.610993 | 1.648482 | 1.244556 | 0.192553 |
3 | -0.207290 | 0.368589 | 0.478065 | 0.239763 | -0.314645 |
4 | -0.653676 | -0.407414 | -0.344568 | -0.448925 | -0.690892 |
optimal policy:
\begin{align} v_\ast(s) & \doteq \displaystyle \max_\pi v_\pi(s) \quad \text{ for all } s \in \mathcal{S} \\ & = \max_a \sum_{s', r} p(s', r \mid s, a) \left [ r + \gamma v_\ast(s') \right ] \\ \end{align}Any policy that is greedy with respect to the optimal evaluation function $v_\ast$ is an optimal policy.
optimal action-value function:
\begin{align} q_\ast(s, a) & \doteq \max_\pi q_\pi(s, a) \quad \text{ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}(s)$} \\ & = \mathbb{E} [ R_{t+1} + \gamma v_\ast(S_{t+1}) \mid S_t = s, A_t = a ] \\ & = \sum_{s', r} p(s', r \mid s, a) \left [ r + \gamma \max_{a'} q_\ast (s', a') \right ] \\ \end{align}Explicitly solving the Bellman optimality equation relies on at least three assumptions that are rarely true in practice:
extreme computational cost, memory => approximations
put more effort into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.