#!/usr/bin/env python
# coding: utf-8

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)

from IPython.display import Image


# Chapter 9 On-policy Prediction with Approximation
# =========
# 
# approximate value function: parameterized function $\hat{v}(s, w) \approx v_\pi(s)$
# 
# + applicable to partially observable problems.

# ### 9.1 Value-function Approximation
# 
# $s \to u$: $s$ is the state updated and $u$ is the update target that $s$'s estimated value is shifted toward.
# 
# We use machine learning methods and pass to them the $s \to g$ of each update as a training example. Then we interperet the approximate function they produce as an estimated value function.
# 
# not all function approximation methods are equally well suited for use in reinforcement learning:
# + learn efficiently from incrementally acquired data: many traditional methods assume a static training set over which multiple passes are made.
# + are able to handle nonstationary target functions.

# ### 9.2 The Prediction Objective (VE)
# 
# which states we care most about: a state distribution $\mu(s) \geq 0$, $\sum_s \mu(s) = 1$.
# + Often $\mu(s)$ is chosen to be the fraction of time spent in $s$.
# 
# objective function, the Mean Squared Value Error, denoted $\overline{VE}$:
# 
# \begin{equation}
#     \overline{VE}(w) \doteq \sum_{s \in \delta} \mu(s) \left [ v_\pi (s) - \hat{v}(s, w) \right ]^2
# \end{equation}
# 
# where $v_\pi(s)$ is the true value and $\hat{v}(s, w)$ is the approximate value.
# 
# Note that best $\overline{VE}$ is no guarantee of our ultimate purpose: to find a better policy.
# + global optimum.
# + local optimum.
# + don't convergence, or diverge.

# ### 9.3 Stochastic-gradient and Semi-gradient Methods
# 
# SGD: well suited to online reinforcement learning.
# 
# \begin{align}
#     w_{t+1} &\doteq w_t - \frac1{2} \alpha \nabla \left [ v_\pi(S_t) - \hat{v}(S_t, w_t) \right ]^2 \\
#     &= w_t + \alpha \left [ \color{blue}{v_\pi(S_t)} - \hat{v}(S_t, w_t) \right ] \nabla \hat{v}(S_t, w_t) \\
#     &\approx w_t + \alpha \left [ \color{blue}{U_t} - \hat{v}(S_t, w_t) \right ] \nabla \hat{v}(S_t, w_t) \\
# \end{align}
# 
# $S_t \to U_t$, is not the true value $v_\pi(S_t)$, but some, possibly random, approximation to it. (前面各种方法累计的value）:
# + If $U_t$ is an unbiased estimate, $w_t$ is guaranteed to converge to a local optimum.
# + Otherwise, like boostrappig target or DP target => semi-gradient methods. (might do not converge as robustly as gradient methods)
#   - significantly faster learning.
#   - enable learning to be continual and online.
#   
# state aggregation: states are grouped together, with one estimated value for each group.

# ### 9.4 Linear Methods
# 
# For every state $s$, there is a real-valued feature vector $x(s) \doteq (x_1(s), x_2(s), \dots, x_d(s))^T$:
# 
# \begin{equation}
#     \hat{v}(s, w) \doteq w^T x(s) \doteq \sum_{i=1}^d w_i x_i(s)
# \end{equation}

# ### 9.5 Feature Construction for Linear Methods
# 
# Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learing systems.
# 
# + Polynomials
# + Fourier Basis: low dimension, easy to select, global properities
# + Coarse Coding
# + Tile Coding: convolution kernel?
# + Radial Basis Functions

# ### 9.6 Selecting Step-Size Parameters Manually
# 
# A good rule of thumb for setting the step-size parameter of linear SGD methods is then $\alpha \doteq (\gamma \mathbf{E}[x^T x])^{-1}$
# 
# 
# 
# ### 9.7 Nonlinear Function Approximation: Artificial Neural Networks
# 
# +ANN, CNN
# 
# 
# ### 9.8 Least-Squares TD
# 
# $w_{TD} = A^{-1} b$: data efficient, while expensive computation
# 
# 
# ### 9.9 Memory-based Function Approximation
# 
# nearest neighbor method
# 
# 
# ### 9.10 Kernel-based Function Approximation
# 
# RBF function
# 
# 
# ### 9.11 Looking Deeper at On-policy Learning: Interest and Emphasis
# 
# more interested in some states than others:
# + interest $I_t$: the degree to which we are interested in accurately valuing the state at time $t$.
# + emphaisis $M_t$: 
# 
# \begin{align}
#     w_{t+n} & \doteq w_{t+n-1} + \alpha M_t \left [ G_{t:t+n} - \hat{v}(S_t, w_{t+n-1} \right ] \nabla \hat{v}(S_t, w_{t+n-1}) \\    
#     M_t & = I_t + \gamma^n M_{t-n}, \qquad 0 \leq t < T
# \end{align}

# In[ ]: