Preface

Got two certifications from RL in Alberta, I feel I understand more concepts in RL. Keep going! The third part - I see it’s related to the GD & function approximation.

Introduction
Value-function Approximation
The Prediction Objective (VE)

Introduction

Control Problem is the task of improving a policy. So, if we only need to evaluate the state, it’s not a control problem.

Can we represent the value function with a tabel? => no; GD / Average award.

The novelty in this chapter is that the approximate value function is represented not as a table but as a parameterized functional form with weight vector w $\in R^d$ .

In fact, all the theoretical results for methods using function approximation presented in this part of the book apply equally well to cases of partial observability.

9.1 Value-function Approximation

For the table update, Now we permit arbitrarily complex and sophisticated methods to implement the update.

** Supervisde learning we will have true lable. But TD-method, we don’t have the true label, we onyl have the next state’s expected value which will be changed **

9.2 The Prediction Objective (VE)

By assumption we have far more states than weights, so making one state’s estimate more accurate invariably means making others’ less accurate. We are obligated then to say which states we care most about. We must specify a state distribution $u(s)~0, \sumμ(s) = 1$, representing how much we care about the error in each state s. And the MSAE of general loss function will be updated as $\sum u(s) [v_\pi(s) - \hat{v}(s,w)]$

Often $μ(s)$ is chosen to be the fraction of time spent in s. Under on-policy training this is called the on-policy distribution;

9.3 Stochastic-gradient and Semi-gradient Methods

Sometimes, the target function may not be the best perfect target, For example, Ut might be a noise-corrupted version of v(St), or it might be one of the bootstrapping targets using vˆ mentioned in the previous section.

Because the true value of a state is the expected value of the return following it, the Monte Carlo target Ut = Gt is by definition an unbiased estimate of vt(St).

Bootstrapping methods are not in fact instances of true gradient descent (Barnard, 1993). They take into account the effect of changing the weight vector wt on the estimate, but ignore its effect on the target. They include only a part of the gradient and, accordingly, we call them semi-gradient methods.

Pay attention that the TD(0) algorithm let us use v(S’,w) + R as the target, thus this target function also have the weights not like the supervised learning.

The value of a state is estimated as its group’s component, and when the state is updated, that component alone is updated. State aggregation is a special case of SGD (9.7) in which the gradient, rvˆ(St,wt), is 1 for St’s group’s component and 0 for the other components.

In my understanding, the different between TD(0) and Monte Carlo is the target function. TD(0) use the next state’s value as the target function, but Monte Carlo use the real return as the target function. Even in the function approximation, the difference is still there. TD(0) contian the bias. Because this bias, when performing gradient descent, our target is the next state’s value that is still calucalted by the value function. So, the target function is not the real value function. That’s reason we call it semi-gradient.

$$\mathbf{w_{t+1}} = \mathbf{w_{t}} + \alpha \delta_t \nabla \hat{v}(S_t,\mathbf{w_{t}})$$

At each time step, we update the weights in the direction $g_t = \delta_t \nabla \hat{v}(S_t,\mathbf{w_t})$ using a fixed step-size $\alpha$. $\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1},\mathbf{w_{t}}) - \hat{v}(S_t,\mathbf{w_t})$ is the TD-error. $\nabla \hat{v}(S_t,\mathbf{w_{t}})$ is the gradient of the value function with respect to the weights.

9.4 Linear Methods

When using the semi-gradient method, if I set the value function as v = x. Then the gradient of x is equal to 1. Then we perform gradient decent, the update will be alpha(Gt - x)*x’ = aplha\(Gt-x) that is the tabular fourmula we had before. So, the linear method is a special case of the semi-gradient method.

Linear methods approximate the state-value function by the inner product between w and x(s)

Almost all useful convergence results for learning systems of all kinds are for linear (or simpler) function approximation methods.

A interesting point here is if the system converges, it must converge to the situation that $b - A*w_t = 0$

$\mathbb{E}[W_{t+1}|W_t] = w_t + \alpha(b-Aw_t)$

In general, $w_t$ will be reduced toward zero whenever A is positive definite, meaning $y^\mathbb{T}Ay > 0$ for any real vector y 6= 0.

On the other hand, recall that the TD methods are often of vastly reduced variance compared to Monte Carlo methods, and thus faster, as we saw in Chapters 6 and 7

$\overline{VE}(W_{TD}) <= \frac{1}{1-\gamma}*min_w\overline{VE}(w)$ is the asymptotic error for the TD method. linaer semi-gradient DP with updates according to the on-policy distribution will also converge to the TD fixed point. (TD uses next time’s value included into current time’s update. semi gradient use gradient to update the TD value function.)

9.5 Feature Construction for Linear Methods

9.5.1 Polynomials

Attention: $S$ represent the current state, and ${s_1, s_2, s_3, \cdots}$ represent the feature vector for state $S$

9.5.2 Fourier Basis

9.5.3 Coarse Coding

Features with large receptive fields give broad generalization, but might also seem to limit the learned function to a coarse approximation, unable to make discriminations much finer than the width of the receptive fields. Happily, this is not the case. Initial generalization from one point to another is indeed controlled by the size and shape of the receptive fields, but acuity, the finest discrimination ultimately possible, is controlled more by the total number of features.

So it’s foundmentally like the feature function. We input a complex, high level space into the function, and the function will output a low level, simple space. corase coding is using several smaller pieces circle to cover all feature spaces. So why not use convolutional neural network to do this? We can use convolution to gather feature space.

The receptive field of each feature is quite large. This means we can approximate the rough shape of the function with relatively few samples. As we sample the function more, our estimate forms a better and better approximation of the true function. The broad generalization of the longer intervals made learning faster.

9.5.4 Tile Coding

Is it like a CNN to encoding state space? Or Four active tiles/features overlap the point and are used to represent it. Does it mean that it’s not a CNN instead we use fource tile to code the feature space?

I am still confused about this whole chapter. It’s a chapter to build the feature for linear methods. So it’s some way to represent the feature.

Using Tile Coding in TD

So tile coding is just used to simply encoding the feature space. tile coding has the overlap between the feature space for two different tiles. But state aggregation only has one feature space for one tile. So, tile coding is more complex than state aggregation.

Recall that when we created the tile coder we had to set several parameters, the size and shape of the tiles and the number of tilings. These parameters were fixed before learning. In a neural network, we have similar parameters corresponding to the number of layers, the number of nodes in each layer, and the activation functions. These are all, typically, fixed before learning.

One simple yet effective initialization strategy, is to randomly sample the initial weights from a normal distribution with small variance.

Momentum

If recent gradients have all been in similar directions, then we gained momentum in that direction. This means, we make a large step in that direction. If recent updates have conflicting directions, then it kills the momentum. The momentum term will have little impact on the update and we will make a regular gradient descent step.

So we make a new term called momentum, we accumulate past historical gradient. And when the gradient is positive and negative, the momentum will be killed.

Adam Algorithm

In this assignment, instead of using SGD for updating the weights, we use a more advanced algorithm called Adam. The Adam algorithm improves the SGD update with two concepts: adaptive vector step-sizes and momentum. It keeps estimates of the mean and second moment of the updates, denoted by $\mathbf{m}$ and $\mathbf{v}$ respectively:
$$\mathbf{m_t} = \beta_m \mathbf{m_{t-1}} + (1 - \beta_m)g_t \
\mathbf{v_t} = \beta_v \mathbf{v_{t-1}} + (1 - \beta_v)g^2_t
$$

Given that $\mathbf{m}$ and $\mathbf{v}$ are initialized to zero, they are biased toward zero. To get unbiased estimates of the mean and second moment, Adam defines $\mathbf{\hat{m}}$ and $\mathbf{\hat{v}}$ as:
$$ \mathbf{\hat{m_t}} = \frac{\mathbf{m_t}}{1 - \beta_m^t} \
\mathbf{\hat{v_t}} = \frac{\mathbf{v_t}}{1 - \beta_v^t}
$$

we can see $\mathbf{m_t}$ has replace gradient, it’s the accumulated gradients and each time we add a $\beta$ ration of odd one and $1-\beta$ of the new one. And $v_t$ is the coefficient to control the momentum.
The weights are then updated as follows:
$$ \mathbf{w_t} = \mathbf{w_{t-1}} + \frac{\alpha}{\sqrt{\mathbf{\hat{v_t}}}+\epsilon} \mathbf{\hat{m_t}}
$$

When implementing the agent you will use the Adam algorithm instead of SGD because it is more efficient. We have already provided you the implementation of the Adam algorithm in the cell below. You will use it when implementing your agent.

Vector Step Size

Vector step size is a vector of step parameter. each weight have their own step size.

Overall, On one hand, neural networks are powerful function approximators capable of representing a wide class of functions. They are also capable of producing features without exclusively relying on hand-crafted mechanisms. On the other hand, compared to a linear function approximator with tile-coding, neural networks can be less sample efficient. When implementing your own Reinforcement Learning agents, you may consider these strengths and weaknesses to choose the proper function approximator for your problems.

9.5.5 Radial Basis Functions

Radial basis functions (RBFs) are the natural generalization of coarse coding to continuous- valued features. A typical RBF feature, $x_i$, has a Gaussian (bell-shaped) response $x_i(s)$ dependent only on the distance between the state, s, and the feature’s prototypical or center state, $c_i$, and relative to the feature’s width, $\sigma_i$: $x_i(s) = exp(-\frac{||s-c_i||^2}{2*\sigma_i^2})$

9.6 Selecting Step-Size Parameters Manually

A good rule of thumb for setting the step-size parameter of linear SGD methods is then
$(\tau\mathbb{E}(X^{T}X))^{-1}$
where x is a random feature vector chosen from the same distribution as input vectors will be in the SGD. This method works best if the feature vectors do not vary greatly in length; ideally $X^{T}*X$ is a constant.

9.7 Nonlinear Function Approximation: Artificial Neural Networks

Training the hidden layers of an ANN is therefore a way to automatically create features appropriate for a given problem so that hierarchical representations can be produced without relying exclusively on hand-crafted features.

9.8 Least-Squares TD

Assume $A = x_t(x_t - \gamma x_{t+1})^T$ and $b = R_{t+1}x_t$. we can calculate the $w_{TD}$ as $A^{-1}b$

And we can also calculate the A at the time t directly. $A_t=\sum_{k=0}^{t-1}x_k(x_k - \gamma*x_{k+1})^T$

9.9 Memory-based Function Approximation

They simply save training examples in memory as they arrive (or at least save a subset of the examples) without updating any parameters. Then, whenever a query state’s value estimate is needed, a set of examples is retrieved from memory and used to compute a value estimate for the query state.

Zhongzhu's Blog

ReinforcementLearning-Principle-Day12