Reinforcement Learning Day 2 (Multi-armed Bandits)

Q* Formula
Nonstationary problem
Optimistic Initial Value
Gradient Bandits Problem
Associative Search
Summary
Overall

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

This chapter is implementing RL in a simple setting. I also learn this chapter in the coursera fundamentals-of-reinforcement-learning by Martha and Adam in University of Alberta.

I learn three algorithms or two algorithms fully in coursera. First, it’s the formula to compute Q* and epsilon-greedy algorithm. This algorithm
Studying this case enables us to see most clearly how evaluative feedback diers from, and yet can be combined with, instructive feedback.

I found richard keep talking about the difference between RL and other learning algorithms.

Q* Formula

(Assuming we have understand what is the K-arm bandit’s meaning)

Q* function in K-arm bandit is a problem of

$$Q_*(a) = \mathbb{E}(R_t|A_t=a)$$

We denote the estimated value of action a at time step t as Qt(a). We would like $Q_{t}(a)$ to be close to $q_*(a)$.

$NewEstimate\leftarrow OldEstimate+Stepsize[Target-OldEstimate]$

Nonstationary Problem

stationary bandit problems, that is, for bandit problems in which the reward probabilities do not change over time.

Therefore, we need to give more weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a constant step-size parameter.

It turn into our familiar formula:

$$Q_{n+1} = Q_n + \alpha * [R_n - Q_n] $$

We use this time’s difference and add it to our estimated values

And keep deriving it:

$$= \alpha * R_n + (1 - \alpha) * Q_n $$
$$= \alpha * R_n + (1 - \alpha) * (\alpha * R_{n-1} + (1-\alpha) * Q_{n-1})$$

reduce it until $Q_1$

$$= (1- \alpha )^n*Q_1 + \sum_{i=1}^n \alpha *(1- \alpha )^{n-i} * R_i$$

If $1 - \alpha = 0$, then all the weight goes on the very last reward, Rn, because of the convention that $0^0= 1$.

If we set $\alpha = \frac{1}{n}$ which means with the growing of our game, the reward given by the higher step shall be less importance? It seems not make sense for this problem?

It means the more far step we shall believe our past experience and convergence. And through the code assignment we know the importance of the $\frac{1}{n}$!

Oh because it’s a non stationary problem!

Does this mean that 1/N(A) is always the best? When might it not be? One possible setting where it might not be as effective is in non-stationary problems. You learned about non-stationarity in the lessons. Non-stationarity means that the environment may change over time. This could manifest itself as continual change over time of the environment, or a sudden change in the environment.

The smaller step size cannot let the estimated value be close to actual value and bigger will fluctuate. Therefore, we need to let step size bigger in the stationary problem.

Conclusion in coursera work

These are the types of tradeoffs we have to think about in reinforcement learning. A larger step size moves us more quickly toward the true value, but can make our estimated values oscillate around the expected value. A step size that reduces over time can converge to close to the expected value, without oscillating. On the other hand, such a decaying stepsize is not able to adapt to changes in the environment. Nonstationarity—and the related concept of partial observability—is a common feature of reinforcement learning problems and when learning online.

Therefore, above content describe the choic of stepsize. And sutton tell us: A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:
$$ \sum_{n=1}^{\infty}\alpha_n(a) = \infty$$
$$ \sum_{n=1}^{\infty}\alpha^2_n(a) < \infty$$

Optimistic Initial Value

In the language of statistics, these methods are biased by their initial estimates.

We call this technique for encouraging exploration optimistic initial values. We regard it as a simple trick that can be quite e↵ective on stationary problems,

The beginning of time occurs only once, and thus we should not focus on it too much.

BTW: UCB let the Q = Q_value + Uncertainty (which means we can choose one arm if it has a huge expected / huge uncertainty)

Gradient Bandit Algorithms

In this section we consider learning a numerical preference for each action a, which we denote $H_t(a) \in R$.

$$Pr{A_t = a} = \frac{e^H_t(a)}{\sum_{b=1}^ke^H_t(b)} = \pi_t(A_t)$$

$\pi_t(a)$, for the probability of taking action a at time t. (policy)

There is a natural learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent. On each step, after selecting action At and receiving the reward Rt, the action preferences are updated by:

$$H_{t+1}(A_t) = H_t(At) + \alpha(R_t-R_{average})(1-\pi_tA_t) $$

(infered from gradient decent)

we want to get the real probability, and the real one is R_t and parameter is R_{average} (like the Neural network).

$H_{t+1}(a) = H_t(a) + \alpha *\frac{\delta E[R_t]}{\delta H_t(a)}$

we want to find the minimum $H_t(a)$ to let the expected reward -> maximum or (Rreal-Raverage = 0)’ for $H_t(a)$

$E[R_t] = \sum_x{\pi_t(x)q_(x)}{}$

And we can infer from this formula to the latter formula.

Sutton’s derivation is gorgeous [page 39 - 40]

And the trick is multiply $\pi_t(x)$ and \ $\pi_t(x)$ at the same time.

Baseline * $\pi_t(A_t)$

I understand the policy’s gradient’s algorithm.

We update the value which related to the polciy’s probability directly.

Associative Search

We present this problem in the next chapter and consider its ramifications throughout the rest of the book.

(current reward will affect the next selection)

Summary

The Gittins-index approach is an instance of Bayesian methods, which assume a known initial distribution over the action values and then update the distribution exactly after each step (assuming that the true action values are stationary). In general, the update computations can be very complex, but for certain special distributions (called conjugate priors) they are easy. One possibility is to then select actions at each step according to their posterior probability of being the best action. This method, sometimes called posterior sampling or Thompson sampling, often performs similarly to the best of the distribution-free methods we have presented in this chapter.

(I see many algorithms use Gittins index as the benchmark) (it’s the assumption best solution?) Let me keep reading this book.

Overall

In this chapter, we learn some simple but important concepts including the Q-value, $\epsilon$-greedy, step-size, stationary and non-stationary, optimistic initial value, gradient based (policy gradient problem) and associative search. I found that after I learn RL, in fact, this chapter describe some simple version of the complex algorithm. And it hint me some point I never know (inital value (MAML)). This chapter is a great beginning of the RL. Let me keep exploring more specific and complex algorithm.

Zhongzhu's Blog

ReinforcementLearning-Principle-Day2