ReinforcementLearning-Principle-Day7

What is Monte Carlo
Using Monte Carlo for Prediction
Using Monte Carlo for Action Values
Using Monte Carlo methods for generalized policy iteration
Solving the Blackjack Example

What is Monte Carlo

The DP need to know the transition probabilities and consumes too much time complexity.

I observe that in MC there is still a gamma $\gamma$. For the reward after these states, we time it a gamma and add it to average the value function of this state.

Using Monte Carlo for Prediction

Using Monte Carlo for Action Values

exploring starts. In the start, we must use all state and actions.

Using Monte Carlo methods for generalized policy iteration

Monte Carlo control methods combined policy improvement and policy evaluation on an episode-by-episode basis

Solving the Blackjack Example

Epsilon-soft policies

understand why exploring starts can be problematic in real problems. For example, self driving, we cannot do all the actions or be any states at the start.

If our policy always gives at least Epsilon probability to each action, it’s impossible to converge to a deterministic optimal policy.

Why does off-policy learning matter?

Epsilon soft policies are neither optimal policies for obtaining reward nor are the optimal for exploring to find the best actions

Target Policy: $\pi$

Behavior Policy: $b$

Importance Sampling

we call the p(x) as importance sampling ratio.

$p(x) = \frac{\pi(x)}{b(x)}$

$E_{\pi}[X] =\sum_{x\in X} xp(x)b(x) = E_b[Xp(x)]$

Off-Policy Monte Carlo Prediction

to average the return, we need to time p this time.

Batch Reinforcement Learning

Emma Brunskill

Chapter Summary

A potential problem is that this method learns only from the tails of episodes, when all of the remaining actions in the episode are greedy. If nongreedy actions are common, then learning will be slow, particularly for states appearing in the early portions of long episodes. Potentially, this could greatly slow learning. There has been insuffcient experience with off-policy Monte Carlo methods to assess how serious this problem is. If it is serious, the most important way to address it is probably by incorporating temporal- difference learning, the algorithmic idea developed in the next chapter. Alternatively, if $\gamma$ is less than 1, then the idea developed in the next section may also help significantly.

A fourth advantage of Monte Carlo methods, which we discuss later in the book, is that they may be less harmed by violations of the Markov property. This is because they do not update their value estimates on the basis of the value estimates of successor states. In other words, it is because they do not bootstrap.

Ordinary importance sampling uses a simple average of the weighted returns, whereas weighted importance sampling uses a weighted average.

In this chapter, I learn how to use monte carlo methods and off policy. I think this two methods are core in this book.