0%

  • 8.4. Prioritized Sweeping
  • 8.5 Expected vs. Sample Updates
  • 8.6 Trajectory Sampling
  • 8.7 Real-time Dynamic Programming
  • 8.8 Planning at Decision Times
  • 8.9 Heuristic Search

8.4. Prioritized Sweeping

In general, we want to work back not just from goal states but from any state whose value has changed.

In this way one can work backward from arbitrary states that have changed in value, either performing useful updates or terminating the propagation. This general idea might be termed backward focusing of planning computations.

for this algorithm, it add a process:

Read more »

  • Preface
  • 8.1 Models and Planning
  • 8.2 Dyna: Integrated Planning, Acting, and Learning
  • 8.3 When the Model Is Wrong
  • What is a Model?
  • Comparing Sample and Distribution Models
  • Random Tabular Q-planning
  • The Dyna Architecture
  • The Dyna Algorithm
  • Dyna & Q-learning in a Simple Maze
  • What if the model is inaccurate?
  • In-depth with changing environments
  • Drew Bagnell: self-driving, robotics, and Model Based RL
  • Week 4 Summary
  • Programming Assignment: Dyna-Q and Dyna-Q+
Read more »

MetaLearning Learning Note - 5

  • Recap

Recap

Read more »

  • Sarsa: On-policy TD Control
  • Q-learning: Off-policy TD Control
  • Maximization Bias and Double Learning
  • Games, Afterstates, and Other Special Cases
  • Summary
  • Sarsa: GPI with TD
  • Sarsa in the Windy Grid World
  • What is Q-learning
  • Q-learning in the Windy Grid World
  • How is Q-learning off-policy
  • Expected Sarsa
  • Expected Sarsa in the Cliff World
  • Generality of Expected Sarsa
  • Week3 Summary
  • Program Assignment
Read more »

  • Abstract
  • TD Prediction
  • Advantages of TD Prediction Methods
  • Optimality of TD(0)

Abstract

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Read more »

  • What is Monte Carlo
  • Using Monte Carlo for Prediction
  • Using Monte Carlo for Action Values
  • Using Monte Carlo methods for generalized policy iteration
  • Solving the Blackjack Example

What is Monte Carlo

The DP need to know the transition probabilities and consumes too much time complexity.

I observe that in MC there is still a gamma γ\gamma. For the reward after these states, we time it a gamma and add it to average the value function of this state.

Read more »

Preface

When I do some system research work, I found it worth understanding the real implementation of every details of every components (file system, memory management, etc.). Thus, I want to start a new chapter here to records every notes and experience of reading books - Understanding the Linux Kernel, Third Edition 3rd Edition by Daniel P. Bovet. Hope after reading this books, I can understand every papers in the OSDI and figure out more useful, novel idea. Not only think without considering any real problems or architecture in the operating system.

Read more »

Preface

We start our coursera Sample-based Learning Methods from now on. And in this period, I will still excerpt some sentences from Sutton's book. But this time, I will label my own comprehension red.

ReinforcementLearning-Principle-Day6: Monte-Carlo

Introduction

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.

Monte Carlo methods sample and average returns for each state–action pair much like the bandit methods we explored in Chapter 2 sample and average rewards for each action. The main di↵erence is that now there are multiple states, each acting like a different bandit problem (like an associative-search or contextual bandit) and the different bandit problems are interrelated. That is, the return after taking an action in one state depends on the actions taken in later states in the same episode. Because all the action selections are undergoing learning, the problem becomes nonstationary from the point of view of the earlier state.

Read more »

Reinforcement Learning Day 4 (Policy Evaluation)

  • Abstract
  • Policy Evaluation
  • Policy Improvement
  • Policy Iteration
  • Value Iteration
  • Asynchronous Dynamic Programming
  • Generalized Policy Iteration
  • Effciency of Dynamic Programming
  • Summary
  • Policy Evaluation vs. Control
  • Iterative Policy Evaluation
  • Policy Improvement
  • Policy Iteration
  • Flexibility of the Policy Iteration Framework
  • Efficiency of Dynamic Programming
  • Approximate Dynamic Programming for Fleet Management
Read more »

MetaLearning Learning Note - 4

  • optimization based on meta learning
  • non-parametric few-shot learning
  • properties of meta learning algorithms.

Recap Optimization based meta learning

  1. Fine-tuning => MAML our pre-trained parameters will be fine tuned in the test process. θϕ\theta \rightarrow \phi. In this way, these two varaibles are the same value.

  2. Probabilistic Interpretation of Optimization-Based Interference

meta-parameter θ\theta serve as prior;

maxθlogip(Diθ)max_{\theta} log \prod_i p (D_i|\theta)

And if we use θ\theta to initalize ϕ\phi, we can have this formula:

logp(Diϕi)p(ϕi,θ)dθilog \prod \int p(D_i| \phi_i)p(\phi_i,\theta) d \theta_i

logp(Diϕi)log \prod p(D_i | \overline{\phi_i})

as the empirical bayes.

Read more »

Reinforcement Learning Day 4 (Finite Markov Decision Processes's Coursera Video Notes)

  • Specifying Policies
  • Value Functions
  • Action-value function
  • Bellman Equation Derivation
  • Intuition - Bellman Eqaution
  • Optimal Policy
  • Summary

Specifying Policies

  • Policies can only depend on current state not on previous state or time.

Value Functions

  • value functions and state-action functions.

value functions regularize all the maximum return it can get (as below).

state-value function indicate the maximum reward that a state can get.

state-action qπq_\pi functions regularize if we choose a action and keeping using π\pi in following states, the maximum return we can get.

Read more »

Reinforcement Learning Day 3 (Finite Markov Decision Processes)

  • Return, Policy and Value Function
  • Optimal Policies and Optimal Value Functions
  • Coursera False Questions
  • Optimality and Approximation
  • Summary

Goal of Reinforcement Learning

  • Micheal Littman: identify where reward signals come from;
  • develop algorithms that search the space of behaviors to maximize reward signals;

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

in MDPs we estimate the value q*(s,a) of each action a in each state s, or we estimate the value v*(s) of each state given optimal action selections.

Read more »