Reinforcement Learning-Principle-Day11

Posted on 2021-11-14 Edited on 2026-03-23

Reinforcement learning study notes — reward shaping, inverse RL, and imitation learning techniques.

8.4. Prioritized Sweeping
8.5 Expected vs. Sample Updates
8.6 Trajectory Sampling
8.7 Real-time Dynamic Programming
8.8 Planning at Decision Times
8.9 Heuristic Search

8.4. Prioritized Sweeping

In general, we want to work back not just from goal states but from any state whose value has changed.

In this way one can work backward from arbitrary states that have changed in value, either performing useful updates or terminating the propagation. This general idea might be termed backward focusing of planning computations.

for this algorithm, it add a process:

Reinforcement Learning-Principle-Day10

Posted on 2021-11-07 Edited on 2026-03-23

Reinforcement learning study notes — exploration strategies: epsilon-greedy, UCB, Thompson sampling, and curiosity-driven methods.

Preface
8.1 Models and Planning
8.2 Dyna: Integrated Planning, Acting, and Learning
8.3 When the Model Is Wrong
What is a Model?
Comparing Sample and Distribution Models
Random Tabular Q-planning
The Dyna Architecture
The Dyna Algorithm
Dyna & Q-learning in a Simple Maze
What if the model is inaccurate?
In-depth with changing environments
Drew Bagnell: self-driving, robotics, and Model Based RL
Week 4 Summary
Programming Assignment: Dyna-Q and Dyna-Q+

MetaLearning-Standford-Lecture5

Posted on 2021-11-04 Edited on 2026-03-23

Stanford CS 330 Meta-Learning lecture notes — covering Bayesian meta-learning and neural processes for uncertainty-aware few-shot prediction.

MetaLearning Learning Note - 5

Recap

Recap

Reinforcement Learning-Principle-Day9

Posted on 2021-10-31 Edited on 2026-03-23

Reinforcement learning study notes — multi-agent reinforcement learning, cooperative and competitive settings.

Sarsa: On-policy TD Control
Q-learning: Off-policy TD Control
Maximization Bias and Double Learning
Games, Afterstates, and Other Special Cases
Summary
Sarsa: GPI with TD
Sarsa in the Windy Grid World
What is Q-learning
Q-learning in the Windy Grid World
How is Q-learning off-policy
Expected Sarsa
Expected Sarsa in the Cliff World
Generality of Expected Sarsa
Week3 Summary
Program Assignment

Reinforcement Learning-Principle-Day8

Posted on 2021-10-20 Edited on 2026-03-23

Reinforcement learning study notes — model-based reinforcement learning and planning with learned dynamics.

Abstract
TD Prediction
Advantages of TD Prediction Methods
Optimality of TD(0)

Abstract

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Reinforcement Learning-Principle-Day7

Posted on 2021-10-13 Edited on 2026-03-23

Reinforcement learning study notes — advanced policy optimization: PPO, TRPO, and trust region methods.

What is Monte Carlo
Using Monte Carlo for Prediction
Using Monte Carlo for Action Values
Using Monte Carlo methods for generalized policy iteration
Solving the Blackjack Example

What is Monte Carlo

The DP need to know the transition probabilities and consumes too much time complexity.

I observe that in MC there is still a gamma $\gamma$ . For the reward after these states, we time it a gamma and add it to average the value function of this state.

Operating System Memory Address

Posted on 2021-09-29 Edited on 2026-03-23

Notes on operating system memory management — covering virtual addressing, page tables, TLB, and memory allocation strategies.

Preface

When I do some system research work, I found it worth understanding the real implementation of every details of every components (file system, memory management, etc.). Thus, I want to start a new chapter here to records every notes and experience of reading books - Understanding the Linux Kernel, Third Edition 3rd Edition by Daniel P. Bovet. Hope after reading this books, I can understand every papers in the OSDI and figure out more useful, novel idea. Not only think without considering any real problems or architecture in the operating system.

Reinforcement Learning-Principle-Day6

Posted on 2021-09-29 Edited on 2026-03-23

Reinforcement learning study notes — policy gradient methods: REINFORCE, Actor-Critic, and A2C algorithms.

Preface

We start our coursera Sample-based Learning Methods from now on. And in this period, I will still excerpt some sentences from Sutton's book. But this time, I will label my own comprehension red.

ReinforcementLearning-Principle-Day6: Monte-Carlo

Introduction

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.

Monte Carlo methods sample and average returns for each state–action pair much like the bandit methods we explored in Chapter 2 sample and average rewards for each action. The main di↵erence is that now there are multiple states, each acting like a different bandit problem (like an associative-search or contextual bandit) and the different bandit problems are interrelated. That is, the return after taking an action in one state depends on the actions taken in later states in the same episode. Because all the action selections are undergoing learning, the problem becomes nonstationary from the point of view of the earlier state.

Reinforcement Learning-Principle-Day5

Posted on 2021-07-21 Edited on 2026-03-23

Reinforcement learning study notes — function approximation and the DQN (Deep Q-Network) breakthrough.

Reinforcement Learning Day 4 (Policy Evaluation)

Abstract
Policy Evaluation
Policy Improvement
Policy Iteration
Value Iteration
Asynchronous Dynamic Programming
Generalized Policy Iteration
Effciency of Dynamic Programming
Summary
Policy Evaluation vs. Control
Iterative Policy Evaluation
Policy Improvement
Policy Iteration
Flexibility of the Policy Iteration Framework
Efficiency of Dynamic Programming
Approximate Dynamic Programming for Fleet Management

MetaLearning-Standford-Lecture4

Posted on 2021-04-14 Edited on 2026-03-23

Stanford CS 330 Meta-Learning lecture notes — exploring metric learning, Siamese networks, and Matching Networks for few-shot classification.

MetaLearning Learning Note - 4

optimization based on meta learning
non-parametric few-shot learning
properties of meta learning algorithms.

Recap Optimization based meta learning

Fine-tuning => MAML our pre-trained parameters will be fine tuned in the test process. $\theta \rightarrow \phi$ . In this way, these two varaibles are the same value.
Probabilistic Interpretation of Optimization-Based Interference

meta-parameter $\theta$ serve as prior;

$max_{\theta} log \prod_i p (D_i|\theta)$

And if we use $\theta$ to initalize $\phi$ , we can have this formula:

$log \prod \int p(D_i| \phi_i)p(\phi_i,\theta) d \theta_i$

$log \prod p(D_i | \overline{\phi_i})$

as the empirical bayes.

Reinforcement Learning Principle Day4

Posted on 2021-03-05 Edited on 2026-03-23

Reinforcement learning study notes — temporal difference learning: TD(0), SARSA, and Q-learning algorithms.

Reinforcement Learning Day 4 (Finite Markov Decision Processes's Coursera Video Notes)

Specifying Policies
Value Functions
Action-value function
Bellman Equation Derivation
Intuition - Bellman Eqaution
Optimal Policy
Summary

Specifying Policies

Policies can only depend on current state not on previous state or time.

Value Functions

value functions and state-action functions.

value functions regularize all the maximum return it can get (as below).

state-value function indicate the maximum reward that a state can get.

state-action $q_\pi$ functions regularize if we choose a action and keeping using $\pi$ in following states, the maximum return we can get.

Reinforcement Learning-Principle-Day3

Posted on 2020-12-02 Edited on 2026-03-23

Reinforcement learning study notes — Monte Carlo methods for prediction and control in model-free settings.

Reinforcement Learning Day 3 (Finite Markov Decision Processes)

Return, Policy and Value Function
Optimal Policies and Optimal Value Functions
Coursera False Questions
Optimality and Approximation
Summary

Goal of Reinforcement Learning

Micheal Littman: identify where reward signals come from;
develop algorithms that search the space of behaviors to maximize reward signals;

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

in MDPs we estimate the value q*(s,a) of each action a in each state s, or we estimate the value v*(s) of each state given optimal action selections.