0%

MetaLearning-Standford-Lecture2

Preface

I choose to learn meta-learning through the stanford's coursework. Prof.Song and Dr.Xu ask me do some work around reinforcement learning and meta learning. Therefore, I choose to learn meta learning from stanford's coursework by Chelsea Finn.

MetaLearning Learning Note - 1

Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 2 - Multi-Task & Meta-Learning Basics

First it introduce some notation for meta-learning. Then it has a task descriptor! Use features to perform personalization and we have the summary object as below:

i=1TL(θ,Di)\sum_{i=1}^{T}{\mathbb{L}(\theta,\mathcal{D}_i)}

The course said that the one-hot vector to restrcit the share parameters.

concat and additve conditional representive; multi head multiplcative;

  • It turn into a problem dependent NN tuning
  • largely guided by intution and knowledge
  • an art than science

Then we will introduce some challenges

  • Negative transfer: independent network works better; because optimization challenges; limited representation capacity

there are softparameter sharing

  • overfitting; means you do not share enough.

Case study: recommending what video to wacht next in Youtube; Input is what the user is currently watching plus user features. First generate candidate videos pool.

And for the input, there are query video; candidate video; user & context features. binary classfication like user click; regression task like time spent; And the Youtube use a softmax gating; Multi gate technique to consider all these features softly.


Then they introduce the meta learning.

Two ways to view meta-learning

  • mechanistic view

  • probabilistic view

In the supervised learning under mechanistic view we need to maxium this formula: argmaxϕlogp(θD)argmax_\phi logp(\theta|\mathcal{D}) which means the parameters under a data distribution's value; Under probablitistic view, we need to argmaxϕlogp(Dθ)argmax_\phi logp(\mathcal{D}|\theta) which means under what distribution will generate ths data distribution that we call likelihood.

ϕ\phi means the data in the new task. Therefore, the meta learning can be summarized as below:

argmaxϕlogp(θDmeta_train)argmax_\phi logp(\theta|\mathcal{D}_{meta\_train})

argmaxϕlogp(ϕD,θ)argmax_\phi logp(\phi|\mathcal{D},\theta)

First we call meta-learning and second we call adaption.

The key idea of meta-learning is "test and train must match"

ϕ\phi is the task sepcific parameters given by our task and our goal is to maximize which is different from task. Then \theta can leads to ϕ\phi

θ=argmaxϕlogp(ϕ,Dtest)\theta = argmax_\phi logp(\phi,\mathcal{D}^{test})

we can see in the above figure that shows the meta dataset's difference and we want to get the best θ\theta^*

θ\theta can be taken as the hyperparameters; network architectures

Question: what is the ϕ\phi before we learn the θ\theta

Sol: Oh! we use meta-train set to learn the ϕ\phi then use this ϕ\phi to learn the θ\theta^*

Summary

Today I learn the introduction of multi task learning and meta learning. It looks interesting. I figure out meta learning's core proerty. We use meta train to learn ϕ\phi in train data of meta-train set and then use ϕ\phi to learn θ\theta^* in test set of meta-train set. Finally, we get θ\theta^* and learn ϕ\phi in the training set and test it in test set. Let me keep learning and review this part in the future work.