0%

MetaLearning-Standford-Lecture3

MetaLearning Learning Note - 2

Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 3 - Optimization-Based Meta-Learning

  • Recap the probabilistic formulation of meta-learning
  • general recipe of meta-learning algorithm
  • black box adaption appraoches
  • optimization-based meta-learning algorithm

Recap

meta-learning to learn the θ\theta then to learn ϕ\phi in train dataset.

We will use omniglot dataset. 1623 characters 50 different alphabets.

In her's introduction to the meta-supervised learning:

Input is the Dtr,xtest,θD_{tr}, x_{test}, \theta

The steps to design a meta-learning algorithm

  1. choosing a form of p(ϕDiTR,θ)p(\phi | D^{TR}_i,\theta)
  2. choosing how to optimizae θ\theta

Treat p(ϕDiTR,θ)p(\phi | D^{TR}_i,\theta) as a inference problem?

Black-Box adaption

we ca see we use DitrD^{tr}_i to train data and generate the results parameters ϕ\phi

In my opinion, they train θ\theta to generate ϕi\phi_i then use the use ϕi\phi_i to generate ytesty^{test} we need to let this loss to get minimized!

Yes the step of black box training:

  1. sample task TiT_i
  2. sample disjoint dataset Ditr,DitestD^{tr}_i,D^{test}_i
  3. Compute ϕfθDitr\phi \leftarrow f_{\theta}{D^{tr}_i}
  4. Update θ\theta using ΔθL(ϕ,Dtest)\Delta_{\theta}\mathbb{L}(\phi,D^{test})

Challenge: Output all neural net parameters doest not seem scalable.

We only ouput sufficient statistics (SNAIL!) Oh I read this paper but I don't understand. Let me try to review this paper tommorrow.

The first Homework is omnigenet.

The proble of Black-Box adaption need a large number of meta-learning dataset. - data inefficent.

Optimiation Based Inference

Motivation: in the above problem, to let the θ\theta become scalable, we only genereate several fixed parameter from our meta-learner. How can we genereate all parameters?

First we get the formula:

maxϕilogP(Ditrϕi)+logp(ϕiθ)max_{\phi_i}logP(D_i^{tr}|\phi_i)+logp(\phi_i|\theta)

whye θ\theta in the behind of the formula?

meta parameter are server as prior parameter. ( Recap: Fine-tune your parameters in the test data set!)

Pre-training can be taken as art more than others.

Our goal:

minθtask imin_\theta\sum_{task\ i}

and MAML:

key idea: acquire ϕithroughoptimization\phi_i through optimization

  1. Sample task TiT_i
  2. Sample disjoint datasets DitrD^{tr}_i,DitestD^{test}_i from DiD_i
  3. Optimize ϕiθαθL(θ,Ditr)\phi_i \leftarrow \theta - \alpha \nabla_\theta L(\theta,D^{tr}_i)
  4. use ϕ\phi from step 3's trainging to update θ\theta

Specificly, the phi is theta its self. And MAML summarized alll task's gradient to do a 'the whole gradient steps' to find the best value.

Shall we care the second order derivative?

Follow her formula:

ϕ=u(θ,Dtr) d:total derivative and:partialderivative\phi = u(\theta, D^{tr})\ d: total\ derivative\ and \bigtriangledown : partial derivative

minθL(ϕ,Dtest)=minθ(u(θ,Dtr),Dtest)min_\theta L(\phi,D^{test}) = min_\theta(u(\theta,D^{tr}),D^{test})

Let us see the GD process of this formula.

ddθL(ϕ,Dtest)=ϕL(ϕ,Dtest)dthetau(θ,Dtr)ϕ=u(θ,Dtr)\frac{d}{d_\theta}L(\phi,D^{test}) = \bigtriangledown_\phi * L(\phi,D^{test}) \circ d_{theta} u(\theta,D^tr) _{\phi = u(\theta,D^tr)}

I consider the first derivative for ϕ\phi is second derivative for θ\theta?

Let u(θ,Dtr)=θαdθL(θ,Dtr)u(\theta,D^tr) = \theta - \alpha * d_\theta * L(\theta, D^tr)

dθu(θ,Dtr)=Iαdθ2L(θ,Dtr)d_\theta u(\theta, D^tr) = I - \alpha * d_\theta^2 * L(\theta, D^tr) is a hessian matrix for this problem.

I am a little confused about the ϕ&θ\phi \& \theta who is the meta-parameters?

So it can repeate computing the higher-order derivations.