Preface
I choose to learn meta-learning through the stanford’s coursework. Prof.Song and Dr.Xu ask me do some work around reinforcement learning and meta learning. Therefore, I choose to learn meta learning from stanford’s coursework by Chelsea Finn.
MetaLearning Learning Note - 1
Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 2 - Multi-Task & Meta-Learning Basics
First it introduce some notation for meta-learning. Then it has a task descriptor! Use features to perform personalization and we have the summary object as below:
$$\sum_{i=1}^{T}{\mathbb{L}(\theta,\mathcal{D}_i)}$$
The course said that the one-hot vector to restrcit the share parameters.
concat and additve conditional representive; multi head multiplcative;
- It turn into a problem dependent NN tuning
- largely guided by intution and knowledge
- an art than science
Then we will introduce some challenges
- Negative transfer: independent network works better; because optimization challenges; limited representation capacity
there are softparameter sharing
- overfitting; means you do not share enough.
Case study: recommending what video to wacht next in Youtube; Input is what the user is currently watching plus user features. First generate candidate videos pool.
And for the input, there are query video; candidate video; user & context features. binary classfication like user click; regression task like time spent; And the Youtube use a softmax gating; Multi gate technique to consider all these features softly.
Then they introduce the meta learning.
Two ways to view meta-learning
mechanistic view
probabilistic view
In the supervised learning under mechanistic view we need to maxium this formula: $argmax_\phi logp(\theta|\mathcal{D})$ which means the parameters under a data distribution’s value; Under probablitistic view, we need to $argmax_\phi logp(\mathcal{D}|\theta)$ which means under what distribution will generate ths data distribution that we call likelihood.
$\phi$ means the data in the new task. Therefore, the meta learning can be summarized as below:
$$argmax_\phi logp(\theta|\mathcal{D}_{meta_train})$$
$$argmax_\phi logp(\phi|\mathcal{D},\theta)$$
First we call meta-learning and second we call adaption.
The key idea of meta-learning is “test and train must match”
$\phi$ is the task sepcific parameters given by our task and our goal is to maximize which is different from task. Then \theta can leads to $\phi$
$$\theta = argmax_\phi logp(\phi,\mathcal{D}^{test})$$
we can see in the above figure that shows the meta dataset’s difference and we want to get the best $\theta^*$
$\theta$ can be taken as the hyperparameters; network architectures
Question: what is the $\phi$ before we learn the $\theta$
Sol: Oh! we use meta-train set to learn the $\phi$ then use this $\phi$ to learn the $\theta^*$
Summary
Today I learn the introduction of multi task learning and meta learning. It looks interesting. I figure out meta learning’s core proerty. We use meta train to learn $\phi$ in train data of meta-train set and then use $\phi$ to learn $\theta^$ in test set of meta-train set. Finally, we get $\theta^$ and learn $\phi$ in the training set and test it in test set. Let me keep learning and review this part in the future work.