Bayes meets Bernstein at the Meta Level: an Analysis of Fast Rates in Meta-Learning with PAC-Bayes
Abstract
Bernstein’s condition is a key assumption that guarantees fast rates in machine learning. For example, the Gibbs algorithm with prior has an excess risk in , as opposed to the standard , where denotes the number of observations and is a complexity parameter which depends on the prior . In this paper, we examine the Gibbs algorithm in the context of meta-learning, i.e., when learning the prior from tasks (with observations each) generated by a meta distribution. Our main result is that Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior , which will reduce the term across tasks, is in , instead of the expected . We further illustrate how this result improves on the standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussians priors.
Keywords: Bernstein’s condition, meta-learning, fast rates, PAC-Bayes bounds, information bounds, the Gibbs algorithm, variational approximations.
1 Introduction
One of the greatest promises of artificial intelligence is the ability to design autonomous systems that can learn from different situations throughout their lives and adapt quickly to new environments, as humans, animals and other living things naturally do. Based on the intuition that a new problem often has significant similarities to previously encountered tasks, the use of past experience is particularly important in areas such as computer vision (Quattoni et al., 2008; Kulis et al., 2011; Li et al., 2018; Achille et al., 2019), natural language processing (Huang et al., 2018; Gu et al., 2018; Dou et al., 2019; Qian and Yu, 2019) and reinforcement learning (Finn et al., 2017; Mishra et al., 2018; Wang et al., 2016; Yu et al., 2020) where the learner has access to only a few training examples for the task of interest, but for which a vast amount of datasets from a variety of related tasks is available. In the area of digit recognition for example, it is possible to leverage the experience gained from millions of similar open source image classification datasets, as the key features needed to classify cats from dogs or pants from shirts can be used to classify handwritten digits. This idea is at the heart of meta-learning (Thrun and Pratt, 1998; Baxter, 2000; Vanschoren, 2019), a field that has recently attracted a lot of attention due to its huge success in real-world applications, and which aims to improve performance on a particular task by transferring the knowledge contained in different but related tasks.
Meta-learning has been widely studied in recent literature. It must be noted that meta-learning was used to refer to a wide range of situations. Providing a precise definition of meta-learning is a challenge. In particular, the terms transfer learning and multi-task learning, although distinct, are often used interchangeably instead of meta-learning. Transfer learning is a very general concept that involves two tasks that share similarities - a source and a target - and consists in transferring the knowledge acquired on the source dataset to better process the target data (Pan and Yang, 2010; Zhuang et al., 2020). In practice, this can be formulated in many different ways, but the most popular approach is to pre-train a model on the source data, e.g., images of cats and dogs, and then to fine-tune it on the target training data set, e.g., images of handwritten digits. In particular, a challenging problem in transfer learning is to find a measure that quantifies the similarity between the source and target tasks. Multi-task learning adopts a different framework, where multiple learning tasks are considered and the goal is to learn a model that can handle all tasks simultaneously (Caruana, 1997; Zhang and Yang, 2021). The model usually has a common representation, e.g., the first layers of a deep neural network, and a task-specific component, e.g., the last layer of the network. Meta-learning also considers a collection of datasets from a variety of tasks, but unlike multi-task learning, we are not interested in learning the fixed number of tasks, but rather in being prepared for future tasks that are not yet given. Also, unlike transfer learning, meta-learning exploits the commonality of previous tasks rather than the similarity between some specific source and target tasks. We use these metadata to design a meta-procedure that adaptively learns a predictor for any new learning task that is a priori unknown, and the goal is to quickly learn to adapt a learning procedure from past experience. Meta-learning is therefore sometimes referred to as learning-to-learn, or lifelong learning in the online context. The implementation of this learning-to-learn mechanism can take different forms, which we briefly describe in the following paragraph.
As the name suggests, meta-learning involves two levels of abstraction to improve learning over time: a meta-level and a within-task level. At the within-task level, the new task of interest is presented and the corresponding pattern is learned from the training data set of the task at hand. This learning process is greatly accelerated by a meta-learner, which has distilled the knowledge accumulated in previous tasks into the within-task model. The meta-learning procedure can accelerate the within-task algorithm in various ways, and three main categories stand out in the literature: metric-based methods, which are based on non-parametric predictive models governed by a metric that is learned using the meta-training dataset (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018); model-based methods, which quickly update the parameters in a few learning steps, which can be achieved by the model’s internal architecture or another meta-learning model (Santoro et al., 2016; Munkhdalai and Yu, 2017; Mishra et al., 2018); and optimisation-based methods, which mainly involve learning the hyper-parameters of a within-task algorithm using the meta-training set for fast adaptation (Hochreiter et al., 2001; Ravi and Larochelle, 2017; Finn et al., 2017; Nichol et al., 2018; Qiao et al., 2018; Gidaris and Komodakis, 2018). Due to their performance and ease of implementation, the optimisation-based family is the dominant class in the recent literature, exploiting the idea that well-chosen hyperparameters can greatly speed up the learning process and allow model parameters to be quickly adapted to new tasks with little data. For example, it is possible to learn the task of interest using a gradient descent whose initialisation and learning rate would have been learned from the metadata. Among the best known meta-strategies is the model agnostic meta-learning procedure (MAML) (Finn et al., 2017) and its variants implicit MAML (Rajeswaran et al., 2019), Bayesian MAML (Grant et al., 2018; Yoon et al., 2018; Nguyen et al., 2020) and Reptile (Nichol et al., 2018). We refer the interested reader to the recent review by Chen et al. (2023) for more details.
2 Approach and Contributions
In this paper, we focus on the Gibbs algorithms within tasks, or their variational approximations. the Gibbs algorithms, also known as Gibbs posteriors (Alquier et al., 2016) or exponentially weighted aggregation (Dalalyan and Tsybakov, 2008), can also be interpreted in the framework of Bayesian statistics as a kind of generalized posterior (Bissiri et al., 2016). PAC-Bayes bounds were developed to control the risk and the excess risk of such procedures (Shawe-Taylor and Williamson, 1997; McAllester, 1998; Catoni, 2004; Zhang, 2006; Catoni, 2007; Yang et al., 2019), see Guedj (2019); Alquier (2021) for recent surveys. More recently, the related mutual information bounds (Russo and Zou, 2019; Haghifam et al., 2021) were also used to study the excess risk of the Gibbs algorithms (Xu and Raginsky, 2017). Gibbs posteriors are often intractable, and it is then easier to compute a variational approximation of such a posterior. It appears that PAC-Bayes bounds can also be used on such approximations (Alquier et al., 2016). Many recent publications built foundations of meta-learning through PAC-Bayes and information bounds (Pentina and Lampert, 2014; Amit and Meir, 2018; Ding et al., 2021; Liu et al., 2021; Rothfuss et al., 2021; Farid and Majumdar, 2021; Rothfuss et al., 2022; Guan and Lu, 2022a; Rezazadeh, 2022). These works and the related literature are discussed in detail in Section 6. Most of these papers proved empirical PAC-Bayes bounds for meta-learning. These bounds can be minimized, and we obtain both a practical meta-learning procedure, together with a numerical certificate on its generalization. However, these works did not focus on the rate of convergence of the excess risk.
Bernstein’s condition a is a low-noise assumption reflecting the inherent difficulty of the learning task (Mammen and Tsybakov, 1999; Tsybakov, 2004; Bartlett and Mendelson, 2006). While it was initially designed to study the ERM (Bartlett and Mendelson, 2006), it characterizes the learning rate of algorithms beyond the ERM. PAC-Bayes bounds and mutual information bounds show that the excess risk of the Gibbs algorithm is in when Bernstein’s condition is satisfied (Catoni, 2007; Grünwald and Mehta, 2020), as opposed to the slow rate in the general case. The quantity measures the complexity of task . Importantly, it also depends on the prior distribution used in the algorithm. Similar results hold when we replace the Gibbs algorithm by a variational approximation (Alquier et al., 2016).
In the meta-learning setting, we are given tasks simultaneously. Using the Gibbs algorithm with a fixed in all tasks leads to an average excess risk in , where when Bernstein’s condition holds for each task , and otherwise. Here, denotes the expectation with respect to a future (out-of-sample) task . This approach is referred to as “learning in isolation”, because each task is solved regardless of the others. Of course, in meta-learning we want to take advantage of the multiple tasks. For example, Pentina and Lampert (2014) used the Gibbs algorithm at the meta-level, in order to learn a better prior. The expected benefit is to reduce the complexity term .
Overview of the paper:
-
•
In Section 3, we recall existing results on the excess risk of the Gibbs algorithm when learning tasks in isolation, and we introduce Bernstein’s condition, a fundamental assumption under which the fast rate is achieved by the Gibbs algorithm.
-
•
In Section 4, we prove that regardless of its validity at the within-task level, Bernstein’s condition is always satisfied at the meta level. As a consequence of this result, we show that a meta-level the Gibbs algorithm achieves the excess risk with if Bernstein’s condition is satisfied, and otherwise. We further raise the open question of the generalization of this result to its variational approximations.
-
•
In Section 5, we apply the previous results to various settings: learning a discrete prior, learning a Gaussian prior and learning a mixture of Gaussians prior. We show that the gain brought from the meta learning is blatant, as in some favorable situations, one can even have .
-
•
In Section 6, we provide a deeper comparison with the rich literature on meta-learning.
3 Problem Definition and Notations
Let be a space of observations, be a decision space and be a bounded loss function defined on the previously defined sets. Let denote the set of all probability distributions on equipped with a suitable -field. The learner has to solve tasks. Given a task , the learner receives the observations , assumed to be drawn independently from a distribution on the decision space . The objective of the learner is to find a parameter in the parameter space which minimizes the so-called prediction risk associated to on , defined as
We denote by the minimum of and by its corresponding minimizer:
In Bayesian approaches, we rather seek for such that
is small. Defining, for any , the empirical risk as
a standard choice for is the so-called Gibbs posterior given by
(1) |
where is the prior distribution on the parameter and is some parameter which will be made explicit later. The corresponding risk estimate is defined by
More generally, we can also consider variational approximations, defined by
(2) |
where . Adequate choices for lead to feasible minimization problems, and don’t affect the generalization properties (Alquier et al., 2016). With these notations, .
3.1 Assumptions on the loss and Bernstein’s condition
Recall that we assumed the loss function to be bounded: there exists a constant such that
(3) |
PAC-Bayes bounds for unbounded losses are well-known, see the discussion at the end of Alquier (2021). However, a lot of those bounds become a little more complicated. Thus, the choice to work with bounded variables is made here for the sake of clarity, and is not due to a fundamental limitation of our method.
We define the variance term for task by
for any . The following condition is crucial in the study of the risk of Gibbs posterior.
Assumption 1 (Bernstein’s condition)
There exists a constant such that, for any ,
This assumption characterizes the excess risk of Gibbs posterior, see Theorem 1 below. In this paper, we will provide a bound on the excess risk both under this condition and without it.
In some of the applications developed in Section 5, we will also make the following assumption: there exists such that, for any and any ,
(4) |
Intuitively, Taylor’s expansion of gives
and thus we can expect (4) to be satisfied when the risk is smooth enough. However, note that this assumption is not necessary for the main results of this paper to hold, and will only be used in very specific applications.
3.2 Learning in Isolation
In the process of learning in isolation, we consider each of the tasks separately. We then fix some and denote by the set of observations from task : . We recall the following result, which can be found, e.g., in Alquier (2021), and whose proof is recalled in Appendix B for the sake of completeness.
Theorem 1
When specifying a model and a prior, a derivation of the right-hand side in Theorem 1 leads to explicit rates of convergence. For example, a classical assumption in the Bayesian literature is that there are constants such that (Ghosal and Van der Vaart, 2017). As this condition is usually applied on one task, with a specific prior, the notation does not reflect the dependence with repect to or to . However, in our context, this dependence will be crucial, so we will write instead of . Under such an assumption, the right-hand side in Theorem 1 can be made more explicit. For simplicity, we state this result for Gibbs posteriors only.
Corollary 2
Assume that, almost surely on , . Then,
In particular, under Bernstein’s condition, the choice gives
On the other hand, without Bernstein’s condition,
and in particular, for , we obtain
4 Main Results
From this section onward, we focus on meta-learning. As opposed to the learning in isolation, the meta-learning considers all the tasks and takes advantage of possible similarities between the tasks to improve learning in each task. More precisely, while assuming that for any are drawn independently from distribution , we also assume that the distributions are drawn independently from a certain distribution . A future (out-of-sample) task will also be drawn from and will be drawn independently from . This task will be solved by the Gibbs algorithm . Our objective is to learn the prior using the tasks , in order to make the meta-risk
as small as possible. We will compare it to the so-called oracle meta-risk
which can only be reached by an oracle who would know the best classifier in each task in advance.
4.1 Bernstein’s condition at the meta level
In this subsection, we prove a version of Bernstein’s condition at the meta level. Let
for any prior , and let be the distribution minimizing the expectation of the above quantity:
Surprisingly enough, in contrast to the learning in isolation, Bernstein’s condition is always satisfied at the meta level, on the condition that we use Gibbs posteriors within tasks.
Theorem 3
Proof It classically holds (e.g. Alquier (2021)) that
(5) |
for any fixed . By the boundedness assumption, it holds that, for any ,
We next use the following lemma, whose proof can be found in Appendix D.
Lemma 4
Let . Then, for any ,
An application of Lemma 4 to and gives
(these derivations are directly inspired from the proof technique introduced by Bartlett et al. (2006)).
Taking expectations with respect to on both sides
Integrating with respect to yields
(6) |
By definition of , it holds that, for any ,
In particular, this holds for and plugging this into the right hand side of (6) gives
The (optimal) choice gives the desired bound
4.2 PAC-Bayes Bound for Meta-learning
We will now seek for a prior which allows to obtain a small meta-risk
In order to do so, we will fix a set of possible priors and a set of distributions on these priors: 111Note that measurability issues can arise when the set is non parametric. However, in all our examples, the set is parametric.. Given a probability distribution called “prior on priors”, we define Gibbs distribution on priors similarly as in (1), but at the meta-level:
(7) |
where is some parameter. As a consequence of Theorem 3 comes the next result, whose proof is given in Appendix E.
Theorem 5
Assume that the loss satisfies (3). Defining , it holds, for any ,
While the bound given in Theorem 5 depends on some subset chosen for computational reasons, the bound is on the excess risk of , which itself is based on the exact Gibbs posterior , and not on its variational approximation , based on and defined as
In some settings, is tractable and its Gibbs-based counterpart is not, and it is a fundamental open question to determine under what condition on we can replace by in Theorem 5.
Open Question 1
Under what conditions on can we replace by in the left-hand side of Theorem 5?
4.3 A Toy Application of Theorem 5: Concurrent Priors
This subsection gives a toy application of Theorem 5 just to fix ideas. Here, we study the case where statisticians propose a different prior, all of which are assumed to satisfy a prior mass condition as in Corollary 2. We denote by the set of priors. We choose as the uniform distribution on and . Here again, for the sake of simplicity, we assume that Bernstein’s condition (see Assumption 1) is satisfied at the task level.
5 Applications of Theorem 5
In this section, by an application of Theorem 5, we derive explicit bounds on the excess risk of the Gibbs algorithm in the case of discrete priors (the parameter set is finite; Subsection 5.1), Gaussian priors (Subsection 5.2) and mixtures of Gaussian priors (Subsection 5.3).
5.1 Learning Discrete Priors
In this subsection, we assume that . Following Meunier and Alquier (2021), we define as the smallest possible subset of such that
(8) |
and we denote . In general, and . However, in some favorable situations, and , in which case, the meta-learning may improve upon the learning in isolation. In the setting considered, Bernstein’s condition is trivially satisfied and the excess risk of the Gibbs algorithm is .
We define our set of priors as the set of probability distributions which are uniform on and parameterized by :
and is the set of all distributions on . Our “prior on priors” is then defined as follows: we draw with probability (for ), then given , draw a subset of cardinality uniformly at random, and take . In other words, is a distribution defined on such that .
Proposition 6
The excess risk of the meta predictor defined in (7) is bounded as follows:
Remark 7
Let us now compare the meta-learning rate above to the rate achieved by the learning in isolation. In the unfavorable case , the meta-learning bound is sensibly larger than the learning in isolation one, by a term which vanishes rapidly when .
In the favorable case however, the meta-learning considerably improves upon the learning in isolation for large values of . In the extreme case where , we have
Thus, the benefits of the meta-learning are mainly expected in the regime. They are huge when the tasks are very similar, and close to zero when the tasks are totally unrelated. This is in line with the results of Meunier and Alquier (2021) in the online setting.
Proof We first consider learning in isolation. The classical choice is to take as uniform in each task. An application of Theorem 1 gives
In the meta-learning case, an application of Theorem 5 gives
Under the assumption (8), it holds that
We conclude by using the classic bound .
5.2 Learning Gaussian priors
In this subsection, we consider the set of all Gaussian distributions
(9) |
So, priors on priors are actually defined as priors on and given by:
(10) |
and we choose the prior on priors as .
From now on and until the end of this section, we assume that both (4) and Assumption 1 (Bernstein’s condition) hold, and we are looking for a prior on priors from which to sample , such that concentrates as much as possible to the best parameter. Denoting and , the following holds.
Proposition 8
Interestingly enough, in the favorable case , a proper choice of the prior of priors lead to the very fast rate of convergence , which considerably improves upon the fast rate when . The detailed proof of this proposition, as well as explicit expressions of and , are given in Appendix F.
5.3 Learning Mixtures of Gaussian priors
In this subsection, we generalize the result of the previous section to priors that are mixtures of Gaussians. We still assume that Assumption 4 and Assumption 1 (Bernstein’s condition) hold. We first assume the number of components in the mixture is known. Under these hypotheses, the set of possible priors is
(11) |
We add a Dirichlet prior on the weights of the components in the mixture, and the set of priors on priors becomes
(12) |
while the initial prior on priors is chosen as . We define
Proposition 9
Let us analyze each of the terms of the above bound. The first term is the bound we had in the finite case of Subsection 5.1. Visualizing our mixtures as the points in the finite case, this term is the time required by our estimator to select the right mixture. While this term makes the convergence rate of notably worse than the we might hope for, it is essentially unavoidable as it appears in the much simpler model of a finite set of parameters described in Subsection 5.1.
The next term is (similar to) the bound obtained in the Gaussian case of Subsection 5.2, with the exception that is replaced by , and scales with the convergence time to the best Gaussian for every task .
Eventually, the last term is the convergence term at the meta level and is a . This is the cost of the meta learning compared to the learning in isolation. When , this term is very small and justifies the use of the meta learning.
Remark 10
One may think, looking at the bounds in the two previous cases considered, that the convergence rate in the case of mixtures of Gaussians is slower than the one for Gaussians which can be as fast as . In reality, the rate of convergence is (naturally) faster for the model of mixtures of Gaussians, because in the case of mixtures of Gaussians, the convergence term is under the assumption that , while in the Gaussian case, the much stronger assumption is required. Under this assumption, the similar rate is naturally achieved.
We now consider the case when the number of mixtures is unknown. The set of priors hence becomes the set of all (finite) mixtures of Gaussians:
(13) |
In the definition of the set of priors on priors , we assume that (otherwise, there is a high chance of overfitting). We then set a Dirichlet prior on the number of components , and given , set the same model as before. Formally,
(14) |
and we set the prior on priors . The application of Theorem 5 yields the bound
Proposition 11
Our estimator takes to find the optimal number of mixtures at the meta level. This is the price to pay to have the infimum on in the bound. In the limit and when no prior information on is available, this clearly improves upon the bound of Proposition 9, and hence justifies setting a prior on at the meta level rather than choosing an isolated , pleading again in the favor of meta-learning. The proof of all the results of this subsection is given in Appendix G.
6 Discussion
In recent years, the statistical guarantees of meta-learning have received increasing attention. In the following paragraphs, we present a short review of the literature on the statistical theory of meta-learning, followed by a brief discussion of three papers that are closely related to our analysis.
Theoretical bounds in meta-learning. The first theoretical analysis of meta-learning goes back to Baxter (2000), who introduced the notion of task environment and derived a uniform generalization bound based on the capacity and covering number of the model. Following this i.i.d. task environment setting, many other generalization bounds have since been provided for different strategies and proof techniques, including VC theory (Baxter, 2000; Ben-David and Schuller, 2003; Maurer, 2009; Maurer et al., 2016; Guan and Lu, 2022b), algorithmic stability (Maurer and Jaakkola, 2005; Chen et al., 2020; Al-Shedivat et al., 2021; Guan et al., 2022) and information theory (Jose and Simeone, 2021; Chen et al., 2021; Jose et al., 2021; Rezazadeh et al., 2021; Hellström and Durisi, 2022). A related approach for deriving such bounds is based on PAC-Bayes theory. First proposed in the meta-learning framework in the pioneering paper of Pentina and Lampert (2014), this idea of learning a hyper-posterior that generates a prior for the new task has been taken up several times in the recent years (Amit and Meir, 2018; Ding et al., 2021; Liu et al., 2021; Rothfuss et al., 2021; Farid and Majumdar, 2021; Rothfuss et al., 2022; Guan and Lu, 2022a; Rezazadeh, 2022). In particular, Amit and Meir (2018) derived a new PAC-Bayes bound, which they applied to the optimization of deep neural networks, albeit with computational limitations. This latter concern was partially addressed by Rothfuss et al. (2021), who also specified the hyper-posterior and extended the results to unbounded losses, and further investigated their study in Rothfuss et al. (2022). Some papers combined ideas from different literatures, such as Farid and Majumdar (2021), who explored the link between PAC-Bayes and uniform stability in meta-learning, and provided a precise analysis of stability and generalization. Excess risk bounds have also been provided in the i.i.d. task environment framework, see Maurer et al. (2016); Denevi et al. (2018a, b, 2019a, 2019b, 2020); Balcan et al. (2019); Bai et al. (2021); Chen and Chen (2022). The task environment assumption has recently been challenged, for example by Du et al. (2020) and Tripuraneni et al. (2021), who proposed to use assumptions on the distributional similarity between the features and the diversity of tasks to control the excess risk, an idea further explored by Fallah et al. (2021) who exploited a notion of diversity between the new task and training tasks using the total variation distance. Finally, a detailed analysis of regret bounds in lifelong learning has been carried out in recent years (Alquier et al., 2017; Denevi et al., 2018b, 2019b; Balcan et al., 2019; Khodak et al., 2019; Finn et al., 2019; Meunier and Alquier, 2021).
Comparison to Denevi et al. (2019a). Denevi et al. (2019a) is probably the study that is the most related to our paper. The authors provide statistical guarantees for Ridge regression with a meta-learned bias, and focus on the usefulness of their strategy relative to single-task learning, proving that their method outperforms the standard -regularized empirical risk minimizer. In particular, they can achieve an excess risk rate of order in the favorable case , where is a variance term similar to the one we defined in our Gaussian example.
Comparison to Guan et al. (2022). To the best of our knowledge, Guan et al. (2022) is the only work in the meta-learning literature that addresses fast rates with respect to the number of tasks under the task environment assumption. However, we actually show in our paper that there is no need to extend Bernstein’s condition when using exact Bayesian inference and that the final posterior naturally satisfies the extended Bernstein assumption, thus giving fast rates with respect to , while Guan et al. (2022) require an additional Polyak-Łojasiewicz condition to achieve fast rates. Furthermore, their analysis is very different in nature, relying on stability arguments to derive generalization bounds, while we use PAC-Bayes theory to control the excess risk.
Comparison to Guan and Lu (2022a) and Rezazadeh (2022). Finally, Guan and Lu (2022a) and Rezazadeh (2022) provide fast rate generalization bounds based on Catoni’s PAC-Bayes inequality. Nevertheless, their notion of fast rates differs substantially from ours: our rates quantify the rate of convergence of the excess risk, in line with the classic statistical learning literature. In contrast, the fast rates of Guan and Lu (2022a) are only available for a variant of the generalization gap that compares the theoretical risk to a factor of the empirical risk. In order to obtain a sharp factor equal to in front of the empirical risk, one needs to fine-tune a parameter in Guan and Lu (2022a)’s inequality, which would then return a slow rate.
7 Conclusion and open problems
We provided an analysis of the excess risk in meta-learning the prior via PAC-Bayes bounds. Surprisingly, at the meta-level, conditions for fast rates are always satisfied if one uses exact Gibbs posteriors at the task level. An important problem is to extend this result to variational approximations of Gibbs posteriors.
References
- Achille et al. (2019) Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019.
- Al-Shedivat et al. (2021) Maruan Al-Shedivat, Liam Li, Eric Xing, and Ameet Talwalkar. On data efficiency of meta-learning. In International Conference on Artificial Intelligence and Statistics, pages 1369–1377. PMLR, 2021.
- Alquier et al. (2016) P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(239):1–41, 2016. URL http://jmlr.org/papers/v17/15-290.html.
- Alquier (2021) Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
- Alquier et al. (2017) Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret Bounds for Lifelong Learning. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 261–269, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
- Amit and Meir (2018) R. Amit and R. Meir. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning, pages 205–214. PMLR, 2018.
- Bai et al. (2021) Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason Lee, Sham Kakade, Huan Wang, and Caiming Xiong. How important is the train-validation split in meta-learning? In International Conference on Machine Learning, pages 543–553. PMLR, 2021.
- Balcan et al. (2019) Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pages 424–433. PMLR, 2019.
- Bartlett et al. (2006) P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Bartlett and Mendelson (2006) Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311–334, 2006.
- Baxter (2000) Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
- Ben-David and Schuller (2003) Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 567–580. Springer, 2003.
- Bissiri et al. (2016) P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
- Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford University Press, 2013.
- Caruana (1997) R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
- Catoni (2004) O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
- Catoni (2007) O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes – Monograph Series, 56. Institute of Mathematical Statistics, Beachwood, OH, 2007.
- Chen et al. (2020) Jiaxin Chen, Xiao-Ming Wu, Yanke Li, Qimai Li, Li-Ming Zhan, and Fu-lai Chung. A closer look at the training strategy for modern meta-learning. Advances in Neural Information Processing Systems, 33:396–406, 2020.
- Chen and Chen (2022) Lisha Chen and Tianyi Chen. Is bayesian model-agnostic meta learning better than model-agnostic meta learning, provably? In International Conference on Artificial Intelligence and Statistics, pages 1733–1774. PMLR, 2022.
- Chen et al. (2023) Lisha Chen, Sharu Theresa Jose, Ivana Nikoloska, Sangwoo Park, Tianyi Chen, and Osvaldo Simeone. Learning with limited samples: Meta-learning and applications to communication systems. Foundations and Trends® in Signal Processing, 17(2):79–208, 2023.
- Chen et al. (2021) Qi Chen, Changjian Shui, and Mario Marchand. Generalization bounds for meta-learning: An information-theoretic analysis. Advances in Neural Information Processing Systems, 34:25878–25890, 2021.
- (22) T. Cover and J. Thomas. Elements of Information Theory. Wileys Series in Communication.
- Dalalyan and Tsybakov (2008) A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1-2):39–61, 2008.
- Denevi et al. (2018a) G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. Advances in neural information processing systems, 31, 2018a.
- Denevi et al. (2018b) G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical guarantees. arXiv preprint arXiv:1803.08089, 2018b.
- Denevi et al. (2019a) G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575. PMLR, 2019a.
- Denevi et al. (2019b) Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. Advances in Neural Information Processing Systems, 32, 2019b.
- Denevi et al. (2020) Giulia Denevi, Massimiliano Pontil, and Carlo Ciliberto. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33:964–974, 2020.
- Ding et al. (2021) N. Ding, X. Chen, T. Levinboim, S. Goodman, and R. Soricut. Bridging the gap between practice and pac-bayes theory in few-shot meta-learning. Advances in neural information processing systems, 34:29506–29516, 2021.
- Donsker and Varadhan (1976) M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iii. Communications on Pure and Applied Mathematics, 28:389–461, 1976.
- Dou et al. (2019) Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. Investigating meta-learning algorithms for low-resource natural language understanding tasks. arXiv preprint arXiv:1908.10423, 2019.
- Du et al. (2020) Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- Fallah et al. (2021) Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34:5469–5480, 2021.
- Farid and Majumdar (2021) A. Farid and A. Majumdar. Generalization bounds for meta-learning via pac-bayes and uniform stability. Advances in neural information processing systems, 34:2173–2186, 2021.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- Finn et al. (2019) Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920–1930. PMLR, 2019.
- Ghosal and Van der Vaart (2017) S. Ghosal and A. Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
- Gidaris and Komodakis (2018) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018.
- Grant et al. (2018) Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
- Grünwald and Mehta (2020) Peter D Grünwald and Nishant A Mehta. Fast rates for general unbounded loss functions: from erm to generalized bayes. The Journal of Machine Learning Research, 21(1):2040–2119, 2020.
- Gu et al. (2018) Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437, 2018.
- Guan and Lu (2022a) J. Guan and Z. Lu. Fast-rate pac-bayesian generalization bounds for meta-learning. In International conference on machine learning, pages 7930–7948. PMLR, 2022a.
- Guan et al. (2022) J. Guan, Y. Liu, and Z. Lu. Fine-grained analysis of stability and generalization for modern meta learning algorithms. Advances in neural information processing systems, 2022.
- Guan and Lu (2022b) Jiechao Guan and Zhiwu Lu. Task relatedness-based generalization bounds for meta learning. In International Conference on Learning Representations, 2022b.
- Guedj (2019) B. Guedj. A primer on PAC-Bayesian learning. In Proceedings of the second congress of the French Mathematical Society, 2019.
- Haghifam et al. (2021) Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran, and Dan Roy. Towards a unified information-theoretic framework for generalization. Advances in Neural Information Processing Systems, 34:26370–26381, 2021.
- Hellström and Durisi (2022) Fredrik Hellström and Giuseppe Durisi. Evaluated cmi bounds for meta learning: Tightness and expressiveness. arXiv preprint arXiv:2210.06511, 2022.
- Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 87–94. Springer, 2001.
- Huang et al. (2018) Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. Natural language to structured query generation via meta-learning. arXiv preprint arXiv:1803.02400, 2018.
- Jose and Simeone (2021) Sharu Theresa Jose and Osvaldo Simeone. Information-theoretic generalization bounds for meta-learning and applications. Entropy, 23(1):126, 2021.
- Jose et al. (2021) Sharu Theresa Jose, Osvaldo Simeone, and Giuseppe Durisi. Transfer meta-learning: Information-theoretic bounds and information meta-risk minimization. IEEE Transactions on Information Theory, 68(1):474–501, 2021.
- Khodak et al. (2019) Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. Advances in Neural Information Processing Systems, 32, 2019.
- Koch et al. (2015) Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.
- Kulis et al. (2011) Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR 2011, pages 1785–1792. IEEE, 2011.
- Li et al. (2018) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Liu et al. (2021) Tianyu Liu, Jie Lu, Zheng Yan, and Guangquan Zhang. Pac-bayes bounds for meta-learning with data-dependent prior. arXiv preprint arXiv:2102.03748, 2021.
- Mammen and Tsybakov (1999) E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. the Annals of Statistics, 27(6):1808–1829, 1999.
- Maurer (2009) A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
- Maurer and Jaakkola (2005) A. Maurer and T. Jaakkola. Algorithmic stability and meta-learning. The Journal of Machine Learning Research, 6(6), 2005.
- Maurer et al. (2016) A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(81):1–32, 2016.
- McAllester (1998) D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, New York, 1998. ACM.
- Meunier and Alquier (2021) D. Meunier and P. Alquier. Meta-strategy for learning tuning parameters with guarantees. Entropy, 23(10), 2021.
- Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. ICLR, 2018.
- Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International conference on machine learning, pages 2554–2563. PMLR, 2017.
- Nguyen et al. (2020) Cuong Nguyen, Thanh-Toan Do, and Gustavo Carneiro. Uncertainty in model-agnostic meta-learning using variational inference. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3090–3100, 2020.
- Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
- Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
- Pentina and Lampert (2014) A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991–999. PMLR, 2014.
- Qian and Yu (2019) Kun Qian and Zhou Yu. Domain adaptive dialog generation via meta learning. arXiv preprint arXiv:1906.03520, 2019.
- Qiao et al. (2018) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7229–7238, 2018.
- Quattoni et al. (2008) Ariadna Quattoni, Michael Collins, and Trevor Darrell. Transfer learning for image classification with sparse prototype representations. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
- Rajeswaran et al. (2019) Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
- Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International conference on learning representations, 2017.
- Rezazadeh (2022) A. Rezazadeh. A unified view on pac-bayes bounds for meta-learning. In International conference on machine learning, pages 18576–18595. PMLR, 2022.
- Rezazadeh et al. (2021) Arezou Rezazadeh, Sharu Theresa Jose, Giuseppe Durisi, and Osvaldo Simeone. Conditional mutual information-based generalization bound for meta learning. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1176–1181. IEEE, 2021.
- Rothfuss et al. (2021) J. Rothfuss, V. Fortuin, M. Josifoski, and A. Krause. Pacoh: Bayes-optimal meta-learning with pac-guarantees. In International conference on machine learning, pages 9116–9126. PMLR, 2021.
- Rothfuss et al. (2022) J. Rothfuss, M. Josifoski, V. Fortuin, and A. Krause. Pac-bayesian meta-learning: From theory to practice. arXiv preprint arXiv:2211.07206, 2022.
- Russo and Zou (2019) D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850. PMLR, 2016.
- Shawe-Taylor and Williamson (1997) J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, pages 2–9, New York, 1997. ACM.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
- Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018.
- Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. Kluwer Academic Publishers, pages 3–17, 1998.
- Tripuraneni et al. (2021) Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
- Tsybakov (2004) A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. the Annals of Statistics, 32(1):135–166, 2004.
- Vanschoren (2019) Joaquin Vanschoren. Meta-learning. Automated machine learning: methods, systems, challenges, pages 35–61, 2019.
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
- Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 2521–2530, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
- Yang et al. (2019) Jun Yang, Shengyang Sun, and Daniel M Roy. Fast-rate pac-bayes generalization bounds via shifted rademacher processes. Advances in Neural Information Processing Systems, 32, 2019.
- Yoon et al. (2018) Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018.
- Yu et al. (2020) Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
- Zhang (2006) Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
- Zhang and Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
- Zhuang et al. (2020) Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
Appendix A Some Useful Formulas
The following (known) results are used throughout the text. They are recalled here without proof.
A.1 Concentration Inequalities
For Hoeffding and Bernstein see Boucheron et al. (2013). For Donsker and Varadhan, see for example Catoni (2007).
Lemma 12 (Hoeffding’s inequality)
Let be i.i.d random variables taking values in an interval . Then, for any ,
Lemma 13 (Bernstein’s inequality)
Let be i.i.d random variables such that for any ,
(15) |
Then, for any ,
Note that in particular, if almost surely, 15 always holds with .
A.2 Donsker and Varadhan’s Lemma
Lemma 14 (Donsker and Varadhan’s variational inequality Donsker and Varadhan (1976))
Let be a probability measure on . For any measurable, bounded function , we have:
Moreover, the supremum with respect to in the right-hand side is reached for the Gibbs measure defined by its density with respect to
A.3 KL Divergence of some known Distributions
Denoting by the entropy of , recall that the KL divergence between a multinomial distribution of parameters and a multinomial distribution of parameters is
(16) |
Recall that the KL divergence between 2 normal distributions is
(17) |
Recall that the KL divergence between 2 Gamma distributions is
(18) |
Recall that the KL divergence between a Dirichlet distribution of parameter and a Dirichlet distribution of parameter is
(19) |
Appendix B Proof of Theorem 1
We mostly follow the proof technique developed in Catoni (2007). For any , fix and let for any . We are going to distinguish two cases, whether or not Bernstein assumption is satisfied.
If Bernstein assumption is satisfied, we apply Lemma 13 to . Note that in this case, is actually the variance term . So, for any ,
We let , which gives
Making use of Bernstein hypothesis gives
If Bernstein hypothesis is not satisfied, we apply Lemma 12 to , which gives, for any ,
Letting gives
Defining the general bound
(20) |
where is equal to if Bernstein assumption is satisfied, and otherwise, it holds in either case that
Rearranging the terms gives
Next, integrating this bound with respect to and using Fubini’s theorem to exchange both integrals gives
We then apply Lemma 14 to the argument of the expectation with respect to the sample, and we have
Jensen’s inequality implies
in other words,
At this stage, we can replace by its value given in (20) to obtain the bound:
Next, we rearrange the terms and replace the supremum on by :
We then replace by and by definition of Gibbs posterior , the above bound is the same as
In particular, under Bernstein assumption, i.e., if , the choice gives
Without Bernstein assumption, i.e., if , rewriting the bound and taking the minimum over yields
and this concludes the proof.
Appendix C Proof of Corollary 2
First,
Now, define as the resctriction of to the set . Then,
by assumption. An optimization with respect to leads to and we obtain the first statement:
Plugging this into Theorem 1 leads immediately to the other statements.
Appendix D Proof of Lemma 4
First, we note that is differentiable on and . As a consequence, is maximized at and its maximum is . This implies that is -Lipschitz, that is, for any ,
Taking the square of both sides of the inequality yields
(21) |
Then, and thus, (minimum reached for ). This implies that is -strongly convex, that is, for any we have, for any ,
We apply this inequality to and rearrange terms, and we obtain
(22) |
Appendix E Proof of Theorem 5
The proof of Theorem 5 is structured as follows: we first bound the excess risk by the expectation of the infimum of the empirical risk in Lemma 15. Using classic techniques, we turn this bound into the prediction risk.
E.1 Lemma
Lemma 15
Assume that the loss satisfies the boundedness assumption (3). Then, the following bound holds with the choice :
Proof For any , let
Please note that
where is a shortcut notation for . Besides, please note that, by the assumption on the boundedness of , it a.s. holds that . Applying Lemma 13 to gives, for any ,
where . This factor can be bounded as by Theorem 3, which states that Bernstein hypothesis is satisfied at the meta level, so that the bound becomes
Integrating with respect to the prior and using Fubini’s theorem yields
Next, by an application of Lemma 14, the left-hand side becomes
We then make use of Jensen’s inequality and arrange terms, so that the bound becomes
We replace the supremum on by an evaluation of the term in and arrange terms, so that the bound becomes
which is identical to
Then, we replace by its value, yielding the bound
Since the term does not depend on , we can simplify it on both sides of the inequality, which gives
(23) |
Theorem 1 provides the following lower bound for any :
(24) |
This further implies that
(25) |
for any . In particular, applying (24) to and (25) to , and injecting the results in the left-hand side of (23) gives
We remove from both sides of the inequality and arrange terms, so that the bound becomes
By definition, is the minimizer of the integrand of the right-hand side, and therefore,
and the choice yields
which concludes the proof of the lemma.
E.2 Proof of Theorem 5
Appendix F Application of Theorem 5 to the Gaussian Case
Assume that is -Lipschitz. As a result, the risks are also -Lipschitz. We choose the prior (and hence, the variances are the same for all coordinates). A straightforward application of (17) gives
In this case, , where
We now define as the family of distributions on , where
Fix a prior on priors . We choose , where
where is the gamma function and is the digamma function. We apply Theorem 5:
We next apply Assumption 4, and the bound becomes
Idea: define for any , that minimizes , that is: , and define . If -a.s., all the task have the same solution. On the other hand, if has a lot of variations, then the tasks have very unrelated solutions. In the infimum above, take and . Then,
Let , this quantity will be very important in the rate. Therefore,
An exact optimization in gives and after simplifications,
We now consider separately two cases: or for some (very small) that we will choose later. First case: . This is the case where we do not expect a significant improvement on the bound from the meta-learning. Simply take to get
In this case, an accurate optimization with respect to and will not improve the bound significantly, and for this reason, we simply take and . Also note that for we have and thus and . Thus,
Second case: . In this case, we expect a significant improvement in the bound from the meta-learning, and in order to take advantage of it, we choose very small: . We obtain
In order to take advantage of the meta-learning, we are going to make sure that is very small by tuning and adequately. But then we have , and we obtain
Choose (make sure that ) and an optimization with respect to gives . Reinjecting in the bound gives
The last step is to chose . Interestingly enough, for we recover the rate of learning in isolation: . Now, an optimization w.r.t gives and thus
In the regime, this is a significant improvement compared to the learning in isolation in which case, we can simplify the bound as
where
As a conclusion,
Appendix G Application of Theorem 5 to the Case of Mixtures of Gaussians
We first assume that priors that are mixtures of Gaussians, where is known:
We set the prior . Then, denoting by the pdf of the normal distribution , (17) implies, for any ,
where the inequality on the second line follows from the log sum inequality from Cover and Thomas , and the bound from Theorem 5 becomes, at ,
Assumption (4) implies that
It follows that the previous bound with the choice becomes
While the choice may seem less meaningful than in the Gaussian case (with one single component), it is completely natural as the best possible choice for the parameter is . In the computation, each component of the mixture brings an error term which can be decomposed between a bias term and a variance term,
for which the choice minimizes the first order error term. Next, we set the family of distributions on :
where is the Dirichlet distribution of parameter . We set the prior on priors , where and . Then, using (17), (18) and (19),
where is the digamma function. We can next use the bound from Theorem 5 and we have
Next, minimizing over gives the optimal value , and replacing in the above bound gives
We are going to restrict the infimum to the set of such that for any . In other words, we are selecting only the best component of the mixture in the optimization bound. The reader can check that this is actually the exact solution to the minimization problem in the above bound. As a result of this minimization, the bound becomes
Please note that the term inside the expectation is, up to the minimum on , identical to the one we had in the case of one single Gaussian mixture, except for the term , which may be seen as a penalty for the choice of the component in the mixture. We then bound the expectation term in the above bound by first using Fubini’s theorem, and then inverting the minimum and the second expectation:
(26) |
We can then bound the expectation term, which we decompose as
Jensen’s inequality helps to bound both the first term
and the second term
in the decomposition. The third term can be bounded as follows
The bound on the expectation then becomes
In our final bound, we wish to have as few terms as possible in while the terms in are not so problematic, because they correspond to the fast convergence rate at the meta-level. For this reason, we are going to take out of the infimum:
-
•
the term , which is unavoidable and corresponds to the main term of the bound in the worst case, with a speed of convergence;
-
•
the term , which will be handled through an optimization in and will be a term.
An exact optimization in gives
and replacing in the bound yields
From here, we set for any , which implies
because is increasing. Please also note that
We can then deduce the bound
Let
it is clear that
By choosing minimizing , the previous bound becomes
(28) |
Please note that are characteristic of the distribution . Intuitively, if the distribution has modes, or Gaussian mixtures, correspond to the centers of these modes or mixtures up to a permutation. Consequently, they do not scale with or , but can be regarded as problem parameters of constant order.
Let which we will specify later, we are going to consider two separate cases:
-
-
either , implying that the distribution is concentrated around and the variance of the local distribution around each of those points is smaller than ;
-
-
or .
G.1 First Case:
In this case, we expect the distribution to well concentrated around (which are the centers of the mixtures). As a result, the optimal parameter in the new task is going to be closed to one of and we can infer it from the previous tasks. This is precisely the case where we expect a significant improvement from the meta-learning over the learning in isolation.
A closer look at the bound above shows that, besides the term (for which we will provide an interpretation later in this section), all the terms are , which is the fast rate of convergence at the meta level, except possibly for the terms and . Nevertheless, the assumption on shows that the second of those terms is very small (of order ). For this reason, we are going to choose small enough so that the first term becomes small, and we set
Replacing in the bound gives
We can easily bound the second term in the bound using the inequality , which becomes
and replacing by its upper bound gives
From here, the choice of the rate yields the final bound:
where we denoted
and
Remark 16
Please note that
and
This bound comes as no surprise, because the process of learning the parameter in the mixture of Gaussians framework consists of two different steps:
-
-
first identifying the centers of the mixtures which, similarly to the finite case, is captured in the term;
-
-
then, identifying the right parameters of the Gaussian components centered on the points identified in the previous step. Similarly to the Gaussian case, this is captured in , in the most favorable case.
Remark 17
It may seem paradoxical that the model of mixtures of Gaussians achieves, in the best possible case, a rate of convergence slower than the rate achieved by a single Gaussian in the regime under optimal conditions. In reality, the latter rate of convergence is also achievable in the model of mixtures of Gaussians, but similarly as in the case of a single Gaussian component, it requires the strong assumption that
which is much more restrictive. On the other hand, many distributions only satisfy
for some , in which case the rate of convergence achieved here is much faster than , which is the best possible rate achieved in the single Gaussian model in general.
G.2 Second Case:
In this case, is not smaller than and therefore, the choice of a large parameter would make the term too large to achieve the desired rate of convergence. As a result, we set
Replacing in the bound (28) gives
G.3 In Summary
In any case, the bound can be written as
where
is the convergence term in the finite case (to find the centers of the mixtures), and
is equal to the convergence term in the Gaussian case (with one component) and is a if there exist points such that distribution is concentrated at a rate around those points, and otherwise. The remaining term is
and it is the convergence term term at the meta level. Please note that it is a .
G.4 What if the number of components in the mixture is unknown?
In practice, we do not know in advance how to chose the number of components in the prior. In this case, we are going to include inside all the mixtures of Gaussians, i.e.,
Note that the sum inside the definition of is finite, since for any beyond a certain rank . We still denote by the prior in each task. By definition, for any . It still holds that, for any ,
where we denoted
To put things clearly, the KL remains identical to the case where is known except for the fact that the sums on are no longer stopping at a pre-determined . This difference aside, the bound remains identical to the one in the case where is known, and the bound from Theorem 1 becomes, at ,
Next, we are going to define a prior on within the prior of priors as follows. We assume that the number of mixtures is smaller than , because even if it were not, it would be impossible to identify them with enough confidence. We define the set of priors on priors
where is the prior distribution on and
and we set the prior of prior as . We also need to re-compute the KL divergence between the priors of priors, which becomes
using (16). Please note that in any optimization on , we optimize first in and then on conditionally on . This means that the latter parameters are allowed to depend on . While the infimum on of any quantity should be written , we will adopt the shortcut notation . We can next use the bound from Theorem 5 and we have
The optimization on may be performed exactly by setting , and the bound becomes
We restrict the optimization in to the set , and the bound becomes
Next, we classically decompose the expectation as
Applying Fubini’s theorem and inverting the infimum and the expectation yields the bound
The bound (27) from the previous part still holds:
and we can inject it in the computation so that the bound becomes
An exact optimization in yields and we can replace in the bound
We choose for any and noting that both
and
we deduce the following bound:
Recall the (unchanged) definition of :
Recalling that (as well as ) is allowed to depend on , we define as the argument (up to a permutation) of . It follows that the bound becomes
From here, we are going to distinguish two different cases, as we did before:
-
-
first case: there exists reasonably small such that ;
-
-
second case: for any reasonable , .
G.4.1 First Case
Again, this is the case where we expect some improvement from the meta-learning. Setting
the bound becomes
and after the simplification
we deduce
Then, choosing to restrict the infimum on all the multinomial distributions to all the Dirac masses, i.e., all the such that there exists such that . It follows that and we deduce that
In particular, if there exists a relatively small such that , then the bound becomes
G.4.2 Second Case
Recall the bound:
In this case, we do not expect much improvement from meta-learning, and we will simply choose
The bound then becomes
Similarly as before, we restrict the minimization to the set of Dirac distributions and we deduce
G.4.3 Overall Bound
In summary, the bound can be written as
where and are exactly the same terms as in the case where the number of mixtures is known, and the convergence term at the meta level becomes
Even when the number of mixtures is unknown, the same bound as in the case of known can be achieved up to a term, which is the order of the time required to find the optimal number of components in the mixture at the meta level.