Bayes meets Bernstein at the Meta Level: an Analysis of Fast Rates in Meta-Learning with PAC-Bayes

\nameCharles Riou \email[email protected]
\addrThe University of Tokyo & RIKEN Center for AIP
Tokyo, Japan \AND\namePierre Alquier \email[email protected]
\addrESSEC Business School
Asia-Pacific campus, Singapore \AND\nameBadr-Eddine Chérief-Abdellatif \email[email protected]
\addrCNRS, LPSM, Sorbonne Université & Université Paris Cité

Abstract

Bernstein’s condition is a key assumption that guarantees fast rates in machine learning. For example, the Gibbs algorithm with prior $\pi$ has an excess risk in $O(d_{\pi}/n)$ , as opposed to the standard $O(\sqrt{d_{\pi}/n})$ , where $n$ denotes the number of observations and $d_{\pi}$ is a complexity parameter which depends on the prior $\pi$ . In this paper, we examine the Gibbs algorithm in the context of meta-learning, i.e., when learning the prior $\pi$ from $T$ tasks (with $n$ observations each) generated by a meta distribution. Our main result is that Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior $\pi$ , which will reduce the term $d_{\pi}$ across tasks, is in $O(1/T)$ , instead of the expected $O(1/\sqrt{T})$ . We further illustrate how this result improves on the standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussians priors.

Keywords: Bernstein’s condition, meta-learning, fast rates, PAC-Bayes bounds, information bounds, the Gibbs algorithm, variational approximations.

1 Introduction

One of the greatest promises of artificial intelligence is the ability to design autonomous systems that can learn from different situations throughout their lives and adapt quickly to new environments, as humans, animals and other living things naturally do. Based on the intuition that a new problem often has significant similarities to previously encountered tasks, the use of past experience is particularly important in areas such as computer vision (Quattoni et al., 2008; Kulis et al., 2011; Li et al., 2018; Achille et al., 2019), natural language processing (Huang et al., 2018; Gu et al., 2018; Dou et al., 2019; Qian and Yu, 2019) and reinforcement learning (Finn et al., 2017; Mishra et al., 2018; Wang et al., 2016; Yu et al., 2020) where the learner has access to only a few training examples for the task of interest, but for which a vast amount of datasets from a variety of related tasks is available. In the area of digit recognition for example, it is possible to leverage the experience gained from millions of similar open source image classification datasets, as the key features needed to classify cats from dogs or pants from shirts can be used to classify handwritten digits. This idea is at the heart of meta-learning (Thrun and Pratt, 1998; Baxter, 2000; Vanschoren, 2019), a field that has recently attracted a lot of attention due to its huge success in real-world applications, and which aims to improve performance on a particular task by transferring the knowledge contained in different but related tasks.

Meta-learning has been widely studied in recent literature. It must be noted that meta-learning was used to refer to a wide range of situations. Providing a precise definition of meta-learning is a challenge. In particular, the terms transfer learning and multi-task learning, although distinct, are often used interchangeably instead of meta-learning. Transfer learning is a very general concept that involves two tasks that share similarities - a source and a target - and consists in transferring the knowledge acquired on the source dataset to better process the target data (Pan and Yang, 2010; Zhuang et al., 2020). In practice, this can be formulated in many different ways, but the most popular approach is to pre-train a model on the source data, e.g., images of cats and dogs, and then to fine-tune it on the target training data set, e.g., images of handwritten digits. In particular, a challenging problem in transfer learning is to find a measure that quantifies the similarity between the source and target tasks. Multi-task learning adopts a different framework, where multiple learning tasks are considered and the goal is to learn a model that can handle all tasks simultaneously (Caruana, 1997; Zhang and Yang, 2021). The model usually has a common representation, e.g., the first layers of a deep neural network, and a task-specific component, e.g., the last layer of the network. Meta-learning also considers a collection of datasets from a variety of tasks, but unlike multi-task learning, we are not interested in learning the fixed number of tasks, but rather in being prepared for future tasks that are not yet given. Also, unlike transfer learning, meta-learning exploits the commonality of previous tasks rather than the similarity between some specific source and target tasks. We use these metadata to design a meta-procedure that adaptively learns a predictor for any new learning task that is a priori unknown, and the goal is to quickly learn to adapt a learning procedure from past experience. Meta-learning is therefore sometimes referred to as learning-to-learn, or lifelong learning in the online context. The implementation of this learning-to-learn mechanism can take different forms, which we briefly describe in the following paragraph.

As the name suggests, meta-learning involves two levels of abstraction to improve learning over time: a meta-level and a within-task level. At the within-task level, the new task of interest is presented and the corresponding pattern is learned from the training data set of the task at hand. This learning process is greatly accelerated by a meta-learner, which has distilled the knowledge accumulated in previous tasks into the within-task model. The meta-learning procedure can accelerate the within-task algorithm in various ways, and three main categories stand out in the literature: metric-based methods, which are based on non-parametric predictive models governed by a metric that is learned using the meta-training dataset (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018); model-based methods, which quickly update the parameters in a few learning steps, which can be achieved by the model’s internal architecture or another meta-learning model (Santoro et al., 2016; Munkhdalai and Yu, 2017; Mishra et al., 2018); and optimisation-based methods, which mainly involve learning the hyper-parameters of a within-task algorithm using the meta-training set for fast adaptation (Hochreiter et al., 2001; Ravi and Larochelle, 2017; Finn et al., 2017; Nichol et al., 2018; Qiao et al., 2018; Gidaris and Komodakis, 2018). Due to their performance and ease of implementation, the optimisation-based family is the dominant class in the recent literature, exploiting the idea that well-chosen hyperparameters can greatly speed up the learning process and allow model parameters to be quickly adapted to new tasks with little data. For example, it is possible to learn the task of interest using a gradient descent whose initialisation and learning rate would have been learned from the metadata. Among the best known meta-strategies is the model agnostic meta-learning procedure (MAML) (Finn et al., 2017) and its variants implicit MAML (Rajeswaran et al., 2019), Bayesian MAML (Grant et al., 2018; Yoon et al., 2018; Nguyen et al., 2020) and Reptile (Nichol et al., 2018). We refer the interested reader to the recent review by Chen et al. (2023) for more details.

2 Approach and Contributions

In this paper, we focus on the Gibbs algorithms within tasks, or their variational approximations. the Gibbs algorithms, also known as Gibbs posteriors (Alquier et al., 2016) or exponentially weighted aggregation (Dalalyan and Tsybakov, 2008), can also be interpreted in the framework of Bayesian statistics as a kind of generalized posterior (Bissiri et al., 2016). PAC-Bayes bounds were developed to control the risk and the excess risk of such procedures (Shawe-Taylor and Williamson, 1997; McAllester, 1998; Catoni, 2004; Zhang, 2006; Catoni, 2007; Yang et al., 2019), see Guedj (2019); Alquier (2021) for recent surveys. More recently, the related mutual information bounds (Russo and Zou, 2019; Haghifam et al., 2021) were also used to study the excess risk of the Gibbs algorithms (Xu and Raginsky, 2017). Gibbs posteriors are often intractable, and it is then easier to compute a variational approximation of such a posterior. It appears that PAC-Bayes bounds can also be used on such approximations (Alquier et al., 2016). Many recent publications built foundations of meta-learning through PAC-Bayes and information bounds (Pentina and Lampert, 2014; Amit and Meir, 2018; Ding et al., 2021; Liu et al., 2021; Rothfuss et al., 2021; Farid and Majumdar, 2021; Rothfuss et al., 2022; Guan and Lu, 2022a; Rezazadeh, 2022). These works and the related literature are discussed in detail in Section 6. Most of these papers proved empirical PAC-Bayes bounds for meta-learning. These bounds can be minimized, and we obtain both a practical meta-learning procedure, together with a numerical certificate on its generalization. However, these works did not focus on the rate of convergence of the excess risk.

Bernstein’s condition a is a low-noise assumption reflecting the inherent difficulty of the learning task (Mammen and Tsybakov, 1999; Tsybakov, 2004; Bartlett and Mendelson, 2006). While it was initially designed to study the ERM (Bartlett and Mendelson, 2006), it characterizes the learning rate of algorithms beyond the ERM. PAC-Bayes bounds and mutual information bounds show that the excess risk of the Gibbs algorithm is in $O(d_{\pi,t}/n)$ when Bernstein’s condition is satisfied (Catoni, 2007; Grünwald and Mehta, 2020), as opposed to the slow rate $O((d_{\pi,t}/n)^{1/2})$ in the general case. The quantity $d_{\pi,t}$ measures the complexity of task $t$ . Importantly, it also depends on the prior distribution $\pi$ used in the algorithm. Similar results hold when we replace the Gibbs algorithm by a variational approximation (Alquier et al., 2016).

In the meta-learning setting, we are given $T$ tasks simultaneously. Using the Gibbs algorithm with a fixed $\pi$ in all tasks leads to an average excess risk in $O(\mathbb{E}_{t}[(d_{\pi,t}/n)^{\alpha}])$ , where $\alpha=1$ when Bernstein’s condition holds for each task $t\in\{1,\dots,T\}$ , and $\alpha=1/2$ otherwise. Here, $\mathbb{E}_{t}$ denotes the expectation with respect to a future (out-of-sample) task $t$ . This approach is referred to as “learning in isolation”, because each task is solved regardless of the others. Of course, in meta-learning we want to take advantage of the multiple tasks. For example, Pentina and Lampert (2014) used the Gibbs algorithm at the meta-level, in order to learn a better prior. The expected benefit is to reduce the complexity term $d_{\pi,t}$ .

Overview of the paper:

•

In Section 3, we recall existing results on the excess risk of the Gibbs algorithm when learning tasks in isolation, and we introduce Bernstein’s condition, a fundamental assumption under which the fast rate $O(\mathbb{E}_{t}[d_{\pi,t}/n])$ is achieved by the Gibbs algorithm.
•

In Section 4, we prove that regardless of its validity at the within-task level, Bernstein’s condition is always satisfied at the meta level. As a consequence of this result, we show that a meta-level the Gibbs algorithm achieves the excess risk $O(\inf_{\pi\in\mathcal{M}}\mathbb{E}_{t}[\left(d_{\pi,t}/n\right)^{\alpha}]+1/T)$ with $\alpha=1$ if Bernstein’s condition is satisfied, and $\alpha=1/2$ otherwise. We further raise the open question of the generalization of this result to its variational approximations.
•

In Section 5, we apply the previous results to various settings: learning a discrete prior, learning a Gaussian prior and learning a mixture of Gaussians prior. We show that the gain brought from the meta learning is blatant, as in some favorable situations, one can even have $\inf_{\pi\in\mathcal{M}}\mathbb{E}_{t}[d_{\pi,t}/n]=0$ .
•

In Section 6, we provide a deeper comparison with the rich literature on meta-learning.

3 Problem Definition and Notations

Let $\mathcal{Z}$ be a space of observations, $\Theta$ be a decision space and $\ell:\mathcal{Z}\times\Theta\rightarrow\mathbb{R}_{+}$ be a bounded loss function defined on the previously defined sets. Let $\mathcal{P}(\Theta)$ denote the set of all probability distributions on $\Theta$ equipped with a suitable $\sigma$ -field. The learner has to solve $T$ tasks. Given a task $t\in\{1,\dots,T\}$ , the learner receives the observations $Z_{t,i},i=1\dots n$ , assumed to be drawn independently from a distribution $P_{t}$ on the decision space $\mathcal{Z}$ . The objective of the learner is to find a parameter $\theta$ in the parameter space $\Theta$ which minimizes the so-called prediction risk associated to $P_{t}$ on $\mathcal{Z}$ , defined as

R_{P_{t}}(\theta)=\mathbb{E}_{Z\sim P_{t}}[\ell(Z,\theta)].

We denote by $R^{*}_{P_{t}}$ the minimum of $R_{P_{t}}(\theta)$ and by $\theta_{t}^{*}$ its corresponding minimizer:

R^{*}_{P_{t}}=\inf_{\theta\in\Theta}R_{P_{t}}(\theta)=R_{P_{t}}(\theta^{*}_{t}).

In Bayesian approaches, we rather seek for $\rho_{t}\in\mathcal{P}(\Theta)$ such that

\mathbb{E}_{\theta\sim\rho_{t}}[R_{P_{t}}(\theta)]

is small. Defining, for any $\theta\in\Theta$ , the empirical risk as

\hat{R}_{t}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\ell(Z_{t,i},\theta),

a standard choice for $\rho_{t}$ is the so-called Gibbs posterior given by

\rho_{t}(\pi,\alpha)=\operatorname{argmin}_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\},

(1)

where $\pi$ is the prior distribution on the parameter $\theta$ and $\alpha$ is some parameter which will be made explicit later. The corresponding risk estimate is defined by

\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)=\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}.

More generally, we can also consider variational approximations, defined by

\rho_{t}(\pi,\alpha,\mathcal{F})=\operatorname{argmin}_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\},

(2)

where $\mathcal{F}\subseteq\mathcal{P}(\Theta)$ . Adequate choices for $\mathcal{F}$ lead to feasible minimization problems, and don’t affect the generalization properties (Alquier et al., 2016). With these notations, $\rho_{t}(\pi,\alpha)=\rho_{t}(\pi,\alpha,\mathcal{P}(\Theta))$ .

3.1 Assumptions on the loss and Bernstein’s condition

Recall that we assumed the loss function to be bounded: there exists a constant $C>0$ such that

\forall(z,\theta)\in\mathcal{Z}\times\Theta,\ \ell(z,\theta)\leq C.

(3)

PAC-Bayes bounds for unbounded losses are well-known, see the discussion at the end of Alquier (2021). However, a lot of those bounds become a little more complicated. Thus, the choice to work with bounded variables is made here for the sake of clarity, and is not due to a fundamental limitation of our method.

We define the variance term for task $t$ by

V_{t}(\theta,\theta^{*}_{t}):=\mathbb{E}_{P_{t}}\left[\left|\ell(Z,\theta)-\ell(Z,\theta^{*}_{t})\right|^{2}\right]

for any $\theta\in\Theta$ . The following condition is crucial in the study of the risk of Gibbs posterior.

Assumption 1 (Bernstein’s condition)

There exists a constant $c>0$ such that, for any $\theta\in\Theta$ ,

V_{t}(\theta,\theta^{*}_{t})\leq c\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right).

This assumption characterizes the excess risk of Gibbs posterior, see Theorem 1 below. In this paper, we will provide a bound on the excess risk both under this condition and without it.

In some of the applications developed in Section 5, we will also make the following assumption: there exists $L>0$ such that, for any $P_{t}\sim\mathcal{P}$ and any $\theta\in\Theta$ ,

R_{P_{t}}(\theta)-R_{P_{t}}^{*}\leq L\|\theta-\theta^{*}_{t}\|^{2}.

(4)

Intuitively, Taylor’s expansion of $R_{P_{t}}$ gives

R_{P_{t}}(\theta)=R_{P_{t}}(\theta^{*}_{t})+\underbrace{dR_{P_{t}}(\theta^{*}_{t}).(\theta-\theta^{*}_{t})}_{0}+O(\|\theta-\theta_{t}^{*}\|^{2})=R_{P_{t}}(\theta^{*}_{t})+O(\|\theta-\theta_{t}^{*}\|^{2}).

and thus we can expect (4) to be satisfied when the risk is smooth enough. However, note that this assumption is not necessary for the main results of this paper to hold, and will only be used in very specific applications.

3.2 Learning in Isolation

In the process of learning in isolation, we consider each of the tasks separately. We then fix some $t\in\{1,\dots,T\}$ and denote by $\mathcal{S}_{t}$ the set of observations from task $t$ : $Z_{t,1},\dots,Z_{t,n}$ . We recall the following result, which can be found, e.g., in Alquier (2021), and whose proof is recalled in Appendix B for the sake of completeness.

Theorem 1

Assume that the loss $\ell$ satisfies (3). Then, the following bound holds, for any $\alpha>0$ :

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha,\mathcal{F})}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}\leq\frac{1}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\left(\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\right),

where $\mathbb{I}_{B}$ is equal to $1$ if Bernstein’s condition (in Assumption 1) is satisfied, and $0$ otherwise. In particular, under Bernstein’s condition, the choice $\alpha=\frac{1}{c+C}$ yields the bound

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha,\mathcal{F})}[R_{P_{t}}(\theta)]-R_{P_{t}}^{*}\leq 2\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right].

When specifying a model and a prior, a derivation of the right-hand side in Theorem 1 leads to explicit rates of convergence. For example, a classical assumption in the Bayesian literature is that there are constants $\kappa,d\geq 0$ such that $\pi(\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\}\leq s)\geq s^{d}/\kappa$ (Ghosal and Van der Vaart, 2017). As this condition is usually applied on one task, with a specific prior, the notation $d$ does not reflect the dependence with repect to $\pi$ or to $t$ . However, in our context, this dependence will be crucial, so we will write $d_{\pi,t}$ instead of $d$ . Under such an assumption, the right-hand side in Theorem 1 can be made more explicit. For simplicity, we state this result for Gibbs posteriors $\rho_{t}(\pi,\alpha)$ only.

Corollary 2

Assume that, almost surely on $P_{t}$ , $\pi(\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\}\leq s)\geq s^{d_{\pi,t}}/\kappa_{\pi,t}$ . Then,

\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]\leq\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi}}+\log\kappa_{\pi,t}}{\alpha n}.

In particular, under Bernstein’s condition, the choice $\alpha=1/(c+C)$ gives

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\leq R_{P_{t}}^{*}+2\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi,t}}+\log\kappa_{\pi,t}}{\alpha n}.

On the other hand, without Bernstein’s condition,

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\leq R_{P_{t}}^{*}+\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi,t}}+\log\kappa_{\pi,t}}{\alpha n}+\frac{\alpha C^{2}}{8},

and in particular, for $\alpha=2\sqrt{2d_{\pi,t}}/(\sqrt{n}C)$ , we obtain

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\leq R_{P_{t}}^{*}+\frac{C}{2}\sqrt{\frac{d_{\pi,t}}{2n}}\left(\frac{1}{2}\log\frac{8e^{2}n}{d_{\pi}C^{2}}+\frac{1}{d_{\pi,t}}\log\kappa_{\pi,t}\right).

4 Main Results

From this section onward, we focus on meta-learning. As opposed to the learning in isolation, the meta-learning considers all the tasks $t\in\{1,\dots,T\}$ and takes advantage of possible similarities between the $T$ tasks to improve learning in each task. More precisely, while assuming that for any $t\in\{1,\dots,T\},Z_{t,1},\dots,Z_{t,n}$ are drawn independently from distribution $P_{t}$ , we also assume that the distributions $P_{1},\dots,P_{T}$ are drawn independently from a certain distribution $\mathcal{P}$ . A future (out-of-sample) task $P_{T+1}$ will also be drawn from $\mathcal{P}$ and $Z_{T+1,1},\dots,Z_{T+1,n}$ will be drawn independently from $P_{T+1}$ . This task will be solved by the Gibbs algorithm $\rho_{T+1}(\pi,\alpha)$ . Our objective is to learn the prior $\pi$ using the tasks $t\in\{1,\dots,T\}$ , in order to make the meta-risk

\mathcal{E}(\pi)=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\mathbb{E}_{\mathcal{S}_{T+1}\sim P_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}}[R_{P_{T+1}}(\theta)].

as small as possible. We will compare it to the so-called oracle meta-risk

\mathcal{E}^{*}=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}[R_{P_{T+1}}^{*}],

which can only be reached by an oracle who would know the best classifier in each task in advance.

4.1 Bernstein’s condition at the meta level

In this subsection, we prove a version of Bernstein’s condition at the meta level. Let

\mathcal{R}_{t}(\nu,\alpha)=\mathbb{E}_{\mathcal{S}_{t}\sim P_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\nu,\alpha)\right]

for any prior $\nu$ , and let $\pi^{*}_{\alpha}$ be the distribution minimizing the expectation of the above quantity:

\pi^{*}_{\alpha}=\operatorname{argmin}_{\nu}\mathbb{E}_{P_{t}\sim\mathcal{P}}\left[\mathcal{R}_{t}(\nu,\alpha)\right].

Surprisingly enough, in contrast to the learning in isolation, Bernstein’s condition is always satisfied at the meta level, on the condition that we use Gibbs posteriors $\rho_{t}(\pi,\alpha)$ within tasks.

Theorem 3

Assume that the loss $\ell$ satisfies the boundedness assumption (3). Then, for any $\pi\in\mathcal{P}$ ,

\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq c\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right],

where $c=8eC$ .

Proof It classically holds (e.g. Alquier (2021)) that

\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)=-\frac{1}{n\alpha}\log\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]=-\frac{1}{\tau}\log\left(\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}}\right)

(5)

for any fixed $\tau>0$ . By the boundedness assumption, it holds that, for any $\pi\in\mathcal{P}$ ,

\exp(-C\tau)\leq\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}}\leq 1.

We next use the following lemma, whose proof can be found in Appendix D.

Lemma 4

Let $f:x\mapsto-\frac{1}{\tau}\log x$ . Then, for any $x,y\in\left[\exp(-C\tau),1\right]$ ,

(f(x)-f(y))^{2}\leq\frac{8\exp(2C\tau)}{\tau}\left(\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right)\right).

An application of Lemma 4 to $x=\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}}$ and $y=\mathbb{E}_{\theta\sim\pi^{*}_{\alpha}}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}}$ gives

	$\displaystyle\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{}_{\alpha},\alpha),\pi^{}_{\alpha},\alpha)\right)^{2}$
	$\displaystyle=(f(x)-f(y))^{2}$
	$\displaystyle\leq\frac{8\exp(2C\tau)}{\tau}\left(\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right)\right)$
	$\displaystyle=\frac{8\exp(2C\tau)}{\tau}\left[\frac{\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)+\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{}_{\alpha},\alpha),\pi^{}_{\alpha},\alpha)}{2}+\frac{1}{\tau}\log\left(\frac{x+y}{2}\right)\right]$

(these derivations are directly inspired from the proof technique introduced by Bartlett et al. (2006)).

Taking expectations with respect to $S_{t}\sim P_{t}$ on both sides

\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{8\exp(2C\tau)}{\tau}\left(\frac{\mathcal{R}_{t}(\pi,\alpha)+\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)}{2}-+\mathcal{R}_{t}\left(\frac{\pi+\pi^{*}_{\alpha}}{2},\alpha\right)\right).

Integrating with respect to $P_{t}\sim\mathcal{P}$ yields

\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{8\exp(2C\tau)}{\tau}\Biggl{(}\frac{\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi,\alpha)+\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)}{2}-\mathbb{E}_{P_{t}}\mathcal{R}_{t}\left(\frac{\pi+\pi^{*}_{\alpha}}{2},\alpha\right)\Biggr{)}.

(6)

By definition of $\pi^{*}_{\alpha}$ , it holds that, for any $\pi^{\prime}\in\mathcal{P}$ ,

\mathbb{E}_{P_{t}}[\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)]\leq\mathbb{E}_{P_{t}}[\mathcal{R}_{t}(\pi^{\prime},\alpha)].

In particular, this holds for $\pi^{\prime}=(\pi+\pi^{*}_{\alpha})/2$ and plugging this into the right hand side of (6) gives

\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{4\exp(2C\tau)}{\tau}\Biggl{(}\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi,\alpha)-\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)\Biggr{)}.

The (optimal) choice $\tau=\frac{1}{2C}$ gives the desired bound

\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{4\exp(2C\tau)}{\tau}\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\Biggl{[}\hat{\mathcal{R}}_{t}(\pi,\alpha)-\mathbb{E}_{P_{t}}\hat{\mathcal{R}}_{t}(\pi^{*}_{\alpha},\alpha)\Biggr{]}.

4.2 PAC-Bayes Bound for Meta-learning

We will now seek for a prior $\pi$ which allows to obtain a small meta-risk

\mathbb{E}_{\pi\sim\Pi}[\mathcal{E}(\pi)].

In order to do so, we will fix a set of possible priors $\mathcal{M}$ and a set of distributions $\mathcal{G}$ on these priors: $\mathcal{G}\subseteq\mathcal{P}(\mathcal{M})$ ¹¹1Note that measurability issues can arise when the set $\mathcal{F}$ is non parametric. However, in all our examples, the set $\mathcal{F}$ is parametric.. Given a probability distribution $\Lambda\in\mathcal{G}$ called “prior on priors”, we define Gibbs distribution on priors similarly as in (1), but at the meta-level:

\hat{\Pi}=\operatorname{argmin}_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}\left(\rho_{t}(\pi,\alpha),\pi,\alpha\right)\right]+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\},

(7)

where $\beta>0$ is some parameter. As a consequence of Theorem 3 comes the next result, whose proof is given in Appendix E.

Theorem 5

Assume that the loss $\ell$ satisfies (3). Defining $\beta=\frac{1}{C+c}$ , it holds, for any $\mathcal{F}\subseteq\mathcal{P}(\Theta)$ ,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{[}\\ \mathbb{E}_{\pi\sim\Pi}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}[R_{T+1}(\theta)-R_{T+1}(\theta_{t}^{*})]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}\right\}\right]+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{]}.

While the bound given in Theorem 5 depends on some subset $\mathcal{F}\subseteq\mathcal{P}(\Theta)$ chosen for computational reasons, the bound is on the excess risk of $\hat{\Pi}$ , which itself is based on the exact Gibbs posterior $\rho_{t}(\pi,\alpha)$ , and not on its variational approximation $\hat{\Pi}(\mathcal{F})$ , based on $\rho_{t}(\pi,\alpha,\mathcal{F})$ and defined as

\hat{\Pi}(\mathcal{F})=\operatorname{argmin}_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}\left(\rho_{t}(\pi,\alpha,\mathcal{F}),\pi,\alpha\right)\right]+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}.

In some settings, $\hat{\Pi}(\mathcal{F})$ is tractable and its Gibbs-based counterpart $\hat{\Pi}$ is not, and it is a fundamental open question to determine under what condition on $\mathcal{F}$ we can replace $\hat{\Pi}$ by $\hat{\Pi}(\mathcal{F})$ in Theorem 5.

Open Question 1

Under what conditions on $\mathcal{F}$ can we replace $\hat{\Pi}$ by $\hat{\Pi}(\mathcal{F})$ in the left-hand side of Theorem 5?

4.3 A Toy Application of Theorem 5: Concurrent Priors

This subsection gives a toy application of Theorem 5 just to fix ideas. Here, we study the case where $M$ statisticians propose a different prior, all of which are assumed to satisfy a prior mass condition as in Corollary 2. We denote by $\mathcal{M}=\{\pi_{1},\dots,\pi_{M}\}$ the set of priors. We choose $\Lambda$ as the uniform distribution on $\mathcal{M}$ and $\mathcal{G}=\mathcal{P}(\mathcal{M})$ . Here again, for the sake of simplicity, we assume that Bernstein’s condition (see Assumption 1) is satisfied at the task level.

A direct application of Theorem 5 and Corollary 2 gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq 4\min_{\pi\in\mathcal{M}}\mathbb{E}_{P_{T+1}}\frac{d_{\pi,T+1}\log\frac{n\alpha}{d_{\pi,T+1}}+\log\kappa_{\pi,T+1}}{\alpha n}+\frac{4\log M}{\beta T}.

In other words, we obtain the rate of convergence provided by the best prior among $\{\pi_{1},\dots,\pi_{M}\}$ , at the price of an additional $\log(M)/T$ term.

5 Applications of Theorem 5

In this section, by an application of Theorem 5, we derive explicit bounds on the excess risk of the Gibbs algorithm in the case of discrete priors (the parameter set $\Theta$ is finite; Subsection 5.1), Gaussian priors (Subsection 5.2) and mixtures of Gaussian priors (Subsection 5.3).

5.1 Learning Discrete Priors

In this subsection, we assume that $|\Theta|=M<\infty$ . Following Meunier and Alquier (2021), we define $A^{*}$ as the smallest possible subset of $\Theta$ such that

\forall P\sim\mathcal{P},\ \theta^{*}:=\operatorname{argmin}_{\theta}R_{P}(\theta)\in A^{*},

(8)

and we denote $m^{*}:=|A^{*}|$ . In general, $A^{*}=\Theta$ and $m^{*}=M$ . However, in some favorable situations, $A^{*}\neq\Theta$ and $m^{*}\ll M$ , in which case, the meta-learning may improve upon the learning in isolation. In the setting considered, Bernstein’s condition is trivially satisfied and the excess risk of the Gibbs algorithm is $\frac{4\log(M)}{\alpha n}$ .

We define our set of priors $\mathcal{M}$ as the set of probability distributions $\pi_{A}$ which are uniform on $A\subseteq\Theta$ and parameterized by $A$ :

\mathcal{M}=\left\{\pi_{A}|A\subseteq\Theta\right\},

and $\mathcal{G}$ is the set of all distributions on $\mathcal{M}$ . Our “prior on priors” $\Lambda$ is then defined as follows: we draw $m\in\{1,\dots,M\}$ with probability $\frac{2^{M-m}}{2^{M}-1}\approx 2^{-m}$ (for $M\gg 1$ ), then given $m$ , draw a subset $A\subseteq\Theta$ of cardinality $m$ uniformly at random, and take $\pi_{A}$ . In other words, $\Lambda$ is a distribution defined on $\mathcal{F}$ such that $P_{\pi\sim\Lambda}(\pi=\pi_{A})=\frac{2^{M-m}}{2^{M}-1}\times\frac{1}{{M\choose m}}$ .

Proposition 6

The excess risk of the meta predictor $\hat{\Pi}$ defined in (7) is bounded as follows:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\log m^{*}}{\alpha n}+\frac{4m^{*}\log\frac{2{\rm e}M}{m^{*}}}{\beta T}.

Remark 7

Let us now compare the meta-learning rate above to the $\frac{4\log M}{\alpha n}$ rate achieved by the learning in isolation. In the unfavorable case $m^{*}\sim M$ , the meta-learning bound is sensibly larger than the learning in isolation one, by a term $O(M/T)$ which vanishes rapidly when $T\to+\infty$ .

In the favorable case $m^{*}\ll M$ however, the meta-learning considerably improves upon the learning in isolation for large values of $T$ . In the extreme case where $m^{*}=1$ , we have

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\log(2{\rm e}M)}{\beta T}.

Thus, the benefits of the meta-learning are mainly expected in the $T\gg n$ regime. They are huge when the tasks are very similar, and close to zero when the tasks are totally unrelated. This is in line with the results of Meunier and Alquier (2021) in the online setting.

Proof We first consider learning in isolation. The classical choice is to take $\pi$ as uniform in each task. An application of Theorem 1 gives

	$\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\right]-\mathcal{E}^{*}$	$\displaystyle\leq 2\inf_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}+\frac{\mathrm{KL}(\rho\\|\pi)}{\alpha n}\right\}$
		$\displaystyle\leq 2\inf_{\rho=\delta_{\vartheta}}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}+\frac{\mathrm{KL}(\rho\\|\pi)}{\alpha n}\right\}$
		$\displaystyle=\frac{2\log M}{\alpha n}.$

In the meta-learning case, an application of Theorem 5 gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{1\leq m\leq M}\inf_{|A|=m}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\bigg{[}\inf_{\theta\in A}\left\{R_{P_{T+1}}(\theta)-R_{P_{t+1}}^{*}\right\}\\ +\frac{\log(m)}{\alpha n}+\frac{m\log 2+\log{M\choose m}}{\beta T}\bigg{]}.

Under the assumption (8), it holds that

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\log(m^{*})}{\alpha n}+\frac{4m^{*}\log(2)+4\log{M\choose m^{*}}}{\beta T}.

We conclude by using the classic bound $\log{M\choose m}\leq m\log\frac{M{\rm e}}{m}$ .

5.2 Learning Gaussian priors

In this subsection, we consider the set of all Gaussian distributions

\mathcal{M}=\left\{p_{\mu,\sigma^{2}}=\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{i},\sigma_{i}^{2}),\mu=(\mu_{1},\dots,\mu_{d})\in\mathbb{R}^{d},\sigma^{2}=(\sigma_{1}^{2},\dots,\sigma_{d}^{2})\in(\mathbb{R}_{+}^{*})^{d}\right\}.

(9)

So, priors on priors are actually defined as priors on $\mu$ and $\sigma^{2}$ given by:

\mathcal{G}=\Bigg{\{}q_{\tau,\xi^{2},a,b}=\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(a_{k},b_{k})\Bigg{\}},

(10)

and we choose the prior on priors as $\Lambda=q_{0,\bar{\bar{\xi}}^{2},\bar{\bar{a}},\bar{\bar{b}}}$ .

From now on and until the end of this section, we assume that both (4) and Assumption 1 (Bernstein’s condition) hold, and we are looking for a prior on priors $\Pi$ from which to sample $\pi$ , such that $\rho_{T+1}(\pi,\alpha)$ concentrates as much as possible to the best parameter. Denoting $\mu^{*}:=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}[\mu_{P_{T+1}}]$ and $\Sigma(\mathcal{P}):=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\|\mu_{P_{T+1}}-\mu^{*}\|^{2}\right]$ , the following holds.

Proposition 8

Under Assumptions 1 and 4, the excess risk of $\hat{\Pi}$ defined in (7) for $\mathcal{M}$ and $\mathcal{G}$ defined in (9) and (10) is bounded as follows:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\left\{\begin{array}[]{ll}G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\frac{d+\log(T)}{T}&\mbox{ if }\ \Sigma(\mathcal{P})\leq\frac{n}{T};\\ F(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\left(\frac{d\log n+\Sigma(\mathcal{P})}{n}+\frac{d}{T}\right)&\mbox{ otherwise,}\end{array}\right.

where $F$ and $G$ are functions of the problem parameters.

Interestingly enough, in the favorable case $\Sigma(\mathcal{P})\leq\frac{n}{T}$ , a proper choice of the prior of priors lead to the very fast rate of convergence $O\left(\frac{\log T}{T}\right)$ , which considerably improves upon the fast rate $O\left(\frac{1}{n}+\frac{1}{T}\right)$ when $n\ll T$ . The detailed proof of this proposition, as well as explicit expressions of $F$ and $G$ , are given in Appendix F.

5.3 Learning Mixtures of Gaussian priors

In this subsection, we generalize the result of the previous section to priors that are mixtures of Gaussians. We still assume that Assumption 4 and Assumption 1 (Bernstein’s condition) hold. We first assume the number of components in the mixture is known. Under these hypotheses, the set of possible priors $\pi$ is

\mathcal{M}=\left\{p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{K}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\forall k\in[K],w_{k}\geq 0,1^{\top}w=1\right\}.

(11)

We add a Dirichlet prior on the weights $w$ of the components in the mixture, and the set of priors on priors becomes

\mathcal{G}=\Bigg{\{}q_{\delta,\tau,\xi^{2},b}=\mathrm{Dir}(\delta)\otimes\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(2,b_{k}):\delta=(\delta_{1},\dots,\delta_{K})\in\mathbb{R}^{K}\Bigg{\}},

(12)

while the initial prior on priors is chosen as $\Lambda=q_{\mathbb{I}_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}$ . We define

\Sigma_{K}(\mathcal{P}):=\inf_{\tau_{1},\dots,\tau_{K}}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\min_{k\in[K]}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right].

Proposition 9

Under Assumptions 1 and 4, the excess risk of $\hat{\Pi}$ defined in (7) for $\mathcal{M}$ and $\mathcal{G}$ defined in (11) and (12) is bounded as follows:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\mathrm{CV}_{\mathrm{finite}}(K,n)+K\times\mathrm{CV}_{\mathrm{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)\\ +\mathrm{CV}_{\mathrm{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau),

\text{where }\ \mathrm{CV}_{\mathrm{finite}}(K,n)=\frac{4\log(2K)}{\alpha n};\quad\mathrm{CV}_{\mathrm{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=O\left(\frac{dK\log T}{T}\right);

\mathrm{CV}_{\mathrm{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)=\left\{\begin{array}[]{ll}\frac{8Ld}{T}+\frac{4}{\alpha T}&\mbox{if }\ \Sigma_{K}(\mathcal{P})\leq\frac{n}{T^{2}};\\ \frac{2d}{\alpha n}\log\left(1+4\alpha Ln\right)+\frac{4\Sigma_{K}(\mathcal{P})}{\alpha n}&\mbox{otherwise.}\end{array}\right.

Let us analyze each of the terms of the above bound. The first term $\text{CV}_{\text{finite}}(K,n)$ is the bound we had in the finite case of Subsection 5.1. Visualizing our $K$ mixtures as the $K$ points in the finite case, this term is the time required by our estimator to select the right mixture. While this term makes the convergence rate of $O\left(\frac{1}{n}+\frac{1}{T}\right)$ notably worse than the $O\left(\frac{1}{T}\right)$ we might hope for, it is essentially unavoidable as it appears in the much simpler model of a finite set of $K$ parameters described in Subsection 5.1.

The next term $\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)$ is (similar to) the bound obtained in the Gaussian case of Subsection 5.2, with the exception that $\Sigma(\mathcal{P})$ is replaced by $\Sigma_{K}(\mathcal{P})$ , and scales with the convergence time to the best Gaussian for every task $t$ .

Eventually, the last term $\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)$ is the convergence term at the meta level and is a $O\left(\frac{dK\log T}{T}\right)$ . This is the cost of the meta learning compared to the learning in isolation. When $T\gg n$ , this term is very small and justifies the use of the meta learning.

Remark 10

One may think, looking at the bounds in the two previous cases considered, that the $O\left(\frac{1}{n}+\frac{1}{T}\right)$ convergence rate in the case of mixtures of Gaussians is slower than the one for Gaussians which can be as fast as $O\left(\frac{1}{T}\right)$ . In reality, the rate of convergence is (naturally) faster for the model of mixtures of Gaussians, because in the case of mixtures of Gaussians, the convergence term $\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)$ is $O\left(\frac{1}{T}\right)$ under the assumption that $\Sigma_{K}(\mathcal{P})\leq\frac{n}{T}$ , while in the Gaussian case, the much stronger assumption $\Sigma(\mathcal{P})=\Sigma_{1}(\mathcal{P})\leq\frac{n}{T}$ is required. Under this assumption, the similar rate $O\left(\frac{1}{T}\right)$ is naturally achieved.

We now consider the case when the number of mixtures $K$ is unknown. The set of priors hence becomes the set of all (finite) mixtures of Gaussians:

\mathcal{M}=\left\{p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{+\infty}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\exists K\geq 1:\forall k\geq K+1,w_{k}=0\right\}.

(13)

In the definition of the set of priors on priors $\mathcal{G}$ , we assume that $K\leq T$ (otherwise, there is a high chance of overfitting). We then set a Dirichlet prior on the number of components $K$ , and given $K$ , set the same model as before. Formally,

\mathcal{G}=\Bigg{\{}q_{x,\delta,\tau,\xi^{2},b}=q_{x}\times q_{\delta,\tau,\xi^{2},b|K}\Bigg{\}},

(14)

and we set the prior on priors $\Lambda=q_{\frac{1}{T}\mathbb{I}_{T},\mathbb{I}_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}$ . The application of Theorem 5 yields the bound

Proposition 11

Under the same conditions and using the same notations as Proposition 9, the excess risk of $\hat{\Pi}$ defined in (7) for $\mathcal{M}$ and $\mathcal{G}$ defined in (13) and (14) is bounded as follows:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ \inf_{K\in[T]}\Bigg{\{}\mathrm{CV}_{\mathrm{finite}}(K,n)+K\times\mathrm{CV}_{\mathrm{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)+\mathrm{CV}_{\mathrm{meta}}^{\mathrm{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)\Bigg{\}},

where the convergence term at the meta level becomes

\mathrm{CV}_{\mathrm{meta}}^{\mathrm{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\mathrm{CV}_{\mathrm{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)+\frac{2\log T}{\beta T}.

Our estimator takes $\frac{2\log T}{\beta T}$ to find the optimal number of mixtures at the meta level. This is the price to pay to have the infimum on $K$ in the bound. In the limit $T\gg n$ and when no prior information on $K$ is available, this clearly improves upon the bound of Proposition 9, and hence justifies setting a prior on $K$ at the meta level rather than choosing an isolated $K$ , pleading again in the favor of meta-learning. The proof of all the results of this subsection is given in Appendix G.

6 Discussion

In recent years, the statistical guarantees of meta-learning have received increasing attention. In the following paragraphs, we present a short review of the literature on the statistical theory of meta-learning, followed by a brief discussion of three papers that are closely related to our analysis.

Theoretical bounds in meta-learning. The first theoretical analysis of meta-learning goes back to Baxter (2000), who introduced the notion of task environment and derived a uniform generalization bound based on the capacity and covering number of the model. Following this i.i.d. task environment setting, many other generalization bounds have since been provided for different strategies and proof techniques, including VC theory (Baxter, 2000; Ben-David and Schuller, 2003; Maurer, 2009; Maurer et al., 2016; Guan and Lu, 2022b), algorithmic stability (Maurer and Jaakkola, 2005; Chen et al., 2020; Al-Shedivat et al., 2021; Guan et al., 2022) and information theory (Jose and Simeone, 2021; Chen et al., 2021; Jose et al., 2021; Rezazadeh et al., 2021; Hellström and Durisi, 2022). A related approach for deriving such bounds is based on PAC-Bayes theory. First proposed in the meta-learning framework in the pioneering paper of Pentina and Lampert (2014), this idea of learning a hyper-posterior that generates a prior for the new task has been taken up several times in the recent years (Amit and Meir, 2018; Ding et al., 2021; Liu et al., 2021; Rothfuss et al., 2021; Farid and Majumdar, 2021; Rothfuss et al., 2022; Guan and Lu, 2022a; Rezazadeh, 2022). In particular, Amit and Meir (2018) derived a new PAC-Bayes bound, which they applied to the optimization of deep neural networks, albeit with computational limitations. This latter concern was partially addressed by Rothfuss et al. (2021), who also specified the hyper-posterior and extended the results to unbounded losses, and further investigated their study in Rothfuss et al. (2022). Some papers combined ideas from different literatures, such as Farid and Majumdar (2021), who explored the link between PAC-Bayes and uniform stability in meta-learning, and provided a precise analysis of stability and generalization. Excess risk bounds have also been provided in the i.i.d. task environment framework, see Maurer et al. (2016); Denevi et al. (2018a, b, 2019a, 2019b, 2020); Balcan et al. (2019); Bai et al. (2021); Chen and Chen (2022). The task environment assumption has recently been challenged, for example by Du et al. (2020) and Tripuraneni et al. (2021), who proposed to use assumptions on the distributional similarity between the features and the diversity of tasks to control the excess risk, an idea further explored by Fallah et al. (2021) who exploited a notion of diversity between the new task and training tasks using the total variation distance. Finally, a detailed analysis of regret bounds in lifelong learning has been carried out in recent years (Alquier et al., 2017; Denevi et al., 2018b, 2019b; Balcan et al., 2019; Khodak et al., 2019; Finn et al., 2019; Meunier and Alquier, 2021).

Comparison to Denevi et al. (2019a). Denevi et al. (2019a) is probably the study that is the most related to our paper. The authors provide statistical guarantees for Ridge regression with a meta-learned bias, and focus on the usefulness of their strategy relative to single-task learning, proving that their method outperforms the standard $\ell_{2}$ -regularized empirical risk minimizer. In particular, they can achieve an excess risk rate of order $O\left({1}/{\sqrt{T}}\right)$ in the favorable case $\Sigma(\mathcal{P})\leq\frac{n}{T}$ , where $\Sigma(\mathcal{P})$ is a variance term similar to the one we defined in our Gaussian example.

Comparison to Guan et al. (2022). To the best of our knowledge, Guan et al. (2022) is the only work in the meta-learning literature that addresses fast rates with respect to the number of tasks $T$ under the task environment assumption. However, we actually show in our paper that there is no need to extend Bernstein’s condition when using exact Bayesian inference and that the final posterior naturally satisfies the extended Bernstein assumption, thus giving fast rates with respect to $T$ , while Guan et al. (2022) require an additional Polyak-Łojasiewicz condition to achieve fast rates. Furthermore, their analysis is very different in nature, relying on stability arguments to derive generalization bounds, while we use PAC-Bayes theory to control the excess risk.

Comparison to Guan and Lu (2022a) and Rezazadeh (2022). Finally, Guan and Lu (2022a) and Rezazadeh (2022) provide fast rate generalization bounds based on Catoni’s PAC-Bayes inequality. Nevertheless, their notion of fast rates differs substantially from ours: our rates quantify the rate of convergence of the excess risk, in line with the classic statistical learning literature. In contrast, the fast rates of Guan and Lu (2022a) are only available for a variant of the generalization gap that compares the theoretical risk to a factor of the empirical risk. In order to obtain a sharp factor equal to $1$ in front of the empirical risk, one needs to fine-tune a parameter in Guan and Lu (2022a)’s inequality, which would then return a slow rate.

7 Conclusion and open problems

We provided an analysis of the excess risk in meta-learning the prior via PAC-Bayes bounds. Surprisingly, at the meta-level, conditions for fast rates are always satisfied if one uses exact Gibbs posteriors at the task level. An important problem is to extend this result to variational approximations of Gibbs posteriors.

References

Achille et al. (2019) Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019.
Al-Shedivat et al. (2021) Maruan Al-Shedivat, Liam Li, Eric Xing, and Ameet Talwalkar. On data efficiency of meta-learning. In International Conference on Artificial Intelligence and Statistics, pages 1369–1377. PMLR, 2021.
Alquier et al. (2016) P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(239):1–41, 2016. URL http://jmlr.org/papers/v17/15-290.html.
Alquier (2021) Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
Alquier et al. (2017) Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret Bounds for Lifelong Learning. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 261–269, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
Amit and Meir (2018) R. Amit and R. Meir. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning, pages 205–214. PMLR, 2018.
Bai et al. (2021) Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason Lee, Sham Kakade, Huan Wang, and Caiming Xiong. How important is the train-validation split in meta-learning? In International Conference on Machine Learning, pages 543–553. PMLR, 2021.
Balcan et al. (2019) Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pages 424–433. PMLR, 2019.
Bartlett et al. (2006) P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
Bartlett and Mendelson (2006) Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311–334, 2006.
Baxter (2000) Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
Ben-David and Schuller (2003) Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 567–580. Springer, 2003.
Bissiri et al. (2016) P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford University Press, 2013.
Caruana (1997) R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
Catoni (2004) O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
Catoni (2007) O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes – Monograph Series, 56. Institute of Mathematical Statistics, Beachwood, OH, 2007.
Chen et al. (2020) Jiaxin Chen, Xiao-Ming Wu, Yanke Li, Qimai Li, Li-Ming Zhan, and Fu-lai Chung. A closer look at the training strategy for modern meta-learning. Advances in Neural Information Processing Systems, 33:396–406, 2020.
Chen and Chen (2022) Lisha Chen and Tianyi Chen. Is bayesian model-agnostic meta learning better than model-agnostic meta learning, provably? In International Conference on Artificial Intelligence and Statistics, pages 1733–1774. PMLR, 2022.
Chen et al. (2023) Lisha Chen, Sharu Theresa Jose, Ivana Nikoloska, Sangwoo Park, Tianyi Chen, and Osvaldo Simeone. Learning with limited samples: Meta-learning and applications to communication systems. Foundations and Trends® in Signal Processing, 17(2):79–208, 2023.
Chen et al. (2021) Qi Chen, Changjian Shui, and Mario Marchand. Generalization bounds for meta-learning: An information-theoretic analysis. Advances in Neural Information Processing Systems, 34:25878–25890, 2021.
(22) T. Cover and J. Thomas. Elements of Information Theory. Wileys Series in Communication.
Dalalyan and Tsybakov (2008) A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1-2):39–61, 2008.
Denevi et al. (2018a) G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. Advances in neural information processing systems, 31, 2018a.
Denevi et al. (2018b) G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical guarantees. arXiv preprint arXiv:1803.08089, 2018b.
Denevi et al. (2019a) G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575. PMLR, 2019a.
Denevi et al. (2019b) Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. Advances in Neural Information Processing Systems, 32, 2019b.
Denevi et al. (2020) Giulia Denevi, Massimiliano Pontil, and Carlo Ciliberto. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33:964–974, 2020.
Ding et al. (2021) N. Ding, X. Chen, T. Levinboim, S. Goodman, and R. Soricut. Bridging the gap between practice and pac-bayes theory in few-shot meta-learning. Advances in neural information processing systems, 34:29506–29516, 2021.
Donsker and Varadhan (1976) M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iii. Communications on Pure and Applied Mathematics, 28:389–461, 1976.
Dou et al. (2019) Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. Investigating meta-learning algorithms for low-resource natural language understanding tasks. arXiv preprint arXiv:1908.10423, 2019.
Du et al. (2020) Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
Fallah et al. (2021) Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34:5469–5480, 2021.
Farid and Majumdar (2021) A. Farid and A. Majumdar. Generalization bounds for meta-learning via pac-bayes and uniform stability. Advances in neural information processing systems, 34:2173–2186, 2021.
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
Finn et al. (2019) Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920–1930. PMLR, 2019.
Ghosal and Van der Vaart (2017) S. Ghosal and A. Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
Gidaris and Komodakis (2018) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018.
Grant et al. (2018) Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
Grünwald and Mehta (2020) Peter D Grünwald and Nishant A Mehta. Fast rates for general unbounded loss functions: from erm to generalized bayes. The Journal of Machine Learning Research, 21(1):2040–2119, 2020.
Gu et al. (2018) Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437, 2018.
Guan and Lu (2022a) J. Guan and Z. Lu. Fast-rate pac-bayesian generalization bounds for meta-learning. In International conference on machine learning, pages 7930–7948. PMLR, 2022a.
Guan et al. (2022) J. Guan, Y. Liu, and Z. Lu. Fine-grained analysis of stability and generalization for modern meta learning algorithms. Advances in neural information processing systems, 2022.
Guan and Lu (2022b) Jiechao Guan and Zhiwu Lu. Task relatedness-based generalization bounds for meta learning. In International Conference on Learning Representations, 2022b.
Guedj (2019) B. Guedj. A primer on PAC-Bayesian learning. In Proceedings of the second congress of the French Mathematical Society, 2019.
Haghifam et al. (2021) Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran, and Dan Roy. Towards a unified information-theoretic framework for generalization. Advances in Neural Information Processing Systems, 34:26370–26381, 2021.
Hellström and Durisi (2022) Fredrik Hellström and Giuseppe Durisi. Evaluated cmi bounds for meta learning: Tightness and expressiveness. arXiv preprint arXiv:2210.06511, 2022.
Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 87–94. Springer, 2001.
Huang et al. (2018) Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. Natural language to structured query generation via meta-learning. arXiv preprint arXiv:1803.02400, 2018.
Jose and Simeone (2021) Sharu Theresa Jose and Osvaldo Simeone. Information-theoretic generalization bounds for meta-learning and applications. Entropy, 23(1):126, 2021.
Jose et al. (2021) Sharu Theresa Jose, Osvaldo Simeone, and Giuseppe Durisi. Transfer meta-learning: Information-theoretic bounds and information meta-risk minimization. IEEE Transactions on Information Theory, 68(1):474–501, 2021.
Khodak et al. (2019) Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. Advances in Neural Information Processing Systems, 32, 2019.
Koch et al. (2015) Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.
Kulis et al. (2011) Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR 2011, pages 1785–1792. IEEE, 2011.
Li et al. (2018) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
Liu et al. (2021) Tianyu Liu, Jie Lu, Zheng Yan, and Guangquan Zhang. Pac-bayes bounds for meta-learning with data-dependent prior. arXiv preprint arXiv:2102.03748, 2021.
Mammen and Tsybakov (1999) E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. the Annals of Statistics, 27(6):1808–1829, 1999.
Maurer (2009) A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
Maurer and Jaakkola (2005) A. Maurer and T. Jaakkola. Algorithmic stability and meta-learning. The Journal of Machine Learning Research, 6(6), 2005.
Maurer et al. (2016) A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(81):1–32, 2016.
McAllester (1998) D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, New York, 1998. ACM.
Meunier and Alquier (2021) D. Meunier and P. Alquier. Meta-strategy for learning tuning parameters with guarantees. Entropy, 23(10), 2021.
Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. ICLR, 2018.
Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International conference on machine learning, pages 2554–2563. PMLR, 2017.
Nguyen et al. (2020) Cuong Nguyen, Thanh-Toan Do, and Gustavo Carneiro. Uncertainty in model-agnostic meta-learning using variational inference. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3090–3100, 2020.
Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
Pentina and Lampert (2014) A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991–999. PMLR, 2014.
Qian and Yu (2019) Kun Qian and Zhou Yu. Domain adaptive dialog generation via meta learning. arXiv preprint arXiv:1906.03520, 2019.
Qiao et al. (2018) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7229–7238, 2018.
Quattoni et al. (2008) Ariadna Quattoni, Michael Collins, and Trevor Darrell. Transfer learning for image classification with sparse prototype representations. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
Rajeswaran et al. (2019) Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International conference on learning representations, 2017.
Rezazadeh (2022) A. Rezazadeh. A unified view on pac-bayes bounds for meta-learning. In International conference on machine learning, pages 18576–18595. PMLR, 2022.
Rezazadeh et al. (2021) Arezou Rezazadeh, Sharu Theresa Jose, Giuseppe Durisi, and Osvaldo Simeone. Conditional mutual information-based generalization bound for meta learning. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1176–1181. IEEE, 2021.
Rothfuss et al. (2021) J. Rothfuss, V. Fortuin, M. Josifoski, and A. Krause. Pacoh: Bayes-optimal meta-learning with pac-guarantees. In International conference on machine learning, pages 9116–9126. PMLR, 2021.
Rothfuss et al. (2022) J. Rothfuss, M. Josifoski, V. Fortuin, and A. Krause. Pac-bayesian meta-learning: From theory to practice. arXiv preprint arXiv:2211.07206, 2022.
Russo and Zou (2019) D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850. PMLR, 2016.
Shawe-Taylor and Williamson (1997) J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, pages 2–9, New York, 1997. ACM.
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018.
Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. Kluwer Academic Publishers, pages 3–17, 1998.
Tripuraneni et al. (2021) Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
Tsybakov (2004) A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. the Annals of Statistics, 32(1):135–166, 2004.
Vanschoren (2019) Joaquin Vanschoren. Meta-learning. Automated machine learning: methods, systems, challenges, pages 35–61, 2019.
Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 2521–2530, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
Yang et al. (2019) Jun Yang, Shengyang Sun, and Daniel M Roy. Fast-rate pac-bayes generalization bounds via shifted rademacher processes. Advances in Neural Information Processing Systems, 32, 2019.
Yoon et al. (2018) Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018.
Yu et al. (2020) Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
Zhang (2006) Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
Zhang and Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
Zhuang et al. (2020) Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.

Appendix A Some Useful Formulas

The following (known) results are used throughout the text. They are recalled here without proof.

A.1 Concentration Inequalities

For Hoeffding and Bernstein see Boucheron et al. (2013). For Donsker and Varadhan, see for example Catoni (2007).

Lemma 12 (Hoeffding’s inequality)

Let $U_{1},\dots,U_{n}$ be i.i.d random variables taking values in an interval $[a,b]$ . Then, for any $s>0$ ,

\mathbb{E}\left[{\rm e}^{s\sum_{i=1}^{n}[U_{i}-\mathbb{E}(U_{i})]}\right]\leq{\rm e}^{\frac{ns^{2}(b-a)^{2}}{8}}.

Lemma 13 (Bernstein’s inequality)

Let $U_{1},\dots,U_{n}$ be i.i.d random variables such that for any $k\geq 2$ ,

\mathbb{E}\left[|U_{i}|^{k}\right]\leq\frac{k!}{2}VC^{k-2}.

(15)

Then, for any $s\in(0,1/C]$ ,

\mathbb{E}\left[{\rm e}^{s\sum_{i=1}^{n}[U_{i}-\mathbb{E}(U_{i})]}\right]\leq{\rm e}^{\frac{ns^{2}V}{2(1-sC)}}.

Note that in particular, if $|U_{i}|\leq C$ almost surely, 15 always holds with $V=\mathbb{E}(U_{i}^{2})$ .

A.2 Donsker and Varadhan’s Lemma

Lemma 14 (Donsker and Varadhan’s variational inequality Donsker and Varadhan (1976))

Let $\mu$ be a probability measure on $\Theta$ . For any measurable, bounded function $h:\Theta\rightarrow\mathbb{R}$ , we have:

\log\mathbb{E}_{\theta\sim\mu}\left[{\rm e}^{h(\theta)}\right]=\sup_{\rho\in\mathcal{P}(\Theta)}\Bigl{\{}\mathbb{E}_{\theta\sim\rho}[h(\theta)]-\mathrm{KL}(\rho\|\mu)\Bigr{\}}.

Moreover, the supremum with respect to $\rho$ in the right-hand side is reached for the Gibbs measure $\mu_{h}$ defined by its density with respect to $\mu$

\frac{{\rm d}\mu_{h}}{{\rm d}\mu}(\vartheta)=\frac{{\rm e}^{h(\vartheta)}}{\mathbb{E}_{\theta\sim\mu}\left[{\rm e}^{h(\theta)}\right]}.

A.3 KL Divergence of some known Distributions

Denoting by $H(x)$ the entropy of $(x_{1},\dots,x_{T})$ , recall that the KL divergence between a multinomial distribution of parameters $(x_{1},\dots,x_{T})$ and a multinomial distribution of parameters $\left(\frac{1}{T},\dots,\frac{1}{T}\right)$ is

\mathrm{KL}\left(\mathrm{Mult}(x)\|\mathrm{Mult}\left(\frac{1}{T}\right)\right)=\log T-H(x).

(16)

Recall that the KL divergence between 2 normal distributions is

\mathrm{KL}\left(\mathcal{N}(\mu,\sigma^{2})\|\mathcal{N}(\bar{\mu},\bar{\sigma}^{2})\right)=\frac{1}{2}\left(\frac{(\mu-\bar{\mu})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma^{2}}{\bar{\sigma^{2}}}-1+\log\frac{\bar{\sigma}^{2}}{\sigma^{2}}\right).

(17)

Recall that the KL divergence between 2 Gamma distributions is

\mathrm{KL}\left(\Gamma(a,b)\|\Gamma\left(\bar{\bar{a}},\bar{\bar{b}}\right)\right)=\left(a-\bar{\bar{a}}\right)\psi(a)+\log\frac{\Gamma\left(\bar{\bar{a}}\right)}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}.

(18)

Recall that the KL divergence between a Dirichlet distribution of parameter $\delta$ and a Dirichlet distribution of parameter $1_{K}=(1,\dots,1)$ is

\mathrm{KL}\left(\mathrm{Dir}(\delta)\|\mathrm{Dir}(1_{K})\right)=\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right).

(19)

Appendix B Proof of Theorem 1

We mostly follow the proof technique developed in Catoni (2007). For any $s>0$ , fix $\theta\in\Theta$ and let $U_{i}=\mathbb{E}[\ell(Z_{t,i},\theta)]-\ell(Z_{t,i},\theta)-\mathbb{E}[\ell(Z_{t,i},\theta^{*}_{t})]+\ell(Z_{t,i},\theta^{*}_{t})$ for any $i\in\{1,\dots,n\}$ . We are going to distinguish two cases, whether or not Bernstein assumption is satisfied.

If Bernstein assumption is satisfied, we apply Lemma 13 to $U_{i}$ . Note that in this case, $V$ is actually the variance term $V_{t}(\theta,\theta_{t}^{*})$ . So, for any $s>0$ ,

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{sn\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{ns^{2}V(\theta,\theta^{*}_{t})}{2(1-sC)}}.

We let $s=\lambda/n$ , which gives

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{\lambda^{2}V(\theta,\theta^{*}_{t})}{2(n-C\lambda)}}.

Making use of Bernstein hypothesis gives

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{\lambda^{2}c\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right)}{2(n-C\lambda)}}.

If Bernstein hypothesis is not satisfied, we apply Lemma 12 to $U_{i}$ , which gives, for any $s>0$ ,

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{sn\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{ns^{2}C^{2}}{8}}.

Letting $s=\lambda/n$ gives

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{\lambda^{2}C^{2}}{8n}}.

Defining the general bound

W=\frac{\lambda^{2}c\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right)}{2(n-C\lambda)}\mathbb{I}_{B}+\frac{\lambda^{2}C^{2}}{8n}(1-\mathbb{I}_{B}),

(20)

where $\mathbb{I}_{B}$ is equal to $1$ if Bernstein assumption is satisfied, and $0$ otherwise, it holds in either case that

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{W}.

Rearranging the terms gives

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq 1.

Next, integrating this bound with respect to $\pi$ and using Fubini’s theorem to exchange both integrals gives

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq 1.

We then apply Lemma 14 to the argument of the expectation with respect to the sample, and we have

\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\sup_{\rho\in\mathcal{P}(\Theta)}\left\{\lambda\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right]-\mathrm{KL}(\rho\|\pi)\right\}}\right]\leq 1.

Jensen’s inequality implies

{\rm e}^{\lambda\mathbb{E}_{\mathcal{S}_{t}}\left[\sup_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right]-\frac{\mathrm{KL}(\rho\|\pi)}{\lambda}\right\}\right]}\leq 1,

in other words,

\mathbb{E}_{\mathcal{S}_{t}}\left[\sup_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right]-\frac{\mathrm{KL}(\rho\|\pi)}{\lambda}\right\}\right]\leq 0.

At this stage, we can replace $W$ by its value given in (20) to obtain the bound:

\mathbb{E}_{\mathcal{S}_{t}}\Bigg{[}\sup_{\rho\in\mathcal{P}(\Theta)}\bigg{\{}\mathbb{E}_{\theta\sim\rho}\bigg{[}\left(1-\frac{\lambda c\mathbb{I}_{B}}{2(n-C\lambda)}\right)\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right)\\ -\frac{\lambda C^{2}(1-\mathbb{I}_{B})}{8n}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\bigg{]}-\frac{\mathrm{KL}(\rho\|\pi)}{\lambda}\bigg{\}}\Bigg{]}\leq 0.

Next, we rearrange the terms and replace the supremum on $\rho$ by $\rho_{t}(\pi,\alpha)$ :

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}\leq\frac{1}{1-\frac{\lambda c\mathbb{I}_{B}}{2(n-C\lambda)}}\\ \times\left(\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho_{t}(\pi,\alpha)\|\pi)}{\lambda}\right]+\frac{\lambda C^{2}(1-\mathbb{I}_{B})}{8n}\right).

We then replace $\lambda$ by $\alpha n$ and by definition of Gibbs posterior $\rho_{t}(\pi,\alpha)$ , the above bound is the same as

\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}\leq\frac{1}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\\ \times\left(\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}\right]+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\right).

In particular, under Bernstein assumption, i.e., if $\mathbb{I}_{B}=1$ , the choice $\alpha=\frac{1}{C+c}$ gives

\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\right]-R_{P_{t}}^{*}\leq 2\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}\right].

Without Bernstein assumption, i.e., if $\mathbb{I}_{B}=0$ , rewriting the bound and taking the minimum over $\alpha$ yields

\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\right]-R_{P_{t}}^{*}\leq\inf_{\alpha}\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}+\frac{\alpha C^{2}}{8}\right\}\right],

and this concludes the proof. $\blacksquare$

Appendix C Proof of Corollary 2

First,

	$\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]$	$\displaystyle=\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\left(\mathbb{E}_{\theta\sim\rho}[\hat{R}_{t}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}\right)-\hat{R}_{t}(\theta^{*}_{t})\right]$
		$\displaystyle\leq\inf_{\rho\in\mathcal{P}(\Theta)}\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho}[\hat{R}_{t}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}-\hat{R}_{t}(\theta^{*}_{t})\right]$
		$\displaystyle=\inf_{\rho\in\mathcal{P}(\Theta)}\left[]\mathbb{E}_{\theta\sim\rho}[R_{t}(\theta)]-R_{t}(\theta_{t}^{*})+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}\right].$

Now, define $\rho_{s}$ as the resctriction of $\pi$ to the set $\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\leq s\}$ . Then,

	$\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]$	$\displaystyle\leq\inf_{s>0}\left[\mathbb{E}_{\theta\sim\rho}[R_{t}(\theta)]-R_{t}(\theta_{t}^{*})+\frac{{\rm KL}(\rho_{s}\\|\pi)}{\alpha n}\right]$
		$\displaystyle\leq\inf_{s>0}\left[s+\frac{\log\frac{1}{\pi(\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\leq s\})}}{\alpha n}\right]$
		$\displaystyle\leq\inf_{s>0}\left[s+\frac{d_{\pi,t}\log\frac{1}{s}+\log\kappa_{\pi,t}}{\alpha n}\right]$

by assumption. An optimization with respect to $s$ leads to $s=d_{\pi,t}/(\alpha n)$ and we obtain the first statement:

\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]\leq\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi}}+\log\kappa_{\pi,t}}{\alpha n}.

Plugging this into Theorem 1 leads immediately to the other statements.

Appendix D Proof of Lemma 4

First, we note that $f:x\mapsto-\frac{1}{\tau}\log(x)$ is differentiable on $[\exp(-C\tau),1]$ and $f^{\prime}(x)=-\frac{1}{\tau x}$ . As a consequence, $|f^{\prime}(x)|=1/(\tau x)$ is maximized at $x=\exp(-C\tau)$ and its maximum is $\exp(C\tau)/\tau$ . This implies that $f$ is $\exp(C\tau)/\tau$ -Lipschitz, that is, for any $(x,y)\in[\exp(-C\tau),1]^{2}$ ,

|f(x)-f(y)|\leq\frac{\exp(C\tau)}{\tau}|x-y|.

Taking the square of both sides of the inequality yields

\forall(x,y)\in[\exp(-C\tau),1]^{2},\ (f(x)-f(y))^{2}\leq\frac{\exp(2C\tau)}{\tau^{2}}(x-y)^{2}.

(21)

Then, $f^{\prime\prime}(x)=1/(\tau x^{2})$ and thus, $|f^{\prime\prime}(x)|\geq 1/\tau$ (minimum reached for $x=1$ ). This implies that $f$ is $1/\tau$ -strongly convex, that is, for any $(x,y)\in[\exp(-C\tau),1]$ we have, for any $\theta\in[0,1]$ ,

f(\theta x+(1-\theta)y)\leq\theta f(x)+(1-\theta)f(y)-\frac{\theta(1-\theta)}{2\tau}(x-y)^{2}.

We apply this inequality to $\theta=1/2$ and rearrange terms, and we obtain

\forall(x,y)\in[\exp(-C\tau),1]^{2},\ \frac{1}{8\tau}(x-y)^{2}\leq\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right).

(22)

Finally, combining (21) with (22) yields

\forall(x,y)\in[\exp(-C\tau),1]^{2},[f(x)-f(y)]^{2}\leq\frac{8\exp(2C\tau)}{\tau}\left[\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right)\right],

concluding the proof of the lemma. $\blacksquare$

Appendix E Proof of Theorem 5

The proof of Theorem 5 is structured as follows: we first bound the excess risk by the expectation of the infimum of the empirical risk $\hat{\mathcal{R}}(\rho,\pi,\alpha)-\hat{R}(\theta_{t}^{*})$ in Lemma 15. Using classic techniques, we turn this bound into the prediction risk.

E.1 Lemma

Lemma 15

Assume that the loss $\ell$ satisfies the boundedness assumption (3). Then, the following bound holds with the choice $\beta=\frac{1}{C+c}$ :

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\\ \inf_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]},

Proof For any $t\in[T]$ , let

U_{t}:=\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha).

Please note that

\mathbb{E}[U_{t}]=\mathbb{E}_{P_{t}}\left[\mathcal{R}_{t}(\pi^{*}_{\alpha})\right]-\mathbb{E}_{P_{t}}\left[\mathcal{R}_{t}(\pi)\right],

where $\mathbb{E}[U_{t}]$ is a shortcut notation for $\mathbb{E}_{P_{t}\sim\mathcal{P}}\mathbb{E}_{S_{t}\sim P_{t}}[U_{t}]$ . Besides, please note that, by the assumption on the boundedness of $\ell$ , it a.s. holds that $|U_{t}|\leq C$ . Applying Lemma 13 to $U_{t}$ gives, for any $\beta>0$ ,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[e^{\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}_{S_{t}}[U_{t}]\right)}\right]\leq e^{\frac{\beta^{2}T\tilde{V}(\pi)}{2(1-\beta C)}},

where $\tilde{V}(\pi)=\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[\left(\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]$ . This factor can be bounded as $\tilde{V}(\pi)\leq c\mathbb{E}[-U_{T+1}]$ by Theorem 3, which states that Bernstein hypothesis is satisfied at the meta level, so that the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[e^{\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta^{2}T\mathbb{E}[U_{T+1}]}{2(1-\beta C)}}\right]\leq 1.

Integrating with respect to the prior $\pi\sim\Lambda$ and using Fubini’s theorem yields

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\Lambda}\left[e^{\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta^{2}T\mathbb{E}[U_{T+1}]}{2(1-\beta C)}}\right]\leq 1.

Next, by an application of Lemma 14, the left-hand side becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[e^{\sup_{\Pi\in\mathcal{P}(\mathcal{P}(\theta))}\left\{\mathbb{E}_{\pi\sim\Pi}\left[\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta^{2}T\mathbb{E}[U_{T+1}]}{2(1-\beta C)}\right]-\mathrm{KL}(\Pi\|\Lambda)\right\}}\right]\leq 1.

We then make use of Jensen’s inequality and arrange terms, so that the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[\sup_{\Pi\in\mathcal{P}(\mathcal{P}(\theta))}\left\{\mathbb{E}_{\pi\sim\Pi}\left[\frac{1}{T}\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta\mathbb{E}[U_{T+1}]}{2(1-\beta C)}\right]-\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}\right]\leq 0.

We replace the supremum on $\Pi$ by an evaluation of the term in $\hat{\Pi}$ and arrange terms, so that the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[U_{t}]+\frac{c\beta\mathbb{E}[U_{T+1}]}{2(1-\beta C)}\right]\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}[U_{t}]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right],

which is identical to

\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[U_{T+1}\right]\leq\\ \mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}[U_{t}]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

Then, we replace $\mathbb{E}[U_{T+1}]$ by its value, yielding the bound

\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Bigg{[}\\ \hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\Bigg{]}\leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\\ -\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\Bigg{]}.

Since the term $\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)$ does not depend on $\pi\sim\hat{\Pi}$ , we can simplify it on both sides of the inequality, which gives

\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)\right]\\ +\frac{c\beta}{2(1-C\beta)}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right]\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

(23)

Theorem 1 provides the following lower bound for any $\pi^{\prime}$ :

\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{\prime},\alpha),\pi^{\prime},\alpha)\right]\geq\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi^{\prime},\alpha)}\left[R_{P_{T+1}}(\theta)\right]\\ +\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}R_{P_{T+1}}^{*}-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}.

(24)

This further implies that

\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{\prime},\alpha),\pi^{\prime},\alpha)\right]\geq\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)R_{P_{T+1}}^{*}\\ +\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}R_{P_{T+1}}^{*}-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}

(25)

for any $\pi^{\prime}$ . In particular, applying (24) to $\pi^{\prime}=\pi$ and (25) to $\pi=\pi^{*}_{\alpha}$ , and injecting the results in the left-hand side of (23) gives

\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\Bigg{(}\frac{c\beta}{2(1-C\beta)}\mathbb{E}_{P_{T+1}}\left[R_{P_{T+1}}^{*}\right]\\ +\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi^{\prime},\alpha)}\left[R_{P_{T+1}}(\theta)\right]\Bigg{)}\\ +\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\mathbb{E}_{P_{T+1}}\left[R_{P_{T+1}}^{*}\right]-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

We remove $\mathcal{E}^{*}=\mathbb{E}_{P_{T+1}}\left[R_{P_{T+1}}^{*}\right]$ from both sides of the inequality and arrange terms, so that the bound becomes

\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\left(1-\frac{c\beta}{2(1-C\beta)}\right)\Bigg{(}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\Bigg{)}-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\mathcal{E}^{*}+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

By definition, $\hat{\Pi}$ is the minimizer of the integrand of the right-hand side, and therefore,

\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\left(1-\frac{c\beta}{2(1-C\beta)}\right)\Bigg{(}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\Bigg{)}\leq\\ \mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[\inf_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}\right]\\ +\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8},

and the choice $\beta=\frac{1}{c+C}$ yields

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\\ \inf_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]},

which concludes the proof of the lemma.

E.2 Proof of Theorem 5

From Lemma 15,

	$\displaystyle\mathbb{E}_{P_{1},\dots,P_{T}}$	$\displaystyle\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\inf_{\Pi\in\mathcal{G}}\Biggl{\{}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}\Biggr{\}}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Biggl{\{}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle=\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Biggl{\{}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{T+1}(\theta_{T+1}^{*})\right)$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[\hat{R}_{T+1}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}$
		$\displaystyle\quad\quad\quad-\hat{R}_{T+1}(\theta_{T+1}^{*})\Biggr{]}+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\mathbb{E}_{S_{T+1}}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[\hat{R}_{T+1}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}$
		$\displaystyle\quad\quad\quad-\hat{R}_{T+1}(\theta_{T+1}^{*})\Biggr{]}+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[R_{T+1}(\theta)-R_{T+1}(\theta_{t}^{*})]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}\Biggr{]}$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}.$

This ends the proof. $\blacksquare$

Appendix F Application of Theorem 5 to the Gaussian Case

Assume that $\ell$ is $L$ -Lipschitz. As a result, the risks $R_{P_{t}}$ are also $L$ -Lipschitz. We choose the prior $p_{\bar{\mu},(\bar{\sigma}^{2},\dots,\bar{\sigma}^{2})}\in\mathcal{F}$ (and hence, the variances are the same for all coordinates). A straightforward application of (17) gives

\mathrm{KL}(p_{\mu,\sigma^{2}},p_{\bar{\mu},(\bar{\sigma}^{2},\dots,\bar{\sigma}^{2})})=\frac{1}{2}\sum_{i=1}^{d}\left[\frac{(\mu_{i}-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right].

In this case, $\rho_{t}(\pi,\alpha)=p_{\mu(t),\sigma^{2}(t)}$ , where

(\mu(t),\sigma^{2}(t))=\operatorname{argmin}_{\mu,\sigma^{2}}\left\{\mathbb{E}_{\theta\sim\mathcal{N}(\mu,\sigma^{2})}\left[\hat{R}_{t}(\theta)\right]+\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right]\right\}.

We now define $\mathcal{G}$ as the family of distributions $q_{\tau,v,a,b}$ on $(\bar{\mu},\bar{\sigma}^{2})$ , where

q_{\tau,v,a,b}(\bar{\mu},\bar{\sigma}^{2})=\left[\bigotimes_{i=1}^{d}\mathcal{N}(\bar{\mu}_{i};\tau_{i},\xi^{2})\right]\otimes\Gamma(\bar{\sigma}^{2};a,b).

Fix a prior on priors $\Lambda=q_{0,\bar{\bar{\xi}}^{2},\bar{\bar{a}},\bar{\bar{b}}}$ . We choose $\hat{\Pi}=q_{\hat{t},\hat{\xi}^{2},\hat{a},\hat{b}}$ , where

(\hat{\tau},\hat{\xi}^{2},\hat{a},\hat{b})=\operatorname{argmin}_{\tau,\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{(\bar{\mu},\bar{\sigma}^{2})\sim q_{\tau,\xi^{2},a,b}}\Biggl{[}\frac{1}{T}\sum_{t=1}^{T}\min_{\mu(t),\sigma^{2}(t)}\Biggl{\{}\mathbb{E}_{\theta\sim\mathcal{N}(\mu(t),\sigma^{2}(t))}\left[\hat{R}_{t}(\theta)\right]\\ +\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}(t)-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}(t)}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}(t)}\right)\right]\Biggr{\}}\Biggr{]}\\ +\frac{1}{2\beta T}\sum_{i=1}^{d}\left[\frac{\tau_{i}^{2}}{\bar{\bar{\xi}}^{2}}+\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}},

where $\Gamma(\cdot)$ is the gamma function and $\psi(\cdot)=\Gamma^{\prime}(\cdot)/\Gamma(\cdot)$ is the digamma function. We apply Theorem 5:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\tau,\xi^{2},a,b}\Biggl{\{}\\ \mathbb{E}_{(\bar{\mu},\bar{\sigma}^{2})\sim q_{\tau,\xi^{2},a,b}}\mathbb{E}_{P_{T+1}}\Biggl{[}\min_{\mu(T+1),\sigma^{2}(T+1)}\Biggl{\{}\mathbb{E}_{\theta\sim\mathcal{N}(\mu(T+1),\sigma^{2}(T+1))}[R_{P_{T+1}}(\theta)]-R_{P_{T+1}}(\theta^{*}_{t})\\ +\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}(T+1)-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}(T+1)}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}(T+1)}\right)\right]\Biggr{\}}\Biggr{]}\\ +\frac{1}{2\beta T}\sum_{i=1}^{d}\left[\frac{\tau_{i}^{2}}{\bar{\bar{\xi}}^{2}}+\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

We next apply Assumption 4, and the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\tau,\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{(\bar{\mu},\bar{\sigma}^{2})\sim q_{\tau,\xi^{2},a,b}}\mathbb{E}_{P_{T+1}}\Biggl{[}\min_{\mu,\sigma^{2}}\Biggl{\{}L\|\sigma\|^{2}\\ +\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}(T+1)}\right)\right]\Biggr{\}}\Biggr{]}\\ +\frac{1}{2\beta T}\sum_{i=1}^{d}\left[\frac{\tau_{i}^{2}}{\bar{\bar{\xi}}^{2}}+\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

Idea: define for any $P$ , $\mu_{P}$ that minimizes $R_{P}$ , that is: $R_{P}(\mu_{P})=R_{p}^{*}$ , and define $\mu^{*}=\mathbb{E}_{P\sim\mathcal{P}}(\mu_{P})$ . If $\mu_{P}=\mu^{*}$ $\mathcal{P}$ -a.s., all the task have the same solution. On the other hand, if $\mu_{P}$ has a lot of variations, then the tasks have very unrelated solutions. In the infimum above, take $\mu=\mu_{P_{T+1}}$ and $\tau=\mu^{*}$ . Then,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\inf_{\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\Bigg{[}\min_{\sigma^{2}}\Biggl{\{}L\|\sigma\|^{2}\\ +\frac{\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left(\|\mu_{P_{T+1}}-\mu^{*}\|^{2}\right)+\xi^{2}d}{2\bar{\sigma}^{2}\alpha n}+\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right]\Biggr{\}}\Bigg{]}\\ +\frac{\beta C^{2}}{8T}+\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}}+\frac{d}{2\beta}\left[\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

Let $\Sigma(\mathcal{P})=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\|\mu_{P_{T+1}}-\mu^{*}\|^{2}\right]$ , this quantity will be very important in the rate. Therefore,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ 4\inf_{\xi^{2},a,b,\beta}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\Bigg{[}\min_{\sigma^{2}}\Biggl{\{}L\|\sigma\|^{2}+\frac{\Sigma(\mathcal{P})+\xi^{2}d}{2\bar{\sigma}^{2}\alpha n}+\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right]\Biggr{\}}\Bigg{]}\\ +\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}+\frac{d}{2\beta}\left[\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

An exact optimization in $\sigma^{2}$ gives $\sigma_{i}^{2}=\frac{\bar{\sigma}^{2}}{(2\alpha L\bar{\sigma}^{2}n+1)}$ and after simplifications,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\\ \leq\mathcal{E}^{*}+4\inf_{\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\left[\frac{\Sigma(\mathcal{P})+\xi^{2}d}{2\bar{\sigma}^{2}\alpha n}+\frac{d\log\left(2\alpha L\bar{\sigma}^{2}n+1\right)}{2\alpha n}\right]\\ +\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}+\frac{d}{2\beta}\left[\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

We now consider separately two cases: $\Sigma(\mathcal{P})>d\epsilon$ or $\Sigma(\mathcal{P})\leq d\epsilon$ for some $\epsilon$ (very small) that we will choose later. First case: $\Sigma(\mathcal{P})>d\epsilon$ . This is the case where we do not expect a significant improvement on the bound from the meta-learning. Simply take $\xi=\bar{\bar{\xi}}$ to get

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\\ \leq 4\inf_{a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\left[\frac{\Sigma(\mathcal{P})+d\bar{\bar{\xi}}^{2}}{\bar{\sigma}^{2}\alpha n}+\frac{d\log\left(2\alpha L\bar{\sigma}^{2}n+1\right)}{2\alpha n}\right]\\ +\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}+\frac{(a-\bar{\bar{a}})\psi(a)+2\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{\beta T}\Biggr{\}}.

In this case, an accurate optimization with respect to $a$ and $b$ will not improve the bound significantly, and for this reason, we simply take $a=\bar{\bar{a}}$ and $b=\bar{\bar{b}}$ . Also note that for $U\sim\Gamma(a,b)$ we have $\mathbb{E}(U^{x})=\Gamma(a+x)/[b^{x}\Gamma(a)]$ and thus $\mathbb{E}(1/\bar{\sigma}^{2})=b/(a-1)$ and $\mathbb{E}(\log(2\alpha L\bar{\sigma}^{2}n+1))\leq\log(2\alpha L\mathbb{E}(\bar{\sigma}^{2})n+1)=\log(2\alpha Lna/b+1)$ . Thus,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\bar{\bar{b}}\left[d\bar{\bar{\xi}}^{2}+\Sigma(\mathcal{P})\right]}{(\bar{\bar{a}}-1)\alpha n}+\frac{2d\log\left(\frac{2\bar{\bar{a}}\alpha Ln}{\bar{\bar{b}}}+1\right)}{\alpha n}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}.

Second case: $\Sigma(\mathcal{P})\leq d\epsilon$ . In this case, we expect a significant improvement in the bound from the meta-learning, and in order to take advantage of it, we choose $\xi$ very small: $\xi^{2}=\epsilon$ . We obtain

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+4\inf_{a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\Biggl{\{}\frac{d\epsilon}{\bar{\sigma}^{2}\alpha n}+\frac{d\log\left(2\alpha L\bar{\sigma}^{2}n+1\right)}{2\alpha n}\Biggr{\}}\\ +\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}+\frac{d}{2\beta T}\left[\frac{\epsilon}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\epsilon}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

In order to take advantage of the meta-learning, we are going to make sure that $\bar{\sigma}^{2}$ is very small by tuning $a$ and $b$ adequately. But then we have $\mathbb{E}(\log(2\alpha L\bar{\sigma}^{2}n+1))\leq 2\alpha L\mathbb{E}(\bar{\sigma}^{2})n=2\alpha Lna/b$ , and we obtain

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+4\inf_{a,b,\beta\in G}\Biggl{\{}\frac{bd\epsilon}{(a-1)\alpha n}+\frac{dLa}{b}+\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}\\ +\frac{d}{2\beta T}\left[\frac{\epsilon}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\epsilon}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

Choose $a=\bar{\bar{a}}$ (make sure that $\bar{\bar{a}}>1$ ) and an optimization with respect to $b$ gives $b=\sqrt{\frac{\alpha La(a-1)n}{\epsilon}}$ . Reinjecting in the bound gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+4d\sqrt{\frac{\bar{\bar{a}}L}{\alpha(\bar{\bar{a}}-1)}}\sqrt{\frac{\epsilon}{n}}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}\\ +\frac{2d}{\beta T}\left[\frac{\epsilon}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\epsilon}\right)\right]+\frac{2}{\beta T}\left(\bar{\bar{a}}\log\left(\frac{1}{\bar{\bar{b}}}\sqrt{\frac{n\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}{\epsilon}}\right)+\bar{\bar{a}}\bar{\bar{b}}\sqrt{\frac{\epsilon}{n\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}}\right).

The last step is to chose $\epsilon$ . Interestingly enough, for $\epsilon=1/n$ we recover the rate of learning in isolation: $d/n$ . Now, an optimization w.r.t $\epsilon$ gives $\epsilon=\sqrt{n}/T$ and thus

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4d}{T}\sqrt{\frac{\bar{\bar{a}}L}{\alpha(\bar{\bar{a}}-1)}}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}\\ +\frac{2d}{\beta T}\left(\frac{n}{\bar{\bar{\xi}}^{2}T^{2}}-1+\log\left(\bar{\bar{\xi}}^{2}T^{2}\right)\right)+\frac{2}{\beta T}\left(\bar{\bar{a}}\log\left(\frac{T}{\bar{\bar{b}}}\sqrt{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}\right)+\frac{\bar{\bar{a}}\bar{\bar{b}}}{T}\sqrt{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}\right).

In the $T>n$ regime, this is a significant improvement compared to the learning in isolation in which case, we can simplify the bound as

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\frac{d+\log(T)}{T},

where

G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})=4\sqrt{\frac{\bar{\bar{a}}L}{\alpha(\bar{\bar{a}}-1)}}\\ +\frac{2}{\beta}\left(\frac{\|\mu^{*}\|^{2}}{\bar{\bar{\xi}}^{2}}+\frac{n}{\bar{\bar{\xi}}^{2}T^{2}}+1+\log\bar{\bar{\xi}}^{2}+\bar{\bar{a}}+\frac{\bar{\bar{a}}}{2}\log\frac{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}{\bar{\bar{b}}^{2}}+\frac{\bar{\bar{a}}\bar{\bar{b}}}{T}\sqrt{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}\right).

As a conclusion,

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\min\Biggl{(}\frac{\bar{\bar{b}}\left[d\bar{\bar{\xi}}^{2}+\Sigma(\mathcal{P})\right]}{(\bar{\bar{a}}-1)\alpha n}+\frac{d\log\left(\frac{2\bar{\bar{a}}\alpha Ln}{\bar{\bar{b}}}+1\right)}{2\alpha n}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T},\\ G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\frac{d+\log(T)}{T}\Biggr{)}.

Appendix G Application of Theorem 5 to the Case of Mixtures of Gaussians

We first assume that priors that are mixtures of $K$ Gaussians, where $K$ is known:

\mathcal{M}=\Bigg{\{}p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{K}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\\ \forall(i,k)\in[d]\times[K],\mu_{k,i}\in\mathbb{R},\sigma^{2}_{k,i}\in\mathbb{R}^{*}_{+},w_{k}\geq 0,1^{\top}w=1\Bigg{\}}.

We set the prior $\pi=\sum_{k=1}^{K}\bar{w}_{k}\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})$ . Then, denoting by $g(x;\mu,\sigma^{2})$ the pdf of the normal distribution $\mathcal{N}(\mu,\sigma^{2})$ , (17) implies, for any $w,\mu,\sigma^{2}$ ,

	$\displaystyle\mathrm{KL}(p_{w,\mu,\sigma^{2}}\\|\pi)$	$\displaystyle=\int_{\mathbb{R}^{d}}\log\frac{\sum_{k=1}^{K}w_{k}g(x;\mu_{k},\sigma^{2}_{k})}{\sum_{k=1}^{K}\bar{w}_{k}g(x;\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})}\sum_{k=1}^{K}w_{k}g(x;\mu_{k},\sigma^{2}_{k})dx$
		$\displaystyle\leq\int_{\mathbb{R}^{d}}\sum_{k=1}^{K}\log\frac{w_{k}g(x;\mu_{k},\sigma^{2}_{k})}{\bar{w}_{k}g(x;\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})}w_{k}g(x;\mu_{k},\sigma^{2}_{k})dx$
		$\displaystyle=\sum_{k=1}^{K}w_{k}\log\frac{w_{k}}{\bar{w}_{k}}+\sum_{k=1}^{K}w_{k}\mathrm{KL}(\mathcal{N}(\mu_{k},\sigma^{2}_{k})\\|\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d}))$
		$\displaystyle=\mathrm{KL}(w\\|\bar{w})+\frac{1}{2}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right),$

where the inequality on the second line follows from the log sum inequality from Cover and Thomas , and the bound from Theorem 5 becomes, at $t=T+1$ ,

\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi,\alpha)}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq 2\inf_{w,\mu,\sigma^{2}}\Bigg{\{}\mathbb{E}_{\theta\sim p_{w,\mu,\sigma^{2}}}[R_{P_{T+1}}(\theta)]-R_{P_{T+1}}^{*}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}.

Assumption (4) implies that

\mathbb{E}_{\theta\sim\mathcal{N}(\mu,\sigma^{2})}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq L\mathbb{E}_{\theta\sim\mathcal{N}(\mu,\sigma^{2})}\left[\|\theta-\mu_{P_{T+1}}\|^{2}\right].

It follows that the previous bound with the choice $\mu_{1}=\dots=\mu_{K}=\mu_{P_{T+1}}$ becomes

\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi,\alpha)}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq 2\inf_{w,\sigma^{2}}\Bigg{\{}L\sum_{k=1}^{K}w_{k}\|\sigma_{k}\|^{2}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{P_{T+1},i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}.

While the choice $\mu_{1}=\dots=\mu_{K}=\mu_{P_{T+1}}$ may seem less meaningful than in the Gaussian case (with one single component), it is completely natural as the best possible choice for the parameter $\theta$ is $\mu_{P_{T+1}}$ . In the computation, each component $\mathcal{N}(\mu_{k},\sigma_{k}^{2})$ of the mixture brings an error term which can be decomposed between a bias term and a variance term,

\mathbb{E}_{\theta\sim\mathcal{N}(\mu_{k},\sigma_{k}^{2})}\left[\|\theta-\mu_{P_{T+1}}\|^{2}\right]=\underbrace{\|\mu_{k}-\mu_{P_{T+1}}\|^{2}}_{\text{bias term (first order)}}+\underbrace{\sigma_{k}^{2}}_{\text{variance term (second order)}},

for which the choice $\mu_{k}=\mu_{P_{T+1}}$ minimizes the first order error term. Next, we set the family $\mathcal{G}$ of distributions on $\mathcal{F}$ :

\mathcal{G}=\Bigg{\{}q_{\delta,\tau,\xi^{2},b}=\mathrm{Dir}(\delta)\otimes\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(2,b_{k}):\\ \delta=(\delta_{1},\dots,\delta_{K})\in\mathbb{R}^{K},\forall(k,i),\xi_{k}^{2}>0,\tau_{k,i}\in\mathbb{R},b_{k}>0,\delta_{k}>0\Bigg{\}},

where $\mathrm{Dir}(\delta)$ is the Dirichlet distribution of parameter $\delta$ . We set the prior on priors $\Lambda=q_{1_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}$ , where $1_{K}=(1,\dots,1)$ and $\bar{\bar{\xi}}^{2}=\left(\bar{\bar{\xi}}_{1}^{2},\dots,\bar{\bar{\xi}}_{K}^{2}\right)$ . Then, using (17), (18) and (19),

\mathrm{KL}(q_{\delta,\tau,\xi^{2},b}\|\Lambda)=\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right),

where $\psi$ is the digamma function. We can next use the bound from Theorem 5 and we have

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\inf_{w,\sigma^{2}}\Bigg{\{}L\sum_{k=1}^{K}w_{k}\|\sigma_{k}\|^{2}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{P_{T+1},i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

Next, minimizing over $\sigma_{k,i}^{2}$ gives the optimal value $\frac{\bar{\sigma}_{k}^{2}}{2\alpha nL\bar{\sigma}_{k}^{2}+1}$ , and replacing in the above bound gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \inf_{w}\left\{\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{K}w_{k}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

We are going to restrict the infimum $\inf_{w}$ to the set of $w$ such that $w_{k}\in\{0,1\}$ for any $k\in[K]$ . In other words, we are selecting only the best component of the mixture in the optimization bound. The reader can check that this is actually the exact solution to the minimization problem in the above bound. As a result of this minimization, the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \min_{k\in[K]}\left\{\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

Please note that the term inside the expectation is, up to the minimum on $k\in[K]$ , identical to the one we had in the case of one single Gaussian mixture, except for the term $\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}$ , which may be seen as a penalty for the choice of the component $k\in[K]$ in the mixture. We then bound the expectation term in the above bound by first using Fubini’s theorem, and then inverting the minimum and the second expectation:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{P_{T+1}}\Bigg{[}\min_{k\in[K]}\Bigg{\{}\\ \mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

(26)

We can then bound the expectation term, which we decompose as

\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\\ =\frac{1}{\alpha n}\mathbb{E}_{\bar{w}\sim\mathrm{Dir}(\delta)}\left[\log\frac{1}{\bar{w}_{k}}\right]+\frac{d}{2\alpha n}\mathbb{E}_{\bar{\sigma}_{k}^{2}\sim\Gamma(2,b_{k})}\left[\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)\right]\\ +\frac{1}{2\alpha n}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right].

Jensen’s inequality helps to bound both the first term

	$\displaystyle\frac{1}{\alpha n}\mathbb{E}_{\bar{w}\sim\mathrm{Dir}(\delta)}\left[\log\frac{1}{\bar{w}_{k}}\right]$	$\displaystyle\leq\frac{1}{\alpha n}\log\mathbb{E}_{\bar{w}\sim\mathrm{Dir}(\delta)}\left[\frac{1}{\bar{w}_{k}}\right]$
		$\displaystyle=\frac{1}{\alpha n}\log\frac{1^{\top}\delta-1}{\delta_{k}-1}$

and the second term

	$\displaystyle\frac{d}{2\alpha n}\mathbb{E}_{\bar{\sigma}_{k}^{2}\sim\Gamma(2,b_{k})}\left[\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)\right]$	$\displaystyle\leq\frac{d}{2\alpha n}\log\left(2\alpha nL\mathbb{E}_{\bar{\sigma}_{k}^{2}\sim\Gamma(2,b_{k})}\left[\bar{\sigma}_{k}^{2}\right]+1\right)$
		$\displaystyle=\frac{d}{2\alpha n}\log\left(\frac{4L\alpha n}{b_{k}}+1\right)$

in the decomposition. The third term can be bounded as follows

	$\displaystyle\frac{1}{2\alpha n}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{\\|\mu_{P_{T+1}}-\bar{\mu}_{k}\\|^{2}}{\bar{\sigma}_{k}^{2}}\right]$
	$\displaystyle=\frac{b_{k}}{2\alpha n}\mathbb{E}_{\bar{\mu}_{k}\sim\mathcal{N}(\tau_{k},\xi_{k}^{2}I_{d})}\left[\\|\mu_{P_{T+1}}-\bar{\mu}_{k}\\|^{2}\right]$
	$\displaystyle\leq\frac{b_{k}}{\alpha n}\left(\\|\mu_{P_{T+1}}-\tau_{k}\\|^{2}+\mathbb{E}_{\bar{\mu}_{k}\sim\mathcal{N}(\tau_{k},\xi_{k}^{2}I_{d})}\left[\\|\tau_{k}-\bar{\mu}_{k}\\|^{2}\right]\right)$
	$\displaystyle=\frac{b_{k}}{\alpha n}\left(\\|\mu_{P_{T+1}}-\tau_{k}\\|^{2}+d\xi_{k}^{2}\right).$

The bound on the expectation then becomes

\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},a,b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\\ \leq\frac{1}{\alpha n}\log\frac{1^{\top}\delta-1}{\delta_{k}-1}+\frac{d}{2\alpha n}\log\left(\frac{4L\alpha n}{b_{k}}+1\right)+\frac{b_{k}}{\alpha n}\left(\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+d\xi_{k}^{2}\right).

In our final bound, we wish to have as few terms as possible in $O\left(\frac{1}{n}\right)$ while the terms in $O\left(\frac{1}{T}\right)$ are not so problematic, because they correspond to the fast convergence rate at the meta-level. For this reason, we are going to take out of the infimum:

•

the term $\frac{d}{2\alpha n}\log\left(\frac{4L\alpha n}{b_{k}}+1\right)$ , which is unavoidable and corresponds to the main term of the bound in the worst case, with a $O\left(\frac{1}{n}\right)$ speed of convergence;
•

the term $\frac{b_{k}d\xi_{k}^{2}}{\alpha n}$ , which will be handled through an optimization in $\xi_{k}^{2}$ and will be a $O\left(\frac{1}{T}\right)$ term.

As a consequence, we bound the minimum on $[K]$ by

\min_{k\in[K]}\left\{\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\right\}\leq\\ \frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\sum_{k=1}^{K}\frac{b_{k}d\xi_{k}^{2}}{\alpha n}+\frac{1}{\alpha n}\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\},

(27)

and plugging this result in (26) gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\sum_{k=1}^{K}\frac{b_{k}d\xi_{k}^{2}}{\alpha n}\\ +\frac{1}{\alpha n}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}\left(\delta_{k}-1\right)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{4\beta T}\sum_{k=1}^{K}\left(\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

An exact optimization in $\xi_{k}^{2}$ gives

\xi_{k}^{2}=\frac{\bar{\bar{\xi}}_{k}^{2}}{1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}}

and replacing in the bound yields

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\\ +\frac{1}{\alpha n}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}\left(\delta_{k}-1\right)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

From here, we set $\delta_{k}=2$ for any $k\in[K]$ , which implies

\frac{1}{2\beta T}\sum_{k=1}^{K}\left(\delta_{k}-1\right)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\leq 0

because $\psi$ is increasing. Please also note that

\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}=\log\frac{\Gamma(2K)}{\Gamma(K)}\leq K\log(2K).

We can then deduce the bound

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\frac{\log(2K)}{\alpha n}\\ +\frac{1}{\alpha n}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right\}\right]+\frac{K\log(2K)}{2\beta T}+\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}\\ +\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

Let

\Sigma_{K}(\mathcal{P}):=\inf_{\tau_{1},\dots,\tau_{K}}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\min_{k\in[K]}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right],

it is clear that

\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right\}\right]\leq\Sigma_{K}(\mathcal{P})\sum_{k=1}^{K}b_{k}.

By choosing $\tau_{1},\dots,\tau_{K}$ minimizing $\Sigma_{K}(\mathcal{P})$ , the previous bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{b}\Bigg{\{}\frac{\log(2K)}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\\ +\frac{\Sigma_{K}(\mathcal{P})}{\alpha n}\sum_{k=1}^{K}b_{k}+\frac{K\log(2K)}{2\beta T}+\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}\\ +\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

(28)

Please note that $\tau_{1},\dots,\tau_{K}$ are characteristic of the distribution $\mathcal{P}$ . Intuitively, if the distribution $\mathcal{P}$ has $K$ modes, or $K$ Gaussian mixtures, $\tau_{1},\dots,\tau_{K}$ correspond to the centers of these modes or mixtures up to a permutation. Consequently, they do not scale with $n$ or $T$ , but can be regarded as problem parameters of constant order.

Let $\epsilon>0$ which we will specify later, we are going to consider two separate cases:

-

either $\Sigma_{K}(\mathcal{P})\leq d\epsilon$ , implying that the distribution is concentrated around $\tau_{1},\dots,\tau_{K}$ and the variance of the local distribution around each of those points is smaller than $\epsilon$ ;
-

or $\Sigma_{K}(\mathcal{P})>d\epsilon$ .

G.1 First Case: $\Sigma_{K}(\mathcal{P})\leq d\epsilon$

In this case, we expect the distribution to well concentrated around $\tau_{1},\dots,\tau_{K}$ (which are the centers of the mixtures). As a result, the optimal parameter in the new task $T+1$ is going to be closed to one of $\tau_{1},\dots,\tau_{K}$ and we can infer it from the previous tasks. This is precisely the case where we expect a significant improvement from the meta-learning over the learning in isolation.

A closer look at the bound above shows that, besides the term $\frac{\log(2K)}{\alpha n}$ (for which we will provide an interpretation later in this section), all the terms are $O\left(\frac{\log T}{T}\right)$ , which is the fast rate of convergence at the meta level, except possibly for the terms $\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)$ and $\frac{\Sigma_{K}(\mathcal{P})}{\alpha n}\sum_{k=1}^{K}b_{k}$ . Nevertheless, the assumption on $\Sigma_{K}(\mathcal{P})$ shows that the second of those terms is very small (of order $O\left(\frac{\epsilon}{n}\right)$ ). For this reason, we are going to choose $b_{k}$ small enough so that the first term becomes small, and we set

\forall k\in[K],\ b_{k}=T.

Replacing in the bound gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\Bigg{\{}\frac{\log(2K)}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\\ +\frac{TK\Sigma_{K}(\mathcal{P})}{\alpha n}+\frac{K\log(2K)}{2\beta T}+\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}\\ +\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{\}}.

We can easily bound the second term in the bound using the inequality $\log(1+x)\leq x$ , which becomes

\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\leq\frac{2LdK}{T},

and replacing $\Sigma_{K}(\mathcal{P})$ by its upper bound $d\epsilon$ gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\frac{4\log(2K)}{\alpha n}+\frac{8LdK}{T}+\frac{4TK\epsilon}{\alpha n}\\ +\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right).

From here, the choice of the rate $\epsilon=\frac{n}{T^{2}}$ yields the final bound:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ \text{CV}_{\text{finite}}(K,n)+\text{CV}_{\text{Gaussian}}(K,d,T)+\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau),

where we denoted

\text{CV}_{\text{finite}}(K,n)=\frac{4\log(2K)}{\alpha n},\quad\text{CV}_{\text{Gaussian}}(K,d,T)=\frac{8LdK}{T}+\frac{4K}{\alpha T}

and

\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)\\ +\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right).

Remark 16

Please note that

\text{CV}_{\text{finite}}(K,n)=O\left(\frac{\log K}{n}\right)

and

\text{CV}_{\text{Gaussian}}(K,d,T)+\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=O\left(\frac{dK\log T}{T}\right).

This bound comes as no surprise, because the process of learning the parameter $\theta$ in the mixture of Gaussians framework consists of two different steps:

-

first identifying the $K$ centers of the mixtures which, similarly to the finite case, is captured in the $\frac{\log K}{n}$ term;
-

then, identifying the right parameters of the Gaussian components centered on the points identified in the previous step. Similarly to the Gaussian case, this is captured in $O\left(\frac{\log T}{T}\right)$ , in the most favorable case.

Remark 17

It may seem paradoxical that the model of mixtures of Gaussians achieves, in the best possible case, a rate of convergence $O\left(\frac{\log K}{n}\right)$ slower than the $O\left(\frac{\log T}{T}\right)$ rate achieved by a single Gaussian in the regime $n<<T$ under optimal conditions. In reality, the latter rate of convergence is also achievable in the model of mixtures of Gaussians, but similarly as in the case of a single Gaussian component, it requires the strong assumption that

\Sigma_{1}(\mathcal{P})\leq d\epsilon,

which is much more restrictive. On the other hand, many distributions only satisfy

\Sigma_{K}(\mathcal{P})\leq d\epsilon

for some $K\geq 2$ , in which case the rate of convergence $O\left(\frac{\log K}{n}\right)$ achieved here is much faster than $O\left(\frac{d\log n}{n}\right)$ , which is the best possible rate achieved in the single Gaussian model in general.

G.2 Second Case: $\Sigma_{K}(\mathcal{P})>d\epsilon$

In this case, $\Sigma_{K}(\mathcal{P})$ is not smaller than $d\epsilon$ and therefore, the choice of a large parameter $b_{k}$ would make the term $\frac{\Sigma_{K}(\mathcal{P})}{\alpha n}\sum_{k=1}^{K}b_{k}$ too large to achieve the desired rate of convergence. As a result, we set

\forall k\in[K],\ b_{k}=1.

Replacing in the bound (28) gives

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\frac{4\log(2K)}{\alpha n}+\frac{2dK}{\alpha n}\log\left(1+4\alpha Ln\right)+\frac{4K\Sigma_{K}(\mathcal{P})}{\alpha n}\\ +\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{1}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right).

G.3 In Summary

In any case, the bound can be written as

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\text{CV}_{\text{finite}}(K,n)+K\times\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)\\ +\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau),

where

\text{CV}_{\text{finite}}(K,n)=\frac{4\log(2K)}{\alpha n}

is the convergence term in the finite case (to find the centers of the mixtures), and

\text{CV}_{\text{Gaussian}}\left(d,K,\Sigma_{K}(\mathcal{P}),n,T\right)=\left\{\begin{array}[]{ll}\frac{8Ld}{T}+\frac{4}{\alpha T}&\mbox{if }\ \Sigma_{K}(\mathcal{P})\leq\frac{n}{T^{2}};\\ \frac{2d}{\alpha n}\log\left(1+4\alpha Ln\right)+\frac{4\Sigma_{K}(\mathcal{P})}{\alpha n}&\mbox{otherwise;}\end{array}\right.

is equal to the convergence term in the Gaussian case (with one component) and is a $O\left(\frac{1}{T}\right)$ if there exist $K$ points such that distribution $\mathcal{P}$ is concentrated at a rate $\epsilon=\frac{n}{T^{2}}$ around those $K$ points, and $O\left(\frac{d\log n}{n}\right)$ otherwise. The remaining term is

\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)\\ +\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right)

and it is the convergence term term at the meta level. Please note that it is a $O\left(\frac{dK\log T}{T}\right)$ .

G.4 What if the number of components in the mixture is unknown?

In practice, we do not know in advance how to chose the number of components $K$ in the prior. In this case, we are going to include inside $\mathcal{M}$ all the mixtures of Gaussians, i.e.,

\mathcal{M}=\left\{p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{+\infty}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\exists K\geq 1:\forall k\geq K+1,w_{k}=0\right\}.

Note that the sum inside the definition of $\mathcal{F}$ is finite, since $w_{k}=0$ for any $k$ beyond a certain rank $K$ . We still denote by $\pi=\sum_{k=1}^{\bar{K}}\bar{w}_{k}\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})$ the prior in each task. By definition, for any $k\geq\bar{K}+1,\bar{w}_{k}=0$ . It still holds that, for any $w,\mu,\sigma^{2}$ ,

\mathrm{KL}(p_{w,\mu,\sigma^{2}}\|\pi)\leq\mathrm{KL}(w\|\bar{w})+\frac{1}{2}\sum_{k=1}^{\infty}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right),

where we denoted

\mathrm{KL}(w\|\bar{w})=\sum_{k=1}^{\infty}w_{k}\log\frac{w_{k}}{\bar{w}_{k}}.

To put things clearly, the KL remains identical to the case where $K$ is known except for the fact that the sums on $k$ are no longer stopping at a pre-determined $K$ . This difference aside, the bound remains identical to the one in the case where $K$ is known, and the bound from Theorem 1 becomes, at $t=T+1$ ,

\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi,\alpha)}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq 2\inf_{w,\mu,\sigma^{2}}\Bigg{\{}\mathbb{E}_{\theta\sim p_{w,\mu,\sigma^{2}}}[R_{P_{T+1}}(\theta)]-R_{P_{T+1}}^{*}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}.

Next, we are going to define a prior on $K$ within the prior of priors as follows. We assume that the number of mixtures $K$ is smaller than $T$ , because even if it were not, it would be impossible to identify them with enough confidence. We define the set of priors on priors

\mathcal{G}=\Bigg{\{}q_{x,\delta,\tau,\xi^{2},b}=q_{x}\times q_{\delta,\tau,\xi^{2},b|K}\Bigg{\}},

where $q_{x}=\mathrm{Mult}(x_{1},\dots,x_{T})$ is the prior distribution on $K$ and

q_{\delta,\tau,\xi^{2},b|K}=\mathrm{Dir}(\delta_{1},\dots,\delta_{K})\otimes\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(2,b_{k}).

and we set the prior of prior as $\Lambda=q_{\frac{1}{T}1_{T},1_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}$ . We also need to re-compute the KL divergence between the priors of priors, which becomes

\mathrm{KL}(q_{\lambda,\delta,\tau,\xi^{2},b}\|\Lambda)=\log T-H(x)+\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]},

using (16). Please note that in any optimization on $\mathcal{G}$ , we optimize first in $(x_{1},\dots,x_{T})$ and then on $\delta,\tau,\xi^{2},b$ conditionally on $K$ . This means that the latter parameters are allowed to depend on $K$ . While the infimum on $\mathcal{G}$ of any quantity should be written $\inf_{x}\inf_{\delta,\tau,\xi^{2},b\in\sigma(K)}$ , we will adopt the shortcut notation $\inf_{x,\delta,\tau,\xi^{2},b}$ . We can next use the bound from Theorem 5 and we have

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\inf_{w,\sigma^{2}}\Bigg{\{}L\sum_{k=1}^{\infty}w_{k}\|\sigma_{k}\|^{2}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

The optimization on $\sigma_{k,i}^{2}$ may be performed exactly by setting $\sigma_{k,i}^{2}=\frac{\bar{\sigma}_{k}^{2}}{2\alpha Ln\bar{\sigma}_{k}^{2}+1}$ , and the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \inf_{w}\left\{\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

We restrict the optimization in $w$ to the set $\{(w_{k})_{k\geq 1}:\exists k_{0}\leq\bar{K}:w_{k_{0}}=1,\forall k\neq k_{0},w_{k}=0\}$ , and the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \min_{k\in[K]}\left\{\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

Next, we classically decompose the expectation $\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}[X]$ as

\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}[X]=\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b|K}}[X]\right].

Applying Fubini’s theorem and inverting the infimum and the expectation yields the bound

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\Bigg{[}\min_{k\in[K]}\Bigg{\{}\\ \mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b|K}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

The bound (27) from the previous part still holds:

\min_{k\in[K]}\left\{\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b|K}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\right\}\leq\\ \frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\sum_{k=1}^{K}\frac{b_{k}d\xi_{k}^{2}}{\alpha n}+\frac{1}{\alpha n}\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\},

and we can inject it in the computation so that the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{d}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}b_{k}\xi_{k}^{2}\right]+\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

An exact optimization in $\xi_{k}^{2}$ yields $\xi_{k}^{2}=\frac{\bar{\bar{\xi}}_{k}^{2}}{1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}}$ and we can replace in the bound

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)\\ +\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)+\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

We choose $\delta_{k}=2$ for any $k\geq 1$ and noting that both

\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\leq 0

and

\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}=\log\frac{\Gamma(2K)}{\Gamma(K)}\leq K\log(2K),

we deduce the following bound:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right\}\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

Recall the (unchanged) definition of $\Sigma_{K}(\mathcal{P})$ :

\Sigma_{K}(\mathcal{P})=\inf_{\tau_{1},\dots,\tau_{K}}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\min_{k\in[K]}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right].

Recalling that $\tau$ (as well as $b$ ) is allowed to depend on $K$ , we define $(\tau_{1},\dots,\tau_{K})$ as the argument (up to a permutation) of $\Sigma_{K}(\mathcal{P})$ . It follows that the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}b_{k}\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)\\ +\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)\\ +2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

From here, we are going to distinguish two different cases, as we did before:

-

first case: there exists $K$ reasonably small such that $\Sigma_{K}(\mathcal{P})\leq d\epsilon$ ;
-

second case: for any reasonable $K$ , $\Sigma_{K}(\mathcal{P})>d\epsilon$ .

G.4.1 First Case

Again, this is the case where we expect some improvement from the meta-learning. Setting

\forall k\in[K],b_{k}=T,

the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[KT\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}},

and after the simplification

\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\leq\frac{2LdK}{T},

we deduce

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x}\Bigg{\{}\frac{2Ld}{T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\right]+\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\log(2K)\right]\\ +\frac{T}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\Sigma_{K}(\mathcal{P})\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}},

Then, choosing to restrict the infimum on all the multinomial distributions $\mathrm{Mult}(x_{1},\dots,x_{T})$ to all the Dirac masses, i.e., all the $(x_{1},\dots,x_{T})$ such that there exists $K\in\{1,\dots,T\}$ such that $x_{K}=1$ . It follows that $H(x)=0$ and we deduce that

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\inf_{K\in[T]}\Bigg{\{}\frac{4\log(2K)}{\alpha n}+\frac{8LdK}{T}+\frac{4TK\Sigma_{K}(\mathcal{P})}{\alpha n}+\frac{2\log T}{\beta T}\\ +\frac{2}{\beta T}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}},

In particular, if there exists a $K$ relatively small such that $\Sigma_{K}(\mathcal{P})\leq d\epsilon=\frac{dn}{T^{2}}$ , then the bound becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\inf_{K\in[T]}\Bigg{\{}\frac{4\log(2K)}{\alpha n}+\frac{8LdK}{T}+\frac{4dK}{\alpha T}+\frac{2\log T}{\beta T}\\ +\frac{2}{\beta T}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}}.

G.4.2 Second Case

Recall the bound:

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}b_{k}\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)\\ +\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)\\ +2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

In this case, we do not expect much improvement from meta-learning, and we will simply choose

\forall k\in\{1,\dots,T\},\ b_{k}=1.

The bound then becomes

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x}\Bigg{\{}\frac{d\log\left(1+4\alpha Ln\right)}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{1}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right)\Bigg{]}\Bigg{\}}.

Similarly as before, we restrict the minimization to the set of Dirac distributions and we deduce

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\\ \leq\inf_{K\in[K]}\Bigg{\{}\frac{4\log(2K)}{\alpha n}+\frac{2dK\log\left(1+4\alpha Ln\right)}{\alpha n}+\frac{4K\Sigma_{K}(\mathcal{P})}{\alpha n}+\frac{2\log T}{\beta T}\\ +\frac{2}{\beta T}\left[K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{1}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right)\right]\Bigg{\}}.

G.4.3 Overall Bound

In summary, the bound can be written as

\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ \inf_{K\in[T]}\Bigg{\{}\text{CV}_{\text{finite}}(K,n)+K\times\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)+\text{CV}_{\text{meta}}^{\text{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)\Bigg{\}},

where $\text{CV}_{\text{finite}}(K,n)$ and $\text{CV}_{\text{Gaussian}}\left(d,K,\Sigma_{K}(\mathcal{P}),n,T\right)$ are exactly the same terms as in the case where the number of mixtures $K$ is known, and the convergence term at the meta level becomes

\text{CV}_{\text{meta}}^{\text{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)+\frac{2\log T}{\beta T}.

Even when the number of mixtures $K$ is unknown, the same bound as in the case of $K$ known can be achieved up to a $\frac{2\log T}{\beta T}$ term, which is the order of the time required to find the optimal number of components in the mixture at the meta level.

	$\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]$	$\displaystyle=\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\left(\mathbb{E}_{\theta\sim\rho}[\hat{R}_{t}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}\right)-\hat{R}_{t}(\theta^{*}_{t})\right]$
		$\displaystyle\leq\inf_{\rho\in\mathcal{P}(\Theta)}\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho}[\hat{R}_{t}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}-\hat{R}_{t}(\theta^{*}_{t})\right]$
		$\displaystyle=\inf_{\rho\in\mathcal{P}(\Theta)}\left[]\mathbb{E}_{\theta\sim\rho}[R_{t}(\theta)]-R_{t}(\theta_{t}^{*})+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}\right].$

	$\displaystyle\mathbb{E}_{P_{1},\dots,P_{T}}$	$\displaystyle\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\inf_{\Pi\in\mathcal{G}}\Biggl{\{}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}\Biggr{\}}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Biggl{\{}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle=\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Biggl{\{}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{T+1}(\theta_{T+1}^{*})\right)$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[\hat{R}_{T+1}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}$
		$\displaystyle\quad\quad\quad-\hat{R}_{T+1}(\theta_{T+1}^{*})\Biggr{]}+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\mathbb{E}_{S_{T+1}}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[\hat{R}_{T+1}(\theta)]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}$
		$\displaystyle\quad\quad\quad-\hat{R}_{T+1}(\theta_{T+1}^{*})\Biggr{]}+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}$
		$\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[R_{T+1}(\theta)-R_{T+1}(\theta_{t}^{*})]+\frac{{\rm KL}(\rho\\|\pi)}{\alpha n}\Biggr{]}$
		$\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}.$

	$\displaystyle\mathrm{KL}(p_{w,\mu,\sigma^{2}}\\|\pi)$	$\displaystyle=\int_{\mathbb{R}^{d}}\log\frac{\sum_{k=1}^{K}w_{k}g(x;\mu_{k},\sigma^{2}_{k})}{\sum_{k=1}^{K}\bar{w}_{k}g(x;\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})}\sum_{k=1}^{K}w_{k}g(x;\mu_{k},\sigma^{2}_{k})dx$
		$\displaystyle\leq\int_{\mathbb{R}^{d}}\sum_{k=1}^{K}\log\frac{w_{k}g(x;\mu_{k},\sigma^{2}_{k})}{\bar{w}_{k}g(x;\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})}w_{k}g(x;\mu_{k},\sigma^{2}_{k})dx$
		$\displaystyle=\sum_{k=1}^{K}w_{k}\log\frac{w_{k}}{\bar{w}_{k}}+\sum_{k=1}^{K}w_{k}\mathrm{KL}(\mathcal{N}(\mu_{k},\sigma^{2}_{k})\\|\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d}))$
		$\displaystyle=\mathrm{KL}(w\\|\bar{w})+\frac{1}{2}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right),$

	$\displaystyle\frac{1}{2\alpha n}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{\\|\mu_{P_{T+1}}-\bar{\mu}_{k}\\|^{2}}{\bar{\sigma}_{k}^{2}}\right]$
	$\displaystyle=\frac{b_{k}}{2\alpha n}\mathbb{E}_{\bar{\mu}_{k}\sim\mathcal{N}(\tau_{k},\xi_{k}^{2}I_{d})}\left[\\|\mu_{P_{T+1}}-\bar{\mu}_{k}\\|^{2}\right]$
	$\displaystyle\leq\frac{b_{k}}{\alpha n}\left(\\|\mu_{P_{T+1}}-\tau_{k}\\|^{2}+\mathbb{E}_{\bar{\mu}_{k}\sim\mathcal{N}(\tau_{k},\xi_{k}^{2}I_{d})}\left[\\|\tau_{k}-\bar{\mu}_{k}\\|^{2}\right]\right)$
	$\displaystyle=\frac{b_{k}}{\alpha n}\left(\\|\mu_{P_{T+1}}-\tau_{k}\\|^{2}+d\xi_{k}^{2}\right).$