This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayes meets Bernstein at the Meta Level: an Analysis of Fast Rates in Meta-Learning with PAC-Bayes

\nameCharles Riou \email[email protected]
\addrThe University of Tokyo & RIKEN Center for AIP
Tokyo, Japan \AND\namePierre Alquier \email[email protected]
\addrESSEC Business School
Asia-Pacific campus, Singapore \AND\nameBadr-Eddine Chérief-Abdellatif \email[email protected]
\addrCNRS, LPSM, Sorbonne Université & Université Paris Cité
Abstract

Bernstein’s condition is a key assumption that guarantees fast rates in machine learning. For example, the Gibbs algorithm with prior π\pi has an excess risk in O(dπ/n)O(d_{\pi}/n), as opposed to the standard O(dπ/n)O(\sqrt{d_{\pi}/n}), where nn denotes the number of observations and dπd_{\pi} is a complexity parameter which depends on the prior π\pi. In this paper, we examine the Gibbs algorithm in the context of meta-learning, i.e., when learning the prior π\pi from TT tasks (with nn observations each) generated by a meta distribution. Our main result is that Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior π\pi, which will reduce the term dπd_{\pi} across tasks, is in O(1/T)O(1/T), instead of the expected O(1/T)O(1/\sqrt{T}). We further illustrate how this result improves on the standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussians priors.

Keywords: Bernstein’s condition, meta-learning, fast rates, PAC-Bayes bounds, information bounds, the Gibbs algorithm, variational approximations.

1 Introduction

One of the greatest promises of artificial intelligence is the ability to design autonomous systems that can learn from different situations throughout their lives and adapt quickly to new environments, as humans, animals and other living things naturally do. Based on the intuition that a new problem often has significant similarities to previously encountered tasks, the use of past experience is particularly important in areas such as computer vision (Quattoni et al., 2008; Kulis et al., 2011; Li et al., 2018; Achille et al., 2019), natural language processing (Huang et al., 2018; Gu et al., 2018; Dou et al., 2019; Qian and Yu, 2019) and reinforcement learning (Finn et al., 2017; Mishra et al., 2018; Wang et al., 2016; Yu et al., 2020) where the learner has access to only a few training examples for the task of interest, but for which a vast amount of datasets from a variety of related tasks is available. In the area of digit recognition for example, it is possible to leverage the experience gained from millions of similar open source image classification datasets, as the key features needed to classify cats from dogs or pants from shirts can be used to classify handwritten digits. This idea is at the heart of meta-learning (Thrun and Pratt, 1998; Baxter, 2000; Vanschoren, 2019), a field that has recently attracted a lot of attention due to its huge success in real-world applications, and which aims to improve performance on a particular task by transferring the knowledge contained in different but related tasks.

Meta-learning has been widely studied in recent literature. It must be noted that meta-learning was used to refer to a wide range of situations. Providing a precise definition of meta-learning is a challenge. In particular, the terms transfer learning and multi-task learning, although distinct, are often used interchangeably instead of meta-learning. Transfer learning is a very general concept that involves two tasks that share similarities - a source and a target - and consists in transferring the knowledge acquired on the source dataset to better process the target data (Pan and Yang, 2010; Zhuang et al., 2020). In practice, this can be formulated in many different ways, but the most popular approach is to pre-train a model on the source data, e.g., images of cats and dogs, and then to fine-tune it on the target training data set, e.g., images of handwritten digits. In particular, a challenging problem in transfer learning is to find a measure that quantifies the similarity between the source and target tasks. Multi-task learning adopts a different framework, where multiple learning tasks are considered and the goal is to learn a model that can handle all tasks simultaneously (Caruana, 1997; Zhang and Yang, 2021). The model usually has a common representation, e.g., the first layers of a deep neural network, and a task-specific component, e.g., the last layer of the network. Meta-learning also considers a collection of datasets from a variety of tasks, but unlike multi-task learning, we are not interested in learning the fixed number of tasks, but rather in being prepared for future tasks that are not yet given. Also, unlike transfer learning, meta-learning exploits the commonality of previous tasks rather than the similarity between some specific source and target tasks. We use these metadata to design a meta-procedure that adaptively learns a predictor for any new learning task that is a priori unknown, and the goal is to quickly learn to adapt a learning procedure from past experience. Meta-learning is therefore sometimes referred to as learning-to-learn, or lifelong learning in the online context. The implementation of this learning-to-learn mechanism can take different forms, which we briefly describe in the following paragraph.

As the name suggests, meta-learning involves two levels of abstraction to improve learning over time: a meta-level and a within-task level. At the within-task level, the new task of interest is presented and the corresponding pattern is learned from the training data set of the task at hand. This learning process is greatly accelerated by a meta-learner, which has distilled the knowledge accumulated in previous tasks into the within-task model. The meta-learning procedure can accelerate the within-task algorithm in various ways, and three main categories stand out in the literature: metric-based methods, which are based on non-parametric predictive models governed by a metric that is learned using the meta-training dataset (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018); model-based methods, which quickly update the parameters in a few learning steps, which can be achieved by the model’s internal architecture or another meta-learning model (Santoro et al., 2016; Munkhdalai and Yu, 2017; Mishra et al., 2018); and optimisation-based methods, which mainly involve learning the hyper-parameters of a within-task algorithm using the meta-training set for fast adaptation (Hochreiter et al., 2001; Ravi and Larochelle, 2017; Finn et al., 2017; Nichol et al., 2018; Qiao et al., 2018; Gidaris and Komodakis, 2018). Due to their performance and ease of implementation, the optimisation-based family is the dominant class in the recent literature, exploiting the idea that well-chosen hyperparameters can greatly speed up the learning process and allow model parameters to be quickly adapted to new tasks with little data. For example, it is possible to learn the task of interest using a gradient descent whose initialisation and learning rate would have been learned from the metadata. Among the best known meta-strategies is the model agnostic meta-learning procedure (MAML) (Finn et al., 2017) and its variants implicit MAML (Rajeswaran et al., 2019), Bayesian MAML (Grant et al., 2018; Yoon et al., 2018; Nguyen et al., 2020) and Reptile (Nichol et al., 2018). We refer the interested reader to the recent review by Chen et al. (2023) for more details.

2 Approach and Contributions

In this paper, we focus on the Gibbs algorithms within tasks, or their variational approximations. the Gibbs algorithms, also known as Gibbs posteriors (Alquier et al., 2016) or exponentially weighted aggregation (Dalalyan and Tsybakov, 2008), can also be interpreted in the framework of Bayesian statistics as a kind of generalized posterior (Bissiri et al., 2016). PAC-Bayes bounds were developed to control the risk and the excess risk of such procedures (Shawe-Taylor and Williamson, 1997; McAllester, 1998; Catoni, 2004; Zhang, 2006; Catoni, 2007; Yang et al., 2019), see Guedj (2019); Alquier (2021) for recent surveys. More recently, the related mutual information bounds (Russo and Zou, 2019; Haghifam et al., 2021) were also used to study the excess risk of the Gibbs algorithms (Xu and Raginsky, 2017). Gibbs posteriors are often intractable, and it is then easier to compute a variational approximation of such a posterior. It appears that PAC-Bayes bounds can also be used on such approximations (Alquier et al., 2016). Many recent publications built foundations of meta-learning through PAC-Bayes and information bounds (Pentina and Lampert, 2014; Amit and Meir, 2018; Ding et al., 2021; Liu et al., 2021; Rothfuss et al., 2021; Farid and Majumdar, 2021; Rothfuss et al., 2022; Guan and Lu, 2022a; Rezazadeh, 2022). These works and the related literature are discussed in detail in Section 6. Most of these papers proved empirical PAC-Bayes bounds for meta-learning. These bounds can be minimized, and we obtain both a practical meta-learning procedure, together with a numerical certificate on its generalization. However, these works did not focus on the rate of convergence of the excess risk.

Bernstein’s condition a is a low-noise assumption reflecting the inherent difficulty of the learning task (Mammen and Tsybakov, 1999; Tsybakov, 2004; Bartlett and Mendelson, 2006). While it was initially designed to study the ERM (Bartlett and Mendelson, 2006), it characterizes the learning rate of algorithms beyond the ERM. PAC-Bayes bounds and mutual information bounds show that the excess risk of the Gibbs algorithm is in O(dπ,t/n)O(d_{\pi,t}/n) when Bernstein’s condition is satisfied (Catoni, 2007; Grünwald and Mehta, 2020), as opposed to the slow rate O((dπ,t/n)1/2)O((d_{\pi,t}/n)^{1/2}) in the general case. The quantity dπ,td_{\pi,t} measures the complexity of task tt. Importantly, it also depends on the prior distribution π\pi used in the algorithm. Similar results hold when we replace the Gibbs algorithm by a variational approximation (Alquier et al., 2016).

In the meta-learning setting, we are given TT tasks simultaneously. Using the Gibbs algorithm with a fixed π\pi in all tasks leads to an average excess risk in O(𝔼t[(dπ,t/n)α])O(\mathbb{E}_{t}[(d_{\pi,t}/n)^{\alpha}]), where α=1\alpha=1 when Bernstein’s condition holds for each task t{1,,T}t\in\{1,\dots,T\}, and α=1/2\alpha=1/2 otherwise. Here, 𝔼t\mathbb{E}_{t} denotes the expectation with respect to a future (out-of-sample) task tt. This approach is referred to as “learning in isolation”, because each task is solved regardless of the others. Of course, in meta-learning we want to take advantage of the multiple tasks. For example, Pentina and Lampert (2014) used the Gibbs algorithm at the meta-level, in order to learn a better prior. The expected benefit is to reduce the complexity term dπ,td_{\pi,t}.

Overview of the paper:

  • In Section 3, we recall existing results on the excess risk of the Gibbs algorithm when learning tasks in isolation, and we introduce Bernstein’s condition, a fundamental assumption under which the fast rate O(𝔼t[dπ,t/n])O(\mathbb{E}_{t}[d_{\pi,t}/n]) is achieved by the Gibbs algorithm.

  • In Section 4, we prove that regardless of its validity at the within-task level, Bernstein’s condition is always satisfied at the meta level. As a consequence of this result, we show that a meta-level the Gibbs algorithm achieves the excess risk O(infπ𝔼t[(dπ,t/n)α]+1/T)O(\inf_{\pi\in\mathcal{M}}\mathbb{E}_{t}[\left(d_{\pi,t}/n\right)^{\alpha}]+1/T) with α=1\alpha=1 if Bernstein’s condition is satisfied, and α=1/2\alpha=1/2 otherwise. We further raise the open question of the generalization of this result to its variational approximations.

  • In Section 5, we apply the previous results to various settings: learning a discrete prior, learning a Gaussian prior and learning a mixture of Gaussians prior. We show that the gain brought from the meta learning is blatant, as in some favorable situations, one can even have infπ𝔼t[dπ,t/n]=0\inf_{\pi\in\mathcal{M}}\mathbb{E}_{t}[d_{\pi,t}/n]=0.

  • In Section 6, we provide a deeper comparison with the rich literature on meta-learning.

3 Problem Definition and Notations

Let 𝒵\mathcal{Z} be a space of observations, Θ\Theta be a decision space and :𝒵×Θ+\ell:\mathcal{Z}\times\Theta\rightarrow\mathbb{R}_{+} be a bounded loss function defined on the previously defined sets. Let 𝒫(Θ)\mathcal{P}(\Theta) denote the set of all probability distributions on Θ\Theta equipped with a suitable σ\sigma-field. The learner has to solve TT tasks. Given a task t{1,,T}t\in\{1,\dots,T\}, the learner receives the observations Zt,i,i=1nZ_{t,i},i=1\dots n, assumed to be drawn independently from a distribution PtP_{t} on the decision space 𝒵\mathcal{Z}. The objective of the learner is to find a parameter θ\theta in the parameter space Θ\Theta which minimizes the so-called prediction risk associated to PtP_{t} on 𝒵\mathcal{Z}, defined as

RPt(θ)=𝔼ZPt[(Z,θ)].R_{P_{t}}(\theta)=\mathbb{E}_{Z\sim P_{t}}[\ell(Z,\theta)].

We denote by RPtR^{*}_{P_{t}} the minimum of RPt(θ)R_{P_{t}}(\theta) and by θt\theta_{t}^{*} its corresponding minimizer:

RPt=infθΘRPt(θ)=RPt(θt).R^{*}_{P_{t}}=\inf_{\theta\in\Theta}R_{P_{t}}(\theta)=R_{P_{t}}(\theta^{*}_{t}).

In Bayesian approaches, we rather seek for ρt𝒫(Θ)\rho_{t}\in\mathcal{P}(\Theta) such that

𝔼θρt[RPt(θ)]\mathbb{E}_{\theta\sim\rho_{t}}[R_{P_{t}}(\theta)]

is small. Defining, for any θΘ\theta\in\Theta, the empirical risk as

R^t(θ)=1ni=1n(Zt,i,θ),\hat{R}_{t}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\ell(Z_{t,i},\theta),

a standard choice for ρt\rho_{t} is the so-called Gibbs posterior given by

ρt(π,α)=argminρ𝒫(Θ){𝔼θρ[R^t(θ)]+KL(ρπ)αn},\rho_{t}(\pi,\alpha)=\operatorname{argmin}_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}, (1)

where π\pi is the prior distribution on the parameter θ\theta and α\alpha is some parameter which will be made explicit later. The corresponding risk estimate is defined by

^t(ρ,π,α)=𝔼θρ[R^t(θ)]+KL(ρπ)αn.\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)=\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}.

More generally, we can also consider variational approximations, defined by

ρt(π,α,)=argminρ{𝔼θρ[R^t(θ)]+KL(ρπ)αn},\rho_{t}(\pi,\alpha,\mathcal{F})=\operatorname{argmin}_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}, (2)

where 𝒫(Θ)\mathcal{F}\subseteq\mathcal{P}(\Theta). Adequate choices for \mathcal{F} lead to feasible minimization problems, and don’t affect the generalization properties (Alquier et al., 2016). With these notations, ρt(π,α)=ρt(π,α,𝒫(Θ))\rho_{t}(\pi,\alpha)=\rho_{t}(\pi,\alpha,\mathcal{P}(\Theta)).

3.1 Assumptions on the loss and Bernstein’s condition

Recall that we assumed the loss function to be bounded: there exists a constant C>0C>0 such that

(z,θ)𝒵×Θ,(z,θ)C.\forall(z,\theta)\in\mathcal{Z}\times\Theta,\ \ell(z,\theta)\leq C. (3)

PAC-Bayes bounds for unbounded losses are well-known, see the discussion at the end of Alquier (2021). However, a lot of those bounds become a little more complicated. Thus, the choice to work with bounded variables is made here for the sake of clarity, and is not due to a fundamental limitation of our method.

We define the variance term for task tt by

Vt(θ,θt):=𝔼Pt[|(Z,θ)(Z,θt)|2]V_{t}(\theta,\theta^{*}_{t}):=\mathbb{E}_{P_{t}}\left[\left|\ell(Z,\theta)-\ell(Z,\theta^{*}_{t})\right|^{2}\right]

for any θΘ\theta\in\Theta. The following condition is crucial in the study of the risk of Gibbs posterior.

Assumption 1 (Bernstein’s condition)

There exists a constant c>0c>0 such that, for any θΘ\theta\in\Theta,

Vt(θ,θt)c(RPt(θ)RPt).V_{t}(\theta,\theta^{*}_{t})\leq c\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right).

This assumption characterizes the excess risk of Gibbs posterior, see Theorem 1 below. In this paper, we will provide a bound on the excess risk both under this condition and without it.

In some of the applications developed in Section 5, we will also make the following assumption: there exists L>0L>0 such that, for any Pt𝒫P_{t}\sim\mathcal{P} and any θΘ\theta\in\Theta,

RPt(θ)RPtLθθt2.R_{P_{t}}(\theta)-R_{P_{t}}^{*}\leq L\|\theta-\theta^{*}_{t}\|^{2}. (4)

Intuitively, Taylor’s expansion of RPtR_{P_{t}} gives

RPt(θ)=RPt(θt)+dRPt(θt).(θθt)0+O(θθt2)=RPt(θt)+O(θθt2).R_{P_{t}}(\theta)=R_{P_{t}}(\theta^{*}_{t})+\underbrace{dR_{P_{t}}(\theta^{*}_{t}).(\theta-\theta^{*}_{t})}_{0}+O(\|\theta-\theta_{t}^{*}\|^{2})=R_{P_{t}}(\theta^{*}_{t})+O(\|\theta-\theta_{t}^{*}\|^{2}).

and thus we can expect (4) to be satisfied when the risk is smooth enough. However, note that this assumption is not necessary for the main results of this paper to hold, and will only be used in very specific applications.

3.2 Learning in Isolation

In the process of learning in isolation, we consider each of the tasks separately. We then fix some t{1,,T}t\in\{1,\dots,T\} and denote by 𝒮t\mathcal{S}_{t} the set of observations from task tt: Zt,1,,Zt,nZ_{t,1},\dots,Z_{t,n}. We recall the following result, which can be found, e.g., in Alquier (2021), and whose proof is recalled in Appendix B for the sake of completeness.

Theorem 1

Assume that the loss \ell satisfies (3). Then, the following bound holds, for any α>0\alpha>0:

𝔼𝒮t𝔼θρt(π,α,)[RPt(θ)]RPt11αc𝕀B2(1Cα)(𝔼𝒮t[infρ^t(ρ,π,α)R^t(θt)]+αC2(1𝕀B)8),\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha,\mathcal{F})}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}\leq\frac{1}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\left(\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\right),

where 𝕀B\mathbb{I}_{B} is equal to 11 if Bernstein’s condition (in Assumption 1) is satisfied, and 0 otherwise. In particular, under Bernstein’s condition, the choice α=1c+C\alpha=\frac{1}{c+C} yields the bound

𝔼𝒮t𝔼θρt(π,α,)[RPt(θ)]RPt2𝔼𝒮t[infρ^t(ρ,π,α)R^t(θt)].\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha,\mathcal{F})}[R_{P_{t}}(\theta)]-R_{P_{t}}^{*}\leq 2\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right].

When specifying a model and a prior, a derivation of the right-hand side in Theorem 1 leads to explicit rates of convergence. For example, a classical assumption in the Bayesian literature is that there are constants κ,d0\kappa,d\geq 0 such that π({θ:Rt(θ)Rt(θt)}s)sd/κ\pi(\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\}\leq s)\geq s^{d}/\kappa (Ghosal and Van der Vaart, 2017). As this condition is usually applied on one task, with a specific prior, the notation dd does not reflect the dependence with repect to π\pi or to tt. However, in our context, this dependence will be crucial, so we will write dπ,td_{\pi,t} instead of dd. Under such an assumption, the right-hand side in Theorem 1 can be made more explicit. For simplicity, we state this result for Gibbs posteriors ρt(π,α)\rho_{t}(\pi,\alpha) only.

Corollary 2

Assume that, almost surely on PtP_{t}, π({θ:Rt(θ)Rt(θt)}s)sdπ,t/κπ,t\pi(\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\}\leq s)\geq s^{d_{\pi,t}}/\kappa_{\pi,t}. Then,

𝔼𝒮t[infρ𝒫(Θ)^t(ρ,π,α)R^t(θt)]dπ,tlognαdπ+logκπ,tαn.\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]\leq\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi}}+\log\kappa_{\pi,t}}{\alpha n}.

In particular, under Bernstein’s condition, the choice α=1/(c+C)\alpha=1/(c+C) gives

𝔼𝒮t𝔼θρt(π,α)[RPt(θ)]RPt+2dπ,tlognαdπ,t+logκπ,tαn.\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\leq R_{P_{t}}^{*}+2\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi,t}}+\log\kappa_{\pi,t}}{\alpha n}.

On the other hand, without Bernstein’s condition,

𝔼𝒮t𝔼θρt(π,α)[RPt(θ)]RPt+dπ,tlognαdπ,t+logκπ,tαn+αC28,\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\leq R_{P_{t}}^{*}+\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi,t}}+\log\kappa_{\pi,t}}{\alpha n}+\frac{\alpha C^{2}}{8},

and in particular, for α=22dπ,t/(nC)\alpha=2\sqrt{2d_{\pi,t}}/(\sqrt{n}C), we obtain

𝔼𝒮t𝔼θρt(π,α)[RPt(θ)]RPt+C2dπ,t2n(12log8e2ndπC2+1dπ,tlogκπ,t).\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\leq R_{P_{t}}^{*}+\frac{C}{2}\sqrt{\frac{d_{\pi,t}}{2n}}\left(\frac{1}{2}\log\frac{8e^{2}n}{d_{\pi}C^{2}}+\frac{1}{d_{\pi,t}}\log\kappa_{\pi,t}\right).

4 Main Results

From this section onward, we focus on meta-learning. As opposed to the learning in isolation, the meta-learning considers all the tasks t{1,,T}t\in\{1,\dots,T\} and takes advantage of possible similarities between the TT tasks to improve learning in each task. More precisely, while assuming that for any t{1,,T},Zt,1,,Zt,nt\in\{1,\dots,T\},Z_{t,1},\dots,Z_{t,n} are drawn independently from distribution PtP_{t}, we also assume that the distributions P1,,PTP_{1},\dots,P_{T} are drawn independently from a certain distribution 𝒫\mathcal{P}. A future (out-of-sample) task PT+1P_{T+1} will also be drawn from 𝒫\mathcal{P} and ZT+1,1,,ZT+1,nZ_{T+1,1},\dots,Z_{T+1,n} will be drawn independently from PT+1P_{T+1}. This task will be solved by the Gibbs algorithm ρT+1(π,α)\rho_{T+1}(\pi,\alpha). Our objective is to learn the prior π\pi using the tasks t{1,,T}t\in\{1,\dots,T\}, in order to make the meta-risk

(π)=𝔼PT+1𝒫𝔼𝒮T+1PT+1𝔼θρT+1[RPT+1(θ)].\mathcal{E}(\pi)=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\mathbb{E}_{\mathcal{S}_{T+1}\sim P_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}}[R_{P_{T+1}}(\theta)].

as small as possible. We will compare it to the so-called oracle meta-risk

=𝔼PT+1𝒫[RPT+1],\mathcal{E}^{*}=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}[R_{P_{T+1}}^{*}],

which can only be reached by an oracle who would know the best classifier in each task in advance.

4.1 Bernstein’s condition at the meta level

In this subsection, we prove a version of Bernstein’s condition at the meta level. Let

t(ν,α)=𝔼𝒮tPt[infρ𝒫(Θ)^t(ρ,ν,α)]\mathcal{R}_{t}(\nu,\alpha)=\mathbb{E}_{\mathcal{S}_{t}\sim P_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\nu,\alpha)\right]

for any prior ν\nu, and let πα\pi^{*}_{\alpha} be the distribution minimizing the expectation of the above quantity:

πα=argminν𝔼Pt𝒫[t(ν,α)].\pi^{*}_{\alpha}=\operatorname{argmin}_{\nu}\mathbb{E}_{P_{t}\sim\mathcal{P}}\left[\mathcal{R}_{t}(\nu,\alpha)\right].

Surprisingly enough, in contrast to the learning in isolation, Bernstein’s condition is always satisfied at the meta level, on the condition that we use Gibbs posteriors ρt(π,α)\rho_{t}(\pi,\alpha) within tasks.

Theorem 3

Assume that the loss \ell satisfies the boundedness assumption (3). Then, for any π𝒫\pi\in\mathcal{P},

𝔼Pt𝔼St[(^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α))2]c𝔼Pt𝔼St[^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α)],\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq c\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right],

where c=8eCc=8eC.

Proof  It classically holds (e.g. Alquier (2021)) that

^t(ρt(π,α),π,α)=1nαlog𝔼θπ[enαR^t(θ)]=1τlog(𝔼θπ[enαR^t(θ)]τnα)\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)=-\frac{1}{n\alpha}\log\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]=-\frac{1}{\tau}\log\left(\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}}\right) (5)

for any fixed τ>0\tau>0. By the boundedness assumption, it holds that, for any π𝒫\pi\in\mathcal{P},

exp(Cτ)𝔼θπ[enαR^t(θ)]τnα1.\exp(-C\tau)\leq\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}}\leq 1.

We next use the following lemma, whose proof can be found in Appendix D.

Lemma 4

Let f:x1τlogxf:x\mapsto-\frac{1}{\tau}\log x. Then, for any x,y[exp(Cτ),1]x,y\in\left[\exp(-C\tau),1\right],

(f(x)f(y))28exp(2Cτ)τ(f(x)+f(y)2f(x+y2)).(f(x)-f(y))^{2}\leq\frac{8\exp(2C\tau)}{\tau}\left(\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right)\right).

An application of Lemma 4 to x=𝔼θπ[enαR^t(θ)]τnαx=\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}} and y=𝔼θπα[enαR^t(θ)]τnαy=\mathbb{E}_{\theta\sim\pi^{*}_{\alpha}}\left[{\rm e}^{-n\alpha\hat{R}_{t}(\theta)}\right]^{\frac{\tau}{n\alpha}} gives

(^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α))2\displaystyle\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}
=(f(x)f(y))2\displaystyle=(f(x)-f(y))^{2}
8exp(2Cτ)τ(f(x)+f(y)2f(x+y2))\displaystyle\leq\frac{8\exp(2C\tau)}{\tau}\left(\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right)\right)
=8exp(2Cτ)τ[^t(ρt(π,α),π,α)+^t(ρt(πα,α),πα,α)2+1τlog(x+y2)]\displaystyle=\frac{8\exp(2C\tau)}{\tau}\left[\frac{\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)+\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)}{2}+\frac{1}{\tau}\log\left(\frac{x+y}{2}\right)\right]

(these derivations are directly inspired from the proof technique introduced by Bartlett et al. (2006)).

Taking expectations with respect to StPtS_{t}\sim P_{t} on both sides

𝔼St[(^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α))2]8exp(2Cτ)τ(t(π,α)+t(πα,α)2+t(π+πα2,α)).\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{8\exp(2C\tau)}{\tau}\left(\frac{\mathcal{R}_{t}(\pi,\alpha)+\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)}{2}-+\mathcal{R}_{t}\left(\frac{\pi+\pi^{*}_{\alpha}}{2},\alpha\right)\right).

Integrating with respect to Pt𝒫P_{t}\sim\mathcal{P} yields

𝔼Pt𝔼St[(^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α))2]8exp(2Cτ)τ(𝔼Ptt(π,α)+𝔼Ptt(πα,α)2𝔼Ptt(π+πα2,α)).\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{8\exp(2C\tau)}{\tau}\Biggl{(}\frac{\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi,\alpha)+\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)}{2}-\mathbb{E}_{P_{t}}\mathcal{R}_{t}\left(\frac{\pi+\pi^{*}_{\alpha}}{2},\alpha\right)\Biggr{)}. (6)

By definition of πα\pi^{*}_{\alpha}, it holds that, for any π𝒫\pi^{\prime}\in\mathcal{P},

𝔼Pt[t(πα,α)]𝔼Pt[t(π,α)].\mathbb{E}_{P_{t}}[\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)]\leq\mathbb{E}_{P_{t}}[\mathcal{R}_{t}(\pi^{\prime},\alpha)].

In particular, this holds for π=(π+πα)/2\pi^{\prime}=(\pi+\pi^{*}_{\alpha})/2 and plugging this into the right hand side of (6) gives

𝔼Pt𝔼St[(^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α))2]4exp(2Cτ)τ(𝔼Ptt(π,α)𝔼Ptt(πα,α)).\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{4\exp(2C\tau)}{\tau}\Biggl{(}\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi,\alpha)-\mathbb{E}_{P_{t}}\mathcal{R}_{t}(\pi^{*}_{\alpha},\alpha)\Biggr{)}.

The (optimal) choice τ=12C\tau=\frac{1}{2C} gives the desired bound

𝔼Pt𝔼St[(^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α))2]4exp(2Cτ)τ𝔼Pt𝔼St[^t(π,α)𝔼Pt^t(πα,α)].\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\left[\left(\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]\\ \leq\frac{4\exp(2C\tau)}{\tau}\mathbb{E}_{P_{t}}\mathbb{E}_{S_{t}}\Biggl{[}\hat{\mathcal{R}}_{t}(\pi,\alpha)-\mathbb{E}_{P_{t}}\hat{\mathcal{R}}_{t}(\pi^{*}_{\alpha},\alpha)\Biggr{]}.

 

4.2 PAC-Bayes Bound for Meta-learning

We will now seek for a prior π\pi which allows to obtain a small meta-risk

𝔼πΠ[(π)].\mathbb{E}_{\pi\sim\Pi}[\mathcal{E}(\pi)].

In order to do so, we will fix a set of possible priors \mathcal{M} and a set of distributions 𝒢\mathcal{G} on these priors: 𝒢𝒫()\mathcal{G}\subseteq\mathcal{P}(\mathcal{M})111Note that measurability issues can arise when the set \mathcal{F} is non parametric. However, in all our examples, the set \mathcal{F} is parametric.. Given a probability distribution Λ𝒢\Lambda\in\mathcal{G} called “prior on priors”, we define Gibbs distribution on priors similarly as in (1), but at the meta-level:

Π^=argminΠ𝒢{1Tt=1T𝔼πΠ[^t(ρt(π,α),π,α)]+KL(ΠΛ)βT},\hat{\Pi}=\operatorname{argmin}_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}\left(\rho_{t}(\pi,\alpha),\pi,\alpha\right)\right]+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}, (7)

where β>0\beta>0 is some parameter. As a consequence of Theorem 3 comes the next result, whose proof is given in Appendix E.

Theorem 5

Assume that the loss \ell satisfies (3). Defining β=1C+c\beta=\frac{1}{C+c}, it holds, for any 𝒫(Θ)\mathcal{F}\subseteq\mathcal{P}(\Theta),

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]21αc𝕀B2(1Cα)infΠ𝒢𝔼PT+1[𝔼πΠ[infρ{𝔼θρ[RT+1(θ)RT+1(θt)]+KL(ρπ)αn}]+KL(ΠΛ)βT+αC2(1𝕀B)8].\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{[}\\ \mathbb{E}_{\pi\sim\Pi}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}[R_{T+1}(\theta)-R_{T+1}(\theta_{t}^{*})]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}\right\}\right]+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{]}.

While the bound given in Theorem 5 depends on some subset 𝒫(Θ)\mathcal{F}\subseteq\mathcal{P}(\Theta) chosen for computational reasons, the bound is on the excess risk of Π^\hat{\Pi}, which itself is based on the exact Gibbs posterior ρt(π,α)\rho_{t}(\pi,\alpha), and not on its variational approximation Π^()\hat{\Pi}(\mathcal{F}), based on ρt(π,α,)\rho_{t}(\pi,\alpha,\mathcal{F}) and defined as

Π^()=argminΠ𝒢{1Tt=1T𝔼πΠ[^t(ρt(π,α,),π,α)]+KL(ΠΛ)βT}.\hat{\Pi}(\mathcal{F})=\operatorname{argmin}_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}\left(\rho_{t}(\pi,\alpha,\mathcal{F}),\pi,\alpha\right)\right]+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}.

In some settings, Π^()\hat{\Pi}(\mathcal{F}) is tractable and its Gibbs-based counterpart Π^\hat{\Pi} is not, and it is a fundamental open question to determine under what condition on \mathcal{F} we can replace Π^\hat{\Pi} by Π^()\hat{\Pi}(\mathcal{F}) in Theorem 5.

Open Question 1

Under what conditions on \mathcal{F} can we replace Π^\hat{\Pi} by Π^()\hat{\Pi}(\mathcal{F}) in the left-hand side of Theorem 5?

4.3 A Toy Application of Theorem 5: Concurrent Priors

This subsection gives a toy application of Theorem 5 just to fix ideas. Here, we study the case where MM statisticians propose a different prior, all of which are assumed to satisfy a prior mass condition as in Corollary 2. We denote by ={π1,,πM}\mathcal{M}=\{\pi_{1},\dots,\pi_{M}\} the set of priors. We choose Λ\Lambda as the uniform distribution on \mathcal{M} and 𝒢=𝒫()\mathcal{G}=\mathcal{P}(\mathcal{M}). Here again, for the sake of simplicity, we assume that Bernstein’s condition (see Assumption 1) is satisfied at the task level.

A direct application of Theorem 5 and Corollary 2 gives

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4minπ𝔼PT+1dπ,T+1lognαdπ,T+1+logκπ,T+1αn+4logMβT.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq 4\min_{\pi\in\mathcal{M}}\mathbb{E}_{P_{T+1}}\frac{d_{\pi,T+1}\log\frac{n\alpha}{d_{\pi,T+1}}+\log\kappa_{\pi,T+1}}{\alpha n}+\frac{4\log M}{\beta T}.

In other words, we obtain the rate of convergence provided by the best prior among {π1,,πM}\{\pi_{1},\dots,\pi_{M}\}, at the price of an additional log(M)/T\log(M)/T term.

5 Applications of Theorem 5

In this section, by an application of Theorem 5, we derive explicit bounds on the excess risk of the Gibbs algorithm in the case of discrete priors (the parameter set Θ\Theta is finite; Subsection 5.1), Gaussian priors (Subsection 5.2) and mixtures of Gaussian priors (Subsection 5.3).

5.1 Learning Discrete Priors

In this subsection, we assume that |Θ|=M<|\Theta|=M<\infty. Following Meunier and Alquier (2021), we define AA^{*} as the smallest possible subset of Θ\Theta such that

P𝒫,θ:=argminθRP(θ)A,\forall P\sim\mathcal{P},\ \theta^{*}:=\operatorname{argmin}_{\theta}R_{P}(\theta)\in A^{*}, (8)

and we denote m:=|A|m^{*}:=|A^{*}|. In general, A=ΘA^{*}=\Theta and m=Mm^{*}=M. However, in some favorable situations, AΘA^{*}\neq\Theta and mMm^{*}\ll M, in which case, the meta-learning may improve upon the learning in isolation. In the setting considered, Bernstein’s condition is trivially satisfied and the excess risk of the Gibbs algorithm is 4log(M)αn\frac{4\log(M)}{\alpha n}.

We define our set of priors \mathcal{M} as the set of probability distributions πA\pi_{A} which are uniform on AΘA\subseteq\Theta and parameterized by AA:

={πA|AΘ},\mathcal{M}=\left\{\pi_{A}|A\subseteq\Theta\right\},

and 𝒢\mathcal{G} is the set of all distributions on \mathcal{M}. Our “prior on priors” Λ\Lambda is then defined as follows: we draw m{1,,M}m\in\{1,\dots,M\} with probability 2Mm2M12m\frac{2^{M-m}}{2^{M}-1}\approx 2^{-m} (for M1M\gg 1), then given mm, draw a subset AΘA\subseteq\Theta of cardinality mm uniformly at random, and take πA\pi_{A}. In other words, Λ\Lambda is a distribution defined on \mathcal{F} such that PπΛ(π=πA)=2Mm2M1×1(Mm)P_{\pi\sim\Lambda}(\pi=\pi_{A})=\frac{2^{M-m}}{2^{M}-1}\times\frac{1}{{M\choose m}}.

Proposition 6

The excess risk of the meta predictor Π^\hat{\Pi} defined in (7) is bounded as follows:

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4logmαn+4mlog2eMmβT.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\log m^{*}}{\alpha n}+\frac{4m^{*}\log\frac{2{\rm e}M}{m^{*}}}{\beta T}.
Remark 7

Let us now compare the meta-learning rate above to the 4logMαn\frac{4\log M}{\alpha n} rate achieved by the learning in isolation. In the unfavorable case mMm^{*}\sim M, the meta-learning bound is sensibly larger than the learning in isolation one, by a term O(M/T)O(M/T) which vanishes rapidly when T+T\to+\infty.

In the favorable case mMm^{*}\ll M however, the meta-learning considerably improves upon the learning in isolation for large values of TT. In the extreme case where m=1m^{*}=1, we have

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4log(2eM)βT.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\log(2{\rm e}M)}{\beta T}.

Thus, the benefits of the meta-learning are mainly expected in the TnT\gg n regime. They are huge when the tasks are very similar, and close to zero when the tasks are totally unrelated. This is in line with the results of Meunier and Alquier (2021) in the online setting.

Proof  We first consider learning in isolation. The classical choice is to take π\pi as uniform in each task. An application of Theorem 1 gives

𝔼𝒮t[𝔼θρt(π,α)[RPt(θ)]]\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\right]-\mathcal{E}^{*} 2infρ𝒫(Θ){𝔼θρ[RPt(θ)]RPt+KL(ρπ)αn}\displaystyle\leq 2\inf_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}
2infρ=δϑ{𝔼θρ[RPt(θ)]RPt+KL(ρπ)αn}\displaystyle\leq 2\inf_{\rho=\delta_{\vartheta}}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}
=2logMαn.\displaystyle=\frac{2\log M}{\alpha n}.

In the meta-learning case, an application of Theorem 5 gives

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]4inf1mMinf|A|=m𝔼PT+1𝒫[infθA{RPT+1(θ)RPt+1}+log(m)αn+mlog2+log(Mm)βT].\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{1\leq m\leq M}\inf_{|A|=m}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\bigg{[}\inf_{\theta\in A}\left\{R_{P_{T+1}}(\theta)-R_{P_{t+1}}^{*}\right\}\\ +\frac{\log(m)}{\alpha n}+\frac{m\log 2+\log{M\choose m}}{\beta T}\bigg{]}.

Under the assumption (8), it holds that

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4log(m)αn+4mlog(2)+4log(Mm)βT.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\log(m^{*})}{\alpha n}+\frac{4m^{*}\log(2)+4\log{M\choose m^{*}}}{\beta T}.

We conclude by using the classic bound log(Mm)mlogMem\log{M\choose m}\leq m\log\frac{M{\rm e}}{m}.  

5.2 Learning Gaussian priors

In this subsection, we consider the set of all Gaussian distributions

={pμ,σ2=i=1d𝒩(μi,σi2),μ=(μ1,,μd)d,σ2=(σ12,,σd2)(+)d}.\mathcal{M}=\left\{p_{\mu,\sigma^{2}}=\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{i},\sigma_{i}^{2}),\mu=(\mu_{1},\dots,\mu_{d})\in\mathbb{R}^{d},\sigma^{2}=(\sigma_{1}^{2},\dots,\sigma_{d}^{2})\in(\mathbb{R}_{+}^{*})^{d}\right\}. (9)

So, priors on priors are actually defined as priors on μ\mu and σ2\sigma^{2} given by:

𝒢={qτ,ξ2,a,b=k[K]i[d]𝒩(τk,i,ξk2)k=1KΓ(ak,bk)},\mathcal{G}=\Bigg{\{}q_{\tau,\xi^{2},a,b}=\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(a_{k},b_{k})\Bigg{\}}, (10)

and we choose the prior on priors as Λ=q0,ξ¯¯2,a¯¯,b¯¯\Lambda=q_{0,\bar{\bar{\xi}}^{2},\bar{\bar{a}},\bar{\bar{b}}}.

From now on and until the end of this section, we assume that both (4) and Assumption 1 (Bernstein’s condition) hold, and we are looking for a prior on priors Π\Pi from which to sample π\pi, such that ρT+1(π,α)\rho_{T+1}(\pi,\alpha) concentrates as much as possible to the best parameter. Denoting μ:=𝔼PT+1𝒫[μPT+1]\mu^{*}:=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}[\mu_{P_{T+1}}] and Σ(𝒫):=𝔼PT+1𝒫[μPT+1μ2]\Sigma(\mathcal{P}):=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\|\mu_{P_{T+1}}-\mu^{*}\|^{2}\right], the following holds.

Proposition 8

Under Assumptions 1 and 4, the excess risk of Π^\hat{\Pi} defined in (7) for \mathcal{M} and 𝒢\mathcal{G} defined in (9) and (10) is bounded as follows:

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]{G(C,μ,ξ¯¯,L,a¯¯,b¯¯)d+log(T)T if Σ(𝒫)nT;F(C,μ,ξ¯¯,L,a¯¯,b¯¯)(dlogn+Σ(𝒫)n+dT) otherwise,\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\left\{\begin{array}[]{ll}G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\frac{d+\log(T)}{T}&\mbox{ if }\ \Sigma(\mathcal{P})\leq\frac{n}{T};\\ F(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\left(\frac{d\log n+\Sigma(\mathcal{P})}{n}+\frac{d}{T}\right)&\mbox{ otherwise,}\end{array}\right.

where FF and GG are functions of the problem parameters.

Interestingly enough, in the favorable case Σ(𝒫)nT\Sigma(\mathcal{P})\leq\frac{n}{T}, a proper choice of the prior of priors lead to the very fast rate of convergence O(logTT)O\left(\frac{\log T}{T}\right), which considerably improves upon the fast rate O(1n+1T)O\left(\frac{1}{n}+\frac{1}{T}\right) when nTn\ll T. The detailed proof of this proposition, as well as explicit expressions of FF and GG, are given in Appendix F.

5.3 Learning Mixtures of Gaussian priors

In this subsection, we generalize the result of the previous section to priors that are mixtures of Gaussians. We still assume that Assumption 4 and Assumption 1 (Bernstein’s condition) hold. We first assume the number of components in the mixture is known. Under these hypotheses, the set of possible priors π\pi is

={pw,μ,σ2=k=1Kwki=1d𝒩(μk,i,σk,i2):k[K],wk0,1w=1}.\mathcal{M}=\left\{p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{K}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\forall k\in[K],w_{k}\geq 0,1^{\top}w=1\right\}. (11)

We add a Dirichlet prior on the weights ww of the components in the mixture, and the set of priors on priors becomes

𝒢={qδ,τ,ξ2,b=Dir(δ)k[K]i[d]𝒩(τk,i,ξk2)k=1KΓ(2,bk):δ=(δ1,,δK)K},\mathcal{G}=\Bigg{\{}q_{\delta,\tau,\xi^{2},b}=\mathrm{Dir}(\delta)\otimes\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(2,b_{k}):\delta=(\delta_{1},\dots,\delta_{K})\in\mathbb{R}^{K}\Bigg{\}}, (12)

while the initial prior on priors is chosen as Λ=q𝕀K,0,ξ¯¯2,b¯¯\Lambda=q_{\mathbb{I}_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}. We define

ΣK(𝒫):=infτ1,,τK𝔼PT+1𝒫[mink[K]μPT+1τk2].\Sigma_{K}(\mathcal{P}):=\inf_{\tau_{1},\dots,\tau_{K}}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\min_{k\in[K]}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right].
Proposition 9

Under Assumptions 1 and 4, the excess risk of Π^\hat{\Pi} defined in (7) for \mathcal{M} and 𝒢\mathcal{G} defined in (11) and (12) is bounded as follows:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]CVfinite(K,n)+K×CVGaussian(d,ΣK(𝒫),n,T)+CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ),\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\mathrm{CV}_{\mathrm{finite}}(K,n)+K\times\mathrm{CV}_{\mathrm{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)\\ +\mathrm{CV}_{\mathrm{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau),
where CVfinite(K,n)=4log(2K)αn;CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)=O(dKlogTT);\text{where }\ \mathrm{CV}_{\mathrm{finite}}(K,n)=\frac{4\log(2K)}{\alpha n};\quad\mathrm{CV}_{\mathrm{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=O\left(\frac{dK\log T}{T}\right);
CVGaussian(d,ΣK(𝒫),n,T)={8LdT+4αTif ΣK(𝒫)nT2;2dαnlog(1+4αLn)+4ΣK(𝒫)αnotherwise.\mathrm{CV}_{\mathrm{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)=\left\{\begin{array}[]{ll}\frac{8Ld}{T}+\frac{4}{\alpha T}&\mbox{if }\ \Sigma_{K}(\mathcal{P})\leq\frac{n}{T^{2}};\\ \frac{2d}{\alpha n}\log\left(1+4\alpha Ln\right)+\frac{4\Sigma_{K}(\mathcal{P})}{\alpha n}&\mbox{otherwise.}\end{array}\right.

Let us analyze each of the terms of the above bound. The first term CVfinite(K,n)\text{CV}_{\text{finite}}(K,n) is the bound we had in the finite case of Subsection 5.1. Visualizing our KK mixtures as the KK points in the finite case, this term is the time required by our estimator to select the right mixture. While this term makes the convergence rate of O(1n+1T)O\left(\frac{1}{n}+\frac{1}{T}\right) notably worse than the O(1T)O\left(\frac{1}{T}\right) we might hope for, it is essentially unavoidable as it appears in the much simpler model of a finite set of KK parameters described in Subsection 5.1.

The next term CVGaussian(d,ΣK(𝒫),n,T)\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right) is (similar to) the bound obtained in the Gaussian case of Subsection 5.2, with the exception that Σ(𝒫)\Sigma(\mathcal{P}) is replaced by ΣK(𝒫)\Sigma_{K}(\mathcal{P}), and scales with the convergence time to the best Gaussian for every task tt.

Eventually, the last term CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau) is the convergence term at the meta level and is a O(dKlogTT)O\left(\frac{dK\log T}{T}\right). This is the cost of the meta learning compared to the learning in isolation. When TnT\gg n, this term is very small and justifies the use of the meta learning.

Remark 10

One may think, looking at the bounds in the two previous cases considered, that the O(1n+1T)O\left(\frac{1}{n}+\frac{1}{T}\right) convergence rate in the case of mixtures of Gaussians is slower than the one for Gaussians which can be as fast as O(1T)O\left(\frac{1}{T}\right). In reality, the rate of convergence is (naturally) faster for the model of mixtures of Gaussians, because in the case of mixtures of Gaussians, the convergence term CVGaussian(d,ΣK(𝒫),n,T)\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right) is O(1T)O\left(\frac{1}{T}\right) under the assumption that ΣK(𝒫)nT\Sigma_{K}(\mathcal{P})\leq\frac{n}{T}, while in the Gaussian case, the much stronger assumption Σ(𝒫)=Σ1(𝒫)nT\Sigma(\mathcal{P})=\Sigma_{1}(\mathcal{P})\leq\frac{n}{T} is required. Under this assumption, the similar rate O(1T)O\left(\frac{1}{T}\right) is naturally achieved.

We now consider the case when the number of mixtures KK is unknown. The set of priors hence becomes the set of all (finite) mixtures of Gaussians:

={pw,μ,σ2=k=1+wki=1d𝒩(μk,i,σk,i2):K1:kK+1,wk=0}.\mathcal{M}=\left\{p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{+\infty}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\exists K\geq 1:\forall k\geq K+1,w_{k}=0\right\}. (13)

In the definition of the set of priors on priors 𝒢\mathcal{G}, we assume that KTK\leq T (otherwise, there is a high chance of overfitting). We then set a Dirichlet prior on the number of components KK, and given KK, set the same model as before. Formally,

𝒢={qx,δ,τ,ξ2,b=qx×qδ,τ,ξ2,b|K},\mathcal{G}=\Bigg{\{}q_{x,\delta,\tau,\xi^{2},b}=q_{x}\times q_{\delta,\tau,\xi^{2},b|K}\Bigg{\}}, (14)

and we set the prior on priors Λ=q1T𝕀T,𝕀K,0,ξ¯¯2,b¯¯\Lambda=q_{\frac{1}{T}\mathbb{I}_{T},\mathbb{I}_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}. The application of Theorem 5 yields the bound

Proposition 11

Under the same conditions and using the same notations as Proposition 9, the excess risk of Π^\hat{\Pi} defined in (7) for \mathcal{M} and 𝒢\mathcal{G} defined in (13) and (14) is bounded as follows:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]infK[T]{CVfinite(K,n)+K×CVGaussian(d,ΣK(𝒫),n,T)+CVmetaunknown(T,n,d,K,b¯¯,ξ¯¯2,τ)},\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ \inf_{K\in[T]}\Bigg{\{}\mathrm{CV}_{\mathrm{finite}}(K,n)+K\times\mathrm{CV}_{\mathrm{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)+\mathrm{CV}_{\mathrm{meta}}^{\mathrm{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)\Bigg{\}},

where the convergence term at the meta level becomes

CVmetaunknown(T,n,d,K,b¯¯,ξ¯¯2,τ)=CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)+2logTβT.\mathrm{CV}_{\mathrm{meta}}^{\mathrm{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\mathrm{CV}_{\mathrm{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)+\frac{2\log T}{\beta T}.

Our estimator takes 2logTβT\frac{2\log T}{\beta T} to find the optimal number of mixtures at the meta level. This is the price to pay to have the infimum on KK in the bound. In the limit TnT\gg n and when no prior information on KK is available, this clearly improves upon the bound of Proposition 9, and hence justifies setting a prior on KK at the meta level rather than choosing an isolated KK, pleading again in the favor of meta-learning. The proof of all the results of this subsection is given in Appendix G.

6 Discussion

In recent years, the statistical guarantees of meta-learning have received increasing attention. In the following paragraphs, we present a short review of the literature on the statistical theory of meta-learning, followed by a brief discussion of three papers that are closely related to our analysis.

Theoretical bounds in meta-learning. The first theoretical analysis of meta-learning goes back to Baxter (2000), who introduced the notion of task environment and derived a uniform generalization bound based on the capacity and covering number of the model. Following this i.i.d. task environment setting, many other generalization bounds have since been provided for different strategies and proof techniques, including VC theory (Baxter, 2000; Ben-David and Schuller, 2003; Maurer, 2009; Maurer et al., 2016; Guan and Lu, 2022b), algorithmic stability (Maurer and Jaakkola, 2005; Chen et al., 2020; Al-Shedivat et al., 2021; Guan et al., 2022) and information theory (Jose and Simeone, 2021; Chen et al., 2021; Jose et al., 2021; Rezazadeh et al., 2021; Hellström and Durisi, 2022). A related approach for deriving such bounds is based on PAC-Bayes theory. First proposed in the meta-learning framework in the pioneering paper of Pentina and Lampert (2014), this idea of learning a hyper-posterior that generates a prior for the new task has been taken up several times in the recent years (Amit and Meir, 2018; Ding et al., 2021; Liu et al., 2021; Rothfuss et al., 2021; Farid and Majumdar, 2021; Rothfuss et al., 2022; Guan and Lu, 2022a; Rezazadeh, 2022). In particular, Amit and Meir (2018) derived a new PAC-Bayes bound, which they applied to the optimization of deep neural networks, albeit with computational limitations. This latter concern was partially addressed by Rothfuss et al. (2021), who also specified the hyper-posterior and extended the results to unbounded losses, and further investigated their study in Rothfuss et al. (2022). Some papers combined ideas from different literatures, such as Farid and Majumdar (2021), who explored the link between PAC-Bayes and uniform stability in meta-learning, and provided a precise analysis of stability and generalization. Excess risk bounds have also been provided in the i.i.d. task environment framework, see Maurer et al. (2016); Denevi et al. (2018a, b, 2019a, 2019b, 2020); Balcan et al. (2019); Bai et al. (2021); Chen and Chen (2022). The task environment assumption has recently been challenged, for example by Du et al. (2020) and Tripuraneni et al. (2021), who proposed to use assumptions on the distributional similarity between the features and the diversity of tasks to control the excess risk, an idea further explored by Fallah et al. (2021) who exploited a notion of diversity between the new task and training tasks using the total variation distance. Finally, a detailed analysis of regret bounds in lifelong learning has been carried out in recent years (Alquier et al., 2017; Denevi et al., 2018b, 2019b; Balcan et al., 2019; Khodak et al., 2019; Finn et al., 2019; Meunier and Alquier, 2021).

Comparison to Denevi et al. (2019a). Denevi et al. (2019a) is probably the study that is the most related to our paper. The authors provide statistical guarantees for Ridge regression with a meta-learned bias, and focus on the usefulness of their strategy relative to single-task learning, proving that their method outperforms the standard 2\ell_{2}-regularized empirical risk minimizer. In particular, they can achieve an excess risk rate of order O(1/T)O\left({1}/{\sqrt{T}}\right) in the favorable case Σ(𝒫)nT\Sigma(\mathcal{P})\leq\frac{n}{T}, where Σ(𝒫)\Sigma(\mathcal{P}) is a variance term similar to the one we defined in our Gaussian example.

Comparison to Guan et al. (2022). To the best of our knowledge, Guan et al. (2022) is the only work in the meta-learning literature that addresses fast rates with respect to the number of tasks TT under the task environment assumption. However, we actually show in our paper that there is no need to extend Bernstein’s condition when using exact Bayesian inference and that the final posterior naturally satisfies the extended Bernstein assumption, thus giving fast rates with respect to TT, while Guan et al. (2022) require an additional Polyak-Łojasiewicz condition to achieve fast rates. Furthermore, their analysis is very different in nature, relying on stability arguments to derive generalization bounds, while we use PAC-Bayes theory to control the excess risk.

Comparison to Guan and Lu (2022a) and Rezazadeh (2022). Finally, Guan and Lu (2022a) and Rezazadeh (2022) provide fast rate generalization bounds based on Catoni’s PAC-Bayes inequality. Nevertheless, their notion of fast rates differs substantially from ours: our rates quantify the rate of convergence of the excess risk, in line with the classic statistical learning literature. In contrast, the fast rates of Guan and Lu (2022a) are only available for a variant of the generalization gap that compares the theoretical risk to a factor of the empirical risk. In order to obtain a sharp factor equal to 11 in front of the empirical risk, one needs to fine-tune a parameter in Guan and Lu (2022a)’s inequality, which would then return a slow rate.

7 Conclusion and open problems

We provided an analysis of the excess risk in meta-learning the prior via PAC-Bayes bounds. Surprisingly, at the meta-level, conditions for fast rates are always satisfied if one uses exact Gibbs posteriors at the task level. An important problem is to extend this result to variational approximations of Gibbs posteriors.

References

  • Achille et al. (2019) Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019.
  • Al-Shedivat et al. (2021) Maruan Al-Shedivat, Liam Li, Eric Xing, and Ameet Talwalkar. On data efficiency of meta-learning. In International Conference on Artificial Intelligence and Statistics, pages 1369–1377. PMLR, 2021.
  • Alquier et al. (2016) P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(239):1–41, 2016. URL http://jmlr.org/papers/v17/15-290.html.
  • Alquier (2021) Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
  • Alquier et al. (2017) Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret Bounds for Lifelong Learning. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 261–269, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
  • Amit and Meir (2018) R. Amit and R. Meir. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning, pages 205–214. PMLR, 2018.
  • Bai et al. (2021) Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason Lee, Sham Kakade, Huan Wang, and Caiming Xiong. How important is the train-validation split in meta-learning? In International Conference on Machine Learning, pages 543–553. PMLR, 2021.
  • Balcan et al. (2019) Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pages 424–433. PMLR, 2019.
  • Bartlett et al. (2006) P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • Bartlett and Mendelson (2006) Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311–334, 2006.
  • Baxter (2000) Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
  • Ben-David and Schuller (2003) Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 567–580. Springer, 2003.
  • Bissiri et al. (2016) P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
  • Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford University Press, 2013.
  • Caruana (1997) R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
  • Catoni (2004) O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
  • Catoni (2007) O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes – Monograph Series, 56. Institute of Mathematical Statistics, Beachwood, OH, 2007.
  • Chen et al. (2020) Jiaxin Chen, Xiao-Ming Wu, Yanke Li, Qimai Li, Li-Ming Zhan, and Fu-lai Chung. A closer look at the training strategy for modern meta-learning. Advances in Neural Information Processing Systems, 33:396–406, 2020.
  • Chen and Chen (2022) Lisha Chen and Tianyi Chen. Is bayesian model-agnostic meta learning better than model-agnostic meta learning, provably? In International Conference on Artificial Intelligence and Statistics, pages 1733–1774. PMLR, 2022.
  • Chen et al. (2023) Lisha Chen, Sharu Theresa Jose, Ivana Nikoloska, Sangwoo Park, Tianyi Chen, and Osvaldo Simeone. Learning with limited samples: Meta-learning and applications to communication systems. Foundations and Trends® in Signal Processing, 17(2):79–208, 2023.
  • Chen et al. (2021) Qi Chen, Changjian Shui, and Mario Marchand. Generalization bounds for meta-learning: An information-theoretic analysis. Advances in Neural Information Processing Systems, 34:25878–25890, 2021.
  • (22) T. Cover and J. Thomas. Elements of Information Theory. Wileys Series in Communication.
  • Dalalyan and Tsybakov (2008) A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning, 72(1-2):39–61, 2008.
  • Denevi et al. (2018a) G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. Advances in neural information processing systems, 31, 2018a.
  • Denevi et al. (2018b) G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical guarantees. arXiv preprint arXiv:1803.08089, 2018b.
  • Denevi et al. (2019a) G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575. PMLR, 2019a.
  • Denevi et al. (2019b) Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. Advances in Neural Information Processing Systems, 32, 2019b.
  • Denevi et al. (2020) Giulia Denevi, Massimiliano Pontil, and Carlo Ciliberto. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33:964–974, 2020.
  • Ding et al. (2021) N. Ding, X. Chen, T. Levinboim, S. Goodman, and R. Soricut. Bridging the gap between practice and pac-bayes theory in few-shot meta-learning. Advances in neural information processing systems, 34:29506–29516, 2021.
  • Donsker and Varadhan (1976) M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iii. Communications on Pure and Applied Mathematics, 28:389–461, 1976.
  • Dou et al. (2019) Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. Investigating meta-learning algorithms for low-resource natural language understanding tasks. arXiv preprint arXiv:1908.10423, 2019.
  • Du et al. (2020) Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
  • Fallah et al. (2021) Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34:5469–5480, 2021.
  • Farid and Majumdar (2021) A. Farid and A. Majumdar. Generalization bounds for meta-learning via pac-bayes and uniform stability. Advances in neural information processing systems, 34:2173–2186, 2021.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  • Finn et al. (2019) Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920–1930. PMLR, 2019.
  • Ghosal and Van der Vaart (2017) S. Ghosal and A. Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
  • Gidaris and Komodakis (2018) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018.
  • Grant et al. (2018) Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
  • Grünwald and Mehta (2020) Peter D Grünwald and Nishant A Mehta. Fast rates for general unbounded loss functions: from erm to generalized bayes. The Journal of Machine Learning Research, 21(1):2040–2119, 2020.
  • Gu et al. (2018) Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437, 2018.
  • Guan and Lu (2022a) J. Guan and Z. Lu. Fast-rate pac-bayesian generalization bounds for meta-learning. In International conference on machine learning, pages 7930–7948. PMLR, 2022a.
  • Guan et al. (2022) J. Guan, Y. Liu, and Z. Lu. Fine-grained analysis of stability and generalization for modern meta learning algorithms. Advances in neural information processing systems, 2022.
  • Guan and Lu (2022b) Jiechao Guan and Zhiwu Lu. Task relatedness-based generalization bounds for meta learning. In International Conference on Learning Representations, 2022b.
  • Guedj (2019) B. Guedj. A primer on PAC-Bayesian learning. In Proceedings of the second congress of the French Mathematical Society, 2019.
  • Haghifam et al. (2021) Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran, and Dan Roy. Towards a unified information-theoretic framework for generalization. Advances in Neural Information Processing Systems, 34:26370–26381, 2021.
  • Hellström and Durisi (2022) Fredrik Hellström and Giuseppe Durisi. Evaluated cmi bounds for meta learning: Tightness and expressiveness. arXiv preprint arXiv:2210.06511, 2022.
  • Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 87–94. Springer, 2001.
  • Huang et al. (2018) Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. Natural language to structured query generation via meta-learning. arXiv preprint arXiv:1803.02400, 2018.
  • Jose and Simeone (2021) Sharu Theresa Jose and Osvaldo Simeone. Information-theoretic generalization bounds for meta-learning and applications. Entropy, 23(1):126, 2021.
  • Jose et al. (2021) Sharu Theresa Jose, Osvaldo Simeone, and Giuseppe Durisi. Transfer meta-learning: Information-theoretic bounds and information meta-risk minimization. IEEE Transactions on Information Theory, 68(1):474–501, 2021.
  • Khodak et al. (2019) Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. Advances in Neural Information Processing Systems, 32, 2019.
  • Koch et al. (2015) Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.
  • Kulis et al. (2011) Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR 2011, pages 1785–1792. IEEE, 2011.
  • Li et al. (2018) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  • Liu et al. (2021) Tianyu Liu, Jie Lu, Zheng Yan, and Guangquan Zhang. Pac-bayes bounds for meta-learning with data-dependent prior. arXiv preprint arXiv:2102.03748, 2021.
  • Mammen and Tsybakov (1999) E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. the Annals of Statistics, 27(6):1808–1829, 1999.
  • Maurer (2009) A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, 2009.
  • Maurer and Jaakkola (2005) A. Maurer and T. Jaakkola. Algorithmic stability and meta-learning. The Journal of Machine Learning Research, 6(6), 2005.
  • Maurer et al. (2016) A. Maurer, M. Pontil, and B. Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(81):1–32, 2016.
  • McAllester (1998) D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 230–234, New York, 1998. ACM.
  • Meunier and Alquier (2021) D. Meunier and P. Alquier. Meta-strategy for learning tuning parameters with guarantees. Entropy, 23(10), 2021.
  • Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. ICLR, 2018.
  • Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International conference on machine learning, pages 2554–2563. PMLR, 2017.
  • Nguyen et al. (2020) Cuong Nguyen, Thanh-Toan Do, and Gustavo Carneiro. Uncertainty in model-agnostic meta-learning using variational inference. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3090–3100, 2020.
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • Pentina and Lampert (2014) A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991–999. PMLR, 2014.
  • Qian and Yu (2019) Kun Qian and Zhou Yu. Domain adaptive dialog generation via meta learning. arXiv preprint arXiv:1906.03520, 2019.
  • Qiao et al. (2018) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7229–7238, 2018.
  • Quattoni et al. (2008) Ariadna Quattoni, Michael Collins, and Trevor Darrell. Transfer learning for image classification with sparse prototype representations. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • Rajeswaran et al. (2019) Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
  • Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International conference on learning representations, 2017.
  • Rezazadeh (2022) A. Rezazadeh. A unified view on pac-bayes bounds for meta-learning. In International conference on machine learning, pages 18576–18595. PMLR, 2022.
  • Rezazadeh et al. (2021) Arezou Rezazadeh, Sharu Theresa Jose, Giuseppe Durisi, and Osvaldo Simeone. Conditional mutual information-based generalization bound for meta learning. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1176–1181. IEEE, 2021.
  • Rothfuss et al. (2021) J. Rothfuss, V. Fortuin, M. Josifoski, and A. Krause. Pacoh: Bayes-optimal meta-learning with pac-guarantees. In International conference on machine learning, pages 9116–9126. PMLR, 2021.
  • Rothfuss et al. (2022) J. Rothfuss, M. Josifoski, V. Fortuin, and A. Krause. Pac-bayesian meta-learning: From theory to practice. arXiv preprint arXiv:2211.07206, 2022.
  • Russo and Zou (2019) D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
  • Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850. PMLR, 2016.
  • Shawe-Taylor and Williamson (1997) J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, pages 2–9, New York, 1997. ACM.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  • Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018.
  • Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. Kluwer Academic Publishers, pages 3–17, 1998.
  • Tripuraneni et al. (2021) Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
  • Tsybakov (2004) A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. the Annals of Statistics, 32(1):135–166, 2004.
  • Vanschoren (2019) Joaquin Vanschoren. Meta-learning. Automated machine learning: methods, systems, challenges, pages 35–61, 2019.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  • Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  • Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 2521–2530, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  • Yang et al. (2019) Jun Yang, Shengyang Sun, and Daniel M Roy. Fast-rate pac-bayes generalization bounds via shifted rademacher processes. Advances in Neural Information Processing Systems, 32, 2019.
  • Yoon et al. (2018) Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018.
  • Yu et al. (2020) Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  • Zhang (2006) Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
  • Zhang and Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
  • Zhuang et al. (2020) Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.

Appendix A Some Useful Formulas

The following (known) results are used throughout the text. They are recalled here without proof.

A.1 Concentration Inequalities

For Hoeffding and Bernstein see Boucheron et al. (2013). For Donsker and Varadhan, see for example Catoni (2007).

Lemma 12 (Hoeffding’s inequality)

Let U1,,UnU_{1},\dots,U_{n} be i.i.d random variables taking values in an interval [a,b][a,b]. Then, for any s>0s>0,

𝔼[esi=1n[Ui𝔼(Ui)]]ens2(ba)28.\mathbb{E}\left[{\rm e}^{s\sum_{i=1}^{n}[U_{i}-\mathbb{E}(U_{i})]}\right]\leq{\rm e}^{\frac{ns^{2}(b-a)^{2}}{8}}.
Lemma 13 (Bernstein’s inequality)

Let U1,,UnU_{1},\dots,U_{n} be i.i.d random variables such that for any k2k\geq 2,

𝔼[|Ui|k]k!2VCk2.\mathbb{E}\left[|U_{i}|^{k}\right]\leq\frac{k!}{2}VC^{k-2}. (15)

Then, for any s(0,1/C]s\in(0,1/C],

𝔼[esi=1n[Ui𝔼(Ui)]]ens2V2(1sC).\mathbb{E}\left[{\rm e}^{s\sum_{i=1}^{n}[U_{i}-\mathbb{E}(U_{i})]}\right]\leq{\rm e}^{\frac{ns^{2}V}{2(1-sC)}}.

Note that in particular, if |Ui|C|U_{i}|\leq C almost surely, 15 always holds with V=𝔼(Ui2)V=\mathbb{E}(U_{i}^{2}).

A.2 Donsker and Varadhan’s Lemma

Lemma 14 (Donsker and Varadhan’s variational inequality Donsker and Varadhan (1976))

Let μ\mu be a probability measure on Θ\Theta. For any measurable, bounded function h:Θh:\Theta\rightarrow\mathbb{R}, we have:

log𝔼θμ[eh(θ)]=supρ𝒫(Θ){𝔼θρ[h(θ)]KL(ρμ)}.\log\mathbb{E}_{\theta\sim\mu}\left[{\rm e}^{h(\theta)}\right]=\sup_{\rho\in\mathcal{P}(\Theta)}\Bigl{\{}\mathbb{E}_{\theta\sim\rho}[h(\theta)]-\mathrm{KL}(\rho\|\mu)\Bigr{\}}.

Moreover, the supremum with respect to ρ\rho in the right-hand side is reached for the Gibbs measure μh\mu_{h} defined by its density with respect to μ\mu

dμhdμ(ϑ)=eh(ϑ)𝔼θμ[eh(θ)].\frac{{\rm d}\mu_{h}}{{\rm d}\mu}(\vartheta)=\frac{{\rm e}^{h(\vartheta)}}{\mathbb{E}_{\theta\sim\mu}\left[{\rm e}^{h(\theta)}\right]}.

A.3 KL Divergence of some known Distributions

Denoting by H(x)H(x) the entropy of (x1,,xT)(x_{1},\dots,x_{T}), recall that the KL divergence between a multinomial distribution of parameters (x1,,xT)(x_{1},\dots,x_{T}) and a multinomial distribution of parameters (1T,,1T)\left(\frac{1}{T},\dots,\frac{1}{T}\right) is

KL(Mult(x)Mult(1T))=logTH(x).\mathrm{KL}\left(\mathrm{Mult}(x)\|\mathrm{Mult}\left(\frac{1}{T}\right)\right)=\log T-H(x). (16)

Recall that the KL divergence between 2 normal distributions is

KL(𝒩(μ,σ2)𝒩(μ¯,σ¯2))=12((μμ¯)2σ¯2+σ2σ2¯1+logσ¯2σ2).\mathrm{KL}\left(\mathcal{N}(\mu,\sigma^{2})\|\mathcal{N}(\bar{\mu},\bar{\sigma}^{2})\right)=\frac{1}{2}\left(\frac{(\mu-\bar{\mu})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma^{2}}{\bar{\sigma^{2}}}-1+\log\frac{\bar{\sigma}^{2}}{\sigma^{2}}\right). (17)

Recall that the KL divergence between 2 Gamma distributions is

KL(Γ(a,b)Γ(a¯¯,b¯¯))=(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb.\mathrm{KL}\left(\Gamma(a,b)\|\Gamma\left(\bar{\bar{a}},\bar{\bar{b}}\right)\right)=\left(a-\bar{\bar{a}}\right)\psi(a)+\log\frac{\Gamma\left(\bar{\bar{a}}\right)}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}. (18)

Recall that the KL divergence between a Dirichlet distribution of parameter δ\delta and a Dirichlet distribution of parameter 1K=(1,,1)1_{K}=(1,\dots,1) is

KL(Dir(δ)Dir(1K))=logΓ(1δ)Γ(K)×k=1KΓ(δk)+k=1K(δk1)(ψ(δk)ψ(1δ)).\mathrm{KL}\left(\mathrm{Dir}(\delta)\|\mathrm{Dir}(1_{K})\right)=\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right). (19)

Appendix B Proof of Theorem 1

We mostly follow the proof technique developed in Catoni (2007). For any s>0s>0, fix θΘ\theta\in\Theta and let Ui=𝔼[(Zt,i,θ)](Zt,i,θ)𝔼[(Zt,i,θt)]+(Zt,i,θt)U_{i}=\mathbb{E}[\ell(Z_{t,i},\theta)]-\ell(Z_{t,i},\theta)-\mathbb{E}[\ell(Z_{t,i},\theta^{*}_{t})]+\ell(Z_{t,i},\theta^{*}_{t}) for any i{1,,n}i\in\{1,\dots,n\}. We are going to distinguish two cases, whether or not Bernstein assumption is satisfied.

If Bernstein assumption is satisfied, we apply Lemma 13 to UiU_{i}. Note that in this case, VV is actually the variance term Vt(θ,θt)V_{t}(\theta,\theta_{t}^{*}). So, for any s>0s>0,

𝔼𝒮t[esn(RPt(θ)R^t(θ)RPt+R^t(θt))]ens2V(θ,θt)2(1sC).\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{sn\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{ns^{2}V(\theta,\theta^{*}_{t})}{2(1-sC)}}.

We let s=λ/ns=\lambda/n, which gives

𝔼𝒮t[eλ(RPt(θ)R^t(θ)RPt+R^t(θt))]eλ2V(θ,θt)2(nCλ).\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{\lambda^{2}V(\theta,\theta^{*}_{t})}{2(n-C\lambda)}}.

Making use of Bernstein hypothesis gives

𝔼𝒮t[eλ(RPt(θ)R^t(θ)RPt+R^t(θt))]eλ2c(RPt(θ)RPt)2(nCλ).\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{\lambda^{2}c\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right)}{2(n-C\lambda)}}.

If Bernstein hypothesis is not satisfied, we apply Lemma 12 to UiU_{i}, which gives, for any s>0s>0,

𝔼𝒮t[esn(RPt(θ)R^t(θ)RPt+R^t(θt))]ens2C28.\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{sn\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{ns^{2}C^{2}}{8}}.

Letting s=λ/ns=\lambda/n gives

𝔼𝒮t[eλ(RPt(θ)R^t(θ)RPt+R^t(θt))]eλ2C28n.\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{\frac{\lambda^{2}C^{2}}{8n}}.

Defining the general bound

W=λ2c(RPt(θ)RPt)2(nCλ)𝕀B+λ2C28n(1𝕀B),W=\frac{\lambda^{2}c\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right)}{2(n-C\lambda)}\mathbb{I}_{B}+\frac{\lambda^{2}C^{2}}{8n}(1-\mathbb{I}_{B}), (20)

where 𝕀B\mathbb{I}_{B} is equal to 11 if Bernstein assumption is satisfied, and 0 otherwise, it holds in either case that

𝔼𝒮t[eλ(RPt(θ)R^t(θ)RPt+R^t(θt))]eW.\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-\hat{R}_{t}(\theta)-R_{P_{t}}^{*}+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq{\rm e}^{W}.

Rearranging the terms gives

𝔼𝒮t[eλ(RPt(θ)RPtWλR^t(θ)+R^t(θt))]1.\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq 1.

Next, integrating this bound with respect to π\pi and using Fubini’s theorem to exchange both integrals gives

𝔼𝒮t𝔼θπ[eλ(RPt(θ)RPtWλR^t(θ)+R^t(θt))]1.\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\pi}\left[{\rm e}^{\lambda\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right)}\right]\leq 1.

We then apply Lemma 14 to the argument of the expectation with respect to the sample, and we have

𝔼𝒮t[esupρ𝒫(Θ){λ𝔼θρ[RPt(θ)RPtWλR^t(θ)+R^t(θt)]KL(ρπ)}]1.\mathbb{E}_{\mathcal{S}_{t}}\left[{\rm e}^{\sup_{\rho\in\mathcal{P}(\Theta)}\left\{\lambda\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right]-\mathrm{KL}(\rho\|\pi)\right\}}\right]\leq 1.

Jensen’s inequality implies

eλ𝔼𝒮t[supρ𝒫(Θ){𝔼θρ[RPt(θ)RPtWλR^t(θ)+R^t(θt)]KL(ρπ)λ}]1,{\rm e}^{\lambda\mathbb{E}_{\mathcal{S}_{t}}\left[\sup_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right]-\frac{\mathrm{KL}(\rho\|\pi)}{\lambda}\right\}\right]}\leq 1,

in other words,

𝔼𝒮t[supρ𝒫(Θ){𝔼θρ[RPt(θ)RPtWλR^t(θ)+R^t(θt)]KL(ρπ)λ}]0.\mathbb{E}_{\mathcal{S}_{t}}\left[\sup_{\rho\in\mathcal{P}(\Theta)}\left\{\mathbb{E}_{\theta\sim\rho}\left[R_{P_{t}}(\theta)-R_{P_{t}}^{*}-\frac{W}{\lambda}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\right]-\frac{\mathrm{KL}(\rho\|\pi)}{\lambda}\right\}\right]\leq 0.

At this stage, we can replace WW by its value given in (20) to obtain the bound:

𝔼𝒮t[supρ𝒫(Θ){𝔼θρ[(1λc𝕀B2(nCλ))(RPt(θ)RPt)λC2(1𝕀B)8nR^t(θ)+R^t(θt)]KL(ρπ)λ}]0.\mathbb{E}_{\mathcal{S}_{t}}\Bigg{[}\sup_{\rho\in\mathcal{P}(\Theta)}\bigg{\{}\mathbb{E}_{\theta\sim\rho}\bigg{[}\left(1-\frac{\lambda c\mathbb{I}_{B}}{2(n-C\lambda)}\right)\left(R_{P_{t}}(\theta)-R_{P_{t}}^{*}\right)\\ -\frac{\lambda C^{2}(1-\mathbb{I}_{B})}{8n}-\hat{R}_{t}(\theta)+\hat{R}_{t}(\theta^{*}_{t})\bigg{]}-\frac{\mathrm{KL}(\rho\|\pi)}{\lambda}\bigg{\}}\Bigg{]}\leq 0.

Next, we rearrange the terms and replace the supremum on ρ\rho by ρt(π,α)\rho_{t}(\pi,\alpha):

𝔼𝒮t𝔼θρt(π,α)[RPt(θ)]RPt11λc𝕀B2(nCλ)×(𝔼𝒮t[𝔼θρt(π,α)[R^t(θ)]R^t(θt)+KL(ρt(π,α)π)λ]+λC2(1𝕀B)8n).\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}\leq\frac{1}{1-\frac{\lambda c\mathbb{I}_{B}}{2(n-C\lambda)}}\\ \times\left(\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho_{t}(\pi,\alpha)\|\pi)}{\lambda}\right]+\frac{\lambda C^{2}(1-\mathbb{I}_{B})}{8n}\right).

We then replace λ\lambda by αn\alpha n and by definition of Gibbs posterior ρt(π,α)\rho_{t}(\pi,\alpha), the above bound is the same as

𝔼𝒮t𝔼θρt(π,α)[RPt(θ)]RPt11αc𝕀B2(1Cα)×(𝔼𝒮t[infρ{𝔼θρ[R^t(θ)]R^t(θt)+KL(ρπ)αn}]+αC2(1𝕀B)8).\mathbb{E}_{\mathcal{S}_{t}}\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}\left[R_{P_{t}}(\theta)\right]-R_{P_{t}}^{*}\leq\frac{1}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\\ \times\left(\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}\right]+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\right).

In particular, under Bernstein assumption, i.e., if 𝕀B=1\mathbb{I}_{B}=1, the choice α=1C+c\alpha=\frac{1}{C+c} gives

𝔼𝒮t[𝔼θρt(π,α)[RPt(θ)]]RPt2𝔼𝒮t[infρ{𝔼θρ[R^t(θ)]R^t(θt)+KL(ρπ)αn}].\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\right]-R_{P_{t}}^{*}\leq 2\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}\right\}\right].

Without Bernstein assumption, i.e., if 𝕀B=0\mathbb{I}_{B}=0, rewriting the bound and taking the minimum over α\alpha yields

𝔼𝒮t[𝔼θρt(π,α)[RPt(θ)]]RPtinfα𝔼𝒮t[infρ{𝔼θρ[R^t(θ)]R^t(θt)+KL(ρπ)αn+αC28}],\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho_{t}(\pi,\alpha)}[R_{P_{t}}(\theta)]\right]-R_{P_{t}}^{*}\leq\inf_{\alpha}\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{F}}\left\{\mathbb{E}_{\theta\sim\rho}\left[\hat{R}_{t}(\theta)\right]-\hat{R}_{t}(\theta^{*}_{t})+\frac{\mathrm{KL}(\rho\|\pi)}{\alpha n}+\frac{\alpha C^{2}}{8}\right\}\right],

and this concludes the proof. \blacksquare

Appendix C Proof of Corollary 2

First,

𝔼𝒮t[infρ𝒫(Θ)^t(ρ,π,α)R^t(θt)]\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right] =𝔼𝒮t[infρ𝒫(Θ)(𝔼θρ[R^t(θ)]+KL(ρπ)αn)R^t(θt)]\displaystyle=\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\left(\mathbb{E}_{\theta\sim\rho}[\hat{R}_{t}(\theta)]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}\right)-\hat{R}_{t}(\theta^{*}_{t})\right]
infρ𝒫(Θ)𝔼𝒮t[𝔼θρ[R^t(θ)]+KL(ρπ)αnR^t(θt)]\displaystyle\leq\inf_{\rho\in\mathcal{P}(\Theta)}\mathbb{E}_{\mathcal{S}_{t}}\left[\mathbb{E}_{\theta\sim\rho}[\hat{R}_{t}(\theta)]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}-\hat{R}_{t}(\theta^{*}_{t})\right]
=infρ𝒫(Θ)[]𝔼θρ[Rt(θ)]Rt(θt)+KL(ρπ)αn].\displaystyle=\inf_{\rho\in\mathcal{P}(\Theta)}\left[]\mathbb{E}_{\theta\sim\rho}[R_{t}(\theta)]-R_{t}(\theta_{t}^{*})+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}\right].

Now, define ρs\rho_{s} as the resctriction of π\pi to the set {θ:Rt(θ)Rt(θt)s}\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\leq s\}. Then,

𝔼𝒮t[infρ𝒫(Θ)^t(ρ,π,α)R^t(θt)]\displaystyle\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right] infs>0[𝔼θρ[Rt(θ)]Rt(θt)+KL(ρsπ)αn]\displaystyle\leq\inf_{s>0}\left[\mathbb{E}_{\theta\sim\rho}[R_{t}(\theta)]-R_{t}(\theta_{t}^{*})+\frac{{\rm KL}(\rho_{s}\|\pi)}{\alpha n}\right]
infs>0[s+log1π({θ:Rt(θ)Rt(θt)s})αn]\displaystyle\leq\inf_{s>0}\left[s+\frac{\log\frac{1}{\pi(\{\theta:R_{t}(\theta)-R_{t}(\theta_{t}^{*})\leq s\})}}{\alpha n}\right]
infs>0[s+dπ,tlog1s+logκπ,tαn]\displaystyle\leq\inf_{s>0}\left[s+\frac{d_{\pi,t}\log\frac{1}{s}+\log\kappa_{\pi,t}}{\alpha n}\right]

by assumption. An optimization with respect to ss leads to s=dπ,t/(αn)s=d_{\pi,t}/(\alpha n) and we obtain the first statement:

𝔼𝒮t[infρ𝒫(Θ)^t(ρ,π,α)R^t(θt)]dπ,tlognαdπ+logκπ,tαn.\mathbb{E}_{\mathcal{S}_{t}}\left[\inf_{\rho\in\mathcal{P}(\Theta)}\hat{\mathcal{R}}_{t}(\rho,\pi,\alpha)-\hat{R}_{t}(\theta^{*}_{t})\right]\leq\frac{d_{\pi,t}\log\frac{n\alpha}{d_{\pi}}+\log\kappa_{\pi,t}}{\alpha n}.

Plugging this into Theorem 1 leads immediately to the other statements.

Appendix D Proof of Lemma 4

First, we note that f:x1τlog(x)f:x\mapsto-\frac{1}{\tau}\log(x) is differentiable on [exp(Cτ),1][\exp(-C\tau),1] and f(x)=1τxf^{\prime}(x)=-\frac{1}{\tau x}. As a consequence, |f(x)|=1/(τx)|f^{\prime}(x)|=1/(\tau x) is maximized at x=exp(Cτ)x=\exp(-C\tau) and its maximum is exp(Cτ)/τ\exp(C\tau)/\tau. This implies that ff is exp(Cτ)/τ\exp(C\tau)/\tau-Lipschitz, that is, for any (x,y)[exp(Cτ),1]2(x,y)\in[\exp(-C\tau),1]^{2},

|f(x)f(y)|exp(Cτ)τ|xy|.|f(x)-f(y)|\leq\frac{\exp(C\tau)}{\tau}|x-y|.

Taking the square of both sides of the inequality yields

(x,y)[exp(Cτ),1]2,(f(x)f(y))2exp(2Cτ)τ2(xy)2.\forall(x,y)\in[\exp(-C\tau),1]^{2},\ (f(x)-f(y))^{2}\leq\frac{\exp(2C\tau)}{\tau^{2}}(x-y)^{2}. (21)

Then, f′′(x)=1/(τx2)f^{\prime\prime}(x)=1/(\tau x^{2}) and thus, |f′′(x)|1/τ|f^{\prime\prime}(x)|\geq 1/\tau (minimum reached for x=1x=1). This implies that ff is 1/τ1/\tau-strongly convex, that is, for any (x,y)[exp(Cτ),1](x,y)\in[\exp(-C\tau),1] we have, for any θ[0,1]\theta\in[0,1],

f(θx+(1θ)y)θf(x)+(1θ)f(y)θ(1θ)2τ(xy)2.f(\theta x+(1-\theta)y)\leq\theta f(x)+(1-\theta)f(y)-\frac{\theta(1-\theta)}{2\tau}(x-y)^{2}.

We apply this inequality to θ=1/2\theta=1/2 and rearrange terms, and we obtain

(x,y)[exp(Cτ),1]2,18τ(xy)2f(x)+f(y)2f(x+y2).\forall(x,y)\in[\exp(-C\tau),1]^{2},\ \frac{1}{8\tau}(x-y)^{2}\leq\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right). (22)

Finally, combining (21) with (22) yields

(x,y)[exp(Cτ),1]2,[f(x)f(y)]28exp(2Cτ)τ[f(x)+f(y)2f(x+y2)],\forall(x,y)\in[\exp(-C\tau),1]^{2},[f(x)-f(y)]^{2}\leq\frac{8\exp(2C\tau)}{\tau}\left[\frac{f(x)+f(y)}{2}-f\left(\frac{x+y}{2}\right)\right],

concluding the proof of the lemma. \blacksquare

Appendix E Proof of Theorem 5

The proof of Theorem 5 is structured as follows: we first bound the excess risk by the expectation of the infimum of the empirical risk ^(ρ,π,α)R^(θt)\hat{\mathcal{R}}(\rho,\pi,\alpha)-\hat{R}(\theta_{t}^{*}) in Lemma 15. Using classic techniques, we turn this bound into the prediction risk.

E.1 Lemma

Lemma 15

Assume that the loss \ell satisfies the boundedness assumption (3). Then, the following bound holds with the choice β=1C+c\beta=\frac{1}{C+c}:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]21αc𝕀B2(1Cα)𝔼P1,,PT𝔼S1,,ST[infΠ𝒢{1Tt=1T(𝔼πΠ[^t(ρt(π,α),π,α)]R^t(θt))+KL(ΠΛ)βT}+αC2(1𝕀B)8],\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\\ \inf_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]},

Proof  For any t[T]t\in[T], let

Ut:=^t(ρt(πα,α),πα,α)^t(ρt(π,α),π,α).U_{t}:=\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha).

Please note that

𝔼[Ut]=𝔼Pt[t(πα)]𝔼Pt[t(π)],\mathbb{E}[U_{t}]=\mathbb{E}_{P_{t}}\left[\mathcal{R}_{t}(\pi^{*}_{\alpha})\right]-\mathbb{E}_{P_{t}}\left[\mathcal{R}_{t}(\pi)\right],

where 𝔼[Ut]\mathbb{E}[U_{t}] is a shortcut notation for 𝔼Pt𝒫𝔼StPt[Ut]\mathbb{E}_{P_{t}\sim\mathcal{P}}\mathbb{E}_{S_{t}\sim P_{t}}[U_{t}]. Besides, please note that, by the assumption on the boundedness of \ell, it a.s. holds that |Ut|C|U_{t}|\leq C. Applying Lemma 13 to UtU_{t} gives, for any β>0\beta>0,

𝔼P1,,PT𝔼S1,,ST[eβt=1T(Ut𝔼St[Ut])]eβ2TV~(π)2(1βC),\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[e^{\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}_{S_{t}}[U_{t}]\right)}\right]\leq e^{\frac{\beta^{2}T\tilde{V}(\pi)}{2(1-\beta C)}},

where V~(π)=𝔼PT+1𝔼ST+1[(^T+1(ρT+1(π,α),π,α)^T+1(ρT+1(πα,α),πα,α))2]\tilde{V}(\pi)=\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[\left(\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right)^{2}\right]. This factor can be bounded as V~(π)c𝔼[UT+1]\tilde{V}(\pi)\leq c\mathbb{E}[-U_{T+1}] by Theorem 3, which states that Bernstein hypothesis is satisfied at the meta level, so that the bound becomes

𝔼P1,,PT𝔼S1,,ST[eβt=1T(Ut𝔼[Ut])+cβ2T𝔼[UT+1]2(1βC)]1.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[e^{\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta^{2}T\mathbb{E}[U_{T+1}]}{2(1-\beta C)}}\right]\leq 1.

Integrating with respect to the prior πΛ\pi\sim\Lambda and using Fubini’s theorem yields

𝔼P1,,PT𝔼S1,,ST𝔼πΛ[eβt=1T(Ut𝔼[Ut])+cβ2T𝔼[UT+1]2(1βC)]1.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\Lambda}\left[e^{\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta^{2}T\mathbb{E}[U_{T+1}]}{2(1-\beta C)}}\right]\leq 1.

Next, by an application of Lemma 14, the left-hand side becomes

𝔼P1,,PT𝔼S1,,ST[esupΠ𝒫(𝒫(θ)){𝔼πΠ[βt=1T(Ut𝔼[Ut])+cβ2T𝔼[UT+1]2(1βC)]KL(ΠΛ)}]1.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[e^{\sup_{\Pi\in\mathcal{P}(\mathcal{P}(\theta))}\left\{\mathbb{E}_{\pi\sim\Pi}\left[\beta\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta^{2}T\mathbb{E}[U_{T+1}]}{2(1-\beta C)}\right]-\mathrm{KL}(\Pi\|\Lambda)\right\}}\right]\leq 1.

We then make use of Jensen’s inequality and arrange terms, so that the bound becomes

𝔼P1,,PT𝔼S1,,ST[supΠ𝒫(𝒫(θ)){𝔼πΠ[1Tt=1T(Ut𝔼[Ut])+cβ𝔼[UT+1]2(1βC)]KL(ΠΛ)βT}]0.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[\sup_{\Pi\in\mathcal{P}(\mathcal{P}(\theta))}\left\{\mathbb{E}_{\pi\sim\Pi}\left[\frac{1}{T}\sum_{t=1}^{T}\left(U_{t}-\mathbb{E}[U_{t}]\right)+\frac{c\beta\mathbb{E}[U_{T+1}]}{2(1-\beta C)}\right]-\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}\right]\leq 0.

We replace the supremum on Π\Pi by an evaluation of the term in Π^\hat{\Pi} and arrange terms, so that the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[1Tt=1T𝔼[Ut]+cβ𝔼[UT+1]2(1βC)]𝔼P1,,PT𝔼S1,,ST[1Tt=1T𝔼πΠ^[Ut]+KL(Π^Λ)βT],\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[U_{t}]+\frac{c\beta\mathbb{E}[U_{T+1}]}{2(1-\beta C)}\right]\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}[U_{t}]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right],

which is identical to

(1cβ2(1Cβ))𝔼P1,,PT𝔼S1,,ST𝔼πΠ^𝔼PT+1𝔼ST+1[UT+1]𝔼P1,,PT𝔼S1,,ST[1Tt=1T𝔼πΠ^[Ut]+KL(Π^Λ)βT].\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[U_{T+1}\right]\leq\\ \mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}[U_{t}]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

Then, we replace 𝔼[UT+1]\mathbb{E}[U_{T+1}] by its value, yielding the bound

(1cβ2(1Cβ))𝔼P1,,PT𝔼S1,,ST𝔼πΠ^𝔼PT+1𝔼ST+1[^T+1(ρT+1(π,α),π,α)^T+1(ρT+1(πα,α),πα,α)]𝔼P1,,PT𝔼S1,,ST[1Tt=1T𝔼πΠ^[^t(ρt(π,α),π,α)^t(ρt(πα,α),πα,α)]+KL(Π^Λ)βT].\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Bigg{[}\\ \hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\Bigg{]}\leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\\ -\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)-\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\Bigg{]}.

Since the term ^t(ρt(πα,α),πα,α)\hat{\mathcal{R}}_{t}(\rho_{t}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha) does not depend on πΠ^\pi\sim\hat{\Pi}, we can simplify it on both sides of the inequality, which gives

(1cβ2(1Cβ))𝔼P1,,PT𝔼S1,,ST𝔼πΠ^𝔼PT+1𝔼ST+1[^T+1(ρT+1(π,α),π,α)]+cβ2(1Cβ)𝔼PT+1𝔼ST+1[^T+1(ρT+1(πα,α),πα,α)]𝔼P1,,PT𝔼S1,,ST[1Tt=1T𝔼πΠ^[^t(ρt(π,α),π,α)]+KL(Π^Λ)βT].\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)\right]\\ +\frac{c\beta}{2(1-C\beta)}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{*}_{\alpha},\alpha),\pi^{*}_{\alpha},\alpha)\right]\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right]. (23)

Theorem 1 provides the following lower bound for any π\pi^{\prime}:

𝔼ST+1[^T+1(ρT+1(π,α),π,α)](1αc𝕀B2(1Cα))𝔼ST+1𝔼θρT+1(π,α)[RPT+1(θ)]+αc𝕀B2(1Cα)RPT+1αC2(1𝕀B)8.\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{\prime},\alpha),\pi^{\prime},\alpha)\right]\geq\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi^{\prime},\alpha)}\left[R_{P_{T+1}}(\theta)\right]\\ +\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}R_{P_{T+1}}^{*}-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}. (24)

This further implies that

𝔼ST+1[^T+1(ρT+1(π,α),π,α)](1αc𝕀B2(1Cα))RPT+1+αc𝕀B2(1Cα)RPT+1αC2(1𝕀B)8\mathbb{E}_{S_{T+1}}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi^{\prime},\alpha),\pi^{\prime},\alpha)\right]\geq\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)R_{P_{T+1}}^{*}\\ +\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}R_{P_{T+1}}^{*}-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8} (25)

for any π\pi^{\prime}. In particular, applying (24) to π=π\pi^{\prime}=\pi and (25) to π=πα\pi=\pi^{*}_{\alpha}, and injecting the results in the left-hand side of (23) gives

(1αc𝕀B2(1Cα))(cβ2(1Cβ)𝔼PT+1[RPT+1]+(1cβ2(1Cβ))𝔼P1,,PT𝔼S1,,ST𝔼πΠ^𝔼PT+1𝔼ST+1𝔼θρT+1(π,α)[RPT+1(θ)])+αc𝕀B2(1Cα)𝔼PT+1[RPT+1]αC2(1𝕀B)8𝔼P1,,PT𝔼S1,,ST[1Tt=1T𝔼πΠ^[^t(ρt(π,α),π,α)]+KL(Π^Λ)βT].\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\Bigg{(}\frac{c\beta}{2(1-C\beta)}\mathbb{E}_{P_{T+1}}\left[R_{P_{T+1}}^{*}\right]\\ +\left(1-\frac{c\beta}{2(1-C\beta)}\right)\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi^{\prime},\alpha)}\left[R_{P_{T+1}}(\theta)\right]\Bigg{)}\\ +\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\mathbb{E}_{P_{T+1}}\left[R_{P_{T+1}}^{*}\right]-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

We remove =𝔼PT+1[RPT+1]\mathcal{E}^{*}=\mathbb{E}_{P_{T+1}}\left[R_{P_{T+1}}^{*}\right] from both sides of the inequality and arrange terms, so that the bound becomes

(1αc𝕀B2(1Cα))(1cβ2(1Cβ))(𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)])αC2(1𝕀B)8𝔼P1,,PT𝔼S1,,ST[1Tt=1T𝔼πΠ^[^t(ρt(π,α),π,α)]+KL(Π^Λ)βT].\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\left(1-\frac{c\beta}{2(1-C\beta)}\right)\Bigg{(}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\Bigg{)}-\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\\ \leq\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[-\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\mathcal{E}^{*}+\frac{\mathrm{KL}(\hat{\Pi}\|\Lambda)}{\beta T}\right].

By definition, Π^\hat{\Pi} is the minimizer of the integrand of the right-hand side, and therefore,

(1αc𝕀B2(1Cα))(1cβ2(1Cβ))(𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)])𝔼P1,,PT𝔼S1,,ST[infΠ𝒢{1Tt=1T(𝔼πΠ[^t(ρt(π,α),π,α)]R^t(θt))+KL(ΠΛ)βT}]+αC2(1𝕀B)8,\left(1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}\right)\left(1-\frac{c\beta}{2(1-C\beta)}\right)\Bigg{(}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\Bigg{)}\leq\\ \mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\left[\inf_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}\right]\\ +\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8},

and the choice β=1c+C\beta=\frac{1}{c+C} yields

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]21αc𝕀B2(1Cα)𝔼P1,,PT𝔼S1,,ST[infΠ𝒢{1Tt=1T(𝔼πΠ[^t(ρt(π,α),π,α)]R^t(θt))+KL(ΠΛ)βT}+αC2(1𝕀B)8],\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\\ \inf_{\Pi\in\mathcal{G}}\left\{\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\right\}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]},

which concludes the proof of the lemma.  

E.2 Proof of Theorem 5

From Lemma 15,

𝔼P1,,PT\displaystyle\mathbb{E}_{P_{1},\dots,P_{T}} 𝔼S1,,ST𝔼πΠ^[(π)]\displaystyle\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}\left[\mathcal{E}(\pi)\right]-\mathcal{E}^{*}
21αc𝕀B2(1Cα)𝔼P1,,PT𝔼S1,,ST[infΠ𝒢{1Tt=1T(𝔼πΠ[^t(ρt(π,α),π,α)]R^t(θt))\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Bigg{[}\inf_{\Pi\in\mathcal{G}}\Biggl{\{}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)
+KL(ΠΛ)βT}+αC2(1𝕀B)8]\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}\Biggr{\}}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Bigg{]}
21αc𝕀B2(1Cα)infΠ𝒢𝔼P1,,PT𝔼S1,,ST{1Tt=1T(𝔼πΠ[^t(ρt(π,α),π,α)]R^t(θt))\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\Biggl{\{}\frac{1}{T}\sum_{t=1}^{T}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{t}(\rho_{t}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{t}(\theta_{t}^{*})\right)
+KL(ΠΛ)βT+αC2(1𝕀B)8}\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}
=21αc𝕀B2(1Cα)infΠ𝒢𝔼PT+1𝔼ST+1{(𝔼πΠ[^T+1(ρT+1(π,α),π,α)]R^T+1(θT+1))\displaystyle=\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Biggl{\{}\left(\mathbb{E}_{\pi\sim\Pi}\left[\hat{\mathcal{R}}_{T+1}(\rho_{T+1}(\pi,\alpha),\pi,\alpha)\right]-\hat{R}_{T+1}(\theta_{T+1}^{*})\right)
+KL(ΠΛ)βT+αC2(1𝕀B)8}\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}
21αc𝕀B2(1Cα)infΠ𝒢𝔼PT+1𝔼ST+1{𝔼πΠinfρ𝒫(Θ)[𝔼θρ[R^T+1(θ)]+KL(ρπ)αn\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\mathbb{E}_{S_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[\hat{R}_{T+1}(\theta)]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}
R^T+1(θT+1)]+KL(ΠΛ)βT+αC2(1𝕀B)8}\displaystyle\quad\quad\quad-\hat{R}_{T+1}(\theta_{T+1}^{*})\Biggr{]}+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}
21αc𝕀B2(1Cα)infΠ𝒢𝔼PT+1{𝔼πΠinfρ𝒫(Θ)𝔼ST+1[𝔼θρ[R^T+1(θ)]+KL(ρπ)αn\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\mathbb{E}_{S_{T+1}}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[\hat{R}_{T+1}(\theta)]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}
R^T+1(θT+1)]+KL(ΠΛ)βT+αC2(1𝕀B)8}\displaystyle\quad\quad\quad-\hat{R}_{T+1}(\theta_{T+1}^{*})\Biggr{]}+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}
21αc𝕀B2(1Cα)infΠ𝒢𝔼PT+1{𝔼πΠinfρ𝒫(Θ)[𝔼θρ[RT+1(θ)RT+1(θt)]+KL(ρπ)αn]\displaystyle\leq\frac{2}{1-\frac{\alpha c\mathbb{I}_{B}}{2(1-C\alpha)}}\inf_{\Pi\in\mathcal{G}}\mathbb{E}_{P_{T+1}}\Biggl{\{}\mathbb{E}_{\pi\sim\Pi}\inf_{\rho\in\mathcal{P}(\Theta)}\Biggl{[}\mathbb{E}_{\theta\sim\rho}[R_{T+1}(\theta)-R_{T+1}(\theta_{t}^{*})]+\frac{{\rm KL}(\rho\|\pi)}{\alpha n}\Biggr{]}
+KL(ΠΛ)βT+αC2(1𝕀B)8}.\displaystyle\quad\quad\quad+\frac{\mathrm{KL}(\Pi\|\Lambda)}{\beta T}+\frac{\alpha C^{2}(1-\mathbb{I}_{B})}{8}\Biggr{\}}.

This ends the proof. \blacksquare

Appendix F Application of Theorem 5 to the Gaussian Case

Assume that \ell is LL-Lipschitz. As a result, the risks RPtR_{P_{t}} are also LL-Lipschitz. We choose the prior pμ¯,(σ¯2,,σ¯2)p_{\bar{\mu},(\bar{\sigma}^{2},\dots,\bar{\sigma}^{2})}\in\mathcal{F} (and hence, the variances are the same for all coordinates). A straightforward application of (17) gives

KL(pμ,σ2,pμ¯,(σ¯2,,σ¯2))=12i=1d[(μiμ¯i)2σ¯2+σi2σ¯21+log(σ¯2σi2)].\mathrm{KL}(p_{\mu,\sigma^{2}},p_{\bar{\mu},(\bar{\sigma}^{2},\dots,\bar{\sigma}^{2})})=\frac{1}{2}\sum_{i=1}^{d}\left[\frac{(\mu_{i}-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right].

In this case, ρt(π,α)=pμ(t),σ2(t)\rho_{t}(\pi,\alpha)=p_{\mu(t),\sigma^{2}(t)}, where

(μ(t),σ2(t))=argminμ,σ2{𝔼θ𝒩(μ,σ2)[R^t(θ)]+12αni=1d[(μiμ¯i)2σ¯2+σi2σ¯21+log(σ¯2σi2)]}.(\mu(t),\sigma^{2}(t))=\operatorname{argmin}_{\mu,\sigma^{2}}\left\{\mathbb{E}_{\theta\sim\mathcal{N}(\mu,\sigma^{2})}\left[\hat{R}_{t}(\theta)\right]+\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right]\right\}.

We now define 𝒢\mathcal{G} as the family of distributions qτ,v,a,bq_{\tau,v,a,b} on (μ¯,σ¯2)(\bar{\mu},\bar{\sigma}^{2}), where

qτ,v,a,b(μ¯,σ¯2)=[i=1d𝒩(μ¯i;τi,ξ2)]Γ(σ¯2;a,b).q_{\tau,v,a,b}(\bar{\mu},\bar{\sigma}^{2})=\left[\bigotimes_{i=1}^{d}\mathcal{N}(\bar{\mu}_{i};\tau_{i},\xi^{2})\right]\otimes\Gamma(\bar{\sigma}^{2};a,b).

Fix a prior on priors Λ=q0,ξ¯¯2,a¯¯,b¯¯\Lambda=q_{0,\bar{\bar{\xi}}^{2},\bar{\bar{a}},\bar{\bar{b}}}. We choose Π^=qt^,ξ^2,a^,b^\hat{\Pi}=q_{\hat{t},\hat{\xi}^{2},\hat{a},\hat{b}}, where

(τ^,ξ^2,a^,b^)=argminτ,ξ2,a,b{𝔼(μ¯,σ¯2)qτ,ξ2,a,b[1Tt=1Tminμ(t),σ2(t){𝔼θ𝒩(μ(t),σ2(t))[R^t(θ)]+12αni=1d[(μi(t)μ¯i)2σ¯2+σi2(t)σ¯21+log(σ¯2σi2(t))]}]+12βTi=1d[τi2ξ¯¯2+ξ2ξ¯¯21+log(ξ¯¯2ξ2)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT},(\hat{\tau},\hat{\xi}^{2},\hat{a},\hat{b})=\operatorname{argmin}_{\tau,\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{(\bar{\mu},\bar{\sigma}^{2})\sim q_{\tau,\xi^{2},a,b}}\Biggl{[}\frac{1}{T}\sum_{t=1}^{T}\min_{\mu(t),\sigma^{2}(t)}\Biggl{\{}\mathbb{E}_{\theta\sim\mathcal{N}(\mu(t),\sigma^{2}(t))}\left[\hat{R}_{t}(\theta)\right]\\ +\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}(t)-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}(t)}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}(t)}\right)\right]\Biggr{\}}\Biggr{]}\\ +\frac{1}{2\beta T}\sum_{i=1}^{d}\left[\frac{\tau_{i}^{2}}{\bar{\bar{\xi}}^{2}}+\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}},

where Γ()\Gamma(\cdot) is the gamma function and ψ()=Γ()/Γ()\psi(\cdot)=\Gamma^{\prime}(\cdot)/\Gamma(\cdot) is the digamma function. We apply Theorem 5:

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]4infτ,ξ2,a,b{𝔼(μ¯,σ¯2)qτ,ξ2,a,b𝔼PT+1[minμ(T+1),σ2(T+1){𝔼θ𝒩(μ(T+1),σ2(T+1))[RPT+1(θ)]RPT+1(θt)+12αni=1d[(μi(T+1)μ¯i)2σ¯2+σi2(T+1)σ¯21+log(σ¯2σi2(T+1))]}]+12βTi=1d[τi2ξ¯¯2+ξ2ξ¯¯21+log(ξ¯¯2ξ2)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\tau,\xi^{2},a,b}\Biggl{\{}\\ \mathbb{E}_{(\bar{\mu},\bar{\sigma}^{2})\sim q_{\tau,\xi^{2},a,b}}\mathbb{E}_{P_{T+1}}\Biggl{[}\min_{\mu(T+1),\sigma^{2}(T+1)}\Biggl{\{}\mathbb{E}_{\theta\sim\mathcal{N}(\mu(T+1),\sigma^{2}(T+1))}[R_{P_{T+1}}(\theta)]-R_{P_{T+1}}(\theta^{*}_{t})\\ +\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}(T+1)-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}(T+1)}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}(T+1)}\right)\right]\Biggr{\}}\Biggr{]}\\ +\frac{1}{2\beta T}\sum_{i=1}^{d}\left[\frac{\tau_{i}^{2}}{\bar{\bar{\xi}}^{2}}+\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

We next apply Assumption 4, and the bound becomes

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]4infτ,ξ2,a,b{𝔼(μ¯,σ¯2)qτ,ξ2,a,b𝔼PT+1[minμ,σ2{Lσ2+12αni=1d[(μiμ¯i)2σ¯2+σi2σ¯21+log(σ¯2σi2(T+1))]}]+12βTi=1d[τi2ξ¯¯2+ξ2ξ¯¯21+log(ξ¯¯2ξ2)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\tau,\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{(\bar{\mu},\bar{\sigma}^{2})\sim q_{\tau,\xi^{2},a,b}}\mathbb{E}_{P_{T+1}}\Biggl{[}\min_{\mu,\sigma^{2}}\Biggl{\{}L\|\sigma\|^{2}\\ +\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{(\mu_{i}-\bar{\mu}_{i})^{2}}{\bar{\sigma}^{2}}+\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}(T+1)}\right)\right]\Biggr{\}}\Biggr{]}\\ +\frac{1}{2\beta T}\sum_{i=1}^{d}\left[\frac{\tau_{i}^{2}}{\bar{\bar{\xi}}^{2}}+\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

Idea: define for any PP, μP\mu_{P} that minimizes RPR_{P}, that is: RP(μP)=RpR_{P}(\mu_{P})=R_{p}^{*}, and define μ=𝔼P𝒫(μP)\mu^{*}=\mathbb{E}_{P\sim\mathcal{P}}(\mu_{P}). If μP=μ\mu_{P}=\mu^{*} 𝒫\mathcal{P}-a.s., all the task have the same solution. On the other hand, if μP\mu_{P} has a lot of variations, then the tasks have very unrelated solutions. In the infimum above, take μ=μPT+1\mu=\mu_{P_{T+1}} and τ=μ\tau=\mu^{*}. Then,

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+infξ2,a,b{𝔼σ¯2Γ1(a,b)[minσ2{Lσ2+𝔼PT+1𝒫(μPT+1μ2)+ξ2d2σ¯2αn+12αni=1d[σi2σ¯21+log(σ¯2σi2)]}]+βC28T+μ22βξ¯¯2+d2β[ξ2ξ¯¯21+log(ξ¯¯2ξ2)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\inf_{\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\Bigg{[}\min_{\sigma^{2}}\Biggl{\{}L\|\sigma\|^{2}\\ +\frac{\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left(\|\mu_{P_{T+1}}-\mu^{*}\|^{2}\right)+\xi^{2}d}{2\bar{\sigma}^{2}\alpha n}+\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right]\Biggr{\}}\Bigg{]}\\ +\frac{\beta C^{2}}{8T}+\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}}+\frac{d}{2\beta}\left[\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

Let Σ(𝒫)=𝔼PT+1𝒫[μPT+1μ2]\Sigma(\mathcal{P})=\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\|\mu_{P_{T+1}}-\mu^{*}\|^{2}\right], this quantity will be very important in the rate. Therefore,

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]4infξ2,a,b,β{𝔼σ¯2Γ1(a,b)[minσ2{Lσ2+Σ(𝒫)+ξ2d2σ¯2αn+12αni=1d[σi2σ¯21+log(σ¯2σi2)]}]+μ22βξ¯¯2T+d2β[ξ2ξ¯¯21+log(ξ¯¯2ξ2)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ 4\inf_{\xi^{2},a,b,\beta}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\Bigg{[}\min_{\sigma^{2}}\Biggl{\{}L\|\sigma\|^{2}+\frac{\Sigma(\mathcal{P})+\xi^{2}d}{2\bar{\sigma}^{2}\alpha n}+\frac{1}{2\alpha n}\sum_{i=1}^{d}\left[\frac{\sigma_{i}^{2}}{\bar{\sigma}^{2}}-1+\log\left(\frac{\bar{\sigma}^{2}}{\sigma_{i}^{2}}\right)\right]\Biggr{\}}\Bigg{]}\\ +\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}+\frac{d}{2\beta}\left[\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

An exact optimization in σ2\sigma^{2} gives σi2=σ¯2(2αLσ¯2n+1)\sigma_{i}^{2}=\frac{\bar{\sigma}^{2}}{(2\alpha L\bar{\sigma}^{2}n+1)} and after simplifications,

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4infξ2,a,b{𝔼σ¯2Γ1(a,b)[Σ(𝒫)+ξ2d2σ¯2αn+dlog(2αLσ¯2n+1)2αn]+μ22βξ¯¯2T+d2β[ξ2ξ¯¯21+log(ξ¯¯2ξ2)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\\ \leq\mathcal{E}^{*}+4\inf_{\xi^{2},a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\left[\frac{\Sigma(\mathcal{P})+\xi^{2}d}{2\bar{\sigma}^{2}\alpha n}+\frac{d\log\left(2\alpha L\bar{\sigma}^{2}n+1\right)}{2\alpha n}\right]\\ +\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}+\frac{d}{2\beta}\left[\frac{\xi^{2}}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\xi^{2}}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

We now consider separately two cases: Σ(𝒫)>dϵ\Sigma(\mathcal{P})>d\epsilon or Σ(𝒫)dϵ\Sigma(\mathcal{P})\leq d\epsilon for some ϵ\epsilon (very small) that we will choose later. First case: Σ(𝒫)>dϵ\Sigma(\mathcal{P})>d\epsilon. This is the case where we do not expect a significant improvement on the bound from the meta-learning. Simply take ξ=ξ¯¯\xi=\bar{\bar{\xi}} to get

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]4infa,b{𝔼σ¯2Γ1(a,b)[Σ(𝒫)+dξ¯¯2σ¯2αn+dlog(2αLσ¯2n+1)2αn]+2μ2βξ¯¯2T+(aa¯¯)ψ(a)+2logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bbβT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\\ \leq 4\inf_{a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\left[\frac{\Sigma(\mathcal{P})+d\bar{\bar{\xi}}^{2}}{\bar{\sigma}^{2}\alpha n}+\frac{d\log\left(2\alpha L\bar{\sigma}^{2}n+1\right)}{2\alpha n}\right]\\ +\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}+\frac{(a-\bar{\bar{a}})\psi(a)+2\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{\beta T}\Biggr{\}}.

In this case, an accurate optimization with respect to aa and bb will not improve the bound significantly, and for this reason, we simply take a=a¯¯a=\bar{\bar{a}} and b=b¯¯b=\bar{\bar{b}}. Also note that for UΓ(a,b)U\sim\Gamma(a,b) we have 𝔼(Ux)=Γ(a+x)/[bxΓ(a)]\mathbb{E}(U^{x})=\Gamma(a+x)/[b^{x}\Gamma(a)] and thus 𝔼(1/σ¯2)=b/(a1)\mathbb{E}(1/\bar{\sigma}^{2})=b/(a-1) and 𝔼(log(2αLσ¯2n+1))log(2αL𝔼(σ¯2)n+1)=log(2αLna/b+1)\mathbb{E}(\log(2\alpha L\bar{\sigma}^{2}n+1))\leq\log(2\alpha L\mathbb{E}(\bar{\sigma}^{2})n+1)=\log(2\alpha Lna/b+1). Thus,

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4b¯¯[dξ¯¯2+Σ(𝒫)](a¯¯1)αn+2dlog(2a¯¯αLnb¯¯+1)αn+2μ2βξ¯¯2T.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4\bar{\bar{b}}\left[d\bar{\bar{\xi}}^{2}+\Sigma(\mathcal{P})\right]}{(\bar{\bar{a}}-1)\alpha n}+\frac{2d\log\left(\frac{2\bar{\bar{a}}\alpha Ln}{\bar{\bar{b}}}+1\right)}{\alpha n}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}.

Second case: Σ(𝒫)dϵ\Sigma(\mathcal{P})\leq d\epsilon. In this case, we expect a significant improvement in the bound from the meta-learning, and in order to take advantage of it, we choose ξ\xi very small: ξ2=ϵ\xi^{2}=\epsilon. We obtain

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4infa,b{𝔼σ¯2Γ1(a,b){dϵσ¯2αn+dlog(2αLσ¯2n+1)2αn}+μ22βξ¯¯2T+d2βT[ϵξ¯¯21+log(ξ¯¯2ϵ)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+4\inf_{a,b}\Biggl{\{}\mathbb{E}_{\bar{\sigma}^{2}\sim\Gamma^{-1}(a,b)}\Biggl{\{}\frac{d\epsilon}{\bar{\sigma}^{2}\alpha n}+\frac{d\log\left(2\alpha L\bar{\sigma}^{2}n+1\right)}{2\alpha n}\Biggr{\}}\\ +\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}+\frac{d}{2\beta T}\left[\frac{\epsilon}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\epsilon}\right)\right]\\ +\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

In order to take advantage of the meta-learning, we are going to make sure that σ¯2\bar{\sigma}^{2} is very small by tuning aa and bb adequately. But then we have 𝔼(log(2αLσ¯2n+1))2αL𝔼(σ¯2)n=2αLna/b\mathbb{E}(\log(2\alpha L\bar{\sigma}^{2}n+1))\leq 2\alpha L\mathbb{E}(\bar{\sigma}^{2})n=2\alpha Lna/b, and we obtain

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4infa,b,βG{bdϵ(a1)αn+dLab+μ22βξ¯¯2T+d2βT[ϵξ¯¯21+log(ξ¯¯2ϵ)]+(aa¯¯)ψ(a)+logΓ(a¯¯)Γ(a)+a¯¯logbb¯¯+ab¯¯bb2βT}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+4\inf_{a,b,\beta\in G}\Biggl{\{}\frac{bd\epsilon}{(a-1)\alpha n}+\frac{dLa}{b}+\frac{\|\mu^{*}\|^{2}}{2\beta\bar{\bar{\xi}}^{2}T}\\ +\frac{d}{2\beta T}\left[\frac{\epsilon}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\epsilon}\right)\right]+\frac{(a-\bar{\bar{a}})\psi(a)+\log\frac{\Gamma(\bar{\bar{a}})}{\Gamma(a)}+\bar{\bar{a}}\log\frac{b}{\bar{\bar{b}}}+a\frac{\bar{\bar{b}}-b}{b}}{2\beta T}\Biggr{\}}.

Choose a=a¯¯a=\bar{\bar{a}} (make sure that a¯¯>1\bar{\bar{a}}>1) and an optimization with respect to bb gives b=αLa(a1)nϵb=\sqrt{\frac{\alpha La(a-1)n}{\epsilon}}. Reinjecting in the bound gives

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4da¯¯Lα(a¯¯1)ϵn+2μ2βξ¯¯2T+2dβT[ϵξ¯¯21+log(ξ¯¯2ϵ)]+2βT(a¯¯log(1b¯¯nαLa¯¯(a¯¯1)ϵ)+a¯¯b¯¯ϵnαLa¯¯(a¯¯1)).\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+4d\sqrt{\frac{\bar{\bar{a}}L}{\alpha(\bar{\bar{a}}-1)}}\sqrt{\frac{\epsilon}{n}}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}\\ +\frac{2d}{\beta T}\left[\frac{\epsilon}{\bar{\bar{\xi}}^{2}}-1+\log\left(\frac{\bar{\bar{\xi}}^{2}}{\epsilon}\right)\right]+\frac{2}{\beta T}\left(\bar{\bar{a}}\log\left(\frac{1}{\bar{\bar{b}}}\sqrt{\frac{n\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}{\epsilon}}\right)+\bar{\bar{a}}\bar{\bar{b}}\sqrt{\frac{\epsilon}{n\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}}\right).

The last step is to chose ϵ\epsilon. Interestingly enough, for ϵ=1/n\epsilon=1/n we recover the rate of learning in isolation: d/nd/n. Now, an optimization w.r.t ϵ\epsilon gives ϵ=n/T\epsilon=\sqrt{n}/T and thus

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+4dTa¯¯Lα(a¯¯1)+2μ2βξ¯¯2T+2dβT(nξ¯¯2T21+log(ξ¯¯2T2))+2βT(a¯¯log(Tb¯¯αLa¯¯(a¯¯1))+a¯¯b¯¯TαLa¯¯(a¯¯1)).\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\frac{4d}{T}\sqrt{\frac{\bar{\bar{a}}L}{\alpha(\bar{\bar{a}}-1)}}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T}\\ +\frac{2d}{\beta T}\left(\frac{n}{\bar{\bar{\xi}}^{2}T^{2}}-1+\log\left(\bar{\bar{\xi}}^{2}T^{2}\right)\right)+\frac{2}{\beta T}\left(\bar{\bar{a}}\log\left(\frac{T}{\bar{\bar{b}}}\sqrt{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}\right)+\frac{\bar{\bar{a}}\bar{\bar{b}}}{T}\sqrt{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}\right).

In the T>nT>n regime, this is a significant improvement compared to the learning in isolation in which case, we can simplify the bound as

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+G(C,μ,ξ¯¯,L,a¯¯,b¯¯)d+log(T)T,\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\frac{d+\log(T)}{T},

where

G(C,μ,ξ¯¯,L,a¯¯,b¯¯)=4a¯¯Lα(a¯¯1)+2β(μ2ξ¯¯2+nξ¯¯2T2+1+logξ¯¯2+a¯¯+a¯¯2logαLa¯¯(a¯¯1)b¯¯2+a¯¯b¯¯TαLa¯¯(a¯¯1)).G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})=4\sqrt{\frac{\bar{\bar{a}}L}{\alpha(\bar{\bar{a}}-1)}}\\ +\frac{2}{\beta}\left(\frac{\|\mu^{*}\|^{2}}{\bar{\bar{\xi}}^{2}}+\frac{n}{\bar{\bar{\xi}}^{2}T^{2}}+1+\log\bar{\bar{\xi}}^{2}+\bar{\bar{a}}+\frac{\bar{\bar{a}}}{2}\log\frac{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}{\bar{\bar{b}}^{2}}+\frac{\bar{\bar{a}}\bar{\bar{b}}}{T}\sqrt{\alpha L\bar{\bar{a}}(\bar{\bar{a}}-1)}\right).

As a conclusion,

𝔼P1,,PT𝔼𝒮1,,𝒮T𝔼πΠ^[(π)]+min(b¯¯[dξ¯¯2+Σ(𝒫)](a¯¯1)αn+dlog(2a¯¯αLnb¯¯+1)2αn+2μ2βξ¯¯2T,G(C,μ,ξ¯¯,L,a¯¯,b¯¯)d+log(T)T).\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{\mathcal{S}_{1},\dots,\mathcal{S}_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]\leq\mathcal{E}^{*}+\min\Biggl{(}\frac{\bar{\bar{b}}\left[d\bar{\bar{\xi}}^{2}+\Sigma(\mathcal{P})\right]}{(\bar{\bar{a}}-1)\alpha n}+\frac{d\log\left(\frac{2\bar{\bar{a}}\alpha Ln}{\bar{\bar{b}}}+1\right)}{2\alpha n}+\frac{2\|\mu^{*}\|^{2}}{\beta\bar{\bar{\xi}}^{2}T},\\ G(C,\mu^{*},\bar{\bar{\xi}},L,\bar{\bar{a}},\bar{\bar{b}})\frac{d+\log(T)}{T}\Biggr{)}.

Appendix G Application of Theorem 5 to the Case of Mixtures of Gaussians

We first assume that priors that are mixtures of KK Gaussians, where KK is known:

={pw,μ,σ2=k=1Kwki=1d𝒩(μk,i,σ2k,i):(i,k)[d]×[K],μk,i,σ2k,i+,wk0,1w=1}.\mathcal{M}=\Bigg{\{}p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{K}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\\ \forall(i,k)\in[d]\times[K],\mu_{k,i}\in\mathbb{R},\sigma^{2}_{k,i}\in\mathbb{R}^{*}_{+},w_{k}\geq 0,1^{\top}w=1\Bigg{\}}.

We set the prior π=k=1Kw¯k𝒩(μ¯k,σ¯k2Id)\pi=\sum_{k=1}^{K}\bar{w}_{k}\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d}). Then, denoting by g(x;μ,σ2)g(x;\mu,\sigma^{2}) the pdf of the normal distribution 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}), (17) implies, for any w,μ,σ2w,\mu,\sigma^{2},

KL(pw,μ,σ2π)\displaystyle\mathrm{KL}(p_{w,\mu,\sigma^{2}}\|\pi) =dlogk=1Kwkg(x;μk,σ2k)k=1Kw¯kg(x;μ¯k,σ¯k2Id)k=1Kwkg(x;μk,σ2k)dx\displaystyle=\int_{\mathbb{R}^{d}}\log\frac{\sum_{k=1}^{K}w_{k}g(x;\mu_{k},\sigma^{2}_{k})}{\sum_{k=1}^{K}\bar{w}_{k}g(x;\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})}\sum_{k=1}^{K}w_{k}g(x;\mu_{k},\sigma^{2}_{k})dx
dk=1Klogwkg(x;μk,σ2k)w¯kg(x;μ¯k,σ¯k2Id)wkg(x;μk,σ2k)dx\displaystyle\leq\int_{\mathbb{R}^{d}}\sum_{k=1}^{K}\log\frac{w_{k}g(x;\mu_{k},\sigma^{2}_{k})}{\bar{w}_{k}g(x;\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d})}w_{k}g(x;\mu_{k},\sigma^{2}_{k})dx
=k=1Kwklogwkw¯k+k=1KwkKL(𝒩(μk,σ2k)𝒩(μ¯k,σ¯k2Id))\displaystyle=\sum_{k=1}^{K}w_{k}\log\frac{w_{k}}{\bar{w}_{k}}+\sum_{k=1}^{K}w_{k}\mathrm{KL}(\mathcal{N}(\mu_{k},\sigma^{2}_{k})\|\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d}))
=KL(ww¯)+12k=1Kwki=1d((μk,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2),\displaystyle=\mathrm{KL}(w\|\bar{w})+\frac{1}{2}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right),

where the inequality on the second line follows from the log sum inequality from Cover and Thomas , and the bound from Theorem 5 becomes, at t=T+1t=T+1,

𝔼ST+1𝔼θρT+1(π,α)[RPT+1(θ)]RPT+12infw,μ,σ2{𝔼θpw,μ,σ2[RPT+1(θ)]RPT+1+KL(ww¯)αn+12αnk=1Kwki=1d((μk,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2)}.\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi,\alpha)}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq 2\inf_{w,\mu,\sigma^{2}}\Bigg{\{}\mathbb{E}_{\theta\sim p_{w,\mu,\sigma^{2}}}[R_{P_{T+1}}(\theta)]-R_{P_{T+1}}^{*}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}.

Assumption (4) implies that

𝔼θ𝒩(μ,σ2)[RPT+1(θ)]RPT+1L𝔼θ𝒩(μ,σ2)[θμPT+12].\mathbb{E}_{\theta\sim\mathcal{N}(\mu,\sigma^{2})}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq L\mathbb{E}_{\theta\sim\mathcal{N}(\mu,\sigma^{2})}\left[\|\theta-\mu_{P_{T+1}}\|^{2}\right].

It follows that the previous bound with the choice μ1==μK=μPT+1\mu_{1}=\dots=\mu_{K}=\mu_{P_{T+1}} becomes

𝔼ST+1𝔼θρT+1(π,α)[RPT+1(θ)]RPT+12infw,σ2{Lk=1Kwkσk2+KL(ww¯)αn+12αnk=1Kwki=1d((μPT+1,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2)}.\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi,\alpha)}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq 2\inf_{w,\sigma^{2}}\Bigg{\{}L\sum_{k=1}^{K}w_{k}\|\sigma_{k}\|^{2}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{P_{T+1},i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}.

While the choice μ1==μK=μPT+1\mu_{1}=\dots=\mu_{K}=\mu_{P_{T+1}} may seem less meaningful than in the Gaussian case (with one single component), it is completely natural as the best possible choice for the parameter θ\theta is μPT+1\mu_{P_{T+1}}. In the computation, each component 𝒩(μk,σk2)\mathcal{N}(\mu_{k},\sigma_{k}^{2}) of the mixture brings an error term which can be decomposed between a bias term and a variance term,

𝔼θ𝒩(μk,σk2)[θμPT+12]=μkμPT+12bias term (first order)+σk2variance term (second order),\mathbb{E}_{\theta\sim\mathcal{N}(\mu_{k},\sigma_{k}^{2})}\left[\|\theta-\mu_{P_{T+1}}\|^{2}\right]=\underbrace{\|\mu_{k}-\mu_{P_{T+1}}\|^{2}}_{\text{bias term (first order)}}+\underbrace{\sigma_{k}^{2}}_{\text{variance term (second order)}},

for which the choice μk=μPT+1\mu_{k}=\mu_{P_{T+1}} minimizes the first order error term. Next, we set the family 𝒢\mathcal{G} of distributions on \mathcal{F}:

𝒢={qδ,τ,ξ2,b=Dir(δ)k[K]i[d]𝒩(τk,i,ξk2)k=1KΓ(2,bk):δ=(δ1,,δK)K,(k,i),ξk2>0,τk,i,bk>0,δk>0},\mathcal{G}=\Bigg{\{}q_{\delta,\tau,\xi^{2},b}=\mathrm{Dir}(\delta)\otimes\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(2,b_{k}):\\ \delta=(\delta_{1},\dots,\delta_{K})\in\mathbb{R}^{K},\forall(k,i),\xi_{k}^{2}>0,\tau_{k,i}\in\mathbb{R},b_{k}>0,\delta_{k}>0\Bigg{\}},

where Dir(δ)\mathrm{Dir}(\delta) is the Dirichlet distribution of parameter δ\delta. We set the prior on priors Λ=q1K,0,ξ¯¯2,b¯¯\Lambda=q_{1_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}, where 1K=(1,,1)1_{K}=(1,\dots,1) and ξ¯¯2=(ξ¯¯12,,ξ¯¯K2)\bar{\bar{\xi}}^{2}=\left(\bar{\bar{\xi}}_{1}^{2},\dots,\bar{\bar{\xi}}_{K}^{2}\right). Then, using (17), (18) and (19),

KL(qδ,τ,ξ2,bΛ)=logΓ(1δ)Γ(K)×k=1KΓ(δk)+k=1K(δk1)(ψ(δk)ψ(1δ))+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk),\mathrm{KL}(q_{\delta,\tau,\xi^{2},b}\|\Lambda)=\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right),

where ψ\psi is the digamma function. We can next use the bound from Theorem 5 and we have

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infδ,τ,ξ2,b{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b𝔼PT+1[infw,σ2{Lk=1Kwkσk2+KL(ww¯)αn+12αnk=1Kwki=1d((μPT+1,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2)}]+12βTlogΓ(1δ)Γ(K)×k=1KΓ(δk)+12βTk=1K(δk1)(ψ(δk)ψ(1δ))+14βTk,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\inf_{w,\sigma^{2}}\Bigg{\{}L\sum_{k=1}^{K}w_{k}\|\sigma_{k}\|^{2}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{P_{T+1},i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

Next, minimizing over σk,i2\sigma_{k,i}^{2} gives the optimal value σ¯k22αnLσ¯k2+1\frac{\bar{\sigma}_{k}^{2}}{2\alpha nL\bar{\sigma}_{k}^{2}+1}, and replacing in the above bound gives

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infδ,τ,ξ2,b{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b𝔼PT+1[infw{KL(ww¯)αn+d2αnk=1Kwklog(2αnLσ¯k2+1)+12αnk=1KwkμPT+1μ¯k2σ¯k2}]+12βTlogΓ(1δ)Γ(K)×k=1KΓ(δk)+12βTk=1K(δk1)(ψ(δk)ψ(1δ))+14βTk,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \inf_{w}\left\{\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{K}w_{k}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\sum_{k=1}^{K}w_{k}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

We are going to restrict the infimum infw\inf_{w} to the set of ww such that wk{0,1}w_{k}\in\{0,1\} for any k[K]k\in[K]. In other words, we are selecting only the best component of the mixture in the optimization bound. The reader can check that this is actually the exact solution to the minimization problem in the above bound. As a result of this minimization, the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infδ,τ,ξ2,b{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b𝔼PT+1[mink[K]{1αnlog1w¯k+d2αnlog(2αnLσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2}]+12βTlogΓ(1δ)Γ(K)×k=1KΓ(δk)+12βTk=1K(δk1)(ψ(δk)ψ(1δ))+14βTk,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \min_{k\in[K]}\left\{\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

Please note that the term inside the expectation is, up to the minimum on k[K]k\in[K], identical to the one we had in the case of one single Gaussian mixture, except for the term 1αnlog1w¯k\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}, which may be seen as a penalty for the choice of the component k[K]k\in[K] in the mixture. We then bound the expectation term in the above bound by first using Fubini’s theorem, and then inverting the minimum and the second expectation:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infδ,τ,ξ2,b{𝔼PT+1[mink[K]{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b[1αnlog1w¯k+d2αnlog(2αnLσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2]}]+12βTlogΓ(1δ)Γ(K)×k=1KΓ(δk)+12βTk=1K(δk1)(ψ(δk)ψ(1δ))+14βTk,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{P_{T+1}}\Bigg{[}\min_{k\in[K]}\Bigg{\{}\\ \mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}. (26)

We can then bound the expectation term, which we decompose as

𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b[1αnlog1w¯k+d2αnlog(2αnLσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2]=1αn𝔼w¯Dir(δ)[log1w¯k]+d2αn𝔼σ¯k2Γ(2,bk)[log(2αnLσ¯k2+1)]+12αn𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b[μPT+1μ¯k2σ¯k2].\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\\ =\frac{1}{\alpha n}\mathbb{E}_{\bar{w}\sim\mathrm{Dir}(\delta)}\left[\log\frac{1}{\bar{w}_{k}}\right]+\frac{d}{2\alpha n}\mathbb{E}_{\bar{\sigma}_{k}^{2}\sim\Gamma(2,b_{k})}\left[\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)\right]\\ +\frac{1}{2\alpha n}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right].

Jensen’s inequality helps to bound both the first term

1αn𝔼w¯Dir(δ)[log1w¯k]\displaystyle\frac{1}{\alpha n}\mathbb{E}_{\bar{w}\sim\mathrm{Dir}(\delta)}\left[\log\frac{1}{\bar{w}_{k}}\right] 1αnlog𝔼w¯Dir(δ)[1w¯k]\displaystyle\leq\frac{1}{\alpha n}\log\mathbb{E}_{\bar{w}\sim\mathrm{Dir}(\delta)}\left[\frac{1}{\bar{w}_{k}}\right]
=1αnlog1δ1δk1\displaystyle=\frac{1}{\alpha n}\log\frac{1^{\top}\delta-1}{\delta_{k}-1}

and the second term

d2αn𝔼σ¯k2Γ(2,bk)[log(2αnLσ¯k2+1)]\displaystyle\frac{d}{2\alpha n}\mathbb{E}_{\bar{\sigma}_{k}^{2}\sim\Gamma(2,b_{k})}\left[\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)\right] d2αnlog(2αnL𝔼σ¯k2Γ(2,bk)[σ¯k2]+1)\displaystyle\leq\frac{d}{2\alpha n}\log\left(2\alpha nL\mathbb{E}_{\bar{\sigma}_{k}^{2}\sim\Gamma(2,b_{k})}\left[\bar{\sigma}_{k}^{2}\right]+1\right)
=d2αnlog(4Lαnbk+1)\displaystyle=\frac{d}{2\alpha n}\log\left(\frac{4L\alpha n}{b_{k}}+1\right)

in the decomposition. The third term can be bounded as follows

12αn𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b[μPT+1μ¯k2σ¯k2]\displaystyle\frac{1}{2\alpha n}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]
=bk2αn𝔼μ¯k𝒩(τk,ξk2Id)[μPT+1μ¯k2]\displaystyle=\frac{b_{k}}{2\alpha n}\mathbb{E}_{\bar{\mu}_{k}\sim\mathcal{N}(\tau_{k},\xi_{k}^{2}I_{d})}\left[\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}\right]
bkαn(μPT+1τk2+𝔼μ¯k𝒩(τk,ξk2Id)[τkμ¯k2])\displaystyle\leq\frac{b_{k}}{\alpha n}\left(\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\mathbb{E}_{\bar{\mu}_{k}\sim\mathcal{N}(\tau_{k},\xi_{k}^{2}I_{d})}\left[\|\tau_{k}-\bar{\mu}_{k}\|^{2}\right]\right)
=bkαn(μPT+1τk2+dξk2).\displaystyle=\frac{b_{k}}{\alpha n}\left(\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+d\xi_{k}^{2}\right).

The bound on the expectation then becomes

𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,a,b[1αnlog1w¯k+d2αnlog(2αnLσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2]1αnlog1δ1δk1+d2αnlog(4Lαnbk+1)+bkαn(μPT+1τk2+dξk2).\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},a,b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\\ \leq\frac{1}{\alpha n}\log\frac{1^{\top}\delta-1}{\delta_{k}-1}+\frac{d}{2\alpha n}\log\left(\frac{4L\alpha n}{b_{k}}+1\right)+\frac{b_{k}}{\alpha n}\left(\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+d\xi_{k}^{2}\right).

In our final bound, we wish to have as few terms as possible in O(1n)O\left(\frac{1}{n}\right) while the terms in O(1T)O\left(\frac{1}{T}\right) are not so problematic, because they correspond to the fast convergence rate at the meta-level. For this reason, we are going to take out of the infimum:

  • the term d2αnlog(4Lαnbk+1)\frac{d}{2\alpha n}\log\left(\frac{4L\alpha n}{b_{k}}+1\right), which is unavoidable and corresponds to the main term of the bound in the worst case, with a O(1n)O\left(\frac{1}{n}\right) speed of convergence;

  • the term bkdξk2αn\frac{b_{k}d\xi_{k}^{2}}{\alpha n}, which will be handled through an optimization in ξk2\xi_{k}^{2} and will be a O(1T)O\left(\frac{1}{T}\right) term.

As a consequence, we bound the minimum on [K][K] by

mink[K]{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b[1αnlog1w¯k+d2αnlog(2αnLσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2]}d2αnk=1Klog(4αLnbk+1)+k=1Kbkdξk2αn+1αnmink[K]{bkμPT+1τk2+log1δ1δk1},\min_{k\in[K]}\left\{\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha nL\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\right\}\leq\\ \frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\sum_{k=1}^{K}\frac{b_{k}d\xi_{k}^{2}}{\alpha n}+\frac{1}{\alpha n}\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}, (27)

and plugging this result in (26) gives

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infδ,τ,ξ2,b{d2αnk=1Klog(4αLnbk+1)+k=1Kbkdξk2αn+1αn𝔼PT+1[mink[K]{bkμPT+1τk2+log1δ1δk1}]+12βTlogΓ(1δ)Γ(K)×k=1KΓ(δk)+12βTk=1K(δk1)(ψ(δk)ψ(1δ))+14βTk=1Kτk2ξ¯¯k2+d4βTk=1K(ξk2ξ¯¯k21+logξ¯¯k2ξk2)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,\xi^{2},b}\Bigg{\{}\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\sum_{k=1}^{K}\frac{b_{k}d\xi_{k}^{2}}{\alpha n}\\ +\frac{1}{\alpha n}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}\left(\delta_{k}-1\right)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{4\beta T}\sum_{k=1}^{K}\left(\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

An exact optimization in ξk2\xi_{k}^{2} gives

ξk2=ξ¯¯k21+4bkξ¯¯k2βTαn\xi_{k}^{2}=\frac{\bar{\bar{\xi}}_{k}^{2}}{1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}}

and replacing in the bound yields

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infδ,τ,b{d2αnk=1Klog(4αLnbk+1)+1αn𝔼PT+1[mink[K]{bkμPT+1τk2+log1δ1δk1}]+12βTlogΓ(1δ)Γ(K)×k=1KΓ(δk)+12βTk=1K(δk1)(ψ(δk)ψ(1δ))+14βTk=1Kτk2ξ¯¯k2+d4βTk=1Klog(1+4bkξ¯¯k2βTαn)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\delta,\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\\ +\frac{1}{\alpha n}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]\\ +\frac{1}{2\beta T}\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2\beta T}\sum_{k=1}^{K}\left(\delta_{k}-1\right)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

From here, we set δk=2\delta_{k}=2 for any k[K]k\in[K], which implies

12βTk=1K(δk1)(ψ(δk)ψ(1δ))0\frac{1}{2\beta T}\sum_{k=1}^{K}\left(\delta_{k}-1\right)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\leq 0

because ψ\psi is increasing. Please also note that

logΓ(1δ)Γ(K)×k=1KΓ(δk)=logΓ(2K)Γ(K)Klog(2K).\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}=\log\frac{\Gamma(2K)}{\Gamma(K)}\leq K\log(2K).

We can then deduce the bound

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infτ,b{d2αnk=1Klog(4αLnbk+1)+log(2K)αn+1αn𝔼PT+1[mink[K]{bkμPT+1τk2}]+Klog(2K)2βT+14βTk=1Kτk2ξ¯¯k2+d4βTk=1Klog(1+4bkξ¯¯k2βTαn)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\frac{\log(2K)}{\alpha n}\\ +\frac{1}{\alpha n}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right\}\right]+\frac{K\log(2K)}{2\beta T}+\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}\\ +\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}.

Let

ΣK(𝒫):=infτ1,,τK𝔼PT+1𝒫[mink[K]μPT+1τk2],\Sigma_{K}(\mathcal{P}):=\inf_{\tau_{1},\dots,\tau_{K}}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\min_{k\in[K]}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right],

it is clear that

𝔼PT+1[mink[K]{bkμPT+1τk2}]ΣK(𝒫)k=1Kbk.\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right\}\right]\leq\Sigma_{K}(\mathcal{P})\sum_{k=1}^{K}b_{k}.

By choosing τ1,,τK\tau_{1},\dots,\tau_{K} minimizing ΣK(𝒫)\Sigma_{K}(\mathcal{P}), the previous bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infb{log(2K)αn+d2αnk=1Klog(4αLnbk+1)+ΣK(𝒫)αnk=1Kbk+Klog(2K)2βT+14βTk=1Kτk2ξ¯¯k2+d4βTk=1Klog(1+4bkξ¯¯k2βTαn)+1βTk=1K(logbkb¯¯k+b¯¯kbkbk)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{b}\Bigg{\{}\frac{\log(2K)}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\\ +\frac{\Sigma_{K}(\mathcal{P})}{\alpha n}\sum_{k=1}^{K}b_{k}+\frac{K\log(2K)}{2\beta T}+\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}\\ +\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{\}}. (28)

Please note that τ1,,τK\tau_{1},\dots,\tau_{K} are characteristic of the distribution 𝒫\mathcal{P}. Intuitively, if the distribution 𝒫\mathcal{P} has KK modes, or KK Gaussian mixtures, τ1,,τK\tau_{1},\dots,\tau_{K} correspond to the centers of these modes or mixtures up to a permutation. Consequently, they do not scale with nn or TT, but can be regarded as problem parameters of constant order.

Let ϵ>0\epsilon>0 which we will specify later, we are going to consider two separate cases:

  • -

    either ΣK(𝒫)dϵ\Sigma_{K}(\mathcal{P})\leq d\epsilon, implying that the distribution is concentrated around τ1,,τK\tau_{1},\dots,\tau_{K} and the variance of the local distribution around each of those points is smaller than ϵ\epsilon;

  • -

    or ΣK(𝒫)>dϵ\Sigma_{K}(\mathcal{P})>d\epsilon.

G.1 First Case: ΣK(𝒫)dϵ\Sigma_{K}(\mathcal{P})\leq d\epsilon

In this case, we expect the distribution to well concentrated around τ1,,τK\tau_{1},\dots,\tau_{K} (which are the centers of the mixtures). As a result, the optimal parameter in the new task T+1T+1 is going to be closed to one of τ1,,τK\tau_{1},\dots,\tau_{K} and we can infer it from the previous tasks. This is precisely the case where we expect a significant improvement from the meta-learning over the learning in isolation.

A closer look at the bound above shows that, besides the term log(2K)αn\frac{\log(2K)}{\alpha n} (for which we will provide an interpretation later in this section), all the terms are O(logTT)O\left(\frac{\log T}{T}\right), which is the fast rate of convergence at the meta level, except possibly for the terms d2αnk=1Klog(4αLnbk+1)\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right) and ΣK(𝒫)αnk=1Kbk\frac{\Sigma_{K}(\mathcal{P})}{\alpha n}\sum_{k=1}^{K}b_{k}. Nevertheless, the assumption on ΣK(𝒫)\Sigma_{K}(\mathcal{P}) shows that the second of those terms is very small (of order O(ϵn)O\left(\frac{\epsilon}{n}\right)). For this reason, we are going to choose bkb_{k} small enough so that the first term becomes small, and we set

k[K],bk=T.\forall k\in[K],\ b_{k}=T.

Replacing in the bound gives

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4{log(2K)αn+d2αnk=1Klog(4αLnT+1)+TKΣK(𝒫)αn+Klog(2K)2βT+14βTk=1Kτk2ξ¯¯k2+d4βTk=1Klog(1+4ξ¯¯k2βT2αn)+1βTk=1K(logTb¯¯k+b¯¯kTT)}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\Bigg{\{}\frac{\log(2K)}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\\ +\frac{TK\Sigma_{K}(\mathcal{P})}{\alpha n}+\frac{K\log(2K)}{2\beta T}+\frac{1}{4\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}\\ +\frac{d}{4\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+\frac{1}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{\}}.

We can easily bound the second term in the bound using the inequality log(1+x)x\log(1+x)\leq x, which becomes

d2αnk=1Klog(4αLnT+1)2LdKT,\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\leq\frac{2LdK}{T},

and replacing ΣK(𝒫)\Sigma_{K}(\mathcal{P}) by its upper bound dϵd\epsilon gives

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4log(2K)αn+8LdKT+4TKϵαn+2Klog(2K)βT+1βTk=1Kτk2ξ¯¯k2+dβTk=1Klog(1+4ξ¯¯k2βT2αn)+4βTk=1K(logTb¯¯k+b¯¯kTT).\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\frac{4\log(2K)}{\alpha n}+\frac{8LdK}{T}+\frac{4TK\epsilon}{\alpha n}\\ +\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right).

From here, the choice of the rate ϵ=nT2\epsilon=\frac{n}{T^{2}} yields the final bound:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]CVfinite(K,n)+CVGaussian(K,d,T)+CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ),\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ \text{CV}_{\text{finite}}(K,n)+\text{CV}_{\text{Gaussian}}(K,d,T)+\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau),

where we denoted

CVfinite(K,n)=4log(2K)αn,CVGaussian(K,d,T)=8LdKT+4KαT\text{CV}_{\text{finite}}(K,n)=\frac{4\log(2K)}{\alpha n},\quad\text{CV}_{\text{Gaussian}}(K,d,T)=\frac{8LdK}{T}+\frac{4K}{\alpha T}

and

CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)=2Klog(2K)βT+1βTk=1Kτk2ξ¯¯k2+dβTk=1Klog(1+4ξ¯¯k2βT2αn)+4βTk=1K(logTb¯¯k+b¯¯kTT).\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)\\ +\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right).
Remark 16

Please note that

CVfinite(K,n)=O(logKn)\text{CV}_{\text{finite}}(K,n)=O\left(\frac{\log K}{n}\right)

and

CVGaussian(K,d,T)+CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)=O(dKlogTT).\text{CV}_{\text{Gaussian}}(K,d,T)+\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=O\left(\frac{dK\log T}{T}\right).

This bound comes as no surprise, because the process of learning the parameter θ\theta in the mixture of Gaussians framework consists of two different steps:

  • -

    first identifying the KK centers of the mixtures which, similarly to the finite case, is captured in the logKn\frac{\log K}{n} term;

  • -

    then, identifying the right parameters of the Gaussian components centered on the points identified in the previous step. Similarly to the Gaussian case, this is captured in O(logTT)O\left(\frac{\log T}{T}\right), in the most favorable case.

Remark 17

It may seem paradoxical that the model of mixtures of Gaussians achieves, in the best possible case, a rate of convergence O(logKn)O\left(\frac{\log K}{n}\right) slower than the O(logTT)O\left(\frac{\log T}{T}\right) rate achieved by a single Gaussian in the regime n<<Tn<<T under optimal conditions. In reality, the latter rate of convergence is also achievable in the model of mixtures of Gaussians, but similarly as in the case of a single Gaussian component, it requires the strong assumption that

Σ1(𝒫)dϵ,\Sigma_{1}(\mathcal{P})\leq d\epsilon,

which is much more restrictive. On the other hand, many distributions only satisfy

ΣK(𝒫)dϵ\Sigma_{K}(\mathcal{P})\leq d\epsilon

for some K2K\geq 2, in which case the rate of convergence O(logKn)O\left(\frac{\log K}{n}\right) achieved here is much faster than O(dlognn)O\left(\frac{d\log n}{n}\right), which is the best possible rate achieved in the single Gaussian model in general.

G.2 Second Case: ΣK(𝒫)>dϵ\Sigma_{K}(\mathcal{P})>d\epsilon

In this case, ΣK(𝒫)\Sigma_{K}(\mathcal{P}) is not smaller than dϵd\epsilon and therefore, the choice of a large parameter bkb_{k} would make the term ΣK(𝒫)αnk=1Kbk\frac{\Sigma_{K}(\mathcal{P})}{\alpha n}\sum_{k=1}^{K}b_{k} too large to achieve the desired rate of convergence. As a result, we set

k[K],bk=1.\forall k\in[K],\ b_{k}=1.

Replacing in the bound (28) gives

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4log(2K)αn+2dKαnlog(1+4αLn)+4KΣK(𝒫)αn+2Klog(2K)βT+1βTk=1Kτk2ξ¯¯k2+dβTk=1Klog(1+4ξ¯¯k2βT2αn)+4βTk=1K(log1b¯¯k+b¯¯k1).\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\frac{4\log(2K)}{\alpha n}+\frac{2dK}{\alpha n}\log\left(1+4\alpha Ln\right)+\frac{4K\Sigma_{K}(\mathcal{P})}{\alpha n}\\ +\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{1}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right).

G.3 In Summary

In any case, the bound can be written as

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]CVfinite(K,n)+K×CVGaussian(d,ΣK(𝒫),n,T)+CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ),\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\text{CV}_{\text{finite}}(K,n)+K\times\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)\\ +\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau),

where

CVfinite(K,n)=4log(2K)αn\text{CV}_{\text{finite}}(K,n)=\frac{4\log(2K)}{\alpha n}

is the convergence term in the finite case (to find the centers of the mixtures), and

CVGaussian(d,K,ΣK(𝒫),n,T)={8LdT+4αTif ΣK(𝒫)nT2;2dαnlog(1+4αLn)+4ΣK(𝒫)αnotherwise;\text{CV}_{\text{Gaussian}}\left(d,K,\Sigma_{K}(\mathcal{P}),n,T\right)=\left\{\begin{array}[]{ll}\frac{8Ld}{T}+\frac{4}{\alpha T}&\mbox{if }\ \Sigma_{K}(\mathcal{P})\leq\frac{n}{T^{2}};\\ \frac{2d}{\alpha n}\log\left(1+4\alpha Ln\right)+\frac{4\Sigma_{K}(\mathcal{P})}{\alpha n}&\mbox{otherwise;}\end{array}\right.

is equal to the convergence term in the Gaussian case (with one component) and is a O(1T)O\left(\frac{1}{T}\right) if there exist KK points such that distribution 𝒫\mathcal{P} is concentrated at a rate ϵ=nT2\epsilon=\frac{n}{T^{2}} around those KK points, and O(dlognn)O\left(\frac{d\log n}{n}\right) otherwise. The remaining term is

CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)=2Klog(2K)βT+1βTk=1Kτk2ξ¯¯k2+dβTk=1Klog(1+4ξ¯¯k2βT2αn)+4βTk=1K(logTb¯¯k+b¯¯k1)\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\frac{2K\log(2K)}{\beta T}+\frac{1}{\beta T}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{\beta T}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)\\ +\frac{4}{\beta T}\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right)

and it is the convergence term term at the meta level. Please note that it is a O(dKlogTT)O\left(\frac{dK\log T}{T}\right).

G.4 What if the number of components in the mixture is unknown?

In practice, we do not know in advance how to chose the number of components KK in the prior. In this case, we are going to include inside \mathcal{M} all the mixtures of Gaussians, i.e.,

={pw,μ,σ2=k=1+wki=1d𝒩(μk,i,σ2k,i):K1:kK+1,wk=0}.\mathcal{M}=\left\{p_{w,\mu,\sigma^{2}}=\sum_{k=1}^{+\infty}w_{k}\bigotimes_{i=1}^{d}\mathcal{N}(\mu_{k,i},\sigma^{2}_{k,i}):\exists K\geq 1:\forall k\geq K+1,w_{k}=0\right\}.

Note that the sum inside the definition of \mathcal{F} is finite, since wk=0w_{k}=0 for any kk beyond a certain rank KK. We still denote by π=k=1K¯w¯k𝒩(μ¯k,σ¯k2Id)\pi=\sum_{k=1}^{\bar{K}}\bar{w}_{k}\mathcal{N}(\bar{\mu}_{k},\bar{\sigma}_{k}^{2}I_{d}) the prior in each task. By definition, for any kK¯+1,w¯k=0k\geq\bar{K}+1,\bar{w}_{k}=0. It still holds that, for any w,μ,σ2w,\mu,\sigma^{2},

KL(pw,μ,σ2π)KL(ww¯)+12k=1wki=1d((μk,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2),\mathrm{KL}(p_{w,\mu,\sigma^{2}}\|\pi)\leq\mathrm{KL}(w\|\bar{w})+\frac{1}{2}\sum_{k=1}^{\infty}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right),

where we denoted

KL(ww¯)=k=1wklogwkw¯k.\mathrm{KL}(w\|\bar{w})=\sum_{k=1}^{\infty}w_{k}\log\frac{w_{k}}{\bar{w}_{k}}.

To put things clearly, the KL remains identical to the case where KK is known except for the fact that the sums on kk are no longer stopping at a pre-determined KK. This difference aside, the bound remains identical to the one in the case where KK is known, and the bound from Theorem 1 becomes, at t=T+1t=T+1,

𝔼ST+1𝔼θρT+1(π,α)[RPT+1(θ)]RPT+12infw,μ,σ2{𝔼θpw,μ,σ2[RPT+1(θ)]RPT+1+KL(ww¯)αn+12αnk=1wki=1d((μk,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2)}.\mathbb{E}_{S_{T+1}}\mathbb{E}_{\theta\sim\rho_{T+1}(\pi,\alpha)}[R_{P_{T+1}}(\theta)]-R^{*}_{P_{T+1}}\leq 2\inf_{w,\mu,\sigma^{2}}\Bigg{\{}\mathbb{E}_{\theta\sim p_{w,\mu,\sigma^{2}}}[R_{P_{T+1}}(\theta)]-R_{P_{T+1}}^{*}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}.

Next, we are going to define a prior on KK within the prior of priors as follows. We assume that the number of mixtures KK is smaller than TT, because even if it were not, it would be impossible to identify them with enough confidence. We define the set of priors on priors

𝒢={qx,δ,τ,ξ2,b=qx×qδ,τ,ξ2,b|K},\mathcal{G}=\Bigg{\{}q_{x,\delta,\tau,\xi^{2},b}=q_{x}\times q_{\delta,\tau,\xi^{2},b|K}\Bigg{\}},

where qx=Mult(x1,,xT)q_{x}=\mathrm{Mult}(x_{1},\dots,x_{T}) is the prior distribution on KK and

qδ,τ,ξ2,b|K=Dir(δ1,,δK)k[K]i[d]𝒩(τk,i,ξk2)k=1KΓ(2,bk).q_{\delta,\tau,\xi^{2},b|K}=\mathrm{Dir}(\delta_{1},\dots,\delta_{K})\otimes\bigotimes_{\begin{subarray}{c}k\in[K]\\ i\in[d]\end{subarray}}\mathcal{N}(\tau_{k,i},\xi_{k}^{2})\otimes\bigotimes_{k=1}^{K}\Gamma(2,b_{k}).

and we set the prior of prior as Λ=q1T1T,1K,0,ξ¯¯2,b¯¯\Lambda=q_{\frac{1}{T}1_{T},1_{K},0,\bar{\bar{\xi}}^{2},\bar{\bar{b}}}. We also need to re-compute the KL divergence between the priors of priors, which becomes

KL(qλ,δ,τ,ξ2,bΛ)=logTH(x)+𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk)],\mathrm{KL}(q_{\lambda,\delta,\tau,\xi^{2},b}\|\Lambda)=\log T-H(x)+\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]},

using (16). Please note that in any optimization on 𝒢\mathcal{G}, we optimize first in (x1,,xT)(x_{1},\dots,x_{T}) and then on δ,τ,ξ2,b\delta,\tau,\xi^{2},b conditionally on KK. This means that the latter parameters are allowed to depend on KK. While the infimum on 𝒢\mathcal{G} of any quantity should be written infxinfδ,τ,ξ2,bσ(K)\inf_{x}\inf_{\delta,\tau,\xi^{2},b\in\sigma(K)}, we will adopt the shortcut notation infx,δ,τ,ξ2,b\inf_{x,\delta,\tau,\xi^{2},b}. We can next use the bound from Theorem 5 and we have

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,δ,τ,ξ2,b{𝔼(w¯,μ¯,σ¯2)qx,δ,τ,ξ2,b𝔼PT+1[infw,σ2{Lk=1wkσk2+KL(ww¯)αn+12αnk=1wki=1d((μk,iμ¯k,i)2σ¯k2+σk,i2σ¯k21+logσ¯k2σk,i2)}]+12βT(logTH(x))+12βT𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\inf_{w,\sigma^{2}}\Bigg{\{}L\sum_{k=1}^{\infty}w_{k}\|\sigma_{k}\|^{2}+\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}\\ +\frac{1}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\sum_{i=1}^{d}\left(\frac{(\mu_{k,i}-\bar{\mu}_{k,i})^{2}}{\bar{\sigma}_{k}^{2}}+\frac{\sigma_{k,i}^{2}}{\bar{\sigma}_{k}^{2}}-1+\log\frac{\bar{\sigma}_{k}^{2}}{\sigma_{k,i}^{2}}\right)\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

The optimization on σk,i2\sigma_{k,i}^{2} may be performed exactly by setting σk,i2=σ¯k22αLnσ¯k2+1\sigma_{k,i}^{2}=\frac{\bar{\sigma}_{k}^{2}}{2\alpha Ln\bar{\sigma}_{k}^{2}+1}, and the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,δ,τ,ξ2,b{𝔼(w¯,μ¯,σ¯2)qx,δ,τ,ξ2,b𝔼PT+1[infw{KL(ww¯)αn+d2αnk=1wklog(2αLnσ¯k2+1)+12αnk=1wkμPT+1μ¯k2σ¯k2}]+12βT(logTH(x))+12βT𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \inf_{w}\left\{\frac{\mathrm{KL}(w\|\bar{w})}{\alpha n}+\frac{d}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\sum_{k=1}^{\infty}w_{k}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

We restrict the optimization in ww to the set {(wk)k1:k0K¯:wk0=1,kk0,wk=0}\{(w_{k})_{k\geq 1}:\exists k_{0}\leq\bar{K}:w_{k_{0}}=1,\forall k\neq k_{0},w_{k}=0\}, and the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,δ,τ,ξ2,b{𝔼(w¯,μ¯,σ¯2)qx,δ,τ,ξ2,b𝔼PT+1[mink[K]{1αnlog1w¯k+d2αnlog(2αLnσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2}]+12βT(logTH(x))+12βT𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}\mathbb{E}_{P_{T+1}}\Bigg{[}\\ \min_{k\in[K]}\left\{\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right\}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

Next, we classically decompose the expectation 𝔼(w¯,μ¯,σ¯2)qx,δ,τ,ξ2,b[X]\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}[X] as

𝔼(w¯,μ¯,σ¯2)qx,δ,τ,ξ2,b[X]=𝔼KMult(x)[𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b|K[X]].\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{x,\delta,\tau,\xi^{2},b}}[X]=\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b|K}}[X]\right].

Applying Fubini’s theorem and inverting the infimum and the expectation yields the bound

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,δ,τ,ξ2,b{𝔼KMult(x)𝔼PT+1[mink[K]{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b|K[1αnlog1w¯k+d2αnlog(2αLnσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2]}]+12βT(logTH(x))+12βT𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\Bigg{[}\min_{k\in[K]}\Bigg{\{}\\ \mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b|K}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\Bigg{\}}\Bigg{]}\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

The bound (27) from the previous part still holds:

mink[K]{𝔼(w¯,μ¯,σ¯2)qδ,τ,ξ2,b|K[1αnlog1w¯k+d2αnlog(2αLnσ¯k2+1)+12αnμPT+1μ¯k2σ¯k2]}d2αnk=1Klog(4αLnbk+1)+k=1Kbkdξk2αn+1αnmink[K]{bkμPT+1τk2+log1δ1δk1},\min_{k\in[K]}\left\{\mathbb{E}_{(\bar{w},\bar{\mu},\bar{\sigma}^{2})\sim q_{\delta,\tau,\xi^{2},b|K}}\left[\frac{1}{\alpha n}\log\frac{1}{\bar{w}_{k}}+\frac{d}{2\alpha n}\log\left(2\alpha Ln\bar{\sigma}_{k}^{2}+1\right)+\frac{1}{2\alpha n}\frac{\|\mu_{P_{T+1}}-\bar{\mu}_{k}\|^{2}}{\bar{\sigma}_{k}^{2}}\right]\right\}\leq\\ \frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)+\sum_{k=1}^{K}\frac{b_{k}d\xi_{k}^{2}}{\alpha n}+\frac{1}{\alpha n}\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\},

and we can inject it in the computation so that the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,δ,τ,ξ2,b{d2αn𝔼KMult(x)[k=1Klog(4αLnbk+1)]+dαn𝔼KMult(x)[k=1Kbkξk2]+1αn𝔼KMult(x)𝔼PT+1[mink[K]{bkμPT+1τk2+log1δ1δk1}]+12βT(logTH(x))+12βT𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k,i(τk,i2ξ¯¯k2+ξk2ξ¯¯k21+logξ¯¯k2ξk2)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,\xi^{2},b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{d}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}b_{k}\xi_{k}^{2}\right]+\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]\\ +\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\\ +\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}+\frac{1}{2}\sum_{k,i}\left(\frac{\tau_{k,i}^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{\xi_{k}^{2}}{\bar{\bar{\xi}}_{k}^{2}}-1+\log\frac{\bar{\bar{\xi}}_{k}^{2}}{\xi_{k}^{2}}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

An exact optimization in ξk2\xi_{k}^{2} yields ξk2=ξ¯¯k21+4bkξ¯¯k2βTαn\xi_{k}^{2}=\frac{\bar{\bar{\xi}}_{k}^{2}}{1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}} and we can replace in the bound

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,δ,τ,b{d2αn𝔼KMult(x)[k=1Klog(4αLnbk+1)]+1αn𝔼KMult(x)𝔼PT+1[mink[K]{bkμPT+1τk2+log1δ1δk1}]+12βT(logTH(x))+12βT𝔼KMult(x)[k=1K(δk1)(ψ(δk)ψ(1δ))+logΓ(1δ)Γ(K)×k=1KΓ(δk)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4bkξ¯¯k2βTαn)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\delta,\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}+\log\frac{1^{\top}\delta-1}{\delta_{k}-1}\right\}\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)\\ +\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)+\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

We choose δk=2\delta_{k}=2 for any k1k\geq 1 and noting that both

k=1K(δk1)(ψ(δk)ψ(1δ))0\sum_{k=1}^{K}(\delta_{k}-1)\left(\psi(\delta_{k})-\psi(1^{\top}\delta)\right)\leq 0

and

logΓ(1δ)Γ(K)×k=1KΓ(δk)=logΓ(2K)Γ(K)Klog(2K),\log\frac{\Gamma(1^{\top}\delta)}{\Gamma(K)\times\prod_{k=1}^{K}\Gamma(\delta_{k})}=\log\frac{\Gamma(2K)}{\Gamma(K)}\leq K\log(2K),

we deduce the following bound:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,τ,b{d2αn𝔼KMult(x)[k=1Klog(4αLnbk+1)]+1αn𝔼KMult(x)𝔼PT+1[mink[K]{bkμPT+1τk2}]+1αn𝔼KMult(x)[log(2K)]+12βT(logTH(x))+12βT𝔼KMult(x)[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4bkξ¯¯k2βTαn)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,\tau,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\mathbb{E}_{P_{T+1}}\left[\min_{k\in[K]}\left\{b_{k}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right\}\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

Recall the (unchanged) definition of ΣK(𝒫)\Sigma_{K}(\mathcal{P}):

ΣK(𝒫)=infτ1,,τK𝔼PT+1𝒫[mink[K]μPT+1τk2].\Sigma_{K}(\mathcal{P})=\inf_{\tau_{1},\dots,\tau_{K}}\mathbb{E}_{P_{T+1}\sim\mathcal{P}}\left[\min_{k\in[K]}\|\mu_{P_{T+1}}-\tau_{k}\|^{2}\right].

Recalling that τ\tau (as well as bb) is allowed to depend on KK, we define (τ1,,τK)(\tau_{1},\dots,\tau_{K}) as the argument (up to a permutation) of ΣK(𝒫)\Sigma_{K}(\mathcal{P}). It follows that the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,b{d2αn𝔼KMult(x)[k=1Klog(4αLnbk+1)]+1αn𝔼KMult(x)[k=1KbkΣK(𝒫)+log(2K)]+12βT(logTH(x))+12βT𝔼KMult(x)[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4bkξ¯¯k2βTαn)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}b_{k}\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)\\ +\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)\\ +2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

From here, we are going to distinguish two different cases, as we did before:

  • -

    first case: there exists KK reasonably small such that ΣK(𝒫)dϵ\Sigma_{K}(\mathcal{P})\leq d\epsilon;

  • -

    second case: for any reasonable KK, ΣK(𝒫)>dϵ\Sigma_{K}(\mathcal{P})>d\epsilon.

G.4.1 First Case

Again, this is the case where we expect some improvement from the meta-learning. Setting

k[K],bk=T,\forall k\in[K],b_{k}=T,

the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx{d2αn𝔼KMult(x)[k=1Klog(4αLnT+1)]+1αn𝔼KMult(x)[KTΣK(𝒫)+log(2K)]+12βT(logTH(x))+12βT𝔼KMult(x)[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4ξ¯¯k2βT2αn)+2k=1K(logTb¯¯k+b¯¯kTT)]},\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[KT\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}},

and after the simplification

d2αnk=1Klog(4αLnT+1)2LdKT,\frac{d}{2\alpha n}\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{T}+1\right)\leq\frac{2LdK}{T},

we deduce

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx{2LdT𝔼KMult(x)[K]+1αn𝔼KMult(x)[log(2K)]+Tαn𝔼KMult(x)[KΣK(𝒫)]+12βT(logTH(x))+12βT𝔼KMult(x)[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4ξ¯¯k2βT2αn)+2k=1K(logTb¯¯k+b¯¯kTT)]},\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x}\Bigg{\{}\frac{2Ld}{T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\right]+\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\log(2K)\right]\\ +\frac{T}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\Sigma_{K}(\mathcal{P})\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}},

Then, choosing to restrict the infimum on all the multinomial distributions Mult(x1,,xT)\mathrm{Mult}(x_{1},\dots,x_{T}) to all the Dirac masses, i.e., all the (x1,,xT)(x_{1},\dots,x_{T}) such that there exists K{1,,T}K\in\{1,\dots,T\} such that xK=1x_{K}=1. It follows that H(x)=0H(x)=0 and we deduce that

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]infK[T]{4log(2K)αn+8LdKT+4TKΣK(𝒫)αn+2logTβT+2βT[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4ξ¯¯k2βT2αn)+2k=1K(logTb¯¯k+b¯¯kTT)]},\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\inf_{K\in[T]}\Bigg{\{}\frac{4\log(2K)}{\alpha n}+\frac{8LdK}{T}+\frac{4TK\Sigma_{K}(\mathcal{P})}{\alpha n}+\frac{2\log T}{\beta T}\\ +\frac{2}{\beta T}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}},

In particular, if there exists a KK relatively small such that ΣK(𝒫)dϵ=dnT2\Sigma_{K}(\mathcal{P})\leq d\epsilon=\frac{dn}{T^{2}}, then the bound becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]infK[T]{4log(2K)αn+8LdKT+4dKαT+2logTβT+2βT[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4ξ¯¯k2βT2αn)+2k=1K(logTb¯¯k+b¯¯kTT)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\inf_{K\in[T]}\Bigg{\{}\frac{4\log(2K)}{\alpha n}+\frac{8LdK}{T}+\frac{4dK}{\alpha T}+\frac{2\log T}{\beta T}\\ +\frac{2}{\beta T}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T^{2}}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{T}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-T}{T}\right)\Bigg{]}\Bigg{\}}.

G.4.2 Second Case

Recall the bound:

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx,b{d2αn𝔼KMult(x)[k=1Klog(4αLnbk+1)]+1αn𝔼KMult(x)[k=1KbkΣK(𝒫)+log(2K)]+12βT(logTH(x))+12βT𝔼KMult(x)[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4bkξ¯¯k2βTαn)+2k=1K(logbkb¯¯k+b¯¯kbkbk)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x,b}\Bigg{\{}\frac{d}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}\log\left(\frac{4\alpha Ln}{b_{k}}+1\right)\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[\sum_{k=1}^{K}b_{k}\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)\\ +\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4b_{k}\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)\\ +2\sum_{k=1}^{K}\left(\log\frac{b_{k}}{\bar{\bar{b}}_{k}}+\frac{\bar{\bar{b}}_{k}-b_{k}}{b_{k}}\right)\Bigg{]}\Bigg{\}}.

In this case, we do not expect much improvement from meta-learning, and we will simply choose

k{1,,T},bk=1.\forall k\in\{1,\dots,T\},\ b_{k}=1.

The bound then becomes

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]4infx{dlog(1+4αLn)2αn𝔼KMult(x)[K]+1αn𝔼KMult(x)[KΣK(𝒫)+log(2K)]+12βT(logTH(x))+12βT𝔼KMult(x)[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4ξ¯¯k2βTαn)+2k=1K(log1b¯¯k+b¯¯k1)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq 4\inf_{x}\Bigg{\{}\frac{d\log\left(1+4\alpha Ln\right)}{2\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\right]\\ +\frac{1}{\alpha n}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\left[K\Sigma_{K}(\mathcal{P})+\log(2K)\right]+\frac{1}{2\beta T}\left(\log T-H(x)\right)+\frac{1}{2\beta T}\mathbb{E}_{K\sim\mathrm{Mult}(x)}\Bigg{[}K\log(2K)\\ +\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{1}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right)\Bigg{]}\Bigg{\}}.

Similarly as before, we restrict the minimization to the set of Dirac distributions and we deduce

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]infK[K]{4log(2K)αn+2dKlog(1+4αLn)αn+4KΣK(𝒫)αn+2logTβT+2βT[Klog(2K)+12k=1Kτk2ξ¯¯k2+d2k=1Klog(1+4ξ¯¯k2βTαn)+2k=1K(log1b¯¯k+b¯¯k1)]}.\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\\ \leq\inf_{K\in[K]}\Bigg{\{}\frac{4\log(2K)}{\alpha n}+\frac{2dK\log\left(1+4\alpha Ln\right)}{\alpha n}+\frac{4K\Sigma_{K}(\mathcal{P})}{\alpha n}+\frac{2\log T}{\beta T}\\ +\frac{2}{\beta T}\left[K\log(2K)+\frac{1}{2}\sum_{k=1}^{K}\frac{\|\tau_{k}\|^{2}}{\bar{\bar{\xi}}_{k}^{2}}+\frac{d}{2}\sum_{k=1}^{K}\log\left(1+\frac{4\bar{\bar{\xi}}_{k}^{2}\beta T}{\alpha n}\right)+2\sum_{k=1}^{K}\left(\log\frac{1}{\bar{\bar{b}}_{k}}+\bar{\bar{b}}_{k}-1\right)\right]\Bigg{\}}.

G.4.3 Overall Bound

In summary, the bound can be written as

𝔼P1,,PT𝔼S1,,ST𝔼πΠ^[(π)]infK[T]{CVfinite(K,n)+K×CVGaussian(d,ΣK(𝒫),n,T)+CVmetaunknown(T,n,d,K,b¯¯,ξ¯¯2,τ)},\mathbb{E}_{P_{1},\dots,P_{T}}\mathbb{E}_{S_{1},\dots,S_{T}}\mathbb{E}_{\pi\sim\hat{\Pi}}[\mathcal{E}(\pi)]-\mathcal{E}^{*}\leq\\ \inf_{K\in[T]}\Bigg{\{}\text{CV}_{\text{finite}}(K,n)+K\times\text{CV}_{\text{Gaussian}}\left(d,\Sigma_{K}(\mathcal{P}),n,T\right)+\text{CV}_{\text{meta}}^{\text{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)\Bigg{\}},

where CVfinite(K,n)\text{CV}_{\text{finite}}(K,n) and CVGaussian(d,K,ΣK(𝒫),n,T)\text{CV}_{\text{Gaussian}}\left(d,K,\Sigma_{K}(\mathcal{P}),n,T\right) are exactly the same terms as in the case where the number of mixtures KK is known, and the convergence term at the meta level becomes

CVmetaunknown(T,n,d,K,b¯¯,ξ¯¯2,τ)=CVmeta(T,n,d,K,b¯¯,ξ¯¯2,τ)+2logTβT.\text{CV}_{\text{meta}}^{\text{unknown}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)=\text{CV}_{\text{meta}}(T,n,d,K,\bar{\bar{b}},\bar{\bar{\xi}}^{2},\tau)+\frac{2\log T}{\beta T}.

Even when the number of mixtures KK is unknown, the same bound as in the case of KK known can be achieved up to a 2logTβT\frac{2\log T}{\beta T} term, which is the order of the time required to find the optimal number of components in the mixture at the meta level.