This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Algorithm Generalization Error Bounds via Auxiliary Distributions

Gholamali Aminian\star,  Saeed Masiha\star,  Laura Toni,  and Miguel R. D. Rodrigues,  Equal Contribution.One of the ideas of this work was presented, in part, at 2020 IEEE Information Theory Workshop (ITW), [1]. G. Aminian is with the Alan Turing Institute, London NW1 2DB, U.K. (e-mail: [email protected]), L. Toni and M. R. D. Rodrigues are with the Electronic and Electrical Engineering Department, University College London, London WC1E 6BT, U.K. (e-mail: {l.toni, m.rodrigues}@ucl.ac.uk), and S. Masiha is with the École Polytechnique Fédérale de Lausanne, Lausanne 1015, Switzerland. (e-mail: [email protected]).
Abstract

Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds involving the α\alpha-Jensen-Shannon, α\alpha-Rényi (0<α<10<\alpha<1) information between a random variable modeling the set of training samples and another random variable modeling the set of hypotheses. Our upper bounds based on α\alpha-Jensen-Shannon information are also finite. Additionally, we demonstrate how our auxiliary distribution method can be used to derive the upper bounds on excess risk of some learning algorithms in the supervised learning context and the generalization error under the distribution mismatch scenario in supervised learning algorithms, where the distribution mismatch is modeled as α\alpha-Jensen-Shannon or α\alpha-Rényi divergence between the distribution of test and training data samples distributions. We also outline the conditions for which our proposed upper bounds might be tighter than other earlier upper bounds.

Index Terms:
Expected Generalization Error Bounds, population risk upper bound, Mutual Information, α\alpha-Jensen-Shannon Information, α\alpha-Rényi Information, Distribution mismatch.

I Introduction

Numerous methods have been proposed in order to describe the generalization error of learning algorithms. These include VC-based bounds [2], algorithmic stability-based bounds [3], algorithmic robustness-based bounds [4], PAC-Bayesian bounds [5]. Nevertheless, for a number of reasons, many of these generalization error bounds are unable to describe how different machine-learning techniques can generalize: some of the bounds depend only on the hypothesis class and not on the learning algorithm; existing bounds do not easily exploit dependencies between different hypotheses; or do not exploit dependencies between the learning algorithm input and output.

More recently, methods that use information-theoretic tools have also been developed to describe the generalization of learning techniques. Such methods frequently incorporate the many components related to the learning problem by expressing the expected generalization error in terms of certain information measurements between the learning algorithm input (the training dataset) and output (the hypothesis). In particular, building upon pioneering work by Russo and Zou [6], Xu and Raginsky [7] have derived expected generalization error bounds involving the mutual information between the training set and the hypothesis. Bu et al. [8] have derived tighter expected generalization error bounds involving the mutual information between each individual sample in the training set and the hypothesis. Meanwhile, bounds using chaining mutual information have been proposed in [9, 10]. Other authors have also constructed information-theoretic based expected generalization error bounds based on other information measures such as α\alpha-Rényi divergence for α>1\alpha>1, ff-divergence, and maximal leakage [11]. In [12], an upper bound based on α\alpha-Rényi divergence for 0<α<10<\alpha<1 is derived by using the variational representation of α\alpha-Rényi divergence. Bounds based on the Wasserstein distance between the training sample data and the output of a randomized learning algorithm [13], [14] and Wasserstein distance between distributions of an individual sample data and the output of the learning algorithm is proposed in [15], and tighter upper bounds via convexity of Wasserstein distance are proposed in [16]. Upper bounds based on conditional mutual information and individual sample conditional mutual information are proposed in [17] and [18], respectively. The combination of conditioning and processing techniques can provide tighter expected generalization error upper bounds [19]. An exact characterization of the expected generalization error for the Gibbs algorithm in terms of symmetrized KL information is provided in [20]. [21] provides information-theoretic expected generalization error upper bounds in the presence of training/test data distribution mismatch, using rate-distortion theory.

Generalization error bounds have also been developed to address scenarios where the training data distribution differs from the test data distribution, known as Distribution Mismatch. This scenario – which also links to out-of-distribution generalization – has attracted various contributions in recent years, such as [22, 23, 24]. In particular, Masiha et al. [21] provides information-theoretic generalization error upper bounds in the presence of training/test data distribution mismatch, using rate-distortion theory.

In this work, we propose an auxiliary distribution method (ADM) to characterize the expected generalization error upper bound of supervised learning algorithms in terms of novel information measures. Our new bounds offer two advantages over existing ones: (1) Some of our bounds – such as the α\alpha-JS\mathrm{JS} information ones – are always finite, whereas conventional mutual information ones (e.g., [7]) may not be; (2) In contrast to mutual information-based bounds, our bounds—such as the α\alpha-Rényi information for 0<α<10<\alpha<1—are finite for some deterministic supervised learning algorithms; (3) We also apply ADM to provide an upper bound on population risk of supervised learning algorithms under a learning algorithm.

In summary, our main contributions are as follows:

  1. 1.

    We suggest a novel method, i.e., ADM, that uses auxiliary distributions over the parameter and data sample spaces to obtain upper bounds on the expected generalization error.

  2. 2.

    Using ADM, we derive new expected generalization error bounds expressed via α\alpha-JS\mathrm{JS} divergence, which is known to be finite.

  3. 3.

    Using ADM, we offer an upper bound based on α\alpha-Rényi divergence for 0<α<10<\alpha<1 with the same convergence rate as the mutual information-based upper bound. Furthermore, in contrast to the mutual information-based bounds, the α\alpha-Rényi divergence bounds for 0<α<10<\alpha<1 can be finite when the hypothesis (output of the learning algorithm) is a deterministic function of at least one data sample.

  4. 4.

    Using our upper bounds on expected generalization error, we also provide upper bounds on excess risk of some learning algorithms as solutions to regularized empirical risk minimization by α\alpha-Rényi or α\alpha-Jensen-Shannon divergences.

  5. 5.

    Using ADM, we also provide generalization error upper bound under training and test data distribution mismatch. It turns out that training and test distribution mismatch is captured in our upper bounds via α\alpha-Jensen-Shannon or α\alpha-Rényi divergences.

It is noteworthy to add that, although the α\alpha-JS\mathrm{JS} measure does not appear to have been used to characterize the generalization ability of learning algorithms, these information-theoretic quantities as well as α\alpha-Rényi measure for 0<α<10<\alpha<1, have been employed to study some machine learning problems, including the use of

  • α\alpha-JS\mathrm{JS} as a loss function under label noise scenario [25], and Jensen-Shannon divergence ( α\alpha-JS\mathrm{JS} divergence for α=1/2\alpha=1/2) in adversarial learning [26] and active learning [27].

  • α\alpha-Rényi divergence in feature extraction  [28] and image segmentation based on clustering [29].

II Problem Formulation

II-A Notations

In this work, we adopt the following notation in the sequel. Calligraphic letters denote spaces (e.g. 𝒵\mathcal{Z}), Upper-case letters denote random variables (e.g., ZZ), and lower-case letters denote a realization of random variable (e.g. zz). We denote the distribution of the random variable ZZ by PZP_{Z}, the joint distribution of two random variables (Z1,Z2)(Z_{1},Z_{2}) by PZ1,Z2P_{Z_{1},Z_{2}}, and the α\alpha-convex combination of the joint distribution PZ1,Z2P_{Z_{1},Z_{2}} and the product of two marginals PZ1PZ2P_{Z_{1}}\otimes P_{Z_{2}}, i.e. αPZ1PZ2+(1α)PZ1,Z2\alpha P_{Z_{1}}\otimes P_{Z_{2}}+(1-\alpha)P_{Z_{1},Z_{2}} for α(0,1)\alpha\in(0,1), by PZ1,Z2(α)P_{Z_{1},Z_{2}}^{(\alpha)}. The set of distributions ( measures) over a space 𝒳\mathcal{X} with is denoted 𝒫(𝒳)\mathcal{P}(\mathcal{X}). We denote the derivative of a real-valued function f(x)f(x) with respect to its argument xx by f()f^{\prime}(\cdot). We also adopt the notion log()\log(\cdot) for the natural logarithm. The function f(x)f(x) is LfL_{f}-Lipschitz if |f(x1)f(x2)|Lfx1x22|f(x_{1})-f(x_{2})|\leq L_{f}\|x_{1}-x_{2}\|_{2}, where 2\|\cdot\|_{2} is L2L_{2}-norm. Let 𝒩(a,B)\mathcal{N}(a,B) denotes the Gaussian distribution over d\mathbb{R}^{d} with mean ada\in\mathbb{R}^{d} and covariance matrix Bd×dB\in\mathbb{R}^{d\times d}.

II-B Framework of Statistical Learning

We analyze a standard supervised learning setting where we wish to learn a hypothesis given a set of input-output examples that can then be used to predict a new output given a new input.

In particular, in order to formalize this setting, we model the input data (also known as features) using a random variable X𝒳X\in\mathcal{X} where 𝒳\mathcal{X} is the input space, and we model the output data (also known as predictors or labels) using a random variable Y𝒴Y\in\mathcal{Y} where 𝒴\mathcal{Y} is the output space. We also model input-output data pairs using a random variable Z=(X,Y)𝒵=𝒳×𝒴Z=(X,Y)\in\mathcal{Z}=\mathcal{X}\times\mathcal{Y} where ZZ is drawn from 𝒵\mathcal{Z} per some unknown distribution μ\mu. We also let S={Zi}i=1nS=\{Z_{i}\}_{i=1}^{n} be a training set consisting of nn input-output data points drawn i.i.d. from 𝒵\mathcal{Z} according to μ\mathcal{\mu}.

Our goal is to learn a parameterized function, fW:𝒳𝒴f_{W}:\mathcal{X}\rightarrow\mathcal{Y}, where the parameters are a random variable W𝒲dW\in\mathcal{W}\subset\mathbb{R}^{d} and 𝒲\mathcal{W} is a parameter space. Finally, we represent a learning algorithm via a Markov kernel that maps a given training set SS onto parameter WW defined on the parameter space 𝒲\mathcal{W} according to the probability law PW|SP_{W|S}.

We introduce a (non-negative) loss function :𝒲×𝒵+\ell:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}^{+} that measures how well a hypothesis (parameterized function) predicts an output given an input. We can define the population risk and the empirical risk associated with a given hypothesis as follows:

Lμ(w):=𝒵(w,z)𝑑μ(z),\displaystyle L_{\mu}(w):=\int_{\mathcal{Z}}\ell(w,z)d\mu(z), (1)
LE(w,s):=1ni=1n(w,zi),\displaystyle L_{E}(w,s):=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i}), (2)

respectively. We can also define the (expected) generalization error,

gen¯(PW|S,μ)\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu) :=𝔼PW,S[gen(W,S,μ)],\displaystyle:=\mathbb{E}_{P_{W,S}}[\text{gen}(W,S,\mu)], (3)

where gen(w,s,μ):=Lμ(w)LE(w,s)\text{gen}(w,s,\mu):=L_{\mu}(w)-L_{E}(w,s). This (expected) generalization error quantifies by how much the population risk deviates from the empirical risk. This quantity cannot be computed directly because μ\mu is unknown, but it can often be (upper) bounded, thereby providing a means to gauge various learning algorithms’ performance. We are also interested in excess risk under the learning algorithm PW|SP_{W|S},

r(PW|S,μ)\displaystyle\mathcal{E}_{r}(P_{W|S},\mu) :=𝔼PW,S[Lμ(W)]infw𝒲Lμ(w).\displaystyle:=\mathbb{E}_{P_{W,S}}[L_{\mu}(W)]-\inf_{w\in\mathcal{W}}L_{\mu}(w). (4)

Note that the excess risk can be decomposed as follows,

r(PW|S,μ)=gen¯(PW|S,μ)+𝔼PW,S[LE(W,S)]infw𝒲Lμ(w),\begin{split}\mathcal{E}_{r}(P_{W|S},\mu)=\overline{\operatorname*{gen}}(P_{W|S},\mu)+\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]-\inf_{w\in\mathcal{W}}L_{\mu}(w),\end{split}

where the first term is expected generalization error and the second is statistical excess risk.

Furthermore, we analyse a supervised learning scenario under distribution mismatch ( a.k.a. out-of-distribution), where training and test data are drawn from different distributions (μ\mu and μ\mu^{\prime}, respectively). In particular, we define the population risk based on test distribution μ\mu^{\prime} as,

LP(w,μ)𝒵(w,z)𝑑μ(z).\displaystyle L_{P}(w,\mu^{\prime})\triangleq\int_{\mathcal{Z}}\ell(w,z)d\mu^{\prime}(z). (5)

We define the mismatched(expected) generalization error as

gen¯(PW|S,μ,μ)\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu,\mu^{\prime}) 𝔼PW,S[gen(W,S,μ,μ)],\displaystyle\triangleq\mathbb{E}_{P_{W,S}}[\text{gen}(W,S,\mu,\mu^{\prime})], (6)

where gen(w,s,μ,μ)LP(w,μ)LE(w,s)\text{gen}(w,s,\mu,\mu^{\prime})\triangleq L_{P}(w,\mu^{\prime})-L_{E}(w,s).

Our goal in the sequel will be to derive (upper) bounds on the expected generalization errors (3) and the excess risk (4) in terms of various information-theoretic measures.

II-C Auxiliary Distribution Method

We describe our main method to derive upper bounds on the expected generalization error, i.e., the ADM. Consider PP and QQ as two distributions defined on a measurable space 𝒳\mathcal{X} and let f:𝒳f:\mathcal{X}\to\mathbb{R} be a measurable function. Assume that we can use an asymmetric information measure T(PQ)T(P\|Q) between PP and QQ to construct the following upper bound:

|𝔼P[f(X)]𝔼Q[f(X)]|F(T(PQ)),\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|\leq F(T(P\|Q)), (7)

where F()F(\cdot) is a given non-decreasing concave function.

Consider RR as an auxiliary distribution on the same space 𝒳\mathcal{X}. We can use the following upper bound instead of (7):

|𝔼P[f(X)]𝔼Q[f(X)]|\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|\leq
|𝔼P[f(X)]𝔼R[f(X)]|+|𝔼Q[f(X)]𝔼R[f(X)]|\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{R}[f(X)]\right|+\left|\mathbb{E}_{Q}[f(X)]-\mathbb{E}_{R}[f(X)]\right|
F(T(PR))+F(T(QR))\displaystyle\leq F(T(P\|R))+F(T(Q\|R)) (8)

From concavity of FF, we have

F(T(PR))+F(T(QR))\displaystyle F(T(P\|R))+F(T(Q\|R))\leq
2F(T(PR)/2+T(QR)/2)\displaystyle 2F\bigg{(}T(P\|R)/2+T(Q\|R)/2\bigg{)} (9)

We assume that TT satisfies a reverse triangle inequality as follows:

minR𝒫(X)T(PR)+T(QR)T(PQ).\displaystyle\min_{R\in\mathcal{P}(X)}T(P\|R)+T(Q\|R)\leq T(P\|Q). (10)

Considering RargminRT(PR)+T(QR)R^{*}\in\operatorname*{arg\,min}_{R}T(P\|R)+T(Q\|R), we have

|𝔼P[f(X)]𝔼Q[f(X)]|\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|\leq
2F(T(PR)/2+T(QR)/2).\displaystyle 2F\bigg{(}T(P\|R^{*})/2+T(Q\|R^{*})/2\bigg{)}. (11)

We can also provide another upper bound based on T(RP)T(R\|P) and T(RQ)T(R\|Q) instead of T(PR)T(P\|R) and T(QR)T(Q\|R):

|𝔼P[f(X)]𝔼Q[f(X)]|\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|\leq
|𝔼R[f(X)]𝔼P[f(X)]|+|𝔼R[f(X)]𝔼Q[f(X)]|\displaystyle\left|\mathbb{E}_{R}[f(X)]-\mathbb{E}_{P}[f(X)]\right|+\left|\mathbb{E}_{R}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|
F(T(RP))+F(T(RQ)).\displaystyle\leq F(T(R\|P))+F(T(R\|Q)). (12)

Considering R~argminR𝒫(X)T(RP)+T(RQ)\tilde{R}\in\operatorname*{arg\,min}_{R\in\mathcal{P}(X)}T(R\|P)+T(R\|Q), we have

|𝔼P[f(X)]𝔼Q[f(X)]|\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|\leq
2F(T(R~P)/2+T(R~Q)/2).\displaystyle 2F\bigg{(}T(\tilde{R}\|P)/2+T(\tilde{R}\|Q)/2\bigg{)}. (13)

Via this ADM approach – taking T()T(\cdot\|\cdot) to be a KL divergence – we can derive expected generalization error upper bounds involving KL divergences as follows:

αKL(PW,ZiP^W,Zi)+(1α)KL(PWμP^W,Zi),\displaystyle\alpha\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}}), (14)
αKL(P^W,ZiPW,Zi)+(1α)KL(P^W,ZiPWμ),\displaystyle\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu), (15)

where P^W,Zi\widehat{P}_{W,Z_{i}}, PW,ZiP_{W,Z_{i}} and PWμP_{W}\otimes\mu are an auxiliary joint distribution over the space 𝒵×𝒲\mathcal{Z}\times\mathcal{W}, the true joint distribution of the random variables WW and ZiZ_{i} and the product of marginal distributions of random variables WW and ZiZ_{i}, respectively. Inspired by the ADM, we use the fact that KL divergence is asymmetric and satisfies the reverse triangle inequality [30]. Hence, we can choose the auxiliary joint distribution, P^W,Zi\widehat{P}_{W,Z_{i}}, to derive new upper bounds which are finite or tighter under some conditions.

II-D Information Measures

In our characterization of the expected generalization error upper bounds, we will use the information measures between two distributions PXP_{X} and PXP_{X^{\prime}} on a common measurable space 𝒳\mathcal{X}, summarized in Table I. The last two divergences are α\alpha-JS\mathrm{JS} divergence111a.k.a. capacitory discrimination  [31] for α=1/2\alpha=1/2, α\alpha-Rényi divergence, which can be characterized by (14) and (15), respectively (See their characterizations as a convex combination of KL-divergences in Lemmas 2 and 3). They are the main divergences discussed in this paper and defined in Table I. KL divergence, Symmetrized KL divergence, Bhattacharyya distance, and Jensen-Shannon divergence can be obtained as special cases of the first three divergences in Table I.

TABLE I: Divergence Measures Definitions
Divergence Measure Definition
KL divergence [32] KL(PXPX):=𝒳PX(x)log(PX(x)PX(x))𝑑x\mathrm{KL}(P_{X}\|P_{X^{\prime}}):=\int_{\mathcal{X}}P_{X}(x)\log\left(\frac{P_{X}(x)}{P_{X^{\prime}}(x)}\right)dx
α\alpha-JS\mathrm{JS} divergence [33, 34] JSα(PXPX):=\mathrm{JS}_{\alpha}(P_{X^{\prime}}\|P_{X}):= αKL(PXαPX+(1α)PX)+(1α)KL(PXαPX+(1α)PX)\alpha\mathrm{KL}\left(P_{X}\|\alpha P_{X}+(1-\alpha)P_{X^{\prime}}\right)+(1-\alpha)\mathrm{KL}\left(P_{X^{\prime}}\|\alpha P_{X}+(1-\alpha)P_{X^{\prime}}\right)
Jensen-Shannon divergence [34] JSD(PXPX):=JS1/2(PXPX)\mathrm{JSD}(P_{X^{\prime}}\|P_{X}):=\mathrm{JS}_{1/2}(P_{X^{\prime}}\|P_{X}) =12KL(PX||PX+PX2)+12KL(PX||PX+PX2)=\frac{1}{2}\mathrm{KL}\left(P_{X}\Big{|}\Big{|}\frac{P_{X}+P_{X^{\prime}}}{2}\right)+\frac{1}{2}\mathrm{KL}\left(P_{X^{\prime}}\Big{|}\Big{|}\frac{P_{X}+P_{X^{\prime}}}{2}\right)
α\alpha-Rényi divergence for α[0,)\alpha\in[0,\infty) [35] Rα(PXPX):=1α1log(𝒳PXα(x)PX1α(x)𝑑x)\mathrm{R}_{\alpha}(P_{X^{\prime}}\|P_{X}):=\frac{1}{\alpha-1}\log\left(\int_{\mathcal{X}}P_{X}^{\alpha}(x)P_{X^{\prime}}^{1-\alpha}(x)dx\right)
Bhattacharyya distance [36] DB(PXPX):=R1/2(PXPX)D_{B}(P_{X^{\prime}}\|P_{X}):=\mathrm{R}_{1/2}(P_{X^{\prime}}\|P_{X}) =log(𝒳PX(x)PX(x)𝑑x)=-\log\left(\int_{\mathcal{X}}\sqrt{P_{X}(x)P_{X^{\prime}}(x)}dx\right)

In addition, in our expected generalization error characterizations, we will also use various information measures between two random variables XX and XX^{\prime} with joint distribution PXXP_{XX^{\prime}} and marginals PXP_{X} and PXP_{X^{\prime}}. These information measures are summarized in Table II. Note that all these information measures are zero if and only if the random variables XX and XX^{\prime} are independent.

TABLE II: Information Measures Definitions
Information Measure Definition
Mutual information I(X;X):=KL(PX,XPXPX)I(X;X^{\prime}):=\mathrm{KL}(P_{X,X^{\prime}}\|P_{X}\otimes P_{X^{\prime}})
Lautum information [37] L(X;X):=KL(PXPXPX,X)L(X;X^{\prime}):=\mathrm{KL}(P_{X}\otimes P_{X^{\prime}}\|P_{X,X^{\prime}})
α\alpha-JS\mathrm{JS} information (0<α<10<\alpha<1) IJSα(X;X):=JSα(PX,XPXPX)I_{\mathrm{JS}}^{\alpha}(X;X^{\prime}):=\mathrm{JS}_{\alpha}(P_{X,X^{\prime}}\|P_{X}\otimes P_{X^{\prime}})
Jensen-Shannon information [38] IJS(X;X):=JSD(PX,XPXPX)I_{\mathrm{JS}}(X;X^{\prime}):=\mathrm{JSD}(P_{X,X^{\prime}}\|P_{X}\otimes P_{X^{\prime}})
α\alpha-Rényi information IRα(X;X):=Rα(PX,XPXPX)I_{\mathrm{R}}^{\alpha}(X;X^{\prime}):=\mathrm{R}_{\alpha}(P_{X,X^{\prime}}\|P_{X}\otimes P_{X^{\prime}})
Sibson’s α\alpha-Mutual information [39] ISα(X;X):=minQXRα(PX,XPXQX)I_{\mathrm{S}}^{\alpha}(X;X^{\prime}):=\min_{Q_{X^{\prime}}}\mathrm{R}_{\alpha}(P_{X,X^{\prime}}\|P_{X}\otimes Q_{X^{\prime}})

II-E Definitions

We offer some standard definitions that will guide our analysis in the sequel.

Definition 1

The cumulant generating function (CGF) of a random variable XX is defined as

ΛX(λ):=log𝔼[eλ(X𝔼[X])].\Lambda_{X}(\lambda):=\log\mathbb{E}[e^{\lambda(X-\mathbb{E}[X])}]. (16)

Assuming ΛX(λ)\Lambda_{X}(\lambda) exists, it can be verified that ΛX(0)=ΛX(0)=0\Lambda_{X}(0)=\Lambda_{X}^{\prime}(0)=0, and that it is convex.

Definition 2

For a convex function ψ\psi defined on the interval [0,b)[0,b), where 0<b0<b\leq\infty, its Legendre dual ψ\psi^{\star} is defined as

ψ(x):=supλ[0,b)(λxψ(λ)).\psi^{\star}(x):=\sup_{\lambda\in[0,b)}\big{(}\lambda x-\psi(\lambda)\big{)}. (17)

The following lemma characterizes a useful property of the Legendre dual and its inverse function.

Lemma 1

[40, Lemma 2.4] Assume that ψ(0)=ψ(0)=0\psi(0)=\psi^{\prime}(0)=0. Then, the Legendre dual ψ(x)\psi^{\star}(x) of ψ(x)\psi(x) defined above is a non-negative convex and non-decreasing function on [0,)[0,\infty) with ψ(0)=0\psi^{\star}(0)=0. Moreover, its inverse function ψ1(y)=inf{x0:ψ(x)y}\psi^{\star-1}(y)=\inf\{x\geq 0:\psi^{\star}(x)\geq y\} is concave, and can be written as

ψ1(y)=infλ[0,b)(y+ψ(λ)λ),b>0.\psi^{\star-1}(y)=\inf_{\lambda\in[0,b)}\Big{(}\frac{y+\psi(\lambda)}{\lambda}\Big{)},\quad b>0. (18)

Importantly, using these results, we can characterize the tail behaviour of Sub-Gaussian random variables. A random variable XX is σ\sigma-sub-Gaussian, if ψ(λ)=σ2λ22\psi(\lambda)=\frac{\sigma^{2}\lambda^{2}}{2} is an upper bound on ΛX(λ)\Lambda_{X}(\lambda), for λ\lambda\in\mathbb{R}. Then by Lemma 1,

ψ1(y)=2σ2y.\displaystyle\psi^{\star-1}(y)=\sqrt{2\sigma^{2}y}. (19)

The tail behaviour of sub-Exponential and sub-Gamma random variables are introduced in [20].

III Upper Bounds on the Expected Generalization Error via ADM

We provide a series of bounds on the expected generalization error of supervised learning algorithms based on different information measures using the ADM coupled with KL divergence.

III-A α\alpha-Jensen-Shannon- based Upper Bound

In the following Theorem, we provide a new expected generalization error upper bound based on KL divergence by applying ADM and using KL divergences terms, KL(PWμP^W,Zi)\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}}) and KL(PW,ZiP^W,Zi)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}). Somme proof details are deferred to Appendix A.

Theorem 1

Assume that under an auxiliary joint distribution P^W,Zi𝒫(𝒲×𝒵)\widehat{P}_{W,Z_{i}}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z})Λ(W,Zi)(λ)\Lambda_{\ell(W,Z_{i})}(\lambda) exists, it is upper bounded by ψ+(λ)\psi_{+}(\lambda) for λ[0,b+)\lambda\in[0,b_{+}), 0<b+<+0<b_{+}<+\infty, and it is also upper bounded by ψ(λ)\psi_{-}(-\lambda) for λ(b,0]\lambda\in(b_{-},0], i=1,,n\forall i=1,\cdots,n. Also assume that ψ+(λ)\psi_{+}(\lambda) and ψ(λ)\psi_{-}(\lambda) are convex functions and ψ(0)=ψ+(0)=ψ+(0)=ψ(0)=0\psi_{-}(0)=\psi_{+}(0)=\psi_{+}^{\prime}(0)=\psi_{-}^{\prime}(0)=0. Then, it holds that:

gen¯(PW|S,μ)\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu) 1ni=1n(ψ+1(Ai)+ψ1(Bi)),\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(\psi_{+}^{\star-1}(A_{i})+\psi_{-}^{\star-1}(B_{i})\right), (20)
gen¯(PW|S,μ)\displaystyle-\overline{\operatorname*{gen}}(P_{W|S},\mu) 1ni=1n(ψ1(Ai)+ψ+1(Bi)),\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(\psi_{-}^{\star-1}(A_{i})+\psi_{+}^{\star-1}(B_{i})\right), (21)

where Ai=KL(PWμP^W,Zi)A_{i}=\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}}), Bi=KL(PW,ZiP^W,Zi)B_{i}=\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}), ψ1(x)=infλ[0,b)x+ψ(λ)λ\psi_{-}^{\star-1}(x)=\inf_{\lambda\in[0,-b_{-})}\frac{x+\psi_{-}(\lambda)}{\lambda} and ψ+1(x)=infλ[0,b+)x+ψ+(λ)λ.\psi_{+}^{\star-1}(x)=\inf_{\lambda\in[0,b_{+})}\frac{x+\psi_{+}(\lambda)}{\lambda}\,.

Proof:

The proofs of the bounds to gen¯(PW|S,μ)\overline{\operatorname*{gen}}(P_{W|S},\mu) and gen¯(PW|S,μ)-\overline{\operatorname*{gen}}(P_{W|S},\mu) are similar. Therefore, we focus on the latter.

Let us consider the Donsker–Varadhan variational representation of KL divergence between two probability distributions α\alpha and β\beta on a common space Ψ\Psi given by [41]:

KL(αβ)=supfΨf𝑑αlogΨef𝑑β,\displaystyle\mathrm{KL}(\alpha\|\beta)=\sup_{f}\int_{\Psi}fd\alpha-\log\int_{\Psi}e^{f}d\beta, (22)

where f={f:Ψ s.t. 𝔼β[ef]<}f\in\mathcal{F}=\{f:\Psi\rightarrow\mathbb{R}\text{ s.t. }\mathbb{E}_{\beta}[e^{f}]<\infty\}.

Using the Donsker-Varadhan representation to bound KL(PW,ZiP^W,Zi)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}) for λ(b,0]\lambda\in(b_{-},0] as follows:

KL(PW,ZiP^W,Zi)\displaystyle\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})\geq (23)
𝔼PW,Zi[λ(W,Zi)]log𝔼P^W,Zi[eλ(W,Zi)]\displaystyle\mathbb{E}_{P_{W,Z_{i}}}[\lambda\ell(W,Z_{i})]-\log\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[e^{\lambda\ell(W,{Z}_{i})}]\geq
λ(𝔼PW,Zi[(W,Zi)]𝔼P^W,Zi[(W,Zi)])ψ(λ),\displaystyle\lambda(\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,{Z}_{i})])-\psi_{-}(-\lambda), (24)

where the last inequality is due to:

Λ(W,Zi)(λ)=\displaystyle\Lambda_{\ell({W},{Z}_{i})}(\lambda)= (25)
log(𝔼P^W,Zi[eλ((W,Zi)𝔼P^W,Zi[(W,Zi)])])ψ(λ).\displaystyle\log\left(\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[e^{\lambda(\ell(W,{Z}_{i})-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,{Z}_{i})])}]\right)\leq\psi_{-}(-\lambda).

It can then be shown from (24) that the following holds for λ(b,0]\lambda\in(b_{-},0]:

𝔼P^W,Zi[(W,Zi)]𝔼PW,Zi[(W,Zi)]\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,{Z}_{i})]-\mathbb{E}_{{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\leq (26)
infλ[0,b)KL(PW,ZiP^W,Zi)+ψ(λ)λ=\displaystyle\inf_{\lambda\in[0,-b_{-})}\frac{\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})+\psi_{-}(\lambda)}{\lambda}=
ψ1(KL(PW,ZiP^W,Zi)).\displaystyle\psi_{-}^{\star-1}(\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})). (27)

It can likewise also be shown by adopting similar steps that the following holds for λ[0,b+)\lambda\in[0,b_{+}):

𝔼PW,Zi[(W,Zi)]𝔼P^W,Zi[(W,Zi)]\displaystyle\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\leq (28)
infλ[0,b+)KL(PW,ZiP^W,Zi)+ψ(λ)λ=\displaystyle\inf_{\lambda\in[0,b_{+})}\frac{\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})+\psi(\lambda)}{\lambda}=
ψ+1(KL(PW,ZiP^W,Zi)).\displaystyle\psi_{+}^{\star-1}(\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})). (29)

We can similarly show using an identical procedure that:

𝔼PWμ[(W,Zi)]𝔼P^W,Zi[(W,Z^i)]\displaystyle\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,\widehat{Z}_{i})]
ψ+1(KL(PWμP^W,Zi))\displaystyle\leq\psi_{+}^{\star-1}(\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})) (30)
𝔼P^W,Zi[(W,Z^i)]𝔼PWμ[(W,Zi)]\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,\widehat{Z}_{i})]-\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]
ψ1(KL(PWμP^W,Zi)).\displaystyle\leq\psi_{-}^{\star-1}(\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})). (31)

Finally, we can immediately bound the expected generalization error by leveraging (III-A) and (26) as follows:

gen¯(PW|S,μ)=1ni=1n𝔼PWμ[(W,Zi)]𝔼PW,Zi[(W,Zi)]\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]-\mathbb{E}_{{P}_{W,Z_{i}}}[\ell(W,Z_{i})]
=1ni=1n𝔼PWμ[(W,Zi)]𝔼P^W,Zi[(W,Zi)]+\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell({W},{Z}_{i})]+
𝔼P^W,Zi[(W,Zi)]𝔼PW,Zi[(W,Zi)]\displaystyle~{}\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell({W},{Z}_{i})]-\mathbb{E}_{{P}_{W,Z_{i}}}[\ell(W,Z_{i})]
1ni=1n(ψ+1(Ai)+ψ1(Bi)),\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(\psi_{+}^{\star-1}(A_{i})+\psi_{-}^{\star-1}(B_{i})\right),

where Ai=KL(PWμP^W,Zi)A_{i}=\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}}) and Bi=KL(PW,ZiP^W,Zi)B_{i}=\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}). ∎

Note that Theorem 1 can be applied to sub-Gaussian (19). It can also sub-Exponential and sub-Gamma assumptions on loss function CGF, introduced in [20].

We can utilize Theorem 1 to recover existing expected generalization error bounds and offer new ones. For example, we can immediately recover the mutual information bound [7] from the following results.

Example 1

Choose P^W,Zi=PWμ\widehat{P}_{W,Z_{i}}=P_{W}\otimes\mu for i=1,,ni=1,\cdots,n. It follows immediately from Theorem 1 that:

gen¯(PW|S,μ)\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu) 1ni=1nψ1(I(W;Zi)),\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{-}^{\star-1}(I(W;Z_{i})), (32)
gen¯(PW|S,μ)\displaystyle-\overline{\operatorname*{gen}}(P_{W|S},\mu) 1ni=1nψ+1(I(W;Zi)).\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{+}^{\star-1}(I(W;Z_{i})). (33)
Example 2

Choose P^W,Zi=PW,Zi\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}} for i=1,,ni=1,\cdots,n. It also follows immediately from Theorem 1 that:

gen¯(PW|S,μ)\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu) 1ni=1nψ+1(L(W;Zi)),\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{+}^{\star-1}(L(W;Z_{i})), (34)
gen¯(PW|S,μ)\displaystyle-\overline{\operatorname*{gen}}(P_{W|S},\mu) 1ni=1nψ1(L(W;Zi)).\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{-}^{\star-1}(L(W;Z_{i})). (35)

The result in Example 1 is the same as a result appearing in [8] whereas the result in Example 2 extends the result appearing in [42].

The conclusion in Theorem 1 can be extended to many auxiliary distributions by repeatedly using ADM. In this study, we take into account just one auxiliary distribution and use ADM just once.

Building upon Theorem 1, we are also able to provide an expected generalization error upper bound based on a convex combination of KL terms, i.e.,

αKL(PWμP^W,Zi)+(1α)KL(PW,ZiP^W,Zi),\alpha\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}),

that relies on a certain σ\sigma-sub-Gaussian tail assumption.

Proposition 1

Assume that the loss function is σ^\hat{\sigma}-sub-Gaussian– under the distribution P^W,Zi\widehat{P}_{W,Z_{i}} i=1,,n\forall i=1,\cdots,n– Then, it holds α(0,1)\forall\alpha\in(0,1) that:

|gen¯(PW|S,μ)|1ni=1n2σ^2(αAi+(1α)Bi)α(1α),\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\hat{\sigma}^{2}\frac{\left(\alpha A_{i}+(1-\alpha)B_{i}\right)}{\alpha(1-\alpha)}}, (36)

where Ai=KL(PW,ZiP^W,Zi)A_{i}=\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}) and Bi=KL(PWμP^W,Zi)B_{i}=\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}}) .

Proof:

The assumption that the loss function is σ\sigma-sub-Gaussian under the distribution P^W,Zi\widehat{P}_{W,Z_{i}} implies that ψ1(y)=ψ+1(y)=2σ2y\psi_{-}^{\star-1}(y)=\psi_{+}^{\star-1}(y)=\sqrt{2\sigma^{2}y}, [8]. Consider the arbitrary auxiliary distributions {P^W,Zi}i=1n\{\widehat{P}_{W,Z_{i}}\}_{i=1}^{n} defined on 𝒲×𝒵\mathcal{W}\times\mathcal{Z}.

gen¯(μ,PW|S)=𝔼PWPS[LE(W,S)]𝔼PW,S[LE(W,S)]\displaystyle\overline{\text{\rm{gen}}}(\mu,P_{W|S})=\mathbb{E}_{P_{W}P_{S}}[L_{E}(W,S)]-\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]
=1ni=1n𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})] (37)
1ni=1n|𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]|\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right| (38)

Using the assumption that the loss function (w,zi)\ell(w,z_{i}) is σ^2\hat{\sigma}^{2}-sub-Gaussian under distribution P^W,Zi\widehat{P}_{W,Z_{i}} and Donsker-Varadhan representation for KL(PWZiP^W,Zi)\mathrm{KL}(P_{WZ_{i}}\|\widehat{P}_{W,Z_{i}}), we have:

λ(𝔼PW,Zi[(W,Zi)]𝔼P^W,Zi[(W,Zi)])\displaystyle\lambda\left(\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\right)\leq (39)
KL(PWZiP^W,Zi)+λ2σ^22.λ\displaystyle\quad\mathrm{KL}(P_{WZ_{i}}\|\widehat{P}_{W,Z_{i}})+\frac{\lambda^{2}\hat{\sigma}^{2}}{2}.\quad\forall\lambda\in\mathbb{R}

Using the assumption loss that the function (w,zi)\ell(w,z_{i}) is σ^2\hat{\sigma}^{2}-sub-Gaussian under distribution P^W,Zi\widehat{P}_{W,Z_{i}} and Donsker-Varadhan representation for KL(P^W,ZiPWPZi)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}P_{Z_{i}}), we have,

λ(𝔼PWPZi[(W,Zi)]𝔼P^W,Zi[(W,Zi)])\displaystyle\lambda^{\prime}\left(\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\right)\leq (40)
KL(PWPZiP^W,Zi)+λ2σ^22,λ.\displaystyle\quad\mathrm{KL}(P_{W}P_{Z_{i}}\|\widehat{P}_{W,Z_{i}})+\frac{{\lambda^{\prime}}^{2}\hat{\sigma}^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R}.

Assuming λ<0\lambda<0, then we can choose λ=αα1λ\lambda^{\prime}=\frac{\alpha}{\alpha-1}\lambda. Hence we have,

𝔼P^W,Zi[(W,Zi)]𝔼PW,Zi[(W,Zi)]\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]\leq (41)
KL(PW,ZiP^W,Zi)|λ|+|λ|σ^22,λ,\displaystyle\quad\frac{\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})}{|\lambda|}+\frac{|\lambda|\hat{\sigma}^{2}}{2},\quad\forall\lambda\in\mathbb{R^{-}},

and,

𝔼PWPZi[(W,Zi)]𝔼P^W,Zi[(W,Zi)]\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\leq (42)
KL(PWPZiP^W,Zi)λ+λσ^22,λ+.\displaystyle\quad\frac{\mathrm{KL}(P_{W}P_{Z_{i}}\|\widehat{P}_{W,Z_{i}})}{\lambda^{\prime}}+\frac{\lambda^{\prime}\hat{\sigma}^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R^{+}}.

Merging two Inequalities (41) and (42), we have

𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq (43)
αKL(PW,ZiP^W,Zi)+(1α)KL(PWPZiP^W,Zi)α|λ|+\displaystyle\frac{\alpha\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}P_{Z_{i}}\|\widehat{P}_{W,Z_{i}})}{\alpha|\lambda|}+
|λ|σ^22+|λ|α1ασ^22,λ.\displaystyle\quad\frac{|\lambda|\hat{\sigma}^{2}}{2}+\frac{|\lambda|\frac{\alpha}{1-\alpha}\hat{\sigma}^{2}}{2},\quad\forall\lambda\in\mathbb{R^{-}}.

Similarly, using an identical approach, we also obtain,

(𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)])\displaystyle-\left(\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right)\leq (44)
αKL(PW,ZiP^W,Zi)+(1α)KL(PWPZiP^W,Zi)αλ+\displaystyle\frac{\alpha\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}P_{Z_{i}}\|\widehat{P}_{W,Z_{i}})}{\alpha\lambda}+
λσ^22+λα1ασ^22,λ+.\displaystyle\quad\frac{\lambda\hat{\sigma}^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\hat{\sigma}^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.

Considering (43) and (44), we have a non-negative parabola in λ\lambda, whose discriminant must be non-positive, and we have α(0,1)\forall\alpha\in(0,1),

|𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]|2\displaystyle\left|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right|^{2}\leq (45)
2σ^2(αKL(PW,ZiP^W,Zi)+(1α)KL(PWPZiP^W,Zi))α(1α).\displaystyle 2\hat{\sigma}^{2}\frac{\left(\alpha\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}P_{Z_{i}}\|\widehat{P}_{W,Z_{i}})\right)}{\alpha(1-\alpha)}.

Using (III-A), completes the proof. ∎

We propose a Lemma connecting certain KL divergences to the α\alpha-JS\mathrm{JS} information.

Lemma 2

Consider an auxiliary distribution P^W,Zi𝒫(𝒲×𝒵)\widehat{P}_{W,Z_{i}}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z}). Then, the following equality holds:

αKL(PWμP^W,Zi)+(1α)KL(PW,ZiP^W,Zi)=\displaystyle\alpha\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})=
IJSα(W;Zi)+KL(PW,Zi(α)P^W,Zi).\displaystyle I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|\widehat{P}_{W,Z_{i}}).

Note that the proof is inspired by [43].

Using the result in Proposition 1 and ADM we can provide a tighter upper bound. For this purpose, Lemma 2 paves the way to apply ADM and offer a tighter version of the expected generalization error bound appearing in Proposition 1 based on choosing an appropriate auxiliary distribution, as well as recover existing ones.

Theorem 2

Assume that the loss function is σ(α)\sigma_{(\alpha)}-sub-Gaussian– under the distribution PW,Zi(α)P_{W,Z_{i}}^{(\alpha)} i=1,,n\forall i=1,\cdots,n– Then, it holds α(0,1)\forall\alpha\in(0,1) that:

|gen¯(PW|S,μ)|1ni=1n2σ(α)2IJSα(W;Zi)α(1α),α(0,1).|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}},\quad\forall\alpha\in(0,1).

The bound in Theorem 2 results from minimizing the term αKL(PWμP^W,Zi)+(1α)KL(PW,ZiP^W,Zi),\alpha\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}), in the upper bound (36), presented in Proposition 1, over the joint auxiliary distribution P^W,Zi\widehat{P}_{W,Z_{i}}. Such an optimal joint auxiliary distribution is PW,Zi(α):=αPWPZi+(1α)PW,ZiP_{W,Z_{i}}^{(\alpha)}:=\alpha P_{W}P_{Z_{i}}+(1-\alpha)P_{W,Z_{i}}. Note that, the parameter of sub-Gaussianity, denoted as σ^\hat{\sigma} in Proposition 1, relies on P^W,Zi\widehat{P}_{W,Z_{i}}. Consequently, the upper bound mentioned in Theorem 2 is not the minimum of the upper bound presented in Proposition 1. However, assuming a bounded loss function, the upper bound in Theorem 2 becomes the minimum of the upper bound in Proposition 1.

It turns out that we can immediately recover existing bounds from Theorem 2 depending on how we choose α\alpha.

Remark 1 (Recovering upper bound based on Jensen-Shannon information)

The expected generalization error upper bound based on Jensen-Shannon information in [1] can be immediately recovered by considering α=12\alpha=\frac{1}{2} in Theorem 2.

Remark 2 (Recovering upper bounds based on mutual information and lautum information)

The expected generalization error upper bound based on mutual information in [8] and lautum information in [42] can be immediately recovered by considering α1\alpha\rightarrow 1 and α0\alpha\rightarrow 0 in Theorem 2, respectively.

Note that we can also establish how the bound in Theorem 2 behaves as a function of the number of training samples. This can be done by using P^W,Zi=PWμ\widehat{P}_{W,Z_{i}}=P_{W}\otimes\mu in Lemma 2, leading up to

(1α)I(W;Zi)=IJSα(W;Zi)+KL(PW,Zi(α)PWμ).\displaystyle(1-\alpha)I(W;Z_{i})=I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|P_{W}\otimes\mu). (46)

and in turn to the following inequality

IJSα(W;Zi)(1α)I(W;Zi),α(0,1).\displaystyle I_{\mathrm{JS}}^{\alpha}(W;Z_{i})\leq(1-\alpha)I(W;Z_{i}),\quad\forall\alpha\in(0,1). (47)

We prove the convergence rate of the upper bound in Theorem 2 using (47).

Proposition 2

Assume the hypothesis space is finite and the data samples, {Zi}i=1n\{Z_{i}\}_{i=1}^{n}, are i.i.d. Then, the bound in Theorem 2 has a convergence rate of 𝒪(1n)\mathcal{O}(\frac{1}{\sqrt{n}}).

The value of this new proposed bound presented in Theorem 2 in relation to existing bounds can also be further appreciated by offering two additional results.

Proposition 3

Consider the assumptions in Theorem 2. Then, it follows that:

|gen¯(PW|S,μ)|σ(α)2h(α)α(1α),α(0,1),\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\sigma_{(\alpha)}\sqrt{2\frac{h(\alpha)}{\alpha(1-\alpha)}},\quad\forall\alpha\in(0,1), (48)

where h(α)=αlog(α)(1α)log(1α)h(\alpha)=-\alpha\log(\alpha)-(1-\alpha)\log(1-\alpha).

This proposition shows that, unlike the mutual information-based and lautum information-based generalization bounds that currently exist (e.g. [7], [8], [9], and [11]) the proposed α\alpha-JS\mathrm{JS} information generalization bound is always finite. We can also optimize the bound in (48) with respect to α\alpha, where the minimum is achieved at α=1/2\alpha=1/2.

Corollary 1

Consider the assumptions in Theorem 2. Then, it follows that:

|gen¯(PW|S,μ)|2σ(1/2)2log(2).\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq 2\sigma_{(1/2)}\sqrt{2\log(2)}. (49)

Also, this result applies independently of whether the loss function is bounded or not. Naturally, it is possible to show that the absolute value of the expected generalization error is always upper bounded as follows |gen¯(PW|S,μ)|(ba)|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq(b-a) for any bounded loss function within the interval [a,b][a,b]. If we consider the bounded loss functions in the interval [a,b][a,b], then our upper bound (49) would be 2log(2)(ba)\sqrt{2\log(2)}(b-a) which is less than total variation constant upper bound, 2(ba)2(b-a) presented in [44, 15].

It is worthwhile to mention that our result cannot be immediately recovered from existing approaches such as [11, Theorem. 2]. For example, if we consider the upper bound based on Jensen-Shannon information, then there exist ff-divergence based representations of the Jensen-Shannon information as follows:

JSD(PX,PX)=dPXf(dPXdPX),\displaystyle\mathrm{JSD}(P_{X},P_{X^{\prime}})=\int\mathrm{d}P_{X}f\left(\frac{\mathrm{d}P_{X^{\prime}}}{\mathrm{d}P_{X}}\right), (50)

with f(t)=tlog(t)(1+t)log(1+t2)f(t)=t\log(t)-(1+t)\log(\frac{1+t}{2}). However, [11, Theorem. 2] requires that the function f(t)f(t) associated with the ff-divergence is non-decreasing within the interval [0,+)[0,+\infty), but such a requirement is naturally violated by the function f(t)=tlog(t)(1+t)log(1+t2)f(t)=t\log(t)-(1+t)\log(\frac{1+t}{2}) associated with the Jensen-Shannon divergence.

III-B α\alpha-Rényi-based Upper Bound

Next, we provide a new expected generalization error upper bound based on KL divergence by applying ADM and using the following KL divergences terms, KL(P^W,ZiPWμ)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu) and KL(P^W,ZiPW,Zi)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}}). All the proof details are deferred to Appendix B.

Proposition 4

Suppose that Λ(W,Z)(λ)γ+(λ)\Lambda_{\ell(W,Z)}(\lambda)\leq\gamma_{+}(\lambda) and Λ(W,Zi)(λ)ϕ+(λ),i=1,,n\Lambda_{\ell(W,Z_{i})}(\lambda)\leq\phi_{+}(\lambda),\quad i=1,\cdots,n for λ[0,a+)\lambda\in[0,a_{+}), 0<a+<+0<a_{+}<+\infty and λ[0,c+)\lambda\in[0,c_{+}), 0<c+<+0<c_{+}<+\infty, under PWμP_{W}\otimes\mu and PW,ZiP_{W,Z_{i}}, resp. We also have Λ(W,Z)(λ)γ(λ)\Lambda_{\ell(W,Z)}(\lambda)\leq\gamma_{-}(-\lambda) and Λ(W~,Z~i)(λ)ϕ(λ),i=1,,n\Lambda_{\ell(\widetilde{W},\widetilde{Z}_{i})}(\lambda)\leq\phi_{-}(-\lambda),\quad i=1,\cdots,n for λ(a,0]\lambda\in(a_{-},0], <a<0-\infty<a_{-}<0 and λ(c,0]\lambda\in(c_{-},0], <c<0-\infty<c_{-}<0 under PWμP_{W}\otimes\mu and PW,ZiP_{W,Z_{i}}, resp. Assume that γ+(λ)\gamma_{+}(\lambda), ϕ+(λ)\phi_{+}(\lambda), γ(λ)\gamma_{-}(\lambda) and ϕ(λ)\phi_{-}(\lambda) are convex functions, γ(0)=γ+(0)=γ+(0)=γ(0)=0\gamma_{-}(0)=\gamma_{+}(0)=\gamma_{+}^{\prime}(0)=\gamma_{-}^{\prime}(0)=0 and ϕ(0)=ϕ+(0)=ϕ+(0)=ϕ(0)=0\phi_{-}(0)=\phi_{+}(0)=\phi_{+}^{\prime}(0)=\phi_{-}^{\prime}(0)=0. Then, the following upper bounds hold,

gen¯(PW|S,μ)1ni=1n(γ1(Di)+ϕ+1(Ci)),\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu)\leq\frac{1}{n}\sum_{i=1}^{n}\left(\gamma_{-}^{\star-1}(D_{i})+\phi_{+}^{\star-1}(C_{i})\right), (51)
gen¯(PW|S,μ)1ni=1n(ϕ1(Ci)+γ+1(Di)),\displaystyle-\overline{\operatorname*{gen}}(P_{W|S},\mu)\leq\frac{1}{n}\sum_{i=1}^{n}\left(\phi_{-}^{\star-1}(C_{i})+\gamma_{+}^{\star-1}(D_{i})\right), (52)

where Di=KL(P^W,ZiPWμ)D_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu), Ci=KL(P^W,ZiPW,Zi)C_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}}), γ1(x)=infλ[0,a)x+γ(λ)λ\gamma_{-}^{\star-1}(x)=\inf_{\lambda\in[0,-a_{-})}\frac{x+\gamma_{-}(\lambda)}{\lambda},
γ+1(x)=infλ[0,a+)x+γ+(λ)λ\gamma_{+}^{\star-1}(x)=\inf_{\lambda\in[0,a_{+})}\frac{x+\gamma_{+}(\lambda)}{\lambda}, ϕ1(x)=infλ[0,c)x+ϕ(λ)λ\phi_{-}^{\star-1}(x)=\inf_{\lambda\in[0,-c_{-})}\frac{x+\phi_{-}(\lambda)}{\lambda} and ϕ+1(x)=infλ[0,c+)x+ϕ+(λ)λ\phi_{+}^{\star-1}(x)=\inf_{\lambda\in[0,c_{+})}\frac{x+\phi_{+}(\lambda)}{\lambda}.

Proof:

The proof approach is similar to Theorem 1 by considering different cumulant generating functions and their upper bounds. ∎

Inspired by the upper bound in Proposition 4, we can provide an upper bound on expected generalization error instantly that is dependent on the convex combination of KL divergence terms, i.e.,

αKL(P^W,ZiPW,Zi)+(1α)KL(P^W,ZiPWμ),\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu),

and assuming σ\sigma-sub-Gaussian tail distribution.

Proposition 5 (Upper bound with Sub-Gaussian assumption)

Assume that the loss function is σ\sigma-sub-Gaussian under distribution PWμP_{W}\otimes\mu and γ\gamma-sub-Gaussian under PW,ZiP_{W,Z_{i}} i=1,,n\forall i=1,\cdots,n– Then, it holds for α(0,1)\forall\alpha\in(0,1) that,

|gen¯(PW|S,μ)|\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq (53)
1ni=1n2(ασ2+(1α)γ2)(αCi+(1α)Di)α(1α),\displaystyle\quad\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\left(\alpha C_{i}+(1-\alpha)D_{i}\right)}{\alpha(1-\alpha)}},

where Ci=KL(P^W,ZiPWμ)C_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu) and Di=KL(P^W,ZiPW,Zi)D_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}}) .

Akin to Proposition 1, the result in Proposition 5 paves the way to offer new tighter expected generalization error upper bound by ADM. We next offer a Lemma connecting certain KL divergences to the α\alpha-Rényi information [35, Theorem 30].

Lemma 3

Consider an arbitrary distribution P^W,Zi\widehat{P}_{W,Z_{i}}. Then, the following equality holds for α(0,1)\forall\alpha\in(0,1),

αKL(P^W,ZiPWμ)+(1α)KL(P^W,ZiPW,Zi)=\displaystyle\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})= (54)
(1α)IRα(W;Zi)\displaystyle\quad(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i})
+KL(P^W,Zi(PZiPW)α(PW,Zi)(1α)𝒲×𝒵(dPZidPW)α(dPW,Zi)(1α)).\displaystyle\quad+\mathrm{KL}\left(\widehat{P}_{W,Z_{i}}\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).

A tighter version of the expected generalization error bound appears in Proposition 5 via ADM and using Lemma 3.

Theorem 3 (Upper bound based on α\alpha-Rényi information)

Consider the same assumptions in Proposition 5. The following upper bound for α(0,1)\forall\alpha\in(0,1) holds,

|gen¯(PW|S,μ)|1ni=1n2(ασ2+(1α)γ2)IRα(W;Zi)α.\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}. (55)

The bound in Theorem 3 results from minimizing the bound in Proposition 5 over the joint auxiliary distribution P^W,Zi𝒫(𝒲×𝒵)\hat{P}_{W,Z_{i}}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z}). Such an optimal joint auxiliary distribution is

P^W,Zi=(PZiPW)α(PW,Zi)(1α)𝒲×𝒵(dPZidPW)α(dPW,Zi)(1α).\widehat{P}_{W,Z_{i}}=\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}.
Remark 3 (Deterministic algorithms per sample)

If the parameter, WW, is a deterministic function of data sample ZiZ_{i}, then I(W;Zi)I(W;Z_{i}) is not well-defined as PW,ZiP_{W,Z_{i}} is not absolutely continuous222We say μν\mu\ll\nu, i.e., μ\mu is absolutely continuous with respect to ν\nu if ν(A)=0\nu(A)=0 for some A𝒳A\in\mathcal{X}, then μ(A)=0\mu(A)=0. with respect to PWPZiP_{W}P_{Z_{i}}. However, by considering the α\alpha-Rényi information for α[0,1)\alpha\in[0,1), we do not need to assume the absolute continuous.

Remark 4 (Upper bound based on the Bhattacharyya distance)

We can derive the expected generalization error upper bound based on Bhattacharyya distance by considering α=1/2\alpha=1/2 in Theorem 3,

|gen¯(PW|S,μ)|2ni=1n(σ2+γ2)DB(PW,ZiPWμ),\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{2}{n}\sum_{i=1}^{n}\sqrt{(\sigma^{2}+\gamma^{2})D_{B}(P_{W,Z_{i}}\|P_{W}\otimes\mu)},
Remark 5 (Recovering the upper bound based on mutual information and lautum information)

We can recover the expected generalization error upper bound based on mutual information in [7] and lautum information in [42] by considering α1\alpha\rightarrow 1 and α0\alpha\rightarrow 0 in Theorem 3, respectively.

By considering P^W,Zi=PW,Zi\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}, we have,

αI(W;Zi)=(1α)IRα(W;Zi)\displaystyle\alpha I(W;Z_{i})=(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i}) (56)
+KL(PW,Zi(PZiPW)α(PW,Zi)(1α)𝒲×𝒵(dPZidPW)α(dPW,Zi)(1α)).\displaystyle\quad+\mathrm{KL}\left(P_{W,Z_{i}}\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).

Since that KL divergence is non-negative, based on Lemma 3 and the monotonicity of Rα\mathrm{R}_{\alpha} with respect to α\alpha, we have,

IRα(W;Zi)min{1,α1α}I(W;Zi).\displaystyle I_{\mathrm{R}}^{\alpha}(W;Z_{i})\leq\min\left\{1,\frac{\alpha}{1-\alpha}\right\}I(W;Z_{i}). (57)

The result in (57) implies that our expected generalization error bound based on α\alpha-Rényi information in Theorem 3 exhibits the same convergence rate as upper bound based on mutual information [7].

Proposition 6 (Convergence rate of upper bound based on α\alpha-Rényi information)

Assume the hypothesis space is finite and the data samples are i.i.d. Then, the upper bounds based on α\alpha-Rényi information in Theorem 3 have a convergence rate of 𝒪(1n)\mathcal{O}(\frac{1}{\sqrt{n}}).

We can also provide an upper bound based on Sibson’s α\alpha-mutual information.

Theorem 4 (Upper bound based on Sibson’s α\alpha mutual information)

Assume that the loss function is σ\sigma-sub-Gaussian under distribution μ\mu for all w𝒲w\in\mathcal{W} and γ\gamma-sub-Gaussian under PW,ZiP_{W,Z_{i}}, i=1,,n\forall i=1,\cdots,n. Then, it holds that:

|gen¯(PW|S,μ)|1ni=1n2(ασ2+(1α)γ2)ISα(W;Zi)α.\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{S}}^{\alpha}(W;Z_{i})}{\alpha}}.

The upper bound based on α\alpha-Rényi divergence could also be derived using the variational representation of α\alpha-Rényi divergence in [45]. This approach is applied in [12] by considering the sub-Gaussianity under PZiP_{Z_{i}} and PZi|WP_{Z_{i}|W}. Our approach is more general, paving the way to offer an upper bound based on α\alpha-Sibson’s mutual information in Theorem 4, which is derived via ADM. Since that,

ISα(W;Zi)\displaystyle I_{\mathrm{S}}^{\alpha}(W;Z_{i}) =minQW𝒫(𝒲)Rα(PW,ZiQWμ)\displaystyle=\min_{Q_{W}\in\mathcal{P}(\mathcal{W})}\mathrm{R}_{\alpha}(P_{W,Z_{i}}\|Q_{W}\otimes\mu) (58)
Rα(PW,ZiPWμ)=IRα(W;Zi),\displaystyle\leq\mathrm{R}_{\alpha}(P_{W,Z_{i}}\|P_{W}\otimes\mu)=I_{\mathrm{R}}^{\alpha}(W;Z_{i}), (59)

the upper bound in Theorem 4 is tighter than the upper bound in Theorem 3. It is worthwhile mentioning that we assume the loss function is σ\sigma-sub-Gaussian under PWμP_{W}\otimes\mu distribution in Theorem 3. However, in Theorem 4, we consider the loss function is σ\sigma-sub-Gaussian under μ\mu distribution for all w𝒲w\in\mathcal{W}.

We can also apply generalized Pinsker’s inequality [35] to bounded loss functions for bounding the expected generalization error using the α\alpha-Rényi information between data samples, SS, and hypothesis, WW.

Proposition 7

Consider (w,z)\ell(w,z) be a bounded loss function i.e. |(w,z)|b|\ell(w,z)|\leq b. Then

|gen¯(PW|S,μ)|1ni=1n2b2αIRα(W;Zi),α(0,1].\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\frac{2b^{2}}{\alpha}I_{\mathrm{R}}^{\alpha}(W;Z_{i})},\quad\forall\alpha\in(0,1]. (60)

Considering the bounded loss function can help to provide an upper bound based on α\alpha-Sibson’s mutual information between SS and WW in a similar approach to Proposition 7.

III-C Comparison of Proposed Upper Bounds

A summary of upper bounds on expected generalization error under various σ\sigma-sub-Gaussian assumptions is provided in Table III.

TABLE III: Expected Generalization Error Upper Bounds. We compared our bounds with Mutual information and Lautum information bounds based on the finiteness and the assumption needed for sub-Gaussianity.
Upper Bound Measure sub-Gaussian Assumption Bound Is finite?
Mutual information ([8]) PWμP_{W}\otimes\mu 1ni=1n2σ2I(W;Zi)\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma^{2}I(W;Z_{i})} No
Lautum information ([42]) PW,ZiP_{W,Z_{i}}, i=1,,n\forall i=1,\ldots,n 1ni=1n2γ2L(W;Zi)\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\gamma^{2}L(W;Z_{i})} No
α\alpha-JS\mathrm{JS} information (Proposition 3) PW,Zi(α)P_{W,Z_{i}}^{(\alpha)}, i=1,,n\forall i=1,\ldots,n 1ni=1n2σ(α)2IJSα(W;Zi)α(1α)\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}} Yes (σ(α)2h(α)α(1α)\sigma_{(\alpha)}\sqrt{2\frac{h(\alpha)}{\alpha(1-\alpha)}})
α\alpha-Rényi information (0α<10\leq\alpha<1) (Theorem 3) PWμP_{W}\otimes\mu and PW,ZiP_{W,Z_{i}}, i=1,,n\forall i=1,\ldots,n 1ni=1n2(ασ2+(1α)γ2)IRα(W;Zi)α\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}} No
Remark 6 (Bounded loss function)

The bounded loss function l:𝒲×𝒵[a,b]l:\mathcal{W}\times\mathcal{Z}\rightarrow[a,b] is (ba2)(\frac{b-a}{2})-sub-Gaussian under all distributions [7]. In fact, for bounded functions, we have,

σ=γ=σ(α)=(ba)2.\displaystyle\sigma=\gamma=\sigma_{(\alpha)}=\frac{(b-a)}{2}. (61)

We next compare the upper bounds based on α\alpha-JS\mathrm{JS} information, Theorem 2, with the upper bounds based on α\alpha-Rényi information, Theorem 3. The next proposition showcases that the α\alpha-JS\mathrm{JS} information bound can be tighter than the α\alpha-Rényi based upper bound under certain conditions. The proof details are deferred to Appendix C.

Proposition 8 (Comparison of upper bounds based on α\alpha^{\prime}-Jensen-Shannon and α\alpha-Rényi information measures)

Consider the same assumptions in Theorem 2. Then, it follows that α\alpha^{\prime}-Jensen-Shannon bound given by:

|gen¯(PW|S,μ)|1ni=1n2σ(α)2IJSα(W;Zi)α(1α),0α1|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha^{\prime})}^{2}\frac{I_{\mathrm{JS}}^{\alpha^{\prime}}(W;Z_{i})}{\alpha^{\prime}(1-\alpha^{\prime})}},\quad 0\leq\alpha^{\prime}\leq 1 (62)

is tighter than the α\alpha-Rényi based upper bound for 0α1\quad 0\leq\alpha\leq 1, given by,

|gen¯(PW|S,μ)|1ni=1n2(ασ2+(1α)γ2)IRα(W;Zi)α,\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}, (63)

provided that αh(α)(1α)αIRα(W;Zi)\frac{\alpha h(\alpha^{\prime})}{(1-\alpha^{\prime})\alpha^{\prime}}\leq I_{\mathrm{R}}^{\alpha}(W;Z_{i}) holds for i=1,,ni=1,\cdots,n and σ(α)=σ=γ\sigma_{(\alpha^{\prime})}=\sigma=\gamma.

Remark 7

The condition in Proposition 8, i.e. αh(α)(1α)αIRα(W;Zi)\frac{\alpha h(\alpha^{\prime})}{(1-\alpha^{\prime})\alpha^{\prime}}\leq I_{\mathrm{R}}^{\alpha}(W;Z_{i}), could be tightened by considering α=12\alpha^{\prime}=\frac{1}{2} and considering the upper bound based on Jensen-Shannon information.

Remark 8

If we consider α1\alpha\rightarrow 1 and α=12\alpha^{\prime}=\frac{1}{2} in Proposition 8, then the upper bound based on Jensen-Shannon information is tighter than ones based on mutual information [8] provided that 4log(2)I(W;Zi)4\log(2)\leq I(W;Z_{i}) for all i=1,,ni=1,\cdots,n and σ=σJS\sigma=\sigma_{JS}.

IV Upper Bounds on Excess Risk

This section provides upper bounds on excess risks for regularized empirical risk minimization (ERM) by α\alpha-Rényi divergence or α\alpha-JS\mathrm{JS} divergence.

IV-A α\alpha-JS\mathrm{JS}-Regularized ERM

It is interesting to consider the regularized ERM with α\alpha-JS\mathrm{JS} information between dataset SS, and hypothesis WW,

minPW|S𝔼[LE(W,S)]+1βIJSα(W;S),\min_{P_{W|S}}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}I_{\mathrm{JS}}^{\alpha}(W;S), (64)

where β>0\beta>0 is a parameter that balances fitting and generalization. Since the optimization problem in (64) is dependent on the data generating distribution, μ\mu, we relax the problem and replace α\alpha-JS\mathrm{JS} information with the α\alpha-JS\mathrm{JS} divergence JSα(PW|SQW|PS)\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}), as follows,

minPW|S𝔼[LE(W,S)]+1βJSα(PW|SQW|PS),\displaystyle\min_{P_{W|S}}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}), (65)

where QW𝒫(𝒲)Q_{W}\in\mathcal{P}(\mathcal{W}) is a prior distribution over parameter space.

Lemma 4 (Solution existence of α\alpha-JS\mathrm{JS}-regularized ERM)

The optimization problem in (65) is a convex optimization problem and has a solution.

Proof:

The first term in objective 𝔼[LE(W,S)]\mathbb{E}[L_{E}(W,S)] is linear in term of PW|SP_{W|S} and the second term 1βJSα(PW|SQW|PS)\frac{1}{\beta}\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}) is convex in PW|SP_{W|S} for 0<α<10<\alpha<1 due to [46]. Therefore, a solution exists. ∎

Let us define the solution of (64),

PW|S,β,JSα:=argminPW|S𝒫(𝒲)𝔼[LE(W,S)]+1βJSα(PW|SQW|PS).P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}:=\arg\min_{P_{W|S}\in\mathcal{P}(\mathcal{W})}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}).

In the following, we provide an upper bound on excess risk under PW|S,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}} as a learning algorithm.

Theorem 5 (Upper bound on excess risk under PW|S,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}})

Assume the bounded loss function, i.e., |(w,z)|b|\ell(w,z)|\leq b for all (w,z)𝒲×𝒵(w,z)\in\mathcal{W}\times\mathcal{Z} and L~\tilde{L}-Lipschitz. Then, the following upper bound holds on the excess risk under PW|S,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},

r(PW|S,β,JSα,μ)2b2nα(1α)i=1nIJSα(W;Zi)+L~dβ+JSα(𝒩(w,β1Id)Q)β,\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}\\ \quad&+\frac{\tilde{L}\sqrt{d}}{\beta}+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta},\end{split}

where w=argminw𝒲Lμ(w)w^{\star}=\arg\min_{w\in\mathcal{W}}L_{\mu}(w) and IdI_{d} is identity matrix with size dd.

Corollary 2 (Convergence rate of excess risk for under PW|S,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}})

Under the same assumptions in Theorem 5, assuming that hypothesis space is finite and β\beta is of order n\sqrt{n}, the following upper bound holds on excess risk of PW|S,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}} with convergence rate of 𝒪(n1/2)\mathcal{O}(n^{-1/2}),

r(PW|S,β,JSα,μ)2b2nα(1α)i=1nIJSα(W;Zi)+L~dn+h(α)n,\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}\\ &\quad+\frac{\tilde{L}\sqrt{d}}{\sqrt{n}}+\frac{h(\alpha)}{\sqrt{n}},\end{split}
Remark 9 (Comparison to the Gibbs algorithm)

Our convergence rate of the upper bound on the excess risk under PW|S,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}} is less than the convergence rate of the upper bound on excess risk under the Gibbs algorithm as the solution of KL-regularized empirical which is 𝒪(n1/4)\mathcal{O}(n^{-1/4}), [7, Corollary 3] and [47].

IV-B α\alpha-Rényi-Regularized ERM

Similarly, it is interesting to consider the regularized ERM with α\alpha-Rényi-information between dataset, SS, and hypothesis, WW, for 0<α<10<\alpha<1,

minPW|S𝔼[L(W,S)]+1βIRα(W;S),\displaystyle\min_{P_{W|S}}\mathbb{E}[L(W,S)]+\frac{1}{\beta}I_{\mathrm{R}}^{\alpha}(W;S), (66)

where β>0\beta>0 is a parameter that balances fitting and generalization.

Since the optimization problem in (66) is dependent on the data generating distribution, μ\mu, we propose to relax the problem in (66) by replacing α\alpha-Rényi- information, i.e. IRα(W;S)I_{\mathrm{R}}^{\alpha}(W;S), with Rα(PW|SQW|PS)\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}) as follows,

minPW|S𝔼[LE(W,S)]+1βRα(PW|SQW|PS),\displaystyle\min_{P_{W|S}}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}), (67)

where QW𝒫(𝒲)Q_{W}\in\mathcal{P}(\mathcal{W}).

Lemma 5 (Solution existence of α\alpha-Rényi-regularized ERM)

The optimization problem considered in (67) is a convex optimization problem.

Proof:

The first term in objective 𝔼[LE(W,S)]\mathbb{E}[L_{E}(W,S)] is linear in term of PW|SP_{W|S} and the second term 1βRα(PW|SQW|PS)\frac{1}{\beta}\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}) is convex in PW|SP_{W|S} for 0<α<10<\alpha<1 due to [35, Theorem 11]. Therefore, a solution exists. ∎

Let us define

PW|S,β,Rα:=argminPW|S𝒫(𝒲)𝔼[LE(W,S)]+1βRα(PW|SQW|PS),P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}:=\arg\min_{P_{W|S}\in\mathcal{P}(\mathcal{W})}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}),

as the solution of convex optimization problem (67).

Theorem 6 (Upper bound on excess risk under PW|S,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}})

Assume the bounded loss function, i.e., |(w,z)|b|\ell(w,z)|\leq b for all (w,z)𝒲×𝒵(w,z)\in\mathcal{W}\times\mathcal{Z} and L~\tilde{L}-Lipschitz. Then, the following upper bound holds on the excess risk under PW|S,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},

r(PW|S,β,Rα,μ)2b2nαi=1nIRα(W;Zi)+L~dβ+Rα(𝒩(w,β1Id)Q)β,\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}\\ &\quad+\frac{\tilde{L}\sqrt{d}}{\beta}+\frac{\mathrm{R}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta},\end{split}

where w=argminw𝒲Lμ(w)w^{\star}=\arg\min_{w\in\mathcal{W}}L_{\mu}(w) and IdI_{d} is identity matrix with size dd.

Corollary 3 (Convergence rate of excess risk under PW|S,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}})

Under the same assumptions in Theorem 6, assuming that hypothesis space is finite and β\beta is of order n\sqrt{n}, the following upper bound holds on the excess risk of PW|S,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}} with convergence rate of 𝒪(log(n)/n)\mathcal{O}(\log(n)/\sqrt{n}),

r(PW|S,β,Rα,μ)2b2nαi=1nIRα(W;Zi)+L~dn+12nw22+d4nlog(n)+d2n(1α)log(α).\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}\\ &\quad+\frac{\tilde{L}\sqrt{d}}{\sqrt{n}}+\frac{1}{2\sqrt{n}}\|w^{\star}\|_{2}^{2}+\frac{d}{4\sqrt{n}}\log\big{(}n\big{)}+\frac{d}{2\sqrt{n}(1-\alpha)}\log\big{(}\alpha\big{)}.\end{split}

V Expected Generalization Error Upper Bounds Under Distribution Mismatch

In this section, we extend our results in Section III under distribution mismatch, where the training data distribution differs from the test data distribution. All the proof details are deferred to Appendix E.

Proposition 9

Assume that the loss function is σ(α)\sigma_{(\alpha)}-sub-Gaussian – under the distributions PW,Zi(α)P_{W,Z_{i}}^{(\alpha)} i=1,,n\forall i=1,\cdots,n and αμ+(1α)μ\alpha\mu+(1-\alpha)\mu^{\prime} for all w𝒲w\in\mathcal{W} – Then under distribution mismatch (6), it holds α(0,1)\forall\alpha\in(0,1) that:

|gen¯(PW|S,μ,μ)|2σ(α)2JSα(μμ)α(1α)\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu,\mu^{\prime})|\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{\mathrm{JS}_{\alpha}(\mu^{\prime}\|\mu)}{\alpha(1-\alpha)}} (68)
+1ni=1n2σ(α)2IJSα(W;Zi)α(1α),α(0,1).\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}},\quad\forall\alpha\in(0,1).
Proposition 10

Assume that the loss function is σ\sigma-sub-Gaussian under distributions μ\mu and μ\mu^{\prime} for all w𝒲w\in\mathcal{W} and also γ\gamma-sub-Gaussian under PW,ZiP_{W,Z_{i}} i=1,,n\forall i=1,\cdots,n. The following upper bound for α(0,1)\forall\alpha\in(0,1) holds,

|gen¯(PW|S,μ,μ)|2(ασ2+(1α)γ2)Rα(μμ)α\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu,\mu^{\prime})|\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\mathrm{R}_{\alpha}(\mu^{\prime}\|\mu)}{\alpha}} (69)
+1ni=1n2(ασ2+(1α)γ2)IRα(W;Zi)α.\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}.

The mismatch between the test and training samples distributions is characterised in [21, Theorem 5] as KL divergence between test and training samples distributions, i.e., KL(μμ)\mathrm{KL}(\mu^{\prime}\|\mu). However, assuming that the loss function is σ(α)\sigma_{(\alpha)}-sub-Gaussian under αμ+(1α)μ\alpha\mu+(1-\alpha)\mu^{\prime} for all w𝒲w\in\mathcal{W}, Proposition 9 allows us to explain the distributional mismatch in terms of α\alpha-Jensen-Shannon divergence, which is finite.

In Proposition 10, the distributional mismatch is presented in terms of α\alpha-Rényi divergence, i.e., Rα(μμ)\mathrm{R}_{\alpha}(\mu^{\prime}\|\mu). If μ\mu^{\prime} is not absolutely continuous with respect to μ\mu, then we have KL(μμ)=\mathrm{KL}(\mu^{\prime}\|\mu)=\infty. However, for α\alpha-Rényi divergence (0<α<10<\alpha<1), it suffices that the mutual singularity [35], i.e., μμ\mu^{\prime}\perp\!\!\!\perp\mu, does not hold, which is a less restrictive condition about μ\mu^{\prime} compared to the absolutely continuity condition.

Similar to Remark 6, the sub-Gaussianity assumptions in Propositions 9 and 10 hold for bounded loss functions.

VI Numerical Example

In this section, we illustrate that some of our proposed bounds can be tighter than existing ones in a simple toy example. We consider the α\alpha-JS\mathrm{JS} and α\alpha-Rényi information only. Our example setting involves the estimation of the mean of a Gaussian random variable Z𝒩(β,σ2)Z\sim\mathcal{N}(\beta,\sigma^{2}) based on two i.i.d. samples Z1Z_{1} and Z2Z_{2}. We consider the hypothesis (estimate) given by W=tZ1+(1t)Z2W=tZ_{1}+(1-t)Z_{2} for 0<t<10<t<1. We also consider the loss function given by (w,z)=min((wz)2,c2)\ell(w,z)=\min((w-z)^{2},c^{2}).

Due to the fact that the loss function is bounded within the interval [0,c2][0,c^{2}], then it is c22\frac{c^{2}}{2}-sub-Gaussian under all distributions. Therefore, we can apply the expected generalization error upper bounds based on mutual information, α\alpha-JS\mathrm{JS} information and α\alpha-Rényi information α(0,1)\forall\alpha\in(0,1) as follows:

gen¯(PW|Z1,Z2,PZ)c24(2I(W;Z1)+2I(W;Z2)),\displaystyle\overline{\operatorname*{gen}}(P_{W|Z_{1},Z_{2}},P_{Z})\leq\frac{c^{2}}{4}\big{(}\sqrt{2I(W;Z_{1})}+\sqrt{2I(W;Z_{2})}\big{)}, (70)
gen¯(PW|Z1,Z2,PZ)c24(2IJSα(W;Z1)α(1α)+2IJSα(W;Z2)α(1α)),\displaystyle\overline{\operatorname*{gen}}(P_{W|Z_{1},Z_{2}},P_{Z})\leq\frac{c^{2}}{4}\Big{(}\sqrt{2\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{1})}{\alpha(1-\alpha)}}+\sqrt{2\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{2})}{\alpha(1-\alpha)}}\Big{)}, (71)
gen¯(PW|Z1,Z2,PZ)c24(2IRα(W;Z1)α+2IRα(W;Z2)α).\displaystyle\overline{\operatorname*{gen}}(P_{W|Z_{1},Z_{2}},P_{Z})\leq\frac{c^{2}}{4}\Big{(}\sqrt{2\frac{I_{\textit{R}}^{\alpha}(W;Z_{1})}{\alpha}}+\sqrt{2\frac{I_{\textit{R}}^{\alpha}(W;Z_{2})}{\alpha}}\Big{)}. (72)

It can be immediately shown that W𝒩(β,σ2(t2+(1t)2))W\sim\mathcal{N}(\beta,\sigma^{2}(t^{2}+(1-t)^{2})) and (W,Z1)(W,Z_{1}) and (W,Z2)(W,Z_{2}) are jointly Gaussian with correlation coefficients ρ1=tt2+(1t)2\rho_{1}=\frac{t}{\sqrt{t^{2}+(1-t)^{2}}} and ρ2=(1t)t2+(1t)2\rho_{2}=\frac{(1-t)}{\sqrt{t^{2}+(1-t)^{2}}}. Therefore, it can be shown that the mutual information appearing above is given by I(W;Z1)=12log(1ρ12)I(W;Z_{1})=-\frac{1}{2}\log(1-\rho_{1}^{2}) and I(W;Z2)=12log(1ρ22)I(W;Z_{2})=-\frac{1}{2}\log(1-\rho_{2}^{2}). In contrast, the α\alpha-JS\mathrm{JS} information appearing above can be computed via an extension of entropic-based formulation of the Jensen-Shannon measure as follows [34]:

IJS(W;Zi)=\displaystyle I_{\mathrm{JS}}(W;Z_{i})= (73)
h(PW,Zi(α))(αh(PW)+αh(PZi)+(1α)h(PZi,W)),\displaystyle\quad h\left(P_{W,Z_{i}}^{(\alpha)}\right)-(\alpha h(P_{W})+\alpha h(P_{Z_{i}})+(1-\alpha)h(P_{Z_{i},W})),

– with h()h(\cdot) denoting the differential entropy – where

h(PZi)=12log(2πσ2),\displaystyle h(P_{Z_{i}})=\frac{1}{2}\log(2\pi\sigma^{2}),
h(PW)=12log(2πσ2(t2+(1t)2)),\displaystyle h(P_{W})=\frac{1}{2}\log(2\pi\sigma^{2}(t^{2}+(1-t)^{2})),
h(PW,Zi)=log(2πσ2(t2+(1t)2)(1ρi2)),\displaystyle h(P_{W,Z_{i}})=\log(2\pi\sigma^{2}(t^{2}+(1-t)^{2})(1-\rho_{i}^{2})),

whereas h(PW,Zi(α))h\left(P_{W,Z_{i}}^{(\alpha)}\right) can be computed numerically.

Fig.1 depicts the true generalization error, the mutual information based bound in (70), and the α\alpha-JS\mathrm{JS} information based bound for α=0.25,0.5,0.75\alpha=0.25,0.5,0.75 in (71) for values of t(0,0.5]t\in(0,0.5], considering σ2=1,10\sigma^{2}=1,10, μ=1\mu=1, c=σ4c=\frac{\sigma}{4}.

It can be seen that for α=0.75\alpha=0.75 the α\alpha-JS\mathrm{JS} information bound is tighter than the mutual information bound. For α=0.5\alpha=0.5, which is equal to traditional Jensen-Shannon information, if we consider t<0.25t<0.25 then the Jensen-Shannon information bound is tighter than the mutual information bound; in contrast, for t>0.25t>0.25, the mutual information bound is slightly better than the Jensen-Shannon information bound. This showcases that our proposed bounds can be tighter than existing ones in some regimes.

Refer to caption
Figure 1: True generalization error, α\alpha-JS\mathrm{JS} based bound for α=0.25,0.5,0.75\alpha=0.25,0.5,0.75, and Mutual Information based bound.

Fig.2 also depicts the true generalization error, the mutual information based bound in (70), and the α\alpha-Rényi information based bound for α=0.25,0.5,0.75\alpha=0.25,0.5,0.75 in (72). It can be seen that the α\alpha-Rényi based bound is looser than the mutual information based bound. In our experiment setup, when t0t\to 0 (or t1t\to 1), we have I(W;Z2)I(W;Z_{2})\to\infty (or I(W;Z1)I(W;Z_{1})\to\infty). However, the α\alpha-Rényi based bound is finite.

Refer to caption
Figure 2: True generalization error, α\alpha-Rényi based bound for α=0.25,0.5,0.75\alpha=0.25,0.5,0.75, and Mutual Information based bound.

VII Conclusion and Future Works

We have presented the Auxiliary Distribution Method, a novel approach for deriving information-theoretic upper bounds on the generalization error within the context of supervised learning problems. Our method offers the flexibility to recover existing bounds while also enabling the derivation of new bounds grounded in the α\alpha-JS\mathrm{JS} and α\alpha-Rényi information measures. Notably, our upper bounds, which are rooted in the α\alpha-JS\mathrm{JS} information measure, are finite, in contrast to mutual information-based bounds. Moreover, our upper bound based on α\alpha-Rényi information, for α(0,1)\alpha\in(0,1), remains finite when considering a deterministic learning process. An intriguing observation is that our newly introduced α\alpha-JS\mathrm{JS} information measure can, in certain regimes, yield tighter bounds compared to existing approaches. We also discuss the existence of algorithms under α\alpha-JS\mathrm{JS}-regularized and α\alpha-Rényi-regularized empirical risk minimization problems and provide upper bounds on excess risk of these algorithms, where the upper bound on the excess risk under α\alpha-JS\mathrm{JS}-regularized empirical risk minimization is tighter than other well-known upper bounds on excess risk. Furthermore, we provide an upper bound on generalization error in a mismatch scenario, where the distributions of test and training datasets are different, via our auxiliary distribution method.

As a direction for future research, we propose extending our bounds to the PAC-Bayesian framework, leveraging the α\alpha-JS\mathrm{JS} and α\alpha-Rényi divergences for 0<α<10<\alpha<1. Additionally, the conditional technique based on individual sample measures, as described in [18], could be applied to improve the effectiveness of our upper bounds.

Acknowledgment

Gholamali Aminian is supported in part by the Royal Society Newton International Fellowship, grant no. NIF\R1 \192656, the UKRI Prosperity Partnership Scheme (FAIR) under the EPSRC Grant EP/V056883/1, and the Alan Turing Institute. Saeed Masiha worked on this project at Sharif university of Technology before joining EPFL University.

References

  • [1] G. Aminian, L. Toni, and M. R. D. Rodrigues, “Jensen-shannon information based characterization of the generalization error of learning algorithms,” in 2020 IEEE Information Theory Workshop (ITW), pp. 1–5, 2021.
  • [2] V. N. Vapnik, “An overview of statistical learning theory,” IEEE transactions on neural networks, vol. 10, no. 5, pp. 988–999, 1999.
  • [3] O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal of machine learning research, vol. 2, no. Mar, pp. 499–526, 2002.
  • [4] H. Xu and S. Mannor, “Robustness and generalization,” Machine learning, vol. 86, no. 3, pp. 391–423, 2012.
  • [5] D. A. McAllester, “Pac-bayesian stochastic model selection,” Machine Learning, vol. 51, no. 1, pp. 5–21, 2003.
  • [6] D. Russo and J. Zou, “How much does your data exploration overfit? controlling bias via information usage,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 302–323, 2019.
  • [7] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Advances in Neural Information Processing Systems, pp. 2524–2533, 2017.
  • [8] Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
  • [9] A. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening generalization bounds,” in Advances in Neural Information Processing Systems, pp. 7234–7243, 2018.
  • [10] A. R. Asadi and E. Abbe, “Chaining meets chain rule: Multilevel entropic regularization and training of neural networks,” Journal of Machine Learning Research, vol. 21, no. 139, pp. 1–32, 2020.
  • [11] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
  • [12] E. Modak, H. Asnani, and V. M. Prabhakaran, “Rényi divergence based bounds on generalization error,” in 2021 IEEE Information Theory Workshop (ITW), pp. 1–6, 2021.
  • [13] A. T. Lopez and V. Jog, “Generalization error bounds using wasserstein distances,” in 2018 IEEE Information Theory Workshop (ITW), pp. 1–5, IEEE, 2018.
  • [14] H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon, “An information-theoretic view of generalization via wasserstein distance,” in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581, IEEE, 2019.
  • [15] B. R. Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via wasserstein distance,” in Advances in Neural Information Processing Systems, 2021.
  • [16] G. Aminian, Y. Bu, G. Wornell, and M. Rodrigues, “Tighter expected generalization error bounds via convexity of information measures,” in IEEE International Symposium on Information Theory (ISIT), 2022.
  • [17] T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory, pp. 3437–3452, PMLR, 2020.
  • [18] R. Zhou, C. Tian, and T. Liu, “Individually conditional individual mutual information bound on generalization error,” IEEE Transactions on Information Theory, pp. 1–1, 2022.
  • [19] H. Hafez-Kolahi, Z. Golgooni, S. Kasaei, and M. Soleymani, “Conditioning and processing: Techniques to improve information-theoretic generalization bounds,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [20] G. Aminian, Y. Bu, L. Toni, M. Rodrigues, and G. Wornell, “An exact characterization of the generalization error for the Gibbs algorithm,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [21] M. S. Masiha, A. Gohari, M. H. Yassaee, and M. R. Aref, “Learning under distribution mismatch and model misspecification,” in IEEE International Symposium on Information Theory (ISIT), 2021.
  • [22] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” arXiv preprint arXiv:0902.3430, 2009.
  • [23] Z. Wang, “Theoretical guarantees of transfer learning,” 2018.
  • [24] X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transfer learning,” in 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2819–2824, IEEE, 2020.
  • [25] E. Englesson and H. Azizpour, “Generalized jensen-shannon divergence loss for learning with noisy labels,” Advances in Neural Information Processing Systems, vol. 34, pp. 30284–30297, 2021.
  • [26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
  • [27] P. Melville, S. M. Yang, M. Saar-Tsechansky, and R. Mooney, “Active learning for probability estimation using jensen-shannon divergence,” in European conference on machine learning, pp. 268–279, Springer, 2005.
  • [28] E. Choi and C. Lee, “Feature extraction based on the bhattacharyya distance,” Pattern Recognition, vol. 36, no. 8, pp. 1703–1709, 2003.
  • [29] G. B. Coleman and H. C. Andrews, “Image segmentation by clustering,” Proceedings of the IEEE, vol. 67, no. 5, pp. 773–785, 1979.
  • [30] F. Topsøe, “Information theory at the service of science,” in Entropy, Search, Complexity, pp. 179–207, Springer, 2007.
  • [31] F. Topsoe, “Some inequalities for information divergence and related measures of discrimination,” IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1602–1609, 2000.
  • [32] T. M. Cover, Elements of information theory. John Wiley & Sons, 1999.
  • [33] F. Nielsen, “On a generalization of the jensen-shannon divergence and the jensen-shannon centroid,” Entropy, vol. 22, no. 2, p. 221, 2020.
  • [34] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991.
  • [35] T. Van Erven and P. Harremos, “Rényi divergence and kullback-leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
  • [36] T. Kailath, “The divergence and bhattacharyya distance measures in signal selection,” IEEE transactions on communication technology, vol. 15, no. 1, pp. 52–60, 1967.
  • [37] D. P. Palomar and S. Verdú, “Lautum information,” IEEE Transactions on Information Theory, vol. 54, no. 3, pp. 964–975, 2008.
  • [38] I. Sason and S. Verdú, “ff-divergence inequalities,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.
  • [39] S. Verdú, “α\alpha-mutual information,” in 2015 Information Theory and Applications Workshop (ITA), pp. 1–6, IEEE, 2015.
  • [40] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  • [41] P. Dupuis and R. S. Ellis, A weak convergence approach to the theory of large deviations, vol. 902. John Wiley & Sons, 2011.
  • [42] M. Gastpar, A. R. Esposito, and I. Issa, “Information measures, learning and generalization,” 5th London Symposium on Information Theory, 2019.
  • [43] F. Topsoe, “Inequalities for the jensen-shannon divergence,” Draft available at http://www. math. ku. dk/topsoe, 2002.
  • [44] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in 2016 IEEE Information Theory Workshop (ITW), pp. 26–30, IEEE, 2016.
  • [45] V. Anantharam, “A variational characterization of rényi divergences,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 6979–6989, 2018.
  • [46] I. Sason, “On f-divergences: Integral representations, local behavior, and inequalities,” Entropy, vol. 20, no. 5, p. 383, 2018.
  • [47] I. Kuzborskij, N. Cesa-Bianchi, and C. Szepesvári, “Distribution-dependent analysis of gibbs-erm principle,” in Conference on Learning Theory, pp. 2028–2054, PMLR, 2019.
  • [48] F. Topsøe, “Jenson-shannon divergence and norm-based measures of discrimination and variation,” preprint, 2003.
  • [49] M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for commonly used univariate continuous distributions,” Information Sciences, vol. 249, pp. 124–131, 2013.

Appendix A Proof of Section III-A

Proof:
αKL(PWPZiP^W,Zi)+(1α)KL(PW,ZiP^W,Zi)\displaystyle\alpha\mathrm{KL}(P_{W}\otimes P_{Z_{i}}\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}) (74)
=𝒲×𝒵α(dPWdPZi)log(dPWdPZi)\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})\log(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}) (75)
+𝒲×𝒵(1α)dPW,Zilog(dPW,Zi)\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}(1-\alpha)\mathrm{d}P_{W,Z_{i}}\log(\mathrm{d}P_{W,Z_{i}})
𝒲×𝒵((α(dPWdPZi)+(1α)dPW,Zi)log(dP^W,Zi))\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}((\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})+(1-\alpha)\mathrm{d}P_{W,Z_{i}})\log(d\widehat{P}_{W,Z_{i}}))
=𝒲×𝒵α(dPWdPZi)log(dPWdPZi)\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})\log(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}) (76)
+𝒲×𝒵(1α)dPW,Zilog(dPW,Zi)dPW,Zi(α)log(dP^W,Zi)\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}(1-\alpha)\mathrm{d}P_{W,Z_{i}}\log(\mathrm{d}P_{W,Z_{i}})-\mathrm{d}P_{W,Z_{i}}^{(\alpha)}\log(d\widehat{P}_{W,Z_{i}})
+𝒲×𝒵dPW,Zi(α)log(dPW,Zi(α))dPW,Zi(α)log(dPW,Zi(α))\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W,Z_{i}}^{(\alpha)}\log(\mathrm{d}P_{W,Z_{i}}^{(\alpha)})-\mathrm{d}P_{W,Z_{i}}^{(\alpha)}\log(\mathrm{d}P_{W,Z_{i}}^{(\alpha)})
=IJSα(W;Zi)+KL(PW,Zi(α)P^W,Zi).\displaystyle=I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|\widehat{P}_{W,Z_{i}}). (77)

Proof:

As shown in [48], and by considering the Lemma 2 we have

minP^W,ZiαKL(PWμP^W,Zi)+(1α)KL(PW,ZiP^W,Zi)=\displaystyle\min_{\widehat{P}_{W,Z_{i}}}\alpha\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})= (78)
minP^W,ZiIJSα(W;Zi)+KL(PW,Zi(α)P^W,Zi).\displaystyle\min_{\widehat{P}_{W,Z_{i}}}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|\widehat{P}_{W,Z_{i}}).

As we have 0KL(PW,Zi(α)P^W,Zi)0\leq\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|\widehat{P}_{W,Z_{i}}), therefore, the minimum of (36) is achieved with P^W,Zi=PW,Zi(α)\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}^{(\alpha)}. Now, considering P^W,Zi=PW,Zi(α)\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}^{(\alpha)} in Proposition 1, completes the proof. ∎

Proof:

Using (46),

IJSα(W;Zi)(1α)I(W;Zi),\displaystyle I_{\mathrm{JS}}^{\alpha}(W;Z_{i})\leq(1-\alpha)I(W;Z_{i}), (79)

we have:

|gen¯(PW|S,μ)|\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)| 1ni=1n2σ(α)2IJSα(W;Zi)α(1α)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}} (80)
1ni=1n2σ(α)2I(W;Zi)α\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I(W;Z_{i})}{\alpha}} (81)
2σ(α)2i=1nI(W;Zi)αn\displaystyle\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{\sum_{i=1}^{n}I(W;Z_{i})}{\alpha n}} (82)
2σ(α)2I(W;S)αn\displaystyle\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{I(W;S)}{\alpha n}} (83)
2σ(α)2H(W)αn,\displaystyle\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{H(W)}{\alpha n}}, (84)

where the final result would follow from the finite hypothesis space. ∎

Proof:

This proposition follows from the fact that IJSα(W,Zi)h(α)I_{\mathrm{JS}}^{\alpha}(W,Z_{i})\leq h(\alpha) for i=1,,ni=1,\cdots,n.

We prove that IJSα(W,Zi)h(α)I_{\mathrm{JS}}^{\alpha}(W,Z_{i})\leq h(\alpha).

IJSα(W,Zi)=\displaystyle I_{\mathrm{JS}}^{\alpha}(W,Z_{i})= (85)
αKL(PWPZiPW,Zi(α))+(1α)KL(PW,ZiPW,Zi(α))\displaystyle\quad\alpha\mathrm{KL}(P_{W}\otimes P_{Z_{i}}\|P_{W,Z_{i}}^{(\alpha)})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|P_{W,Z_{i}}^{(\alpha)})
=α𝒲×𝒵dPWdPZilog(dPWdPZidPW,Zi(α))\displaystyle=\alpha\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}\log\left(\frac{\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}}{\mathrm{d}P_{W,Z_{i}}^{(\alpha)}}\right) (86)
+(1α)𝒲×𝒵dPW,Zilog(dPW,ZidPW,Zi(α))\displaystyle\quad+(1-\alpha)\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W,Z_{i}}\log\left(\frac{\mathrm{d}P_{W,Z_{i}}}{\mathrm{d}P_{W,Z_{i}}^{(\alpha)}}\right)
α𝒲×𝒵dPWdPZilog(dPWdPZiα(dPWdPZi))\displaystyle\leq\alpha\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}\log\left(\frac{\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}}{\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})}\right) (87)
+(1α)𝒲×𝒵dPW,Zilog(dPW,Zi(1α)dPW,Zi)\displaystyle\quad+(1-\alpha)\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W,Z_{i}}\log\left(\frac{\mathrm{d}P_{W,Z_{i}}}{(1-\alpha)\mathrm{d}P_{W,Z_{i}}}\right)
=αlog(α)(1α)log(1α)\displaystyle=-\alpha\log(\alpha)-(1-\alpha)\log(1-\alpha) (88)
=h(α).\displaystyle=h(\alpha). (89)

Proof:

We first compute the derivative of h(α)α(1α)\frac{h(\alpha)}{\alpha(1-\alpha)} with respect to α(0,1)\alpha\in(0,1)

dh(α)α(1α)dα=log(1α)α2log(α)(1α)2.\displaystyle\frac{d\frac{h(\alpha)}{\alpha(1-\alpha)}}{d\alpha}=\frac{\log(1-\alpha)}{\alpha^{2}}-\frac{\log(\alpha)}{(1-\alpha)^{2}}. (90)

Now for α=12\alpha=\frac{1}{2}, we have dh(α)α(1α)dα=0\frac{d\frac{h(\alpha)}{\alpha(1-\alpha)}}{d\alpha}=0. ∎

Appendix B Proofs of Section III-B

Proof:

Consider arbitrary auxiliary distributions {P^W,Zi}i=1n\{\widehat{P}_{W,Z_{i}}\}_{i=1}^{n} defined on 𝒲×𝒵\mathcal{W}\times\mathcal{Z}. Then,

gen¯(PW|S,μ)=𝔼PWPS[LE(W,S)]𝔼PW,S[LE(W,S)]\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu)=\mathbb{E}_{P_{W}P_{S}}[L_{E}(W,S)]-\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]
=1ni=1n𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})] (91)
1ni=1n|𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]|\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right| (92)
1ni=1n|𝔼P^W,Zi[(W,Zi)]𝔼PWZi[(W,Zi)]|\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\bigl{|}\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\bigr{|} (93)
+|𝔼P^W,Zi[(W,Zi)]𝔼PWPZi[(W,Zi)]|.\displaystyle\quad\quad+\bigl{|}\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]\bigr{|}.

Using the assumption that loss function (w,zi)\ell(w,z_{i}) is γ2\gamma^{2}-sub-Gaussian under distribution PW,ZiP_{W,Z_{i}} and Donsker-Varadhan representation we have:

λ(𝔼P^W,Zi[(W,Zi)]𝔼PWZi[(W,Zi)])\displaystyle\lambda\left(\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right)\leq (94)
KL(P^W,ZiPWZi)+λ2γ22,λ.\displaystyle\quad\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{WZ_{i}})+\frac{\lambda^{2}\gamma^{2}}{2},\quad\forall\lambda\in\mathbb{R}.

Using the assumption that (w,Z)\ell(w,Z) is σ2\sigma^{2}-sub-Gaussian under PWPZiP_{W}\otimes P_{Z_{i}}, and again Donsker-Varadhan representation we have:

λ(𝔼P^W,Zi[(W,Zi)]𝔼PWPZi[(W,Zi)])\displaystyle\lambda^{\prime}\left(\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]\right)\leq (95)
KL(P^W,ZiPWPZi)+λ2σ22.λ\displaystyle\quad\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}P_{Z_{i}})+\frac{{\lambda^{\prime}}^{2}\sigma^{2}}{2}.\quad\forall\lambda^{\prime}\in\mathbb{R}

Note that 𝔼PWPZi[(W,Zi)𝔼PZi[(W,Zi)]]=0\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]=0.

Now if we consider λ>0\lambda>0, then we choose λ=αα1λ\lambda^{\prime}=\frac{\alpha}{\alpha-1}\lambda. Hence we have

𝔼P^W,Zi[(W,Zi)𝔼PW,Zi[(W,Zi)]]\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]]\leq (96)
KL(P^W,ZiPWZi)λ+λγ22,λ+.\displaystyle\quad\frac{\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{WZ_{i}})}{\lambda}+\frac{\lambda\gamma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.

Using the assumption that (w,Z)\ell(w,Z) is σ2\sigma^{2}-sub-Gaussian and again Donsker-Varadhan representation,

𝔼P^W,Zi[(W,Zi)𝔼PWPZi[(W,Zi)]]\displaystyle-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]]\leq (97)
KL(P^W,ZiPWPZi)|λ|+|λ|σ22,λ.\displaystyle\quad\frac{\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}P_{Z_{i}})}{|\lambda^{\prime}|}+\frac{|\lambda^{\prime}|\sigma^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R^{-}}.

Now sum up two Inequalities (96) and (97), to obtain

𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq (98)
αKL(P^W,ZiPW,Zi)+(1α)KL(P^W,ZiQWPZi)αλ+\displaystyle\frac{\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|Q_{W}P_{Z_{i}})}{\alpha\lambda}+
λγ22+λα1ασ22,λ+.\displaystyle\quad\frac{\lambda\gamma^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\sigma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.

Considering (98), we have a nonnegative parabola in λ\lambda, whose discriminant must be nonpositive, and we have:

|𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]|\displaystyle\left|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right|\leq (99)
2(ασ2+(1α)γ2)(αCi+(1α)Di)α(1α),\displaystyle\quad\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\left(\alpha C_{i}+(1-\alpha)D_{i}\right)}{\alpha(1-\alpha)}},

where Ci=KL(P^W,ZiPWμ)C_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu) and Di=KL(P^W,ZiPW,Zi)D_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}}) . Finally, we prove the claim using (B). ∎

Proof:

Using Lemma 3, we have:

minP^W,ZiαKL(P^W,ZiPWμ)+(1α)KL(P^W,ZiPW,Zi)=\displaystyle\min_{\widehat{P}_{W,Z_{i}}}\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})=
(1α)IRα(W;Zi)\displaystyle(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i})
+minP^W,ZiKL(P^W,Zi(PZiPW)α(PW,Zi)(1α)𝒲×𝒵(dPZidPW)α(dPW,Zi)(1α)).\displaystyle+\min_{\widehat{P}_{W,Z_{i}}}\mathrm{KL}\left(\widehat{P}_{W,Z_{i}}\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).

Now by considering the dP^W,Zi=(dPZidPW)α(dPW,Zi)(1α)𝒲×𝒵(dPZidPW)α(dPW,Zi)(1α)d\widehat{P}_{W,Z_{i}}=\frac{(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}, the KL term would be equal to zero. The final result holds by using Proposition 5. ∎

Proof:

Our proof is based on [35, Theorem 30]. For 0α10\leq\alpha\leq 1, we have:

αKL(P^W,ZiPWμ)+(1α)KL(P^W,ZiPW,Zi)\displaystyle\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}}) (100)
=𝒲×𝒵𝑑P^W,Zilog(dP^W,Zi)\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log(d\widehat{P}_{W,Z_{i}}) (101)
𝒲×𝒵P^W,Zilog((dPWdPZi)α(dPW,Zi)(1α))\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}\widehat{P}_{W,Z_{i}}\log((dP_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)})
=𝒲×𝒵𝑑P^W,Zilog(dP^W,Zi)\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log(d\widehat{P}_{W,Z_{i}}) (102)
𝒲×𝒵𝑑P^W,Zilog((dPWdPZi)α(dPW,Zi)(1α))\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log\left((\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)
+log(𝒲×𝒵(dPWdPZi)α(dPW,Zi)(1α))\displaystyle\quad+\log\left(\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)
log(𝒲×𝒵(dPWdPZi)α(dPW,Zi)(1α))\displaystyle\quad-\log\left(\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)
=log(𝒲×𝒵(PWdPZi)α(dPW,Zi)(1α))\displaystyle=-\log\left(\int_{\mathcal{W}\times\mathcal{Z}}(P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right) (103)
+𝒲×𝒵𝑑P^W,Zilog(dP^W,Zi)\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log(d\widehat{P}_{W,Z_{i}})
𝒲×𝒵𝑑P^W,Zilog((dPWdPZi)α(dPW,Zi)(1α)𝒲×𝒵(dPWdPZi)α(dPW,Zi)(1α))\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log\left(\frac{(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right)
=(1α)IRα(W;Zi)\displaystyle=(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i}) (104)
+KL(P^W,Zi(PZiPW)α(PW,Zi)(1α)𝒲×𝒵(dPZidPW)α(dPW,Zi)(1α)).\displaystyle\quad+\mathrm{KL}\left(\widehat{P}_{W,Z_{i}}\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).

Proof:

Using (57),

IRα(W;Zi)α1αI(W;Zi),\displaystyle I_{\mathrm{R}}^{\alpha}(W;Z_{i})\leq\frac{\alpha}{1-\alpha}I(W;Z_{i}), (105)

and considering the hypothesis space is finite and the upper bound in Theorem 3, we have:

|gen¯(PW|S,μ)|1ni=1n2(ασ2+(1α)γ2)IRα(W;Zi)α\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}} (106)
1ni=1n2(ασ2+(1α)γ2)min{1α,11α}I(W;Zi)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}I(W;Z_{i})} (107)
2(ασ2+(1α)γ2)min{1α,11α}i=1nI(W;Zi)n\displaystyle\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{\sum_{i=1}^{n}I(W;Z_{i})}{n}} (108)
2(ασ2+(1α)γ2)min{1α,11α}I(W;S)n\displaystyle\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{I(W;S)}{n}} (109)
2(ασ2+(1α)γ2)min{1α,11α}H(W)n\displaystyle\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{H(W)}{n}} (110)
2(ασ2+(1α)γ2)min{1α,11α}log(k)n,\displaystyle\quad\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{\log(k)}{n}},

where (108) follows from Jensen inequality and (109) follows from i.i.d assumption for ZiZ_{i}’s. ∎

Proof:

Consider arbitrary auxiliary distributions {P~W,Zi}i=1n\{\tilde{P}_{W,Z_{i}}\}_{i=1}^{n} defined on 𝒲×𝒵\mathcal{W}\times\mathcal{Z}.

gen¯(PW|S,μ)=𝔼PWPS[LE(W,S)]𝔼PW,S[LE(W,S)]\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu)=\mathbb{E}_{P_{W}P_{S}}[L_{E}(W,S)]-\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]
=1ni=1n𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})] (111)
1ni=1n|𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]|.\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right|. (112)

Using the assumption centered loss function (w,zi)𝔼PZi[(w,Zi)]\ell(w,z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(w,Z_{i})] is γ2\gamma^{2}-sub-Gaussian under distribution PW,ZiP_{W,Z_{i}} and Donsker-Varadhan representation by considering function (w,zi)𝔼PZi[(w,Zi)]\ell(w,z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(w,Z_{i})] we have:

λ(𝔼P~W,Zi[(W,Zi)𝔼PZi[(W,Zi)]]\displaystyle\lambda\bigg{(}\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]] (113)
𝔼PWZi[(W,Zi)𝔼PZi[(W,Zi)]])\displaystyle\quad-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]\bigg{)}
KL(P~W,ZiPWZi)+λ2γ22.λ\displaystyle\leq\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|P_{WZ_{i}})+\frac{\lambda^{2}\gamma^{2}}{2}.\quad\forall\lambda\in\mathbb{R}

Note that 𝔼PWZi[(W,Zi)𝔼PZi[(W,Zi)]]=𝔼PWZi[(W,Zi)]𝔼PWPZi[(W,Zi)]\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]=\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})].

Using the assumption that (w,Z)\ell(w,Z) is σ2\sigma^{2}-sub-Gaussian under PZiP_{Z_{i}} for all w𝒲w\in\mathcal{W}, and again Donsker-Varadhan representation by considering function (w,zi)𝔼PZi[(w,Zi)]\ell(w,z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(w,Z_{i})] we have:

λ(𝔼P~W,Zi[(W,Zi)𝔼PZi[(W,Zi)]]\displaystyle\lambda^{\prime}\bigg{(}\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]] (114)
𝔼QWPZi[(W,Zi)𝔼PZi[(W,Zi)]])\displaystyle\quad-\mathbb{E}_{Q_{W}P_{Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]\bigg{)}
KL(P~W,ZiQWPZi)+λ2σ22.λ\displaystyle\quad\leq\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|Q_{W}P_{Z_{i}})+\frac{{\lambda^{\prime}}^{2}\sigma^{2}}{2}.\quad\forall\lambda^{\prime}\in\mathbb{R}

Note that 𝔼QWPZi[(W,Zi)𝔼PZi[(W,Zi)]]=0\mathbb{E}_{Q_{W}P_{Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]=0.

Now if we consider λ>0\lambda>0, then we choose λ=αα1λ\lambda^{\prime}=\frac{\alpha}{\alpha-1}\lambda. Hence we have

𝔼P~W,Zi[(W,Zi)𝔼PZi[(W,Zi)]]\displaystyle\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]] (115)
𝔼PWZi[(W,Zi)𝔼PZi[(W,Zi)]]\displaystyle\quad-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]
KL(P~W,ZiPWZi)λ+λγ22,λ+.\displaystyle\quad\leq\frac{\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|P_{WZ_{i}})}{\lambda}+\frac{\lambda\gamma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.

Using the assumption (w,Z)\ell(w,Z) is σ2\sigma^{2}-sub-Gaussian and again Donsker-Varadhan representation,

𝔼P~W,Zi[(W,Zi)𝔼PZi[(W,Zi)]]\displaystyle-\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]\leq (116)
KL(P~W,ZiQWPZi)|λ|+|λ|σ22,λ.\displaystyle\quad\frac{\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|Q_{W}P_{Z_{i}})}{|\lambda^{\prime}|}+\frac{|\lambda^{\prime}|\sigma^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R^{-}}.

Now sum up the two Inequalities (115) and (116) to obtain,

𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq (117)
αKL(P~W,ZiPW,Zi)+(1α)KL(P~W,ZiQWPZi)αλ+\displaystyle\frac{\alpha\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|Q_{W}P_{Z_{i}})}{\alpha\lambda}+
λγ22+λα1ασ22,λ+.\displaystyle\quad\frac{\lambda\gamma^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\sigma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.

Taking infimum on P~W,Zi\tilde{P}_{W,Z_{i}} and using [35, Theorem 30] that states

(1α)Rα(P1P2)=infR{αKL(RP1)+(1α)KL(RP2)}(1-\alpha)\mathrm{R}_{\alpha}(P_{1}\|P_{2})=\inf_{R}\{\alpha\mathrm{KL}(R\|P_{1})+(1-\alpha)\mathrm{KL}(R\|P_{2})\}

Now, we have:

(1α)Rα(PW,ZiQWPZi)=\displaystyle(1-\alpha)\mathrm{R}_{\alpha}(P_{W,Z_{i}}\|Q_{W}P_{Z_{i}})= (118)
infP~W,Zi{αKL(P~W,ZiPW,Zi)+(1α)KL(P~W,ZiQWPZi)}\displaystyle{\inf_{\tilde{P}_{W,Z_{i}}}\{\alpha\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\tilde{P}_{W,Z_{i}}\|Q_{W}P_{Z_{i}})\}}

and taking infimum on QWQ_{W}, we have:

infQWRα(PW,ZiQWPZi)=ISα(Zi;W).\displaystyle\inf_{Q_{W}}\mathrm{R}_{\alpha}(P_{W,Z_{i}}\|Q_{W}P_{Z_{i}})=I_{\mathrm{S}}^{\alpha}(Z_{i};W). (119)

Using (119) in (117), we get:

𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq (120)
(1α)Iαs(Zi;W)λα+λγ22+λα1ασ22λ+.\displaystyle\quad\frac{(1-\alpha)I^{s}_{\alpha}(Z_{i};W)}{\lambda\alpha}+\frac{\lambda\gamma^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\sigma^{2}}{2}\quad\forall\lambda\in\mathbb{R^{+}}.

Using the same approach for λR\lambda\in R^{-}, we have:

𝔼PWZi[(W,Zi)]𝔼PWPZi[(W,Zi)]\displaystyle\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]\leq (121)
(1α)Iαs(Zi;W)|λ|α+|λ|γ22+|λ|α1ασ22,λ.\displaystyle\quad\frac{(1-\alpha)I^{s}_{\alpha}(Z_{i};W)}{|\lambda|\alpha}+\frac{|\lambda|\gamma^{2}}{2}+\frac{|\lambda|\frac{\alpha}{1-\alpha}\sigma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{-}}.

Considering (120) and (121), we have a non-negative parabola in λ\lambda, whose discriminant must be non-positive, and we have:

|𝔼PWPZi[(W,Zi)]𝔼PWZi[(W,Zi)]|\displaystyle\left|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right|\leq (122)
2(ασ2+(1α)γ2)ISα(Zi;W)α.\displaystyle\quad\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{S}}^{\alpha}(Z_{i};W)}{\alpha}}.

We prove the claim using (B).

Proof:

The Generalized Pinsker’s inequality is introduced in [35], as follows,

𝕋𝕍(P,Q)22αRα(PQ),α(0,1],\displaystyle\mathbb{TV}(P,Q)^{2}\leq\frac{2}{\alpha}\mathrm{R}_{\alpha}(P\|Q),\quad\alpha\in(0,1], (123)

where 𝕋𝕍(P,Q)=𝒳|P(dx)Q(dx)|\mathbb{TV}(P,Q)=\int_{\mathcal{X}}|P(\mathrm{d}x)-Q(\mathrm{d}x)|. Denote f:𝒳f:\mathcal{X}\to\mathbb{R} a bounded function |f|L|f|\leq L, then

𝔼P[f(X)]𝔼Q[f(X)]=\displaystyle\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]= (124)
f(x)(P(dx)Q(dx))\displaystyle\quad\int f(x)(P(dx)-Q(dx))\leq
supxf(x)|P(dx)Q(dx)|L2αRα(PQ).\displaystyle\quad\sup_{x}f(x)\cdot\int|P(dx)-Q(dx)|\leq L\sqrt{\frac{2}{\alpha}\mathrm{R}_{\alpha}(P\|Q)}.

Let P=PW,ZP=P_{W,Z}, Q=PWPZQ=P_{W}P_{Z} and f(w,z)=Lμ(w)LE(w,z)f(w,z)=L_{\mu}(w)-L_{E}(w,z). Then, we have the final result,

gen¯(PW|S,μ)=1ni=1n𝔼[Lμ(W)LE(W,Zi)]\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[L_{\mu}(W)-L_{E}(W,Z_{i})]
1ni=1n2b2αRα(PW,ZiPWPZi).\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\frac{2b^{2}}{\alpha}\mathrm{R}_{\alpha}(P_{W,Z_{i}}\|P_{W}P_{Z_{i}})}.

Appendix C Proof of Section III-C

Proof:

It follows from

IJSα(W;Zi)h(α)α(1α),I_{\mathrm{JS}}^{\alpha}(W;Z_{i})\leq\frac{h(\alpha^{\prime})}{\alpha^{\prime}(1-\alpha^{\prime})},

that if we have αh(α)α(1α)IRα(W;Zi)\frac{\alpha h(\alpha^{\prime})}{\alpha^{\prime}(1-\alpha^{\prime})}\leq I_{\mathrm{R}}^{\alpha}(W;Z_{i}) for all i=1,,ni=1,\cdots,n, then the results holds for σ=γ=σJS\sigma=\gamma=\sigma_{JS}. ∎

Appendix D Proofs of Section IV

Proof:

Let us define P𝒩:=𝒩(w,β1Id)P_{\mathcal{N}}:=\mathcal{N}(w^{\star},\beta^{-1}I_{d}) and w:=arginfw𝒲Lμ(w)w^{\star}:=\arg\inf_{w\in\mathcal{W}}L_{\mu}(w).

r(PW|S,β,JSα,μ)\displaystyle\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)
|gen¯(PW|S,β,JSα,μ)|+𝔼PSPW|S,β,JSα[LE(W,S)]Lμ(w)\displaystyle\leq\big{|}\overline{\operatorname*{gen}}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\big{|}+\mathbb{E}_{P_{S}\otimes P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})
2b2nα(1α)i=1nIJSα(W;Zi)+𝔼PSP𝒩[LE(W,S)]Lμ(w)\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{S}\otimes P_{\mathcal{N}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})
+JSα(𝒩(w,β1Id)Q)β\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta}
2b2nα(1α)i=1nIJSα(W;Zi)+𝔼P𝒩[Lμ(W)]Lμ(w)\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{\mathcal{N}}}[L_{\mu}(W)]-L_{\mu}(w^{\star})
+JSα(𝒩(w,β1Id)Q)β\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta}
2b2nα(1α)i=1nIJSα(W;Zi)+𝔼P𝒩[Lμ(w)\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{\mathcal{N}}}[L_{\mu}(w^{\star})
+L~Ww2]Lμ(w)+JSα(𝒩(w,β1Id)Q)β\displaystyle\quad+\tilde{L}\|W-w^{\star}\|_{2}]-L_{\mu}(w^{\star})+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta}
2b2nα(1α)i=1nIJSα(W;Zi)+L~dβ\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}
+JSα(𝒩(w,β1Id)Q)β,\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta},

Note that L~𝔼P𝒩[Ww2]=L~dβ\tilde{L}\mathbb{E}_{P_{\mathcal{N}}}[\|W-w^{\star}\|_{2}]=\frac{\tilde{L}d}{\beta}. ∎

Proof:

The proof is similar to Proof of Theorem 5, by replacing the second inequality with the following inequality,

r(PW|S,β,Rα,μ)|gen¯(PW|S,β,Rα,μ)|+𝔼PSPW|S,β,JSα[LE(W,S)]Lμ(w)2b2nα(1α)i=1nIRα(W;Zi)+𝔼PSP𝒩[LE(W,S)]Lμ(w)+Rα(𝒩(w,β1Id)Q)β\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\big{|}\overline{\operatorname*{gen}}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\big{|}\\ &\quad+\mathbb{E}_{P_{S}\otimes P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})\\ &\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}\\ &\quad+\mathbb{E}_{P_{S}\otimes P_{\mathcal{N}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})+\frac{\mathrm{R}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta}\end{split}

Proof:

Using the boundedness of α\alpha-JS\mathrm{JS} divergence in Theorem 5, we have,

r(PW|S,β,JSα,μ)2b2nαi=1nIRα(W;Zi)+L~dβ+h(α)β,\begin{split}\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)&\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}+\frac{h(\alpha)}{\beta},\end{split}

where h(α)=αlog(α)(1α)log(1α)h(\alpha)=-\alpha\log(\alpha)-(1-\alpha)\log(1-\alpha). Therefore, by setting β=n1/2\beta=n^{1/2}, the convergence rate of excess risk is 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). ∎

Proof:

We consider the normal distribution as prior, i.e., Q=𝒩(0,Id)Q=\mathcal{N}(0,I_{d}), in Theorem 6. Then, we can compute the α\alpha-Rényi divergence between two multivariate Gaussian distributions [49],

Rα(𝒩(w,β1Id)𝒩(0,Id))=α2w22(α+(1α)β1)1+d2(α1)log(βα1α+(1α)β1).\begin{split}&\mathrm{R}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|\mathcal{N}(0,I_{d}))=\frac{\alpha}{2}\|w^{\star}\|_{2}^{2}(\alpha+(1-\alpha)\beta^{-1})^{-1}\\ &\quad+\frac{d}{2(\alpha-1)}\log\Big{(}\frac{\beta^{\alpha-1}}{\alpha+(1-\alpha)\beta^{-1}}\Big{)}.\end{split}

Then, the following upper bound holds on the excess risk under PW|S,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},

r(PW|S,β,Rα,μ)2b2nαi=1nIRα(W;Zi)+L~dβ+α2βw22(α+(1α)β1)1+d2β(α1)log(β(α1)α+(1α)β1)2b2nαi=1nIRα(W;Zi)+L~dβ+12βw22+d2βlog(β)+d2β(1α)log(α).\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}\\ &\quad+\frac{\alpha}{2\beta}\|w^{\star}\|_{2}^{2}(\alpha+(1-\alpha)\beta^{-1})^{-1}\\ &\quad+\frac{d}{2\beta(\alpha-1)}\log\Big{(}\frac{\beta^{(\alpha-1)}}{\alpha+(1-\alpha)\beta^{-1}}\Big{)}\\ &\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}\\ &\quad+\frac{1}{2\beta}\|w^{\star}\|_{2}^{2}+\frac{d}{2\beta}\log\big{(}\beta\big{)}+\frac{d}{2\beta(1-\alpha)}\log\big{(}\alpha\big{)}.\end{split}

Appendix E Proofs of Section V

We first propose the following Lemma to provide an upper bound on the expected generalization error under distribution mismatch.

Lemma 6

The following upper bound holds on expected generalization error under distribution mismatch between the test and training distributions:

|gen¯(PW|S,μ,μ)|\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu,\mu^{\prime})|\leq (125)
|gen¯(PW|S,μ)|+|𝔼PWμ[(W,Z)]𝔼PWμ[(W,Z)]|.\displaystyle\quad|\overline{\operatorname*{gen}}(P_{W|S},\mu)|+|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]|.
Proof:

We have:

|gen¯(PW|S,μ,μ)|\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu,\mu^{\prime})| (126)
=|𝔼PW,S[LP(W,μ)LP(W,μ)+LP(W,μ)LE(E,S)]|\displaystyle=|\mathbb{E}_{P_{W,S}}[L_{P}(W,\mu^{\prime})-L_{P}(W,\mu)+L_{P}(W,\mu)-L_{E}(E,S)]|
|𝔼PW,S[LP(W,μ)LP(W,μ)]|+|gen¯(PW|S,μ)|\displaystyle\leq|\mathbb{E}_{P_{W,S}}[L_{P}(W,\mu^{\prime})-L_{P}(W,\mu)]|+|\overline{\operatorname*{gen}}(P_{W|S},\mu)|
=|𝔼PWμ[(W,Z)]𝔼PWμ[(W,Z)]|+|gen¯(PW|S,μ)|\displaystyle=|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]|+|\overline{\operatorname*{gen}}(P_{W|S},\mu)|

Proof:

In Lemma 6, the generalization error under distribution mismatch can be upper bounded by two terms. Considering Theorem 2, we can provide the upper bound based on α\alpha-Jensen-Shannon information over |gen¯(PW|S,μ)||\overline{\operatorname*{gen}}(P_{W|S},\mu)|. We can also provide an upper bound on the term |𝔼PWμ[(W,Z)]𝔼PWμ[(W,Z)]||\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]| in Lemma 6 by applying ADM using a similar approach as in Theorem 2 and using the α\alpha-Jensen-Shannon divergence as follows:

|𝔼PWμ[(W,Z)]𝔼PWμ[(W,Z)]|\displaystyle|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]|\leq (127)
2σ(α)2JSα(PWμPWμ)α(1α)=\displaystyle\sqrt{2\sigma_{(\alpha)}^{2}\frac{\mathrm{JS}_{\alpha}(P_{W}\otimes\mu^{\prime}\|P_{W}\otimes\mu)}{\alpha(1-\alpha)}}= (128)
2σ(α)2JSα(μμ)α(1α).\displaystyle\sqrt{2\sigma_{(\alpha)}^{2}\frac{\mathrm{JS}_{\alpha}(\mu^{\prime}\|\mu)}{\alpha(1-\alpha)}}. (129)

Proof:

Based on Lemma 6, the generalization error is upper bounded by two terms (See Equation (125)). We can provide the upper bound based on α\alpha-Rényi information over |gen¯(PW|S,μ)||\overline{\operatorname*{gen}}(P_{W|S},\mu)| using Theorem 3. We can also provide an upper bound on the term |𝔼PWμ[(W,Z)]𝔼PWμ[(W,Z)]||\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]| by applying ADM using a similar approach as in Theorem 3 and using α\alpha-Rényi divergence as follows:

|𝔼PWμ[(W,Z)]𝔼PWμ[(W,Z)]|\displaystyle|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]|\leq (130)
2(ασ2+(1α)γ2)Rα(PWμPWμ)α=\displaystyle\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\mathrm{R}_{\alpha}(P_{W}\otimes\mu^{\prime}\|P_{W}\otimes\mu)}{\alpha}}= (131)
2(ασ2+(1α)γ2)Rα(μμ)α.\displaystyle\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\mathrm{R}_{\alpha}(\mu^{\prime}\|\mu)}{\alpha}}. (132)