Learning Algorithm Generalization Error Bounds via Auxiliary Distributions

Gholamali Aminian^$\star$, Saeed Masiha^$\star$, Laura Toni, and Miguel R. D. Rodrigues, ^∗ Equal Contribution.One of the ideas of this work was presented, in part, at 2020 IEEE Information Theory Workshop (ITW), [1]. G. Aminian is with the Alan Turing Institute, London NW1 2DB, U.K. (e-mail: [email protected]), L. Toni and M. R. D. Rodrigues are with the Electronic and Electrical Engineering Department, University College London, London WC1E 6BT, U.K. (e-mail: {l.toni, m.rodrigues}@ucl.ac.uk), and S. Masiha is with the École Polytechnique Fédérale de Lausanne, Lausanne 1015, Switzerland. (e-mail: [email protected]).

Abstract

Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds involving the $\alpha$ -Jensen-Shannon, $\alpha$ -Rényi ( $0<\alpha<1$ ) information between a random variable modeling the set of training samples and another random variable modeling the set of hypotheses. Our upper bounds based on $\alpha$ -Jensen-Shannon information are also finite. Additionally, we demonstrate how our auxiliary distribution method can be used to derive the upper bounds on excess risk of some learning algorithms in the supervised learning context and the generalization error under the distribution mismatch scenario in supervised learning algorithms, where the distribution mismatch is modeled as $\alpha$ -Jensen-Shannon or $\alpha$ -Rényi divergence between the distribution of test and training data samples distributions. We also outline the conditions for which our proposed upper bounds might be tighter than other earlier upper bounds.

Index Terms:

Expected Generalization Error Bounds, population risk upper bound, Mutual Information,

\alpha

-Jensen-Shannon Information,

\alpha

-Rényi Information, Distribution mismatch.

I Introduction

Numerous methods have been proposed in order to describe the generalization error of learning algorithms. These include VC-based bounds [2], algorithmic stability-based bounds [3], algorithmic robustness-based bounds [4], PAC-Bayesian bounds [5]. Nevertheless, for a number of reasons, many of these generalization error bounds are unable to describe how different machine-learning techniques can generalize: some of the bounds depend only on the hypothesis class and not on the learning algorithm; existing bounds do not easily exploit dependencies between different hypotheses; or do not exploit dependencies between the learning algorithm input and output.

More recently, methods that use information-theoretic tools have also been developed to describe the generalization of learning techniques. Such methods frequently incorporate the many components related to the learning problem by expressing the expected generalization error in terms of certain information measurements between the learning algorithm input (the training dataset) and output (the hypothesis). In particular, building upon pioneering work by Russo and Zou [6], Xu and Raginsky [7] have derived expected generalization error bounds involving the mutual information between the training set and the hypothesis. Bu et al. [8] have derived tighter expected generalization error bounds involving the mutual information between each individual sample in the training set and the hypothesis. Meanwhile, bounds using chaining mutual information have been proposed in [9, 10]. Other authors have also constructed information-theoretic based expected generalization error bounds based on other information measures such as $\alpha$ -Rényi divergence for $\alpha>1$ , $f$ -divergence, and maximal leakage [11]. In [12], an upper bound based on $\alpha$ -Rényi divergence for $0<\alpha<1$ is derived by using the variational representation of $\alpha$ -Rényi divergence. Bounds based on the Wasserstein distance between the training sample data and the output of a randomized learning algorithm [13], [14] and Wasserstein distance between distributions of an individual sample data and the output of the learning algorithm is proposed in [15], and tighter upper bounds via convexity of Wasserstein distance are proposed in [16]. Upper bounds based on conditional mutual information and individual sample conditional mutual information are proposed in [17] and [18], respectively. The combination of conditioning and processing techniques can provide tighter expected generalization error upper bounds [19]. An exact characterization of the expected generalization error for the Gibbs algorithm in terms of symmetrized KL information is provided in [20]. [21] provides information-theoretic expected generalization error upper bounds in the presence of training/test data distribution mismatch, using rate-distortion theory.

Generalization error bounds have also been developed to address scenarios where the training data distribution differs from the test data distribution, known as Distribution Mismatch. This scenario – which also links to out-of-distribution generalization – has attracted various contributions in recent years, such as [22, 23, 24]. In particular, Masiha et al. [21] provides information-theoretic generalization error upper bounds in the presence of training/test data distribution mismatch, using rate-distortion theory.

In this work, we propose an auxiliary distribution method (ADM) to characterize the expected generalization error upper bound of supervised learning algorithms in terms of novel information measures. Our new bounds offer two advantages over existing ones: (1) Some of our bounds – such as the $\alpha$ - $\mathrm{JS}$ information ones – are always finite, whereas conventional mutual information ones (e.g., [7]) may not be; (2) In contrast to mutual information-based bounds, our bounds—such as the $\alpha$ -Rényi information for $0<\alpha<1$ —are finite for some deterministic supervised learning algorithms; (3) We also apply ADM to provide an upper bound on population risk of supervised learning algorithms under a learning algorithm.

In summary, our main contributions are as follows:

1.

We suggest a novel method, i.e., ADM, that uses auxiliary distributions over the parameter and data sample spaces to obtain upper bounds on the expected generalization error.
2.

Using ADM, we derive new expected generalization error bounds expressed via $\alpha$ - $\mathrm{JS}$ divergence, which is known to be finite.
3.

Using ADM, we offer an upper bound based on $\alpha$ -Rényi divergence for $0<\alpha<1$ with the same convergence rate as the mutual information-based upper bound. Furthermore, in contrast to the mutual information-based bounds, the $\alpha$ -Rényi divergence bounds for $0<\alpha<1$ can be finite when the hypothesis (output of the learning algorithm) is a deterministic function of at least one data sample.
4.

Using our upper bounds on expected generalization error, we also provide upper bounds on excess risk of some learning algorithms as solutions to regularized empirical risk minimization by $\alpha$ -Rényi or $\alpha$ -Jensen-Shannon divergences.
5.

Using ADM, we also provide generalization error upper bound under training and test data distribution mismatch. It turns out that training and test distribution mismatch is captured in our upper bounds via $\alpha$ -Jensen-Shannon or $\alpha$ -Rényi divergences.

It is noteworthy to add that, although the $\alpha$ - $\mathrm{JS}$ measure does not appear to have been used to characterize the generalization ability of learning algorithms, these information-theoretic quantities as well as $\alpha$ -Rényi measure for $0<\alpha<1$ , have been employed to study some machine learning problems, including the use of

•

$\alpha$ - $\mathrm{JS}$ as a loss function under label noise scenario [25], and Jensen-Shannon divergence ( $\alpha$ - $\mathrm{JS}$ divergence for $\alpha=1/2$ ) in adversarial learning [26] and active learning [27].
•

$\alpha$ -Rényi divergence in feature extraction [28] and image segmentation based on clustering [29].

II Problem Formulation

II-A Notations

In this work, we adopt the following notation in the sequel. Calligraphic letters denote spaces (e.g. $\mathcal{Z}$ ), Upper-case letters denote random variables (e.g., $Z$ ), and lower-case letters denote a realization of random variable (e.g. $z$ ). We denote the distribution of the random variable $Z$ by $P_{Z}$ , the joint distribution of two random variables $(Z_{1},Z_{2})$ by $P_{Z_{1},Z_{2}}$ , and the $\alpha$ -convex combination of the joint distribution $P_{Z_{1},Z_{2}}$ and the product of two marginals $P_{Z_{1}}\otimes P_{Z_{2}}$ , i.e. $\alpha P_{Z_{1}}\otimes P_{Z_{2}}+(1-\alpha)P_{Z_{1},Z_{2}}$ for $\alpha\in(0,1)$ , by $P_{Z_{1},Z_{2}}^{(\alpha)}$ . The set of distributions ( measures) over a space $\mathcal{X}$ with is denoted $\mathcal{P}(\mathcal{X})$ . We denote the derivative of a real-valued function $f(x)$ with respect to its argument $x$ by $f^{\prime}(\cdot)$ . We also adopt the notion $\log(\cdot)$ for the natural logarithm. The function $f(x)$ is $L_{f}$ -Lipschitz if $|f(x_{1})-f(x_{2})|\leq L_{f}\|x_{1}-x_{2}\|_{2}$ , where $\|\cdot\|_{2}$ is $L_{2}$ -norm. Let $\mathcal{N}(a,B)$ denotes the Gaussian distribution over $\mathbb{R}^{d}$ with mean $a\in\mathbb{R}^{d}$ and covariance matrix $B\in\mathbb{R}^{d\times d}$ .

II-B Framework of Statistical Learning

We analyze a standard supervised learning setting where we wish to learn a hypothesis given a set of input-output examples that can then be used to predict a new output given a new input.

In particular, in order to formalize this setting, we model the input data (also known as features) using a random variable $X\in\mathcal{X}$ where $\mathcal{X}$ is the input space, and we model the output data (also known as predictors or labels) using a random variable $Y\in\mathcal{Y}$ where $\mathcal{Y}$ is the output space. We also model input-output data pairs using a random variable $Z=(X,Y)\in\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ where $Z$ is drawn from $\mathcal{Z}$ per some unknown distribution $\mu$ . We also let $S=\{Z_{i}\}_{i=1}^{n}$ be a training set consisting of $n$ input-output data points drawn i.i.d. from $\mathcal{Z}$ according to $\mathcal{\mu}$ .

Our goal is to learn a parameterized function, $f_{W}:\mathcal{X}\rightarrow\mathcal{Y}$ , where the parameters are a random variable $W\in\mathcal{W}\subset\mathbb{R}^{d}$ and $\mathcal{W}$ is a parameter space. Finally, we represent a learning algorithm via a Markov kernel that maps a given training set $S$ onto parameter $W$ defined on the parameter space $\mathcal{W}$ according to the probability law $P_{W|S}$ .

We introduce a (non-negative) loss function $\ell:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}^{+}$ that measures how well a hypothesis (parameterized function) predicts an output given an input. We can define the population risk and the empirical risk associated with a given hypothesis as follows:

	$\displaystyle L_{\mu}(w):=\int_{\mathcal{Z}}\ell(w,z)d\mu(z),$		(1)
	$\displaystyle L_{E}(w,s):=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i}),$		(2)

respectively. We can also define the (expected) generalization error,

\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu)

\displaystyle:=\mathbb{E}_{P_{W,S}}[\text{gen}(W,S,\mu)],

(3)

where $\text{gen}(w,s,\mu):=L_{\mu}(w)-L_{E}(w,s)$ . This (expected) generalization error quantifies by how much the population risk deviates from the empirical risk. This quantity cannot be computed directly because $\mu$ is unknown, but it can often be (upper) bounded, thereby providing a means to gauge various learning algorithms’ performance. We are also interested in excess risk under the learning algorithm $P_{W|S}$ ,

\displaystyle\mathcal{E}_{r}(P_{W|S},\mu)

\displaystyle:=\mathbb{E}_{P_{W,S}}[L_{\mu}(W)]-\inf_{w\in\mathcal{W}}L_{\mu}(w).

(4)

Note that the excess risk can be decomposed as follows,

\begin{split}\mathcal{E}_{r}(P_{W|S},\mu)=\overline{\operatorname*{gen}}(P_{W|S},\mu)+\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]-\inf_{w\in\mathcal{W}}L_{\mu}(w),\end{split}

where the first term is expected generalization error and the second is statistical excess risk.

Furthermore, we analyse a supervised learning scenario under distribution mismatch ( a.k.a. out-of-distribution), where training and test data are drawn from different distributions ( $\mu$ and $\mu^{\prime}$ , respectively). In particular, we define the population risk based on test distribution $\mu^{\prime}$ as,

\displaystyle L_{P}(w,\mu^{\prime})\triangleq\int_{\mathcal{Z}}\ell(w,z)d\mu^{\prime}(z).

(5)

We define the mismatched(expected) generalization error as

\displaystyle\overline{\operatorname*{gen}}(P_{W|S},\mu,\mu^{\prime})

\displaystyle\triangleq\mathbb{E}_{P_{W,S}}[\text{gen}(W,S,\mu,\mu^{\prime})],

(6)

where $\text{gen}(w,s,\mu,\mu^{\prime})\triangleq L_{P}(w,\mu^{\prime})-L_{E}(w,s)$ .

Our goal in the sequel will be to derive (upper) bounds on the expected generalization errors (3) and the excess risk (4) in terms of various information-theoretic measures.

II-C Auxiliary Distribution Method

We describe our main method to derive upper bounds on the expected generalization error, i.e., the ADM. Consider $P$ and $Q$ as two distributions defined on a measurable space $\mathcal{X}$ and let $f:\mathcal{X}\to\mathbb{R}$ be a measurable function. Assume that we can use an asymmetric information measure $T(P\|Q)$ between $P$ and $Q$ to construct the following upper bound:

\displaystyle\left|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right|\leq F(T(P\|Q)),

(7)

where $F(\cdot)$ is a given non-decreasing concave function.

Consider $R$ as an auxiliary distribution on the same space $\mathcal{X}$ . We can use the following upper bound instead of (7):

		$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|\leq$
		$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{R}[f(X)]\right\|+\left\|\mathbb{E}_{Q}[f(X)]-\mathbb{E}_{R}[f(X)]\right\|$
		$\displaystyle\leq F(T(P\\|R))+F(T(Q\\|R))$		(8)

From concavity of $F$ , we have

		$\displaystyle F(T(P\\|R))+F(T(Q\\|R))\leq$
		$\displaystyle 2F\bigg{(}T(P\\|R)/2+T(Q\\|R)/2\bigg{)}$		(9)

We assume that $T$ satisfies a reverse triangle inequality as follows:

\displaystyle\min_{R\in\mathcal{P}(X)}T(P\|R)+T(Q\|R)\leq T(P\|Q).

(10)

Considering $R^{*}\in\operatorname*{arg\,min}_{R}T(P\|R)+T(Q\|R)$ , we have

		$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|\leq$
		$\displaystyle 2F\bigg{(}T(P\\|R^{})/2+T(Q\\|R^{})/2\bigg{)}.$		(11)

We can also provide another upper bound based on $T(R\|P)$ and $T(R\|Q)$ instead of $T(P\|R)$ and $T(Q\|R)$ :

	$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|\leq$
	$\displaystyle\left\|\mathbb{E}_{R}[f(X)]-\mathbb{E}_{P}[f(X)]\right\|+\left\|\mathbb{E}_{R}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|$
	$\displaystyle\leq F(T(R\\|P))+F(T(R\\|Q)).$		(12)

Considering $\tilde{R}\in\operatorname*{arg\,min}_{R\in\mathcal{P}(X)}T(R\|P)+T(R\|Q)$ , we have

		$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|\leq$
		$\displaystyle 2F\bigg{(}T(\tilde{R}\\|P)/2+T(\tilde{R}\\|Q)/2\bigg{)}.$		(13)

Via this ADM approach – taking $T(\cdot\|\cdot)$ to be a KL divergence – we can derive expected generalization error upper bounds involving KL divergences as follows:

	$\displaystyle\alpha\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}\otimes\mu\\|\widehat{P}_{W,Z_{i}}),$		(14)
	$\displaystyle\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W}\otimes\mu),$		(15)

where $\widehat{P}_{W,Z_{i}}$ , $P_{W,Z_{i}}$ and $P_{W}\otimes\mu$ are an auxiliary joint distribution over the space $\mathcal{Z}\times\mathcal{W}$ , the true joint distribution of the random variables $W$ and $Z_{i}$ and the product of marginal distributions of random variables $W$ and $Z_{i}$ , respectively. Inspired by the ADM, we use the fact that KL divergence is asymmetric and satisfies the reverse triangle inequality [30]. Hence, we can choose the auxiliary joint distribution, $\widehat{P}_{W,Z_{i}}$ , to derive new upper bounds which are finite or tighter under some conditions.

II-D Information Measures

In our characterization of the expected generalization error upper bounds, we will use the information measures between two distributions $P_{X}$ and $P_{X^{\prime}}$ on a common measurable space $\mathcal{X}$ , summarized in Table I. The last two divergences are $\alpha$ - $\mathrm{JS}$ divergence¹¹1a.k.a. capacitory discrimination [31] for $\alpha=1/2$ , $\alpha$ -Rényi divergence, which can be characterized by (14) and (15), respectively (See their characterizations as a convex combination of KL-divergences in Lemmas 2 and 3). They are the main divergences discussed in this paper and defined in Table I. KL divergence, Symmetrized KL divergence, Bhattacharyya distance, and Jensen-Shannon divergence can be obtained as special cases of the first three divergences in Table I.

TABLE I: Divergence Measures Definitions

Divergence Measure	Definition
KL divergence [32]	$\mathrm{KL}(P_{X}\\|P_{X^{\prime}}):=\int_{\mathcal{X}}P_{X}(x)\log\left(\frac{P_{X}(x)}{P_{X^{\prime}}(x)}\right)dx$
$\alpha$ - $\mathrm{JS}$ divergence [33, 34]	$\mathrm{JS}_{\alpha}(P_{X^{\prime}}\\|P_{X}):=$ $\alpha\mathrm{KL}\left(P_{X}\\|\alpha P_{X}+(1-\alpha)P_{X^{\prime}}\right)+(1-\alpha)\mathrm{KL}\left(P_{X^{\prime}}\\|\alpha P_{X}+(1-\alpha)P_{X^{\prime}}\right)$
Jensen-Shannon divergence [34]	$\mathrm{JSD}(P_{X^{\prime}}\\|P_{X}):=\mathrm{JS}_{1/2}(P_{X^{\prime}}\\|P_{X})$ $=\frac{1}{2}\mathrm{KL}\left(P_{X}\Big{\|}\Big{\|}\frac{P_{X}+P_{X^{\prime}}}{2}\right)+\frac{1}{2}\mathrm{KL}\left(P_{X^{\prime}}\Big{\|}\Big{\|}\frac{P_{X}+P_{X^{\prime}}}{2}\right)$
$\alpha$ -Rényi divergence for $\alpha\in[0,\infty)$ [35]	$\mathrm{R}_{\alpha}(P_{X^{\prime}}\\|P_{X}):=\frac{1}{\alpha-1}\log\left(\int_{\mathcal{X}}P_{X}^{\alpha}(x)P_{X^{\prime}}^{1-\alpha}(x)dx\right)$
Bhattacharyya distance [36]	$D_{B}(P_{X^{\prime}}\\|P_{X}):=\mathrm{R}_{1/2}(P_{X^{\prime}}\\|P_{X})$ $=-\log\left(\int_{\mathcal{X}}\sqrt{P_{X}(x)P_{X^{\prime}}(x)}dx\right)$

In addition, in our expected generalization error characterizations, we will also use various information measures between two random variables $X$ and $X^{\prime}$ with joint distribution $P_{XX^{\prime}}$ and marginals $P_{X}$ and $P_{X^{\prime}}$ . These information measures are summarized in Table II. Note that all these information measures are zero if and only if the random variables $X$ and $X^{\prime}$ are independent.

TABLE II: Information Measures Definitions

Information Measure	Definition
Mutual information	$I(X;X^{\prime}):=\mathrm{KL}(P_{X,X^{\prime}}\\|P_{X}\otimes P_{X^{\prime}})$
Lautum information [37]	$L(X;X^{\prime}):=\mathrm{KL}(P_{X}\otimes P_{X^{\prime}}\\|P_{X,X^{\prime}})$
$\alpha$ - $\mathrm{JS}$ information ( $0<\alpha<1$ )	$I_{\mathrm{JS}}^{\alpha}(X;X^{\prime}):=\mathrm{JS}_{\alpha}(P_{X,X^{\prime}}\\|P_{X}\otimes P_{X^{\prime}})$
Jensen-Shannon information [38]	$I_{\mathrm{JS}}(X;X^{\prime}):=\mathrm{JSD}(P_{X,X^{\prime}}\\|P_{X}\otimes P_{X^{\prime}})$
$\alpha$ -Rényi information	$I_{\mathrm{R}}^{\alpha}(X;X^{\prime}):=\mathrm{R}_{\alpha}(P_{X,X^{\prime}}\\|P_{X}\otimes P_{X^{\prime}})$
Sibson’s $\alpha$ -Mutual information [39]	$I_{\mathrm{S}}^{\alpha}(X;X^{\prime}):=\min_{Q_{X^{\prime}}}\mathrm{R}_{\alpha}(P_{X,X^{\prime}}\\|P_{X}\otimes Q_{X^{\prime}})$

II-E Definitions

We offer some standard definitions that will guide our analysis in the sequel.

Definition 1

The cumulant generating function (CGF) of a random variable $X$ is defined as

\Lambda_{X}(\lambda):=\log\mathbb{E}[e^{\lambda(X-\mathbb{E}[X])}].

(16)

Assuming $\Lambda_{X}(\lambda)$ exists, it can be verified that $\Lambda_{X}(0)=\Lambda_{X}^{\prime}(0)=0$ , and that it is convex.

Definition 2

For a convex function $\psi$ defined on the interval $[0,b)$ , where $0<b\leq\infty$ , its Legendre dual $\psi^{\star}$ is defined as

\psi^{\star}(x):=\sup_{\lambda\in[0,b)}\big{(}\lambda x-\psi(\lambda)\big{)}.

(17)

The following lemma characterizes a useful property of the Legendre dual and its inverse function.

Lemma 1

[40, Lemma 2.4] Assume that $\psi(0)=\psi^{\prime}(0)=0$ . Then, the Legendre dual $\psi^{\star}(x)$ of $\psi(x)$ defined above is a non-negative convex and non-decreasing function on $[0,\infty)$ with $\psi^{\star}(0)=0$ . Moreover, its inverse function $\psi^{\star-1}(y)=\inf\{x\geq 0:\psi^{\star}(x)\geq y\}$ is concave, and can be written as

\psi^{\star-1}(y)=\inf_{\lambda\in[0,b)}\Big{(}\frac{y+\psi(\lambda)}{\lambda}\Big{)},\quad b>0.

(18)

Importantly, using these results, we can characterize the tail behaviour of Sub-Gaussian random variables. A random variable $X$ is $\sigma$ -sub-Gaussian, if $\psi(\lambda)=\frac{\sigma^{2}\lambda^{2}}{2}$ is an upper bound on $\Lambda_{X}(\lambda)$ , for $\lambda\in\mathbb{R}$ . Then by Lemma 1,

\displaystyle\psi^{\star-1}(y)=\sqrt{2\sigma^{2}y}.

(19)

The tail behaviour of sub-Exponential and sub-Gamma random variables are introduced in [20].

III Upper Bounds on the Expected Generalization Error via ADM

We provide a series of bounds on the expected generalization error of supervised learning algorithms based on different information measures using the ADM coupled with KL divergence.

III-A $\alpha$ -Jensen-Shannon- based Upper Bound

In the following Theorem, we provide a new expected generalization error upper bound based on KL divergence by applying ADM and using KL divergences terms, $\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})$ and $\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})$ . Somme proof details are deferred to Appendix A.

Theorem 1

Assume that under an auxiliary joint distribution $\widehat{P}_{W,Z_{i}}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z})$ – $\Lambda_{\ell(W,Z_{i})}(\lambda)$ exists, it is upper bounded by $\psi_{+}(\lambda)$ for $\lambda\in[0,b_{+})$ , $0<b_{+}<+\infty$ , and it is also upper bounded by $\psi_{-}(-\lambda)$ for $\lambda\in(b_{-},0]$ , $\forall i=1,\cdots,n$ . Also assume that $\psi_{+}(\lambda)$ and $\psi_{-}(\lambda)$ are convex functions and $\psi_{-}(0)=\psi_{+}(0)=\psi_{+}^{\prime}(0)=\psi_{-}^{\prime}(0)=0$ . Then, it holds that:

	$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(\psi_{+}^{\star-1}(A_{i})+\psi_{-}^{\star-1}(B_{i})\right),$		(20)
	$\displaystyle-\overline{\operatorname*{gen}}(P_{W\|S},\mu)$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(\psi_{-}^{\star-1}(A_{i})+\psi_{+}^{\star-1}(B_{i})\right),$		(21)

where $A_{i}=\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})$ , $B_{i}=\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})$ , $\psi_{-}^{\star-1}(x)=\inf_{\lambda\in[0,-b_{-})}\frac{x+\psi_{-}(\lambda)}{\lambda}$ and $\psi_{+}^{\star-1}(x)=\inf_{\lambda\in[0,b_{+})}\frac{x+\psi_{+}(\lambda)}{\lambda}\,.$

Proof:

The proofs of the bounds to $\overline{\operatorname*{gen}}(P_{W|S},\mu)$ and $-\overline{\operatorname*{gen}}(P_{W|S},\mu)$ are similar. Therefore, we focus on the latter.

Let us consider the Donsker–Varadhan variational representation of KL divergence between two probability distributions $\alpha$ and $\beta$ on a common space $\Psi$ given by [41]:

\displaystyle\mathrm{KL}(\alpha\|\beta)=\sup_{f}\int_{\Psi}fd\alpha-\log\int_{\Psi}e^{f}d\beta,

(22)

where $f\in\mathcal{F}=\{f:\Psi\rightarrow\mathbb{R}\text{ s.t. }\mathbb{E}_{\beta}[e^{f}]<\infty\}$ .

Using the Donsker-Varadhan representation to bound $\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})$ for $\lambda\in(b_{-},0]$ as follows:

		$\displaystyle\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})\geq$		(23)
		$\displaystyle\mathbb{E}_{P_{W,Z_{i}}}[\lambda\ell(W,Z_{i})]-\log\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[e^{\lambda\ell(W,{Z}_{i})}]\geq$
		$\displaystyle\lambda(\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,{Z}_{i})])-\psi_{-}(-\lambda),$		(24)

where the last inequality is due to:

	$\displaystyle\Lambda_{\ell({W},{Z}_{i})}(\lambda)=$		(25)
	$\displaystyle\log\left(\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[e^{\lambda(\ell(W,{Z}_{i})-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,{Z}_{i})])}]\right)\leq\psi_{-}(-\lambda).$

It can then be shown from (24) that the following holds for $\lambda\in(b_{-},0]$ :

		$\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,{Z}_{i})]-\mathbb{E}_{{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\leq$		(26)
		$\displaystyle\inf_{\lambda\in[0,-b_{-})}\frac{\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})+\psi_{-}(\lambda)}{\lambda}=$
		$\displaystyle\psi_{-}^{\star-1}(\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})).$		(27)

It can likewise also be shown by adopting similar steps that the following holds for $\lambda\in[0,b_{+})$ :

	$\displaystyle\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\leq$		(28)
	$\displaystyle\inf_{\lambda\in[0,b_{+})}\frac{\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})+\psi(\lambda)}{\lambda}=$
	$\displaystyle\psi_{+}^{\star-1}(\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})).$		(29)

We can similarly show using an identical procedure that:

		$\displaystyle\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,\widehat{Z}_{i})]$
		$\displaystyle\leq\psi_{+}^{\star-1}(\mathrm{KL}(P_{W}\otimes\mu\\|\widehat{P}_{W,Z_{i}}))$		(30)
		$\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,\widehat{Z}_{i})]-\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]$
		$\displaystyle\leq\psi_{-}^{\star-1}(\mathrm{KL}(P_{W}\otimes\mu\\|\widehat{P}_{W,Z_{i}})).$		(31)

Finally, we can immediately bound the expected generalization error by leveraging (III-A) and (26) as follows:

	$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]-\mathbb{E}_{{P}_{W,Z_{i}}}[\ell(W,Z_{i})]$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}\otimes\mu}[\ell({W},{Z}_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell({W},{Z}_{i})]+$
	$\displaystyle~{}\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell({W},{Z}_{i})]-\mathbb{E}_{{P}_{W,Z_{i}}}[\ell(W,Z_{i})]$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left(\psi_{+}^{\star-1}(A_{i})+\psi_{-}^{\star-1}(B_{i})\right),$

where $A_{i}=\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})$ and $B_{i}=\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})$ . ∎

Note that Theorem 1 can be applied to sub-Gaussian (19). It can also sub-Exponential and sub-Gamma assumptions on loss function CGF, introduced in [20].

We can utilize Theorem 1 to recover existing expected generalization error bounds and offer new ones. For example, we can immediately recover the mutual information bound [7] from the following results.

Example 1

Choose $\widehat{P}_{W,Z_{i}}=P_{W}\otimes\mu$ for $i=1,\cdots,n$ . It follows immediately from Theorem 1 that:

	$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{-}^{\star-1}(I(W;Z_{i})),$		(32)
	$\displaystyle-\overline{\operatorname*{gen}}(P_{W\|S},\mu)$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{+}^{\star-1}(I(W;Z_{i})).$		(33)

Example 2

Choose $\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}$ for $i=1,\cdots,n$ . It also follows immediately from Theorem 1 that:

	$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{+}^{\star-1}(L(W;Z_{i})),$		(34)
	$\displaystyle-\overline{\operatorname*{gen}}(P_{W\|S},\mu)$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\psi_{-}^{\star-1}(L(W;Z_{i})).$		(35)

The result in Example 1 is the same as a result appearing in [8] whereas the result in Example 2 extends the result appearing in [42].

The conclusion in Theorem 1 can be extended to many auxiliary distributions by repeatedly using ADM. In this study, we take into account just one auxiliary distribution and use ADM just once.

Building upon Theorem 1, we are also able to provide an expected generalization error upper bound based on a convex combination of KL terms, i.e.,

\alpha\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}),

that relies on a certain $\sigma$ -sub-Gaussian tail assumption.

Proposition 1

Assume that the loss function is $\hat{\sigma}$ -sub-Gaussian– under the distribution $\widehat{P}_{W,Z_{i}}$ $\forall i=1,\cdots,n$ – Then, it holds $\forall\alpha\in(0,1)$ that:

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\hat{\sigma}^{2}\frac{\left(\alpha A_{i}+(1-\alpha)B_{i}\right)}{\alpha(1-\alpha)}},

(36)

where $A_{i}=\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}})$ and $B_{i}=\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})$ .

Proof:

The assumption that the loss function is $\sigma$ -sub-Gaussian under the distribution $\widehat{P}_{W,Z_{i}}$ implies that $\psi_{-}^{\star-1}(y)=\psi_{+}^{\star-1}(y)=\sqrt{2\sigma^{2}y}$ , [8]. Consider the arbitrary auxiliary distributions $\{\widehat{P}_{W,Z_{i}}\}_{i=1}^{n}$ defined on $\mathcal{W}\times\mathcal{Z}$ .

		$\displaystyle\overline{\text{\rm{gen}}}(\mu,P_{W\|S})=\mathbb{E}_{P_{W}P_{S}}[L_{E}(W,S)]-\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]$		(37)
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right\|$		(38)

Using the assumption that the loss function $\ell(w,z_{i})$ is $\hat{\sigma}^{2}$ -sub-Gaussian under distribution $\widehat{P}_{W,Z_{i}}$ and Donsker-Varadhan representation for $\mathrm{KL}(P_{WZ_{i}}\|\widehat{P}_{W,Z_{i}})$ , we have:

		$\displaystyle\lambda\left(\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\right)\leq$		(39)
		$\displaystyle\quad\mathrm{KL}(P_{WZ_{i}}\\|\widehat{P}_{W,Z_{i}})+\frac{\lambda^{2}\hat{\sigma}^{2}}{2}.\quad\forall\lambda\in\mathbb{R}$

Using the assumption loss that the function $\ell(w,z_{i})$ is $\hat{\sigma}^{2}$ -sub-Gaussian under distribution $\widehat{P}_{W,Z_{i}}$ and Donsker-Varadhan representation for $\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}P_{Z_{i}})$ , we have,

		$\displaystyle\lambda^{\prime}\left(\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\right)\leq$		(40)
		$\displaystyle\quad\mathrm{KL}(P_{W}P_{Z_{i}}\\|\widehat{P}_{W,Z_{i}})+\frac{{\lambda^{\prime}}^{2}\hat{\sigma}^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R}.$

Assuming $\lambda<0$ , then we can choose $\lambda^{\prime}=\frac{\alpha}{\alpha-1}\lambda$ . Hence we have,

		$\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]\leq$		(41)
		$\displaystyle\quad\frac{\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})}{\|\lambda\|}+\frac{\|\lambda\|\hat{\sigma}^{2}}{2},\quad\forall\lambda\in\mathbb{R^{-}},$

and,

		$\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]\leq$		(42)
		$\displaystyle\quad\frac{\mathrm{KL}(P_{W}P_{Z_{i}}\\|\widehat{P}_{W,Z_{i}})}{\lambda^{\prime}}+\frac{\lambda^{\prime}\hat{\sigma}^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R^{+}}.$

Merging two Inequalities (41) and (42), we have

		$\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq$		(43)
		$\displaystyle\frac{\alpha\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}P_{Z_{i}}\\|\widehat{P}_{W,Z_{i}})}{\alpha\|\lambda\|}+$
		$\displaystyle\quad\frac{\|\lambda\|\hat{\sigma}^{2}}{2}+\frac{\|\lambda\|\frac{\alpha}{1-\alpha}\hat{\sigma}^{2}}{2},\quad\forall\lambda\in\mathbb{R^{-}}.$

Similarly, using an identical approach, we also obtain,

		$\displaystyle-\left(\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right)\leq$		(44)
		$\displaystyle\frac{\alpha\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}P_{Z_{i}}\\|\widehat{P}_{W,Z_{i}})}{\alpha\lambda}+$
		$\displaystyle\quad\frac{\lambda\hat{\sigma}^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\hat{\sigma}^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.$

Considering (43) and (44), we have a non-negative parabola in $\lambda$ , whose discriminant must be non-positive, and we have $\forall\alpha\in(0,1)$ ,

	$\displaystyle\left\|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right\|^{2}\leq$		(45)
	$\displaystyle 2\hat{\sigma}^{2}\frac{\left(\alpha\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W}P_{Z_{i}}\\|\widehat{P}_{W,Z_{i}})\right)}{\alpha(1-\alpha)}.$

Using (III-A), completes the proof. ∎

We propose a Lemma connecting certain KL divergences to the $\alpha$ - $\mathrm{JS}$ information.

Lemma 2

Consider an auxiliary distribution $\widehat{P}_{W,Z_{i}}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z})$ . Then, the following equality holds:

	$\displaystyle\alpha\mathrm{KL}(P_{W}\otimes\mu\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})=$
	$\displaystyle I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\\|\widehat{P}_{W,Z_{i}}).$

Note that the proof is inspired by [43].

Using the result in Proposition 1 and ADM we can provide a tighter upper bound. For this purpose, Lemma 2 paves the way to apply ADM and offer a tighter version of the expected generalization error bound appearing in Proposition 1 based on choosing an appropriate auxiliary distribution, as well as recover existing ones.

Theorem 2

Assume that the loss function is $\sigma_{(\alpha)}$ -sub-Gaussian– under the distribution $P_{W,Z_{i}}^{(\alpha)}$ $\forall i=1,\cdots,n$ – Then, it holds $\forall\alpha\in(0,1)$ that:

|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}},\quad\forall\alpha\in(0,1).

The bound in Theorem 2 results from minimizing the term $\alpha\mathrm{KL}(P_{W}\otimes\mu\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\|\widehat{P}_{W,Z_{i}}),$ in the upper bound (36), presented in Proposition 1, over the joint auxiliary distribution $\widehat{P}_{W,Z_{i}}$ . Such an optimal joint auxiliary distribution is $P_{W,Z_{i}}^{(\alpha)}:=\alpha P_{W}P_{Z_{i}}+(1-\alpha)P_{W,Z_{i}}$ . Note that, the parameter of sub-Gaussianity, denoted as $\hat{\sigma}$ in Proposition 1, relies on $\widehat{P}_{W,Z_{i}}$ . Consequently, the upper bound mentioned in Theorem 2 is not the minimum of the upper bound presented in Proposition 1. However, assuming a bounded loss function, the upper bound in Theorem 2 becomes the minimum of the upper bound in Proposition 1.

It turns out that we can immediately recover existing bounds from Theorem 2 depending on how we choose $\alpha$ .

Remark 1 (Recovering upper bound based on Jensen-Shannon information)

The expected generalization error upper bound based on Jensen-Shannon information in [1] can be immediately recovered by considering $\alpha=\frac{1}{2}$ in Theorem 2.

Remark 2 (Recovering upper bounds based on mutual information and lautum information)

The expected generalization error upper bound based on mutual information in [8] and lautum information in [42] can be immediately recovered by considering $\alpha\rightarrow 1$ and $\alpha\rightarrow 0$ in Theorem 2, respectively.

Note that we can also establish how the bound in Theorem 2 behaves as a function of the number of training samples. This can be done by using $\widehat{P}_{W,Z_{i}}=P_{W}\otimes\mu$ in Lemma 2, leading up to

\displaystyle(1-\alpha)I(W;Z_{i})=I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|P_{W}\otimes\mu).

(46)

and in turn to the following inequality

\displaystyle I_{\mathrm{JS}}^{\alpha}(W;Z_{i})\leq(1-\alpha)I(W;Z_{i}),\quad\forall\alpha\in(0,1).

(47)

We prove the convergence rate of the upper bound in Theorem 2 using (47).

Proposition 2

Assume the hypothesis space is finite and the data samples, $\{Z_{i}\}_{i=1}^{n}$ , are i.i.d. Then, the bound in Theorem 2 has a convergence rate of $\mathcal{O}(\frac{1}{\sqrt{n}})$ .

The value of this new proposed bound presented in Theorem 2 in relation to existing bounds can also be further appreciated by offering two additional results.

Proposition 3

Consider the assumptions in Theorem 2. Then, it follows that:

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\sigma_{(\alpha)}\sqrt{2\frac{h(\alpha)}{\alpha(1-\alpha)}},\quad\forall\alpha\in(0,1),

(48)

where $h(\alpha)=-\alpha\log(\alpha)-(1-\alpha)\log(1-\alpha)$ .

This proposition shows that, unlike the mutual information-based and lautum information-based generalization bounds that currently exist (e.g. [7], [8], [9], and [11]) the proposed $\alpha$ - $\mathrm{JS}$ information generalization bound is always finite. We can also optimize the bound in (48) with respect to $\alpha$ , where the minimum is achieved at $\alpha=1/2$ .

Corollary 1

Consider the assumptions in Theorem 2. Then, it follows that:

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq 2\sigma_{(1/2)}\sqrt{2\log(2)}.

(49)

Also, this result applies independently of whether the loss function is bounded or not. Naturally, it is possible to show that the absolute value of the expected generalization error is always upper bounded as follows $|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq(b-a)$ for any bounded loss function within the interval $[a,b]$ . If we consider the bounded loss functions in the interval $[a,b]$ , then our upper bound (49) would be $\sqrt{2\log(2)}(b-a)$ which is less than total variation constant upper bound, $2(b-a)$ presented in [44, 15].

It is worthwhile to mention that our result cannot be immediately recovered from existing approaches such as [11, Theorem. 2]. For example, if we consider the upper bound based on Jensen-Shannon information, then there exist $f$ -divergence based representations of the Jensen-Shannon information as follows:

\displaystyle\mathrm{JSD}(P_{X},P_{X^{\prime}})=\int\mathrm{d}P_{X}f\left(\frac{\mathrm{d}P_{X^{\prime}}}{\mathrm{d}P_{X}}\right),

(50)

with $f(t)=t\log(t)-(1+t)\log(\frac{1+t}{2})$ . However, [11, Theorem. 2] requires that the function $f(t)$ associated with the $f$ -divergence is non-decreasing within the interval $[0,+\infty)$ , but such a requirement is naturally violated by the function $f(t)=t\log(t)-(1+t)\log(\frac{1+t}{2})$ associated with the Jensen-Shannon divergence.

III-B $\alpha$ -Rényi-based Upper Bound

Next, we provide a new expected generalization error upper bound based on KL divergence by applying ADM and using the following KL divergences terms, $\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)$ and $\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})$ . All the proof details are deferred to Appendix B.

Proposition 4

Suppose that $\Lambda_{\ell(W,Z)}(\lambda)\leq\gamma_{+}(\lambda)$ and $\Lambda_{\ell(W,Z_{i})}(\lambda)\leq\phi_{+}(\lambda),\quad i=1,\cdots,n$ for $\lambda\in[0,a_{+})$ , $0<a_{+}<+\infty$ and $\lambda\in[0,c_{+})$ , $0<c_{+}<+\infty$ , under $P_{W}\otimes\mu$ and $P_{W,Z_{i}}$ , resp. We also have $\Lambda_{\ell(W,Z)}(\lambda)\leq\gamma_{-}(-\lambda)$ and $\Lambda_{\ell(\widetilde{W},\widetilde{Z}_{i})}(\lambda)\leq\phi_{-}(-\lambda),\quad i=1,\cdots,n$ for $\lambda\in(a_{-},0]$ , $-\infty<a_{-}<0$ and $\lambda\in(c_{-},0]$ , $-\infty<c_{-}<0$ under $P_{W}\otimes\mu$ and $P_{W,Z_{i}}$ , resp. Assume that $\gamma_{+}(\lambda)$ , $\phi_{+}(\lambda)$ , $\gamma_{-}(\lambda)$ and $\phi_{-}(\lambda)$ are convex functions, $\gamma_{-}(0)=\gamma_{+}(0)=\gamma_{+}^{\prime}(0)=\gamma_{-}^{\prime}(0)=0$ and $\phi_{-}(0)=\phi_{+}(0)=\phi_{+}^{\prime}(0)=\phi_{-}^{\prime}(0)=0$ . Then, the following upper bounds hold,

		$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)\leq\frac{1}{n}\sum_{i=1}^{n}\left(\gamma_{-}^{\star-1}(D_{i})+\phi_{+}^{\star-1}(C_{i})\right),$		(51)
		$\displaystyle-\overline{\operatorname*{gen}}(P_{W\|S},\mu)\leq\frac{1}{n}\sum_{i=1}^{n}\left(\phi_{-}^{\star-1}(C_{i})+\gamma_{+}^{\star-1}(D_{i})\right),$		(52)

where $D_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)$ , $C_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})$ , $\gamma_{-}^{\star-1}(x)=\inf_{\lambda\in[0,-a_{-})}\frac{x+\gamma_{-}(\lambda)}{\lambda}$ ,
$\gamma_{+}^{\star-1}(x)=\inf_{\lambda\in[0,a_{+})}\frac{x+\gamma_{+}(\lambda)}{\lambda}$ , $\phi_{-}^{\star-1}(x)=\inf_{\lambda\in[0,-c_{-})}\frac{x+\phi_{-}(\lambda)}{\lambda}$ and $\phi_{+}^{\star-1}(x)=\inf_{\lambda\in[0,c_{+})}\frac{x+\phi_{+}(\lambda)}{\lambda}$ .

Proof:

The proof approach is similar to Theorem 1 by considering different cumulant generating functions and their upper bounds. ∎

Inspired by the upper bound in Proposition 4, we can provide an upper bound on expected generalization error instantly that is dependent on the convex combination of KL divergence terms, i.e.,

\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu),

and assuming $\sigma$ -sub-Gaussian tail distribution.

Proposition 5 (Upper bound with Sub-Gaussian assumption)

Assume that the loss function is $\sigma$ -sub-Gaussian under distribution $P_{W}\otimes\mu$ and $\gamma$ -sub-Gaussian under $P_{W,Z_{i}}$ $\forall i=1,\cdots,n$ – Then, it holds for $\forall\alpha\in(0,1)$ that,

		$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|\leq$		(53)
		$\displaystyle\quad\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\left(\alpha C_{i}+(1-\alpha)D_{i}\right)}{\alpha(1-\alpha)}},$

where $C_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)$ and $D_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})$ .

Akin to Proposition 1, the result in Proposition 5 paves the way to offer new tighter expected generalization error upper bound by ADM. We next offer a Lemma connecting certain KL divergences to the $\alpha$ -Rényi information [35, Theorem 30].

Lemma 3

Consider an arbitrary distribution $\widehat{P}_{W,Z_{i}}$ . Then, the following equality holds for $\forall\alpha\in(0,1)$ ,

	$\displaystyle\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W}\otimes\mu)+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W,Z_{i}})=$		(54)
	$\displaystyle\quad(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i})$
	$\displaystyle\quad+\mathrm{KL}\left(\widehat{P}_{W,Z_{i}}\\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).$

A tighter version of the expected generalization error bound appears in Proposition 5 via ADM and using Lemma 3.

Theorem 3 (Upper bound based on $\alpha$ -Rényi information)

Consider the same assumptions in Proposition 5. The following upper bound for $\forall\alpha\in(0,1)$ holds,

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}.

(55)

The bound in Theorem 3 results from minimizing the bound in Proposition 5 over the joint auxiliary distribution $\hat{P}_{W,Z_{i}}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z})$ . Such an optimal joint auxiliary distribution is

\widehat{P}_{W,Z_{i}}=\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}.

Remark 3 (Deterministic algorithms per sample)

If the parameter, $W$ , is a deterministic function of data sample $Z_{i}$ , then $I(W;Z_{i})$ is not well-defined as $P_{W,Z_{i}}$ is not absolutely continuous²²2We say $\mu\ll\nu$ , i.e., $\mu$ is absolutely continuous with respect to $\nu$ if $\nu(A)=0$ for some $A\in\mathcal{X}$ , then $\mu(A)=0$ . with respect to $P_{W}P_{Z_{i}}$ . However, by considering the $\alpha$ -Rényi information for $\alpha\in[0,1)$ , we do not need to assume the absolute continuous.

Remark 4 (Upper bound based on the Bhattacharyya distance)

We can derive the expected generalization error upper bound based on Bhattacharyya distance by considering $\alpha=1/2$ in Theorem 3,

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{2}{n}\sum_{i=1}^{n}\sqrt{(\sigma^{2}+\gamma^{2})D_{B}(P_{W,Z_{i}}\|P_{W}\otimes\mu)},

Remark 5 (Recovering the upper bound based on mutual information and lautum information)

We can recover the expected generalization error upper bound based on mutual information in [7] and lautum information in [42] by considering $\alpha\rightarrow 1$ and $\alpha\rightarrow 0$ in Theorem 3, respectively.

By considering $\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}$ , we have,

	$\displaystyle\alpha I(W;Z_{i})=(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i})$		(56)
	$\displaystyle\quad+\mathrm{KL}\left(P_{W,Z_{i}}\\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).$

Since that KL divergence is non-negative, based on Lemma 3 and the monotonicity of $\mathrm{R}_{\alpha}$ with respect to $\alpha$ , we have,

\displaystyle I_{\mathrm{R}}^{\alpha}(W;Z_{i})\leq\min\left\{1,\frac{\alpha}{1-\alpha}\right\}I(W;Z_{i}).

(57)

The result in (57) implies that our expected generalization error bound based on $\alpha$ -Rényi information in Theorem 3 exhibits the same convergence rate as upper bound based on mutual information [7].

Proposition 6 (Convergence rate of upper bound based on $\alpha$ -Rényi information)

Assume the hypothesis space is finite and the data samples are i.i.d. Then, the upper bounds based on $\alpha$ -Rényi information in Theorem 3 have a convergence rate of $\mathcal{O}(\frac{1}{\sqrt{n}})$ .

We can also provide an upper bound based on Sibson’s $\alpha$ -mutual information.

Theorem 4 (Upper bound based on Sibson’s $\alpha$ mutual information)

Assume that the loss function is $\sigma$ -sub-Gaussian under distribution $\mu$ for all $w\in\mathcal{W}$ and $\gamma$ -sub-Gaussian under $P_{W,Z_{i}}$ , $\forall i=1,\cdots,n$ . Then, it holds that:

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{S}}^{\alpha}(W;Z_{i})}{\alpha}}.

The upper bound based on $\alpha$ -Rényi divergence could also be derived using the variational representation of $\alpha$ -Rényi divergence in [45]. This approach is applied in [12] by considering the sub-Gaussianity under $P_{Z_{i}}$ and $P_{Z_{i}|W}$ . Our approach is more general, paving the way to offer an upper bound based on $\alpha$ -Sibson’s mutual information in Theorem 4, which is derived via ADM. Since that,

	$\displaystyle I_{\mathrm{S}}^{\alpha}(W;Z_{i})$	$\displaystyle=\min_{Q_{W}\in\mathcal{P}(\mathcal{W})}\mathrm{R}_{\alpha}(P_{W,Z_{i}}\\|Q_{W}\otimes\mu)$		(58)
		$\displaystyle\leq\mathrm{R}_{\alpha}(P_{W,Z_{i}}\\|P_{W}\otimes\mu)=I_{\mathrm{R}}^{\alpha}(W;Z_{i}),$		(59)

the upper bound in Theorem 4 is tighter than the upper bound in Theorem 3. It is worthwhile mentioning that we assume the loss function is $\sigma$ -sub-Gaussian under $P_{W}\otimes\mu$ distribution in Theorem 3. However, in Theorem 4, we consider the loss function is $\sigma$ -sub-Gaussian under $\mu$ distribution for all $w\in\mathcal{W}$ .

We can also apply generalized Pinsker’s inequality [35] to bounded loss functions for bounding the expected generalization error using the $\alpha$ -Rényi information between data samples, $S$ , and hypothesis, $W$ .

Proposition 7

Consider $\ell(w,z)$ be a bounded loss function i.e. $|\ell(w,z)|\leq b$ . Then

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\frac{2b^{2}}{\alpha}I_{\mathrm{R}}^{\alpha}(W;Z_{i})},\quad\forall\alpha\in(0,1].

(60)

Considering the bounded loss function can help to provide an upper bound based on $\alpha$ -Sibson’s mutual information between $S$ and $W$ in a similar approach to Proposition 7.

III-C Comparison of Proposed Upper Bounds

A summary of upper bounds on expected generalization error under various $\sigma$ -sub-Gaussian assumptions is provided in Table III.

TABLE III: Expected Generalization Error Upper Bounds. We compared our bounds with Mutual information and Lautum information bounds based on the finiteness and the assumption needed for sub-Gaussianity.

Upper Bound Measure	sub-Gaussian Assumption	Bound	Is finite?
Mutual information ([8])	$P_{W}\otimes\mu$	$\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma^{2}I(W;Z_{i})}$	No
Lautum information ([42])	$P_{W,Z_{i}}$ , $\forall i=1,\ldots,n$	$\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\gamma^{2}L(W;Z_{i})}$	No
$\alpha$ - $\mathrm{JS}$ information (Proposition 3)	$P_{W,Z_{i}}^{(\alpha)}$ , $\forall i=1,\ldots,n$	$\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}}$	Yes ( $\sigma_{(\alpha)}\sqrt{2\frac{h(\alpha)}{\alpha(1-\alpha)}}$ )
$\alpha$ -Rényi information ( $0\leq\alpha<1$ ) (Theorem 3)	$P_{W}\otimes\mu$ and $P_{W,Z_{i}}$ , $\forall i=1,\ldots,n$	$\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}$	No

Remark 6 (Bounded loss function)

The bounded loss function $l:\mathcal{W}\times\mathcal{Z}\rightarrow[a,b]$ is $(\frac{b-a}{2})$ -sub-Gaussian under all distributions [7]. In fact, for bounded functions, we have,

\displaystyle\sigma=\gamma=\sigma_{(\alpha)}=\frac{(b-a)}{2}.

(61)

We next compare the upper bounds based on $\alpha$ - $\mathrm{JS}$ information, Theorem 2, with the upper bounds based on $\alpha$ -Rényi information, Theorem 3. The next proposition showcases that the $\alpha$ - $\mathrm{JS}$ information bound can be tighter than the $\alpha$ -Rényi based upper bound under certain conditions. The proof details are deferred to Appendix C.

Proposition 8 (Comparison of upper bounds based on $\alpha^{\prime}$ -Jensen-Shannon and $\alpha$ -Rényi information measures)

Consider the same assumptions in Theorem 2. Then, it follows that $\alpha^{\prime}$ -Jensen-Shannon bound given by:

|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha^{\prime})}^{2}\frac{I_{\mathrm{JS}}^{\alpha^{\prime}}(W;Z_{i})}{\alpha^{\prime}(1-\alpha^{\prime})}},\quad 0\leq\alpha^{\prime}\leq 1

(62)

is tighter than the $\alpha$ -Rényi based upper bound for $\quad 0\leq\alpha\leq 1$ , given by,

\displaystyle|\overline{\operatorname*{gen}}(P_{W|S},\mu)|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}},

(63)

provided that $\frac{\alpha h(\alpha^{\prime})}{(1-\alpha^{\prime})\alpha^{\prime}}\leq I_{\mathrm{R}}^{\alpha}(W;Z_{i})$ holds for $i=1,\cdots,n$ and $\sigma_{(\alpha^{\prime})}=\sigma=\gamma$ .

Remark 7

The condition in Proposition 8, i.e. $\frac{\alpha h(\alpha^{\prime})}{(1-\alpha^{\prime})\alpha^{\prime}}\leq I_{\mathrm{R}}^{\alpha}(W;Z_{i})$ , could be tightened by considering $\alpha^{\prime}=\frac{1}{2}$ and considering the upper bound based on Jensen-Shannon information.

Remark 8

If we consider $\alpha\rightarrow 1$ and $\alpha^{\prime}=\frac{1}{2}$ in Proposition 8, then the upper bound based on Jensen-Shannon information is tighter than ones based on mutual information [8] provided that $4\log(2)\leq I(W;Z_{i})$ for all $i=1,\cdots,n$ and $\sigma=\sigma_{JS}$ .

IV Upper Bounds on Excess Risk

This section provides upper bounds on excess risks for regularized empirical risk minimization (ERM) by $\alpha$ -Rényi divergence or $\alpha$ - $\mathrm{JS}$ divergence.

IV-A $\alpha$ - $\mathrm{JS}$ -Regularized ERM

It is interesting to consider the regularized ERM with $\alpha$ - $\mathrm{JS}$ information between dataset $S$ , and hypothesis $W$ ,

\min_{P_{W|S}}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}I_{\mathrm{JS}}^{\alpha}(W;S),

(64)

where $\beta>0$ is a parameter that balances fitting and generalization. Since the optimization problem in (64) is dependent on the data generating distribution, $\mu$ , we relax the problem and replace $\alpha$ - $\mathrm{JS}$ information with the $\alpha$ - $\mathrm{JS}$ divergence $\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S})$ , as follows,

\displaystyle\min_{P_{W|S}}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}),

(65)

where $Q_{W}\in\mathcal{P}(\mathcal{W})$ is a prior distribution over parameter space.

Lemma 4 (Solution existence of $\alpha$ - $\mathrm{JS}$ -regularized ERM)

The optimization problem in (65) is a convex optimization problem and has a solution.

Proof:

The first term in objective $\mathbb{E}[L_{E}(W,S)]$ is linear in term of $P_{W|S}$ and the second term $\frac{1}{\beta}\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S})$ is convex in $P_{W|S}$ for $0<\alpha<1$ due to [46]. Therefore, a solution exists. ∎

Let us define the solution of (64),

P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}:=\arg\min_{P_{W|S}\in\mathcal{P}(\mathcal{W})}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{JS}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}).

In the following, we provide an upper bound on excess risk under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ as a learning algorithm.

Theorem 5 (Upper bound on excess risk under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ )

Assume the bounded loss function, i.e., $|\ell(w,z)|\leq b$ for all $(w,z)\in\mathcal{W}\times\mathcal{Z}$ and $\tilde{L}$ -Lipschitz. Then, the following upper bound holds on the excess risk under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ ,

\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}\\ \quad&+\frac{\tilde{L}\sqrt{d}}{\beta}+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta},\end{split}

where $w^{\star}=\arg\min_{w\in\mathcal{W}}L_{\mu}(w)$ and $I_{d}$ is identity matrix with size $d$ .

Corollary 2 (Convergence rate of excess risk for under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ )

Under the same assumptions in Theorem 5, assuming that hypothesis space is finite and $\beta$ is of order $\sqrt{n}$ , the following upper bound holds on excess risk of $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ with convergence rate of $\mathcal{O}(n^{-1/2})$ ,

\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}\\ &\quad+\frac{\tilde{L}\sqrt{d}}{\sqrt{n}}+\frac{h(\alpha)}{\sqrt{n}},\end{split}

Remark 9 (Comparison to the Gibbs algorithm)

Our convergence rate of the upper bound on the excess risk under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ is less than the convergence rate of the upper bound on excess risk under the Gibbs algorithm as the solution of KL-regularized empirical which is $\mathcal{O}(n^{-1/4})$ , [7, Corollary 3] and [47].

IV-B $\alpha$ -Rényi-Regularized ERM

Similarly, it is interesting to consider the regularized ERM with $\alpha$ -Rényi-information between dataset, $S$ , and hypothesis, $W$ , for $0<\alpha<1$ ,

\displaystyle\min_{P_{W|S}}\mathbb{E}[L(W,S)]+\frac{1}{\beta}I_{\mathrm{R}}^{\alpha}(W;S),

(66)

where $\beta>0$ is a parameter that balances fitting and generalization.

Since the optimization problem in (66) is dependent on the data generating distribution, $\mu$ , we propose to relax the problem in (66) by replacing $\alpha$ -Rényi- information, i.e. $I_{\mathrm{R}}^{\alpha}(W;S)$ , with $\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S})$ as follows,

\displaystyle\min_{P_{W|S}}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}),

(67)

where $Q_{W}\in\mathcal{P}(\mathcal{W})$ .

Lemma 5 (Solution existence of $\alpha$ -Rényi-regularized ERM)

The optimization problem considered in (67) is a convex optimization problem.

Proof:

The first term in objective $\mathbb{E}[L_{E}(W,S)]$ is linear in term of $P_{W|S}$ and the second term $\frac{1}{\beta}\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S})$ is convex in $P_{W|S}$ for $0<\alpha<1$ due to [35, Theorem 11]. Therefore, a solution exists. ∎

Let us define

P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}:=\arg\min_{P_{W|S}\in\mathcal{P}(\mathcal{W})}\mathbb{E}[L_{E}(W,S)]+\frac{1}{\beta}\mathrm{R}_{\alpha}(P_{W|S}\|Q_{W}|P_{S}),

as the solution of convex optimization problem (67).

Theorem 6 (Upper bound on excess risk under $P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}$ )

\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}\\ &\quad+\frac{\tilde{L}\sqrt{d}}{\beta}+\frac{\mathrm{R}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta},\end{split}

where $w^{\star}=\arg\min_{w\in\mathcal{W}}L_{\mu}(w)$ and $I_{d}$ is identity matrix with size $d$ .

Corollary 3 (Convergence rate of excess risk under $P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}$ )

Under the same assumptions in Theorem 6, assuming that hypothesis space is finite and $\beta$ is of order $\sqrt{n}$ , the following upper bound holds on the excess risk of $P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}$ with convergence rate of $\mathcal{O}(\log(n)/\sqrt{n})$ ,

\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}\\ &\quad+\frac{\tilde{L}\sqrt{d}}{\sqrt{n}}+\frac{1}{2\sqrt{n}}\|w^{\star}\|_{2}^{2}+\frac{d}{4\sqrt{n}}\log\big{(}n\big{)}+\frac{d}{2\sqrt{n}(1-\alpha)}\log\big{(}\alpha\big{)}.\end{split}

V Expected Generalization Error Upper Bounds Under Distribution Mismatch

In this section, we extend our results in Section III under distribution mismatch, where the training data distribution differs from the test data distribution. All the proof details are deferred to Appendix E.

Proposition 9

Assume that the loss function is $\sigma_{(\alpha)}$ -sub-Gaussian – under the distributions $P_{W,Z_{i}}^{(\alpha)}$ $\forall i=1,\cdots,n$ and $\alpha\mu+(1-\alpha)\mu^{\prime}$ for all $w\in\mathcal{W}$ – Then under distribution mismatch (6), it holds $\forall\alpha\in(0,1)$ that:

		$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu,\mu^{\prime})\|\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{\mathrm{JS}_{\alpha}(\mu^{\prime}\\|\mu)}{\alpha(1-\alpha)}}$		(68)
		$\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}},\quad\forall\alpha\in(0,1).$

Proposition 10

Assume that the loss function is $\sigma$ -sub-Gaussian under distributions $\mu$ and $\mu^{\prime}$ for all $w\in\mathcal{W}$ and also $\gamma$ -sub-Gaussian under $P_{W,Z_{i}}$ $\forall i=1,\cdots,n$ . The following upper bound for $\forall\alpha\in(0,1)$ holds,

		$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu,\mu^{\prime})\|\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\mathrm{R}_{\alpha}(\mu^{\prime}\\|\mu)}{\alpha}}$		(69)
		$\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}.$

The mismatch between the test and training samples distributions is characterised in [21, Theorem 5] as KL divergence between test and training samples distributions, i.e., $\mathrm{KL}(\mu^{\prime}\|\mu)$ . However, assuming that the loss function is $\sigma_{(\alpha)}$ -sub-Gaussian under $\alpha\mu+(1-\alpha)\mu^{\prime}$ for all $w\in\mathcal{W}$ , Proposition 9 allows us to explain the distributional mismatch in terms of $\alpha$ -Jensen-Shannon divergence, which is finite.

In Proposition 10, the distributional mismatch is presented in terms of $\alpha$ -Rényi divergence, i.e., $\mathrm{R}_{\alpha}(\mu^{\prime}\|\mu)$ . If $\mu^{\prime}$ is not absolutely continuous with respect to $\mu$ , then we have $\mathrm{KL}(\mu^{\prime}\|\mu)=\infty$ . However, for $\alpha$ -Rényi divergence ( $0<\alpha<1$ ), it suffices that the mutual singularity [35], i.e., $\mu^{\prime}\perp\!\!\!\perp\mu$ , does not hold, which is a less restrictive condition about $\mu^{\prime}$ compared to the absolutely continuity condition.

Similar to Remark 6, the sub-Gaussianity assumptions in Propositions 9 and 10 hold for bounded loss functions.

VI Numerical Example

In this section, we illustrate that some of our proposed bounds can be tighter than existing ones in a simple toy example. We consider the $\alpha$ - $\mathrm{JS}$ and $\alpha$ -Rényi information only. Our example setting involves the estimation of the mean of a Gaussian random variable $Z\sim\mathcal{N}(\beta,\sigma^{2})$ based on two i.i.d. samples $Z_{1}$ and $Z_{2}$ . We consider the hypothesis (estimate) given by $W=tZ_{1}+(1-t)Z_{2}$ for $0<t<1$ . We also consider the loss function given by $\ell(w,z)=\min((w-z)^{2},c^{2})$ .

Due to the fact that the loss function is bounded within the interval $[0,c^{2}]$ , then it is $\frac{c^{2}}{2}$ -sub-Gaussian under all distributions. Therefore, we can apply the expected generalization error upper bounds based on mutual information, $\alpha$ - $\mathrm{JS}$ information and $\alpha$ -Rényi information $\forall\alpha\in(0,1)$ as follows:

		$\displaystyle\overline{\operatorname*{gen}}(P_{W\|Z_{1},Z_{2}},P_{Z})\leq\frac{c^{2}}{4}\big{(}\sqrt{2I(W;Z_{1})}+\sqrt{2I(W;Z_{2})}\big{)},$		(70)
		$\displaystyle\overline{\operatorname*{gen}}(P_{W\|Z_{1},Z_{2}},P_{Z})\leq\frac{c^{2}}{4}\Big{(}\sqrt{2\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{1})}{\alpha(1-\alpha)}}+\sqrt{2\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{2})}{\alpha(1-\alpha)}}\Big{)},$		(71)
		$\displaystyle\overline{\operatorname*{gen}}(P_{W\|Z_{1},Z_{2}},P_{Z})\leq\frac{c^{2}}{4}\Big{(}\sqrt{2\frac{I_{\textit{R}}^{\alpha}(W;Z_{1})}{\alpha}}+\sqrt{2\frac{I_{\textit{R}}^{\alpha}(W;Z_{2})}{\alpha}}\Big{)}.$		(72)

It can be immediately shown that $W\sim\mathcal{N}(\beta,\sigma^{2}(t^{2}+(1-t)^{2}))$ and $(W,Z_{1})$ and $(W,Z_{2})$ are jointly Gaussian with correlation coefficients $\rho_{1}=\frac{t}{\sqrt{t^{2}+(1-t)^{2}}}$ and $\rho_{2}=\frac{(1-t)}{\sqrt{t^{2}+(1-t)^{2}}}$ . Therefore, it can be shown that the mutual information appearing above is given by $I(W;Z_{1})=-\frac{1}{2}\log(1-\rho_{1}^{2})$ and $I(W;Z_{2})=-\frac{1}{2}\log(1-\rho_{2}^{2})$ . In contrast, the $\alpha$ - $\mathrm{JS}$ information appearing above can be computed via an extension of entropic-based formulation of the Jensen-Shannon measure as follows [34]:

	$\displaystyle I_{\mathrm{JS}}(W;Z_{i})=$		(73)
	$\displaystyle\quad h\left(P_{W,Z_{i}}^{(\alpha)}\right)-(\alpha h(P_{W})+\alpha h(P_{Z_{i}})+(1-\alpha)h(P_{Z_{i},W})),$

– with $h(\cdot)$ denoting the differential entropy – where

	$\displaystyle h(P_{Z_{i}})=\frac{1}{2}\log(2\pi\sigma^{2}),$
	$\displaystyle h(P_{W})=\frac{1}{2}\log(2\pi\sigma^{2}(t^{2}+(1-t)^{2})),$
	$\displaystyle h(P_{W,Z_{i}})=\log(2\pi\sigma^{2}(t^{2}+(1-t)^{2})(1-\rho_{i}^{2})),$

whereas $h\left(P_{W,Z_{i}}^{(\alpha)}\right)$ can be computed numerically.

Fig.1 depicts the true generalization error, the mutual information based bound in (70), and the $\alpha$ - $\mathrm{JS}$ information based bound for $\alpha=0.25,0.5,0.75$ in (71) for values of $t\in(0,0.5]$ , considering $\sigma^{2}=1,10$ , $\mu=1$ , $c=\frac{\sigma}{4}$ .

It can be seen that for $\alpha=0.75$ the $\alpha$ - $\mathrm{JS}$ information bound is tighter than the mutual information bound. For $\alpha=0.5$ , which is equal to traditional Jensen-Shannon information, if we consider $t<0.25$ then the Jensen-Shannon information bound is tighter than the mutual information bound; in contrast, for $t>0.25$ , the mutual information bound is slightly better than the Jensen-Shannon information bound. This showcases that our proposed bounds can be tighter than existing ones in some regimes.

Refer to caption — Figure 1: True generalization error, $\alpha$ - $\mathrm{JS}$ based bound for $\alpha=0.25,0.5,0.75$ , and Mutual Information based bound.

Fig.2 also depicts the true generalization error, the mutual information based bound in (70), and the $\alpha$ -Rényi information based bound for $\alpha=0.25,0.5,0.75$ in (72). It can be seen that the $\alpha$ -Rényi based bound is looser than the mutual information based bound. In our experiment setup, when $t\to 0$ (or $t\to 1$ ), we have $I(W;Z_{2})\to\infty$ (or $I(W;Z_{1})\to\infty$ ). However, the $\alpha$ -Rényi based bound is finite.

VII Conclusion and Future Works

We have presented the Auxiliary Distribution Method, a novel approach for deriving information-theoretic upper bounds on the generalization error within the context of supervised learning problems. Our method offers the flexibility to recover existing bounds while also enabling the derivation of new bounds grounded in the $\alpha$ - $\mathrm{JS}$ and $\alpha$ -Rényi information measures. Notably, our upper bounds, which are rooted in the $\alpha$ - $\mathrm{JS}$ information measure, are finite, in contrast to mutual information-based bounds. Moreover, our upper bound based on $\alpha$ -Rényi information, for $\alpha\in(0,1)$ , remains finite when considering a deterministic learning process. An intriguing observation is that our newly introduced $\alpha$ - $\mathrm{JS}$ information measure can, in certain regimes, yield tighter bounds compared to existing approaches. We also discuss the existence of algorithms under $\alpha$ - $\mathrm{JS}$ -regularized and $\alpha$ -Rényi-regularized empirical risk minimization problems and provide upper bounds on excess risk of these algorithms, where the upper bound on the excess risk under $\alpha$ - $\mathrm{JS}$ -regularized empirical risk minimization is tighter than other well-known upper bounds on excess risk. Furthermore, we provide an upper bound on generalization error in a mismatch scenario, where the distributions of test and training datasets are different, via our auxiliary distribution method.

As a direction for future research, we propose extending our bounds to the PAC-Bayesian framework, leveraging the $\alpha$ - $\mathrm{JS}$ and $\alpha$ -Rényi divergences for $0<\alpha<1$ . Additionally, the conditional technique based on individual sample measures, as described in [18], could be applied to improve the effectiveness of our upper bounds.

Acknowledgment

Gholamali Aminian is supported in part by the Royal Society Newton International Fellowship, grant no. NIF\R1 \192656, the UKRI Prosperity Partnership Scheme (FAIR) under the EPSRC Grant EP/V056883/1, and the Alan Turing Institute. Saeed Masiha worked on this project at Sharif university of Technology before joining EPFL University.

References

[1] G. Aminian, L. Toni, and M. R. D. Rodrigues, “Jensen-shannon information based characterization of the generalization error of learning algorithms,” in 2020 IEEE Information Theory Workshop (ITW), pp. 1–5, 2021.
[2] V. N. Vapnik, “An overview of statistical learning theory,” IEEE transactions on neural networks, vol. 10, no. 5, pp. 988–999, 1999.
[3] O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal of machine learning research, vol. 2, no. Mar, pp. 499–526, 2002.
[4] H. Xu and S. Mannor, “Robustness and generalization,” Machine learning, vol. 86, no. 3, pp. 391–423, 2012.
[5] D. A. McAllester, “Pac-bayesian stochastic model selection,” Machine Learning, vol. 51, no. 1, pp. 5–21, 2003.
[6] D. Russo and J. Zou, “How much does your data exploration overfit? controlling bias via information usage,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 302–323, 2019.
[7] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Advances in Neural Information Processing Systems, pp. 2524–2533, 2017.
[8] Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
[9] A. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening generalization bounds,” in Advances in Neural Information Processing Systems, pp. 7234–7243, 2018.
[10] A. R. Asadi and E. Abbe, “Chaining meets chain rule: Multilevel entropic regularization and training of neural networks,” Journal of Machine Learning Research, vol. 21, no. 139, pp. 1–32, 2020.
[11] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
[12] E. Modak, H. Asnani, and V. M. Prabhakaran, “Rényi divergence based bounds on generalization error,” in 2021 IEEE Information Theory Workshop (ITW), pp. 1–6, 2021.
[13] A. T. Lopez and V. Jog, “Generalization error bounds using wasserstein distances,” in 2018 IEEE Information Theory Workshop (ITW), pp. 1–5, IEEE, 2018.
[14] H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon, “An information-theoretic view of generalization via wasserstein distance,” in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581, IEEE, 2019.
[15] B. R. Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via wasserstein distance,” in Advances in Neural Information Processing Systems, 2021.
[16] G. Aminian, Y. Bu, G. Wornell, and M. Rodrigues, “Tighter expected generalization error bounds via convexity of information measures,” in IEEE International Symposium on Information Theory (ISIT), 2022.
[17] T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory, pp. 3437–3452, PMLR, 2020.
[18] R. Zhou, C. Tian, and T. Liu, “Individually conditional individual mutual information bound on generalization error,” IEEE Transactions on Information Theory, pp. 1–1, 2022.
[19] H. Hafez-Kolahi, Z. Golgooni, S. Kasaei, and M. Soleymani, “Conditioning and processing: Techniques to improve information-theoretic generalization bounds,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[20] G. Aminian^∗, Y. Bu^∗, L. Toni, M. Rodrigues, and G. Wornell, “An exact characterization of the generalization error for the Gibbs algorithm,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[21] M. S. Masiha, A. Gohari, M. H. Yassaee, and M. R. Aref, “Learning under distribution mismatch and model misspecification,” in IEEE International Symposium on Information Theory (ISIT), 2021.
[22] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” arXiv preprint arXiv:0902.3430, 2009.
[23] Z. Wang, “Theoretical guarantees of transfer learning,” 2018.
[24] X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transfer learning,” in 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2819–2824, IEEE, 2020.
[25] E. Englesson and H. Azizpour, “Generalized jensen-shannon divergence loss for learning with noisy labels,” Advances in Neural Information Processing Systems, vol. 34, pp. 30284–30297, 2021.
[26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
[27] P. Melville, S. M. Yang, M. Saar-Tsechansky, and R. Mooney, “Active learning for probability estimation using jensen-shannon divergence,” in European conference on machine learning, pp. 268–279, Springer, 2005.
[28] E. Choi and C. Lee, “Feature extraction based on the bhattacharyya distance,” Pattern Recognition, vol. 36, no. 8, pp. 1703–1709, 2003.
[29] G. B. Coleman and H. C. Andrews, “Image segmentation by clustering,” Proceedings of the IEEE, vol. 67, no. 5, pp. 773–785, 1979.
[30] F. Topsøe, “Information theory at the service of science,” in Entropy, Search, Complexity, pp. 179–207, Springer, 2007.
[31] F. Topsoe, “Some inequalities for information divergence and related measures of discrimination,” IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1602–1609, 2000.
[32] T. M. Cover, Elements of information theory. John Wiley & Sons, 1999.
[33] F. Nielsen, “On a generalization of the jensen-shannon divergence and the jensen-shannon centroid,” Entropy, vol. 22, no. 2, p. 221, 2020.
[34] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991.
[35] T. Van Erven and P. Harremos, “Rényi divergence and kullback-leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
[36] T. Kailath, “The divergence and bhattacharyya distance measures in signal selection,” IEEE transactions on communication technology, vol. 15, no. 1, pp. 52–60, 1967.
[37] D. P. Palomar and S. Verdú, “Lautum information,” IEEE Transactions on Information Theory, vol. 54, no. 3, pp. 964–975, 2008.
[38] I. Sason and S. Verdú, “ $f$ -divergence inequalities,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.
[39] S. Verdú, “ $\alpha$ -mutual information,” in 2015 Information Theory and Applications Workshop (ITA), pp. 1–6, IEEE, 2015.
[40] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
[41] P. Dupuis and R. S. Ellis, A weak convergence approach to the theory of large deviations, vol. 902. John Wiley & Sons, 2011.
[42] M. Gastpar, A. R. Esposito, and I. Issa, “Information measures, learning and generalization,” 5th London Symposium on Information Theory, 2019.
[43] F. Topsoe, “Inequalities for the jensen-shannon divergence,” Draft available at http://www. math. ku. dk/topsoe, 2002.
[44] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in 2016 IEEE Information Theory Workshop (ITW), pp. 26–30, IEEE, 2016.
[45] V. Anantharam, “A variational characterization of rényi divergences,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 6979–6989, 2018.
[46] I. Sason, “On f-divergences: Integral representations, local behavior, and inequalities,” Entropy, vol. 20, no. 5, p. 383, 2018.
[47] I. Kuzborskij, N. Cesa-Bianchi, and C. Szepesvári, “Distribution-dependent analysis of gibbs-erm principle,” in Conference on Learning Theory, pp. 2028–2054, PMLR, 2019.
[48] F. Topsøe, “Jenson-shannon divergence and norm-based measures of discrimination and variation,” preprint, 2003.
[49] M. Gil, F. Alajaji, and T. Linder, “Rényi divergence measures for commonly used univariate continuous distributions,” Information Sciences, vol. 249, pp. 124–131, 2013.

Appendix A Proof of Section III-A

Proof:

	$\displaystyle\alpha\mathrm{KL}(P_{W}\otimes P_{Z_{i}}\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})$		(74)
	$\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})\log(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})$		(75)
	$\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}(1-\alpha)\mathrm{d}P_{W,Z_{i}}\log(\mathrm{d}P_{W,Z_{i}})$
	$\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}((\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})+(1-\alpha)\mathrm{d}P_{W,Z_{i}})\log(d\widehat{P}_{W,Z_{i}}))$
	$\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})\log(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})$		(76)
	$\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}(1-\alpha)\mathrm{d}P_{W,Z_{i}}\log(\mathrm{d}P_{W,Z_{i}})-\mathrm{d}P_{W,Z_{i}}^{(\alpha)}\log(d\widehat{P}_{W,Z_{i}})$
	$\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W,Z_{i}}^{(\alpha)}\log(\mathrm{d}P_{W,Z_{i}}^{(\alpha)})-\mathrm{d}P_{W,Z_{i}}^{(\alpha)}\log(\mathrm{d}P_{W,Z_{i}}^{(\alpha)})$
	$\displaystyle=I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\\|\widehat{P}_{W,Z_{i}}).$		(77)

∎

Proof:

As shown in [48], and by considering the Lemma 2 we have

	$\displaystyle\min_{\widehat{P}_{W,Z_{i}}}\alpha\mathrm{KL}(P_{W}\otimes\mu\\|\widehat{P}_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\\|\widehat{P}_{W,Z_{i}})=$		(78)
	$\displaystyle\min_{\widehat{P}_{W,Z_{i}}}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})+\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\\|\widehat{P}_{W,Z_{i}}).$

As we have $0\leq\mathrm{KL}(P_{W,Z_{i}}^{(\alpha)}\|\widehat{P}_{W,Z_{i}})$ , therefore, the minimum of (36) is achieved with $\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}^{(\alpha)}$ . Now, considering $\widehat{P}_{W,Z_{i}}=P_{W,Z_{i}}^{(\alpha)}$ in Proposition 1, completes the proof. ∎

Proof:

Using (46),

\displaystyle I_{\mathrm{JS}}^{\alpha}(W;Z_{i})\leq(1-\alpha)I(W;Z_{i}),

(79)

we have:

$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}{\alpha(1-\alpha)}}$	(80)
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{(\alpha)}^{2}\frac{I(W;Z_{i})}{\alpha}}$	(81)
	$\displaystyle\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{\sum_{i=1}^{n}I(W;Z_{i})}{\alpha n}}$	(82)
	$\displaystyle\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{I(W;S)}{\alpha n}}$	(83)
	$\displaystyle\leq\sqrt{2\sigma_{(\alpha)}^{2}\frac{H(W)}{\alpha n}},$	(84)

where the final result would follow from the finite hypothesis space. ∎

Proof:

This proposition follows from the fact that $I_{\mathrm{JS}}^{\alpha}(W,Z_{i})\leq h(\alpha)$ for $i=1,\cdots,n$ .

We prove that $I_{\mathrm{JS}}^{\alpha}(W,Z_{i})\leq h(\alpha)$ .

	$\displaystyle I_{\mathrm{JS}}^{\alpha}(W,Z_{i})=$		(85)
	$\displaystyle\quad\alpha\mathrm{KL}(P_{W}\otimes P_{Z_{i}}\\|P_{W,Z_{i}}^{(\alpha)})+(1-\alpha)\mathrm{KL}(P_{W,Z_{i}}\\|P_{W,Z_{i}}^{(\alpha)})$
	$\displaystyle=\alpha\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}\log\left(\frac{\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}}{\mathrm{d}P_{W,Z_{i}}^{(\alpha)}}\right)$		(86)
	$\displaystyle\quad+(1-\alpha)\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W,Z_{i}}\log\left(\frac{\mathrm{d}P_{W,Z_{i}}}{\mathrm{d}P_{W,Z_{i}}^{(\alpha)}}\right)$
	$\displaystyle\leq\alpha\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}\log\left(\frac{\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}}}{\alpha(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})}\right)$		(87)
	$\displaystyle\quad+(1-\alpha)\int_{\mathcal{W}\times\mathcal{Z}}\mathrm{d}P_{W,Z_{i}}\log\left(\frac{\mathrm{d}P_{W,Z_{i}}}{(1-\alpha)\mathrm{d}P_{W,Z_{i}}}\right)$
	$\displaystyle=-\alpha\log(\alpha)-(1-\alpha)\log(1-\alpha)$		(88)
	$\displaystyle=h(\alpha).$		(89)

∎

Proof:

We first compute the derivative of $\frac{h(\alpha)}{\alpha(1-\alpha)}$ with respect to $\alpha\in(0,1)$

\displaystyle\frac{d\frac{h(\alpha)}{\alpha(1-\alpha)}}{d\alpha}=\frac{\log(1-\alpha)}{\alpha^{2}}-\frac{\log(\alpha)}{(1-\alpha)^{2}}.

(90)

Now for $\alpha=\frac{1}{2}$ , we have $\frac{d\frac{h(\alpha)}{\alpha(1-\alpha)}}{d\alpha}=0$ . ∎

Appendix B Proofs of Section III-B

Proof:

Consider arbitrary auxiliary distributions $\{\widehat{P}_{W,Z_{i}}\}_{i=1}^{n}$ defined on $\mathcal{W}\times\mathcal{Z}$ . Then,

		$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)=\mathbb{E}_{P_{W}P_{S}}[L_{E}(W,S)]-\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]$		(91)
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right\|$		(92)
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\bigl{\|}\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\bigr{\|}$		(93)
		$\displaystyle\quad\quad+\bigl{\|}\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]\bigr{\|}.$

Using the assumption that loss function $\ell(w,z_{i})$ is $\gamma^{2}$ -sub-Gaussian under distribution $P_{W,Z_{i}}$ and Donsker-Varadhan representation we have:

		$\displaystyle\lambda\left(\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right)\leq$		(94)
		$\displaystyle\quad\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{WZ_{i}})+\frac{\lambda^{2}\gamma^{2}}{2},\quad\forall\lambda\in\mathbb{R}.$

Using the assumption that $\ell(w,Z)$ is $\sigma^{2}$ -sub-Gaussian under $P_{W}\otimes P_{Z_{i}}$ , and again Donsker-Varadhan representation we have:

		$\displaystyle\lambda^{\prime}\left(\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]\right)\leq$		(95)
		$\displaystyle\quad\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W}P_{Z_{i}})+\frac{{\lambda^{\prime}}^{2}\sigma^{2}}{2}.\quad\forall\lambda^{\prime}\in\mathbb{R}$

Note that $\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]=0$ .

Now if we consider $\lambda>0$ , then we choose $\lambda^{\prime}=\frac{\alpha}{\alpha-1}\lambda$ . Hence we have

		$\displaystyle\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{W,Z_{i}}}[\ell(W,Z_{i})]]\leq$		(96)
		$\displaystyle\quad\frac{\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{WZ_{i}})}{\lambda}+\frac{\lambda\gamma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.$

Using the assumption that $\ell(w,Z)$ is $\sigma^{2}$ -sub-Gaussian and again Donsker-Varadhan representation,

		$\displaystyle-\mathbb{E}_{\widehat{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]]\leq$		(97)
		$\displaystyle\quad\frac{\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W}P_{Z_{i}})}{\|\lambda^{\prime}\|}+\frac{\|\lambda^{\prime}\|\sigma^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R^{-}}.$

Now sum up two Inequalities (96) and (97), to obtain

		$\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq$		(98)
		$\displaystyle\frac{\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|Q_{W}P_{Z_{i}})}{\alpha\lambda}+$
		$\displaystyle\quad\frac{\lambda\gamma^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\sigma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.$

Considering (98), we have a nonnegative parabola in $\lambda$ , whose discriminant must be nonpositive, and we have:

	$\displaystyle\left\|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right\|\leq$		(99)
	$\displaystyle\quad\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\left(\alpha C_{i}+(1-\alpha)D_{i}\right)}{\alpha(1-\alpha)}},$

where $C_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W}\otimes\mu)$ and $D_{i}=\mathrm{KL}(\widehat{P}_{W,Z_{i}}\|P_{W,Z_{i}})$ . Finally, we prove the claim using (B). ∎

Proof:

Using Lemma 3, we have:

	$\displaystyle\min_{\widehat{P}_{W,Z_{i}}}\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W}\otimes\mu)+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W,Z_{i}})=$
	$\displaystyle(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i})$
	$\displaystyle+\min_{\widehat{P}_{W,Z_{i}}}\mathrm{KL}\left(\widehat{P}_{W,Z_{i}}\\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).$

Now by considering the $d\widehat{P}_{W,Z_{i}}=\frac{(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}$ , the KL term would be equal to zero. The final result holds by using Proposition 5. ∎

Proof:

Our proof is based on [35, Theorem 30]. For $0\leq\alpha\leq 1$ , we have:

	$\displaystyle\alpha\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W}\otimes\mu)+(1-\alpha)\mathrm{KL}(\widehat{P}_{W,Z_{i}}\\|P_{W,Z_{i}})$		(100)
	$\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log(d\widehat{P}_{W,Z_{i}})$		(101)
	$\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}\widehat{P}_{W,Z_{i}}\log((dP_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)})$
	$\displaystyle=\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log(d\widehat{P}_{W,Z_{i}})$		(102)
	$\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log\left((\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)$
	$\displaystyle\quad+\log\left(\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)$
	$\displaystyle\quad-\log\left(\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)$
	$\displaystyle=-\log\left(\int_{\mathcal{W}\times\mathcal{Z}}(P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}\right)$		(103)
	$\displaystyle\quad+\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log(d\widehat{P}_{W,Z_{i}})$
	$\displaystyle\quad-\int_{\mathcal{W}\times\mathcal{Z}}d\widehat{P}_{W,Z_{i}}\log\left(\frac{(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{W}\otimes\mathrm{d}P_{Z_{i}})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right)$
	$\displaystyle=(1-\alpha)I_{\mathrm{R}}^{\alpha}(W;Z_{i})$		(104)
	$\displaystyle\quad+\mathrm{KL}\left(\widehat{P}_{W,Z_{i}}\\|\frac{(P_{Z_{i}}\otimes P_{W})^{\alpha}(P_{W,Z_{i}})^{(1-\alpha)}}{\int_{\mathcal{W}\times\mathcal{Z}}(\mathrm{d}P_{Z_{i}}\otimes\mathrm{d}P_{W})^{\alpha}(\mathrm{d}P_{W,Z_{i}})^{(1-\alpha)}}\right).$

∎

Proof:

Using (57),

\displaystyle I_{\mathrm{R}}^{\alpha}(W;Z_{i})\leq\frac{\alpha}{1-\alpha}I(W;Z_{i}),

(105)

and considering the hypothesis space is finite and the upper bound in Theorem 3, we have:

		$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{R}}^{\alpha}(W;Z_{i})}{\alpha}}$		(106)
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}I(W;Z_{i})}$		(107)
		$\displaystyle\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{\sum_{i=1}^{n}I(W;Z_{i})}{n}}$		(108)
		$\displaystyle\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{I(W;S)}{n}}$		(109)
		$\displaystyle\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{H(W)}{n}}$		(110)
		$\displaystyle\quad\leq\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\min\left\{\frac{1}{\alpha},\frac{1}{1-\alpha}\right\}\frac{\log(k)}{n}},$

where (108) follows from Jensen inequality and (109) follows from i.i.d assumption for $Z_{i}$ ’s. ∎

Proof:

Consider arbitrary auxiliary distributions $\{\tilde{P}_{W,Z_{i}}\}_{i=1}^{n}$ defined on $\mathcal{W}\times\mathcal{Z}$ .

		$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)=\mathbb{E}_{P_{W}P_{S}}[L_{E}(W,S)]-\mathbb{E}_{P_{W,S}}[L_{E}(W,S)]$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]$		(111)
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right\|.$		(112)

Using the assumption centered loss function $\ell(w,z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(w,Z_{i})]$ is $\gamma^{2}$ -sub-Gaussian under distribution $P_{W,Z_{i}}$ and Donsker-Varadhan representation by considering function $\ell(w,z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(w,Z_{i})]$ we have:

		$\displaystyle\lambda\bigg{(}\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]$		(113)
		$\displaystyle\quad-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]\bigg{)}$
		$\displaystyle\leq\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|P_{WZ_{i}})+\frac{\lambda^{2}\gamma^{2}}{2}.\quad\forall\lambda\in\mathbb{R}$

Note that $\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]=\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]$ .

Using the assumption that $\ell(w,Z)$ is $\sigma^{2}$ -sub-Gaussian under $P_{Z_{i}}$ for all $w\in\mathcal{W}$ , and again Donsker-Varadhan representation by considering function $\ell(w,z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(w,Z_{i})]$ we have:

		$\displaystyle\lambda^{\prime}\bigg{(}\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]$		(114)
		$\displaystyle\quad-\mathbb{E}_{Q_{W}P_{Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]\bigg{)}$
		$\displaystyle\quad\leq\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|Q_{W}P_{Z_{i}})+\frac{{\lambda^{\prime}}^{2}\sigma^{2}}{2}.\quad\forall\lambda^{\prime}\in\mathbb{R}$

Note that $\mathbb{E}_{Q_{W}P_{Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]=0$ .

Now if we consider $\lambda>0$ , then we choose $\lambda^{\prime}=\frac{\alpha}{\alpha-1}\lambda$ . Hence we have

		$\displaystyle\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]$		(115)
		$\displaystyle\quad-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]$
		$\displaystyle\quad\leq\frac{\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|P_{WZ_{i}})}{\lambda}+\frac{\lambda\gamma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.$

Using the assumption $\ell(w,Z)$ is $\sigma^{2}$ -sub-Gaussian and again Donsker-Varadhan representation,

		$\displaystyle-\mathbb{E}_{\tilde{P}_{W,Z_{i}}}[\ell(W,Z_{i})-\mathbb{E}_{P_{Z_{i}}}[\ell(W,Z_{i})]]\leq$		(116)
		$\displaystyle\quad\frac{\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|Q_{W}P_{Z_{i}})}{\|\lambda^{\prime}\|}+\frac{\|\lambda^{\prime}\|\sigma^{2}}{2},\quad\forall\lambda^{\prime}\in\mathbb{R^{-}}.$

Now sum up the two Inequalities (115) and (116) to obtain,

		$\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq$		(117)
		$\displaystyle\frac{\alpha\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|Q_{W}P_{Z_{i}})}{\alpha\lambda}+$
		$\displaystyle\quad\frac{\lambda\gamma^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\sigma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{+}}.$

Taking infimum on $\tilde{P}_{W,Z_{i}}$ and using [35, Theorem 30] that states

(1-\alpha)\mathrm{R}_{\alpha}(P_{1}\|P_{2})=\inf_{R}\{\alpha\mathrm{KL}(R\|P_{1})+(1-\alpha)\mathrm{KL}(R\|P_{2})\}

Now, we have:

	$\displaystyle(1-\alpha)\mathrm{R}_{\alpha}(P_{W,Z_{i}}\\|Q_{W}P_{Z_{i}})=$		(118)
	$\displaystyle{\inf_{\tilde{P}_{W,Z_{i}}}\{\alpha\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|P_{W,Z_{i}})+(1-\alpha)\mathrm{KL}(\tilde{P}_{W,Z_{i}}\\|Q_{W}P_{Z_{i}})\}}$

and taking infimum on $Q_{W}$ , we have:

\displaystyle\inf_{Q_{W}}\mathrm{R}_{\alpha}(P_{W,Z_{i}}\|Q_{W}P_{Z_{i}})=I_{\mathrm{S}}^{\alpha}(Z_{i};W).

(119)

Using (119) in (117), we get:

		$\displaystyle\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\leq$		(120)
		$\displaystyle\quad\frac{(1-\alpha)I^{s}_{\alpha}(Z_{i};W)}{\lambda\alpha}+\frac{\lambda\gamma^{2}}{2}+\frac{\lambda\frac{\alpha}{1-\alpha}\sigma^{2}}{2}\quad\forall\lambda\in\mathbb{R^{+}}.$

Using the same approach for $\lambda\in R^{-}$ , we have:

		$\displaystyle\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]\leq$		(121)
		$\displaystyle\quad\frac{(1-\alpha)I^{s}_{\alpha}(Z_{i};W)}{\|\lambda\|\alpha}+\frac{\|\lambda\|\gamma^{2}}{2}+\frac{\|\lambda\|\frac{\alpha}{1-\alpha}\sigma^{2}}{2},\quad\forall\lambda\in\mathbb{R^{-}}.$

Considering (120) and (121), we have a non-negative parabola in $\lambda$ , whose discriminant must be non-positive, and we have:

	$\displaystyle\left\|\mathbb{E}_{P_{W}P_{Z_{i}}}[\ell(W,Z_{i})]-\mathbb{E}_{P_{WZ_{i}}}[\ell(W,Z_{i})]\right\|\leq$		(122)
	$\displaystyle\quad\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{I_{\mathrm{S}}^{\alpha}(Z_{i};W)}{\alpha}}.$

We prove the claim using (B).

∎

Proof:

The Generalized Pinsker’s inequality is introduced in [35], as follows,

\displaystyle\mathbb{TV}(P,Q)^{2}\leq\frac{2}{\alpha}\mathrm{R}_{\alpha}(P\|Q),\quad\alpha\in(0,1],

(123)

where $\mathbb{TV}(P,Q)=\int_{\mathcal{X}}|P(\mathrm{d}x)-Q(\mathrm{d}x)|$ . Denote $f:\mathcal{X}\to\mathbb{R}$ a bounded function $|f|\leq L$ , then

	$\displaystyle\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]=$		(124)
	$\displaystyle\quad\int f(x)(P(dx)-Q(dx))\leq$
	$\displaystyle\quad\sup_{x}f(x)\cdot\int\|P(dx)-Q(dx)\|\leq L\sqrt{\frac{2}{\alpha}\mathrm{R}_{\alpha}(P\\|Q)}.$

Let $P=P_{W,Z}$ , $Q=P_{W}P_{Z}$ and $f(w,z)=L_{\mu}(w)-L_{E}(w,z)$ . Then, we have the final result,

	$\displaystyle\overline{\operatorname*{gen}}(P_{W\|S},\mu)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[L_{\mu}(W)-L_{E}(W,Z_{i})]$
	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\frac{2b^{2}}{\alpha}\mathrm{R}_{\alpha}(P_{W,Z_{i}}\\|P_{W}P_{Z_{i}})}.$

∎

Appendix C Proof of Section III-C

Proof:

It follows from

I_{\mathrm{JS}}^{\alpha}(W;Z_{i})\leq\frac{h(\alpha^{\prime})}{\alpha^{\prime}(1-\alpha^{\prime})},

that if we have $\frac{\alpha h(\alpha^{\prime})}{\alpha^{\prime}(1-\alpha^{\prime})}\leq I_{\mathrm{R}}^{\alpha}(W;Z_{i})$ for all $i=1,\cdots,n$ , then the results holds for $\sigma=\gamma=\sigma_{JS}$ . ∎

Appendix D Proofs of Section IV

Proof:

Let us define $P_{\mathcal{N}}:=\mathcal{N}(w^{\star},\beta^{-1}I_{d})$ and $w^{\star}:=\arg\inf_{w\in\mathcal{W}}L_{\mu}(w)$ .

	$\displaystyle\mathcal{E}_{r}(P_{W\|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)$
	$\displaystyle\leq\big{\|}\overline{\operatorname*{gen}}(P_{W\|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\big{\|}+\mathbb{E}_{P_{S}\otimes P_{W\|S}^{\star,\beta,\mathrm{JS}_{\alpha}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{S}\otimes P_{\mathcal{N}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})$
	$\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta}$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{\mathcal{N}}}[L_{\mu}(W)]-L_{\mu}(w^{\star})$
	$\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta}$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{\mathcal{N}}}[L_{\mu}(w^{\star})$
	$\displaystyle\quad+\tilde{L}\\|W-w^{\star}\\|_{2}]-L_{\mu}(w^{\star})+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta}$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}$
	$\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta},$

Note that $\tilde{L}\mathbb{E}_{P_{\mathcal{N}}}[\|W-w^{\star}\|_{2}]=\frac{\tilde{L}d}{\beta}$ . ∎

Proof:

The proof is similar to Proof of Theorem 5, by replacing the second inequality with the following inequality,

\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\big{|}\overline{\operatorname*{gen}}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\big{|}\\ &\quad+\mathbb{E}_{P_{S}\otimes P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})\\ &\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}\\ &\quad+\mathbb{E}_{P_{S}\otimes P_{\mathcal{N}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})+\frac{\mathrm{R}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|Q)}{\beta}\end{split}

∎

Proof:

Using the boundedness of $\alpha$ - $\mathrm{JS}$ divergence in Theorem 5, we have,

\begin{split}\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)&\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}+\frac{h(\alpha)}{\beta},\end{split}

where $h(\alpha)=-\alpha\log(\alpha)-(1-\alpha)\log(1-\alpha)$ . Therefore, by setting $\beta=n^{1/2}$ , the convergence rate of excess risk is $\mathcal{O}(1/\sqrt{n})$ . ∎

Proof:

We consider the normal distribution as prior, i.e., $Q=\mathcal{N}(0,I_{d})$ , in Theorem 6. Then, we can compute the $\alpha$ -Rényi divergence between two multivariate Gaussian distributions [49],

\begin{split}&\mathrm{R}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\|\mathcal{N}(0,I_{d}))=\frac{\alpha}{2}\|w^{\star}\|_{2}^{2}(\alpha+(1-\alpha)\beta^{-1})^{-1}\\ &\quad+\frac{d}{2(\alpha-1)}\log\Big{(}\frac{\beta^{\alpha-1}}{\alpha+(1-\alpha)\beta^{-1}}\Big{)}.\end{split}

Then, the following upper bound holds on the excess risk under $P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}$ ,

\begin{split}&\mathcal{E}_{r}(P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}},\mu)\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}\\ &\quad+\frac{\alpha}{2\beta}\|w^{\star}\|_{2}^{2}(\alpha+(1-\alpha)\beta^{-1})^{-1}\\ &\quad+\frac{d}{2\beta(\alpha-1)}\log\Big{(}\frac{\beta^{(\alpha-1)}}{\alpha+(1-\alpha)\beta^{-1}}\Big{)}\\ &\leq\sqrt{\frac{2b^{2}}{n\alpha}\sum_{i=1}^{n}I_{\mathrm{R}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}\\ &\quad+\frac{1}{2\beta}\|w^{\star}\|_{2}^{2}+\frac{d}{2\beta}\log\big{(}\beta\big{)}+\frac{d}{2\beta(1-\alpha)}\log\big{(}\alpha\big{)}.\end{split}

∎

Appendix E Proofs of Section V

We first propose the following Lemma to provide an upper bound on the expected generalization error under distribution mismatch.

Lemma 6

The following upper bound holds on expected generalization error under distribution mismatch between the test and training distributions:

		$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu,\mu^{\prime})\|\leq$		(125)
		$\displaystyle\quad\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|+\|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]\|.$

Proof:

We have:

	$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu,\mu^{\prime})\|$		(126)
	$\displaystyle=\|\mathbb{E}_{P_{W,S}}[L_{P}(W,\mu^{\prime})-L_{P}(W,\mu)+L_{P}(W,\mu)-L_{E}(E,S)]\|$
	$\displaystyle\leq\|\mathbb{E}_{P_{W,S}}[L_{P}(W,\mu^{\prime})-L_{P}(W,\mu)]\|+\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|$
	$\displaystyle=\|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]\|+\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|$

∎

Proof:

In Lemma 6, the generalization error under distribution mismatch can be upper bounded by two terms. Considering Theorem 2, we can provide the upper bound based on $\alpha$ -Jensen-Shannon information over $|\overline{\operatorname*{gen}}(P_{W|S},\mu)|$ . We can also provide an upper bound on the term $|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]|$ in Lemma 6 by applying ADM using a similar approach as in Theorem 2 and using the $\alpha$ -Jensen-Shannon divergence as follows:

	$\displaystyle\|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]\|\leq$		(127)
	$\displaystyle\sqrt{2\sigma_{(\alpha)}^{2}\frac{\mathrm{JS}_{\alpha}(P_{W}\otimes\mu^{\prime}\\|P_{W}\otimes\mu)}{\alpha(1-\alpha)}}=$		(128)
	$\displaystyle\sqrt{2\sigma_{(\alpha)}^{2}\frac{\mathrm{JS}_{\alpha}(\mu^{\prime}\\|\mu)}{\alpha(1-\alpha)}}.$		(129)

∎

Proof:

Based on Lemma 6, the generalization error is upper bounded by two terms (See Equation (125)). We can provide the upper bound based on $\alpha$ -Rényi information over $|\overline{\operatorname*{gen}}(P_{W|S},\mu)|$ using Theorem 3. We can also provide an upper bound on the term $|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]|$ by applying ADM using a similar approach as in Theorem 3 and using $\alpha$ -Rényi divergence as follows:

	$\displaystyle\|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]\|\leq$		(130)
	$\displaystyle\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\mathrm{R}_{\alpha}(P_{W}\otimes\mu^{\prime}\\|P_{W}\otimes\mu)}{\alpha}}=$		(131)
	$\displaystyle\sqrt{2(\alpha\sigma^{2}+(1-\alpha)\gamma^{2})\frac{\mathrm{R}_{\alpha}(\mu^{\prime}\\|\mu)}{\alpha}}.$		(132)

∎

		$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|\leq$
		$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{R}[f(X)]\right\|+\left\|\mathbb{E}_{Q}[f(X)]-\mathbb{E}_{R}[f(X)]\right\|$
		$\displaystyle\leq F(T(P\\|R))+F(T(Q\\|R))$		(8)

	$\displaystyle\left\|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|\leq$
	$\displaystyle\left\|\mathbb{E}_{R}[f(X)]-\mathbb{E}_{P}[f(X)]\right\|+\left\|\mathbb{E}_{R}[f(X)]-\mathbb{E}_{Q}[f(X)]\right\|$
	$\displaystyle\leq F(T(R\\|P))+F(T(R\\|Q)).$		(12)

	$\displaystyle\mathcal{E}_{r}(P_{W\|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)$
	$\displaystyle\leq\big{\|}\overline{\operatorname*{gen}}(P_{W\|S}^{\star,\beta,\mathrm{JS}_{\alpha}},\mu)\big{\|}+\mathbb{E}_{P_{S}\otimes P_{W\|S}^{\star,\beta,\mathrm{JS}_{\alpha}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{S}\otimes P_{\mathcal{N}}}[L_{E}(W,S)]-L_{\mu}(w^{\star})$
	$\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta}$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{\mathcal{N}}}[L_{\mu}(W)]-L_{\mu}(w^{\star})$
	$\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta}$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\mathbb{E}_{P_{\mathcal{N}}}[L_{\mu}(w^{\star})$
	$\displaystyle\quad+\tilde{L}\\|W-w^{\star}\\|_{2}]-L_{\mu}(w^{\star})+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta}$
	$\displaystyle\leq\sqrt{\frac{2b^{2}}{n\alpha(1-\alpha)}\sum_{i=1}^{n}I_{\mathrm{JS}}^{\alpha}(W;Z_{i})}+\frac{\tilde{L}\sqrt{d}}{\beta}$
	$\displaystyle\quad+\frac{\mathrm{JS}_{\alpha}(\mathcal{N}(w^{\star},\beta^{-1}I_{d})\\|Q)}{\beta},$

	$\displaystyle\|\overline{\operatorname*{gen}}(P_{W\|S},\mu,\mu^{\prime})\|$		(126)
	$\displaystyle=\|\mathbb{E}_{P_{W,S}}[L_{P}(W,\mu^{\prime})-L_{P}(W,\mu)+L_{P}(W,\mu)-L_{E}(E,S)]\|$
	$\displaystyle\leq\|\mathbb{E}_{P_{W,S}}[L_{P}(W,\mu^{\prime})-L_{P}(W,\mu)]\|+\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|$
	$\displaystyle=\|\mathbb{E}_{P_{W}\otimes\mu^{\prime}}[\ell(W,Z)]-\mathbb{E}_{P_{W}\otimes\mu}[\ell(W,Z)]\|+\|\overline{\operatorname*{gen}}(P_{W\|S},\mu)\|$

Learning Algorithm Generalization Error Bounds via Auxiliary Distributions

Abstract

Index Terms:

I Introduction

II Problem Formulation

II-A Notations

II-B Framework of Statistical Learning

II-C Auxiliary Distribution Method

II-D Information Measures

II-E Definitions

Definition 1

Definition 2

Lemma 1

III Upper Bounds on the Expected Generalization Error via ADM

III-A α\alpha-Jensen-Shannon- based Upper Bound

Theorem 1

Proof:

Example 1

Example 2

Proposition 1

Proof:

Lemma 2

Theorem 2

Remark 1 (Recovering upper bound based on Jensen-Shannon information)

Remark 2 (Recovering upper bounds based on mutual information and lautum information)

Proposition 2

Proposition 3

Corollary 1

III-B α\alpha-Rényi-based Upper Bound

Proposition 4

Proof:

Proposition 5 (Upper bound with Sub-Gaussian assumption)

Lemma 3

Theorem 3 (Upper bound based on α\alpha-Rényi information)

Remark 3 (Deterministic algorithms per sample)

Remark 4 (Upper bound based on the Bhattacharyya distance)

Remark 5 (Recovering the upper bound based on mutual information and lautum information)

Proposition 6 (Convergence rate of upper bound based on α\alpha-Rényi information)

Theorem 4 (Upper bound based on Sibson’s α\alpha mutual information)

Proposition 7

III-C Comparison of Proposed Upper Bounds

Remark 6 (Bounded loss function)

Proposition 8 (Comparison of upper bounds based on α′\alpha^{\prime}-Jensen-Shannon and α\alpha-Rényi information measures)

Remark 7

Remark 8

IV Upper Bounds on Excess Risk

IV-A α\alpha-JS\mathrm{JS}-Regularized ERM

Lemma 4 (Solution existence of α\alpha-JS\mathrm{JS}-regularized ERM)

Proof:

Theorem 5 (Upper bound on excess risk under PW|S⋆,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}})

Corollary 2 (Convergence rate of excess risk for under PW|S⋆,β,JSαP_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}})

Remark 9 (Comparison to the Gibbs algorithm)

IV-B α\alpha-Rényi-Regularized ERM

Lemma 5 (Solution existence of α\alpha-Rényi-regularized ERM)

Proof:

Theorem 6 (Upper bound on excess risk under PW|S⋆,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}})

Corollary 3 (Convergence rate of excess risk under PW|S⋆,β,RαP_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}})

V Expected Generalization Error Upper Bounds Under Distribution Mismatch

Proposition 9

Proposition 10

VI Numerical Example

VII Conclusion and Future Works

Acknowledgment

References

Appendix A Proof of Section III-A

Proof:

Proof:

Proof:

Proof:

Proof:

Appendix B Proofs of Section III-B

Proof:

Proof:

Proof:

Proof:

Proof:

Proof:

Appendix C Proof of Section III-C

Proof:

Appendix D Proofs of Section IV

III-A $\alpha$ -Jensen-Shannon- based Upper Bound

III-B $\alpha$ -Rényi-based Upper Bound

Theorem 3 (Upper bound based on $\alpha$ -Rényi information)

Proposition 6 (Convergence rate of upper bound based on $\alpha$ -Rényi information)

Theorem 4 (Upper bound based on Sibson’s $\alpha$ mutual information)

Proposition 8 (Comparison of upper bounds based on $\alpha^{\prime}$ -Jensen-Shannon and $\alpha$ -Rényi information measures)

IV-A $\alpha$ - $\mathrm{JS}$ -Regularized ERM

Lemma 4 (Solution existence of $\alpha$ - $\mathrm{JS}$ -regularized ERM)

Theorem 5 (Upper bound on excess risk under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ )

Corollary 2 (Convergence rate of excess risk for under $P_{W|S}^{\star,\beta,\mathrm{JS}_{\alpha}}$ )

IV-B $\alpha$ -Rényi-Regularized ERM

Lemma 5 (Solution existence of $\alpha$ -Rényi-regularized ERM)

Theorem 6 (Upper bound on excess risk under $P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}$ )

Corollary 3 (Convergence rate of excess risk under $P_{W|S}^{\star,\beta,\mathrm{R}_{\alpha}}$ )