This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Information-Theoretic Framework for Out-of-Distribution Generalization
with Applications to
Stochastic Gradient Langevin Dynamics

Wenliang Liu, Guanding Yu, Lele Wang§, and Renjie Liao§
Abstract

We study the Out-of-Distribution (OOD) generalization in machine learning and propose a general framework that establishes information-theoretic generalization bounds. Our framework interpolates freely between Integral Probability Metric (IPM) and ff-divergence, which naturally recovers some known results (including Wasserstein- and KL-bounds), as well as yields new generalization bounds. Additionally, we show that our framework admits an optimal transport interpretation. When evaluated in two concrete examples, the proposed bounds either strictly improve upon existing bounds in some cases or match the best existing OOD generalization bounds. Moreover, by focusing on ff-divergence and combining it with the Conditional Mutual Information (CMI) methods, we derive a family of CMI-based generalization bounds, which include the state-of-the-art ICIMI bound as a special instance. Finally, leveraging these findings, we analyze the generalization of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, showing that our derived generalization bounds outperform existing information-theoretic generalization bounds in certain scenarios.

Index Terms:
Generalization, Out-of-Distribution, ff-Divergence, Integral Probability Metric (IPM), Stochastic Gradient Langevin Dynamics (SGLD)
footnotetext: Wenliang Liu and Guanding Yu are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China (email: {liuwenliang, yuguanding}@zju.edu.cn).footnotetext: Lele Wang and Renjie Liao are with the Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC V6T1Z4, Canada (email:{lelewang, rjliao}@ece.ubc.ca).footnotetext: Renjie Liao is a faculty member at Vector Institute and a Canada CIFAR AI Chair.§§footnotetext: Co-corresponding authors.footnotetext: This work was accepted in part at the 2024 IEEE International Symposium on Information Theory [1] and the 2024 Canadian Workshop on Information Theory.

I Introduction

In machine learning, generalization is the ability of a model to make accurate predictions on unseen data during training. How to analyze and improve the generalization of models are the core subjects of machine learning. In the past decades, a series of mathematical tools have been invented or applied to bound the generalization gap, i.e., the difference between testing and training performance of a model, such as the VC dimension [2], Rademacher complexity [3], covering numbers [4], algorithmic stability [5], and PAC Bayes [6]. Recently, there have been attempts to bound the generalization gap using information-theoretic tools. The idea is to regard the learning algorithm as a communication channel that maps the input set of samples SS to the output hypothesis (model weights) WW. In the pioneering work [7, 8], the generalization gap is bounded by the Mutual Information (MI) between WW and SS, which reflects the intuition that a learning algorithm generalizes well if the learned model weights WW leaks little information about the training samples. However, the generalization bound becomes vacuous whenever the MI term is infinite. The remedy is to replace the MI term with other smaller quantities to reflect the information that SS leaks to WW, and this has been fulfilled by two orthogonal works, exploiting the Individual Mutual Information (IMI) [9] and the Conditional Mutual Information (CMI) [10], respectively. Specifically, the IMI method bounds the generalization gap using the mutual information between WW and each individual training datum ZiZ_{i}, rather than the MI between WW and the set of whole samples. In certain scenarios, we have infinite MI term but finite IMI term, and thus the resulting IMI bound significantly enhances the MI bound [9]. Meanwhile, the CMI method studies the generalization through a set of super-samples (also known as ghost samples), a pair of independent and identically distributed (i.i.d.) copies Zi+Z_{i}^{+} and ZiZ_{i}^{-}, and then uses a Rademacher random variable RiR_{i} to choose Zi+Z_{i}^{+} or ZiZ_{i}^{-} as the ii-th training datum. The resulting generalization bound is captured by the CMI between WW and the samples’ identity RiR_{i}, conditioned on the super-samples. Since then, a line of work [11, 12, 13, 14, 15, 16, 17] has been proposed to tighten information-theoretic generalization bounds, and as a remarkable application, these information-theoretic methods have been employed to study the generalization performance of learning algorithms, e.g., the Langevin dynamics and its variants.

In practice, it is often the case that the training data suffer from selection biases and data distribution shifts with time, e.g., in continual learning, causing the distribution of test data differs from the training distribution. This motivates researchers to study the Out-of-Distribution (OOD) generalization. It is common practice to extract invariant features to improve the OOD performance [18], and an OOD generalization theory was established in [19] via the perspective of invariant features. As a sub-field of OOD generalization, the domain adaptation was systematically studied in [20, 21, 22, 23]. In the information-theoretic regime, the mismatch between the training and the testing distributions yields an additional penalty term to the OOD generalization bounds, and such penalty can be measured by either the Kullback–Leibler (KL) divergence [24, 25, 26] or the total variation (TV) as well as the Wasserstein distances [24, 26]. In this paper, we establish a universal framework for the OOD generalization analysis, which interpolates between the Integral Probability Metric (IPM) and the ff-divergence. Our OOD generalization bounds include most existing bounds as special cases and are shown to strictly improve upon existing bounds in some cases. The framework is expressed in terms of individual datum and thus can be regarded as a natural extension of the classical IMI method. Meanwhile, we demonstrate that part of these results can be adapted to the CMI settings, which gives rise to an extension and improvement of the cutting-edge ICIMI method to its ff-divergence variants. Finally, we demonstrate that our findings lead to several improved generalization bounds of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm. To this end, we notice that the common way to derive generalization bounds for SGLD, as what IMI and CMI methods did, is to leverage the chain rule of the KL divergence. However, such approaches are inapplicable to our methods in general, simply because the chain rule does not hold for general ff-divergence. The remedy is to exploit the subaddivity of the ff-divergence, which leads to an asymptotically tighter generalization bound for SGLD.

I-A Related Works

Information-theoretic generalization bounds have been established in the previous work [24] and [26], under the context of transfer learning and domain adaptation, respectively. The KL-bounds are derived in [25] utilizing the rate-distortion theory. If we ignore the minor difference of models in the generalization bounds, their results can be regarded as natural corollaries of our framework. Moreover, [27] also studied the generalization bounds using ff-divergence, but it only considered the in-distribution case, and the results are given in the high-probability form. Furthermore, both [28] and our work uses the convex analysis (Legendre-Fenchel dual) to study the generalization. However, our work restricts the dependence measure to ff-divergence or IPM. In contrast, [28] did not designate the specific form of the dependence measure, but relied on the strong convexity of the dependence measure. This assumption does not hold for IPM and some of the ff-divergence. Besides, [28] did not consider the OOD generalization as well. In the final writing phase of this paper, we noticed two independent works [29, 30] that use a similar technique to perform the generalization analysis. In particular, [29] derived PAC-Bayes bounds based on the (f,Γ)(f,\Gamma)-divergence, which can be regarded as a high-probability counterpart of our paper. [30] focused on the domain adaptation problem using ff-divergence, and thus their models and methods are partially covered by this paper. Readers are referred to [30] for more specific results under the context of domain adaptation. Additionally, [31] studied the generalization error at a high level through the method of gaps, a technique for deriving the closed-form generalization error in terms of information-theoretic measures. The ff-divergence was also employed to regularize the empirical risk minimization algorithm, as explored in [32] and the references therein.

I-B Contributions

  1. 1.

    We develop a theoretical framework for establishing information-theoretic generalization bounds in the OOD scenarios, which allows us to interpolate freely between IPM and ff-divergence. In addition to reproducing existing results, such as the generalization bounds based on the Wasserstein distance [24, 26] and the KL divergence [24, 26, 25], our framework also derives new generalization bounds. Notably, our OOD generalization bounds can be applied to in-distribution cases by simply setting the testing distribution equal to the training distribution.

  2. 2.

    Our framework is designed to work with the CMI methods for bounded loss functions, and extends the state-of-the-art CMI-based result, known as the Individually Conditional Individual Mutual Information (ICIMI) bound, to a range of ff-divergence-based ICIMI bound. This enables us to properly select the function ff to derive tighter generalization bounds compared to the standard ICIMI bound.

  3. 3.

    We leverage above results to analyze the generalization of the SGLD algorithm for bounded and Lipschitz loss functions. First, we improve upon an existing SGLD generalization bound in [9] by introducing an additional domination term and extending it to the OOD setting. Next, we employ the squared Hellinger divergence and ICIMI methods to derive an alternative bound, highlighting its advantages under the assumption of without-replacement sampling. Finally, we relax this without-replacement sampling assumption to an asymptotic condition, showing that the resulting generalization bound is tighter than existing information-theoretic bounds for SGLD in the asymptotic regime.

I-C Notation and Organization

We denote the set of real numbers and the set of non-negative real numbers by \mathbb{R} and +\mathbb{R}_{+}, respectively. Sets, random variables, and their realizations are respectively represented in Calligraphic fonts (e.g., 𝒳\mathcal{X}), uppercase letters (e.g., XX), and lowercase letters (e.g., xx). Let 𝒫(𝒳)\mathcal{P}(\mathcal{X}) be the set of probability distributions over set 𝒳\mathcal{X} and (𝒳)\mathcal{M}(\mathcal{X}) be the set of measurable functions over 𝒳\mathcal{X}. Given P,Q𝒫(𝒳)P,Q\in\mathcal{P}(\mathcal{X}), we write PQP\perp Q if PP is singular to QQ and PQP\ll Q if PP is absolutely continuous w.r.t. QQ. We write dPdQ\frac{\mathrm{d}P}{\mathrm{d}Q} as the Radon-Nikodym derivative.

The rest of this paper is organized as follows. Section II introduces the problem formulation and some technical preliminaries. In Section III, we propose a general theorem on OOD generalization bounds and illustrate its optimal transport interpretation. Examples and special cases of the general theorem, including both known generalization bounds and new results, are given in Section IV. In Section V, we restate and improve the CMI methods for bounding the generalization gap, and the result is applied to the generalization analysis of the SGLD algorithm in Section VI. Finally, the paper is concluded in Section VII.

II Problem Formulation

In this section, we introduce our problem formulation and some preliminaries.

II-A Problem Formulation

Denote by 𝒲\mathcal{W} the hypothesis space and 𝒵\mathcal{Z} the space of data (i.e., input and output pairs). We assume training data (Z1,,Zn)(Z_{1},\ldots,Z_{n}) are i.i.d. following the distribution ν\nu. Let :𝒲×𝒵+\ell\colon\mathcal{W}\times\mathcal{Z}\to\mathbb{R}_{+} be the loss function. From the Bayesian perspective, our target is to learn a posterior distribution of hypotheses over 𝒲\mathcal{W}, based on the observed data sampled from 𝒵\mathcal{Z}, such that the expected loss is minimized. Specifically, we assume the prior distribution QWQ_{W} of hypotheses is known at the beginning of the learning. Upon observing nn samples, zn=(z1,,zn)𝒵nz^{n}=\left(z_{1},\cdots,z_{n}\right)\in\mathcal{Z}^{n}, a learning algorithm outputs one w𝒲w\in\mathcal{W} through a process like Empirical Risk Minimization (ERM) [33]. The learning algorithm is either deterministic (e.g., gradient descent with fixed hyperparameters) or stochastic (e.g., stochastic gradient descent). Thus, the learning algorithm can be characterized by a probability kernel PW|ZnP_{W|Z^{n}}111Given zn𝒵nz^{n}\in\mathcal{Z}^{n}, PW|Zn=znP_{W|Z^{n}=z^{n}} is a probability measure over 𝒲\mathcal{W}., and its output is regarded as one sample from the posterior distribution PW|Zn=znP_{W|Z^{n}=z^{n}}.

In this paper, we consider the OOD generalization setting where the training distribution ν\nu differs from the testing distribution μ\mu. Given a set of samples znz^{n} and the algorithm’s output ww, the incurred generalization gap is

gen(w,zn):-𝔼μ[(w,Z)]1ni=1n(w,zi).\mathrm{gen}\left(w,z^{n}\right)\coloneq\mathbb{E}_{\mu}\left[\ell\left(w,Z\right)\right]-\frac{1}{n}\sum_{i=1}^{n}\ell\left(w,z_{i}\right). (1)

Finally, we define the generalization gap of the learning algorithm by taking expectation w.r.t. ww and znz^{n}, i.e.,

gen(PW|Zn,ν,μ):-𝔼[gen(W,Zn)],\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\coloneq\mathbb{E}\left[\mathrm{gen}\left(W,Z^{n}\right)\right], (2)

where the expectation is w.r.t. the joint distribution of (W,Zn)(W,Z^{n}), given by PW|ZnνnP_{W|Z^{n}}\otimes\nu^{\otimes n}. An alternative approach to defining the generalization gap is to replace the empirical loss in (2) with the population loss w.r.t. the training distribution ν\nu, i.e.,

gen~(PW|Zn,ν,μ):-𝔼PW[𝔼μ[(W,Z)]𝔼ν[(W,Z)]],\mathrm{\widetilde{gen}}\left(P_{W|Z^{n}},\nu,\mu\right)\coloneq\mathbb{E}_{P_{W}}\left[\mathbb{E}_{\mu}\left[\ell\left(W,Z\right)\right]-\mathbb{E}_{\nu}\left[\ell\left(W,Z\right)\right]\right], (3)

where PWP_{W} denotes the marginal distribution of WW. By convention, we refer to (2) as the Population-Empirical (PE) generalization gap and refer to (3) as the Population-Population (PP) generalization gap. In the next two sections, we focus on bounding both the PP and the PE generalization gap using information-theoretic tools.

II-B Preliminaries

Definition 1 (ff-Divergence [34]).

Let f:(0,+)f\colon(0,+\infty)\to\mathbb{R} be a convex function satisfying f(1)=0f(1)=0. Given two distributions P,Q𝒫(𝒳)P,Q\in\mathcal{P}(\mathcal{X}), decompose P=Pc+PsP=P_{c}+P_{s}, where PcQP_{c}\ll Q and PsQP_{s}\perp Q. The ff-divergence between PP and QQ is defined by

Df(P||Q):-𝔼Q[f(dPcdQ)]+f()Ps(𝒳),D_{f}\left(P||Q\right)\coloneq\mathbb{E}_{Q}\left[f\left(\frac{\mathrm{d}P_{c}}{\mathrm{d}Q}\right)\right]+f^{\prime}(\infty)P_{s}(\mathcal{X}), (4)

where f()=limx+f(x)/xf^{\prime}(\infty)=\lim_{x\to+\infty}f(x)/x. If ff is super-linear, i.e., f()=+f^{\prime}(\infty)=+\infty, then the ff-divergence has the form of

Df(P||Q)={𝔼Q[f(dPdQ)],ifPQ,+,otherwise.D_{f}\left(P||Q\right)=\begin{cases}\mathbb{E}_{Q}\left[f\left(\frac{\mathrm{d}P}{\mathrm{d}Q}\right)\right],&\text{if}\ P\ll Q,\\ +\infty,&\text{otherwise}.\end{cases} (5)
Definition 2 (Generalized Cumulant Generating Function (CGF) [35, 36]).

Let ff be defined as above and gg be a measurable function. The generalized cumulant generating function of gg w.r.t. ff and QQ is defined by

Λf;Q(g):-infλ{λ+𝔼Q[f(gλ)]},\Lambda_{f;Q}\left(g\right)\coloneq\inf_{\lambda\in\mathbb{R}}\bigl{\{}\lambda+\mathbb{E}_{Q}\left[f^{*}(g-\lambda)\right]\bigr{\}}, (6)

where ff^{*} represents the Legendre-Fenchel dual of ff, as

f(y):-supx{xyf(x)}.f^{*}(y)\coloneq\sup_{x\in\mathbb{R}}\bigl{\{}xy-f(x)\bigr{\}}. (7)
Remark 1.

Taking f(x)=xlogx(x1)f(x)=x\log x-(x-1) in the ff-divergence yields the KL divergence222Here we choose ff to be standard, i.e., f(1)=f(1)=0f^{\prime}(1)=f(1)=0.. A direct calculation shows f(y)=ey1f^{*}(y)=e^{y}-1. The infimum is achieved at λ=log𝔼Q[eg]\lambda=\log\mathbb{E}_{Q}\left[e^{g}\right] and thus Λf;Q(g)=log𝔼Q[eg]\Lambda_{f;Q}\left(g\right)=\log\mathbb{E}_{Q}\left[e^{g}\right]. This means Λf;Q(t(g𝔼Q[g]))\Lambda_{f;Q}\left(t(g-\mathbb{E}_{Q}\left[g\right])\right) degenerates to the classical cumulant generating function of gg.

If we refer to QQ as a fixed reference distribution and regard Df(P||Q)D_{f}\left(P||Q\right) as a function of distribution PP, then the ff-divergence and the generalized CGF form a pair of Legendre-Fenchel dual, as clarified in Lemma 1.

Lemma 1 (Variational Representation of ff-Divergence [34]).
Df(P||Q)=supg{𝔼P[g]Λf;Q(g)},D_{f}\left(P||Q\right)=\sup_{g}\bigl{\{}\mathbb{E}_{P}\left[g\right]-\Lambda_{f;Q}\left(g\right)\bigr{\}}, (8)

where the supreme can be either taken over

  1. 1.

    the set of all simple functions, or

  2. 2.

    (𝒳)\mathcal{M}(\mathcal{X}), the set of all measurable functions, or

  3. 3.

    LQ(𝒳)L_{Q}^{\infty}(\mathcal{X}), the set of all QQ-almost-surely bounded functions.

In particular, we recover the Donsker-Varadhan variational representation of KL divergence by combining Remark 2 and Lemma 1:

DKL(P||Q)=supg{𝔼P[g]log𝔼Q[eg]}.D_{\mathrm{KL}}\left(P||Q\right)=\sup_{g}\bigl{\{}\mathbb{E}_{P}\left[g\right]-\log\mathbb{E}_{Q}\left[e^{g}\right]\bigr{\}}. (9)
Definition 3 (Γ\Gamma-Integral Probability Metric [37]).

Let Γ(𝒳)\Gamma\subseteq\mathcal{M}(\mathcal{X}) be a subset of measurable functions, then the Γ\Gamma-Integral Probability Metric (IPM) between PP and QQ is defined by

WΓ(P,Q):-supgΓ{𝔼P[g]𝔼Q[g]}.W^{\Gamma}\left(P,Q\right)\coloneq\sup_{g\in\Gamma}\bigl{\{}\mathbb{E}_{P}\left[g\right]-\mathbb{E}_{Q}\left[g\right]\bigr{\}}. (10)

Examples of Γ\Gamma-IPM include 11-Wasserstein distance, Dudley metric, and maximum mean discrepancy. In general, if 𝒳\mathcal{X} is a Polish space with metric ρ\rho, then the pp-Wasserstein distance between PP and QQ is defined through

Wp(P,Q)=(infη𝒞(P,Q)𝔼(X,Y)η[ρ(X,Y)p])1/p,W_{p}(P,Q)=\Bigl{(}\inf_{\eta\in\mathcal{C}(P,Q)}\mathbb{E}_{(X,Y)\sim\eta}\left[\rho(X,Y)^{p}\right]\Bigr{)}^{1/p}, (11)

where 𝒞(P,Q)\mathcal{C}(P,Q) is the set of couplings of PP and QQ. For the special case p=1p=1, the Wasserstein distance can be expressed as IPM due to the Kantorovich-Rubinstein Duality

W1(P,Q)=supgLip1{𝔼P[g]𝔼Q[g]},W_{1}(P,Q)=\sup_{\|g\|_{\mathrm{Lip}}\leq 1}\bigl{\{}\mathbb{E}_{P}\left[g\right]-\mathbb{E}_{Q}\left[g\right]\bigr{\}}, (12)

where gLip:-supx,y𝒳g(x)g(y)ρ(x,y)\displaystyle\|g\|_{\mathrm{Lip}}\coloneq\sup_{x,y\in\mathcal{X}}\frac{g(x)-g(y)}{\rho(x,y)} is the Lipschitz norm of gg.

III Main Results

In this section, we first propose an inequality regarding the generalization gap in Subsection III-A, which leads to one of our main results, a general theorem on the generalization bounds in Subsection III-B. Finally, we show the theorem admits an optimal transport interpretation in Subsection III-C.

III-A An Inequality on the Generalization Gap

In this subsection, we show the generalization gap can be bounded from above using the Γ\Gamma-IPM, ff-divergence, and the generalized CGF. For simplicity, we denote by Pi=PW|ZiνP_{i}=P_{W|Z_{i}}\otimes\nu and Q=QWμQ=Q_{W}\otimes\mu. Moreover, we define the (negative) re-centered loss function as ¯(w,z):-𝔼μ[(w,Z)](w,z).\bar{\ell}\left(w,z\right)\coloneq\mathbb{E}_{\mu}\left[\ell\left(w,Z\right)\right]-\ell\left(w,z\right).

Proposition 1.

Let Γ¯(𝒲×𝒵)\bar{\Gamma}\subseteq\mathcal{M}\left(\mathcal{W}\times\mathcal{Z}\right) be a class of measurable functions and assume ¯Γ¯\bar{\ell}\in\bar{\Gamma}. Then for arbitrary probability distributions ηi𝒫(𝒲×𝒵)\eta_{i}\in\mathcal{P}\left(\mathcal{W}\times\mathcal{Z}\right) and arbitrary positive real numbers ti>0t_{i}>0, i[n]i\in[n], we have

gen(PW|Zn,ν,μ)1ni=1n(WΓ¯(Pi,ηi)+1tiDf(ηi||Q)+1tiΛf;Q(ti¯(W,Z))).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\Bigl{(}W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right)\\ +\frac{1}{t_{i}}D_{f}\left(\eta_{i}||Q\right)+\frac{1}{t_{i}}\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z\right)\right)\Bigr{)}. (13)
Proof of Proposition 13.

If FF^{*} is the Legendre dual of some functional F:𝒳F:\mathcal{X}\to\mathbb{R}, then we have

(tF)(x)=tF(1tx),(tF)^{*}(x^{*})=tF^{*}\left(\frac{1}{t}x^{*}\right), (14)

for all t+t\in\mathbb{R}_{+} and x𝒳x^{*}\in\mathcal{X}^{*}, the dual space of 𝒳\mathcal{X}. Let QQ be a fixed reference distribution, η\eta be a probability distribution, and gg be a measurable function. Combining the above fact with Lemma 1 yields the following Fenchel-Young inequality:

𝔼η[g]1tDf(η||Q)+1tΛf;Q(tg),t+.\mathbb{E}_{\eta}\left[g\right]\leq\frac{1}{t}D_{f}\left(\eta||Q\right)+\frac{1}{t}\Lambda_{f;Q}\left(tg\right),t\in\mathbb{R}_{+}. (15)

As a consequence, we have

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) =𝔼PW|Znνn[𝔼μ[(W,Z)]1ni=1n(W,Zi)]\displaystyle=\mathbb{E}_{P_{W|Z^{n}}\otimes\nu^{\otimes n}}\left[\mathbb{E}_{\mu}\left[\ell\left(W,Z\right)\right]-\frac{1}{n}\sum_{i=1}^{n}\ell\left(W,Z_{i}\right)\right] (16)
=1ni=1n𝔼Pi[¯(W,Zi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{i}}\left[\bar{\ell}\left(W,Z_{i}\right)\right] (17)
1ni=1n𝔼Pi[¯(W,Zi)]𝔼ηi[¯(W,Zi)]+1ti(Df(ηi||Q)+Λf;Q(ti¯(W,Zi)))\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{i}}\left[\bar{\ell}\left(W,Z_{i}\right)\right]-\mathbb{E}_{\eta_{i}}\left[\bar{\ell}\left(W,Z_{i}\right)\right]+\frac{1}{t_{i}}\Bigl{(}D_{f}\left(\eta_{i}||Q\right)+\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z_{i}\right)\right)\Bigr{)} (18)
1ni=1nsupgΓ¯{𝔼Pi[g]𝔼ηi[g]}+1ti(Df(ηi||Q)+Λf;Q(ti¯(W,Zi)))\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sup_{g\in\bar{\Gamma}}\bigl{\{}\mathbb{E}_{P_{i}}\left[g\right]-\mathbb{E}_{\eta_{i}}\left[g\right]\bigr{\}}+\frac{1}{t_{i}}\Bigl{(}D_{f}\left(\eta_{i}||Q\right)+\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z_{i}\right)\right)\Bigr{)} (19)
=RHS of(13).\displaystyle=\text{RHS of}~{}\eqref{equation::fundamental inequality}.

Here, inequality (18) follows from (15) and inequality (19) follows since ¯Γ¯\bar{\ell}\in\bar{\Gamma}, and the last equality follows by Definition 10. ∎

Proposition 13 has a close relationship with the (f,Γ)(f,\Gamma)-divergence [35]. In Appendix A-A, We provide an alternative proof of Proposition 13 using (f,Γ)(f,\Gamma)-divergence. Furthermore, we show the inequality in Proposition 13 is tight in Appendix A-B.

III-B A Theorem for OOD Generalization

It is common that the generalized CGF Λf;Q(t¯)\Lambda_{f;Q}\left(t\bar{\ell}\right) does not admit an analytical expression, resulting in the lack of closed-form expression in Proposition 13. This problem can be remedied by finding a convex upper bound ψ(t)\psi(t) of Λf;Q(t¯)\Lambda_{f;Q}\left(t\bar{\ell}\right), as clarified in Theorem 20. Section IV provides many cases where ψ\psi is quadratic and Theorem 20 can be further simplified.

Theorem 1.

Let ¯Γ¯(𝒲×𝒵)\bar{\ell}\in\bar{\Gamma}\subseteq\mathcal{M}(\mathcal{W}\times\mathcal{Z}) and 0<b+0<b\leq+\infty. If there exists a continuous convex function ψ:[0,+)[0,+)\psi:[0,+\infty)\to[0,+\infty) satisfying ψ(0)=ψ(0)=0\psi(0)=\psi^{\prime}(0)=0 and Λf;Q(t¯)ψ(t)\Lambda_{f;Q}\left(t\bar{\ell}\right)\leq\psi(t) for all t(0,b)t\in(0,b). Then we have

gen(PW|Zn,ν,μ)1ni=1ninfηi𝒫(𝒲×𝒵){WΓ¯(Pi,ηi)+(ψ)1(Df(ηi||Q))},\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\inf_{\eta_{i}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z})}\Bigl{\{}W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right)+(\psi^{*})^{-1}\left(D_{f}\left(\eta_{i}||Q\right)\right)\Bigr{\}}, (20)

where ψ\psi^{*} denotes the Legendre dual of ψ\psi and (ψ)1(\psi^{*})^{-1} denotes the generalized inverse of ψ\psi^{*}.

Remark 2.

Technically we can replace ¯\bar{\ell} with ¯-\bar{\ell} and prove an upper bound of gen(PW|Zn,ν,μ)-\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) by a similar argument. This result together with Theorem 20 can be regarded as an extension of the previous result [9, Theorem 2]. Specifically, the extensions are two-fold. First, [9] only considered the KL-divergence while our result interpolates freely between IPM and ff-divergence. Second, [9] only considered the in-distribution generalization while our result applies to the OOD generalization, including the case where the training distribution is not absolutely continuous w.r.t. the testing distribution.

Proof of Theorem 20.

We first invoke a key lemma.

Lemma 2 (Lemma 2.4 in [38]).

Let ψ\psi be a convex and continuously differentiable function defined on the interval [0,b)[0,b), where 0<b+0<b\leq+\infty. Assume that ψ(0)=ψ(0)=0\psi(0)=\psi^{\prime}(0)=0 and for every t0t\geq 0, let ψ(t)=supλ(0,b){λtψ(λ)}\psi^{*}(t)=\sup_{\lambda\in(0,b)}\left\{\lambda t-\psi(\lambda)\right\} be the Legendre dual of ψ\psi. Then the generalized inverse of ψ\psi^{*}, defined by (ψ)1(y):-inf{t0:ψ(t)>y}\left(\psi^{*}\right)^{-1}(y)\coloneq\inf\left\{t\geq 0:\psi^{*}(t)>y\right\}, can also be written as

(ψ)1(y)=infλ(0,b)y+ψ(λ)λ.\left(\psi^{*}\right)^{-1}(y)=\inf_{\lambda\in(0,b)}\frac{y+\psi(\lambda)}{\lambda}. (21)

As a consequence of Lemma 21, we have

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) 1ni=1ninfηi𝒫(𝒲×𝒵),ti+{WΓ¯(Pi,ηi)+1tiDf(ηi||Q)+1tiΛf;Q(ti¯(W,Z))}\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\inf_{\eta_{i}\in\mathcal{P}(\mathcal{W}\times\mathcal{Z}),\;t_{i}\in\mathbb{R}_{+}}\Bigl{\{}W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right)+\frac{1}{t_{i}}D_{f}\left(\eta_{i}||Q\right)+\frac{1}{t_{i}}\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z\right)\right)\Bigr{\}} (22)
1ni=1ninfηiinfti{WΓ¯(Pi,ηi)+Df(ηi||Q)+ψ(ti)ti}\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\inf_{\eta_{i}}\inf_{t_{i}}\biggl{\{}W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right)+\frac{D_{f}\left(\eta_{i}||Q\right)+\psi(t_{i})}{t_{i}}\biggr{\}} (23)
=RHS of (20),\displaystyle=\text{RHS of }(\ref{equation::the general theorem}),

where the first inequality follows by Proposition 13 and the last equality follows by Lemma 21. ∎

In general, compared with checking ¯Γ¯\bar{\ell}\in\bar{\Gamma}, it is more direct to check that Γ\ell\in\Gamma for some Γ(𝒲×𝒵)\Gamma\subseteq\mathcal{M}(\mathcal{W}\times\mathcal{Z}). If so, we can choose333Note that ΓΓ0\Gamma-\Gamma\neq 0, it is the set consists of ggg-g^{\prime} s.t. both gg and gg^{\prime} belong to Γ\Gamma. Γ¯=ΓΓ\bar{\Gamma}=\Gamma-\Gamma. If we further assume that Γ\Gamma is symmetric, i.e., Γ=Γ\Gamma=-\Gamma, then we have Γ¯=2Γ\bar{\Gamma}=2\Gamma and thus

WΓ¯(Pi,ηi)=2WΓ(Pi,ηi).W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right)=2W^{\Gamma}\left(P_{i},\eta_{i}\right). (24)

The following corollary says whenever inserting (24) into generalization bounds (20), the coefficient 2 can be removed under certain conditions.

Corollary 1.

Suppose Γ(𝒲×𝒵)\ell\in\Gamma\subseteq\mathcal{M}(\mathcal{W}\times\mathcal{Z}) and Γ\Gamma be symmetric. Let PWP_{W} be the distribution of the algorithmic output WW and 𝒞(PW,)𝒫(𝒲×𝒵)\mathcal{C}\left(P_{W},\cdot\right)\subseteq\mathcal{P}\left(\mathcal{W}\times\mathcal{Z}\right) be a class of distributions whose marginal distribution on 𝒲\mathcal{W} is PWP_{W}, then we have

gen(PW|Zn,ν,μ)1ni=1ninfηi𝒞(PW,){WΓ(Pi,ηi)+(ψ)1(Df(ηi||Q))}.\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\inf_{\eta_{i}\in\mathcal{C}\left(P_{W},\cdot\right)}\\ \Bigl{\{}W^{\Gamma}\left(P_{i},\eta_{i}\right)+(\psi^{*})^{-1}\left(D_{f}\left(\eta_{i}||Q\right)\right)\Bigr{\}}. (25)
Proof.

By inequality (18), it suffices to prove

𝔼Pi[¯(W,Zi)]𝔼ηi[¯(W,Zi)]WΓ(Pi,ηi).\mathbb{E}_{P_{i}}\left[\bar{\ell}\left(W,Z_{i}\right)\right]-\mathbb{E}_{\eta_{i}}\left[\bar{\ell}\left(W,Z_{i}\right)\right]\leq W^{\Gamma}\left(P_{i},\eta_{i}\right). (26)

If so, (25) will follow by exploiting Lemma 21 and optimizing over tit_{i} in (18). Since ηi𝒞(PW,)\eta_{i}\in\mathcal{C}\left(P_{W},\cdot\right), the left-hand side of (26) is exactly (𝔼ηi[]𝔼Pi[])(\mathbb{E}_{\eta_{i}}\left[\ell\right]-\mathbb{E}_{P_{i}}\left[\ell\right]). Thus (26) follows by Γ\ell\in\Gamma and by the symmetry of Γ\Gamma. ∎

III-C An Optimal Transport Interpretation of Theorem 20

Intuitively, a learning algorithm generalizes well in the OOD setting if the following two conditions hold simultaneously: 1. The training distribution ν\nu is close to the testing distribution μ\mu. 2. The posterior distribution PW|ZiP_{W|Z_{i}} is close to the prior distribution QWQ_{W}. The second condition can be interpreted as the “algorithmic stability” and has been studied by a line of work [39, 40]. The two conditions together imply that the learning algorithm generalizes well if PiP_{i} is close to QQ. The right-hand side of (20) can be regarded as a characterization of the “closeness” between PiP_{i} and QQ. Moreover, inspired by [35], we provide an optimal transport interpretation to the generalization bound (20). Consider the task of moving (or reshaping) a pile of dirt whose shape is characterized by distribution QQ, to another pile of dirt whose shape is characterized by PiP_{i}. Decompose the task into two phases as follows. During the first phase, we move QQ to ηi\eta_{i} and this yields an ff-divergence-type transport cost (ψ)1(Df(ηi||Q))\left(\psi^{*}\right)^{-1}\left(D_{f}\left(\eta_{i}||Q\right)\right), which is a monotonously increasing transformation of Df(ηi||Q)D_{f}\left(\eta_{i}||Q\right) (see Lemma 21). During the second phase, we move ηi\eta_{i} to PiP_{i} and this yields an IPM-type transport cost WΓ(Pi,ηi)W^{\Gamma}\left(P_{i},\eta_{i}\right). The total cost is the sum of the two-phased costs and is optimized over all intermediate distributions ηi\eta_{i}.

In particular, we can say more if both ff and ψ\psi are super-linear. By assumption, the ff-divergence is given by (5) and we have (ψ)1(+)=+\left(\psi^{*}\right)^{-1}(+\infty)=+\infty. This implies we require ηiQ\eta_{i}\ll Q to ensure the cost is finite. In other words, ηi\eta_{i} is a “continuous deformation” of QQ and cannot assign mass outside the support of QQ. On the other hand, if we decompose PiP_{i} into Pi=Pic+PisP_{i}=P_{i}^{c}+P_{i}^{s}, where PicQP_{i}^{c}\ll Q and PisQP_{i}^{s}\perp Q, then all the mass of PisP_{i}^{s} is transported during the second phase.

IV Special Cases and Examples

In this section, we demonstrate how a series of generalization bounds, including both PP-type and PE-type, can be derived through Theorem 20 and its Corollary 25. For simplicity, we defer all the proofs in this section to the Appendix B.

IV-A Population-Empirical Generalization Bounds

This subsection bounds the PE generalization gap defined in (2). In particular, the PE bounds can be divided into two classes: the IPM-type bounds and the ff-divergence-type bounds.

IV-A1 IPM-Type Bounds

From the Bayes perspective, QWQ_{W} is the prior distribution of the hypothesis and thus is fixed at the beginning. Technically, however, the derived generalization bounds hold for arbitrary QWQ_{W} and we can optimize over QWQ_{W} to further tighten the generalization bounds. In particular, set QW=PWQ_{W}=P_{W}, ηi=Q\eta_{i}=Q, and let Γ\Gamma be the set of (LW,LZ)(L_{W},L_{Z})-Lipschitz functions. Applying Corollary 25 establishes the Wasserstein distance generalization bound.

Corollary 2 (Wasserstein Distance Bounds for Lipschitz Loss Functions).

If the loss function is (LW,LZ)(L_{W},L_{Z})-Lipschitz, i.e., \ell is LWL_{W}-Lipschitz on 𝒲\mathcal{W} for all z𝒵z\in\mathcal{Z} and LZL_{Z}-Lipschitz on 𝒵\mathcal{Z} for all w𝒲w\in\mathcal{W}, then we have

gen(PW|Zn,ν,μ)LZW1(ν,μ)+LWni=1n𝔼ν[W1(PW|Zi,PW)].\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq L_{Z}W_{1}(\nu,\mu)\\ +\frac{L_{W}}{n}\sum_{i=1}^{n}\mathbb{E}_{\nu}\left[W_{1}\left(P_{W|Z_{i}},P_{W}\right)\right]. (27)

Set QW=PWQ_{W}=P_{W}, ηi=Q\eta_{i}=Q, and Γ={g:agb}\Gamma=\left\{g:a\leq g\leq b\right\}. Applying Corollary 25 establishes the total variation generalization bound.

Corollary 3 (Total Variation Bounds for Bounded Loss Function).

If the loss function is uniformly bounded: (w,z)[a,b]\ell\left(w,z\right)\in[a,b], for all w𝒲w\in\mathcal{W} and z𝒵z\in\mathcal{Z}, then

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) bani=1nTV(Pi,Q)\displaystyle\leq\frac{b-a}{n}\sum_{i=1}^{n}\mathrm{TV}\left(P_{i},Q\right) (28)
(ba)TV(ν,μ)+bani=1n𝔼ν[TV(PW|Zi,PW)].\displaystyle\leq(b-a)\cdot\mathrm{TV}\left(\nu,\mu\right)+\frac{b-a}{n}\sum_{i=1}^{n}\mathbb{E}_{\nu}\left[\mathrm{TV}\left(P_{W|Z_{i}},P_{W}\right)\right]. (29)

Similar results have been obtained by others in the context of domain adaptation [26, Theorem 5.2 and Corollary 5.2] and transfer learning [24, Theorem 5 and Corollary 6]. In essence, these results are equivalent.

IV-A2 ff-Divergence-Type Bounds

Set f(x)=xlogx(x1)f(x)=x\log x-(x-1) and ηi=Pi\eta_{i}=P_{i}. For σ\sigma-sub-Gaussian loss functions, we can choose ψ(t)=12σ2t2\psi(t)=\frac{1}{2}\sigma^{2}t^{2} and thus (ψ)1(y)=2σ2y\left(\psi^{*}\right)^{-1}(y)=\sqrt{2\sigma^{2}y}. This recovers the KL-divergence generalization bound [25, 24, 26].

Corollary 4 (KL Bounds for sub-Gaussian Loss Functions).

If the loss function is σ\sigma-sub-Gaussian for all w𝒲w\in\mathcal{W}, we have

gen(PW|Zn,ν,μ)1ni=1n2σ2(I(W;Zi)+DKL(ν||μ)),\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma^{2}\left(I(W;Z_{i})+D_{\mathrm{KL}}\left(\nu||\mu\right)\right)}, (30)

where I(W;Zi)I(W;Z_{i}) is the mutual information between WW and ZiZ_{i}.

If the loss function is (σ,c)(\sigma,c)-sub-gamma, we can choose ψ(t)=t22(1ct)\psi(t)=\frac{t^{2}}{2(1-ct)}, t[0,1c)t\in[0,\frac{1}{c}), and thus (ψ)1(y)=2σ2y+cy\left(\psi^{*}\right)^{-1}(y)=\sqrt{2\sigma^{2}y}+cy. In particular, the sub-Gaussian case corresponds to c=0c=0.

Corollary 5 (KL Bounds for sub-gamma Loss Functions).

If the loss function is (σ,c)(\sigma,c)-sub-gamma for all w𝒲w\in\mathcal{W}, we have

gen(PW|Zn,ν,μ)1ni=1n2σ2(I(W;Zi)+DKL(ν||μ))+c(I(W;Zi)+DKL(ν||μ)).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma^{2}\left(I(W;Z_{i})+D_{\mathrm{KL}}\left(\nu||\mu\right)\right)}\\ +c\bigl{(}I(W;Z_{i})+D_{\mathrm{KL}}\left(\nu||\mu\right)\bigr{)}. (31)

Setting f(x)=(x1)2f(x)=(x-1)^{2} and ηi=Pi\eta_{i}=P_{i}, we establish the χ2\chi^{2}-divergence bound.

Corollary 6 (χ2\chi^{2} Bounds).

If the variance Varμ(w,Z)σ2\mathrm{Var}_{\mu}\ell\left(w,Z\right)\leq\sigma^{2} for all w𝒲w\in\mathcal{W}, we have

gen(PW|Zn,ν,μ)1ni=1nσ2χ2(Pi||Q).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\sigma^{2}\chi^{2}\left(P_{i}||Q\right)}. (32)

In particular, by the chain rule of χ2\chi^{2}-divergence, we have

gen(PW|Zn,ν,μ)1ni=1nσ(1+supz𝒵χ2(PW|Zi=z||QW))(1+χ2(ν||μ))1.\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\sigma\\ \sqrt{\Bigl{(}1+\sup_{z\in\mathcal{Z}}\chi^{2}\left(P_{W|Z_{i}=z}||Q_{W}\right)\Bigr{)}\bigl{(}1+\chi^{2}\left(\nu||\mu\right)\bigr{)}-1}. (33)

In the remaining part of this subsection, we focus on the bounded loss function. Thanks to Theorem 20, we need a convex upper bound ψ(t)\psi(t) of the generalized CGF Λf;Q(t¯)\Lambda_{f;Q}\left(t\bar{\ell}\right). The following lemma says that ψ(t)\psi(t) is quadratic if ff satisfies certain conditions.

Lemma 3 (Corollary 92 in[36]).

Suppose the loss function (w,z)[a,b],w𝒲,z𝒵\ell(w,z)\in[a,b],\ \forall w\in\mathcal{W},\ z\in\mathcal{Z}, ff is strictly convex and twice differentiable on its domain, thrice differentiable at 1, and

27f′′(1)(3xf′′′(1)/f′′(1))3f′′(1+x),\dfrac{27f^{\prime\prime}(1)}{\left(3-xf^{\prime\prime\prime}(1)/f^{\prime\prime}(1)\right)^{3}}\leq f^{\prime\prime}(1+x), (34)

for all x1x\geq-1. Then Λf;Q(t¯)12σf2t2\Lambda_{f;Q}\left(t\bar{\ell}\right)\leq\dfrac{1}{2}\sigma_{f}^{2}t^{2}, where σf=(ba)2f′′(1)\sigma_{f}=\dfrac{(b-a)}{2\sqrt{f^{\prime\prime}(1)}}.

In Table I, we summarize some common ff-divergence and check whether condition (34) is satisfied. As a result of Lemma 3, we have the following corollary.

Corollary 7.

Let (w,z)[a,b]\ell\left(w,z\right)\in[a,b] for all w𝒲w\in\mathcal{W} and z𝒵z\in\mathcal{Z}. If ff satisfies the conditions in Lemma 3, we have

gen(PW|Zn,ν,μ)1ni=1n2σf2Df(Pi||Q).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma_{f}^{2}D_{f}\left(P_{i}||Q\right)}. (35)

Some common ff-divergence and the corresponding coefficient σf\sigma_{f} are given by Table II.

TABLE I: Comparison Between ff-Divergences
ff-Divergence f(x)f(x) Condition (34) holds?
α\alpha-Divergence xααx+α1α(α1)\dfrac{x^{\alpha}-\alpha x+\alpha-1}{\alpha(\alpha-1)} Only for α[1,2]\alpha\in[-1,2]
χ2\chi^{2}-Divergence (x1)2(x-1)^{2} Yes
KL-Divergence xlogx(x1)x\log x-(x-1) Yes
Squared Hellinger (x1)2(\sqrt{x}-1)^{2} Yes
Reversed KL logx+x1-\log x+x-1 Yes
Jensen-Shannon(with parameter θ\theta) θxlogx(θx+1θ)log(θx+1θ)\theta x\log x-(\theta x+1-\theta)\log(\theta x+1-\theta) Yes
Le Cam 1x2(1+x)+14(x1)\dfrac{1-x}{2(1+x)}+\dfrac{1}{4}(x-1) Yes
  • 1

    All ff in Table I are set to be standard, i.e., f(1)=f(1)=0f^{\prime}(1)=f(1)=0.

  • 2

    Both the χ2\chi^{2}-divergence and the squared Hellinger divergence are α\alpha-divergence, up to a multiplicative constant. In particular, we have χ2=2D2\chi^{2}=2D_{2} and H2=12D1/2H^{2}=\frac{1}{2}D_{1/2}. The θ\theta-Jensen-Shannon divergence has the form of DJS(θ)(P||Q)=θDKL(P||R(θ))+(1θ)DKL(Q||R(θ))D_{\mathrm{JS}(\theta)}(P||Q)=\theta D_{\mathrm{KL}}\left(P||R(\theta)\right)+(1-\theta)D_{\mathrm{KL}}\left(Q||R(\theta)\right), where R(θ):-θP+(1θ)QR(\theta)\coloneq\theta P+(1-\theta)Q and θ(0,1)\theta\in(0,1). The classical Jensen-Shannon divergence corresponds to θ=1/2\theta=1/2.

TABLE II: Correspondence of DfD_{f} and σf\sigma_{f}
DfD_{f} DαD_{\alpha} (α[1,2])(\alpha\in[-1,2]) KL χ2\chi^{2} H2H^{2}
σf\sigma_{f} ba2\dfrac{b-a}{2} ba2\dfrac{b-a}{2} ba22\dfrac{b-a}{2\sqrt{2}} ba2\dfrac{b-a}{\sqrt{2}}
DfD_{f} Reversed KL JS(θ)(\theta) Le Cam
σf\sigma_{f} ba2\dfrac{b-a}{2} ba2θ(1θ)\dfrac{b-a}{2\sqrt{\theta(1-\theta)}} bab-a

We end this subsection with some remarks. First, since Corollary 3 also considers the bounded loss function, it is natural to ask whether we can compare (28) and (35). The answer is affirmative and we always have

TV(Pi,Q)2σf2Df(Pi||Q).\mathrm{TV}\left(P_{i},Q\right)\leq\sqrt{2\sigma_{f}^{2}D_{f}\left(P_{i}||Q\right)}. (36)

This Pinsker-type inequality is given by [36]. Thus the bound in (28) is always tighter than that in (35). Secondly, as previously mentioned, we optimize QWQ_{W} to tighten the bounds. In some examples, e.g., the KL-bound, the optimal QWQ_{W} is achieved at PWP_{W}, but it is not always the case, e.g., the χ2\chi^{2}-bound. Lastly, all the results derived in this subsection encompass the in-distribution generalization as a special case, by simply setting ν=μ\nu=\mu. If we further set QW=PWQ_{W}=P_{W}, then we establish a series of in-distribution generalization bounds by simply replacing Df(Pi||Q)D_{f}\left(P_{i}||Q\right) with If(W;Zi)I_{f}(W;Z_{i}), the ff-mutual information between WW and ZiZ_{i}.

IV-B Population-Population Generalization Bounds

By setting QW=PWQ_{W}=P_{W}, ηi=PWν\eta_{i}=P_{W}\otimes\nu, and Γ¯={¯}\bar{\Gamma}=\{\bar{\ell}\}, Theorem 20 specializes to a family of ff-divergence-type PP generalization bounds. See Appendix B-E for proof.

Corollary 8 (PP Generalization Bounds).

Let ψ\psi be defined in Theorem 20. If Λf;Q(t¯(W,Z))ψ(t)\Lambda_{f;Q}\left(t\bar{\ell}(W,Z)\right)\leq\psi(t), then we have

gen~(PW|Zn,ν,μ)(ψ)1(Df(ν||μ)).\mathrm{\widetilde{gen}}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\left(\psi^{*}\right)^{-1}\left(D_{f}\left(\nu||\mu\right)\right). (37)

By Corollary 37, each ff-divergence-type PE bound provided in Section IV-A2 possesses a PP generalization bound counterpart, with Df(Pi||Q)D_{f}\left(P_{i}||Q\right) replaced by Df(ν||μ)D_{f}\left(\nu||\mu\right). In particular, under the KL case, we recover the results in [26, Theorem 4.1] if the loss function is σ\sigma-sub-Gaussian:

|gen~(PW|Zn,ν,μ)|2σ2DKL(ν||μ),|\mathrm{\widetilde{gen}}\left(P_{W|Z^{n}},\nu,\mu\right)|\leq\sqrt{2\sigma^{2}D_{\mathrm{KL}}\left(\nu||\mu\right)}, (38)

where the absolute value comes from the symmetry of the sub-Gaussian distribution. The remaining PP generalization bounds are summarized in Table III.

TABLE III: ff-Divergence Bounds of the PP Generalization Gap
Assumptions PP Generalization Bounds
\ell is (σ,c)(\sigma,c)-sub-gamma 2σ2DKL(ν||μ)+cDKL(ν||μ)\sqrt{2\sigma^{2}D_{\mathrm{KL}}\left(\nu||\mu\right)}+cD_{\mathrm{KL}}\left(\nu||\mu\right)
Varμ(w,Z)σ2\mathrm{Var}_{\mu}\ell\left(w,Z\right)\leq\sigma^{2}, w𝒲\forall w\in\mathcal{W} σ2χ2(ν||μ)\sqrt{\sigma^{2}\chi^{2}\left(\nu||\mu\right)}
[a,b],α[1,2]\ell\in[a,b],\alpha\in[-1,2] (ba)Dα(ν||μ)/2(b-a)\sqrt{D_{\alpha}(\nu||\mu)/2}
[a,b]\ell\in[a,b] (ba)H2(ν||μ)(b-a)\sqrt{H^{2}(\nu||\mu)}
[a,b]\ell\in[a,b] (ba)DKL(μ||ν)/2(b-a)\sqrt{D_{\mathrm{KL}}\left(\mu||\nu\right)/2}
[a,b]\ell\in[a,b] (ba)DJS(θ)(ν||μ)2θ(1θ)(b-a)\sqrt{\frac{D_{\mathrm{JS}(\theta)}(\nu||\mu)}{2\theta(1-\theta)}}
[a,b]\ell\in[a,b] (ba)2DLC(ν||μ)(b-a)\sqrt{2D_{\mathrm{LC}}(\nu||\mu)}
Remark 3.

Corollary 37 coincides with the previous result [36], which studies the optimal bounds between ff-divergences and IPMs. Specifically, authors in [36] proved Λf;Q(tg)t𝔼Q[g]ψ(t)\Lambda_{f;Q}\left(tg\right)-t\mathbb{E}_{Q}\left[g\right]\leq\psi(t) if and only if Df(P||Q)ψ(𝔼P[g]𝔼Q[g])D_{f}\left(P||Q\right)\geq\psi^{*}(\mathbb{E}_{P}\left[g\right]-\mathbb{E}_{Q}\left[g\right]). In our context, gg is replaced with ¯\bar{\ell} and thus 𝔼Q[g]=0\mathbb{E}_{Q}\left[g\right]=0. Thus, Corollary 37 can be regarded as an application of the general result [36] in the OOD setting.

IV-C Examples

In this subsection, we examine the above results in the context of two simple “learning” problems. We demonstrate that our new χ2\chi^{2}-bound surpasses the existing KL-bound in the first example, and the recovered TV-bound achieves the best in the second example.

IV-C1 Estimating the Gaussian Mean

Consider the task of estimating the mean of Gaussian random variables. We assume the training sample comes from the distribution 𝒩(m,σ2)\mathcal{N}(m,\sigma^{2}), and the testing distribution is 𝒩(m,(σ)2)\mathcal{N}(m^{\prime},(\sigma^{\prime})^{2}). We define the loss function as (w,z)=(wz)2\ell\left(w,z\right)=(w-z)^{2}, then the ERM algorithm yields the estimation w=1ni=1nziw=\frac{1}{n}\sum_{i=1}^{n}z_{i}. Under the above settings, the loss function is sub-Gaussian with parameter 2((σ)2+σ2/n)2((\sigma^{\prime})^{2}+\sigma^{2}/n), and thus Corollary 4 and Corollary 33 apply. The known KL-bounds and the newly derived χ2\chi^{2}-bounds are compared in Fig. 1(a) and Fig. 1(b), where we set (m,σ2)=(1,1)(m,\sigma^{2})=(1,1). In Fig. 1(a) the two bounds are compared under the in-distribution setting, i.e., m=mm^{\prime}=m and σ=σ\sigma^{\prime}=\sigma. A rigorous analysis in Appendix B-F shows that both χ2\chi^{2}- and KL-bound decay at the rate 𝒪(1/n)\mathcal{O}(1/\sqrt{n}), while the true generalization gap decays at the rate 𝒪(1/n)\mathcal{O}(1/n). This additional square root comes from the (ψ)1{(\psi^{*})}^{-1} term in Theorem 20. Moreover, the KL-bound has the form of clog(1+1n)c\sqrt{\log(1+\frac{1}{n})} while the χ2\chi^{2}-bound has the form of c1/nc\sqrt{1/n}. Thus the KL-bound is tighter than the χ2\chi^{2}-bound and they are asymptotically equivalent as nn\to\infty. On the other hand, we compare the OOD case in Fig. 1(b), where we set m=1m^{\prime}=1 and (σ)2=2(\sigma^{\prime})^{2}=2. We observe that the χ2\chi^{2}-bound is tighter than the KL-bound at the very beginning. In Fig. 1(c), we fix the “covariance error”, i.e., 𝚺𝚺F=(σ)2σ2=1\|\mathbf{\Sigma}^{\prime}-\mathbf{\Sigma}\|_{\mathrm{F}}=(\sigma^{\prime})^{2}-\sigma^{2}=1, and extend this example to dd-dimension with d=8d=8. We observe that the χ2\chi^{2}-bound is tighter as nn is sufficiently large, and more samples are needed in the higher dimensional case for the χ2\chi^{2}-bound to surpass the KL-bound. Mathematically, by comparing the χ2\chi^{2}-bound (32) and the KL-bound (30), we conclude that the χ2\chi^{2}-bound will be tighter than the KL-bound whenever χ2(Pi||Q)<2DKL(Pi||Q)\chi^{2}\left(P_{i}||Q\right)<2D_{\mathrm{KL}}\left(P_{i}||Q\right) since the variance of a random variable is no more than its sub-Gaussian parameter. See Appendix B-F for more details.

Refer to caption
(a) In-distribution, m=1,σ2=1,d=1m=1,\ \sigma^{2}=1,\ d=1.
Refer to caption
(b) OOD, m=1,(σ)2=2,d=1m^{\prime}=1,\ (\sigma^{\prime})^{2}=2,\ d=1.
Refer to caption
(c) OOD, 𝐦=𝐦,𝚺𝚺F=1,d=8\mathbf{m}^{\prime}=\mathbf{m},\ \|\mathbf{\Sigma}^{\prime}-\mathbf{\Sigma}\|_{\mathrm{F}}=1,\ d=8.
Figure 1: Generalization Bounds of Estimating Gaussian Means.

IV-C2 Estimating the Bernoulli Mean

Consider the previous example where the Gaussian distribution is replaced with the Bernoulli distribution. We assume the training samples are generated from the distribution (Bern(p))n(\mathrm{Bern}(p))^{\otimes n} and the test data follows Bern(p)\mathrm{Bern}(p^{\prime}). Again we define the loss function as (w,z)=(wz)2\ell\left(w,z\right)=(w-z)^{2} and choose the estimation w=1ni=1nziw=\frac{1}{n}\sum_{i=1}^{n}z_{i}.

Under the above settings, the loss function is bounded with [0,1]\ell\in[0,1]. If we define the Hamming distance over 𝒲\mathcal{W} and 𝒵\mathcal{Z}, then the total variation bound coincides with the Wasserstein distance bound. In Fig. 2(a), we set p=0.3p=0.3 and p=0.1p^{\prime}=0.1, and we see that the squared Hellinger, Jensen-Shannon, and Le Cam bounds are tighter than the KL-bound. On the contrary, χ2\chi^{2}- and α\alpha-divergence bounds are tighter than the KL-bound as in Fig. 2(b), where we set p=0.3p=0.3 and p=0.5p^{\prime}=0.5. But all these ff-divergence-type generalization bounds are looser than the total variation bound, as illustrated by (36). Additionally, there exists an approximately monotone relationship between χ2\chi^{2}-divergence, α\alpha-divergence (α=3/2\alpha=3/2), KL-divergence, and the squared Hellinger divergence. This is because all these bounds are α\alpha-divergence type, with KL-divergence corresponds to α=1\alpha=1444Strictly speaking, DKL=R1D_{\mathrm{KL}}=R_{1}, the Rényi-α\alpha-divergence with α=1\alpha=1, and RαR_{\alpha} is a log\log-transformation of the α\alpha-divergence.. Moreover, we observe that the Le Cam divergence is always tighter than the Jensen-Shannon divergence. This is because the generator ff of Le Cam is smaller than that of Jensen-Shannon, and they share the same coefficient σf=1\sigma_{f}=1. Finally, we consider the extreme case in Fig. 2(c), where n=10n=10, p=0.6p=0.6, and we allow pp^{\prime} decays to 0. When pp^{\prime} is sufficiently small, the KL-bound (along with α\alpha-divergence (α=3/2\alpha=3/2) and χ2\chi^{2}-bound) is larger than 11 and thus becomes vacuous. In contrast, the squared Hellinger, Jensen-Shannon, Le Cam, and total variation bounds do not suffer from such a problem. See Appendix B-F for derivations.

Refer to caption
(a) p=0.3p=0.3, p=0.1p^{\prime}=0.1
Refer to caption
(b) p=0.3p=0.3, p=0.5p^{\prime}=0.5.
Refer to caption
(c) n=10n=10, p=0.6p=0.6.
Figure 2: Generalization Bounds of Estimating Bernoulli Means.

V The CMI Method and its Improvement

In this section, we study the OOD generalization for bounded loss function using the CMI methods. We first recall some CMI results in Subsection V-A and then improve the state-of-the-art result in Subsection V-B using ff-divergence. Unless otherwise stated, we assume that (w,z)[a,b],w𝒲,z𝒵\ell\left(w,z\right)\in[a,b],\ \forall w\in\mathcal{W},z\in\mathcal{Z}.

V-A An Overview of CMI-based In-Distribution Generalization Bounds

Instead of using the training samples Zn=(Z1,,Zn)Z^{n}=(Z_{1},\ldots,Z_{n}) generated from distribution νn\nu^{\otimes n}, the CMI method generates a set of super-samples Zn,±=(Z1±,,Zn±)Z^{n,\pm}=(Z_{1}^{\pm},\ldots,Z_{n}^{\pm}) from the distribution ν2n\nu^{\otimes 2n}, where Zi±=(Zi+,Zi)Z_{i}^{\pm}=(Z_{i}^{+},Z_{i}^{-}) consists of a positive sample Zi+Z_{i}^{+} and a negative sample ZiZ_{i}^{-}. When performing training, the CMI framework randomly chooses Zi+Z_{i}^{+} or ZiZ_{i}^{-} as the ii-th training data ZiZ_{i} with the help of a Rademacher random variable RiR_{i}, i.e., RiR_{i} takes 11 or 1-1 with equal probability 1/2. Specifically, we choose Zi=Zi+Z_{i}=Z_{i}^{+} if Ri=1R_{i}=1 and choose Zi=ZiZ_{i}=Z_{i}^{-} otherwise.

Based on the above construction, [10] established the first in-distribution CMI generalization bound using the mutual information between the output hypothesis WW and the identifier RnR^{n} conditioned on the super-samples Zn,±Z^{n,\pm},

gen(PW|Zn,ν)(ba)2I(W;Rn|Zn,±)n.\mathrm{gen}\left(P_{W|Z^{n}},\nu\right)\leq(b-a)\sqrt{\frac{2I\left(W;R^{n}|Z^{n,\pm}\right)}{n}}. (39)

Here we use gen(PW|Zn,ν)\mathrm{gen}\left(P_{W|Z^{n}},\nu\right) to denote the expected in-distribution generalization gap by setting μ=ν\mu=\nu. The CMI bound can be improved by combining the ‘individual’ technique in [9]. In particular, the Conditionally Individual Mutual Information (CIMI) bound [11] is achieved by individualizing only on the data identifier RiR_{i},

gen(PW|Zn,ν)(ba)ni=1n2I(W;Ri|Zn,±),\mathrm{gen}\left(P_{W|Z^{n}},\nu\right)\leq\frac{(b-a)}{n}\sum_{i=1}^{n}\sqrt{2I\left(W;R_{i}|Z^{n,\pm}\right)}, (40)

and the Individually Conditional Individual Mutual Information (ICIMI) bound [15] is achieved by individualizing on both RiR_{i} and the super-sample Zi±Z_{i}^{\pm},

gen(PW|Zn,ν)(ba)ni=1n2I(W;Ri|Zi±).\mathrm{gen}\left(P_{W|Z^{n}},\nu\right)\leq\frac{(b-a)}{n}\sum_{i=1}^{n}\sqrt{2I\left(W;R_{i}|Z_{i}^{\pm}\right)}. (41)

It is natural to ask which of these generalization bounds is tighter. If we only consider the mutual information term in these generalization bounds, it follows that ICIMICIMICMIMI\mathrm{ICIMI}\leq\mathrm{CIMI}\leq\mathrm{CMI}\leq\mathrm{MI}, and ICIMIIMIMI\mathrm{ICIMI}\leq\mathrm{IMI}\leq\mathrm{MI} [15]. This is mathematically written as

  1. 1.

    I(W;Ri|Zi±)I(W;Ri|Zn,±)I(W;Rn|Zn,±)I(W;Zn)I\left(W;R_{i}|Z_{i}^{\pm}\right)\leq I\left(W;R_{i}|Z^{n,\pm}\right)\leq I\left(W;R^{n}|Z^{n,\pm}\right)\leq I(W;Z^{n}).

  2. 2.

    I(W;Ri|Zi±)I(W;Zi)I(W;Zn)I\left(W;R_{i}|Z_{i}^{\pm}\right)\leq I(W;Z_{i})\leq I(W;Z^{n}).

When it comes to the generalization bound, it was proved in [11] that the CIMI bound is tighter than the CMI bound. Thus it follows that the ICIMI bound achieves the best among those CMI-based bounds. However, the ICIMI bound is not necessarily better than the IMI bound. Since a bounded random variable X[a,b]X\in[a,b] is ba2\frac{b-a}{2}-sub-Gaussian, the IMI bound reduces to

gen(PW|Zn,ν)bani=1nI(W;Zi)2,\mathrm{gen}\left(P_{W|Z^{n}},\nu\right)\leq\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{I(W;Z_{i})}{2}}, (42)

for bounded loss function [a,b]\ell\in[a,b]. As a consequence, the ICIMI bound is tighter than the IMI bound only if I(W;Ri|Zi±)<14I(W;Zi)I\left(W;R_{i}|Z_{i}^{\pm}\right)<\frac{1}{4}I(W;Z_{i}), which does not necessarily hold in general.

V-B f-ICIMI Generalization Bounds

We note that all the above generalization bounds were established for the in-distribution case. When extending to the OOD cases, however, the idea of conditioning on super-samples does not work anymore, since the disagreement between training distribution ν\nu and test distribution μ\mu breaks the symmetry of the expected generalization gap. To tackle this problem, we decompose the OOD generalization gap as

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) =𝔼[𝔼μ[(W,Z)]𝔼ν[(W,Z)]+𝔼ν[(W,Z)]1ni=1n(W,Zi)]\displaystyle=\mathbb{E}\left[\mathbb{E}_{\mu}\left[\ell\left(W,Z\right)\right]-\mathbb{E}_{\nu}\left[\ell\left(W,Z\right)\right]+\mathbb{E}_{\nu}\left[\ell\left(W,Z\right)\right]-\frac{1}{n}\sum_{i=1}^{n}\ell\left(W,Z_{i}\right)\right] (43)
=gen~(PW|Zn,ν,μ)+gen(PW|Zn,ν).\displaystyle=\mathrm{\widetilde{gen}}\left(P_{W|Z^{n}},\nu,\mu\right)+\mathrm{gen}\left(P_{W|Z^{n}},\nu\right). (44)

Here the outer expectation is taken w.r.t. the joint distribution of WW and ZnZ^{n}. The first term of (44) is the PP generalization gap and thus can be bounded using total variation or using 2σf2Df(ν||μ)\sqrt{2\sigma_{f}^{2}D_{f}\left(\nu||\mu\right)} by Corollary 37. The second term of (44) is the in-distribution generalization gap and thus can be bounded by a series of CMI-based methods as introduced above. In particular, by combining the ff-divergence with the ICIMI method, we have the following improved generalization bound. The proof is postponed to the end of this subsection.

Theorem 2 (ff-ICIMI Generalization Bound).

Let the loss function (w,z)[a,b],w𝒲,z𝒵\ell(w,z)\in[a,b],\ \forall w\in\mathcal{W},\ z\in\mathcal{Z} and ff be an generator of some ff-divergence satisfying the condition in Lemma 3. Then, we have

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) (ba)TV(ν,μ)+bani=1n𝔼Zi±[2IfZi±(W;Ri)f′′(1)]\displaystyle\leq(b-a)\mathrm{TV}\left(\nu,\mu\right)+\frac{b-a}{n}\sum_{i=1}^{n}\mathbb{E}_{Z^{\pm}_{i}}\left[\sqrt{\frac{2I_{f}^{Z_{i}^{\pm}}\left(W;R_{i}\right)}{f^{\prime\prime}(1)}}\ \right] (45)
(ba)TV(ν,μ)+bani=1n2If(W;Ri|Zi±)f′′(1).\displaystyle\leq(b-a)\mathrm{TV}\left(\nu,\mu\right)+\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{2I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}}. (46)

In the above, IfI_{f} denotes the ff-mutual information and IfzI_{f}^{z} denotes the disintegrated ff-mutual information, i.e., Ifz(X;Y)=If(X;Y|Z=z)I_{f}^{z}\left(X;Y\right)=I_{f}\left(X;Y|Z=z\right).

For the in-distribution case, we recover the ICIMI bound by setting f=fKLxlogxf=f_{\mathrm{KL}}\coloneqq x\log x, the generator of KL-divergence. Meanwhile, there may exist other ff leading to a tighter generalization bound (45) or (46). This is equivalent to finding ff that satisfies the condition If(W;Ri|Zi±)f′′(1)IKL(W;Ri|Zi±)fKL′′(1)\dfrac{I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}\leq\dfrac{I_{\mathrm{KL}}\left(W;R_{i}|Z_{i}^{\pm}\right)}{f_{\mathrm{KL}}^{\prime\prime}(1)}. Nonetheless, the existence of such ff is nontrivial, since there is a trade-off between If(W;Ri|Zi±)I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right) and 1/f′′(1)1/f^{\prime\prime}(1), see Fig. 3. On the one hand, the ff-mutual information is essentially an expectation of the function ff, hence a “steep” ff (like the generator of KL- or χ2\chi^{2}-divergence) would yield a larger If(W;Ri|Zi±)I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right) term. On the other hand, the f′′(1)f^{\prime\prime}(1) reflects the curvature of the graph of ff at the point (1,0)(1,0), hence a steep ff also result a larger f′′(1)f^{\prime\prime}(1) term. Similarly, for those “flat” ff (like the generator of Jensen Shannon- or squared Hellinger divergence), both the ff-mutual information and the curvature would become smaller. This makes selecting ff to maximize the quotient If(W;Ri|Zi±)/f′′(1)I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right)/f^{\prime\prime}(1) a nontrivial task. However, Theorem 2 provides more options beyond the KL divergence, allowing us to identify a better choice in certain scenarios, which we will discuss in the next section.

Refer to caption
Figure 3: Graph of some common ff: A tradeoff between If(W;Ri|Zi±)I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right) and 1/f′′(1)1/f^{\prime\prime}(1)
Proof of the Theorem 2.

By the decomposition (44), it suffices to consider the expected in-distribution generalization gap, which can be rewritten as

gen(PW|Zn,ν)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu\right) =1ni=1n𝔼Zi±[𝔼W,Ri|Zi±[Ri((W,Zi)(W,Zi+))]]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{Z_{i}^{\pm}}\left[\mathbb{E}_{W,R_{i}|Z_{i}^{\pm}}\left[R_{i}\left(\ell\left(W,Z_{i}^{-}\right)-\ell\left(W,Z_{i}^{+}\right)\right)\right]\right] (47)
1ni=1n𝔼Zi±[inft>0Df(PW,Ri|Zi±||QW,Ri|Zi±)+Λf;QW,Ri|Zi±(tRi((W,Zi)(W,Zi+)))t].\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{Z_{i}^{\pm}}\left[\inf_{t>0}\frac{D_{f}\left(P_{W,R_{i}|Z_{i}^{\pm}}||Q_{W,R_{i}|Z_{i}^{\pm}}\right)+\Lambda_{f;Q_{W,R_{i}|Z_{i}^{\pm}}}\left(tR_{i}\left(\ell\left(W,Z_{i}^{-}\right)-\ell\left(W,Z_{i}^{+}\right)\right)\right)}{t}\right]. (48)

Choose QW,Ri|Zi±=PW|Zi±PRi|Zi±Q_{W,R_{i}|Z_{i}^{\pm}}=P_{W|Z_{i}^{\pm}}P_{R_{i}|Z_{i}^{\pm}} and we have Df(PW,Ri|Zi±||QW,Ri|Zi±)=IfZi±(W;Ri)D_{f}\left(P_{W,R_{i}|Z_{i}^{\pm}}||Q_{W,R_{i}|Z_{i}^{\pm}}\right)=I_{f}^{Z_{i}^{\pm}}\left(W;R_{i}\right). By assumption [a,b]\ell\in[a,b], we have Ri((W,Zi)(W,Zi+))[(ba),ba]R_{i}\left(\ell\left(W,Z_{i}^{-}\right)-\ell\left(W,Z_{i}^{+}\right)\right)\in[-(b-a),b-a], and thus the Lemma 3 follows that

Λf;QW,Ri|Zi±(tRi((W,Zi)(W,Zi+)))12f′′(1)(ba)2t2.\Lambda_{f;Q_{W,R_{i}|Z_{i}^{\pm}}}\left(tR_{i}\left(\ell\left(W,Z_{i}^{-}\right)-\ell\left(W,Z_{i}^{+}\right)\right)\right)\leq\frac{1}{2f^{\prime\prime}(1)}(b-a)^{2}t^{2}. (49)

As a consequence, we have

gen(PW|Zn,ν)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu\right) 1ni=1n𝔼Zi±[inft>01t(IfZi±(W;Ri)+(ba)2t22f′′(1))]\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{Z_{i}^{\pm}}\left[\inf_{t>0}\frac{1}{t}\left(I_{f}^{Z_{i}^{\pm}}\left(W;R_{i}\right)+\frac{(b-a)^{2}t^{2}}{2f^{\prime\prime}(1)}\right)\right] (50)
=(ba)ni=1n𝔼Zi±[2IfZi±(W;Ri)f′′(1)]\displaystyle=\frac{(b-a)}{n}\sum_{i=1}^{n}\mathbb{E}_{Z_{i}^{\pm}}\left[\sqrt{\frac{2I_{f}^{Z_{i}^{\pm}}\left(W;R_{i}\right)}{f^{\prime\prime}(1)}}\ \right] (51)
(ba)ni=1n2If(W;Ri|Zi±)f′′(1),\displaystyle\leq\frac{(b-a)}{n}\sum_{i=1}^{n}\sqrt{\frac{2I_{f}\left(W;R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}}, (52)

where the last inequality follows by Jensen’s inequality. ∎

VI Applications to the SGLD Algorithm

In this section, we demonstrate an application of preceding results to the stochastic gradient Langevin dynamics (SGLD) algorithm [41]. SGLD is a learning algorithm that combines the stochastic gradient descent (SGD), a Robbins–Monro optimization algorithm [42], and Langevin dynamics, a mathematical extension of molecular dynamics models. Essentially, each iteration of SGLD is an SGD update with added isotropic Gaussian noise. The noise enables SGLD to escape local minima and asymptotically converge to global minima for sufficiently regular non-convex loss functions. The wide application of SGLD in machine learning has prompted a series of study on its generalization performance [9, 15, 13, 11, 14]. The remaining part of this section is organized as follows. After a brief introduction to the SGLD in VI-A, we exhibit improved SGLD generalization bounds in the following three subsections. Motivated by the observation in IV-C that TV-based bound achieves the tightest one, we establish a refined SGLD generalization bound in VI-B and show its advantages whenever the Lipschitz constant of the loss function is large. Next, we turn to apply the CMI methods to SGLD. In VI-C, an ff-ICIMI type generalization bound for SGLD is established on the basis of squared Hellinger divergence. Due to the nature of the chain rule of the squared Hellinger divergence, this generalization bound is superior under the without-replacement sampling scheme, and may become worse whenever replacement is allowed. The latter problem is remedied in VI-D, where we utilize a new technique, the sub-additivity of the ff-divergence, to establish an asymptotically tighter generalization bound for SGLD.

VI-A Stochastic Gradient Langevin Dynamics

The SGLD algorithm starts with an initialization W0W_{0} and iteratively updates WtW_{t} through a noisy and stochastic gradient descent process. Mathematically, we write

Wt=Wt1ηtW(Wt1,ZUt)+σtξt,t[T],W_{t}=W_{t-1}-\eta_{t}\nabla_{W}\ell(W_{t-1},Z_{U_{t}})+\sigma_{t}\xi_{t},\ t\in[T], (53)

where ξt𝒩(𝟎,𝐈)\xi_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) is the standard Gaussian random vector and σt\sigma_{t} is the scale parameter. In the standard SGLD setting, we have σt=2ηtβ1\sigma_{t}=\sqrt{2\eta_{t}\beta^{-1}} where β\beta denotes the inverse temperature. Additionally, Ut[n]U_{t}\in[n] specifies which data is employed for calculating the gradient within the tt-th iteration (i.e., Ut=iU_{t}=i if ZiZ_{i} is taken), and finally the algorithm outputs WTW_{T} after TT iterations. We say U[T]=(U1,,UT)U_{[T]}=(U_{1},\ldots,U_{T}) is a sample path, which is randomly generated from some process like random sampling over [n][n] with replacement. Let 𝒯i\mathcal{T}_{i} be the set of steps tt that ZiZ_{i} is taken for updating WtW_{t}. Then, 𝒯i\mathcal{T}_{i} is a deterministic function of the sample path U[T]U_{[T]} and we write by 𝒯i=𝒯i(U[T])\mathcal{T}_{i}=\mathcal{T}_{i}(U_{[T]}).

The generalization of the SGLD algorithm was also studied by [9, 15], which we will discuss later. Other generalization analyses of the non-stochastic Langevin dynamic and the mini-batch SGLD are discussed in [13, 11, 14], where they use a data-dependent estimation on the prior distribution and their results are incomparable to ours.

VI-B Generalization Bounds on SGLD for Bounded and Lipschitz Loss Functions

Authors in [9] utilized IMI to proved that, if the loss function is σ\sigma-sub-Gaussian and LL-Lipschitz, the in-distribution generalization gap of the SGLD algorithm can be bounded by

gen(PW|Zn,ν)σni=1n𝔼U[T][t𝒯i(U[T])ηt2L2σt2].\mathrm{gen}\left(P_{W|Z^{n}},\nu\right)\leq\frac{\sigma}{n}\sum_{i=1}^{n}\mathbb{E}_{U_{[T]}}\left[\sqrt{\sum_{t\in\mathcal{T}_{i}(U_{[T]})}\frac{\eta_{t}^{2}L^{2}}{\sigma_{t}^{2}}}\right]. (54)

When the loss function is bounded in [a,b][a,b], one can simply replace the σ\sigma in (54) with (ba)/2(b-a)/2. However, as we see in IV-C, the total variation achieves tighter bound than the KL-based one. Thus, the above result may be further improved. To this end, we invoke Corollary 3 and obtain the following result under the OOD setting. The proof is postponed to Appendix C-A.

Theorem 3.

Let WW be the output of the SGLD algorithm as defined in (53). Suppose the loss function is LL-Lipschitz and bounded in [a,b][a,b]. We have

gen(PW|Zn,ν,μ)(ba)TV(ν,μ)+bani=1n𝔼U[T][min{12t𝒯i(U[T])ηt2L2σt2,1exp(t𝒯i(U[T])ηt2L22σt2)}].\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq(b-a)\mathrm{TV}\left(\nu,\mu\right)+\frac{b-a}{n}\sum_{i=1}^{n}\mathbb{E}_{U_{[T]}}\left[\min\left\{\frac{1}{2}\sqrt{\sum_{t\in\mathcal{T}_{i}(U_{[T]})}\frac{\eta_{t}^{2}L^{2}}{\sigma_{t}^{2}}},\sqrt{1-\exp\left(-\sum_{t\in\mathcal{T}_{i}(U_{[T]})}\frac{\eta_{t}^{2}L^{2}}{2\sigma_{t}^{2}}\right)}\right\}\right]. (55)

The generalization bound (55) consists of two parts. The first part, (ba)TV(ν,μ)(b-a)\mathrm{TV}\left(\nu,\mu\right), captures the distribution shift and can further be bounded using ff-divergence between ν\nu and μ\mu. The remaining part of (55) captures the in-distribution generalization gap of the SGLD algorithm. In particular, we notice that the first term inside the min{,}\min\{\cdot,\cdot\} coincides with the previous result (54). This is also a direct consequence of the Pinsker inequality, as shown in Appendix C-A. In contrast, the second term inside the min{,}\min\{\cdot,\cdot\} follows from the Bretagnolle-Huber inequality [43], an exponential KL upper bound on the total variation.

We compare the two terms inside the min{,}\min\{\cdot,\cdot\} in Fig. 4, where the x-axis is set to t𝒯iηt2L2σt2\sum_{t\in\mathcal{T}_{i}}\frac{\eta_{t}^{2}L^{2}}{\sigma_{t}^{2}}. Since the parameters ηt\eta_{t} and σt\sigma_{t} depend only on the SGLD algorithm, we keep them fixed and only consider the effect of the Lipschitz constant LL of the loss function. From Fig. 4, we conclude that, if LL is small enough such that t𝒯iηt2L2σt2<δ\sum_{t\in\mathcal{T}_{i}}\dfrac{\eta_{t}^{2}L^{2}}{\sigma_{t}^{2}}<\delta, then the Pinsker-type term is tighter. Here δ\delta is the unique non-zero solution of the equation 12x=1ex/2\frac{1}{2}\sqrt{x}=\sqrt{1-e^{-x/2}}, and we have δ=2W0(2e2)+43.187\delta=2W_{0}\left(-\dfrac{2}{e^{2}}\right)+4\approx 3.187. Here, with a little abuse of notation, W0W_{0} denotes the principal branch of the Lambert WW function. On the other hand, if LL is sufficiently large such that t𝒯iηt2L2σt2>δ\sum_{t\in\mathcal{T}_{i}}\dfrac{\eta_{t}^{2}L^{2}}{\sigma_{t}^{2}}>\delta, the Pinsker-type term begins to diverge and the newly added term helps establish a tighter generalization bound.

Refer to caption
Figure 4: Comparison between the two terms in (55).

VI-C SGLD using Without-Replacement Sampling Method

In the subsequent two subsections, we turn to the CMI methods for the SGLD. The following theorem is a consequence of combining the ff-ICIMI bound (Theorem 2) with the squared Hellinger divergence. The proof is postponed to Appendix C-B.

Theorem 4.

Let WW be the output of the SGLD algorithm as defined in (53). Suppose the loss function is LL-Lipschitz and bounded in [a,b][a,b]. We have

gen(PW|Zn,ν,μ)(ba)TV(ν,μ)+2(ba)nt[T]1exp(ηt2L216σt2).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq(b-a)\mathrm{TV}\left(\nu,\mu\right)+\frac{2(b-a)}{n}\sum_{t\in[T]}\sqrt{1-\exp\left(-\frac{\eta_{t}^{2}L^{2}}{16\sigma_{t}^{2}}\right)}. (56)

To compare it with other in-distribution results, it suffices to consider the case ν=μ\nu=\mu so that the first TV term becomes 0. By writing the summation t[T]\sum_{t\in[T]} as i=1nt𝒯i\sum_{i=1}^{n}\sum_{t\in\mathcal{T}_{i}}, we see that the t𝒯i\sum_{t\in\mathcal{T}_{i}} term is outside the square root as in (56), while the t𝒯i\sum_{t\in\mathcal{T}_{i}} term is inside the square root as in (54) and (55). By the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}, the generalization bound (56) is larger than that of (54) and (55) in general. However, Theorem 56 shows its benefit if the SGLD is performed using the without-replacement sampling scheme, i.e., each data ZiZ_{i} is exactly applied once for updating WW. Under this setting, 𝒯i\mathcal{T}_{i} becomes a singleton and the summation term t𝒯i\sum_{t\in\mathcal{T}_{i}} vanishes. Thus, under the without-replacement sampling scheme, the generalization bound in Theorem 56 is strictly superior than the previous result (54). This is a direct consequence of the inequality 1ex<x1-e^{-x}<x when x>0x>0.

VI-D SGLD with Asymptotic Assumption

As discussed in the previous subsection, the t𝒯i\sum_{t\in\mathcal{T}_{i}} term of Theorem 56 locates outside the square root, mostly causing a larger generalization bound if the replacement is allowed in sampling. Technically, this is a consequence of applying the chain rule of the squared Hellinger divergence in Theorem 56. Although the chain rule is a common technique applied in a series of KL-based SGLD generalization bounds, it does not perform well when other ff-divergence is involved (at least for the squared Hellinger divergence). In this subsection, we drop the chain rule and utilize the sub-additivity of the squared Hellinger divergence. With the new technique at hand, we are able to move the t𝒯i\sum_{t\in\mathcal{T}_{i}} term inside the square root and derive an asymptotically tighter generalization bound for the SGLD.

We begin with an intuitive observation. As the number of training data goes to infinity, the algorithm output is asymptotically independent of an individual sample. Under the CMI model, this translates that the conditional probability of RiR_{i} (identity of the ii-th training sample), given the super-sample and some history outputs of the learning algorithm, should asymptotically converge to its marginal distribution 12δ1+12δ1\frac{1}{2}\delta_{1}+\frac{1}{2}\delta_{-1}, where δ1\delta_{1} is a Dirac delta distribution located at 11. The following assumption assumes there exists some function gg that captures the above convergence behavior. Recall that nn is the total number of training samples.

Assumption 1.

t𝒯i,PRi=1|Wt1,Zi±=12+𝒪(g(n))\forall t\in\mathcal{T}_{i},\ P_{R_{i}=1|W_{t-1},Z_{i}^{\pm}}=\dfrac{1}{2}+\mathcal{O}\left(g(n)\right), where g:+g\colon\mathbb{N}^{+}\to\mathbb{R} is some deterministic function satisfying limng(n)=0\lim_{n\to\infty}g(n)=0.

Remark 4.

One can imagine g(n)g(n) as a polynomial nγn^{-\gamma} for some γ>0\gamma>0, but we do not add such restrictions. Additionally, one special case is the without-replacement sampling as discussed in the previous subsection. Since RiR_{i} is exactly applied for updating WW at the tt-th iteration, Wt1W_{t-1} is independent of RiR_{i}, and thus PRi=1|Wt1,Zi±=12=PRi=1|Wt1,Zi±P_{R_{i}=1|W_{t-1},Z_{i}^{\pm}}=\frac{1}{2}=P_{R_{i}=-1|W_{t-1},Z_{i}^{\pm}} and g(n)0g(n)\equiv 0.

Theorem 5.

Let WW be the output of the SGLD algorithm as defined in (53). Suppose the loss function is LL-Lipschitz and bounded in [a,b][a,b]. Under the Assumption 1, we have

gen(PW|Zn,ν,μ)(ba)TV(ν,μ)+2(ba)ni=1n𝔼U[T][t𝒯i(U[T])(1exp(ηt2L216σt2))]+𝒪(g2(n)).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq(b-a)\mathrm{TV}\left(\nu,\mu\right)+\frac{2(b-a)}{n}\sum_{i=1}^{n}\mathbb{E}_{U_{[T]}}\left[\sqrt{\sum_{t\in\mathcal{T}_{i}(U_{[T]})}\left(1-\exp\left(-\frac{\eta_{t}^{2}L^{2}}{16\sigma_{t}^{2}}\right)\right)}\right]+\mathcal{O}\left(g^{2}(n)\right). (57)
Remark 5.

Under the without-replacement sampling method, we have g(n)0g(n)\equiv 0 and we exactly recover the generalization bound (56). Therefore, Theorem 56 can be regarded as a special case of Theorem 57.

Again, by the inequality 1exx1-e^{-x}\leq x, the generalization bound (57) is asymptotically tighter than the IMI bound (54) (with σ=(ba)/2\sigma=(b-a)/2). Additionally, [15] also derived an ICIMI-based bound for the SGLD algorithm. But it relies on a precise estimation of the posterior probability of RiR_{i} (that is, α\alpha in our proof), which is computationally intractable. In the worst case, the ICIMI-bound can be twice larger than the IMI-bound. In contrast, the generalization bound in Theorem 57 is independent of the specific value of α\alpha but relies on its asymptotic behavior captured by some function g(n)g(n), causing a 𝒪(g2(n))\mathcal{O}(g^{2}(n)) error on the generalization bound. We conjecture that gg can be polynomial (e.g., g(n)=1/ng(n)=1/n) under certain sampling or training schemes, so that (57) can decay fast. We leave the study on function gg for future work.

The remaining part of this section focuses on proving the Theorem 57. As mentioned before, we exploit the sub-additivity, rather than the chain rule, of the ff-divergence. We start with its definition.

Definition 4 (Sub-additivity of divergence [44]).

Let 𝒢\mathcal{G} be a directed acyclic graph that defines a Bayes network. We say a divergence DD is sub-additive if, for all probability distributions PP and QQ over the Bayes network, we have

D(P||Q)v𝒢D(P{v}Πv||Q{v}Πv),D(P||Q)\leq\sum_{v\in\mathcal{G}}D(P_{\{v\}\cup\Pi_{v}}||Q_{\{v\}\cup\Pi_{v}}), (58)

where vv denotes the vertex of 𝒢\mathcal{G} and Πv\Pi_{v} denotes the set of parents of vv.

Many ff-divergences are sub-additive, but here we only need the sub-additivity of the squared Hellinger divergence.

Lemma 4 (Theorem 2 of [44]).

The squared Hellinger divergence (c.f. Table I) is sub-additive on Bayes networks.

Combining the sub-additivity with SGLD yields the following useful lemma. The underlying reason why the sub-additivity can be applied to SGLD is that one can construct a Bayes network from the trajectory of SGLD. The proof is deferred to Appendix C-C.

Lemma 5 (Sub-additivity of Conditional ff-Mutual Information).

Let DfD_{f} be a sub-additive ff-divergence, and W[0:T](W0,,WT)W_{[0:T]}\coloneqq(W_{0},\ldots,W_{T}) be the trajectory generated by the SGLD algorithm under some fixed sample path U[T]=(u1,,uT)U_{[T]}=(u_{1},\ldots,u_{T}). Again, let RiR_{i} and Zi±Z_{i}^{\pm} respectively be the data identifier and super-sample, as illustrated in the CMI model. We have

If(W[0:T];Ri|Zi±)t𝒯i𝔼Wt1,Zi±[Df(PWt,Ri|Wt1,Zi±||PWt|Wt1,Zi±PRi)],I_{f}\left(W_{[0:T]};R_{i}|Z_{i}^{\pm}\right)\leq\sum_{t\in\mathcal{T}_{i}}\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[D_{f}\left(P_{W_{t},R_{i}|W_{t-1},Z_{i}^{\pm}}||P_{W_{t}|W_{t-1},Z_{i}^{\pm}}P_{R_{i}}\right)\right], (59)

where PRiP_{R_{i}} denotes the uniform distribution over the set {1,+1}\left\{-1,+1\right\}.

Now we are ready to prove the main result, i.e., Theorem 57, of this subsection.

Proof of Theorem 57.

By Theorem 2, we have

gen(PW|Zn,ν)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu\right) bani=1n2If(WT;Ri|Zi±)f′′(1)\displaystyle\leq\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{2I_{f}\left(W_{T};R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}} (60)
bani=1n2If(W[0:T];Ri|Zi±)f′′(1)\displaystyle\leq\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{2I_{f}\left(W_{[0:T]};R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}} (61)
bani=1n2f′′(1)t𝒯i𝔼Wt1,Zi±[Df(PWt,Ri|Wt1,Zi±||PWt|Wt1,Zi±PRi)],\displaystyle\leq\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{2}{f^{\prime\prime}(1)}\sum_{t\in\mathcal{T}_{i}}\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[D_{f}\left(P_{W_{t},R_{i}|W_{t-1},Z_{i}^{\pm}}||P_{W_{t}|W_{t-1},Z_{i}^{\pm}}P_{R_{i}}\right)\right]}, (62)

where the last inequality follows by Lemma 5. Let PRi|Wt1,Zi±=αδ1+(1α)δ1P_{R_{i}|W_{t-1},Z_{i}^{\pm}}=\alpha\delta_{1}+(1-\alpha)\delta_{-1}. By Assumption 1 we can write

α(Wt1,Zi±)=12(1+ϵ(Wt1,Zi±)),ϵ𝒪(g(n)).\alpha(W_{t-1},Z_{i}^{\pm})=\frac{1}{2}\left(1+\epsilon(W_{t-1},Z_{i}^{\pm})\right),\ \epsilon\in\mathcal{O}(g(n)). (63)

It follows that PWt|Wt1,Zi±,RiP_{W_{t}|W_{t-1},Z_{i}^{\pm},R_{i}} is Gaussian and PWt|Wt1,Zi±P_{W_{t}|W_{t-1},Z_{i}^{\pm}} is a Gaussian mixture model (c.f. (145) and (146) in Appendix C-B). We have

PWt,Ri|Wt1,Zi±\displaystyle P_{W_{t},R_{i}|W_{t-1},Z_{i}^{\pm}} =α𝒩+×δ1+(1α)𝒩×δ1,\displaystyle=\alpha\mathcal{N}^{+}\times\delta_{1}+(1-\alpha)\mathcal{N}^{-}\times\delta_{-1}, (64)
PWt|Wt1,Zi±PRi\displaystyle P_{W_{t}|W_{t-1},Z_{i}^{\pm}}P_{R_{i}} =(α𝒩++(1α)𝒩)×(12δ1+12δ1)\displaystyle=\left(\alpha\mathcal{N}^{+}+(1-\alpha)\mathcal{N}^{-}\right)\times\left(\frac{1}{2}\delta_{1}+\frac{1}{2}\delta_{-1}\right) (65)
=α𝒩+×(12δ1+12δ1)+(1α)𝒩×(12δ1+12δ1).\displaystyle=\alpha\mathcal{N}^{+}\times\left(\frac{1}{2}\delta_{1}+\frac{1}{2}\delta_{-1}\right)+(1-\alpha)\mathcal{N}^{-}\times\left(\frac{1}{2}\delta_{1}+\frac{1}{2}\delta_{-1}\right). (66)

In what follows, we set f(x)=(x1)2f(x)=\left(\sqrt{x}-1\right)^{2} and thus Df=H2D_{f}=H^{2}, but we still use the ff-notation for those expressions holding for arbitrary ff-divergence. Plugging (64) and (66) into (62) yields

𝔼Wt1,Zi±[Df(PWt,Ri|Wt1,Zi±||PWt|Wt1,Zi±PRi)]\displaystyle\quad\ \mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[D_{f}\left(P_{W_{t},R_{i}|W_{t-1},Z_{i}^{\pm}}||P_{W_{t}|W_{t-1},Z_{i}^{\pm}}P_{R_{i}}\right)\right]
𝔼Wt1,Zi±[αDf(PWt,Ri|Wt1,Zi±||𝒩+×(12δ1+12δ1))+(1α)Df(PWt,Ri|Wt1,Zi±||𝒩×(12δ1+12δ1))]\displaystyle\leq\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\alpha D_{f}\left(P_{W_{t},R_{i}|W_{t-1},Z_{i}^{\pm}}||\mathcal{N}^{+}\times\left(\frac{1}{2}\delta_{1}+\frac{1}{2}\delta_{-1}\right)\right)+(1-\alpha)D_{f}\left(P_{W_{t},R_{i}|W_{t-1},Z_{i}^{\pm}}||\mathcal{N}^{-}\times\left(\frac{1}{2}\delta_{1}+\frac{1}{2}\delta_{-1}\right)\right)\right] (67)
=𝔼Wt1,Zi±[α2(f(2α)+𝔼𝒩+[f(2(1α)d𝒩d𝒩+)])+1α2(f(2(1α))+𝔼𝒩[f(2αd𝒩+d𝒩)])]\displaystyle=\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\frac{\alpha}{2}\left(f(2\alpha)+\mathbb{E}_{\mathcal{N}^{+}}\left[f\left(2(1-\alpha)\frac{\mathrm{d}\mathcal{N}^{-}}{\mathrm{d}\mathcal{N}^{+}}\right)\right]\right)+\frac{1-\alpha}{2}\left(f(2(1-\alpha))+\mathbb{E}_{\mathcal{N}^{-}}\left[f\left(2\alpha\frac{\mathrm{d}\mathcal{N}^{+}}{\mathrm{d}\mathcal{N}^{-}}\right)\right]\right)\right] (68)
=𝔼Wt1,Zi±[22(α3/2+(1α)3/2+(α1α+(1α)α)𝒲d𝒩+d𝒩)]\displaystyle=\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[2-\sqrt{2}\left(\alpha^{3/2}+(1-\alpha)^{3/2}+\left(\alpha\sqrt{1-\alpha}+(1-\alpha)\sqrt{\alpha}\right)\int_{\mathcal{W}}\sqrt{\mathrm{d}\mathcal{N}^{+}\mathrm{d}\mathcal{N}^{-}}\right)\right] (69)
=𝔼Wt1,Zi±[1𝒲d𝒩+d𝒩]+𝒪(ϵ2)\displaystyle=\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[1-\int_{\mathcal{W}}\sqrt{\mathrm{d}\mathcal{N}^{+}\mathrm{d}\mathcal{N}^{-}}\right]+\mathcal{O}\left(\epsilon^{2}\right) (70)
=𝔼Wt1,Zi±[1exp(ηt2W(Wt1,Zi+)W(Wt1,Zi)28σt2)]+𝒪(ϵ2)\displaystyle=\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[1-\exp\left(-\frac{\eta_{t}^{2}\|\nabla_{W}\ell(W_{t-1},Z_{i}^{+})-\nabla_{W}\ell(W_{t-1},Z_{i}^{-})\|^{2}}{8\sigma_{t}^{2}}\right)\right]+\mathcal{O}\left(\epsilon^{2}\right) (71)
1exp(ηt2L216σt2)+𝒪(ϵ2).\displaystyle\leq 1-\exp\left(-\frac{\eta_{t}^{2}L^{2}}{16\sigma_{t}^{2}}\right)+\mathcal{O}\left(\epsilon^{2}\right). (72)

In the above, inequality (67) follows since the ff-divergence is jointly (and thus separately) convex, and (68) follows by the definition of ff-divergence. We derive (69) since f(x)=(x1)2f(x)=(\sqrt{x}-1)^{2}. To derive (70), we use

α3/2\displaystyle\alpha^{3/2} =(12)3/2(1+32ϵ)+𝒪(ϵ2),\displaystyle=\left(\frac{1}{2}\right)^{3/2}\left(1+\frac{3}{2}\epsilon\right)+\mathcal{O}\left(\epsilon^{2}\right), (73a)
(1α)3/2\displaystyle(1-\alpha)^{3/2} =(12)3/2(132ϵ)+𝒪(ϵ2),\displaystyle=\left(\frac{1}{2}\right)^{3/2}\left(1-\frac{3}{2}\epsilon\right)+\mathcal{O}\left(\epsilon^{2}\right), (73b)
α1α\displaystyle\alpha\sqrt{1-\alpha} =(12)3/2(1+12ϵ)+𝒪(ϵ2),\displaystyle=\left(\frac{1}{2}\right)^{3/2}\left(1+\frac{1}{2}\epsilon\right)+\mathcal{O}\left(\epsilon^{2}\right), (73c)
(1α)α\displaystyle(1-\alpha)\sqrt{\alpha} =(12)3/2(112ϵ)+𝒪(ϵ2),\displaystyle=\left(\frac{1}{2}\right)^{3/2}\left(1-\frac{1}{2}\epsilon\right)+\mathcal{O}\left(\epsilon^{2}\right), (73d)

where these equations come from (63) and the fact that (1+ϵ)1/2=1+ϵ/2+𝒪(ϵ2)(1+\epsilon)^{1/2}=1+\epsilon/2+\mathcal{O}(\epsilon^{2}). Expression (71) follows by Lemma 11 in Appendix C-B since 1𝒲d𝒩+d𝒩=12H2(𝒩+||𝒩)1-\int_{\mathcal{W}}\sqrt{\mathrm{d}\mathcal{N}^{+}\mathrm{d}\mathcal{N}^{-}}=\frac{1}{2}H^{2}\left(\mathcal{N}^{+}||\mathcal{N}^{-}\right), and the last inequality (72) follows by (155) in Appendix C-B.

Finally, inserting (72) into (62) and taking expectation w.r.t. the sample path U[T]U_{[T]}, the desired result (57) follows since ϵ2𝒪(g2(n))\epsilon^{2}\in\mathcal{O}\left(g^{2}(n)\right). ∎

VII Conclusions

This paper proposes an information-theoretic framework for analyzing OOD generalization, which seamlessly interpolates between IPM and ff-divergence, reproduces known results as well as yields new generalization bounds. Next, the framework is integrated with the CMI methods, resulting in a class of ff-ICIMI bounds, among which includes the state-of-the-art ICIMI bound. These results separately give a natural extension and improvement to the IMI methods as well as the CMI methods. Lastly, we demonstrate their applications by proving generalization bounds for the SGLD algorithm, under the premise that the loss function is both bounded and Lipschitz. Although these results are derived under the OOD case, they also demonstrate the benefits for the in-distribution case. In particular, the study gives special attention to two specific scenarios: SGLD employing without-replacement sampling and SGLD with asymptotic posterior probability. Compared with the traditional chain-rule based methods, our approaches also shed light on the alternative way to study the generalization of SGLD algorithms through the subadditivity of ff-divergence.

Appendix A Proof of Section III

A-A Proof of Proposition 13

We provide an alternative proof of Proposition 13, demonstrating its relationship with (f,Γ)(f,\Gamma)-divergence [35]. We start with its definition.

Definition 5 ((f,Γ)(f,\Gamma)-Divergence [35]).

Let 𝒳\mathcal{X} be a probability space. Suppose P,Q𝒫(𝒳)P,Q\in\mathcal{P}(\mathcal{X}) and Γ(𝒳)\Gamma\subseteq\mathcal{M}(\mathcal{X}), ff be the convex function that induces the ff-divergence. The (f,Γ)(f,\Gamma)-divergence between distribution PP and QQ is defined by

DfΓ(P||Q):-supgΓ{𝔼P[g]Λf;Q(g)}.D_{f}^{\Gamma}\left(P||Q\right)\coloneq\sup_{g\in\Gamma}\bigl{\{}\mathbb{E}_{P}\left[g\right]-\Lambda_{f;Q}\left(g\right)\bigr{\}}. (74)

The (f,Γ)(f,\Gamma)-divergence admits an upper bound, which interpolates between Γ\Gamma-IPM and ff-divergence.

Lemma 6.

(​[35, Theorem 8])

DfΓ(P||Q)infη𝒫(𝒳){WΓ(P,η)+Df(η||Q)}.D_{f}^{\Gamma}\left(P||Q\right)\leq\inf_{\eta\in\mathcal{P}(\mathcal{X})}\left\{W^{\Gamma}\left(P,\eta\right)+D_{f}\left(\eta||Q\right)\right\}. (75)

Now we are ready to prove Proposition 13.

Proof of Proposition 13 using (f,Γ)(f,\Gamma)-Divergence.
gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) =1ni=1n𝔼Pi[¯(W,Zi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{P_{i}}\left[\bar{\ell}\left(W,Z_{i}\right)\right] (76)
=1ni=1n1ti𝔼Pi[ti¯(W,Zi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{t_{i}}\mathbb{E}_{P_{i}}\left[t_{i}\bar{\ell}\left(W,Z_{i}\right)\right] (77)
1ni=1n1ti(DftiΓ¯(Pi||Q)+Λf;Q(ti¯(W,Zi)))\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\frac{1}{t_{i}}\left(D_{f}^{t_{i}\bar{\Gamma}}\left(P_{i}||Q\right)+\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z_{i}\right)\right)\right) (78)
1ni=1n1tiinfηi𝒫(𝒲×𝒵){WtiΓ¯(Pi,ηi)+Df(ηi||Q)+Λf;Q(ti¯(W,Zi))}\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\frac{1}{t_{i}}\inf_{\eta_{i}\in\mathcal{P}\left(\mathcal{W}\times\mathcal{Z}\right)}\Bigl{\{}W^{t_{i}\bar{\Gamma}}\left(P_{i},\eta_{i}\right)+D_{f}\left(\eta_{i}||Q\right)+\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z_{i}\right)\right)\Bigr{\}} (79)
=RHS of(13).\displaystyle=\text{RHS of}~{}\eqref{equation::fundamental inequality}.

Here equality (76) follows by (17), inequality (78) follows by Definition 74 and the condition ti¯tiΓ¯t_{i}\bar{\ell}\in t_{i}\bar{\Gamma}, inequality (79) follows by Lemma 75, and the last equality follows by the fact that 1tWtΓ¯(Pi,ηi)=WΓ¯(Pi,ηi)\dfrac{1}{t}W^{t\bar{\Gamma}}\left(P_{i},\eta_{i}\right)=W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right), for all t+t\in\mathbb{R}_{+}. ∎

A-B Tightness of the Proposition 13

The following proposition says that the equality in Proposition 13 can be achieved under certain conditions.

Proposition 2.

The upper bound in Proposition 13 achieves equality if the following two conditions hold simultaneously.

  1. 1.

    Γ¯\bar{\Gamma} is a singleton, i.e., ¯\bar{\ell} is the only element of Γ¯\bar{\Gamma}.

  2. 2.

    For each i=1,,ni=1,\ldots,n, the distribution ηi\eta_{i} and the parameter tit_{i} are related through

    dηidQ=(f)(ti¯(w,z)λi),\frac{\mathrm{d}\eta_{i}}{\mathrm{d}Q}=(f^{*})^{\prime}\left(t_{i}\bar{\ell}\left(w,z\right)-\lambda_{i}\right), (80)

    where λi\lambda_{i}\in\mathbb{R} makes (80) a probability density:

    𝔼Q[(f)(ti¯(W,Z)λi)]=1.\mathbb{E}_{Q}\left[(f^{*})^{\prime}\left(t_{i}\bar{\ell}\left(W,Z\right)-\lambda_{i}\right)\right]=1. (81)
Remark 6.

Under the case of KL-divergence (see Remark 2), we have (f)(x)=ex(f^{*})^{\prime}(x)=e^{x} and thus λi=log𝔼Q[eti¯(W,Z)]\lambda_{i}=\log\mathbb{E}_{Q}\left[e^{t_{i}\bar{\ell}(W,Z)}\right]. Therefore, the optimal ηi\eta_{i} has the form of

dηidQ(w,z)=eti¯(w,z)𝔼Q[eti¯(W,Z)]=eti(w,z)𝔼Q[eti(W,Z)].\frac{\mathrm{d}\eta_{i}}{\mathrm{d}Q}(w,z)=\frac{e^{t_{i}\bar{\ell}(w,z)}}{\mathbb{E}_{Q}\left[e^{t_{i}\bar{\ell}(W,Z)}\right]}=\frac{e^{-t_{i}\ell(w,z)}}{\mathbb{E}_{Q}\left[e^{-t_{i}\ell(W,Z)}\right]}. (82)

This means that the optimal ηi\eta_{i} is achieved exactly at the Gibbs posterior distribution, with tit_{i} acting as the inverse temperature.

Proof of Proposition 2.

By assumption 1, we have WΓ¯(Pi,ηi)=𝔼Pi[¯]𝔼ηi[¯]W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right)=\mathbb{E}_{P_{i}}\left[\bar{\ell}\right]-\mathbb{E}_{\eta_{i}}\left[\bar{\ell}\right], and thus Proposition 13 becomes

gen(PW|Zn,ν,μ)1ni=1n(𝔼Pi[¯]𝔼ηi[¯]+1tiDf(ηi||Q)+1tiΛf;Q(ti¯(W,Z))).\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)\leq\frac{1}{n}\sum_{i=1}^{n}\Bigl{(}\mathbb{E}_{P_{i}}\left[\bar{\ell}\right]-\mathbb{E}_{\eta_{i}}\left[\bar{\ell}\right]\\ +\frac{1}{t_{i}}D_{f}\left(\eta_{i}||Q\right)+\frac{1}{t_{i}}\Lambda_{f;Q}\left(t_{i}\bar{\ell}\left(W,Z\right)\right)\Bigr{)}. (83)

As a consequence, it suffices to prove

𝔼η[g]=1tDf(η||Q)+1tΛf;Q(tg),\mathbb{E}_{\eta}\left[g\right]=\frac{1}{t}D_{f}\left(\eta||Q\right)+\frac{1}{t}\Lambda_{f;Q}\left(tg\right), (84)

under the conditions that

dηdQ=(f)(t(gλ)),\displaystyle\frac{\mathrm{d}\eta}{\mathrm{d}Q}=(f^{*})^{\prime}\left(t(g-\lambda)\right), (85a)
𝔼Q[(f)(t(gλ))]=1,\displaystyle\mathbb{E}_{Q}\left[(f^{*})^{\prime}\left(t(g-\lambda)\right)\right]=1, (85b)

where η,Q𝒫(𝒳)\eta,Q\in\mathcal{P}(\mathcal{X}), g(𝒳)g\in\mathcal{M}(\mathcal{X}), and t+t\in\mathbb{R}_{+}. If it is the case, then Proposition 2 follows by setting 𝒳=𝒲×𝒵\mathcal{X}=\mathcal{W}\times\mathcal{Z}, η=ηi\eta=\eta_{i}, t=tit=t_{i}, g=¯g=\bar{\ell}, and λ=1tiλi\lambda=\frac{1}{t_{i}}\lambda_{i}. To see (84) holds, we need the following lemma.

Lemma 7.

([35, Lemma 48])

f((f)(y))=y(f)(y)f(y).f\bigl{(}(f^{*})^{\prime}(y)\bigr{)}=y(f^{*})^{\prime}(y)-f^{*}(y). (86)

Then the subsequent argument is very similar to that of [35, Theorem 82]. We have

supP𝒫(𝒳){𝔼P[g]1tDf(P||Q)}\displaystyle\quad\ \sup_{P\in\mathcal{P}(\mathcal{X})}\Bigl{\{}\mathbb{E}_{P}\left[g\right]-\frac{1}{t}D_{f}\left(P||Q\right)\Bigr{\}} (87)
λ+𝔼η[gλ]1tDf(η||Q)\displaystyle\geq\lambda+\mathbb{E}_{\eta}\left[g-\lambda\right]-\frac{1}{t}D_{f}\left(\eta||Q\right) (88)
=λ+𝔼Q[(f)(t(gλ))(gλ)]1tDf(η||Q)\displaystyle=\lambda+\mathbb{E}_{Q}\left[(f^{*})^{\prime}(t(g-\lambda))(g-\lambda)\right]-\frac{1}{t}D_{f}\left(\eta||Q\right) (89)
=1t(tλ+𝔼Q[f(t(gλ))])\displaystyle=\frac{1}{t}\bigl{(}t\lambda+\mathbb{E}_{Q}\left[f^{*}(t(g-\lambda))\right]\bigr{)} (90)
1tΛf;Q(tg)\displaystyle\geq\frac{1}{t}\Lambda_{f;Q}\left(tg\right) (91)
=supP𝒫(𝒳){𝔼P[g]1tDf(P||Q)}.\displaystyle=\sup_{P\in\mathcal{P}(\mathcal{X})}\Bigl{\{}\mathbb{E}_{P}\left[g\right]-\frac{1}{t}D_{f}\left(P||Q\right)\Bigr{\}}. (92)

In the above, equality (89) follows by (85a), equality (90) follows by Lemma 86, inequality (91) follows by Definition 7, and equality (92) follows by Lemma 1 and equality (14). Therefore, all the inequalities above achieve equality. This proves (84). ∎

Appendix B Proofs in Section IV

B-A Proof of Corollary 27

Proof.

By Corollary 25, we have

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) 1ni=1nsupgΓ{𝔼Pi[g]𝔼Q[g]}\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sup_{g\in\Gamma}\bigl{\{}\mathbb{E}_{P_{i}}\left[g\right]-\mathbb{E}_{Q}\left[g\right]\bigr{\}} (93)
=1ni=1nsupgΓ{𝔼Pi[g]𝔼PWν[g]+𝔼PWν[g]𝔼Q[g]}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sup_{g\in\Gamma}\bigl{\{}\mathbb{E}_{P_{i}}\left[g\right]-\mathbb{E}_{P_{W}\otimes\nu}\left[g\right]+\mathbb{E}_{P_{W}\otimes\nu}\left[g\right]-\mathbb{E}_{Q}\left[g\right]\bigr{\}} (94)
1ni=1nsupgΓ{𝔼ν[𝔼PW|Zi[g]𝔼PW[g]]+𝔼PW[𝔼ν[g]𝔼μ[g]]}\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sup_{g\in\Gamma}\Bigl{\{}\mathbb{E}_{\nu}\left[\mathbb{E}_{P_{W|Z_{i}}}\left[g\right]-\mathbb{E}_{P_{W}}\left[g\right]\right]+\mathbb{E}_{P_{W}}\left[\mathbb{E}_{\nu}\left[g\right]-\mathbb{E}_{\mu}\left[g\right]\right]\Bigr{\}} (95)
1ni=1n𝔼ν[LWW1(PW|Zi,PW)]+LZW1(ν,μ).\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\nu}\left[L_{W}W_{1}(P_{W|Z_{i}},P_{W})\right]+L_{Z}W_{1}(\nu,\mu). (96)

In the above, inequality (95) follows by the tower property of conditional expectation, and inequality (96) follows by the Kantorovich-Rubinstein duality (12). ∎

B-B Proof of Corollary 3

Proof.

By assumption we have Γ\ell\in\Gamma and thus

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) 1ni=1nWΓ(Pi,Q)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}W^{\Gamma}\left(P_{i},Q\right) (97)
=1ni=1nWΓ(ba)/2(Pi,Q)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}W^{\Gamma-(b-a)/2}(P_{i},Q) (98)
=bani=1nTV(Pi,Q).\displaystyle=\frac{b-a}{n}\sum_{i=1}^{n}\mathrm{TV}\left(P_{i},Q\right). (99)

In the above, inequality (97) follows by Corollary 25, equality (98) follows by the translation invariance of IPM, and equality (99) follows by the variational representation of total variation:

TV(P,Q)=supg12{𝔼P[g]𝔼Q[g]}.\mathrm{TV}\left(P,Q\right)=\sup_{\|g\|_{\infty}\leq\frac{1}{2}}\bigl{\{}\mathbb{E}_{P}\left[g\right]-\mathbb{E}_{Q}\left[g\right]\bigr{\}}. (100)

Thus we proved (28). Then (29) follows by the chain rule of total variation. The general form of the chain rule of total variation is given by

TV(PXm,QXm)i=1m𝔼PXi1[TV(PXi|Xi1,QXi|Xi1)].\mathrm{TV}\left(P_{X^{m}},Q_{X^{m}}\right)\leq\sum_{i=1}^{m}\mathbb{E}_{P_{X^{i-1}}}\left[\mathrm{TV}\left(P_{X_{i}|X^{i-1}},Q_{X_{i}|X^{i-1}}\right)\right]. (101)

B-C Proof of Corollaries 4 and 31

Proof.

It suffices to prove Corollary 4 and then Corollary 31 follows by a similar argument. By Theorem 20, we have

gen(PW|Zn,ν,μ)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right) 1ni=1n2σ2DKL(Pi||Q)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma^{2}D_{\mathrm{KL}}\left(P_{i}||Q\right)} (102)
=1ni=1n2σ2(DKL(PW|Zi||QW|ν)+DKL(ν||μ)),\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sqrt{2\sigma^{2}\bigl{(}D_{\mathrm{KL}}\left(P_{W|Z_{i}}||Q_{W}|\nu\right)+D_{\mathrm{KL}}\left(\nu||\mu\right)\bigr{)}}, (103)

where the equality follows from the chain rule of KL divergence. Taking infimum over QWQ_{W} yields (30), which is due to the following lemma.

Lemma 8 (Theorem 4.1 in [34]).

Suppose (W,Z)(W,Z) is a pair of random variables with marginal distribution PWP_{W} and let QWQ_{W} be an arbitrary distribution of WW. If DKL(PW||QW)<D_{\mathrm{KL}}\left(P_{W}||Q_{W}\right)<\infty, then

I(W;Z)=DKL(PW|Z||QW|Z)DKL(PW||QW).I(W;Z)=D_{\mathrm{KL}}\left(P_{W|Z}||Q_{W}|Z\right)-D_{\mathrm{KL}}\left(P_{W}||Q_{W}\right). (104)

By the non-negativity of KL divergence, the infimum is achieved at QW=PWQ_{W}=P_{W} and thus I(W;Zi)=DKL(PW|Zi||PW|ν)I(W;Z_{i})=D_{\mathrm{KL}}\left(P_{W|Z_{i}}||P_{W}|\nu\right). ∎

B-D Proof of Corollary 33

Proof.

A direct calculation shows f(y)=14y2+yf^{*}(y)=\frac{1}{4}y^{2}+y for f(x)=(x1)2f(x)=(x-1)^{2}, and thus Λf;μ(t¯(w,Z))=14Varμ(w,Z)t2\Lambda_{f;\mu}\left(t\bar{\ell}(w,Z)\right)=\frac{1}{4}\mathrm{Var}_{\mu}\ell\left(w,Z\right)t^{2}. Therefore, we can choose ψ(t)=14σ2t2\psi(t)=\frac{1}{4}\sigma^{2}t^{2} and thus (ψ)1(y)=σ2y(\psi^{*})^{-1}(y)=\sqrt{\sigma^{2}y}. Applying Theorem 20 yields (32). ∎

B-E Proof of Corollary 37

Proof.

Since Γ¯={¯}\bar{\Gamma}=\{\bar{\ell}\}, we have

WΓ¯(Pi,ηi)\displaystyle W^{\bar{\Gamma}}\left(P_{i},\eta_{i}\right) =𝔼Pi[¯]𝔼ηi[¯]\displaystyle=\mathbb{E}_{P_{i}}\left[\bar{\ell}\right]-\mathbb{E}_{\eta_{i}}\left[\bar{\ell}\right] (105)
=𝔼PWν[]𝔼PW|Ziν[].\displaystyle=\mathbb{E}_{P_{W}\otimes\nu}\left[\ell\right]-\mathbb{E}_{P_{W|Z_{i}}\otimes\nu}\left[\ell\right]. (106)

Inserting (106) into Theorem 20 and rearranging terms yields

𝔼PWμ[]𝔼PWν[]\displaystyle\mathbb{E}_{P_{W}\otimes\mu}\left[\ell\right]-\mathbb{E}_{P_{W}\otimes\nu}\left[\ell\right] (ψ)1(Df(PWν||PWμ))\displaystyle\leq(\psi^{*})^{-1}(D_{f}\left(P_{W}\otimes\nu||P_{W}\otimes\mu\right)) (107)
=(ψ)1(Df(ν||μ)).\displaystyle=(\psi^{*})^{-1}(D_{f}\left(\nu||\mu\right)). (108)

B-F Derivation of Estimating the Gaussian and Bernoulli Means

To calculate the generalization bounds, we need the distribution PiP_{i} and QQ. All the following results are given in the general dd-dimensional case, where we let the training distribution be 𝒩(𝐦,σ2𝐈d)\mathcal{N}(\mathbf{m},\sigma^{2}\mathbf{I}_{d}) and the testing distribution be 𝒩(𝐦,(σ)2𝐈d)\mathcal{N}(\mathbf{m}^{\prime},(\sigma^{\prime})^{2}\mathbf{I}_{d}).

We can check that both PiP_{i} and QQ are joint Gaussian. Write the random vector as [𝐙T,𝐖T]T\displaystyle[\mathbf{Z}^{\mathrm{T}},\mathbf{W}^{\mathrm{T}}]^{\mathrm{T}}, then PiP_{i} and QQ are given by

Pi=𝒩([𝐦𝐦],[σ2𝐈d,1nσ2𝐈d1nσ2𝐈d,1nσ2𝐈d]),\displaystyle P_{i}=\mathcal{N}\left(\left[\begin{array}[]{cc}\mathbf{m}\\[4.0pt] \mathbf{m}\end{array}\right],\left[\begin{array}[]{cc}\sigma^{2}\mathbf{I}_{d},&\frac{1}{n}\sigma^{2}\mathbf{I}_{d}\\[4.0pt] \frac{1}{n}\sigma^{2}\mathbf{I}_{d},&\frac{1}{n}\sigma^{2}\mathbf{I}_{d}\end{array}\right]\right), (113)
Q=𝒩([𝐦𝐦],[(σ)2𝐈d,𝟎𝟎,1nσ2𝐈d]).\displaystyle Q=\mathcal{N}\left(\left[\begin{array}[]{cc}\mathbf{m}^{\prime}\\[2.0pt] \mathbf{m}\end{array}\right],\left[\begin{array}[]{cc}(\sigma^{\prime})^{2}\mathbf{I}_{d},&\mathbf{0}\\[2.0pt] \mathbf{0},&\frac{1}{n}\sigma^{2}\mathbf{I}_{d}\end{array}\right]\right). (118)

The KL divergence between PiP_{i} and QQ is given by

DKL(Pi||Q)=logdet𝚺Pidet𝚺Q2d+Tr(𝚺Pi𝚺Q1)+exp((𝐦Pi𝐦Q)T𝚺Q1(𝐦Pi𝐦Q)),D_{\mathrm{KL}}\left(P_{i}||Q\right)=\log\frac{\det\mathbf{\Sigma}_{P_{i}}}{\det\mathbf{\Sigma}_{Q}}-2d+\mathrm{Tr}(\mathbf{\Sigma}_{P_{i}}\mathbf{\Sigma}_{Q}^{-1})\\ +\exp\bigl{(}(\mathbf{m}_{P_{i}}-\mathbf{m}_{Q})^{\mathrm{T}}\mathbf{\Sigma}_{Q}^{-1}(\mathbf{m}_{P_{i}}-\mathbf{m}_{Q})\bigr{)}, (119)

where 𝐦Pi\mathbf{m}_{P_{i}} (resp., 𝐦Q\mathbf{m}_{Q}) denotes the mean vector of PiP_{i} (resp., QQ), and 𝚺Pi\mathbf{\Sigma}_{P_{i}} (resp., 𝚺Q\mathbf{\Sigma}_{Q}) denotes the covariance matrix of PiP_{i} (resp., QQ). The χ2\chi^{2} divergence between PiP_{i} and QQ is given by

χ2(Pi||Q)=det𝚺Qdet𝚺Pidet(2𝚺Q𝚺Pi)exp((𝐦Pi𝐦Q)T(2𝚺Q𝚺P)1(𝐦Pi𝐦Q))1.\chi^{2}\left(P_{i}||Q\right)=\frac{\det\mathbf{\Sigma}_{Q}}{\sqrt{\det\mathbf{\Sigma}_{P_{i}}}\sqrt{\det\bigl{(}2\mathbf{\Sigma}_{Q}-\mathbf{\Sigma}_{P_{i}}\bigr{)}}}\cdot\\ \exp\Bigl{(}(\mathbf{m}_{P_{i}}-\mathbf{m}_{Q})^{\mathrm{T}}\bigl{(}2\mathbf{\Sigma}_{Q}-\mathbf{\Sigma}_{P}\bigr{)}^{-1}(\mathbf{m}_{P_{i}}-\mathbf{m}_{Q})\Bigr{)}-1. (120)

Finally, the true generalization gap is given by

gen(PW|Zn,ν,μ)=((σ)2σ2)d+2σ2dn+𝐦𝐦22.\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)=\left((\sigma^{\prime})^{2}-\sigma^{2}\right)d+\frac{2\sigma^{2}d}{n}+\|\mathbf{m}-\mathbf{m}^{\prime}\|_{2}^{2}. (121)

As for the example of estimating Bernoulli examples, a direct calculation shows

Pi(Zi=1W=kn)={(n1k1)pk(1p)nk1,1kn,0,k=0,P_{i}\left(\begin{aligned} &Z_{i}=1\\ &W=\frac{k}{n}\end{aligned}\right)=\left\{\begin{aligned} &\binom{n-1}{k-1}p^{k}(1-p)^{n-k-1},1\leq k\leq n,\\ &0,k=0,\end{aligned}\right. (122)
Pi(Zi=0W=kn)={(n1k)pk(1p)nk,0kn1,0,k=n.P_{i}\left(\begin{aligned} &Z_{i}=0\\ &W=\frac{k}{n}\end{aligned}\right)=\left\{\begin{aligned} &\binom{n-1}{k}p^{k}(1-p)^{n-k},0\leq k\leq n-1,\\ &0,k=n.\end{aligned}\right. (123)

The distribution QQ is the product of Bern(p)\mathrm{Bern}(p^{\prime}) and the binomial distribution with parameter (n,p)(n,p). Then the ff-divergence can be directly calculated by definition. Finally, the true generalization gap is given by

gen(PW|Zn,ν,μ)=2k=1n(n1k1)pk(1p)n1kn+(12p)pp.\mathrm{gen}\left(P_{W|Z^{n}},\nu,\mu\right)=2\sum_{k=1}^{n}\binom{n-1}{k-1}p^{k}(1-p)^{n-1}\frac{k}{n}\\ +(1-2p)p^{\prime}-p. (124)

Appendix C Proofs in Section VI

C-A Proof of Theorem 55

The following lemma will be useful to the proof, which is implicitly included in [9].

Lemma 9.

Suppose the loss function is LL-Lipschitz, then,

I(Wt;Zi|Wt1)ηt2L22σt2.I\left(W_{t};Z_{i}|W_{t-1}\right)\leq\frac{\eta_{t}^{2}L^{2}}{2\sigma_{t}^{2}}. (125)
Proof of Theorem 55.

Invoking Corollary 3 and it suffices to bound the in-distribution term 𝔼ν[TV(PW|Zi,PW)]\mathbb{E}_{\nu}\left[\mathrm{TV}\left(P_{W|Z_{i}},P_{W}\right)\right]. Let the sample path U[T]U_{[T]} be fixed, we have

𝔼ν[TV(PW|Zi,PW)]\displaystyle\mathbb{E}_{\nu}\left[\mathrm{TV}\left(P_{W|Z_{i}},P_{W}\right)\right] 𝔼ν[TV2(PW|Zi,PW)]\displaystyle\leq\sqrt{\mathbb{E}_{\nu}\left[\mathrm{TV}^{2}\left(P_{W|Z_{i}},P_{W}\right)\right]} (126)
𝔼ν[1exp(DKL(PW|Zi||PW))]\displaystyle\leq\sqrt{\mathbb{E}_{\nu}\left[1-\exp\left(-D_{\mathrm{KL}}\left(P_{W|Z_{i}}||P_{W}\right)\right)\right]} (127)
1exp(𝔼ν[DKL(PW|Zi||PW)])\displaystyle\leq\sqrt{1-\exp\left(-\mathbb{E}_{\nu}\left[D_{\mathrm{KL}}\left(P_{W|Z_{i}}||P_{W}\right)\right]\right)} (128)
=1exp(I(W;Zi))\displaystyle=\sqrt{1-\exp\left(-I\left(W;Z_{i}\right)\right)} (129)
1exp(I(W[0:T];Zi))\displaystyle\leq\sqrt{1-\exp\left(-I\left(W_{[0:T]};Z_{i}\right)\right)} (130)
=1exp(t=1TI(Wt;Zi|Wt1))\displaystyle=\sqrt{1-\exp\left(-\sum_{t=1}^{T}I\left(W_{t};Z_{i}|W_{t-1}\right)\right)} (131)
1exp(t𝒯iηt2L22σt2).\displaystyle\leq\sqrt{1-\exp\left(-\sum_{t\in\mathcal{T}_{i}}\frac{\eta_{t}^{2}L^{2}}{2\sigma_{t}^{2}}\right)}. (132)

In the above, inequalities (126) and (128) follow by Jensen’s inequality, combined with the fact that the functions x\sqrt{x} and 1ex1-e^{-x} are both concave. Inequality (127) follows by the Bretagnolle-Huber inequality. To achieve inequality (130), we use the inequality I(W;Zi)I(W[0:T];Zi)I(W;Z_{i})\leq I(W_{[0:T]};Z_{i}) and the fact that function 1ex1-e^{-x} is monotonously increasing. Equality (131) follows by the chain rule of mutual information, and the last inequality follows by Lemma 125 and the fact that WtW_{t} is independent of ZiZ_{i} given Wt1W_{t-1} if t𝒯it\notin\mathcal{T}_{i}.

On the other hand, we use Pinsker’s inequality to yield

𝔼ν[TV(PW|Zi,PW)]\displaystyle\mathbb{E}_{\nu}\left[\mathrm{TV}\left(P_{W|Z_{i}},P_{W}\right)\right] 𝔼ν[12DKL(PW|Zi||PW)]\displaystyle\leq\mathbb{E}_{\nu}\left[\sqrt{\frac{1}{2}D_{\mathrm{KL}}\left(P_{W|Z_{i}}||P_{W}\right)}\right] (133)
I(W;Zi)2\displaystyle\leq\sqrt{\frac{I(W;Z_{i})}{2}} (134)
12t𝒯iηt2L2σt2,\displaystyle\leq\frac{1}{2}\sqrt{\sum_{t\in\mathcal{T}_{i}}\frac{\eta_{t}^{2}L^{2}}{\sigma_{t}^{2}}}, (135)

where the last inequality follows by a similar argument from (129) to (132). Therefore, the 𝔼ν[TV(PW|Zi,PW)]\mathbb{E}_{\nu}\left[\mathrm{TV}\left(P_{W|Z_{i}},P_{W}\right)\right] term is dominated by the minimum of (132) and (135). Plug this result into Corollary 3, and the proof is completed by taking expectation w.r.t. the sample path U[T]U_{[T]}. ∎

C-B Proof of Theorem 56

For the proof of Theorem 56, the following lemmas will be useful.

Lemma 10 (Chain Rule of Hellinger Distance).

Let PP and QQ be two probability distributions on the random vector X[n]=(X1,,Xn)X_{[n]}=(X_{1},\ldots,X_{n}). Then,

H(PX[n]||QX[n])i=1n𝔼PX[i1][H2(PXi|X[i1]||QXi|X[i1])].H\left(P_{X_{[n]}}||Q_{X_{[n]}}\right)\leq\sum_{i=1}^{n}\sqrt{\mathbb{E}_{P_{X_{[i-1]}}}\left[H^{2}\left(P_{X_{i}|X_{[i-1]}}||Q_{X_{i}|X_{[i-1]}}\right)\right]}. (136)
Lemma 11 (Squared Hellinger Divergence between two Gaussian Distribution).

Let P=𝒩(μ1,𝚺1)P=\mathcal{N}(\mathbf{\mu}_{1},\mathbf{\Sigma}_{1}) and Q=𝒩(μ2,𝚺2)Q=\mathcal{N}(\mathbf{\mu}_{2},\mathbf{\Sigma}_{2}) be two multivariate Gaussian distributions. Then the squared Hellinger divergence of PP and QQ can be written as

H2(P||Q)=22det(𝚺1)1/4det(𝚺2)1/4det(𝚺1+𝚺22)1/2exp(18(μ1μ2)T(𝚺1+𝚺22)1(μ1μ2)).H^{2}\left(P||Q\right)=2-2\frac{\det(\mathbf{\Sigma}_{1})^{1/4}\det(\mathbf{\Sigma}_{2})^{1/4}}{\det\left(\frac{\mathbf{\Sigma}_{1}+\mathbf{\Sigma}_{2}}{2}\right)^{1/2}}\exp\left(-\frac{1}{8}(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{\mathrm{T}}\left(\dfrac{\mathbf{\Sigma}_{1}+\mathbf{\Sigma}_{2}}{2}\right)^{-1}(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})\right). (137)
Proof Sketch.

It follows from the Gaussian integral nexp(12𝐱T𝐀𝐱+𝐛T𝐱)d𝐱=(2π)ndet(𝐀)exp(12𝐛T𝐀1𝐛)\displaystyle\int_{\mathbb{R}^{n}}\exp\left(-\frac{1}{2}\mathbf{x}^{\mathrm{T}}\mathbf{A}\mathbf{x}+\mathbf{b}^{\mathrm{T}}\mathbf{x}\right)\mathrm{d}\mathbf{x}=\sqrt{\frac{(2\pi)^{n}}{\det(\mathbf{A})}}\exp\left(\frac{1}{2}\mathbf{b}^{\mathrm{T}}\mathbf{A}^{-1}\mathbf{b}\right) and the equality (𝚺11+𝚺21)1=𝚺1(𝚺1+𝚺2)1𝚺2=𝚺2(𝚺1+𝚺2)1𝚺1\left(\mathbf{\Sigma}_{1}^{-1}+\mathbf{\Sigma}_{2}^{-1}\right)^{-1}=\mathbf{\Sigma}_{1}\left(\mathbf{\Sigma}_{1}+\mathbf{\Sigma}_{2}\right)^{-1}\mathbf{\Sigma}_{2}=\mathbf{\Sigma}_{2}\left(\mathbf{\Sigma}_{1}+\mathbf{\Sigma}_{2}\right)^{-1}\mathbf{\Sigma}_{1}. The last two equalities follow by the Woodbury matrix identity (𝐀+𝐔𝐂𝐕)1=𝐀1𝐀1𝐔(𝐂1+𝐕𝐀1𝐔)1𝐕𝐀1\displaystyle\left(\mathbf{A}+\mathbf{U}\mathbf{C}\mathbf{V}\right)^{-1}=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{U}\left(\mathbf{C}^{-1}+\mathbf{V}\mathbf{A}^{-1}\mathbf{U}\right)^{-1}\mathbf{V}\mathbf{A}^{-1}. Then the Lemma 11 follows by direct calculation. ∎

Lemma 12.

Suppose the gradient of loss function is LL-Lipschitz, i.e., supw𝒲,z𝒵w(w,z)L\displaystyle\sup_{w\in\mathcal{W},z\in\mathcal{Z}}\|\nabla_{w}\ell(w,z)\|\leq L. Then, we have

𝔼W,Zi±[W(W,Zi+)W(W,Zi)2]L22.\mathbb{E}_{W,Z_{i}^{\pm}}\left[\|\nabla_{W}\ell(W,Z_{i}^{+})-\nabla_{W}\ell(W,Z_{i}^{-})\|^{2}\right]\leq\frac{L^{2}}{2}. (138)
Proof.

Since W(W,Zi+)\nabla_{W}\ell(W,Z_{i}^{+}) and W(W,Zi)\nabla_{W}\ell(W,Z_{i}^{-}) are independent and identically distributed conditioned on WW, we have

𝔼W,Zi±[W(W,Zi+)W(W,Zi)2]\displaystyle\mathbb{E}_{W,Z_{i}^{\pm}}\left[\|\nabla_{W}\ell(W,Z_{i}^{+})-\nabla_{W}\ell(W,Z_{i}^{-})\|^{2}\right] =2𝔼W[𝔼Z|W[W(W,Z)2](𝔼Z|W[W(W,Z)])2]\displaystyle=2\mathbb{E}_{W}\left[\mathbb{E}_{Z|W}\left[\|\nabla_{W}\ell(W,Z)\|^{2}\right]-\left(\mathbb{E}_{Z|W}\left[\|\nabla_{W}\ell(W,Z)\|\right]\right)^{2}\right] (139)
=2𝔼W[VarZ|W(W(W,Z))]\displaystyle=2\mathbb{E}_{W}\left[\mathrm{Var}_{Z|W}(\|\nabla_{W}\ell(W,Z)\|)\right] (140)
L2/2,\displaystyle\leq L^{2}/2, (141)

where we remove the superscript and subscript of ZZ for simplicity. The last inequality follows since the variance of a bounded random variable X[a,b]X\in[a,b] is no more than (ba)2/4(b-a)^{2}/4. ∎

Now we are ready to prove the main result of this subsection.

Proof of Theorem 56.

By Theorem 2, we have

gen(PW|Zn,ν)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu\right) bani=1n2If(WT;Ri|Zi±)f′′(1)\displaystyle\leq\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{2I_{f}\left(W_{T};R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}} (142)
bani=1n2If(W[0:T];Ri|Zi±)f′′(1)\displaystyle\leq\frac{b-a}{n}\sum_{i=1}^{n}\sqrt{\frac{2I_{f}\left(W_{[0:T]};R_{i}|Z_{i}^{\pm}\right)}{f^{\prime\prime}(1)}} (143)
2(ba)ni=1nt𝒯iIH2(Wt;Ri|Wt1,Zi±),\displaystyle\leq\frac{2(b-a)}{n}\sum_{i=1}^{n}\sum_{t\in\mathcal{T}_{i}}\sqrt{I_{H^{2}}\left(W_{t};R_{i}|W_{t-1},Z_{i}^{\pm}\right)}, (144)

where the last inequality follows by Lemma 10. As a consequence, it is sufficient to bound the IH2(Wt;Ri|Wt1,Zi±)I_{H^{2}}\left(W_{t};R_{i}|W_{t-1},Z_{i}^{\pm}\right) term. Let PRi|Wt1,Zi±=αδ1+(1α)δ1P_{R_{i}|W_{t-1},Z_{i}^{\pm}}=\alpha\delta_{1}+(1-\alpha)\delta_{-1}, i.e., RiR_{i} takes 1 with probability α\alpha and takes 1-1 with probability (1α)(1-\alpha), given Wt1W_{t-1} and Zi±Z_{i}^{\pm}. On the other hand, we have

PWt|Wt1,Zi±,Ri={𝒩(WtηtW(Wt1,Zi+),σt2𝐈)=𝒩+,ifRi=1,𝒩(WtηtW(Wt1,Zi),σt2𝐈)=𝒩,ifRi=1,P_{W_{t}|W_{t-1},Z_{i}^{\pm},R_{i}}=\begin{cases}\mathcal{N}\left(W_{t}-\eta_{t}\nabla_{W}\ell(W_{t-1},Z_{i}^{+}),\sigma_{t}^{2}\mathbf{I}\right)\overset{\triangle}{=}\mathcal{N}^{+},&\text{if}\ R_{i}=1,\\ \mathcal{N}\left(W_{t}-\eta_{t}\nabla_{W}\ell(W_{t-1},Z_{i}^{-}),\sigma_{t}^{2}\mathbf{I}\right)\overset{\triangle}{=}\mathcal{N}^{-},&\text{if}\ R_{i}=-1,\end{cases} (145)

and thus PWt|Wt1,Zi±P_{W_{t}|W_{t-1},Z_{i}^{\pm}} is mixture Gaussian, expressed by

PWt|Wt1,Zi±=α𝒩++(1α)𝒩.P_{W_{t}|W_{t-1},Z_{i}^{\pm}}=\alpha\mathcal{N}^{+}+(1-\alpha)\mathcal{N}^{-}. (146)

Keep in mind that α\alpha is a function of Wt1W_{t-1} and Zi±Z_{i}^{\pm}, and the Gaussian components 𝒩+\mathcal{N}^{+} and 𝒩\mathcal{N}^{-} are also parameterized by Wt1W_{t-1} and Zi±Z_{i}^{\pm}. We drop these dependencies from the notation for abbreviation. Plugging (145) and (146) into (144) yields

IH2(Wt;Ri|Wt1,Zi±)\displaystyle I_{H^{2}}\left(W_{t};R_{i}|W_{t-1},Z_{i}^{\pm}\right) =𝔼Wt1,Zi±[𝔼Ri|Wt1,Zi±[H2(PWt|Wt1,Zi±,Ri||PWt|Wt1,Zi±)]]\displaystyle=\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\mathbb{E}_{R_{i}|W_{t-1},Z_{i}^{\pm}}\left[H^{2}\left(P_{W_{t}|W_{t-1},Z_{i}^{\pm},R_{i}}||P_{W_{t}|W_{t-1},Z_{i}^{\pm}}\right)\right]\right] (147)
=𝔼Wt1,Zi±[αH2(𝒩+||α𝒩++(1α)𝒩)+(1α)Df(𝒩||α𝒩++(1α)𝒩)]\displaystyle=\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\alpha H^{2}\left(\mathcal{N}^{+}||\alpha\mathcal{N}^{+}+(1-\alpha)\mathcal{N}^{-}\right)+(1-\alpha)D_{f}\left(\mathcal{N}^{-}||\alpha\mathcal{N}^{+}+(1-\alpha)\mathcal{N}^{-}\right)\right] (148)
𝔼Wt1,Zi±[α(1α)(H2(𝒩+||𝒩)+H2(𝒩||𝒩+))]\displaystyle\leq\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\alpha(1-\alpha)\left(H^{2}\left(\mathcal{N}^{+}||\mathcal{N}^{-}\right)+H^{2}\left(\mathcal{N}^{-}||\mathcal{N}^{+}\right)\right)\right] (149)
𝔼Wt1,Zi±[14(H2(𝒩+||𝒩)+H2(𝒩||𝒩+))].\displaystyle\leq\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\frac{1}{4}\left(H^{2}\left(\mathcal{N}^{+}||\mathcal{N}^{-}\right)+H^{2}\left(\mathcal{N}^{-}||\mathcal{N}^{+}\right)\right)\right]. (150)

To obtain the first two inequalities, we use the fact that the ff-divergence Df(||)D_{f}\left(\cdot||\cdot\right) is jointly convex w.r.t. its arguments, and thus separately convex. To obtain the last inequality, we use the fundamental inequality xyx+y2\sqrt{xy}\leq\frac{x+y}{2} to eliminate α\alpha. Now, we choose DfD_{f} to be the squared Hellinger divergence. By Lemma 11, we have

H2(𝒩+||𝒩)\displaystyle H^{2}\left(\mathcal{N}^{+}||\mathcal{N}^{-}\right) =H2(𝒩||𝒩+)\displaystyle=H^{2}\left(\mathcal{N}^{-}||\mathcal{N}^{+}\right) (151)
=22exp(ηt2W(Wt1,Zi+)W(Wt1,Zi)28σt2),\displaystyle=2-2\exp\left(-\frac{\eta_{t}^{2}\|\nabla_{W}\ell(W_{t-1},Z_{i}^{+})-\nabla_{W}\ell(W_{t-1},Z_{i}^{-})\|^{2}}{8\sigma_{t}^{2}}\right), (152)

and thus

IH2(Wt;Ri|Wt1,Zi±)\displaystyle I_{H^{2}}\left(W_{t};R_{i}|W_{t-1},Z_{i}^{\pm}\right) 𝔼Wt1,Zi±[1exp(ηt2W(Wt1,Zi+)W(Wt1,Zi)28σt2)]\displaystyle\leq\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[1-\exp\left(-\frac{\eta_{t}^{2}\|\nabla_{W}\ell(W_{t-1},Z_{i}^{+})-\nabla_{W}\ell(W_{t-1},Z_{i}^{-})\|^{2}}{8\sigma_{t}^{2}}\right)\right] (153)
1exp(ηt28σt2𝔼Wt1,Zi±[W(Wt1,Zi+)W(Wt1,Zi)2])\displaystyle\leq 1-\exp\left(-\frac{\eta_{t}^{2}}{8\sigma_{t}^{2}}\mathbb{E}_{W_{t-1},Z_{i}^{\pm}}\left[\|\nabla_{W}\ell(W_{t-1},Z_{i}^{+})-\nabla_{W}\ell(W_{t-1},Z_{i}^{-})\|^{2}\right]\right) (154)
1exp(ηt2L216σt2).\displaystyle\leq 1-\exp\left(-\frac{\eta_{t}^{2}L^{2}}{16\sigma_{t}^{2}}\right). (155)

To obtain the second inequality we use the concavity of the function 1ex1-e^{-x}, and to obtain the last inequality we use Lemma 12 and the fact that 1ex1-e^{-x} is monotonously increasing. Plugging (155) into (144) yields

gen(PW|Zn,ν)\displaystyle\mathrm{gen}\left(P_{W|Z^{n}},\nu\right) 2(ba)ni=1nt𝒯i1exp(ηt2L216σt2)\displaystyle\leq\frac{2(b-a)}{n}\sum_{i=1}^{n}\sum_{t\in\mathcal{T}_{i}}\sqrt{1-\exp\left(-\frac{\eta_{t}^{2}L^{2}}{16\sigma_{t}^{2}}\right)} (156)
=2(ba)nt[T]1exp(ηt2L216σt2),\displaystyle=\frac{2(b-a)}{n}\sum_{t\in[T]}\sqrt{1-\exp\left(-\frac{\eta_{t}^{2}L^{2}}{16\sigma_{t}^{2}}\right)},

which is the desired result. ∎

C-C Proof of Lemma 5

Proof of Lemma 5.
Refer to caption
Figure 5: The Bayes network structure of SGLD algorithm (given Zi±=zi±Z_{i}^{\pm}=z_{i}^{\pm} fixed and only RiR_{i} is involved).

Let the super-sample Zi±=zi±Z_{i}^{\pm}=z_{i}^{\pm} be fixed. The SGLD optimization trajectory induces a Bayes network 𝒢\mathcal{G}, as illustrated in Fig. 5, where only RiR_{i} is considered. Since DfD_{f} is sub-additive, we have

Df(P||Q)\displaystyle D_{f}\left(P||Q\right) v𝒢Df(P{v}Πv||Q{v}Πv)\displaystyle\leq\sum_{v\in\mathcal{G}}D_{f}\left(P_{\{v\}\cup\Pi_{v}}||Q_{\{v\}\cup\Pi_{v}}\right) (157)
=Df(PRi||QRi)+Df(PW0||QW0)+t𝒯iDf(PWt,Wt1,Ri||QWt,Wt1,Ri)+t𝒯iDf(PWt,Wt1||QWt,Wt1),\displaystyle=D_{f}\left(P_{R_{i}}||Q_{R_{i}}\right)+D_{f}\left(P_{W_{0}}||Q_{W_{0}}\right)+\sum_{t\in\mathcal{T}_{i}}D_{f}\left(P_{W_{t},W_{t-1},R_{i}}||Q_{W_{t},W_{t-1},R_{i}}\right)+\sum_{t\notin\mathcal{T}_{i}}D_{f}\left(P_{W_{t},W_{t-1}}||Q_{W_{t},W_{t-1}}\right), (158)

for all distributions PP and QQ over 𝒲T+1×{1,+1}\mathcal{W}^{T+1}\times\{-1,+1\}. Now define PP as the joint distribution of W[0:T]W_{[0:T]} and RiR_{i} induced by the SGLD algorithm, i.e.,

P\displaystyle P PRiPW[0:T]|Ri\displaystyle\coloneqq P_{R_{i}}\otimes P_{W_{[0:T]}|R_{i}} (159)
=PRiPW0t𝒯iPWt|Wt1t𝒯iPWt|Wt1,Ri,\displaystyle=P_{R_{i}}P_{W_{0}}\prod_{t\notin\mathcal{T}_{i}}P_{W_{t}|W_{t-1}}\prod_{t\in\mathcal{T}_{i}}P_{W_{t}|W_{t-1},R_{i}}, (160)

and define QQ as the product of marginal distribution of W[0:T]W_{[0:T]} and RiR_{i}, i.e.,

Q\displaystyle Q PRiPW[0:T]\displaystyle\coloneqq P_{R_{i}}\otimes P_{W_{[0:T]}} (161)
=PRiPW0t[T]PWt|Wt1.\displaystyle=P_{R_{i}}P_{W_{0}}\prod_{t\in[T]}P_{W_{t}|W_{t-1}}. (162)

By induction over tt, one can prove that PP and QQ have the same marginal distribution at each vertex in the graph 𝒢\mathcal{G}, i.e., PRi=QRiP_{R_{i}}=Q_{R_{i}}, PW0=QW0P_{W_{0}}=Q_{W_{0}}, and PWt=QWt,t[T]P_{W_{t}}=Q_{W_{t}},\ \forall t\in[T]. Therefore, we have

Df(PRi||QRi)\displaystyle D_{f}\left(P_{R_{i}}||Q_{R_{i}}\right) =0,\displaystyle=0, (163a)
Df(PW0||QW0)\displaystyle D_{f}\left(P_{W_{0}}||Q_{W_{0}}\right) =0,\displaystyle=0, (163b)
Df(PWt,Wt1||QWt,Wt1)=Df(PWt1PWt|Wt1||QWt1QWt|Wt1)=0,t𝒯i,\displaystyle\begin{split}D_{f}\left(P_{W_{t},W_{t-1}}||Q_{W_{t},W_{t-1}}\right)&=D_{f}\left(P_{W_{t-1}}P_{W_{t}|W_{t-1}}||Q_{W_{t-1}}Q_{W_{t}|W_{t-1}}\right)\\ &=0,\ t\notin\mathcal{T}_{i},\end{split} (163c)
Df(PWt,Wt1,Ri||QWt,Wt1,Ri)=𝔼Wt1[Df(PWt,Ri|Wt1||QWt,Ri|Wt1)],=𝔼Wt1[Df(PWt,Ri|Wt1||PWt|Wt1PRi)],t𝒯i.\displaystyle\begin{split}D_{f}\left(P_{W_{t},W_{t-1},R_{i}}||Q_{W_{t},W_{t-1},R_{i}}\right)&=\mathbb{E}_{W_{t-1}}\left[D_{f}\left(P_{W_{t},R_{i}|W_{t-1}}||Q_{W_{t},R_{i}|W_{t-1}}\right)\right],\\ &=\mathbb{E}_{W_{t-1}}\left[D_{f}\left(P_{W_{t},R_{i}|W_{t-1}}||P_{W_{t}|W_{t-1}}P_{R_{i}}\right)\right],\ t\in\mathcal{T}_{i}.\end{split} (163d)

Add all terms together and use the fact that Df(P||Q)=If(W[0:T];Ri)D_{f}\left(P||Q\right)=I_{f}(W_{[0:T]};R_{i}), we have

If(W[0:T];Ri)t𝒯i𝔼Wt1[Df(PWt,Ri|Wt1||PWt|Wt1PRi)].I_{f}(W_{[0:T]};R_{i})\leq\sum_{t\in\mathcal{T}_{i}}\mathbb{E}_{W_{t-1}}\left[D_{f}\left(P_{W_{t},R_{i}|W_{t-1}}||P_{W_{t}|W_{t-1}}P_{R_{i}}\right)\right]. (164)

Keep in mind that (164) is established based on the condition Zi±=zi±Z_{i}^{\pm}=z_{i}^{\pm}. Thus, the desired result (59) follows by taking expectation over zi±z_{i}^{\pm}. ∎

References

  • [1] W. Liu, G. Yu, L. Wang, and R. Liao, “An information-theoretic framework for out-of-distribution generalization,” arXiv preprint arXiv:2403.19895, 2024.
  • [2] V. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability & Its Applications, vol. 16, no. 2, pp. 264–280, 1971.
  • [3] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002.
  • [4] D. Pollard, Convergence of stochastic processes.   David Pollard, 1984.
  • [5] O. Bousquet and A. Elisseeff, “Stability and generalization,” The Journal of Machine Learning Research, vol. 2, pp. 499–526, 2002.
  • [6] D. A. McAllester, “Some pac-bayesian theorems,” in Proceedings of the eleventh annual conference on Computational learning theory, 1998, pp. 230–234.
  • [7] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Artificial Intelligence and Statistics.   PMLR, 2016, pp. 1232–1240.
  • [8] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [9] Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information-based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
  • [10] T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory.   PMLR, 2020, pp. 3437–3452.
  • [11] M. Haghifam, J. Negrea, A. Khisti, D. M. Roy, and G. K. Dziugaite, “Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms,” Advances in Neural Information Processing Systems, vol. 33, pp. 9925–9935, 2020.
  • [12] F. Hellström and G. Durisi, “Generalization bounds via information density and conditional information density,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 824–839, 2020.
  • [13] J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy, “Information-theoretic generalization bounds for sgld via data-dependent estimates,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [14] B. Rodríguez-Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “On random subset generalization error bounds and the stochastic gradient langevin dynamics algorithm,” in 2020 IEEE Information Theory Workshop (ITW).   IEEE, 2021, pp. 1–5.
  • [15] R. Zhou, C. Tian, and T. Liu, “Individually conditional individual mutual information bound on generalization error,” IEEE Transactions on Information Theory, vol. 68, no. 5, pp. 3304–3316, 2022.
  • [16] G. Aminian, S. Masiha, L. Toni, and M. R. Rodrigues, “Learning algorithm generalization error bounds via auxiliary distributions,” arXiv preprint arXiv:2210.00483, 2022.
  • [17] Y. Chu and M. Raginsky, “A unified framework for information-theoretic generalization bounds,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [18] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
  • [19] H. Ye, C. Xie, T. Cai, R. Li, Z. Li, and L. Wang, “Towards a theoretical framework of out-of-distribution generalization,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 519–23 531, 2021.
  • [20] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” Advances in neural information processing systems, vol. 19, 2006.
  • [21] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, pp. 151–175, 2010.
  • [22] M. Sugiyama, S. Nakajima, H. Kashima, P. Buenau, and M. Kawanabe, “Direct importance estimation with model selection and its application to covariate shift adaptation,” Advances in neural information processing systems, vol. 20, 2007.
  • [23] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Multiple source adaptation and the rényi divergence,” arXiv preprint arXiv:1205.2628, 2012.
  • [24] X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transfer learning,” in 2020 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2020, pp. 2819–2824.
  • [25] M. S. Masiha, A. Gohari, M. H. Yassaee, and M. R. Aref, “Learning under distribution mismatch and model misspecification,” in 2021 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2021, pp. 2912–2917.
  • [26] Z. Wang and Y. Mao, “Information-theoretic analysis of unsupervised domain adaptation,” arXiv preprint arXiv:2210.00706, 2022.
  • [27] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi-, ff-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
  • [28] G. Lugosi and G. Neu, “Generalization bounds via convex analysis,” in Conference on Learning Theory.   PMLR, 2022, pp. 3524–3546.
  • [29] P. Viallard, M. Haddouche, U. Şimşekli, and B. Guedj, “Tighter generalisation bounds via interpolation,” arXiv preprint arXiv:2402.05101, 2024.
  • [30] Z. Wang and Y. Mao, “On ff-divergence principled domain adaptation: An improved framework,” arXiv preprint arXiv:2402.01887, 2024.
  • [31] S. M. Perlaza and X. Zou, “The generalization error of machine learning algorithms,” arXiv preprint arXiv:2411.12030, 2024.
  • [32] F. Daunas, I. Esnaola, S. M. Perlaza, and H. V. Poor, “Equivalence of the empirical risk minimization to regularization on the family of f-divergences,” arXiv preprint arXiv:2402.00501, 2024.
  • [33] V. Vapnik, “Principles of risk minimization for learning theory,” Advances in neural information processing systems, vol. 4, 1991.
  • [34] Y. Polyanskiy and Y. Wu, “Information theory: From coding to learning,” Book draft, 2022.
  • [35] J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and L. Rey-Bellet, “(f,Γ)(f,{\Gamma})-divergences: Interpolating between ff-divergences and integral probability metrics,” Journal of machine learning research, vol. 23, no. 39, pp. 1–70, 2022.
  • [36] R. Agrawal and T. Horel, “Optimal bounds between ff-divergences and integral probability metrics,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 5662–5720, 2021.
  • [37] A. Müller, “Integral probability metrics and their generating classes of functions,” Advances in applied probability, vol. 29, no. 2, pp. 429–443, 1997.
  • [38] S. Boucheron, G. Lugosi, and P. Massart, “Concentration inequalities: A nonasymptotic theory of independence. univ. press,” 2013.
  • [39] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in 2016 IEEE Information Theory Workshop (ITW).   IEEE, 2016, pp. 26–30.
  • [40] V. Feldman and T. Steinke, “Calibrating noise to variance in adaptive data analysis,” in Conference On Learning Theory.   PMLR, 2018, pp. 535–544.
  • [41] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th international conference on machine learning (ICML-11).   Citeseer, 2011, pp. 681–688.
  • [42] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
  • [43] J. Bretagnolle and C. Huber, “Estimation des densités: risque minimax,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 47, pp. 119–137, 1979.
  • [44] M. Ding, C. Daskalakis, and S. Feizi, “Gans with conditional independence graphs: On subadditivity of probability divergences,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 3709–3717.