This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Robust Out-of-Distribution
Generalization Bounds via Sharpness

Yingtian Zou, Kenji Kawaguchi, Yingnan Liu, Jiashuo Liu, Mong-Li Lee, Wynne Hsu
School of Computing, National University of Singapore
Institute of Data Science, National University of Singapore
Department of Computer Science & Technology, Tsinghua University, China
Abstract

Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.

1 Introduction

Machine learning systems are typically trained on a given distribution of data and achieve good performance on new, unseen data that follows the same distribution as the training data. Out-of-Distribution (OOD) generalization requires machine learning systems trained in the source domain to generalize to unseen data or target domains with different distributions from the source domain. A myriad of algorithms (Sun & Saenko, 2016; Arjovsky et al., 2019; Sagawa et al., 2019; Koyama & Yamaguchi, 2020; Pezeshki et al., 2021; Ahuja et al., 2021) aim to learn the invariant components along the distribution shifting. Optimization-based methods such as (El Ghaoui & Lebret, 1997; Duchi & Namkoong, 2018; Liu et al., 2021; Rame et al., 2022) focus on maximizing robustness by optimizing for worst-case error over an uncertainty distribution set. While these methods are sophisticated, they do not always perform better than Empirical Risk Minimization (ERM) when evaluated across different datasets (Gulrajani & Lopez-Paz, 2021; Wiles et al., 2022). This raises the question of how to understand the OOD generalization of algorithms and which criteria should be used to select models that are provably better (Gulrajani & Lopez-Paz, 2021). These questions highlight the need for more theoretical research in the field of OOD generalization (Ye et al., 2021).

To characterize the generalization gap between the source domain and the target domain, a canonical method (Blitzer et al., 2007) from domain adaptation theory decouples this gap into an In-Distribution (ID) generalization and a hypothesis-specific Out-of-Distribution (OOD) distance. However, this distance is based on the notion of VC-dimension (Kifer et al., 2004), resulting in a loose bound due to the large size of the hypothesis class in the modern overparameterized neural networks. Subsequent works improve the bound based on Rademacher Complexity (Du et al., 2017), whereas Germain et al. (2016) improves the bound based on PAC-Bayes. Unlike the present paper, these works did not consider algorithmic robustness, which has natural interpretation and advantages for distribution shifts In this work, we consider algorithmic robustness to derive the OOD generalization bound. The key idea is to partition the input space into KK non-overlapping subspaces such that the error difference in the model’s performance between any pair of points in each subspace is bounded by some constant ϵ\epsilon. Within each subspace, any distributional shift is considered subtle for the robust model thus leading to less impact on OOD generalization. Figure 1 illustrates this with the two distributions where the target domain has a distributional shift from the source domain. Compared to existing non-robust OOD generalization bounds Zhao et al. (2018), our new generalization error does not depend on hypothesis size, which is more reliable in the overparameterized regime. Our goal is to measure the generalizability of a model by considering how it is robust to this shift and achieves a tighter bound than existing works.

Refer to caption
Figure 1: An example of a target distribution (red) directly translated from a source distribution (blue). The 1D density reflects the marginal distribution. Unlike existing works (left), we divide the distributions into disjoint partitions as a small change in distribution for a robust model is negligible (right). The sharpness of the model will decide the tolerance of change thus affecting the partitions. If two sub-distributions S,TS,T have small shifts such that they fall into the same partition (red grid), their distance measure d(S,T)d^{\prime}(S,T) by considering robustness will be zero.

Although robustness captures the tolerance to distributional shift, it is intractable to compute robustness constant ϵ\epsilon due to the inaccessibility of target distribution. The robustness definition in Xu & Mannor (2012) indicates that the loss landscape induced by the model’s parameters is closely tied to its robustness. To gain a deeper understanding of robustness, we further study the learned model from an optimization perspective. As shown in (Lyu et al., 2022; Petzka et al., 2021), when the loss landscape is "flat", there is a good generalization, which is also observed in OOD settings (Izmailov et al., 2018; Cha et al., 2021). However, the relationship between robustness and this geometric property of the loss landscape, termed Sharpness, remains an open question. In this paper, we establish a provable dependence between robustness and sharpness for ReLU random neural network classes. It allows us to replace robustness constant ϵ\epsilon with the sharpness of a learned model which is only computed from the training dataset that addresses the problem mentioned above. Our result of the interplay between robustness and sharpness can be applied to both ID and OOD generalization bounds. We also show an example to generalize our result beyond our assumption and validate it empirically.

Our main contributions can be summarized as follows:

  • We proposed a new framework for Out-of-distribution/ Out-of-domain generalization bounds. In this framework, we use robustness to capture the tolerance of distribution shift which leads to tighter upper bounds generally.

  • We reveal the underlying connection between the robustness and sharpness of the loss landscape and use this connection to enrich our robust OOD bounds under one-hidden layer ReLU NNs. This is the first optimization-based bound in Out-of-Distribution/Domain generalization.

  • We studied two cases in ridge regression and classification which support and generalize our main theorem well. All the experimental results corroborate our findings well.

2 Preliminary

Notations

We use [n][n] to denote the integers set {i}i=1n\{i\}_{i=1}^{n}. We use \|\cdot\| to denote the 2\ell_{2}-norm (Euclidean norm). In vector form, 𝒘i\bm{w}_{i} denotes the ii-th instance while the wjw_{j} is the jj-th element of the vector 𝒘\bm{w} and we use |𝒘||\bm{w}| for the element-wise absolute value of vector 𝒘\bm{w}. We use n,dn,d for the training set size and input dimension. 𝒪\mathcal{O} is the Big-O notation.

2.1 Problem formulation

Consider a source domain and a target domain of the OOD generalization problem where we use 𝒟S\mathcal{D}_{S}, 𝒟T\mathcal{D}_{T} to represent the source and target distribution respectively. Let each 𝒟\mathcal{D} be the probability measure of sample 𝒛\bm{z} from sample space 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y} with 𝒳d\mathcal{X}\in\mathbb{R}^{d}. In the source domain, we have a training set S={𝒛i}i=1n,i[n],𝒛i𝒟SS=\{\bm{z}_{i}\}_{i=1}^{n},\forall i\in[n],\bm{z}_{i}\sim\mathcal{D}_{S} while the target is to learn a model ff\in\mathcal{F} with SS and parameters 𝜽Θ\bm{\theta}\in\Theta where f:Θ×𝒳f:\Theta\times\mathcal{X}\mapsto\mathbb{R} generalizes well. Given loss function :×+\ell:\mathbb{R}\times\mathbb{R}\to\mathbb{R}_{+}, which is for short, the expected risk over the source distribution 𝒟S\mathcal{D}_{S} will be

S(f𝜽)𝔼𝒛𝒟S[𝜽(𝒛)]=𝔼𝒛𝒟S[(f(𝜽,𝒙),y)],^S(f𝜽)1n𝒛iS[𝜽(𝒛i)].\mathcal{L}_{S}(f_{\bm{\theta}})\triangleq\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}[\ell_{\bm{\theta}}(\bm{z})]=\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}[\ell(f(\bm{\theta},\bm{x}),y)],~{}~{}~{}~{}\widehat{\mathcal{L}}_{S}(f_{\bm{\theta}})\triangleq\frac{1}{n}\sum\nolimits_{\bm{z}_{i}\in S}[\ell_{\bm{\theta}}(\bm{z}_{i})].

We use 𝜽(𝒛)\ell_{\bm{\theta}}(\bm{z}) as the shorthand. The OOD generalization is to measure between target domain expected risk T(f𝜽)\mathcal{L}_{T}(f_{\bm{\theta}}) and the source domain empirical risk ^S(f𝜽)\widehat{\mathcal{L}}_{S}(f_{\bm{\theta}}) which involves two parts: (1) In-domain generalization error gap between empirical risk and expected risk S(f𝜽)\mathcal{L}_{S}(f_{\bm{\theta}}) in the source domain. (2) Out-of-Domain distance between source and target domains. A model-agnostic example in Zhao et al. (2018) gave the following uniform bound:

Proposition 2.1 (Zhao et al. Zhao et al. (2018) Theorem 2 & 3.2).

With hypothesis class \mathcal{F} and pseudo dimension Pdim()=d\operatorname{Pdim}(\mathcal{F})=d^{\prime}, unlabeled empirical datasets from source and target distribution 𝒟^S\widehat{\mathcal{D}}_{S} and 𝒟^T\widehat{\mathcal{D}}_{T} with size nn each, then with probability at least 1δ1-\delta, for all ff\in\mathcal{F},

T(f)^S(f)+12dΔ(𝒟^T;𝒟^S)+𝒪(d/n)\mathcal{L}_{T}(f)\leq\widehat{\mathcal{L}}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+\mathcal{O}\left(\sqrt{d^{\prime}/n}\right)

where dΔ(𝒟^T;𝒟^S):=2supAf𝒜{f(x)f(x):f,f}|𝒟^S[Af]𝒟^T[Af]|d_{\mathcal{F}\Delta\mathcal{F}}(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}):=2\sup_{A_{f}\in\mathcal{A}_{\{f(x)\oplus f^{\prime}(x):f,f^{\prime}\in\mathcal{F}\}}}\left|\mathbb{P}_{\widehat{\mathcal{D}}_{S}}[A_{f}]-\mathbb{P}_{\widehat{\mathcal{D}}_{T}}[A_{f}]\right| and \oplus is the XOR operator. Specific form of 𝒪(||/n)\mathcal{O}(\sqrt{|\mathcal{F}|/n}) is defined in Appendix C.6.

2.2 Algorithmic robustness

Definition 2.2 (Robustness, Xu & Mannor (2012)).

A learning model f𝜽f_{\bm{\theta}} on training set SS is (K,ϵ())(K,\epsilon(\cdot))-robust, for KK\in\mathbb{N}, if 𝒵\mathcal{Z} can be partitioned into KK disjoint sets, denoted by {Ci}i=1K\left\{C_{i}\right\}_{i=1}^{K}, such that 𝒔S\forall\bm{s}\in S we have

𝒔,𝒛Ci,i[K]|𝜽(𝒔)𝜽(𝒛)|ϵ(S).\bm{s},\bm{z}\in C_{i},\forall i\in[K]\Rightarrow\left|\ell_{\bm{\theta}}(\bm{s})-\ell_{\bm{\theta}}(\bm{z})\right|\leq\epsilon(S).

This definition captures the robustness of the model in terms of the input. Within each partitioned set CiC_{i}, the loss difference between any sample 𝒛\bm{z} belonging to CiC_{i} and training sample 𝒔Ci\bm{s}\in C_{i} will be upper bounded by the robustness constant ϵ(S)\epsilon(S). The generalization result given by Xu & Mannor (2012) provides a framework to bound the empirical risk with algorithmic robustness which has been stated in Appendix C. Based on this framework, we are able to reframe the existing OOD generalization theory.

3 Main Results

In this section, we propose a new Out-of-Distribution (OOD) generalization bound for robust algorithms that have not been extensively studied yet. We then compare our result to the existing domain shift bound in Proposition 2.1 and discuss its implications for OOD and domain generalization problems by considering algorithmic robustness. To further explain the introduced robustness, we connect it to the sharpness of the minimum (a widely concerned geometric property in optimization) by showing a rigorous dependence between robustness and sharpness. This interplay will give us a better understanding of the OOD generalization problem, and meanwhile, provide more information on the final generalization bound. Detailed assumptions are clarified in Appendix B.1.

3.1 Robust OOD Generalization Bound

The main concern in OOD generalization is to measure the domain shift of a learned model. However, existing methods fail to consider the intrinsic property of the model, such as robustness. Definition 2.2 gives us a new robustness measurement to the model trained on dataset SS where the training set SS is a collection of i.i.d. data pair (𝒙,y)(\bm{x},y) sampled from source distribution 𝒟S\mathcal{D}_{S} with size nn. The measurement provides an intuition that if a test sample from the target domain is similar to a specific group of training samples, their losses will be similar as well. In other words, the model’s robustness is a reflection of its ability to generalize to unseen data. Our first theorem shows that by utilizing the "robustness" measurement, we can more effectively handle domain shifts by setting a tolerance range for distribution changes. This robustness measurement, therefore, provides a useful tool for addressing OOD generalization.

Theorem 3.1.

Let 𝒟^T\hat{\mathcal{D}}_{T} be the empirical distribution of size nn drawn from 𝒟T\mathcal{D}_{T}. Assume that the loss \ell is upper bounded by M. With probability at least 1δ1-\delta (over the choice of the samples SS), for every f𝛉f_{\bm{\theta}} trained on SS satisfying (K,ϵ(S))(K,\epsilon(S))-robust, we have

T(f𝜽)^S(f𝜽)+Md(ϵ,K)(S,𝒟^T)+2ϵ(S)+3M2Kln2+2ln(2/δ)n\mathcal{L}_{T}(f_{\bm{\theta}})\leq\widehat{\mathcal{L}}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+2\epsilon(S)+3M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}} (1)

where the total variation distance d(ϵ,K)d_{(\epsilon,K)} for discrete empirical distributions is defined by:

i[n],ni(S):=#(𝒛SCi),d(ϵ,K)(S,𝒟^T):=i=1K|ni(S)nni(𝒟^T)n|\forall i\in[n],n_{i}(S):=\#(\bm{z}\in S\cap C_{i}),~{}~{}~{}d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}):=\sum_{i=1}^{K}\left|\frac{n_{i}(S)}{n}-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right| (2)

and ni(S),ni(𝒟^T)n_{i}(S),n_{i}(\hat{\mathcal{D}}_{T}) are the number of samples from SS and 𝒟^T\hat{\mathcal{D}}_{T} that fall into the set CiC_{i}, respectively.

Remark.

The result can be decomposed into in-domain generalization and out-domain distance |T(f𝜽)S(f𝜽)||\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})| (please refer to Lemma C.1). Both of them depend on robustness ϵ(S)\epsilon(S).

See proof in Appendix C. The last three terms on the RHS of (1) are distribution distance, robustness constant, and error term, respectively. Unlike traditional distribution measures, we partition the sample space and the distributional shift separately in the KK sub-groups instead of measuring it point-wisely. We argue that the d(ϵ,K)(S,𝒟^T)d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}) can be zero measure if all small changes happen within the same partition where a 2D illustrative case is shown in Figure 1. Under the circumstances, our distribution distance term will be significantly smaller than Proposition 2.1 as the target distribution is essentially a perturbation of the source distribution. As a robust OOD generalization measure, our bound characterizes how robust the learned model is to negligible distributional perturbations. To prevent a bound that expands excessively, we also propose an alternate solution tailored for non-robust algorithms (KK\to\infty) as follows.

Corollary 3.2.

If KK\to\infty, the domain shift bound |T(f𝛉)S(f𝛉)||\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})| can be replaced to the distribution distance in Proposition 2.1 where

|T(f𝜽)S(f𝜽)|12dΔ(S;𝒟T)12dΔ(S;𝒟^T)+𝒪(d/n)|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|\leq\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(S;\mathcal{D}_{T})\leq\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(S;\widehat{\mathcal{D}}_{T})+\mathcal{O}(\sqrt{d^{\prime}/n}) (3)

where the pseudo dimension Pdim()=d\operatorname{Pdim}(\mathcal{F})=d^{\prime}.

The proof is in Appendix C.1. As dictated in Theorem 3.1, when KK\to\infty, the use infinite number of partitions on the data distribution leads to meaningless robustness. However, Eq. 25 suggests that our bound can be replaced by dΔ(𝒟S;𝒟T)d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T}) in the limit of infinite KK. This avoids computing a vacuous bound for non-robust algorithms. In summary, Theorem 3.1 presents a novel approach for quantifying distributional shifts by incorporating the concept of robustness. Our framework is particularly beneficial when a robust algorithm is able to adapt to local shifts in the distribution. Additionally, our data-dependent result remains valid and useful in the overparameterized regime, since KK does not depend on the model size.

3.2 Sharpness and Robustness

Clearly, robustness is inherently tied to the optimization properties of a model, particularly the curvature of the loss landscape. One direct approach to characterize this geometric curvature, referred to as "sharpness," involves analyzing the Hessian matrix (Foret et al., 2020; Cohen et al., 2021). Recent research (Petzka et al., 2021) has shown that the concept of “relative flatness", the sharpness in this paper, has a strong correlation with model generalization. However, the impact of relative flatness on OOD generalization remains uncertain, even within the convex setting. To address this problem, we aim to investigate the interplay between robustness and sharpness. With the following definition of sharpness, we endeavor to establish an OOD generalization bound rooted in optimization principles.

Definition 3.3 (Sharpness, Petzka et al. (2021)).

For a twice differentiable loss function (𝒘)=𝒔S𝒘(𝒔)\mathcal{L}(\bm{w})=\sum_{\bm{s}\in S}\ell_{\bm{w}}(\bm{s}), 𝒘m\bm{w}\in\mathbb{R}^{m} with a sample set SS, the sharpness is defined by

κ(𝒘,S,A):=𝒘,𝒘tr(HS,A(𝒘))\kappa(\bm{w},S,A):=\left\langle\bm{w},\bm{w}\right\rangle\cdot\operatorname{tr}\left(H_{S,A}(\bm{w})\right) (4)

where HS,AH_{S,A} is the Hessian matrix of loss (𝒘)\mathcal{L}(\bm{w}) w.r.t. 𝒘\bm{w} with hypothesis set AA and input set SS.

As per Eq. 4, sharpness is characterized by the sum of all the eigenvalues of the Hessian matrix, scaled by the parameter norm. Each eigenvalue of the Hessian reflects the rate of change of the loss derivative in the corresponding eigenspace. Therefore the smaller value of κ\kappa indicates a flatter minimum. In Cha et al. (2021), they suggest that flatter minima will improve the OOD generalization, but fail to deliver an elaborate analysis of the Hessian matrix. In this section, we begin with the random ReLU Neural Networks parameterized by 𝜽=({𝒂i}i[m],𝒘)\bm{\theta}=(\{\bm{a}_{i}\}_{i\in[m]},\bm{w}) where 𝒘=[w1,,wm]\bm{w}=[w_{1},...,w_{m}]^{\top} is the trainable parameter. Let A=[𝒂1,,𝒂m]A=[\bm{a}_{1},...,\bm{a}_{m}], the whole function class is defined as

f(𝒘,A,𝒙)1di=1mwiσ(𝒙,𝒂i):wi,𝒂iUnif(𝕊d1(d)),i[m]f(\bm{w},A,\bm{x})\triangleq\frac{1}{\sqrt{d}}\sum_{i=1}^{m}w_{i}\sigma\left(\bm{x},\bm{a}_{i}\right):w_{i}\in\mathbb{R},\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})),i\in[m] (5)

where σ()\sigma(\cdot) is the ReLU activation function and 𝒂\bm{a} are random vectors uniformly distributed on nn-dim hypersphere whose surface is a n1n-1 manifold. We then define any convex loss function (f(𝒘,A,𝒙),y):×+\ell(f(\bm{w},A,\bm{x}),y):\mathbb{R}\times\mathbb{R}\to\mathbb{R}_{+}. The corresponding empirical minimizer in the source domain will be: 𝒘^=argmin𝒘1ni=1n(f(𝒘,A,𝒙i),yi).\bm{\hat{w}}=\arg\min_{\bm{w}}\frac{1}{n}\sum_{i=1}^{n}\ell(f(\bm{w},A,\bm{x}_{i}),y_{i}). With 𝒘^\bm{\hat{w}}, we are interested in loss geometry over the sample domain (𝒘^,A(𝒛)\ell_{\bm{\hat{w}},A}(\bm{z}) for short). Intuitively, a flatter minimum on the loss landscape is expected to be more robust to varying input. Suppose the sample space 𝒵\mathcal{Z} can be partitioned into KK disjoint sets. For each set Ci,i[K]C_{i},i\in[K], the loss difference is upper bounded by ϵ(S,A)\epsilon(S,A). Given 𝒛S\bm{z}\in S, we have

ϵ(S,A)maxi[K]sup𝒛,𝒛Ci|𝒘^,A(𝒛)𝒘^,A(𝒛)|.\epsilon(S,A)\triangleq\max_{i\in[K]}\sup_{\bm{z},\bm{z}^{\prime}\in C_{i}}\left|\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}^{\prime})\right|. (6)

As an alternative form of robustness, the ϵ(S,A)\epsilon(S,A) in (6) captures the "maximum" loss difference between any two samples in each partition and depends on the convexity and smoothness of the loss function in the input domain. Given a training set SS and any initialization 𝒘0\bm{w}_{0}, the robustness ϵ(S,A)\epsilon(S,A) of a learned model f𝒘^f_{\bm{\hat{w}}} will be determined. It explicitly reflects the smoothness of the loss function in each (pre-)partitioned set. Nevertheless, its connection to the sharpness of the loss function in parameter space still remains unclear. In order to address this gap, we establish a connection between sharpness and robustness in Eq. 7. Notably, this interplay holds implications not only for OOD but also for in-distribution generalization.

Theorem 3.4.

Assume for any AA, the loss function 𝐰^,A(𝐳)\ell_{\bm{\hat{w}},A}(\bm{z}) w.r.t. sample 𝐳\bm{z} satisfies the LL-Hessian Lipschitz continuity (refer to Definition B.2) within every set Ci,i[K]C_{i},\forall i\in[K]. Let 𝐳i(A)=argmax𝐳CiS𝐰^,A(𝐳)\bm{z}_{i}(A)=\arg\max_{\bm{z}\in C_{i}\cap S}\ell_{\bm{\hat{w}},A}(\bm{z}). Define i\mathcal{M}_{i} to be the set of global minima in CiC_{i}, suppose 𝐳i(A)i\exists\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i} such that for some ρi(L)>0,𝐳i(A)𝐳i(A)ρi(L)L\rho_{i}(L)>0,~{}\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\frac{\rho_{i}(L)}{L} almost surely, then let ρmax(L)=max{ρi(L),i[K]}\rho_{\max}(L)=\max\{\rho_{i}(L),i\in[K]\}, 𝐱2R(d)\|\bm{x}\|^{2}\equiv R(d) and nn,+n^{\prime}\leq n,\in\mathbb{N}^{+}, w.p p=min{2πarccos(R(d)12),|12d4πR(d)e14d9|}p=\min\left\{\frac{2}{\pi}\arccos(R(d)^{-\frac{1}{2}}),\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}e^{\frac{1}{4d-9}}\right|\right\} over {𝐚i}i=1mUnif(𝕊d1(d))\{\bm{a}_{i}\}_{i=1}^{m}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) we have

ϵ(S,A)ρmax(L)22L2([n+𝒪(dm)]κ(𝒘^,S,A)+4ρmax(L)3).\epsilon(S,A)\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(\left[n^{\prime}+\mathcal{O}\left(\frac{d}{m}\right)\right]\kappa(\bm{\hat{w}},S,A)+\frac{4\rho_{\max}(L)}{3}\right). (7)
Remark.

Given the training set SS, we can estimate factor n^\hat{n} that nn^n^{\prime}\leq\hat{n} by comparing the maximum Hessian norm w.r.t. 𝒛j\bm{z}_{j} to the sum of all the Hessian norms over {𝒛i}i[n]\{\bm{z}_{i}\}_{i\in[n]}. Note that the smoothness condition only applies to every partitioned set (locally) where it is much weaker than the global requirement for the loss function to be satisfied We also discuss the difference between our results and Petzka et al. (2021) in Appendix F. The chosen family of loss functions that applied to our theorem can be found in Appendix B.1.

Corollary 3.5.

Let w^min\hat{w}_{\min} be the minimum value of |𝐰^||\bm{\hat{w}}|. Suppose 𝐱Unif(𝕊d1(d))\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and |2(f(𝐰^,A,𝐱),y)/f2||\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)/\partial f^{2}| is bounded by [M~1,M~2][\tilde{M}_{1},\tilde{M}_{2}]. If m=Poly(d),d>2m=\text{Poly}(d),d>2, ρmax(L)<(w^min2M~1σ~(d,m))/(2d)\rho_{\max}(L)<(\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}(d,m))/(2d) taking expectation over all 𝐱jS,j[n]\bm{x}_{j}\in S,j\in[n] and all 𝐚iAUnif(𝕊d1(d))i[m]\bm{a}_{i}\in A\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))\forall i\in[m], we have

𝔼S,A[ϵ(S,A)]𝔼S,A7ρmax(L)26L2(nκ(𝒘^,S,A)+M~2).\mathbb{E}_{S,A}\left[\epsilon(S,A)\right]\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\tilde{M}_{2}\right). (8)

where σ~(d,m)=𝔼𝐚Unif(𝕊d1(d))λmin(i=1m𝐚i𝐚iGii)>0\tilde{\sigma}(d,m)=\mathbb{E}_{\bm{a}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0 is the minimum eigenvalue and GiiG_{ii} is product constant of Gegenbauer polynomials (definition can be founded in Appendix B).

See proof in Appendix D. From Eq. 7 we can see, the robustness constant ϵ(S,A)\epsilon(S,A) is (pointwise) upper bounded by the sharpness of the learned model, as measured by the quantity κ(𝒘^,S,A)\kappa(\bm{\hat{w}},S,A), and the parameter ρmax(L)\rho_{\max}(L). It should be noted that the parameter ρmax(L)\rho_{\max}(L) depends on the partition, and as the number of partitions increases, the region ρmax(L)\rho_{\max}(L) of the input domain becomes smaller, thereby making sharpness the dominant term in reflecting the model’s robustness within the small local area. In Corollary 3.5, we show a stronger connection when the partition satisfies some conditions. Overall, this bound states that the larger the sharpness of the model κ(𝒘^,S,A)\kappa(\bm{\hat{w}},S,A), the larger the upper bound on the robustness parameter ϵ(S,A)\epsilon(S,A). This result aligns with the intuition that a sharper model is more prone to overfitting the training domain and is less robust in the unseen domain. While the dependency is not exact, it still can be regarded as an alternative approach that avoids the explicit computation of the intractable robustness term. By substituting this upper bound for ϵ(S,A)\epsilon(S,A) into Theorem 3.1, we derive a sharpness-based OOD generalization bound. This implies that the OOD generalization error will have a high probability of being small if the learned model is flat enough. Unlike existing works, our generalization bound provides more information about how optimization property influences performance when generalizing to OOD data. It bridges the gap between robustness and sharpness which can also be generalized to non-OOD learning problems. Moreover, we provide a better theoretical grounding for an empirical observation that a flat minimum improves domain generalization (Cha et al., 2021) by pinpointing a clear dependence on sharpness.

3.3 Case Study

To better demonstrate the relationship between sharpness and robustness, we provide two specific examples: (1) linear ridge regression; (2) two-layer diagonal neural networks for classification.

Example 3.6.

In ridge regression models, ϵ(S,A)\epsilon(S,A) has a reverse relationship to the regularization parameter β\beta. β\beta\uparrow, the more probably flatter minimum κ\kappa\downarrow and less sensitivity ϵ\epsilon\downarrow of the learned model could be. Following the previous notation, we have c~1>0\exists\tilde{c}_{1}>0 such that ϵ(S,A)c~1κ(𝛉^,S)+o~d\epsilon(S,A)\leq\tilde{c}_{1}\kappa(\bm{\hat{\theta}},S)+\tilde{o}_{d} where o~d\tilde{o}_{d} has a smaller order than κ(𝛉^,S)\kappa(\hat{\bm{\theta}},S) for large dd (proof refer to Appendix E.1).

As suggested in Ali et al. (2019), let’s consider a generic response model 𝒚|𝜽(X𝜽,σ2I)\bm{y}|\bm{\theta}_{*}\sim(X\bm{\theta}_{*},\sigma^{2}I) where Xn×d,d>nX\in\mathbb{R}^{n\times d},d>n. The least-square empirical minimizer of the ridge regression problem will be:

𝜽^=argmin𝜽12nX𝜽𝒚2+β2𝜽2=(XX+nβIn)1X𝒚=M1X𝒚\bm{\hat{\theta}}=\arg\min_{\bm{\theta}}\frac{1}{2n}\|X\bm{\theta}-\bm{y}\|^{2}+\frac{\beta}{2}\|\bm{\theta}\|^{2}=(X^{\top}X+n\beta I_{n})^{-1}X^{\top}\bm{y}=M^{-1}X^{\top}\bm{y} (9)

Let SS be the training set. It’s trivial to get the sharpness of a quadratic loss function where

κ(𝜽^,S)=𝜽^2tr(XX/n+βIn)=𝜽^2tr(M)\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}(X^{\top}X/n+\beta I_{n})=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}(M) (10)

It’s obvious that both of the above two equations depend on the same matrix M=XX/n+βInM=X^{\top}X/n+\beta I_{n}. For fixed training samples XX, we have κ(𝜽^,S)=𝒪(β1)\kappa(\bm{\hat{\theta}},S)=\mathcal{O}(\beta^{-1}) in the limit of β\beta. Then it’s clear that a higher penalty β\beta leads to a flatter minimum. This intuition is rigorously proven in Appendix E.1. According to Theorem 3.1 and Eq. 7, a flatter minimum probably associates with lower robustness constant ϵ(S,A)\epsilon(S,A). Thus it enjoys a lower OOD generalization error gap. In ridge regression, this phenomenon can be reflected by the regularization coefficient β\beta. Therefore, in general, the larger β\beta is, the lower the sharpness κ(𝜽^,S)\kappa(\bm{\hat{\theta}},S) and variance are. As a consequence, larger β\beta learns a more robust model resulting in a lower OOD generalization error gap. This idea is later verified in the distributional shift experiments, shown as Figure 3.

Example 3.7.

We consider a classification problem using a 2-layer diagonal linear network with exp-loss. The robustness ϵ(S,A)\epsilon(S,A) has a similar relationship in Eq. 7. Given training set SS, after iterations t>Tϵt>T_{\epsilon}, c~2>0\exists\tilde{c}_{2}>0, ϵ(S,A)c~2suptTϵκ(𝛉(t),S)\epsilon(S,A)\leq\tilde{c}_{2}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S).

In addition to the regression and linear models, we have obtained a similar relationship for 2-layer diagonal linear networks, which are commonly used in the kernel and rich regimes as well as in intermediate settings (Moroshko et al., 2020). Example 3.7 demonstrates that the relationship also holds true when the model is well-trained, even exp-loss does not satisfy the PŁ condition. By extending our theorems to these more complex frameworks, we go beyond our initial assumptions and offer insights into broader applications. Later experiments on non-linear NN also support our statements. However, we still need a unified theorem for general function classes with fewer assumptions.

4 Related Work

Despite various methods (Sun & Saenko, 2016; Sagawa et al., 2019; Shi et al., 2021; Shafieezadeh Abadeh et al., 2015; Li et al., 2018; Cha et al., 2021; Du et al., 2020; Zhang et al., 2022) that have been proposed to overcome the poor generalization brought by unknown distribution shifts, the underlying principles and theories still remain underexplored. As pointed out in Redko et al. (2020); Miller et al. (2021), different tasks that address distributional shifts, such as domain adaptation, OOD, and domain generalization, are collectively referred to as "transfer transductive learning" and share similar generalization theories. In general, the desired generalization bound will be split into In-Distribution/Domain (ID) generalization error and Out-of-Distribution/Domain (OOD) distance. Since Blitzer et al. (2007) establish a VC-dimension-based framework to estimate the domain shift gap by a divergence term, many following works make the effort to improve this term in the following decades, such as Discrepancy (Mansour et al., 2009), Wasserstein measurement (Courty et al., 2017; Shen et al., 2018), Integral Probability Metrics (IPM) (Zhang et al., 2019b; Ye et al., 2021) and β\beta-divergence (Germain et al., 2016). Among them, new generalization tools like PAC-Bayes, Rademacher Complexity, and Stability are also applied. However, few of them discuss how the sharpness reacts to data distributional shifts.

Beyond this canonical framework, Ye et al. (2021) reformulate the OOD generalization problem and provide a generalization bound using the concepts of "variation" and "informativeness." The causal framework proposed in Peters et al. (2017); Rojas-Carulla et al. (2018) focuses on the impact of interventions on robust optimization over test distributions. However, none of these frameworks consider the optimization process of a model and how it affects OOD generalization. Inspired by previous investigation on the effect of sharpness on ID generalization (Lyu et al., 2022; Petzka et al., 2021), recent work in Cha et al. (2021) found that flatter minima can also improve OOD generalization. Nevertheless, they lack a sufficient theoretical foundation for the relationship between the "sharpness" of a model and OOD generalization, but end with a union bound of Blitzer et al. (2007)’s result. In this paper, we aim to provide a rigorous examination of this relationship.

5 Experiments

In light of space constraints, we present only a portion of our experimental results to support the validity of our theorems and findings. For comprehensive results, please refer to the Appendix G

5.1 Ridge regression in distributional shifting

Following Duchi & Namkoong (2021), we investigated the ridge regression on distributional shift. We randomly generate θ0d\theta^{*}_{0}\in\mathbb{R}^{d} in spherical space, and data from the following generating process: Xiid𝒩(0,1),𝒚=Xθ0X\overset{\text{iid}}{\sim}\mathcal{N}(0,1),~{}~{}~{}\bm{y}=X\theta^{*}_{0}. To simulate distributional shift, we randomly generate a perpendicular unit vector θ0\theta_{0}^{\perp} to θ0\theta^{*}_{0}. Let θ0,θ0\theta_{0}^{\perp},\theta^{*}_{0} be the basis vectors, then shifted ground-truth will be computed from the basis by θα=θ0cos(α)+θ0sin(α)\theta^{*}_{\alpha}=\theta^{*}_{0}\cdot\cos(\alpha)+\theta^{\perp}_{0}\cdot\sin(\alpha). For the source domain, we use θ0\theta^{*}_{0} as our training distribution. We randomly sample 50 data points and train a linear classifier with a gradient descent of 3000 iterations. By minimizing the objective function in (9), we can get the empirical optimum θ^\hat{\theta}. Then we gradually shift the distribution by increasing α\alpha to get different target domains. Along distribution shifting, the test loss (θ^,𝒚α)\ell(\hat{\theta},\bm{y}_{\alpha}) will increase. As shown in Figure 3, the test loss will culminate in around 33 rads due to the maximum distribution shifting. Comparing different levels of regularization, we found that the larger L2-penalty β\beta brings lower OOD generalization error which is shown as darker purple lines. This plot bears out our intuition in the previous section. As stated in the aforementioned case, the sharpness of ridge regression should inversely depend on β\beta. Correspondingly, we compute sharpness using the definition equation (4) by averaging ten different results. For each trial, we use the same training and test data for every β\beta. The sharpness of each ridge regressor is shown in the legend of Figure 3. As we can see, larger β\beta leads to less sharpness.

Refer to caption
Figure 2: OOD test losses increase along distributional shifting. The X-axis is the shifting angle α\alpha and the Y-axis is the test loss of the model which is trained on distribution α=0\alpha=0. Lines are average test losses and shadows are variances of 10 trials. Larger regularization β\beta (darker color) causes a lower increase in test loss but smaller sharpness.
Refer to caption
Figure 3: The relationship between out-of-domain test accuracy and model sharpness on RotatedMNIST dataset. Here we show 4 different OOD environments: 15,30,45,6015^{\circ},30^{\circ},45^{\circ},60^{\circ} rotation as the OOD test set respectively. Each marker denotes a minimum of an algorithm with a specific seed. The marker style means the models trained in the same environment.

5.2 Sharper minimum hurts OOD generalization

In our results, we proved that the upper bound of OOD generalization error involves the sharpness of the trained model. Here we empirically verified our theoretical insight. We follow the experiment setting in DomainBed (Gulrajani & Lopez-Paz, 2021). To easily compute the sharpness, we choose the 4-layer MLP on RotatedMNIST dataset where RotatedMNIST is a rotation of MNIST handwritten digit dataset (LeCun, 1998) with different angles ranging from [0,15,30,45,60,75][0^{\circ},15^{\circ},30^{\circ},45^{\circ},60^{\circ},75^{\circ}]. In this codebase, each environment refers to selecting a domain (a specific rotation angle) as the test domain/OOD test dataset while training on all other domains. After getting the trained model of each environment, we compute the sharpness using all domain training sets based on the implementation of Petzka et al. (2021). To this end, we plot the performances of Empirical Minimization Risk (ERM), SWAD (Cha et al., 2021), Mixup (Yan et al., 2020) and GroupDRO (Sagawa et al., 2019) with 6 seeds of each. Then we measure the sharpness of all these minima. Figure 3 shows the relationship between model sharpness and out-of-domain accuracy. The tendency is clear that flat minima give better OOD performances. In general, different environments can not be plotted together due to different training sets. However, we found the middle 44 environments are similar tasks and thus plot them together for a clearer trend. In addition, different algorithms lead to different feature scales which may affect the scale of the sharpness. To address this, we align their scales when putting them together. For more individual results, please refer to Figure 8 in the appendix.

5.3 Comparison of generalization bounds

To analyze our generalization bounds, we follow the toy example experiments in Sagawa et al. (2020). In this experiment, the distribution shift terms and generalization error terms can be explicitly computed. Furthermore, their synthetic experiment considers the spurious correlation across distribution shifts which is now a general formulation of OOD generalization (Wald et al., 2021; Aubin et al., 2021; Yao et al., 2022). Consider data x=[xcore,xspu]x=[x_{\text{core}},x_{\text{spu}}]\in that consist of two features: core feature and spurious feature. The features are generated from the following rule:

xcore y𝒩(y𝟏,σcore 2Id)xspu a𝒩(a𝟏,σspu2Id)x_{\text{core }}\mid y\sim\mathcal{N}\left(y\mathbf{1},\sigma_{\text{core }}^{2}I_{d}\right)~{}~{}x_{\text{spu }}\mid a\sim\mathcal{N}\left(a\mathbf{1},\sigma_{\mathrm{spu}}^{2}I_{d}\right)

where y{1,1}y\in\{-1,1\} is the label, and a{1,1}a\in\{-1,1\} is the spurious attribute. Data with y=ay=a forms the majority group of size nmajn_{\text{maj}}, and data with y=ay=-a forms minority group of size nminn_{\text{min}}. Total number of training points n=nmaj+nminn=n_{\text{maj}}+n_{\text{min}}. The spurious correlation probability, pmaj=nmajnp_{\text{maj}}=\frac{n_{\text{maj}}}{n} defines the probability of y=ay=a in training data. In testing, we always have pmaj=0.5p_{\text{maj}}=0.5. The metric, worst-group error Sagawa et al. (2019) is defined as

Errwg(w):=maxi[4]𝔼x,ygi[01(w;(x,y))]\operatorname{Err}_{\mathrm{wg}}(w):=\max_{i\in[4]}\mathbb{E}_{x,y\mid g_{i}}\left[\ell_{0-1}(w;(x,y))\right]

where 01\ell_{0-1} is the 010-1 loss in binary classification. Here we compare the robustness of our proposed OOD generalization bound and the baseline in Proposition 2.1. We also give the comparison to other baselines, like PAC-Bayes DA bound in the Appendix G.

Refer to caption
Figure 4: Spurious feature synthetic experiment. Each dot represents a trained model. The dash curves are the smoothed function fit by the test data points. The baseline is Proposition 2.1. (a),(d): the generalization error of the logistic regression models with increasing the model size/correlation probability. (b): concentration error term in domain shift bound. (e): comparison of distribution distance bounds. (c),(f): comparisons of generalization bounds. Note that model size >500>500 is the overparameterized regime. The further the correlation probability is from 0.50.5, the greater the distributional shift is.
Along model size

We plot the generalization error of the random feature logistic regression along the model size increases in Figure 4(a). In this experiment, we follow the hyperparameter setup of Sagawa et al. (2020) by setting the number of points n=500n=500, data dimension 2d=2002d=200 with 100100 on each feature. majority fraction pmaj=0.9p_{\text{maj}}=0.9 and noises σspu2=1,σcore2=100\sigma_{\text{spu}}^{2}=1,\sigma_{\text{core}}^{2}=100. The worst-group error turns out to be nearly the same as the model size increases. However, in Figure 4(b), the error term in domain shift bound Proposition 2.1(A) will keep increasing when the model size is increasing. In contrast, our domain shift bound at order K\sqrt{K} is independent of the model size which addresses the limitation of their bound. We follow Kawaguchi et al. (2022) to compute KK in an inverse image of the ϵ\epsilon-covering in a randomly projected space (see details in appendix). We set the same value K=1,000K=1,000 in our experiment. Different from the baseline, KK is data dependent and leads to a constant concentration error term along with model size increases. Analogously, our OOD generalization bound will not explode as model size increases (shown in Figure 4(c)).

Along distribution shift

In addition, we are interested in characterizing OOD generalization when test distribution shifts from train distribution by varying the correlation probability pmajp_{\text{maj}} during data generation. As shown in Figure 4(d), when pmaj=0.5p_{\text{maj}}=0.5, there is no distributional shift between training and test data due to no spurious features correlated to training data. Thus, the training and test distributions align closer and closer when pmaj<0.5p_{\text{maj}}<0.5 and increase, resulting in an initial decrease in the test error for the worst-case group. However, as pmaj>0.5p_{\text{maj}}>0.5 and deviates from 0.5, introducing the spurious features, a shift in the distribution occurs. This deviation is likely to impact the worst-case group differently, leading to an increase in the test error. As displayed in Figure 4(e) and Figure 4(f), our distribution distance and generalization bound can capture the distribution shifts but are tighter than the baseline.

6 Conclusion

In this paper, we provide a more interpretable and informative theory to understand Out-of-Distribution (OOD) generalization Based on the notion of robustness, we propose a robust OOD bound that effectively captures the algorithmic robustness in the presence of shifting data distributions. In addition, our in-depth analysis of the relationship between robustness and sharpness further illustrates that sharpness has a negative impact on generalization. Overall, our results advance the understanding of OOD generalization and the principles that govern it.

7 Acknowledgement

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-GC-2019-001-2A), (AISG Award No: AISG2-TC-2023-010-SGIL) and the Singapore Ministry of Education Academic Research Fund Tier 1 (Award No: T1 251RES2207). We would appreciate the contributions of Fusheng Liu, Ph.D. student at NUS.

References

  • Ahuja et al. (2021) Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:3438–3450, 2021.
  • Ali et al. (2019) Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics, pp.  1370–1378. PMLR, 2019.
  • Arjovsky et al. (2019) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Aubin et al. (2021) Benjamin Aubin, Agnieszka Słowik, Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Linear unit-tests for invariance discovery. arXiv preprint arXiv:2102.10867, 2021.
  • Bandi et al. (2018) Peter Bandi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018.
  • Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
  • Blitzer et al. (2007) John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. Advances in neural information processing systems, 20, 2007.
  • Bourin et al. (2013) Jean-Christophe Bourin, Eun-Young Lee, and Minghua Lin. Positive matrices partitioned into a small number of hermitian blocks. Linear Algebra and its Applications, 438(5):2591–2598, 2013.
  • Cha et al. (2021) Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
  • Cohen et al. (2021) Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
  • Courty et al. (2017) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 2017.
  • Du et al. (2017) Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. Advances in neural information processing systems, 30, 2017.
  • Du et al. (2020) Yingjun Du, Xiantong Zhen, Ling Shao, and Cees GM Snoek. Metanorm: Learning to normalize few-shot batches across domains. In International Conference on Learning Representations, 2020.
  • Duchi & Namkoong (2018) John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
  • Duchi & Namkoong (2021) John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
  • El Ghaoui & Lebret (1997) Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on matrix analysis and applications, 1997.
  • Foret et al. (2020) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  • Germain et al. (2013) Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International conference on machine learning. PMLR, 2013.
  • Germain et al. (2016) Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A new pac-bayesian perspective on domain adaptation. In International conference on machine learning. PMLR, 2016.
  • Gulrajani & Lopez-Paz (2021) Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2021.
  • Izmailov et al. (2018) Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  • Kawaguchi et al. (2022) Kenji Kawaguchi, Zhun Deng, Kyle Luh, and Jiaoyang Huang. Robustness implies generalization via data-dependent generalization bounds. In International Conference on Machine Learning. PMLR, 2022.
  • Kifer et al. (2004) Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In VLDB, volume 4, pp.  180–191. Toronto, Canada, 2004.
  • Koh et al. (2021) Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp.  5637–5664. PMLR, 2021.
  • Koyama & Yamaguchi (2020) Masanori Koyama and Shoichiro Yamaguchi. Out-of-distribution generalization with maximal invariant predictor. 2020.
  • LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • Li et al. (2017) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp.  5542–5550, 2017.
  • Li et al. (2018) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  • Liu et al. (2021) Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Heterogeneous risk minimization. In International Conference on Machine Learning, pp.  6804–6814. PMLR, 2021.
  • Lyu et al. (2022) Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction. In Advances in Neural Information Processing Systems, 2022.
  • Mansour et al. (2009) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  • Mei & Montanari (2022) Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  • Miller et al. (2021) John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning. PMLR, 2021.
  • Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Moroshko et al. (2020) Edward Moroshko, Blake E Woodworth, Suriya Gunasekar, Jason D Lee, Nati Srebro, and Daniel Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in neural information processing systems, 33:22182–22193, 2020.
  • Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • Petzka et al. (2021) Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. Advances in Neural Information Processing Systems, 34:18420–18432, 2021.
  • Pezeshki et al. (2021) Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256–1272, 2021.
  • Rame et al. (2022) Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pp.  18347–18377. PMLR, 2022.
  • Redko et al. (2020) Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
  • Robbins (1955) Herbert Robbins. A remark on stirling’s formula. The American mathematical monthly, 62(1):26–29, 1955.
  • Rojas-Carulla et al. (2018) Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 2018.
  • Sagawa et al. (2019) Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  • Sagawa et al. (2020) Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning. PMLR, 2020.
  • Shafieezadeh Abadeh et al. (2015) Soroosh Shafieezadeh Abadeh, Peyman M Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. Advances in Neural Information Processing Systems, 2015.
  • Shen et al. (2018) Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • Shi et al. (2021) Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. arXiv preprint arXiv:2104.09937, 2021.
  • Sun & Saenko (2016) Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp.  443–450. Springer, 2016.
  • Wald et al. (2021) Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. Advances in neural information processing systems, 2021.
  • Wiles et al. (2022) Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre-Alvise Rebuffi, Ira Ktena, Krishnamurthy Dj Dvijotham, and Ali Taylan Cemgil. A fine-grained analysis on distribution shift. In International Conference on Learning Representations, 2022.
  • Xu & Mannor (2012) Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
  • Yan et al. (2020) Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. arXiv preprint arXiv:2001.00677, 2020.
  • Yao et al. (2022) Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. arXiv preprint arXiv:2201.00299, 2022.
  • Ye et al. (2021) Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 2021.
  • Zhang et al. (2019a) Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pp.  1524–1534. PMLR, 2019a.
  • Zhang et al. (2022) Xingxuan Zhang, Linjun Zhou, Renzhe Xu, Peng Cui, Zheyan Shen, and Haoxin Liu. Towards unsupervised domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4910–4920, 2022.
  • Zhang et al. (2019b) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning. PMLR, 2019b.
  • Zhao et al. (2018) Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31, 2018.
  • Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pp.  4140–4149. PMLR, 2017.

Appendix A Additional Experiments

A.1 Sharpness v.s. OOD Generalization on PACS and Wilds-Camelyon17

Refer to caption
Figure 5: The relationship between out-of-distribution (OOD) test accuracy on the test environment and model sharpness (of last FC layer) on the Wilds-Camelyon17 dataset. Each marker denotes a model trained using ERM with different seed and hyperparameters.
Refer to caption
Figure 6: The relationship between OOD test accuracy and model sharpness on the PACS dataset. Each marker denotes a model trained using ERM with different seeds and hyperparameters. The marker style shows the out-of-distribution test environment of the model.

To evaluate our theorem more deeply, we examine the relationship between our defined sharpness and OOD generalization error on larger-scale real-world datasets, Wilds-Camelyon17 Bandi et al. (2018); Koh et al. (2021) and PACS Li et al. (2017). Wilds-Camelyon17 dataset includes 455,954 tumor and normal tissue slide images from five hospitals (environments). One of the hospitals is assigned as the test environment by the dataset publisher. Distribution shift arises from variations in patient population, slide staining, and image acquisition. PACS dataset contains 9,991 images of 7 objects in 4 visual styles (environments): art painting, cartoon, photo, and sketch. Following the common setup in Gulrajani & Lopez-Paz (2021), each environment is used as a test environment in turn. We follow the practice in Petzka et al. (2021) to compute the sharpness using the Hessian matrix from the last Fully-Connected (FC) layer of each model. For the Wilds-Camelyon17 dataset, we test the sharpness of 18 ERM models trained with different random seeds and hyperparameters. Figure 6 shows the result. For the PACS dataset, we run 60 ERM models with different random seeds and hyperparameters for each test environment. To get a clearer correlation, we align the points from 4 environments by their mean performance. Figure 6 shows the result. From the two figures, we can observe a clear correlation between sharpness and out-of-distribution (OOD) accuracy. Sharpness tends to hurt the OOD performance of the model. The result is consistent with what we report in Figure 3. It shows that the correlation between sharpness and OOD accuracy can also be observed on large-scale datasets.

Appendix B Notations and Definitions

Notations

We use [n][n] denote the integers set {i}i=1n\{i\}_{i=1}^{n}. \|\cdot\| represents 2\ell_{2}-norm 2\|\cdot\|_{2} for short. Without loss of generality, we use (f(𝜽,𝒙),𝒚)\ell(f(\bm{\theta},\bm{x}),\bm{y}) for the loss function of model f𝜽f_{\bm{\theta}} on data pair 𝒛=(𝒙,𝒚)\bm{z}=(\bm{x},\bm{y}), which is denoted as 𝜽(𝒛))\ell_{\bm{\theta}}(\bm{z})) and we use n,dn,d for training set size and input dimension. Note that we generally follow the notations in the original papers.

  • S,T\mathcal{L}_{S},\mathcal{L}_{T}: expected risk of the source domain and target domain, respectively. The corresponding empirical version will be ^S,^T\widehat{\mathcal{L}}_{S},\widehat{\mathcal{L}}_{T}

  • {Ci}i=1K\{C_{i}\}_{i=1}^{K}: KK partitions on sample space and CiC_{i} denotes each partitioned set.

  • 𝒟S,𝒟T\mathcal{D}_{S},\mathcal{D}_{T}: distributions of source and target domain. Their sampled dataset will be denoted as 𝒟^S,𝒟^T\hat{\mathcal{D}}_{S},\hat{\mathcal{D}}_{T} accordingly.

  • 𝜽\bm{\theta}: In our setting, 𝜽=(𝒘,{𝒂}im)\bm{\theta}=(\bm{w},\{\bm{a}\}_{i}^{m}) denotes the model parameters where 𝒘\bm{w} is the trainable parameters and {𝒂i}m\{\bm{a}_{i}\}^{m} are the random features (we also use A=[𝒂1,,𝒂m]A=[\bm{a}_{1},...,\bm{a}_{m}] for short notation in many places). 𝒘^\bm{\hat{w}} is the minimizer of empirical loss.

  • MM: upper bound of the loss function.

  • S={(𝒙i,yi)}in/X=[𝒙1,𝒙n]S=\{(\bm{x}_{i},y_{i})\}_{i}^{n}/X=[\bm{x}_{1},...\bm{x}_{n}]: training data of size nn.

Definition B.1 (Robustness, Xu & Mannor (2012)).

A learning algorithm 𝒜\mathcal{A} on training set SS is (K,ϵ())(K,\epsilon(\cdot))-robust, for KK\in\mathbb{N}, if 𝒵\mathcal{Z} can be partitioned into KK disjoint sets, denoted by {Ck}k=1K\left\{C_{k}\right\}_{k=1}^{K}, such that for all sS,z𝒵s\in S,z\in\mathcal{Z} we have

s,zCi,k[K],|(𝒜S,s)(𝒜S,z)|ϵ(S,A).\forall s,z\in C_{i},\forall k\in[K],\left|\ell\left(\mathcal{A}_{S},s\right)-\ell\left(\mathcal{A}_{S},z\right)\right|\leq\epsilon(S,A).
Definition B.2 (Hessian Lipschitz continuous).

For a twice differentiable function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}, it has LL-Lipschitz continuous Hessian for domain 𝒙,𝒚\bm{x},\bm{y} are vectors in CiC_{i} if

2f(𝒚)2f(𝒙)Li𝒚𝒙\left\|\nabla^{2}f(\bm{y})-\nabla^{2}f(\bm{x})\right\|\leq L_{i}\|\bm{y}-\bm{x}\|

where Li>0L_{i}>0 depends on input domain CiC_{i} and \|\cdot\| is L2L_{2} norm. Then for all KK domains i=1KCi\cup_{i=1}^{K}C_{i}, let L:=max{Li|i[K]}L:=\max\{L_{i}|i\in[K]\} be the uniform Lipschitz constant, so we have

2f(𝒚)2f(𝒙)Li𝒚𝒙L𝒚𝒙,i[K],(𝒙,𝒚)Ci\left\|\nabla^{2}f(\bm{y})-\nabla^{2}f(\bm{x})\right\|\leq L_{i}\|\bm{y}-\bm{x}\|\leq L\|\bm{y}-\bm{x}\|,\forall i\in[K],(\bm{x},\bm{y})\in C_{i}

which is uniformly bounded with LL.

Lemma B.3 (Hessian Lipschitz Lemma).

If ff is twice differentiable and has LL-Lipschitz continuous Hessian, then

|f(𝒚)f(𝒙)f(𝒙),𝒚𝒙122f(𝒙)(𝒚𝒙),(𝒚𝒙)|L6𝒚𝒙3.\left|f(\bm{y})-f(\bm{x})-\langle\nabla f(\bm{x}),\bm{y}-\bm{x}\rangle-\frac{1}{2}\left\langle\nabla^{2}f(\bm{x})(\bm{y}-\bm{x}),(\bm{y}-\bm{x})\right\rangle\right|\leq\frac{L}{6}\|\bm{y}-\bm{x}\|^{3}.

Gegenbauer Polynomials

We briefly define Gegenbauer polynomials here whose details can be found in Appendix of Mei & Montanari (2022). First, we denote 𝕊d1(r)={𝒙d:𝒙=r}\mathbb{S}^{d-1}(r)=\{\bm{x}\in\mathbb{R}^{d}:\|\bm{x}\|=r\} as the uniform spherical distribution with the radius rr on d1d-1 manifold. Let τd\tau_{d} be the probability measure on 𝕊d1\mathbb{S}^{d-1} and and the inner product in functional space L2([d,d],μd)L^{2}([-d,d],\mu_{d}) denoted as ,L2\langle\cdot,\cdot\rangle_{L^{2}} and L2\|\cdot\|_{L^{2}} :

f,gL2𝕊d1(d)f(𝒙)g(𝒙)μd(d𝒙).\langle f,g\rangle_{L^{2}}\equiv\int_{\mathbb{S}^{d-1}(\sqrt{d})}f(\bm{x})g(\bm{x})\mu_{d}(\mathrm{~{}d}\bm{x}).

For any function σL2([d,d],τd)\sigma\in L^{2}([-d,d],\tau_{d}), where τd\tau_{d} is the distribution of 𝒙,𝒚/d(𝒙,𝒚𝕊d1(d))\langle\bm{x},\bm{y}\rangle/\sqrt{d}~{}(\bm{x},\bm{y}\sim\mathbb{S}^{d-1}(\sqrt{d})), the orthogonal basis {Qtd}\{Q^{d}_{t}\} forms the Gegenbauer polynomial of degree t(t0)t(t\geq 0), its spherical harmonics coefficients λd,t(σ)\lambda_{d,t}(\sigma) can be expressed as:

λd,t(σ)=[d,d]σ(x)Qt(d)(dx)τd(x),\lambda_{d,t}(\sigma)=\int_{[-\sqrt{d},\sqrt{d}]}\sigma(x)Q_{t}^{(d)}(\sqrt{d}x)\tau_{d}(x),

then the Gegenbauer generating function holds in L2([d,d],τd)L^{2}\left([-\sqrt{d},\sqrt{d}],\tau_{d}\right) sense

σ(x)=k=0λd,t(σ)Nd,tQt(d)(dx)\sigma(x)=\sum_{k=0}^{\infty}\lambda_{d,t}(\sigma)N_{d,t}Q_{t}^{(d)}(\sqrt{d}x)

where Nd,tN_{d,t} is the normalized factor depending on the norm of input.

B.1 Assumptions

We discuss and list all assumptions we used in our theorems. The purposes are to offer clarity regarding the specific assumptions required for each theorem and ensure that the assumptions made in our theorems are well-founded and reasonable, reinforcing the validity and reliability of our results.

OOD Generalization

(Setting): Given a full sample space 𝒵\mathcal{Z}, source and target distributions are two different measures over this whole sample domain 𝒵\mathcal{Z}. The purpose is to study the robust algorithms in the OOD generalization setting.

(Assumptions): For any sample 𝒛𝒵\forall\bm{z}\in\mathcal{Z}, the loss function is bounded 𝜽(𝒛)[0,M]\ell_{\bm{\theta}}(\bm{z})\in[0,M]. This assumption generally follows the original paper Xu & Mannor (2012). While it is possible to relax this assumption and derive improved bounds, our primary objective is to formulate a framework for robust OOD generalization and establish a clear connection with the optimization properties of the model.

Robustness and Sharpness

(Setting): In order to give a fine-grained analysis, we follow the common choice where a two-layer ReLU Neural Network function class is widely analyzed in most literature, i.e. Neural Tangent Kernel, non-kernel (rich) regime Moroshko et al. (2020) and random feature models. Among them, we select the following random feature models as our function class:

f(𝒘,A,𝒙)1di=1mwiσ(𝒙,𝒂i):wi,𝒂iUnif(𝕊d1(d)),i[m]f(\bm{w},A,\bm{x})\triangleq\frac{1}{\sqrt{d}}\sum_{i=1}^{m}w_{i}\sigma\left(\bm{x},\bm{a}_{i}\right):w_{i}\in\mathbb{R},\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})),i\in[m]

where mm is the hidden size and A=[𝒂1,,𝒂m]A=[\bm{a}_{1},...,\bm{a}_{m}] contains random vectors uniformly distributed on nn-dim hypersphere whose surface is a n1n-1 manifold. σ(𝒂𝒙)=(𝒂𝒙)𝕀{𝒂𝒙}\sigma(\bm{a}^{\top}\bm{x})=(\bm{a}^{\top}\bm{x})\mathbb{I}\{\bm{a}^{\top}\bm{x}\} denotes the ReLU activation function and 𝕀\mathbb{I} is the indicator function. 𝒘=[w1,,wm]\bm{w}=[w_{1},...,w_{m}]^{\top} is the trainable parameter. We choose the common loss functions: (1) Homogeneity in regression; (2) (Binary) Cross-Entropy Loss; (3) Negative Log Likelihood (NLL) loss;

(Assumptions):

(i) Let Ci,i[K]C_{i},i\in[K] be any set from whole partitions i=1KCi\cup_{i=1}^{K}C_{i}, we assume 𝒛Ci\forall\bm{z}\in C_{i}, the loss function 𝒘^,A(𝒛)\ell_{\bm{\hat{w}},A}(\bm{z}) satisfies LL-Hessian Lipschitz for all i[K]i\in[K] (See details in Definition B.2). Note that we only require this assumption to hold within each partition, instead of holding globally. In general, the smoothness and convexity condition is actually equivalent to locally convex which is a weak assumption for most function classes.

(ii) Consider an optimization problem in each partition Ci,i[K]C_{i},i\in[K]. Let one of the training points 𝒛i(A)SCi\bm{z}_{i}(A)\in S\cap C_{i} be the initial point and 𝒛i(A)i\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i} is the corresponding nearest local minima where i\mathcal{M}_{i} is the local minima set of partition CiC_{i}. For some ρi(L)>0\rho_{i}(L)>0, we assume 𝒛i(A)𝒛i(A)ρi(L)/L\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\rho_{i}(L)/L holds a.s.. It ensures the hessian norm H(𝒛i(A))\|H(\bm{z}_{i}(A))\| has the lower bound. Similar conditions and estimations can be found in Zhang et al. (2019a).

(iii) To simplify the computation of probability, we assume ξi=𝒂i𝒙\xi_{i}=\bm{a}_{i}^{\top}\bm{x} obeys a rotationally invariant distribution.

(iv) For Corollary 3.5, we make additional assumptions that loss function ()\ell() satisfied a bounded condition where the second derivative 𝒙Unif(𝕊d1(d)),|2(f(𝒘^,A,𝒙),y)/f2|\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})),|\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)/\partial f^{2}| with respect to its argument f(𝒘^,A,𝒙)f(\bm{\hat{w}},A,\bm{x}) should be bounded by [M~1,M~2][\tilde{M}_{1},\tilde{M}_{2}]. Note, we consider the case where the data 𝒙Unif(𝕊d1(d))\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) while m=Poly(d)m=\text{Poly}(d) is to ensure the positive definiteness of i=1m𝒂i𝒂id×d\sum_{i=1}^{m}\bm{a}_{i}\bm{a}_{i}^{\top}\in\mathbb{R}^{d\times d} almost surely.

Appendix C Proof to Domain shift

Lemma C.1.

Let 𝒟^T\hat{\mathcal{D}}_{T} be the empirical distribution of size nn drawn from 𝒟T\mathcal{D}_{T}. The loss \ell is upper bounded by M. With probability at least 1δ1-\delta (over the choice of the samples), for every f𝛉f_{\bm{\theta}} trained on SS, we have

T(f𝜽)S(f𝜽)+Md(ϵ,K)(S,𝒟^T)+ϵ(S)+2M2Kln2+2ln(1/δ)n\mathcal{L}_{T}(f_{\bm{\theta}})\leq\mathcal{L}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+\epsilon(S)+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}} (11)

where

i[n],ni(S):=#(𝒛SCi),d(ϵ,K)(S,𝒟^T):=i=1K|ni(S)nni(𝒟^T)n|\forall i\in[n],n_{i}(S):=\#(\bm{z}\in S\cap C_{i}),~{}~{}~{}d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}):=\sum_{i=1}^{K}\left|\frac{n_{i}(S)}{n}-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right| (12)

and ni(S),ni(𝒟^T)n_{i}(S),n_{i}(\hat{\mathcal{D}}_{T}) are the number of samples from SS and 𝒟^T\hat{\mathcal{D}}_{T} that fall into the set CiC_{i}, respectively.

Proof.

In the following generalization statement, we use (f𝜽,𝒛)\ell(f_{\bm{\theta}},\bm{z}) to denote the error obtained with input 𝒛\bm{z} and hypothesis function f𝜽f_{\bm{\theta}} for better illustration. By definition we have,

T(f𝜽)S(f𝜽):=𝔼𝒛𝒟T(f𝜽,𝒛)𝔼𝒛𝒟S(f𝜽,𝒛).\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}}):=\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z}). (13)

Then we make the KK partitions for source distribution 𝒟S\mathcal{D}_{S}. Let nin_{i} be the size of collection set of points 𝒙\bm{x} fall into the partition CiC_{i} where nin_{i} is the i.i.d.multinomial random variable with (ps(C1),,ps(CK))(p_{s}(C_{1}),...,p_{s}(C_{K})). We use parallel notation for target distribution 𝒟T\mathcal{D}_{T} with Si(pt(C1),,pt(CK))S^{\prime}_{i}\sim(p_{t}(C_{1}),...,p_{t}(C_{K})). Since

E𝒛𝒟S(f𝜽,𝒛)\displaystyle\mathrm{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z}) =i=1K𝔼𝒛𝒟S((f𝜽,𝒛)|𝒛Ci)ps(Ci)\displaystyle=\sum_{i=1}^{K}\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}\in C_{i})p_{s}(C_{i}) (14)
E𝒛𝒟T(f𝜽,𝒛)\displaystyle\mathrm{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime}) =i=1K𝔼𝒛𝒟T((f𝜽,𝒛)|𝒛Ci)pt(Ci)\displaystyle=\sum_{i=1}^{K}\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})

and thus we have

T(f𝜽)S(f𝜽)=\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})= i=1K𝔼((f𝜽,𝒛)|𝒛Ci)pt(Ci)𝔼((f𝜽,𝒛)|𝒛Ci)ps(Ci)\displaystyle\sum_{i=1}^{K}\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})|\bm{z}\in C_{i})p_{s}(C_{i}) (15)
±𝔼((f𝜽,𝒛)|𝒛Ci)ps(Ci)\displaystyle\pm\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})p_{s}(C_{i})
=\displaystyle= i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(pt(Ci)ps(Ci))\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))
+i=1K[𝔼((f𝜽,𝒛)|𝒛Ci)𝔼((f𝜽,𝒛)|𝒛Ci)]ps(Ci)\displaystyle+\sum_{i=1}^{K}\left[\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})|\bm{z}\in C_{i})\right]p_{s}(C_{i})
\displaystyle\leq i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(pt(Ci)ps(Ci))+ϵ(S,A).\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))+\epsilon(S,A).

If we sample empirical distribution S,𝒟^TS,\hat{\mathcal{D}}_{T} of size nn each drawn from 𝒟S\mathcal{D}_{S} and 𝒟T\mathcal{D}_{T}, respectively. (n1,,nK)(n_{1},...,n_{K}) are the i.i.d. random variables belongs to CiC_{i}. We use the parallel notation nin^{\prime}_{i} for target distribution.

d(ϵ,K)(S,𝒟^T):=iK|ni(S)nni(𝒟^T)n|.d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}):=\sum_{i}^{K}\left|\frac{n_{i}(S)}{n}-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right|. (16)

Further, we have

i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(pt(Ci)ps(Ci))i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(ni(𝒟^T)nni(S)n)\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))-\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)\left(\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}-\frac{n_{i}(S)}{n}\right) (17)
=i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(pt(Ci)ni(𝒟^T)n)i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(ps(Ci)ni(S)n)\displaystyle=\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)\left(p_{t}(C_{i})-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right)-\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)\left(p_{s}(C_{i})-\frac{n_{i}(S)}{n}\right)
Mi=1K|pt(Ci)ni(𝒟^T)n|+Mi=1K|ps(Ci)ni(S)n|.\displaystyle\ \leq M\sum_{i=1}^{K}\left|p_{t}(C_{i})-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right|+M\sum_{i=1}^{K}\left|p_{s}(C_{i})-\frac{n_{i}(S)}{n}\right|.

With Breteganolle-Huber-Carol inequality we have

i=1K|ni(S)nps(Ci)|2Kln2+2ln(1/δ)n.\sum_{i=1}^{K}\left|\frac{n_{i}(S)}{n}-p_{s}(C_{i})\right|\leq\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}. (18)

To integrate these two inequalities, we have

i=1K(𝔼((f𝜽,𝒛)\displaystyle\sum_{i=1}^{K}(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime}) |𝒛Ci))(pt(Ci)ps(Ci))\displaystyle|\bm{z}^{\prime}\in C_{i}))(p_{t}(C_{i})-p_{s}(C_{i})) (19)
i=1K(𝔼((f𝜽,𝒛)|𝒛Ci))(ni(𝒟^T)nni(S)n)+2M2Kln2+2ln(1/δ)n\displaystyle\leq\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})\right)\left(\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}-\frac{n_{i}(S)}{n}\right)+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}
Md(ϵ,K)(S,𝒟^T)+2M2Kln2+2ln(1/δ)n.\displaystyle\leq Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}.

In summary with probability 1δ1-\delta we have

T(f𝜽)S(f𝜽)+Md(ϵ,K)(S,𝒟^T)+ϵ(S)+2M2Kln2+2ln(1/δ)n\mathcal{L}_{T}(f_{\bm{\theta}})\leq\mathcal{L}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+\epsilon(S)+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}} (20)

which completes the proof. ∎

With the result of the domain (distribution) shift and the relationship between sharpness and robustness, we can move forward to the final OOD generalization error bound. First, we state the context of ID robustness bound in Xu & Mannor (2012) as follows.

Lemma C.2 (Xu et al.Xu & Mannor (2012)).

Assume that for all hh\in\mathcal{H} and z𝒵z\in\mathcal{Z}, the loss is upper bounded by MM i.e., (h,z)M\ell(h,z)\leq M. If the learning algorithm 𝒜\mathcal{A} is (K,ϵ())(K,\epsilon(\cdot))-robust, then for any δ>0\delta>0, with probability at least 1δ1-\delta over an iid draw of nn samples S=(zi)i=1nS=\left(z_{i}\right)_{i=1}^{n}, it holds that:

𝔼z[(𝒜S,z)]1ni=1n(𝒜S,zi)+ϵ(S)+M2Kln2+2ln(1/δ)n\mathbb{E}_{z}\left[\ell\left(\mathcal{A}_{S},z\right)\right]\leq\frac{1}{n}\sum_{i=1}^{n}\ell\left(\mathcal{A}_{S},z_{i}\right)+\epsilon(S)+M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}

As the conclusive results, we briefly prove the following result by summarizing Lemma C.2 and Lemma C.1.

Theorem C.3 (Restatement of Theorem 3.1).

Let 𝒟^T\hat{\mathcal{D}}_{T} be the empirical distribution of size nn drawn from 𝒟T\mathcal{D}_{T}. The loss \ell is upper bounded by M. With probability at least 1δ1-\delta (over the choice of the samples), for every f𝛉f_{\bm{\theta}} trained on SS, we have

T(𝜽)^S(𝜽)+Md(ϵ,K)(S,𝒟^T)+2ϵ(S)+3M2Kln2+2ln(2/δ)n.\mathcal{L}_{T}(\bm{\theta})\leq\widehat{\mathcal{L}}_{S}(\bm{\theta})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+2\epsilon(S)+3M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}. (21)
Proof.

Firstly, with Lemma C.1 and probability as least 1δ21-\frac{\delta}{2}, we have

T(f𝜽)S(f𝜽)+Md(ϵ,K)(𝒟^S,𝒟^T)+2M2Kln2+2ln(2/δ)n+ϵ(S)\mathcal{L}_{T}(f_{\bm{\theta}})\leq\mathcal{L}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(\hat{\mathcal{D}}_{S},\hat{\mathcal{D}}_{T})+2M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}+\epsilon(S)

Secondly, with Lemma C.2 (Xu & Mannor (2012) Theorem 3) and probability as least 1δ21-\frac{\delta}{2}, we have

|S(f𝜽)^S(f𝜽)|ϵ(S)+M2Kln2+2ln(2/δ)n\left|\mathcal{L}_{S}(f_{\bm{\theta}})-\hat{\mathcal{L}}_{S}(f_{\bm{\theta}})\right|\leq\epsilon(S)+M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}

By taking the union bound, we conclude our final result that with probability at least 1δ1-\delta

T(f𝜽)\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}}) ^S(f𝜽)+3M2Kln2+2ln(2/δ)n+2ϵ(S)+Md(ϵ,K)(S,𝒟^T)\displaystyle\leq\hat{\mathcal{L}}_{S}(f_{\bm{\theta}})+3M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}+2\epsilon(S)+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}) (22)

Here ϵ(S)\epsilon(S) is the robustness constant that we can replace with any sharpness measure.

C.1 Proof to Corollary 3.2

Definition C.4.

dΔ(𝒟T;𝒟S):=2sup𝒜(f)𝒜Δ|Pr𝒟S(𝒜(f))Pr𝒟T(𝒜(f))|d_{\mathcal{F}\Delta\mathcal{F}}\left(\mathcal{D}_{T};\mathcal{D}_{S}\right):=2\sup_{\mathcal{A}(f)\in\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}}}|\operatorname{Pr}_{\mathcal{D}_{S}}(\mathcal{A}(f))-\operatorname{Pr}_{\mathcal{D}_{T}}(\mathcal{A}(f))| and Δ\mathcal{F}\Delta\mathcal{F} is defined as:

Δ:={f(x)f(x):f,f}\mathcal{F}\Delta\mathcal{F}:=\{f(x)\oplus f^{\prime}(x):f,f^{\prime}\in\mathcal{F}\}

where \oplus is the XOR operator e.g. 𝕀(f(x)f(x))\mathbb{I}(f^{\prime}(x)\neq f(x)).

Lemma C.5 (Lemma 2 Zhao et al. (2018)).

If Pdim()=d\operatorname{Pdim}(\mathcal{F})=d^{\prime}, then VCdim(Δ)2d\operatorname{VCdim}(\mathcal{F}\Delta\mathcal{F})\leq 2d^{\prime} .

Proposition C.6 (Zhao et al. (2018)).

Let \mathcal{F} be a hypothesis class with pseudo dimension Pdim()=d\operatorname{Pdim}(\mathcal{F})=d^{\prime}. If 𝒟^S\widehat{\mathcal{D}}_{S} is the empirical distributions generated with nn i.i.d.. samples from source domain, and 𝒟^T\widehat{\mathcal{D}}_{T} is the empirical distribution on the target domain generated from nn samples without labels, then with probability at least 1δ1-\delta, for all ff\in\mathcal{F}, we have:

T(f)^S(f)\displaystyle\mathcal{L}_{T}(f)\leq\widehat{\mathcal{L}}_{S}(f) ++2dlogendn+log2δ2n\displaystyle+\mathcal{E}^{*}+\sqrt{\frac{2d^{\prime}\log\frac{en}{d^{\prime}}}{n}}+\sqrt{\frac{\log\frac{2}{\delta}}{2n}} (23)
+12dΔ(𝒟^T;𝒟^S)+42dln(2n)+ln4δn(Empirical div Error)\displaystyle+\underbrace{\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{2d^{\prime}\ln(2n)+\ln\frac{4}{\delta}}{n}}}_{(\text{Empirical div Error})}

where =^S(f)+^T(f)\mathcal{E}^{*}=\widehat{\mathcal{L}}_{S}(f^{*})+\widehat{\mathcal{L}}_{T}(f^{*}) is the total error of best hypothesis ff^{*} over source and target domain.

Proof.

With Lemma 4 (Zhao et al., 2018), we have

T(f)^S(f)+12dΔ+\mathcal{L}_{T}(f)\leq\widehat{\mathcal{L}}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}+\mathcal{E}^{*}

where

=inffS(f)+T(f).\mathcal{E}^{*}=\inf_{f^{\prime}\in\mathcal{F}}\mathcal{L}_{S}(f^{\prime})+\mathcal{L}_{T}(f^{\prime}).

Lemma 6 (Zhao et al., 2018), which is actually Lemma 1 in (Ben-David et al., 2010), shows the following results

dΔ(𝒟T;𝒟S)dΔ(𝒟^T;𝒟^S)+4VCdim(Δ)ln(2n)+ln2δn.d_{\mathcal{F}\Delta\mathcal{F}}\left(\mathcal{D}_{T};\mathcal{D}_{S}\right)\leq d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{\text{VCdim}(\mathcal{F}\Delta\mathcal{F})\ln(2n)+\ln\frac{2}{\delta}}{n}}.

As suggested in Zhao et al. (2018), VCdim(Δ)\text{VCdim}(\mathcal{F}\Delta\mathcal{F}) is at most 2d2d^{\prime}. Further, with Theorem 2 (Ben-David et al., 2010), we have at probability at least 1δ21-\frac{\delta}{2}

T(f)\displaystyle\mathcal{L}_{T}(f) S(f)+12dΔ(𝒟^T;𝒟^S)+4VCdim(Δ)ln(2n)+ln2δn+\displaystyle\leq\mathcal{L}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{\text{VCdim}(\mathcal{F}\Delta\mathcal{F})\ln(2n)+\ln\frac{2}{\delta}}{n}}+\mathcal{E}^{*} (24)
S(f)+12dΔ(𝒟^T;𝒟^S)+42dln(2n)+ln2δn+\displaystyle\leq\mathcal{L}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{2d^{\prime}\ln(2n)+\ln\frac{2}{\delta}}{n}}+\mathcal{E}^{*}

Using in-domain generalization error Lemma 11.6 (Mohri et al., 2018), with probability at least 1δ21-\frac{\delta}{2} the result is

S(f)L^S(f)+M2dlogendn+Mlog1δ2n\mathcal{L}_{S}(f)\leq\widehat{L}_{S}(f)+M\sqrt{\frac{2d^{\prime}\log\frac{en}{d^{\prime}}}{n}}+M\sqrt{\frac{\log\frac{1}{\delta}}{2n}}

Note in Zhao et al. (2018), the M=1M=1 for the normalized regression loss. Combine them all, we conclude the proof. ∎

Corollary C.7.

If K,M=1K\to\infty,M=1, domain shift bound |T(f𝛉)S(f𝛉)||\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})| will be reduced to (Empirical div Error) in Proposition C.6 where

|T(f𝜽)S(f𝜽)|12dΔ(𝒟S;𝒟T)(Empirical div Error)|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|\leq\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T})\leq\text{(Empirical div Error)} (25)
Proof.

According to Eq. 21, we have

T(f𝜽)S(f𝜽)=iK𝔼((f𝜽,𝒛)|𝒛Ci)pt(Ci)𝔼((f𝜽,𝒛)|𝒛Ci)ps(Ci)\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})=\sum_{i}^{K}\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})|\bm{z}\in C_{i})p_{s}(C_{i}) (26)

If KK\to\infty, let’s define a domain that U:=i=1CiU:=\bigcup_{i=1}^{\infty}C_{i}. The equation (26) will be

T(f𝜽)S(f𝜽)\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}}) =𝒛U(f𝜽,𝒛)pt(𝒛)𝑑𝒛𝒛U(f𝜽,𝒛)ps(𝒛)𝑑𝒛\displaystyle=\int_{\bm{z}^{\prime}\in U}\ell(f_{\bm{\theta}},\bm{z}^{\prime})p_{t}(\bm{z}^{\prime})d\bm{z}-\int_{\bm{z}\in U}\ell(f_{\bm{\theta}},\bm{z})p_{s}(\bm{z})d\bm{z} (27)
=𝒛𝒟T(f𝜽,𝒛)pt(𝒛)𝑑𝒛𝒛𝒟S(f𝜽,𝒛)ps(𝒛)𝑑𝒛\displaystyle=\int_{\bm{z}^{\prime}\in\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})p_{t}(\bm{z}^{\prime})d\bm{z}-\int_{\bm{z}\in\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z})p_{s}(\bm{z})d\bm{z}
=𝔼𝒛𝒟T(f𝜽,𝒛)𝔼𝒛𝒟S(f𝜽,𝒛).\displaystyle=\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z}).

In this case, we have,

|T(f𝜽)S(f𝜽)|\displaystyle\left|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})\right| (28)
=|𝔼𝒛𝒟T(f𝜽,𝒛)𝔼𝒛𝒟S(f𝜽,𝒛)|\displaystyle=\left|\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z})\right|
0|Pr𝒟T((f(𝜽,𝒙),y)>t)dtPr𝒟S((f(𝜽,𝒙),y)>t)dt|\displaystyle\leq\int_{0}^{\infty}\left|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)dt-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)dt\right|
=01|Pr𝒟T((f(𝜽,𝒙),y)>t)Pr𝒟S((f(𝜽,𝒙),y)>t)|𝑑t(M=1)\displaystyle=\int_{0}^{1}\left|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)\right|dt~{}~{}~{}(M=1)
supt[0,1]supf(𝜽,)|Pr𝒟T((f(𝜽,𝒙),y)>t)Pr𝒟S((f(𝜽,𝒙),y)>t)|\displaystyle\leq\sup_{t\in[0,1]}\sup_{f(\bm{\theta},\cdot)\in\mathcal{F}}\left|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)\right|
sup𝒜(f)𝒜Δ|Pr𝒟T(𝒜(f))Pr𝒟S(𝒜(f))|\displaystyle\leq\sup_{\mathcal{A}(f)\in\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}}}\left|\operatorname{Pr}_{\mathcal{\mathcal{D}}_{T}}(\mathcal{A}(f))-\operatorname{Pr}_{\mathcal{\mathcal{D}}_{S}}(\mathcal{A}(f))\right|
=12dΔ(𝒟S;𝒟T)(Empirical div error)\displaystyle=\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T})\leq(\text{Empirical div error})

where 𝒜Δ\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}} represents a learning algorithm under the hypothesis Δ={f(x)f(x):f,f}\mathcal{F}\Delta\mathcal{F}=\{f(x)\oplus f^{\prime}(x):f,f^{\prime}\in\mathcal{F}\}, which completes the proof. ∎

Appendix D Sharpness and Robustness

Lemma D.1 (positive definiteness of Hessian).

Let w^min\hat{w}_{\min} be the minimum value of |𝐰^||\bm{\hat{w}}| and 𝐱=argmin𝐱Unif(𝕊d1(d))(f(𝐰^,A,𝐱),𝐲)\bm{x}^{*}=\arg\min_{\bm{x}\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}\ell(f({\bm{\hat{w}}},A,\bm{x}),\bm{y}). For any A=(𝐚1,,𝐚m),𝐚iUnif(𝕊d1(d))i[m]A=(\bm{a}_{1},...,\bm{a}_{m}),\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))\forall i\in[m] denote σ~(d,m)=λmin(i=1m𝐚i𝐚iGii)>0\tilde{\sigma}(d,m)=\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0 be the minimum eigenvalue, where Gij=t=0λd,t2(σ)Nd,t2Qt(d)(𝐚i,𝐚j/d)G_{ij}=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{a}_{j}\rangle/\sqrt{d}) is the polynomial product constant. If m=Poly(d)m=\text{Poly}(d), the hessian H(𝐱)H(\bm{x}^{*}) can be lower bound by

𝔼𝒙Unif(𝕊d1(d))H(𝒙)w^min2σ~(d,m)M~1dId.\mathbb{E}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}H(\bm{x}^{*})\succeq\frac{\hat{w}^{2}_{\min}\tilde{\sigma}(d,m)\tilde{M}_{1}}{d}I_{d}. (29)
Proof.

As suggested in Lemma D.6 of (Zhong et al., 2017), we have a similar result to bound the local positive definiteness of Hessian. By previous definition, the Hessian w.r.t. 𝒙\bm{x} has a following partial order

H(𝒙)\displaystyle H(\bm{x}^{*}) =Df2(𝒙,y)d(i=1mj=1mw^iw^j𝒂i𝒂jσ(𝒂i𝒙)σ(𝒂j𝒙))\displaystyle=\frac{D^{2}_{f}(\bm{x}^{*},y^{*})}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{*})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{*})\right) (30)
M~1d(i=1mj=1mw^iw^j𝒂i𝒂jσ(𝒂i𝒙)σ(𝒂j𝒙))\displaystyle\succeq\frac{\tilde{M}_{1}}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{*})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{*})\right)
w^min2M~1d(i=1mj=1mw^iw^j𝒂i𝒂jσ(𝒂i𝒙)σ(𝒂j𝒙))\displaystyle\succeq\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{*})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{*})\right)

For the ReLU activation function, we further have

σ(𝒂𝒙)σ(𝒂𝒙)\sigma(\bm{a}^{\top}\bm{x}^{*})\geq\sigma^{\prime}(\bm{a}^{\top}\bm{x}^{*}) (31)

We extend the σL2([d,d],τd)\sigma\in L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d}) (where τd\tau_{d} is the distribution of 𝒙1,𝒙2/d\langle\bm{x}_{1},\bm{x}_{2}\rangle/\sqrt{d}) by Gegenbauer polynomials that

σ(x)=t=0λd,t(σ)Nd,tQt(d)(dx).\sigma(x)=\sum_{t=0}^{\infty}\lambda_{d,t}(\sigma)N_{d,t}Q_{t}^{(d)}(\sqrt{d}x). (32)

Let A=(𝒂1,,𝒂m)m×dA=(\bm{a}_{1},...,\bm{a}_{m})\in\mathbb{R}^{m\times d}. We assume 𝒙Unif(𝕊d1(d))\forall\bm{x}\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})). Lemma C.7 in (Mei & Montanari, 2022), suggests that

𝑼=(𝔼𝒙Unif(𝕊d1(d))[σ(𝒂i,𝒙/d)σ(𝒂i,𝒙/d)])i,j[m]m×m\displaystyle\bm{U}=\left(\mathbb{E}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}[\sigma(\langle\bm{a}_{i},\bm{x}^{*}\rangle/\sqrt{d})\sigma(\langle\bm{a}_{i},\bm{x}^{*}\rangle/\sqrt{d})]\right)_{i,j\in[m]}\in\mathbb{R}^{m\times m} (33)

which shows matrix UU is a positive definite matrix. Similarly, taking the expectation over 𝒙\bm{x}^{*}, terms in RHS of (30) bracket can be rewritten as

𝔼𝒙Unif(𝕊d1(d))\displaystyle\operatorname{\mathbb{E}}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))} (i=1mj=1m𝒂i𝒂jσ(𝒂i𝒙)σ(𝒂j𝒙))\displaystyle\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{*})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{*})\right) (34)
i=1mj=1m𝒂i𝒂j𝔼𝒙[σ(𝒂i𝒙/d)σ(𝒂j𝒙/d)]\displaystyle\succeq\sum_{i=1}^{m}\sum_{j=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{j}\mathbb{E}_{\bm{x}^{*}}[\sigma(\bm{a}_{i}^{\top}\bm{x}^{*}/\sqrt{d})\sigma(\bm{a}_{j}^{\top}\bm{x}^{*}/\sqrt{d})]

Besides, we have the following property of Gegenbauer polynomials,

  1. 1.

    For 𝒙,𝒚𝕊d1(d)\bm{x},\bm{y}\in\mathbb{S}^{d-1}(\sqrt{d})

    Qj(d)(𝒙,),Qk(d)(𝒚,)L2(𝕊d1(d),γd)=1Nd,kδjkQk(d)(𝒙,𝒚).\left\langle Q_{j}^{(d)}(\langle\bm{x},\cdot\rangle),Q_{k}^{(d)}(\langle\bm{y},\cdot\rangle)\right\rangle_{L^{2}\left(\mathbb{S}^{d-1}(\sqrt{d}),\gamma_{d}\right)}=\frac{1}{N_{d,k}}\delta_{jk}Q_{k}^{(d)}(\langle\bm{x},\bm{y}\rangle).
  2. 2.

    For 𝒙,𝒚𝕊d1(d)\bm{x},\bm{y}\in\mathbb{S}^{d-1}(\sqrt{d})

    Qk(d)(𝒙,𝒚)=1Nd,ki=1Nd,kYki(d)(𝒙)Yki(d)(𝒚).Q_{k}^{(d)}(\langle\bm{x},\bm{y}\rangle)=\frac{1}{N_{d,k}}\sum_{i=1}^{N_{d,k}}Y_{ki}^{(d)}(\bm{x})Y_{ki}^{(d)}(\bm{y}).

where spherical harmonics {Ylj(d)}1jNd,l\{Y^{(d)}_{lj}\}_{1\leq j\leq N_{d,l}} forms an orthonormal basis which gives the following results

𝔼𝒙[σ(𝒂i𝒙/d)σ(𝒂j𝒙/d)]\displaystyle\mathbb{E}_{\bm{x}^{*}}[\sigma(\bm{a}_{i}^{\top}\bm{x}^{*}/\sqrt{d})\sigma(\bm{a}_{j}^{\top}\bm{x}^{*}/\sqrt{d})] =t=0λd,t2(σ)Nd,t2𝔼𝒙Qt(d)(𝒂i,𝒙/d)Qt(d)(𝒂j,𝒙/d)\displaystyle=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}\mathbb{E}_{\bm{x}^{*}}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{x}^{*}\rangle/\sqrt{d})Q_{t}^{(d)}(\langle\bm{a}_{j},\bm{x}^{*}\rangle/\sqrt{d}) (35)
=t=0λd,t2(σ)Nd,t2Qt(d)(𝒂i,𝒂j/d)=Gij<.\displaystyle=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{a}_{j}\rangle/\sqrt{d})=G_{ij}<\infty.

Hence, we have

i=1mj=1m𝒂i𝒂j𝔼𝒙[σ(𝒂i𝒙/d)σ(𝒂j𝒙/d)]=i=1m𝒂i𝒂iGii+𝒪(1/d)Var(𝒂)\displaystyle\sum_{i=1}^{m}\sum_{j=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{j}\mathbb{E}_{\bm{x}^{*}}[\sigma(\bm{a}_{i}^{\top}\bm{x}^{*}/\sqrt{d})\sigma(\bm{a}_{j}^{\top}\bm{x}^{*}/\sqrt{d})]=\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii}+\mathcal{O}(1/d)\text{Var}(\bm{a}) (36)

Since m=Poly(d)m=Poly(d), and {𝒂}i[m]\{\bm{a}\}_{i\in[m]} are i.i.d, then rank(i=1m𝒂i𝒂iGii)=rank(AA)=d\text{rank}\left(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii}\right)=\text{rank}\left(AA^{\top}\right)=d. Let σ~(d,m)=𝔼𝒂λmin(i=1m𝒂i𝒂iGii)>0\tilde{\sigma}(d,m)=\mathbb{E}_{\bm{a}}\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0 we have

𝔼𝒙H(𝒙)w^min2σ~(d,m)M~1dId\mathbb{E}_{\bm{x}^{*}}H(\bm{x}^{*})\succeq\frac{\hat{w}^{2}_{\min}\tilde{\sigma}(d,m)\tilde{M}_{1}}{d}I_{d} (37)

Lemma D.2.

Let k=1KCk\bigcup_{k=1}^{K}C_{k} be the whole domain, the notion of (ϵ,K)(\epsilon,K)-robustness is described by

ϵ(S,A)maxCik=1KCksup𝒛,𝒛iCi,𝒛S|𝒘^,A(𝒛)𝒘^,A(𝒛i)|.\epsilon(S,A)\triangleq\max_{C_{i}\subset\bigcup_{k=1}^{K}C_{k}}\sup_{\bm{z},\bm{z}_{i}^{\prime}\in C_{i},\bm{z}\in S}\left|\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{\prime})\right|.

Define i\mathcal{M}_{i} be the set of global minima in CiC_{i}, where

i{𝒛(A)|𝒛(A)=min𝒛Ci𝒘^,A(𝒛)}\mathcal{M}_{i}\triangleq\{\bm{z}(A)|\bm{z}(A)=\min_{\bm{z}\in C_{i}}\ell_{\bm{\hat{w}},A}(\bm{z})\}

suppose for some maximum training loss point

𝒛i(A){max𝒛CiS𝒘^,A(𝒛)𝒘^,A(𝒛i(A))}\bm{z}_{i}(A)\in\left\{\max_{\bm{z}\in C_{i}\cap S}\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))\right\}

there 𝐳i(A)\exists\bm{z}_{i}^{*}(A) where

𝒛i(A)argmin𝒛i𝒛𝒛i(A)\bm{z}_{i}^{*}(A)\triangleq\arg\min_{\bm{z}\in\mathcal{M}_{i}}\|\bm{z}-\bm{z}_{i}(A)\|

such that 𝐳i(A)𝐳(A)ρi(L)L\|\bm{z}_{i}(A)-\bm{z}^{*}(A)\|\leq\frac{\rho_{i}(L)}{L} almost surely hold for any AUnif(𝕊d1(d))A\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and for any AA, 𝐰^,A(𝐳)\ell_{\bm{\hat{w}},A}(\bm{z}) is LL-Hessian Lipschitz continuous. Then the ϵ(S,A)\epsilon(S,A) can be bounded by

ϵ(S,A)maxi[K]ρi(L)22L2(2𝒘^,A(𝒛i(A))+4ρi(L)3)\epsilon(S,A)\leq\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(\left\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\right\|+\frac{4\rho_{i}(L)}{3}\right)
Proof.

Let 𝒛S\bm{z}\in S be a collection of (𝒙,y)(\bm{x},y) from the training set SS and 𝒛i\bm{z}_{i}^{\prime} denote any collection from the set CiC_{i}. We define local minima set i\mathcal{M}_{i} (which is the global minima set of CiC_{i}). Assume that for some maximum point 𝒛i(A)max𝒛CiS𝒘^,A(𝒛)\bm{z}_{i}(A)\in\max_{\bm{z}\in C_{i}\cap S}\ell_{\bm{\hat{w}},A}(\bm{z}), there exists a 𝒛i(A)i\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i} almost surely for all AUnif(𝕊d1(d))A\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) such that

𝒛i(A)=argmin𝒛if:={𝒛i:𝒛i(A)𝒛}s.t.𝒛i(A)𝒛ρi(L)L\bm{z}_{i}^{*}(A)=\arg\min_{\bm{z}\in\mathcal{M}_{i}}f:=\left\{\bm{z}\in\mathcal{M}_{i}:\|\bm{z}_{i}(A)-\bm{z}\|\right\}~{}~{}\text{s.t.}~{}~{}\|\bm{z}_{i}(A)-\bm{z}\|\leq\frac{\rho_{i}(L)}{L} (38)

By definition, ϵ(S,A)\epsilon(S,A) can be rewritten as

ϵ(S,A)\displaystyle\epsilon(S,A) =maxi[K]sup𝒛,𝒛iCi,𝒛S|𝒘^,A(𝒛)𝒘^,A(𝒛i)|\displaystyle=\max_{i\in[K]}\sup_{\bm{z},\bm{z}_{i}^{\prime}\in C_{i},\bm{z}\in S}\left|\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{\prime})\right| (39)
=maxi[K]sup𝒛CiS,𝒛i𝒘^,A(𝒛)𝒘^,A(𝒛)\displaystyle=\max_{i\in[K]}\sup_{\bm{z}\in C_{i}\cap S,\bm{z}^{*}\in\mathcal{M}_{i}}\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}^{*})
=maxi[K]𝒘^,A(𝒛i(A))𝒘^,A(𝒛i(A)).\displaystyle=\max_{i\in[K]}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A)).

According to Lemma B.3, we have

ϵ(S,A)=\displaystyle\epsilon(S,A)= maxi[K]𝒘^,A(𝒛i(A))𝒘^,A(𝒛i(A))\displaystyle\max_{i\in[K]}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A)) (40)
(i)\displaystyle\overset{(i)}{\leq} maxi[K]𝒘^,A(𝒛i(A)),𝒛i(A)𝒛i(A)\displaystyle\max_{i\in[K]}\left\langle\nabla\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A)),\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\right\rangle
+122𝒘^,A(𝒛i(A))(𝒛i(A)𝒛i(A)),𝒛i(A)𝒛i(A)+L6𝒛i(A)𝒛3\displaystyle+\frac{1}{2}\left\langle\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))(\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)),\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\right\rangle+\frac{L}{6}\|\bm{z}_{i}(A)-\bm{z}^{*}\|^{3}
=\displaystyle= maxi[K]122𝒘^,A(𝒛i(A))(𝒛i(A)𝒛i(A)),𝒛i(A)𝒛i(A)+L6𝒛i(A)𝒛i(A)3\displaystyle\max_{i\in[K]}\frac{1}{2}\left\langle\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))(\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)),\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\right\rangle+\frac{L}{6}\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{3}
\displaystyle\leq maxi[K]122𝒘^,A(𝒛i(A))𝒛i(A)𝒛i(A)2\displaystyle\max_{i\in[K]}\frac{1}{2}\left\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))\right\|\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{2}
+L6𝒛i(A)𝒛i(A)3(Cauchy-Schwarz)\displaystyle+\frac{L}{6}\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{3}~{}~{}~{}~{}~{}(\textit{Cauchy-Schwarz})

where (i)(i) support by the fact 𝒘^,A(𝒛)=0\nabla\ell_{\bm{\hat{w}},A}(\bm{z}^{*})=0. With Lipschitz continuous Hessian we have

2𝒘^,A(𝒛i(A))L𝒛i(A)𝒛i(A)+2𝒘^,A(𝒛i(A)).\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))\|\leq L\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|+\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\|. (41)

Overall, we have

ϵ(S,A)\displaystyle\epsilon(S,A) maxi[K]122𝒘^,A(𝒛i(A))𝒛i(A)𝒛i(A)2+L6𝒛i(A)𝒛i(A)3\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))\right\|\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{2}+\frac{L}{6}\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{3} (42)
maxi[K]12(2𝒘^,A(𝒛i(A))+L𝒛i(A)𝒛i(A))𝒛i(A)𝒛i(A)2\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left(\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\|+L\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\right)\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{2}
+L6𝒛i(A)𝒛i(A)3\displaystyle+\frac{L}{6}\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|^{3}
maxi[K]12(2𝒘^,A(𝒛i(A))+ρi(L))ρi(L)2L2+ρi(L)36L2\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left(\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\|+\rho_{i}(L)\right)\frac{\rho_{i}(L)^{2}}{L^{2}}+\frac{\rho_{i}(L)^{3}}{6L^{2}}
=maxi[K]ρi(L)22L22𝒘^,A(𝒛i(A))+2ρi(L)33L2\displaystyle=\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\|+\frac{2\rho_{i}(L)^{3}}{3L^{2}}

which completes the proof. ∎

Lemma D.3 (Lemma 2.1 Bourin et al. (2013)).

For every matrix in 𝕄n+m+\mathbb{M}_{n+m}^{+} partitioned into blocks, we have a decomposition

[AXXB]=U[A000]U+V[000B]V\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]=U\left[\begin{array}[]{cc}A&0\\ 0&0\end{array}\right]U^{*}+V\left[\begin{array}[]{ll}0&0\\ 0&B\end{array}\right]V^{*}

for some unitaries U,V𝕄n+mU,V\in\mathbb{M}_{n+m}.

Lemma D.4.

Then, given an arbitrary partitioned positive semi-definite matrix,

[AXXB]sAs+Bs\left\|\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]\right\|_{s}\leq\|A\|_{s}+\|B\|_{s}

for all symmetric norms.

Proof.

In lemma D.3 we have

[AXXB]=U[A000]U+V[000B]V\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]=U\left[\begin{array}[]{cc}A&0\\ 0&0\end{array}\right]U^{*}+V\left[\begin{array}[]{cc}0&0\\ 0&B\end{array}\right]V^{*}

for some unitaries U,V𝕄n+mU,V\in\mathbb{M}_{n+m}. The result then follows from the simple fact that symmetric norms are non-decreasing functions of the singular values where f=s:𝕄f=\|\cdot\|_{s}:\mathbb{M}\mapsto\mathbb{R}, we have

f([AXXB])f(U[A000]U)+f(V[000B]V)f\left(\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]\right)\leq f\left(U\left[\begin{array}[]{cc}A&0\\ 0&0\end{array}\right]U^{*}\right)+f\left(V\left[\begin{array}[]{cc}0&0\\ 0&B\end{array}\right]V^{*}\right)

Lemma D.5.

For 𝐚Unif(𝕊d1(d))\bm{a}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and 𝐱\bm{x} are some vector d\in\mathbb{R}^{d} with norm 𝐱R(d)d\|\bm{x}\|\equiv\sqrt{R(d)}\geq d, we have

(𝒙,𝒂2𝒂2)min{2arccos(1R(d))π,|12d4πR(d)exp(14d9)|}\mathbb{P}(\langle\bm{x},\bm{a}\rangle^{2}\geq\|\bm{a}\|^{2})\geq\min\left\{\frac{2\arccos(\frac{1}{\sqrt{R(d)}})}{\pi},\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}\exp(\frac{1}{4d-9})\right|\right\} (43)
Proof.

We can replace the unit vector of 𝒂\bm{a} with 𝒆\bm{e} by

(𝒙,𝒂2𝒂2)=(𝒙,𝒆21)\mathbb{P}(\langle\bm{x},\bm{a}\rangle^{2}\geq\|\bm{a}\|^{2})=\mathbb{P}(\langle\bm{x},\bm{e}\rangle^{2}\geq 1) (44)

Similarly, we can replace 𝒙\bm{x} by unit vector 𝒔\bm{s} such that

(𝒙,𝒆21)=(𝒔,𝒆21R(d))\mathbb{P}(\langle\bm{x},\bm{e}\rangle^{2}\geq 1)=\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right) (45)

Solving 𝒔,𝒆2=1R(d)\langle\bm{s},\bm{e}\rangle^{2}=\frac{1}{R(d)}, we get

𝒔,𝒆2=cos2ϕ=1R(d)ϕ=arccos±1R(d)\langle\bm{s},\bm{e}\rangle^{2}=\cos^{2}\phi=\frac{1}{R(d)}\Rightarrow\phi=\arccos\pm\frac{1}{\sqrt{R(d)}} (46)

In this case, the probability will converge to 11 as R(d)R(d) increases. As is known to us, surface area of 𝕊d1\mathbb{S}^{d-1} equals

Ad=rd12πd/2Γ(d2)A_{d}=r^{d-1}\frac{2\pi^{d/2}}{\Gamma\left(\frac{d}{2}\right)} (47)

An area CdC_{d} of the spherical cap equals

Adcap(r)=0ϕAd1(rsinθ)rdθ=2π(d1)/2Γ(d12)rd10ϕsind2θdθ.A_{d}^{\mathrm{cap}}(r)=\int_{0}^{\phi}A_{d-1}(r\sin\theta)r\mathrm{d}\theta=\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}r^{d-1}\int_{0}^{\phi}\sin^{d-2}\theta\mathrm{d}\theta. (48)

where Γ(n12)=(2(n1))!22(n1)(n1)!π\Gamma\left(n-\frac{1}{2}\right)=\frac{(2(n-1))!}{2^{2(n-1)}(n-1)!}\sqrt{\pi}.

1) When d=1d=1, almost surely we have

((xe)2e2)=(x21)=1.\mathbb{P}((x\cdot e)^{2}\geq e^{2})=\mathbb{P}(x^{2}\geq 1)=1. (49)

2) When d=2d=2, we have 𝒂S2\bm{a}\sim S^{2} where S2S^{2} is a circle r=1r=1 and the probability is the angle between 𝒔,𝒆\bm{s},\bm{e} how much the vectors span within the circle where

(𝒔,𝒆21R(d))=20ϕr𝑑θπ=2ϕπ.\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)=\frac{2\int_{0}^{\phi}rd\theta}{\pi}=\frac{2\phi}{\pi}. (50)

3) When d3d\geq 3, the probability equals

(|𝒔,𝒆|1R(d))=Adcap(r)12Ad=1A~dcap(r)12Ad\mathbb{P}\left(|\langle\bm{s},\bm{e}\rangle|\geq\frac{1}{\sqrt{R(d)}}\right)=\frac{A_{d}^{\mathrm{cap}}(r)}{\frac{1}{2}A_{d}}=1-\frac{\tilde{A}_{d}^{\mathrm{cap}}(r)}{\frac{1}{2}A_{d}} (51)

where A~dcap(r)\tilde{A}_{d}^{\mathrm{cap}}(r) is the remaining area of cutting the hyperspherical caps in half of the sphere,

A~dcap(r)\displaystyle\tilde{A}_{d}^{\mathrm{cap}}(r) =2π(d1)/2Γ(d12)ϕπ2sind2θdθ\displaystyle=\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}\int_{\phi}^{\frac{\pi}{2}}\sin^{d-2}\theta\mathrm{d}\theta (52)
2π(d1)/2Γ(d12)ϕπ2sinθdθ\displaystyle\leq\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}\int_{\phi}^{\frac{\pi}{2}}\sin\theta\mathrm{d}\theta
2π(d1)/2Γ(d12)(cosπ2+cosϕ)\displaystyle\leq\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}(-\cos\frac{\pi}{2}+\cos\phi)
=2π(d1)/2Γ(d12)cosϕ.\displaystyle=\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}\cos\phi.

3-a) If dd is even then Γ(d2)=(d21)!\Gamma\left(\frac{d}{2}\right)=\left(\frac{d}{2}-1\right)!, so

Γ(d12)Γ(d2)=(d2)!π2d2(d21)!2=π2d2(d2d22).\frac{\Gamma\left(\frac{d-1}{2}\right)}{\Gamma\left(\frac{d}{2}\right)}=\frac{(d-2)!\sqrt{\pi}}{2^{d-2}\left(\frac{d}{2}-1\right)!^{2}}=\frac{\sqrt{\pi}}{2^{d-2}}\left(\begin{array}[]{c}d-2\\ \frac{d-2}{2}\end{array}\right). (53)

Robbins’ bounds (Robbins, 1955) imply that for any positive integer dd

4dπdexp(18d1)<(2dd)<4dπdexp(18d+1).\frac{4^{d}}{\sqrt{\pi d}}\exp\left(-\frac{1}{8d-1}\right)<\left(\begin{array}[]{c}2d\\ d\end{array}\right)<\frac{4^{d}}{\sqrt{\pi d}}\exp\left(-\frac{1}{8d+1}\right). (54)

So we have

2Adcap(r)Ad\displaystyle\frac{2A_{d}^{\mathrm{cap}}(r)}{A_{d}} 2Γ(d2)πΓ(d12)=2d1π(d2d22)1cosϕ\displaystyle\leq\frac{2\Gamma\left(\frac{d}{2}\right)}{\sqrt{\pi}\Gamma\left(\frac{d-1}{2}\right)}=\frac{2^{d-1}}{\pi}\left(\begin{array}[]{c}d-2\\ \frac{d-2}{2}\end{array}\right)^{-1}\cos\phi (55)
<2d1ππ(d2)/22d2exp(14d9)cosϕ\displaystyle<\frac{2^{d-1}}{\pi}\frac{\sqrt{\pi(d-2)/2}}{2^{d-2}\exp(-\frac{1}{4d-9})}\cos\phi
<2d4πexp(14d9)cosϕ.\displaystyle<\frac{\sqrt{2d-4}}{\sqrt{\pi}\exp(-\frac{1}{4d-9})}\cos\phi.

So the probability will be

(𝒔,𝒆21R(d))>12d4πexp(14d9)cosϕ.\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)>1-\frac{\sqrt{2d-4}}{\sqrt{\pi}}\exp(\frac{1}{4d-9})\cos\phi. (56)

Suppose R(d)=kd,k>1R(d)=kd,k>1, we have

limd2d4πkdexp(14d9)=2kπ.\lim_{d\to\infty}\frac{\sqrt{2d-4}}{\sqrt{\pi kd}\exp(-\frac{1}{4d-9})}=\sqrt{\frac{2}{k\pi}}. (57)

3-b) Similarly, if dd is odd, then Γ(d12)=(d32)!\Gamma\left(\frac{d-1}{2}\right)=\left(\frac{d-3}{2}\right)!, so

Γ(d12)Γ(d2)=2d2(d32)!2(d2)!π=2d2(d2)π(d3d32)1.\frac{\Gamma\left(\frac{d-1}{2}\right)}{\Gamma\left(\frac{d}{2}\right)}=\frac{2^{d-2}\left(\frac{d-3}{2}\right)!^{2}}{(d-2)!\sqrt{\pi}}=\frac{2^{d-2}}{(d-2)\sqrt{\pi}}\left(\begin{array}[]{c}d-3\\ \frac{d-3}{2}\end{array}\right)^{-1}. (58)

If d=3d=3 then

(𝒔,𝒆21R(d))12π2πcosϕ=11R(d).\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)\geq 1-\frac{2\sqrt{\pi}}{2\sqrt{\pi}}\cos\phi=1-\frac{1}{\sqrt{R(d)}}. (59)

If d>3d>3 then Robbins’ bounds imply that

πΓ(d12)2Γ(d2)=2d4d2(d3d32)1>π(d3)/2d2exp(14d11).\displaystyle\frac{\sqrt{\pi}\Gamma\left(\frac{d-1}{2}\right)}{2\Gamma\left(\frac{d}{2}\right)}=\frac{2^{d-4}}{d-2}\left(\begin{array}[]{c}d-3\\ \frac{d-3}{2}\end{array}\right)^{-1}>\frac{\sqrt{\pi(d-3)/2}}{d-2}\exp\left(\frac{1}{4d-11}\right). (60)

Thus, the probability will be at least

(𝒔,𝒆21R(d))>1d2π(d3)exp(14d11)cosϕ.\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)>1-\frac{d-2}{\sqrt{\pi(d-3)}}\exp(-\frac{1}{4d-11})\cos\phi. (61)

To simplify the result, we compare the minimum probability that for d3\forall d\geq 3

(𝒔,𝒆21R(d))\displaystyle\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right) >1d2π(d3)exp(14d11)cosϕ\displaystyle>1-\frac{d-2}{\sqrt{\pi(d-3)}}\exp(-\frac{1}{4d-11})\cos\phi (62)
=1d2π(d3)exp(14d11)cosϕ\displaystyle=1-\frac{d-2}{\sqrt{\pi(d-3)}}\exp(\frac{1}{4d-11})\cos\phi
=1d2π(d3)R(d)exp(14d11)cosϕ\displaystyle=1-\frac{d-2}{\sqrt{\pi(d-3)R(d)}}\exp(\frac{1}{4d-11})\cos\phi
>12d4πR(d)exp(14d9)\displaystyle>1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}\exp(\frac{1}{4d-9})

Overall, we have d+\forall d\in\mathbb{N}_{+},

(𝒔,𝒆21R(d))>min{2arccos(1R(d))π,|12d4πR(d)exp(14d9)|}.\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)>\min\left\{\frac{2\arccos(\frac{1}{\sqrt{R(d)}})}{\pi},\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}\exp(\frac{1}{4d-9})\right|\right\}. (63)

D.1 Proof to Eq. 7

Proof.

Let 𝒜d=(𝒂i)i[d]i.i.d.Unif(𝕊d1(d))\mathcal{A}^{d}=\left(\bm{a}_{i}\right)_{i\in[d]}\overset{i.i.d.}{\sim}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})). We consider the random ReLU NN function class to be

relu(𝒜d)={f(𝒘,A,𝒙)=1di=1mwiσ(𝒙𝒂i):wi,i[m]}\mathcal{F}_{relu}(\mathcal{A}^{d})=\left\{f(\bm{w},A,\bm{x})=\frac{1}{\sqrt{d}}\sum_{i=1}^{m}w_{i}\sigma\left(\bm{x}^{\top}\bm{a}_{i}\right):w_{i}\in\mathbb{R},i\in[m]\right\}

where A=[𝒂1,,𝒂m]d×mA=[\bm{a}_{1},...,\bm{a}_{m}]\in\mathbb{R}^{d\times m}. The empirical minimizer of the source domain is

𝒘^=min𝒘d1n𝒙i,yiS(f(𝒘,A,𝒙i),yi)=1n𝒛iS𝒘^,A(𝒛i).\bm{\hat{w}}=\min_{\bm{w}\in\mathbb{R}^{d}}\frac{1}{n}\sum_{\bm{x}_{i},y_{i}\in S}\ell(f(\bm{w},A,\bm{x}_{i}),y_{i})=\frac{1}{n}\sum_{\bm{z}_{i}\in S}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}). (64)

Then with the chain rule, the first derivative of any input 𝒙\bm{x} at 𝒘^\bm{\hat{w}} will be

𝒙(f(𝒘^,A,𝒙),y)\displaystyle\nabla_{\bm{x}}\ell\left(f(\bm{\hat{w}},A,\bm{x}),y\right) =(f(𝒘^,A,𝒙),y)f(𝒘^,A,𝒙)f(𝒘^,A,𝒙)σ(A𝒙)σ(A𝒙)𝒙\displaystyle=\frac{\partial\ell\left(f(\bm{\hat{w}},A,\bm{x}),y\right)}{\partial f(\bm{\hat{w}},A,\bm{x})}\frac{\partial\ell f(\bm{\hat{w}},A,\bm{x})}{\sigma(A\bm{x})}\frac{\partial\sigma(A\bm{x})}{\partial\bm{x}} (65)
=Df1(𝒙,y)di=1mw^i𝒂i𝕀{𝒂i𝒙0}\displaystyle=\frac{D^{1}_{f}(\bm{x},y)}{\sqrt{d}}\sum_{i=1}^{m}\hat{w}_{i}\bm{a}_{i}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}\geq 0\right\}
=Df1(𝒙,y)d𝒘^σ(A𝒙)\displaystyle=\frac{D^{1}_{f}(\bm{x},y)}{\sqrt{d}}\bm{\hat{w}}^{\top}\sigma^{\prime}(A\bm{x})

where a short notation Df1(𝒙,y)D^{1}_{f}(\bm{x},y) denotes the first order directional derivative of f(𝒘^,A,𝒙)f(\bm{\hat{w}},A,\bm{x}) σ(A𝒙)m×d\sigma^{\prime}(A\bm{x})\in\mathbb{R}^{m\times d} is the Jacobian matrix w.r.t. input 𝒙\bm{x}. Apparently, the second order derivative is represented as Df2(𝒙,y)D^{2}_{f}(\bm{x},y), thus the Hessian will be

𝒙2(f(𝒘^,A,𝒙),y)\displaystyle\nabla^{2}_{\bm{x}}\ell(f(\bm{\hat{w}},A,\bm{x}),y) =Df2(𝒙,y)di=1mj=1mw^iw^j𝒂i𝒂j𝕀{𝒂i𝒙0,𝒂j𝒙0}\displaystyle=\frac{D^{2}_{f}(\bm{x},y)}{d}\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}\geq 0~{},\bm{a}_{j}^{\top}\bm{x}\geq 0\right\} (66)
=Df2(𝒙,y)dσ(A𝒙)𝒘^𝒘^σ(A𝒙)\displaystyle=\frac{D^{2}_{f}(\bm{x},y)}{d}\sigma^{\prime}(A\bm{x})^{\top}\bm{\hat{w}}\bm{\hat{w}}^{\top}\sigma^{\prime}(A\bm{x})

Similarly, we have

y2(f(𝒘^,A,𝒙),y)=Dy2(𝒙,y)(sgn(y))2Df2(𝒙,y)\nabla^{2}_{y}\ell(f(\bm{\hat{w}},A,\bm{x}),y)=D^{2}_{y}(\bm{x},y)\cdot(\mathrm{sgn}(y))^{2}\overset{*}{\leq}D^{2}_{f}(\bm{x},y) (67)

where sgn(y)\mathrm{sgn}(y) is the sign function. holds under our choice of the family of loss functions.

  1. 1.

    Homogeneity in regression, i.e. L1, MSE, MAE, Huber Loss, we have |Dy2(𝒙,y)|=|Df2(𝒙,y)||D^{2}_{y}(\bm{x},y)|=|D^{2}_{f}(\bm{x},y)|;

  2. 2.

    (Binary) Cross-Entropy Loss:

    Dy2(𝒙,y)=2(yiexp(𝒙)/c=1Cexp(𝒙c))/y2=0;D^{2}_{y}(\bm{x},y)=\partial^{2}\left(y\sum_{i}\exp(\bm{x})/\sum_{c=1}^{C}\exp(\bm{x}_{c})\right)/\partial y^{2}=0;
  3. 3.

    Negative Log Likelihood (NLL) loss: Dy2(𝒙,y)=0D^{2}_{y}(\bm{x},y)=0.

Besides, as a convex loss function, Dy2(𝒙,y)0D^{2}_{y}(\bm{x},y)\geq 0. Hence, the range of Dy2(𝒙,y)D^{2}_{y}(\bm{x},y) will be [0,Df2(𝒙,y)][0,D^{2}_{f}(\bm{x},y)].

To combine with robustness, we denote 𝒛=(𝒙;y),d+1\bm{z}=(\bm{x};y),\in\mathbb{R}^{d+1}. Therefore, the Hessian of 𝒛\bm{z} will be

H(𝒛|S,A):=[𝒙2(f(𝒘^,A,𝒙),y)2(f(𝒘^,A,𝒙),y))y𝒙(2(f(𝒘^,A,𝒙),y)y𝒙)y2((f(𝒘^,A,𝒙),y))].H(\bm{z}|S,A):=\begin{bmatrix}\nabla^{2}_{\bm{x}}\ell(f(\bm{\hat{w}},A,\bm{x}),y)&\frac{\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y))}{\partial y\partial\bm{x}}\\ \left(\frac{\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)}{\partial y\partial\bm{x}}\right)^{\top}&\nabla^{2}_{y}(\ell(f(\bm{\hat{w}},A,\bm{x}),y))\end{bmatrix}. (68)

With Lemma D.4, the spectral norm of Hessian 𝒛\bm{z} will be bounded by

H(𝒛)𝒙2(f(𝒘^,A,𝒙),y)+|y2(f(𝒘^,A,𝒙),y)|.\left\|H(\bm{z})\right\|\leq\left\|\nabla^{2}_{\bm{x}}\ell(f(\bm{\hat{w}},A,\bm{x}),y)\right\|+\left|\nabla^{2}_{y}\ell(f(\bm{\hat{w}},A,\bm{x}),y)\right|. (69)

The first term in (69) can be further bounded by

Df2(𝒙,y)σ(A𝒙)𝒘^𝒘^σ(A𝒙)\displaystyle\left\|D^{2}_{f}(\bm{x},y)\sigma^{\prime}(A\bm{x})^{\top}\bm{\hat{w}}\bm{\hat{w}}^{\top}\sigma^{\prime}(A\bm{x})\right\| |Df2(𝒙,y)|𝒘^𝒘^σ(A𝒙)σ(A𝒙)\displaystyle\leq|D^{2}_{f}(\bm{x},y)|\left\|\bm{\hat{w}}\bm{\hat{w}}^{\top}\right\|\left\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right\| (70)
=Df2(𝒙,y)𝒘^2σ(A𝒙)σ(A𝒙)\displaystyle=D^{2}_{f}(\bm{x},y)\|\bm{\hat{w}}\|^{2}\left\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right\|

where the convexity of loss functions 𝒙,y,Df2(𝒙,y)0\forall\bm{x},y,D^{2}_{f}(\bm{x},y)\geq 0 supports the last equation. The right term has the facts that

σ(A𝒙)σ(A𝒙)σ(A𝒙)σ(A𝒙)F=tr(σ(A𝒙)σ(A𝒙)).\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\|\leq\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\|_{F}=\operatorname{tr}\left(\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right). (71)

In summary, we have the following inequality:

H(𝒛)\displaystyle\left\|H(\bm{z})\right\| Df2(𝒙,y)(1d𝒘^2σ(A𝒙)σ(A𝒙)+1)\displaystyle\leq D^{2}_{f}(\bm{x},y)\left(\frac{1}{d}\|\bm{\hat{w}}\|^{2}\left\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right\|+1\right) (72)
Df2(𝒙,y)(1d𝒘^2tr(σ(A𝒙)σ(A𝒙))+1)\displaystyle\leq D^{2}_{f}(\bm{x},y)\left(\frac{1}{d}\|\bm{\hat{w}}\|^{2}\operatorname{tr}\left(\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right)+1\right)
=Df2(𝒙,y)(1d𝒘^2j=1m𝒂j2𝕀{𝒂j𝒙0}+1)\displaystyle=D^{2}_{f}(\bm{x},y)\left(\frac{1}{d}\|\bm{\hat{w}}\|^{2}\sum_{j=1}^{m}\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}\geq 0\right\}+1\right)

In Lemma D.2, it depends on some 𝒛i=(𝒙i,yi)SCi\bm{z}_{i}=(\bm{x}_{i},y_{i})\in S\cap C_{i} that

ϵ(S,A)\displaystyle\epsilon(S,A) maxi[K]ρi(L)22L2(H(𝒛i(A))+4ρi(L)3)\displaystyle\leq\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(\|H(\bm{z}_{i}(A))\|+\frac{4\rho_{i}(L)}{3}\right) (73)
maxi[K]ρi(L)22L2(Df2(𝒙i,yi)(𝒘^2dj=1m𝒂j2𝕀{𝒂j𝒙t0}+1)+4ρi(L)3)\displaystyle\leq\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(D^{2}_{f}(\bm{x}_{i},y_{i})\left(\frac{\|\bm{\hat{w}}\|^{2}}{d}\sum_{j=1}^{m}\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{t}\geq 0\right\}+1\right)+\frac{4\rho_{i}(L)}{3}\right)
ρmax(L)22L2(Df2(𝒙k,yk)d𝒘^2j=1m𝒂j2𝕀{𝒂j𝒙k0}+4ρmax(L)3+o~κ)\displaystyle\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(\frac{D^{2}_{f}(\bm{x}_{k},y_{k})}{d}\|\bm{\hat{w}}\|^{2}\sum_{j=1}^{m}\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}+\frac{4\rho_{\max}(L)}{3}+\tilde{o}_{\kappa}\right)

where ρmax(L)=max{ρi(L)}i=0K\rho_{\max}(L)=\max\{\rho_{i}(L)\}_{i=0}^{K}, the o~κ=𝒪(′′(f,𝒙k,yk)𝒘^2j=1m𝒂j2𝕀{𝒂j𝒙k0}d/m)\tilde{o}_{\kappa}=\mathcal{O}(\ell^{\prime\prime}(f,\bm{x}_{k},y_{k})\|\bm{\hat{w}}\|^{2}\sum_{j=1}^{m}\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}d/m) is a smaller order term compared to first term, since mdm\gg d. Last equality, the maximum can be taken as we find maximum (𝒙k,yk)C^{Ci}i[K](\bm{x}_{k},y_{k})\in\hat{C}\in\{C_{i}\}_{i\in[K]}. Because 𝒙k,ykS\bm{x}_{k},y_{k}\in S is one of the training sample k[n]k\in[n], there must exist n[0,n]n^{\prime}\in[0,n] that

Df2(𝒙k,yk)𝒘^2j=1m\displaystyle D^{2}_{f}(\bm{x}_{k},y_{k})\|\bm{\hat{w}}\|^{2}\sum_{j=1}^{m} 𝒂j2𝕀{𝒂j𝒙k0}\displaystyle\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\} (74)
=𝒘^2nnk=1nDf2(𝒙k,yk)dj=1m𝒂j2𝕀{𝒂j𝒙k0}.\displaystyle=\|\bm{\hat{w}}\|^{2}\frac{n^{\prime}}{n}\sum_{k=1}^{n}\frac{D^{2}_{f}(\bm{x}_{k},y_{k})}{d}\sum_{j=1}^{m}\|\bm{a}_{j}\|^{2}\mathbb{I}\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\}.

Recall that the sharpness of parameter 𝒘^\bm{\hat{w}} is defined by

κ(𝒘^,S,A)\displaystyle\kappa(\bm{\hat{w}},S,A) :=𝒘^2tr[HS,A(𝒘^)]\displaystyle:=\|\bm{\hat{w}}\|^{2}\operatorname{tr}[H_{S,A}(\bm{\hat{w}})] (75)
=𝒘^21nj=1nDf2(𝒙j,yj)tr(σ(A𝒙j)σ(A𝒙j))\displaystyle=\|\bm{\hat{w}}\|^{2}\frac{1}{n}\sum_{j=1}^{n}D^{2}_{f}(\bm{x}_{j},y_{j})\cdot\operatorname{tr}\left(\sigma(A\bm{x}_{j})\sigma(A\bm{x}_{j})^{\top}\right)
=𝒘^21nj=1nDf2(𝒙j,yj)di=1m(𝒂i𝒙j)2𝕀{𝒂i𝒙j0}.\displaystyle=\|\bm{\hat{w}}\|^{2}\frac{1}{n}\sum_{j=1}^{n}\frac{D^{2}_{f}(\bm{x}_{j},y_{j})}{d}\sum_{i=1}^{m}(\bm{a}_{i}^{\top}\bm{x}_{j})^{2}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}.

Let the ξi=𝒂i𝒙D(ξ)\xi_{i}=\bm{a}_{i}^{\top}\bm{x}\sim D(\xi) and the expectation of 𝔼(ξi>0)=qi\mathbb{E}(\xi_{i}>0)=q_{i} where D(ξ)D(\xi) is some rotationally invariant distribution, i.e. uniform or normal distribution. Under this circumstance, the sample mean of ξi\xi_{i} still obeys the same family distribution as D(tξ)D(t\xi). Thus, we have

(j=1mξi𝕀{ξi0}j=1m𝒂j2𝕀{ξi0})\displaystyle\mathbb{P}\left(\sum_{j=1}^{m}\xi_{i}\mathbb{I}\left\{\xi_{i}\geq 0\right\}\geq\sum_{j=1}^{m}\|\bm{a}_{j}\|^{2}\mathbb{I}\left\{\xi_{i}\geq 0\right\}\right) =(j=1qiξij=1qi𝒂j2)\displaystyle=\mathbb{P}\left(\sum_{j=1}^{q_{i}}\xi_{i}\geq\sum_{j=1}^{q_{i}}\|\bm{a}_{j}\|^{2}\right) (76)
=((𝒂𝒙)2𝒂2)=𝔼𝒙p(𝒙)\displaystyle=\mathbb{P}((\bm{a}^{\top}\bm{x})^{2}\geq\|\bm{a}\|^{2})=\mathbb{E}_{\bm{x}}p(\bm{x})

With Eq. 43, we have at least a probability at

((𝒂𝒙)2𝒂2)=min{2πarccos(R(d)12),|12d4πR(d)e14d9|}\mathbb{P}((\bm{a}^{\top}\bm{x})^{2}\geq\|\bm{a}\|^{2})=\min\left\{\frac{2}{\pi}\arccos(R(d)^{-\frac{1}{2}}),\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}e^{\frac{1}{4d-9}}\right|\right\} (77)

the following inequality holds,

ϵ(S,A)\displaystyle\epsilon(S,A) ρmax(L)22L2(nκ(𝒘^,S,A)+4ρmax(L)3+o~κ)\displaystyle\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\frac{4\rho_{\max}(L)}{3}+\tilde{o}_{\kappa}\right) (78)
=ρmax(L)22L2([n+𝒪(dm)]κ(𝒘^,S,A)+43ρmax(L))\displaystyle=\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(\left[n^{\prime}+\mathcal{O}\left(\frac{d}{m}\right)\right]\kappa(\bm{\hat{w}},S,A)+\frac{4}{3}\rho_{\max}(L)\right)

D.2 Proof to Corollary 3.5

Corollary D.6 (Restatement of Corollary 3.5).

Let w^min\hat{w}_{\min} be the minimum value of |𝐰^||\bm{\hat{w}}|. Suppose 𝐱Unif(𝕊d1(d))\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and |2(f(𝐰^,A,𝐱),y)/f2||\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)/\partial f^{2}| is bounded by [M~1,M~2][\tilde{M}_{1},\tilde{M}_{2}]. If m>dm>d, ρmax(L)<(w^min2M~1σ~(d,m))/(2d)\rho_{\max}(L)<(\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}(d,m))/(2d) for any A=(𝐚1,,𝐚m),𝐚iUnif(𝕊d1(d))i[m]A=(\bm{a}_{1},...,\bm{a}_{m}),\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))\forall i\in[m], taking expectation over all 𝐱jUnif(𝕊d1(d))\bm{x}_{j}\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) in SS, we have

𝔼S,A[ϵ(S,A)]𝔼S,A7ρmax(L)26L2(nκ(𝒘^,S,A)+M~2).\mathbb{E}_{S,A}\left[\epsilon(S,A)\right]\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\tilde{M}_{2}\right). (79)

where σ~(d,m)=𝔼𝐚λmin(i=1m𝐚i𝐚iGii)>0\tilde{\sigma}(d,m)=\mathbb{E}_{\bm{a}}\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0 is the minimum eigenvalue and GiiG_{ii} is product constant of Gegenbauer polynomials

Gij=t=0λd,t2(σ)Nd,t2Qt(d)(𝒂i,𝒂j/d).G_{ij}=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{a}_{j}\rangle/\sqrt{d}).
Proof.

In our main theorem, with some probability, we have the following relation

ϵ(S,A)ρmax(L)22L2(nκ(𝒘^,S,A)+4ρmax(L)3+dκ).\epsilon(S,A)\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\frac{4\rho_{\max}(L)}{3}+d_{\kappa}\right).

So, we are concerned about the relation between the κ(𝒘^,S,A)\kappa(\bm{\hat{w}},S,A) second term. If ρi(L)<κ(𝒘^,S,A)\rho_{i}(L)<\kappa(\bm{\hat{w}},S,A), we may say the RHS is dominated by sharpness term κ(𝒘^,S,A)\kappa(\bm{\hat{w}},S,A) as well as the main effect is taken by the sharpness. As suggested in Lemma D.1, we have

𝔼𝒙Unif(𝕊d1(d))H(𝒙)w^min2M~1σ~A(m,d)dId\mathbb{E}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}H(\bm{x}^{*})\succeq\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}_{A}(m,d)}{d}I_{d} (80)

where 𝒙\bm{x}^{*} is the global minimum over the whole set. As defined in (38), the following condition holds true

𝒛i(A)i,𝒛i(A)𝒛i(A)ρi(L)L,\exists\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i},\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\frac{\rho_{i}(L)}{L}, (81)

and with Hessian Lipschitz, the relation is almost surely for arbitrary 𝒙\bm{x} that

𝔼𝒛i(A)H(𝒛i(A))H(𝒛i(A))\displaystyle\mathbb{E}_{\bm{z}_{i}^{*}(A)}\|H(\bm{z}_{i}(A))-H(\bm{z}_{i}^{*}(A))\| L𝒛i(A)𝒛i(A)ρi(L)\displaystyle\leq L\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\rho_{i}(L) (82)
w^min2M~1σ~(d,m)2dId\displaystyle\leq\left\|\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}(d,m)}{2d}I_{d}\right\|
𝔼A,𝒙12H(𝒙)\displaystyle\leq\mathbb{E}_{A,\bm{x}^{*}}\frac{1}{2}\|H(\bm{x}^{*})\|
𝔼𝒛i(A)12H(𝒛i(A)).\displaystyle\leq\mathbb{E}_{\bm{z}_{i}^{*}(A)}\frac{1}{2}\|H(\bm{z}_{i}^{*}(A))\|.

Obviously, H(𝒛i(A))>2ρi(L)\|H(\bm{z}_{i}^{*}(A))\|>2\rho_{i}(L). Following Lemma A.2 of Zhang et al. (2019a), for 𝒛i(A),𝒛(A)Ci\bm{z}_{i}^{*}(A),\forall\bm{z}(A)\in C_{i}, we have a similar result that

σ~min(H(𝒛(A)))σ~min(H(𝒛i(A))H(𝒛(A))H(𝒛i(A))ρi(L)\tilde{\sigma}_{\min}(H(\bm{z}(A)))\geq\tilde{\sigma}_{\min}(H(\bm{z}_{i}^{*}(A))-\|H(\bm{z}(A))-H(\bm{z}_{i}^{*}(A))\|\geq\rho_{i}(L) (83)

where σ~min\tilde{\sigma}_{\min} denotes the minimum singular value. With Lemma D.2, we know that

𝔼S,Aϵ(S,A)𝔼S,Amaxi[K]ρi(L)22L2(H(𝒛i(A))+4ρi(L)3).\mathbb{E}_{S,A}\epsilon(S,A)\leq\mathbb{E}_{S,A}\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(\|H(\bm{z}_{i}(A))\|+\frac{4\rho_{i}(L)}{3}\right). (84)

We also have another condition that

|Df2(𝒙,y)|:=|2(f(𝒘^,A,𝒙),y)f2|[M~1,M~2],𝒙,y.|D^{2}_{f}(\bm{x},y)|:=\left|\frac{\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)}{\partial f^{2}}\right|\in[\tilde{M}_{1},\tilde{M}_{2}],\forall\bm{x},y. (85)

Combine all these results, we finally have

𝔼S,Aϵ(S,A)𝔼S,Amaxi[K]ρi(L)22L2(1+43)H(𝒛i(A))\displaystyle\mathbb{E}_{S,A}\epsilon(S,A)\leq\mathbb{E}_{S,A}\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(1+\frac{4}{3}\right)\|H(\bm{z}_{i}(A))\| (86)
𝔼S,A7ρmax(L)26L2(Df2(𝒙k,yk)d𝒘^2j=1m𝒂j2𝕀{𝒂j𝒙k0}+M~2).\displaystyle\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(\frac{D^{2}_{f}(\bm{x}_{k},y_{k})}{d}\|\bm{\hat{w}}\|^{2}\sum_{j=1}^{m}\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}+\tilde{M}_{2}\right).

Recall the definition of κ(𝒘^,S,A)\kappa(\bm{\hat{w}},S,A) in the main theorem that

𝔼S,Aκ(𝒘^,S,A)𝔼{𝒙}n𝒘^21nj=1nDf2(𝒙j,yj)di=1m(𝒂i𝒙j)2𝕀{𝒂i𝒙j0}\mathbb{E}_{S,A}\kappa(\bm{\hat{w}},S,A)\leq\mathbb{E}_{\{\bm{x}\}^{n}}\|\bm{\hat{w}}\|^{2}\frac{1}{n}\sum_{j=1}^{n}\frac{D^{2}_{f}(\bm{x}_{j},y_{j})}{d}\sum_{i=1}^{m}(\bm{a}_{i}^{\top}\bm{x}_{j})^{2}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\} (87)

Look at the second sum, we have

𝔼{𝒙j}n,{𝒂i}mi=1m(𝒂i𝒙j)2𝕀{𝒂i𝒙j0}\displaystyle\mathbb{E}_{\{\bm{x}_{j}\}^{n},\{\bm{a}_{i}\}^{m}}\sum_{i=1}^{m}(\bm{a}_{i}^{\top}\bm{x}_{j})^{2}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\} =i=1m𝔼𝒙j𝔼𝒂i𝒂i2𝒙j2cos2(β)𝕀{𝒂i𝒙j0}\displaystyle=\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\|\bm{x}_{j}\|^{2}\cos^{2}(\beta)\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\} (88)
=i=1m𝔼𝒙j𝔼𝒂i𝒂i2𝕀{𝒂i𝒙j0}dcos2(β).\displaystyle=\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\|\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}d\cos^{2}(\beta).

Suppose 𝒙\bm{x} and 𝒂\bm{a} are i.i.d. from Unif(𝕊d1(1))\text{Unif}(\mathbb{S}^{d-1}(1)), let u=𝒙,𝒂u=\langle\bm{x},\bm{a}\rangle, we have a well-known result that

𝔼u[u2]\displaystyle\mathbb{E}_{u}[u^{2}] =𝔼[𝒙,𝒂2]\displaystyle=\mathbb{E}[\langle\bm{x},\bm{a}\rangle^{2}] (89)
=𝔼[𝒙2𝒂2cos2(β)]\displaystyle=\mathbb{E}[\|\bm{x}\|^{2}\|\bm{a}\|^{2}\cos^{2}(\beta)]
=𝔼[cos2(β)]=1d,𝒙,𝒂d\displaystyle=\mathbb{E}[\cos^{2}(\beta)]=\frac{1}{d},~{}~{}~{}~{}\bm{x},\bm{a}\in\mathbb{R}^{d}

Therefore, in (88),

i=1m𝔼𝒙j𝔼𝒂i𝒂i2𝕀{𝒂i𝒙j0}dcos2(β)=i=1m𝔼𝒙j𝔼𝒂i𝒂i2𝕀{𝒂i𝒙j0}\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\|\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}d\cos^{2}(\beta)=\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\|\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\} (90)

and we have (based on proof of main theorem),

𝔼S,Aϵ(S,A)\displaystyle\mathbb{E}_{S,A}\epsilon(S,A) (91)
7ρmax(L)26L2(𝒘^2nnj=1nDf2(𝒙j,yj)di=1m𝔼𝒙j𝔼𝒂i𝒂i2𝕀{𝒂i𝒙j0}+M~2)\displaystyle\leq\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(\|\bm{\hat{w}}\|^{2}\frac{n^{\prime}}{n}\sum_{j=1}^{n}\frac{D^{2}_{f}(\bm{x}_{j},y_{j})}{d}\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\mathbb{I}\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\}+\tilde{M}_{2}\right)
𝔼S,A7ρmax(L)26L2(nκ(𝒘^,S,A)+M~2)\displaystyle\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\tilde{M}_{2}\right)

Appendix E Case Study

To better illustrate our theorems, we here give two different cases for clearly picturing intuition. The first case is the very basic model, ridge regression. As is known to us, ridge regression provides a straightforward way (by punishing the 2\ell_{2} norm of the weights) to reduce the "variance" of the model in order to avoid overfitting. In this case, this mechanism is equivalent to reducing the model’s sharpness.

Example E.1.

In ridge regression models, the robustness constant ϵ\epsilon has a reverse relationship to regularization parameter β\beta where β\beta\uparrow, the more probably flatter minimum κ\kappa\downarrow and less sensitivity ϵ\epsilon\downarrow of the learned model could be. Follow the previous notation that ϵ(S,A)\epsilon(S,A) denotes the robustness and κ(𝛉^,S)\kappa(\bm{\hat{\theta}},S) is the sharpness on training set SS, then we have

τ>0,c(0,n],ϵ(S,A)cκ(𝜽^,S)+o~d\tau>0,~{}~{}c\in(0,n],~{}~{}\epsilon(S,A)\leq c\kappa(\bm{\hat{\theta}},S)+\tilde{o}_{d}

where o~d\tilde{o}_{d} is a much smaller order than κ(𝛉^,S)\kappa(\bm{\hat{\theta}},S).

E.1 Ridge regression

We consider a generic response model as stated in Ali et al. (2019).

𝒚|𝜽(X𝜽,σ2I)\bm{y}|\bm{\theta}_{*}\sim(X\bm{\theta}_{*},\sigma^{2}I)

ridge regression minimization problem is defined by

min𝜽12nX𝜽𝒚2+β2𝜽2,Xn×d,n<d.\min_{\bm{\theta}}\frac{1}{2n}\|X\bm{\theta}-\bm{y}\|^{2}+\frac{\beta}{2}\|\bm{\theta}\|^{2},X\in\mathbb{R}^{n\times d},n<d. (92)

The least-square solution of ridge regression is

𝜽^=(XX+nβI)1X𝒚.\bm{\hat{\theta}}=\left(X^{\top}X+n\beta I\right)^{-1}X^{\top}\bm{y}. (93)

With minimizer 𝜽^\bm{\hat{\theta}}, we now focus on its geometry w.r.t. 𝒙\bm{x}. Let S={𝒛}in=(X,𝒚)S=\{\bm{z}\}_{i}^{n}=(X,\bm{y}) be the training set, (𝒵,Σ,ρ)(\mathcal{Z},\Sigma,\rho) be a measure space. Consider the bounded sample set 𝒵\mathcal{Z} such that

M>0,ρ(𝒵)<+.\exists M>0,~{}~{}\rho(\mathcal{Z})<+\infty. (94)

The 𝒵\mathcal{Z} can be partitioned into KK disjoint sets {Ci}i[K]\{C_{i}\}_{i\in[K]}. By definition, we have robustness defined by each partition CiC_{i},

𝒛,𝒛Ci,|(𝜽^,𝒛)(𝜽^,𝒛)|ϵ(S,A).\forall\bm{z},\bm{z}^{\prime}\in C_{i},\left|\ell(\bm{\hat{\theta}},\bm{z})-\ell(\bm{\hat{\theta}},\bm{z}^{\prime})\right|\leq\epsilon(S,A). (95)

For this convex function (𝜽^,𝒛)\ell(\bm{\hat{\theta}},\bm{z}), we have the following upper bound in the whole sample domain

ϵ(S)\displaystyle\epsilon(S) =maxi[K]sup𝒛,𝒛Ci|(𝜽^,𝒛)(𝜽^,𝒛)|\displaystyle=\max_{i\in[K]}\sup_{\bm{z},\bm{z}^{\prime}\in C_{i}}\left|\ell(\bm{\hat{\theta}},\bm{z})-\ell(\bm{\hat{\theta}},\bm{z}^{\prime})\right| (96)
=maxi[K]sup𝒛Ci(𝜽^,𝒛)(𝜽^,𝒛iCi)\displaystyle=\max_{i\in[K]}\sup_{\bm{z}\in C_{i}}\ell(\bm{\hat{\theta}},\bm{z})-\ell(\bm{\hat{\theta}},\bm{z}_{i}^{*}\in C_{i})
sup𝒛j𝒵S(𝜽^,𝒛j)(𝜽^,𝒛)\displaystyle\leq\sup_{\bm{z}_{j}\in\mathcal{Z}\cap S}\ell(\bm{\hat{\theta}},\bm{z}_{j})-\ell(\bm{\hat{\theta}},\bm{z}^{*})

where the 𝒛i,𝒛\bm{z}_{i}^{*},\bm{z}^{*} are the global minimum point in CiC_{i} and whole domain 𝒵=iKCi\mathcal{Z}=\bigcup_{i}^{K}C_{i}, respectively. 𝒛j\bm{z}_{j} is a training data point that has the maximum loss difference from the optimum. Specifically, it as well as the augmented form of 𝜽^\bm{\hat{\theta}} can be expressed as

𝒛=[x1,,xd,y],𝜽^+=[θ^1,,θ^d,1],d+1.\bm{z}=[x_{1},...,x_{d},y]^{\top},~{}~{}~{}\bm{\hat{\theta}}_{+}=[\hat{\theta}_{1},...,\hat{\theta}_{d},-1]^{\top},\in\mathbb{R}^{d+1}.

Then the loss difference can be rewritten as

𝜽^+(𝒛)=(𝜽^+𝒛)2=(𝜽^𝒙y)2H(𝜽^+(𝒛))is P.S.D matrix.\ell_{\bm{\hat{\theta}}_{+}}(\bm{z})=(\bm{\hat{\theta}}_{+}^{\top}\bm{z})^{2}=(\bm{\hat{\theta}}^{\top}\bm{x}-y)^{2}\Rightarrow H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}))\text{is P.S.D matrix}.

It is a convex function with regards to 𝒛\bm{z} such that

ϵ(S)\displaystyle\epsilon(S) sup𝒛jS𝜽^+(𝒛j)𝜽^+(𝒛)\displaystyle\leq\sup_{\bm{z}_{j}\in S}\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j})-\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*}) (97)
=sup𝒛jS𝜽^+(𝒛)(𝒛j𝒛)+12(𝒛j𝒛)H(𝜽^(𝒛))(𝒛j𝒛)\displaystyle=\sup_{\bm{z}_{j}\in S}\nabla\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*})^{\top}(\bm{z}_{j}-\bm{z}^{*})+\frac{1}{2}(\bm{z}_{j}-\bm{z}^{*})^{\top}H(\ell_{\bm{\hat{\theta}}}(\bm{z}^{*}))(\bm{z}_{j}-\bm{z}^{*})
=sup𝒛jS12(𝒛j𝒛)H(𝜽^+(𝒛))(𝒛j𝒛)\displaystyle=\sup_{\bm{z}_{j}\in S}\frac{1}{2}(\bm{z}_{j}-\bm{z}^{*})^{\top}H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*}))(\bm{z}_{j}-\bm{z}^{*})
sup𝒛jS12H(𝜽^+(𝒛))𝒛j𝒛2\displaystyle\leq\sup_{\bm{z}_{j}\in S}\frac{1}{2}\|H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*}))\|\|\bm{z}_{j}-\bm{z}^{*}\|^{2}

where the second equality is supported by convexity and the third equality is due to 𝜽^+(𝒛j)=0\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j})=0. Further, with Lemma D.4, we have

H(𝜽^+(𝒛))=[𝜽^𝜽^2𝜽^+(𝒛j))y𝒙(2𝜽^+(𝒛j))𝒙y)1]𝜽^𝜽^+1.\|H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*}))\|=\left\|\begin{bmatrix}\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}&\frac{\partial^{2}\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j}))}{\partial y\partial\bm{x}}\\ \left(\frac{\partial^{2}\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j}))}{\partial\bm{x}\partial y}\right)^{\top}&1\end{bmatrix}\right\|\leq\left\|\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right\|+1. (98)

In (97), we can also bound the norm of the input difference by

𝒛j𝒛2𝒙j𝒙2+(yjy)2\|\bm{z}_{j}-\bm{z}^{*}\|^{2}\leq\|\bm{x}_{j}-\bm{x}^{*}\|^{2}+(y_{j}-y^{*})^{2} (99)

and then for simplicity, assume 𝒙2𝒙j2\|\bm{x}^{*}\|^{2}\leq\|\bm{x}_{j}\|^{2} we have

ϵ(S)\displaystyle\epsilon(S) sup𝒙jS12(𝜽^𝜽^+1)(𝒙j𝒙2+(yjy)2)\displaystyle\leq\sup_{\bm{x}_{j}\in S}\frac{1}{2}\left(\left\|\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right\|+1\right)\left(\|\bm{x}_{j}-\bm{x}^{*}\|^{2}+(y_{j}-y^{*})^{2}\right) (100)
sup𝒙jS12(𝜽^2+1)(𝒙j2+𝒙2+(yjy)2)\displaystyle\leq\sup_{\bm{x}_{j}\in S}\frac{1}{2}\left(\left\|\bm{\hat{\theta}}\right\|^{2}+1\right)\left(\|\bm{x}_{j}\|^{2}+\|\bm{x}^{*}\|^{2}+(y_{j}-y^{*})^{2}\right)
sup𝒙jS12𝜽^2(𝒙j2+𝒙2)+𝒪(d)\displaystyle\leq\sup_{\bm{x}_{j}\in S}\frac{1}{2}\left\|\bm{\hat{\theta}}\right\|^{2}\left(\|\bm{x}_{j}\|^{2}+\|\bm{x}^{*}\|^{2}\right)+\mathcal{O}(d)
sup𝒙jS𝜽^2𝒙j2+𝒪(d)\displaystyle\leq\sup_{\bm{x}_{j}\in S}\left\|\bm{\hat{\theta}}\right\|^{2}\|\bm{x}_{j}\|^{2}+\mathcal{O}(d)

where 𝜽^2𝒙j2=𝒪(d2)\left\|\bm{\hat{\theta}}\right\|^{2}\|\bm{x}_{j}\|^{2}=\mathcal{O}(d^{2}) is the dominate term for large dd. Now, let’s look at the relation to sharpness. By definition,

κ(𝜽^,S)=𝜽^2tr(H𝜽^((𝜽^,S)))=𝜽^2tr(XXn+βI).\displaystyle\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}\left(H_{\bm{\hat{\theta}}}(\ell(\bm{\hat{\theta}},S))\right)=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}\left(\frac{X^{\top}X}{n}+\beta I\right). (101)

Since

tr(XXn+βI)=tr(XXn)+tr(βI)=tr(XXn)+β=1njn𝒙j2+β,\operatorname{tr}\left(\frac{X^{\top}X}{n}+\beta I\right)=\operatorname{tr}\left(\frac{X^{\top}X}{n}\right)+\operatorname{tr}\left(\beta I\right)=\operatorname{tr}\left(\frac{XX^{\top}}{n}\right)+\beta=\frac{1}{n}\sum_{j}^{n}\|\bm{x}_{j}\|^{2}+\beta, (102)

so we have

κ(𝜽^,S)=𝜽^21njn𝒙j2+β𝜽^2.\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\frac{1}{n}\sum_{j}^{n}\|\bm{x}_{j}\|^{2}+\beta\|\bm{\hat{\theta}}\|^{2}. (103)

As is known to us, the "variance" of ridge estimator Ali et al. (2019) can be defined by

Var(𝜽^)tr(𝜽^𝜽^)=tr((XTX+nβI)1XT𝒚𝒚X(XTX+nβI)1).\operatorname{Var}(\bm{\hat{\theta}})\triangleq\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right)=\operatorname{tr}\left(\left(X^{T}X+n\beta I\right)^{-1}X^{T}\bm{y}\bm{y}^{\top}X\left(X^{T}X+n\beta I\right)^{-1}\right). (104)

Note that 𝜽^𝜽^\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top} is a PSD with rank(𝜽^𝜽^)=1\text{rank}(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top})=1, thus it has only one eigenvalue λ(𝜽^𝜽^)=𝜽^2>0\lambda(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top})=\|\bm{\hat{\theta}}\|^{2}>0.

Var(θ)=tr(𝜽^𝜽^)=𝜽^2=𝒪(β2)\operatorname{Var}(\theta)=\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right)=\|\bm{\hat{\theta}}\|^{2}=\mathcal{O}(\beta^{-2}) (105)

By definition, the covariance matrix 𝔼[𝒚𝒚]\mathbb{E}[\bm{y}\bm{y}^{\top}] is a diagonal matrix with entries of σ2\sigma^{2}. Averagely, we have

tr(𝜽^𝜽^)\displaystyle\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right) =σ2tr[(XX+nβI)1XX(XX+nβI)1]\displaystyle=\sigma^{2}\operatorname{tr}\left[\left(X^{\top}X+n\beta I\right)^{-1}X^{\top}X\left(X^{\top}X+n\beta I\right)^{-1}\right] (106)
=σ2ntr[XXn(XXn+βI)2]\displaystyle=\frac{\sigma^{2}}{n}\operatorname{tr}\left[\frac{X^{\top}X}{n}\left(\frac{X^{\top}X}{n}+\beta I\right)^{-2}\right]
=σ2ni=1dλi(XX/n)(λi(XX/n)+β)2\displaystyle=\frac{\sigma^{2}}{n}\sum_{i=1}^{d}\frac{\lambda_{i}(X^{\top}X/n)}{\left(\lambda_{i}(X^{\top}X/n)+\beta\right)^{2}}

where XXn\frac{X^{\top}X}{n} and (XTXn+βI)1\left(\frac{X^{T}X}{n}+\beta I\right)^{-1} are simultaneously diagonalizable and commutable. Therefore, the greater β\beta is, the smaller tr(𝜽^𝜽^)\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right) is.

Conclusions

From our above analysis, we have the following conditions.

  • Upper bound of robustness ϵ(S)sup𝒙jS𝜽^2𝒙j2+𝒪(d).\epsilon(S)\leq\sup_{\bm{x}_{j}\in S}\|\bm{\hat{\theta}}\|^{2}\|\bm{x}_{j}\|^{2}+\mathcal{O}(d).

  • Sharpness expression κ(𝜽^,S)=𝜽^2(1n𝒙j,yjS𝒙j2+β).\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\left(\frac{1}{n}\sum_{\bm{x}_{j},y_{j}\in S}\|\bm{x}_{j}\|^{2}+\beta\right).

  • Variance expression Var(θ)=𝜽^2.\text{Var}(\theta)=\|\bm{\hat{\theta}}\|^{2}.

1) First, let’s discuss how β\beta influences sharpness.

κ(𝜽^,S)\displaystyle\kappa(\bm{\hat{\theta}},S) =𝜽^2tr(H𝜽^((𝜽^,S)))\displaystyle=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}\left(H_{\bm{\hat{\theta}}}(\ell(\bm{\hat{\theta}},S))\right) (107)
=𝒚X(XX+nβI)2X𝒚(1n𝒙j,yjS𝒙j2+β)\displaystyle=\bm{y}^{\top}X\left(X^{\top}X+n\beta I\right)^{-2}X^{\top}\bm{y}\left(\frac{1}{n}\sum_{\bm{x}_{j},y_{j}\in S}\|\bm{x}_{j}\|^{2}+\beta\right)
=𝒪(β1).\displaystyle=\mathcal{O}(\beta^{-1}).

As dictated in above equation, the κ(𝜽^,S)=𝒪(β1)\kappa(\bm{\hat{\theta}},S)=\mathcal{O}(\beta^{-1}) where sharpness holds an inverse relationship to β\beta.

2) Now, it’s trivial to get the relationship between robustness and sharpness by combining the first two points. Because 𝒙j,yjS\bm{x}_{j},y_{j}\in S in supermum is one of the training samples there exists a constant c<nc<n and o~d=o(d2)\tilde{o}_{d}=o(d^{2}) such that

ϵ(S)\displaystyle\epsilon(S) cnjn(𝜽^2𝒙j2)+o~d\displaystyle\leq\frac{c}{n}\sum_{j}^{n}\left(\|\bm{\hat{\theta}}\|^{2}\|\bm{x}_{j}\|^{2}\right)+\tilde{o}_{d} (108)
cnjn(𝜽^2𝒙j2)+cβ𝜽^2+o~d\displaystyle\leq\frac{c}{n}\sum_{j}^{n}\left(\|\bm{\hat{\theta}}\|^{2}\|\bm{x}_{j}\|^{2}\right)+c\beta\|\bm{\hat{\theta}}\|^{2}+\tilde{o}_{d}
=cκ(𝜽^,S)+o~d.\displaystyle=c\kappa(\bm{\hat{\theta}},S)+\tilde{o}_{d}.

This relation is consistent with Eq. 7 where the robustness is upper bounded by nn^{\prime} times sharpness κ(𝜽^,S)\kappa(\bm{\hat{\theta}},S). Besides, the relation is simpler here, where robustness only depends on sharpness without other coefficients before κ(𝜽^,S)\kappa(\bm{\hat{\theta}},S).

3) Finally, the relation between β\beta and robustness will be

ϵ(S)\displaystyle\epsilon(S) cnjn(Var(θ)𝒙j2)+o~d=cσ2ni=1dλi(XX/n)(λi(XX/n)+β)2jn𝒙j2n+o~d\displaystyle\leq\frac{c}{n}\sum_{j}^{n}\left(\text{Var}(\theta)\|\bm{x}_{j}\|^{2}\right)+\tilde{o}_{d}=\frac{c\sigma^{2}}{n}\sum_{i=1}^{d}\frac{\lambda_{i}(X^{\top}X/n)}{\left(\lambda_{i}(X^{\top}X/n)+\beta\right)^{2}}\frac{\sum_{j}^{n}\|\bm{x}_{j}\|^{2}}{n}+\tilde{o}_{d} (109)
=σ2ni=1dλi(XX/n)(λi(XX/n)+β)2jdλj(XX/n)n+o~d\displaystyle=\frac{\sigma^{2}}{n}\sum_{i=1}^{d}\frac{\lambda_{i}(X^{\top}X/n)}{\left(\lambda_{i}(X^{\top}X/n)+\beta\right)^{2}}\frac{\sum_{j}^{d}\lambda_{j}(X^{\top}X/n)}{n}+\tilde{o}_{d}
=𝒪(β2).\displaystyle=\mathcal{O}(\beta^{-2}).

where the ϵ(S)\epsilon(S) somehow is the order of 𝒪(β2)\mathcal{O}(\beta^{-2}) in the limit of β\beta. It’s clear to us that the greater β\beta is, the less sensitive (more robust) model we can get. In practical, we show that "over-robust" may hurt the model’s performance (we show the detail empirically in Figure 7).

E.2 2-layer Diagonal Linear Network Classification

However, our main theorem assumes the loss function satisfying Polyak-Łojasiewicz (PŁ) condition. To extend our result to a more general case, here we study a 2-layer diagonal linear network classification problem whose loss is exponential-based and not satisfied the PŁ condition.

Example E.2.

We consider a classification problem using a 2-layer diagonal linear network with exp-loss. The robustness ϵ(S,A)\epsilon(S,A) has a similar relationship in Eq. 7. Given training set SS, after iterations t>Tϵt>T_{\epsilon}, C2>0\exists C_{2}>0, ϵ(S,A)C2suptTϵκ(𝛉(t),S)\epsilon(S,A)\leq C_{2}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S).

Given a training set S=(X,y),X=[𝒙1,,𝒙n],Xd×nS=(X,y),X=[\bm{x}_{1},...,\bm{x}_{n}],X\in\mathbb{R}^{d\times n}. A depth-22 diagonal linear networks with parameters 𝒖=[𝒖+,𝒖]2d\bm{u}=[\bm{u}_{+},\bm{u}_{-}]^{\top}\in\mathbb{R}^{2d} specified by:

f(𝒖,𝒙)=𝒖+2𝒖2,𝒙f(\bm{u},\bm{x})=\langle\bm{u}_{+}^{2}-\bm{u}_{-}^{2},\bm{x}\rangle

We consider exponential loss where (t)=1ni=1nexp(𝒙i𝜽(t)yi)\mathcal{L}(t)=\frac{1}{n}\sum_{i=1}^{n}\text{exp}(-\bm{x}_{i}^{\top}\bm{\theta}(t)y_{i}) and yi{1,1}y_{i}\in\{-1,1\}. It has the same tail behavior and thus similar asymptotic properties as the logistic or cross-entropy loss. WLOG, we assume i[n]:yi=1\forall i\in[n]:y_{i}=1 such that 𝒙i=yi𝒙i\bm{x}_{i}=y_{i}\bm{x}_{i}. Suggest by Moroshko et al. (2020), we have

𝜽(t)=2α2sinh(4X0t𝐫(s)𝑑s)\bm{\theta}(t)=2\alpha^{2}\operatorname{sinh}\left(4X\int^{t}_{0}\mathbf{r}(s)ds\right)

where 𝐫(t)=1nexp(X𝜽(t))\mathbf{r}(t)=\frac{1}{n}\text{exp}(-X^{\top}\bm{\theta}(t)) and 𝐫(t)1=(t)\|\mathbf{r}(t)\|_{1}=\mathcal{L}(t). Note that

d𝜽(t)dt=4n𝜽2(t)+4α4𝟏Xexp(X𝜽(t))=A(t)X𝐫(t)\frac{d\bm{\theta}(t)}{dt}=\frac{4}{n}\sqrt{\bm{\theta}^{2}(t)+4\alpha^{4}\mathbf{1}}\circ X\exp\left(-X^{\top}\bm{\theta}(t)\right)=A(t)X\mathbf{r}(t) (110)

where A(t)=diag(4𝜽2(t)+4α4𝟏)A(t)=\operatorname{diag}\left(4\sqrt{\bm{\theta}^{2}(t)+4\alpha^{4}\mathbf{1}}\right).

Part i, sharpness. First derivative and Hessian can be obtained by

𝜽(t)\displaystyle\nabla_{\bm{\theta}}\mathcal{L}(t) =(X𝐫(t))\displaystyle=-(X\mathbf{r}(t))^{\top} (111)
H𝜽((t))\displaystyle H_{\bm{\theta}}(\mathcal{L}(t)) =(X𝐫(t))𝜽(t)=i=1n𝐫i(t)XiXi.\displaystyle=-\frac{\partial(X\mathbf{r}(t))^{\top}}{\partial\bm{\theta}(t)}=\sum_{i=1}^{n}\mathbf{r}_{i}(t)X_{i}X_{i}^{\top}.

With the definition of sharpness, we can get

κ(𝜽(t),S)=𝜽(t)2tr(H𝜽[(t)])=𝜽(t)2tr(i=1n𝐫i(t)XiXi)=𝜽(t)2tr(i=1n𝐫i(t)XiXi)=𝜽(t)2(i=1n𝐫i(t)𝒙i2).\begin{split}\kappa(\bm{\theta}(t),S)&=\|\bm{\theta}(t)\|^{2}\operatorname{tr}(H_{\bm{\theta}}[\mathcal{L}(t)])\\ &=\|\bm{\theta}(t)\|^{2}\operatorname{tr}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)X_{i}X_{i}^{\top}\right)\\ &=\|\bm{\theta}(t)\|^{2}\operatorname{tr}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)X_{i}^{\top}X_{i}\right)\\ &=\|\bm{\theta}(t)\|^{2}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\|\bm{x}_{i}\|^{2}\right).\end{split} (112)

Part ii, robustness. Now, let’s use the same discussion of the robustness constant in the previous case. Follow the previous definition of ϵ(S,A)\epsilon(S,A), after some iteration number TϵT_{\epsilon} it is defined by

ϵ(S,A):=supi[n],tTϵ|n𝐫j(t)𝐫(t)|\epsilon(S,A):=\sup_{i\in[n],t\geq T_{\epsilon}}\left|n\mathbf{r}_{j}(t)-\mathbf{r}^{*}(t)\right| (113)

where n𝐫j(t)n\mathbf{r}_{j}(t) is the (denormalized) point-wise loss of 𝒙j\bm{x}_{j} and 𝐫(t)\mathbf{r}^{*}(t) denotes the minimum loss of point 𝒙\bm{x}^{*}. There exists n<nn^{\prime}<n, we have

ϵ(S,A)=suptTϵn(𝐫(t)1𝐫(t))suptTϵn𝐫(t)1.\epsilon(S,A)=\sup_{t\geq T_{\epsilon}}n^{\prime}\left(\|\mathbf{r}(t)\|_{1}-\mathbf{r}^{*}(t)\right)\leq\sup_{t\geq T_{\epsilon}}n^{\prime}\|\mathbf{r}(t)\|_{1}. (114)

Let 𝒙min=mini[n]𝒙i\|\bm{x}_{\min}\|=\min_{i\in[n]}\|\bm{x}_{i}\|, the above equation

ϵ(S,A)\displaystyle\epsilon(S,A) 1𝒙min2suptTϵ(n𝐫(t)𝒙min2)\displaystyle\leq\frac{1}{\|\bm{x}_{\min}\|^{2}}\sup_{t\geq T_{\epsilon}}(n^{\prime}\|\mathbf{r}(t)\|\|\bm{x}_{\min}\|^{2}) (115)
=n𝒙min2suptTϵ(i=1n𝐫i(t)𝒙min2)\displaystyle=\frac{n^{\prime}}{\|\bm{x}_{\min}\|^{2}}\sup_{t\geq T_{\epsilon}}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\|\bm{x}_{\min}\|^{2}\right)
n𝒙min2suptTϵ(i=1n𝐫i(t)𝒙i2).\displaystyle\leq\frac{n^{\prime}}{\|\bm{x}_{\min}\|^{2}}\sup_{t\geq T_{\epsilon}}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\|\bm{x}_{i}\|^{2}\right).

Part iii, connection. Compare the last part of (112) and (115), we found that robustness and sharpness depend on the same term. Further, we can say that for any step tt, 𝜽(t)2\|\bm{\theta}(t)\|^{2} will have the upper bound.

From Lemma 11 of Moroshko et al. (2020), we have the following inequality,

𝜽(t)2α2sinh(x¯2γ22α2γ~(t))\|\bm{\theta}(t)\|_{\infty}\leq 2\alpha^{2}\sinh\left(\frac{\bar{x}}{2\gamma_{2}^{2}\alpha^{2}}\tilde{\gamma}(t)\right) (116)

where (t)=exp(γ~(t))\mathcal{L}(t)=\exp(-\tilde{\gamma}(t)). Then, we can bound the 𝜽(t)2\|\bm{\theta}(t)\|^{2} via:

C1𝜽(t)2d𝜽(t)2=4dα4sinh2(x¯2γ22α2γ~(t))<C_{1}\leq\|\bm{\theta}(t)\|^{2}\leq d\|\bm{\theta}(t)\|^{2}_{\infty}=4d\alpha^{4}\sinh^{2}\left(\frac{\bar{x}}{2\gamma_{2}^{2}\alpha^{2}}\tilde{\gamma}(t)\right)<\infty (117)

Note that C1>0C_{1}>0 In summary, we have

ϵ(S,A)nC1𝒙minsuptTϵκ(𝜽(t),S)C2suptTϵκ(𝜽(t),S)\epsilon(S,A)\leq\frac{n^{\prime}}{C_{1}\|\bm{x}_{\min}\|}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S)\leq C_{2}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S) (118)

where C2=nC1𝒙min>0C_{2}=\frac{n^{\prime}}{C_{1}\|\bm{x}_{\min}\|}>0 is a constant.

Part iv, sanity check– asymptotic. Asymptotically, as tt\to\infty, (t)0\mathcal{L}(t)\to 0 while 𝜽(t)2\|\bm{\theta}(t)\|_{2} will be explode. So if sharpness κ(𝜽(t),X)\kappa(\bm{\theta}(t),X)\to\infty, then it will fail to imply robustness.

κ(𝜽(t),S)\displaystyle\kappa(\bm{\theta}(t),S) =𝜽(t)2(i=1n𝐫i(t)𝒙i2)\displaystyle=\|\bm{\theta}(t)\|^{2}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\|\bm{x}_{i}\|^{2}\right) (119)
𝜽(t)2(i=1n𝐫i(t)𝒙max2)\displaystyle\leq\|\bm{\theta}(t)\|^{2}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\|\bm{x}_{\max}\|^{2}\right)
=𝜽(t)2(𝜽(t))𝒙max2\displaystyle=\|\bm{\theta}(t)\|^{2}\mathcal{L}(\bm{\theta}(t))\|\bm{x}_{\max}\|^{2}
=κmax(𝜽(t),S)\displaystyle=\kappa_{\max}(\bm{\theta}(t),S)

Let κmax(𝜽(t),X)\kappa_{\max}(\bm{\theta}(t),X) be a upper bound of κ(𝜽(t),X)\kappa(\bm{\theta}(t),X) at any time step tt. The dynamics will be

dκmax(𝜽(t),S)dt=\displaystyle\frac{d\kappa_{\max}(\bm{\theta}(t),S)}{dt}= 𝜽κmax(𝜽(t),S)𝜽˙(t)\displaystyle\nabla_{\bm{\theta}}\kappa_{\max}(\bm{\theta}(t),S)\dot{\bm{\theta}}(t) (120)
=\displaystyle= 𝒙max2tr(XX)(𝜽(t)2˙(t)+2(t)𝜽(t)𝜽˙(t))\displaystyle\|\bm{x}_{\max}\|^{2}\operatorname{tr}(XX^{\top})\left(\|\bm{\theta}(t)\|^{2}\dot{\mathcal{L}}(t)+2\mathcal{L}(t)\bm{\theta}(t)^{\top}\dot{\bm{\theta}}(t)\right)
=\displaystyle= 𝒙max2tr(XX)(𝜽(t)2(X𝒓(t))A(t)X𝒓(t)+2(t)𝜽(t)A(t)X𝒓(t))\displaystyle\|\bm{x}_{\max}\|^{2}\operatorname{tr}(XX^{\top})\left(-\|\bm{\theta}(t)\|^{2}(X\bm{r}(t))^{\top}A(t)X\bm{r}(t)+2\mathcal{L}(t)\bm{\theta}(t)^{\top}A(t)X\bm{r}(t)\right)
=\displaystyle= 𝒙max2tr(XX)(2(t)𝜽(t)𝜽(t)2(X𝒓(t)))A(t)X𝒓(t)\displaystyle\|\bm{x}_{\max}\|^{2}\operatorname{tr}(XX^{\top})\left(2\mathcal{L}(t)\bm{\theta}(t)^{\top}-\|\bm{\theta}(t)\|^{2}(X\bm{r}(t))^{\top}\right)A(t)X\bm{r}(t)

As tt\to\infty, we have (t)0\mathcal{L}(t)\to 0, thus it is easy to converge that

limtdκmax(𝜽(t),S)dt=𝒙max2𝜽(t)2(X𝒓(t))A(t)X𝒓(t)=𝒙max2𝜽(t)2˙(t)=0\lim_{t\to\infty}\frac{d\kappa_{\max}(\bm{\theta}(t),S)}{dt}=-\|\bm{x}_{\max}\|^{2}\|\bm{\theta}(t)\|^{2}(X\bm{r}(t))^{\top}A(t)X\bm{r}(t)=-\|\bm{x}_{\max}\|^{2}\|\bm{\theta}(t)\|^{2}\dot{\mathcal{L}}(t)=0 (121)

As we can see, the dynamics of the derivative of κmax(𝜽(t),X)\kappa_{\max}(\bm{\theta}(t),X) is decreasing to 0 as tt\to\infty which means the sharpness κ(𝜽(t),X)\kappa(\bm{\theta}(t),X) will be upper bounded by a converged number κ(𝜽(t),X)<C<\kappa(\bm{\theta}(t\to\infty),X)<C_{\infty}<\infty. So sharpness will not explode.

Appendix F Comparison to feature robustness

F.1 Their results

Definition F.1 (Feature robustness Petzka et al. Petzka et al. (2021)).

Let :𝒴×𝒴+\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}_{+}denote a loss function, ϵ\epsilon and δ\delta two positive (small) real numbers, S𝒳×𝒴S\subseteq\mathcal{X}\times\mathcal{Y} a finite sample set, and Am×mA\in\mathbb{R}^{m\times m} a matrix. A model f(x)=(ψϕ)(x)f(x)=(\psi\circ\phi)(x) with ϕ(𝒳)m\phi(\mathcal{X})\subseteq\mathbb{R}^{m} is called ((𝜹,S,A),ϵ)((\bm{\delta},S,A),\epsilon)-feature robust, if |ϕ(f,S,αA)|ϵ\left|\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\alpha A)\right|\leq\epsilon for all 0αδ0\leq\alpha\leq\delta. More generally, for a probability distribution 𝒜\mathcal{A} on perturbation matrices in m\mathbb{R}^{m}, we define

ϕ(f,S,𝒜)=𝔼A𝒜[ϕ(f,S,A)],\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\mathcal{A})=\mathbb{E}_{A\sim\mathcal{A}}\left[\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,A)\right],

and call the model ((𝜹,𝑺,𝒜),ϵ)((\bm{\delta},\bm{S},\mathcal{A}),\bm{\epsilon})-feature robust on average over 𝒜\mathcal{A}, if |ϕ(f,S,α𝒜)|ϵ\left|\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\alpha\mathcal{A})\right|\leq\epsilon for 00\leq αδ\alpha\leq\delta

Theorem F.2 (Theorem 5 Petzka et al. Petzka et al. (2021)).

Consider a model f(x,𝐰)=g(𝐰ϕ(x))f(x,\mathbf{w})=g(\mathbf{w}\phi(x)) as above, a loss function \ell and a sample set SS, and let Omm×mO_{m}\subset\mathbb{R}^{m\times m} denote the set of orthogonal matrices. Let δ\delta be a positive (small) real number and 𝐰=ωd×m\mathbf{w}=\omega\in\mathbb{R}^{d\times m} denote parameters at a local minimum of the empirical risk on a sample set SS. If the labels satisfy that y(ϕδA(xi))=y(ϕ(xi))=yiy\left(\phi_{\delta A}\left(x_{i}\right)\right)=y\left(\phi\left(x_{i}\right)\right)=y_{i} for all (xi,yi)S\left(x_{i},y_{i}\right)\in S and all A1\|A\|\leq 1, then f(x,ω)f(x,\omega) is ((δ,S,Om),ϵ)\left(\left(\delta,S,O_{m}\right),\epsilon\right)-feature robust on average over OmO_{m} for ϵ=δ22mκϕ(ω)+𝒪(δ3)\epsilon=\frac{\delta^{2}}{2m}\kappa^{\phi}(\omega)+\mathcal{O}\left(\delta^{3}\right).

Proof sketch

  1. 1.

    With assumption that y[ϕδA(xi)]=yiy[\phi_{\delta A}(x_{i})]=y_{i} for all (xi,yi)S(x_{i},y_{i})\in S and all A1\|A\|\leq 1, feature perturbation around ww is

    ϕ(f,S,αA)+emp(𝒘,S)=emp(𝒘+α𝒘A,S)\mathcal{E}^{\phi}_{\mathcal{F}}(f,S,\alpha A)+\mathcal{E}_{emp}(\bm{w},S)=\mathcal{E}_{emp}(\bm{w}+\alpha\bm{w}A,S)
  2. 2.

    Since Taylor expansion for local minimum 𝒘=ω\bm{w}=\omega will only remains second order term, thus

    emp(𝒘+α𝒘A,S)=emp(ω,S)+α22s,t=1d(ωsA)Hs,t(ω,ϕ(S))(ωtA)\mathcal{E}_{emp}(\bm{w}+\alpha\bm{w}A,S)=\mathcal{E}_{emp}(\omega,S)+\frac{\alpha^{2}}{2}\sum^{d}_{s,t=1}(\omega_{s}A)H_{s,t}(\omega,\phi(S))(\omega_{t}A)^{\top}
  3. 3.

    With basic algebra, one can easily get

    𝔼AOm[ϕ(f,S,αA)]\displaystyle\mathbb{E}_{A\sim O_{m}}\left[\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\alpha A)\right] δ22s,t=1d𝔼AOm[(ωsA)Hs,t(ωtA)T]+𝒪(δ3)\displaystyle\leq\frac{\delta^{2}}{2}\sum_{s,t=1}^{d}\mathbb{E}_{A\sim O_{m}}\left[\left(\omega_{s}A\right)H_{s,t}\left(\omega_{t}A\right)^{T}\right]+\mathcal{O}\left(\delta^{3}\right)
    =δ22ms,t=1dωs,ωtTr(Hs,t)+𝒪(δ3)\displaystyle=\frac{\delta^{2}}{2m}\sum_{s,t=1}^{d}\left\langle\omega_{s},\omega_{t}\right\rangle\cdot\operatorname{Tr}\left(H_{s,t}\right)+\mathcal{O}\left(\delta^{3}\right)
    =δ22mκϕ(ω)+𝒪(δ3)\displaystyle=\frac{\delta^{2}}{2m}\kappa^{\phi}(\omega)+\mathcal{O}\left(\delta^{3}\right)

F.2 Comparison to our results

We first summarize the commonalities and differences between their results and ours:

  • Both of us consider the robustness of the model. But they define the feature robustness while we study the loss robustness Xu & Mannor (2012) which has been studied for many years.

  • They consider a non-standard generalization gap by decomposing it into representativeness and the expected deviation of the loss around the sample points. But we strive to integrate sharpness into the general generalization guarantees.

For point 1, their defined feature robustness trivially depends on the sharpness. Because the sharpness (the curvature information) is just defined by the robust perturbation areas around the desired point. From step 2 in the above proof sketch we can see, the hessian w.r.t. ω\omega is exactly the second expansion of perturbed expected risk. So we think this definition provides less information about the optimization landscape. In contrast, we consider the loss robustness for two reasons: 1) it is easy to get in practice without finding the orthogonal matrices OmO_{m} first. 2) we highlight its dependence on the data manifold.

For point 2, we try to integrate this optimization property (sharpness) into the standard generalization frameworks in order to get a clearer interplay. Unlike feature robustness, the robustness defined by loss function will be easier analyzed in generalization tools, because it’s hard and vague to define the "feature" in general. Besides, our result will also benefit the data-dependent bounds Xu & Mannor (2012); Kawaguchi et al. (2022).

Appendix G Experiments and details

G.1 Ridge regression with distributional shifting

As we stated before, we followed the Duchi & Namkoong (2021) to investigate the ridge regression on distributional shift. We randomly generate θ0d\theta^{*}_{0}\in\mathbb{R}^{d} in spherical space, and data from

Xiid𝒩(0,1),𝒚=Xθ0X\overset{\text{iid}}{\sim}\mathcal{N}(0,1),~{}~{}~{}\bm{y}=X\theta^{*}_{0} (122)

To simulate distributional shift, we randomly generate a perpendicular unit vector θ0\theta_{0}^{\perp} to θ0\theta^{*}_{0}. Let θ0,θ0\theta_{0}^{\perp},\theta^{*}_{0} be the basis vectors, then shifted ground-truth will be computed from the basis by θα=θ0cos(α)+θ0sin(α)\theta^{*}_{\alpha}=\theta^{*}_{0}\cdot\cos(\alpha)+\theta^{\perp}_{0}\cdot\sin(\alpha). For the source domain, we use θ0\theta^{*}_{0} as our training distribution. We randomly sample 50 data points and train a linear classifier with a gradient descent of 3000 iterations. Starting from α=0\alpha=0, we gradually increase the α\alpha to generate different distributional shifts.

From the left panel in Figure 7 we can see that a larger penalty suffers from lower loss increasing when β\beta ranges from 0 to 22. Since we consider a cycling shift of label space, 180180^{\circ} corresponds to the maximum shift thus leading to the highest loss increase. According to our analysis of ridge regression, larger β\beta means a flatter minimum and more robustness, resulting in a better OOD generalization. This experiment verifies our theoretical analysis. However, it is important to note that too large a coefficient of 2\ell_{2} regularization will hurt the performance. As shown in Figure 7 (right panel), the curse of underfitting (indicated by brown colors) appears when β>4\beta>4.

Refer to caption
Figure 7: OOD test losses are increasing along distributional shifting. The X-axis is the shifting angle α\alpha and the Y-axis is the test loss of the model which is trained on distribution α=0\alpha=0. Left: The larger regularization β\beta causes a lower increase in test loss (Darker purple lines have lower test losses). Right: Too large 2\ell_{2} penalty coefficient β\beta brings poor fitting and thus fails to generalize on both ID and OOD datasets (orange lines). Best viewed in colors.

G.2 Additional results on sharpness

Here we show more experimental results on RotatedMNIST which is a rotation of MNIST handwritten digit dataset with different angles ranging from 0,15,30,45,60,750^{\circ},15^{\circ},30^{\circ},45^{\circ},60^{\circ},75^{\circ} (6 domains/distributions). For each environment, we select a domain as a test domain and train the model on all other domains. Then OOD generalization test accuracy will be reported on the test domain. In our experiments, we run each algorithm with 1212 different seeds, thus getting 1212 trained models of different minima. Then we compute the sharpness (see Algorithm 1) for all these models and then plot them in Figure 8. For algorithms, we choose Empirical Risk Minimization (ERM), and Stochastic Weight Averaging (SWA) as the plain OOD generalization algorithm which is shown in the first column. In robust optimization algorithms, we choose Group Distributional Robust Optimization Sagawa et al. (2019). We also choose CORALSun & Saenko (2016) as a multi-source domain generalization algorithm. Among these different types of out-of-domain generalization algorithms, we can conclude that sharpness will affect the test accuracy on the OOD dataset.

Experimental configurations are listed in Table 1. For each model, we run the 50005000 iterations and choose the last model as the test model. To ease the computational burden, we choose the 33-layer MLP to compute the sharpness.

Algorithms Optimizer lr WD batch size MLP size eta MMD γ\gamma
ERM(SWA) Adam 0.001 0 64 265*3 - -
DRO Adam 0.001 0 64 265*3 0.01 -
CORAL Adam 0.001 0 64 265*3 - 1
Table 1: Hyperparameters we use for different DG algorithms in the experiments.

From Figure 8 we can see, the sharpness has an inverse relationship with the out-of-domain generalization performance for every model in each individual environment. To make it clear, we plot similar tasks from environment 11 to 44 as the last row. Thus, we can see a clearer tendency in all algorithms. It verified our Theorem 3.1. Note that all algorithms have different feature scales. One may need to normalize the results of different algorithms when plotting them together.

Algorithm 1 Pseudocode of model sharpness computation
0:  feature layer f(x)f(x), training loss \ell
0:  Sharpness SS
  Get Jacobian matrix w.r.t. feature layer 𝐉=(f(x),y)\mathbf{J}=\nabla\ell(f(x),y)
  for each gradient vector 𝐉i\mathbf{J}_{i} in 𝐉\mathbf{J}^{\top} do
     Compute Hessian w.r.t. element i,ji,j of f(x)f(x) by 𝐉ifl(x)\frac{\partial\mathbf{J}_{i}}{\partial f_{l}(x)}
  end for
  We store Hessian in the variable 𝐇\mathbf{H}
  Initialize sharpness S=0S=0
  for ii in feature layer.shape[0] do
     for jj in feature layer.shape[0] do
        Retrieve the hessian value of i,ji,j element via h𝐇[:,j,i,:]h\leftarrow\mathbf{H}[:,j,i,:]
        Sharpness sijTrace(h)fij(x)2s_{ij}\leftarrow Trace(h)*f_{ij}(x)^{2}
        S=S+sijS=S+s_{ij}
     end for
  end for
Refer to caption
Figure 8: Domain generalization test accuracy on RotatedMNIST. From top to the bottom: environment 0, 1, 2, 3, 4, 5 with angles: [0,15,30,45,60,75][0^{\circ},15^{\circ},30^{\circ},45^{\circ},60^{\circ},75^{\circ}] and a plot together. Each column shows the 1212 runs of each algorithm.

G.3 Compare our robust bound to other OOD generalization bounds

G.3.1 Computation of our bounds

First, we follow Kawaguchi et al. (2022) to compute the KK in an inverse image of the ϵ\epsilon- covering in a randomly projected space. The main idea is to partition input space in a projected space with transformation matrix A~\tilde{A}. The specific steps will be (1) To generate a random matrix A~\tilde{A}, we i.i.d. sample each entry from the Uniform Distribution 𝒰(0,1)\mathcal{U}(0,1). (2) Each row of the random matrix 𝒜3×d\mathcal{A}\in\mathbb{R}^{3\times d} is then normalized so that Ax[0,1]3Ax\in[0,1]^{3}, i.e. Aij=A~ij/j=1dA~ijA_{ij}=\tilde{A}_{ij}/\sum_{j=1}^{d}\tilde{A}_{ij} (3) After generating a random matrix AA, we use the ϵ\epsilon-covering of the space of u=Axu=Ax to define the pre-partition {C~i}i=1K\{\tilde{C}_{i}\}_{i=1}^{K}.

G.3.2 Computation of PAC-Bayes bound

We follow the definition to compute expected disρ\textit{dis}_{\rho} in Germain et al. (2013) where

Definition G.1.

Let \mathcal{H} be a hypothesis class. For any marginal distributions DSD_{S} and DTD_{T} over XX, any distribution ρ\rho on \mathcal{H}, the domain disagreement disρ(DS,DT)\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right) between DSD_{S} and DTD_{T} is defined by,

disρ(DS,DT)= def |𝐄h,hρ2[RDT(h,h)RDS(h,h)]|.\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)\stackrel{{\scriptstyle\text{ def }}}{{=}}\left|\underset{h,h^{\prime}\sim\rho^{2}}{\mathbf{E}}\left[R_{D_{T}}\left(h,h^{\prime}\right)-R_{D_{S}}\left(h,h^{\prime}\right)\right]\right|.

Since the disρ(DS,DT)\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right) is defined as the expected distance, we can compute its empirical version according to their theoretical upper bound as follows.

Proposition G.2 (Germain et al. Germain et al. (2013) Theorem 3).

For any distributions DSD_{S} and DTD_{T} over XX, any set of hypothesis \mathcal{H}, any prior distribution π\pi over \mathcal{H}, any δ(0,1]\delta\in(0,1], and any real number α>0\alpha>0, with a probability at least 1δ1-\delta over the choice of S×TS\times T\sim (DS×DT)m\left(D_{S}\times D_{T}\right)^{m}, for every ρ\rho on \mathcal{H}, we have

disρ(DS,DT)2α[disρ(S,T)+2KL(ρπ)ln2δm×α+1]11e2α\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)\leq\frac{2\alpha\left[\operatorname{dis}_{\rho}(S,T)+\frac{2\mathrm{KL}(\rho\|\pi)\ln\frac{2}{\delta}}{m\times\alpha}+1\right]-1}{1-e^{-2\alpha}}

. With disρ(S,T)\operatorname{dis}_{\rho}(S,T), we can then compute the final generalization bound by the following inequality

ρ on ,RPT(Gρ)RPT(GρT)RPS(Gρ)\displaystyle\forall\rho\text{ on }\mathcal{H},R_{P_{T}}\left(G_{\rho}\right)-R_{P_{T}}\left(G_{\rho_{T}^{*}}\right)\leq R_{P_{S}}\left(G_{\rho}\right)
+disρ(DS,DT)+RDT(Gρ,GρT)+RDS(Gρ,GρT)\displaystyle\quad+\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)+R_{D_{T}}\left(G_{\rho},G_{\rho_{T}^{*}}\right)+R_{D_{S}}\left(G_{\rho},G_{\rho_{T}^{*}}\right)

with ρT=argminρRPT(Gρ)\rho_{T}^{*}=\operatorname{argmin}_{\rho}R_{P_{T}}\left(G_{\rho}\right) is the best target posterior, and RD(Gρ,GρT)=EhρEhρTRD(h,h)R_{D}\left(G_{\rho},G_{\rho_{T}^{*}}\right)=E_{h\sim\rho}E_{h^{\prime}\sim\rho_{T}^{*}}R_{D}\left(h,h^{\prime}\right).

Note that we ignore the expected errors over the best hypothesis by assuming the RD(Gρ,GρT)=0R_{D}\left(G_{\rho},G_{\rho_{T}^{*}}\right)=0. We apply the same operation in \mathcal{E}^{*} of Proposition C.6 as well.

G.3.3 Comparisons

In this section, we add some additional experiments on comparing to other baselines, i.e. PAC-Bayes bounds Germain et al. (2013). As shown in the first row of Figure LABEL:fig_app:spurious_features, our robust framework has a smaller distribution distance in the bound compared to the two baselines when increasing the model size. In the second row, we have similar results in final generalization bounds. From the third and fourth rows we can see, our bound is tighter than baselines when suffering distributional shifts.