Towards Robust Out-of-Distribution
Generalization Bounds via Sharpness

Yingtian Zou, Kenji Kawaguchi, Yingnan Liu, Jiashuo Liu^∗, Mong-Li Lee, Wynne Hsu
School of Computing, National University of Singapore
Institute of Data Science, National University of Singapore
^∗Department of Computer Science & Technology, Tsinghua University, China

Abstract

Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.

1 Introduction

Machine learning systems are typically trained on a given distribution of data and achieve good performance on new, unseen data that follows the same distribution as the training data. Out-of-Distribution (OOD) generalization requires machine learning systems trained in the source domain to generalize to unseen data or target domains with different distributions from the source domain. A myriad of algorithms (Sun & Saenko, 2016; Arjovsky et al., 2019; Sagawa et al., 2019; Koyama & Yamaguchi, 2020; Pezeshki et al., 2021; Ahuja et al., 2021) aim to learn the invariant components along the distribution shifting. Optimization-based methods such as (El Ghaoui & Lebret, 1997; Duchi & Namkoong, 2018; Liu et al., 2021; Rame et al., 2022) focus on maximizing robustness by optimizing for worst-case error over an uncertainty distribution set. While these methods are sophisticated, they do not always perform better than Empirical Risk Minimization (ERM) when evaluated across different datasets (Gulrajani & Lopez-Paz, 2021; Wiles et al., 2022). This raises the question of how to understand the OOD generalization of algorithms and which criteria should be used to select models that are provably better (Gulrajani & Lopez-Paz, 2021). These questions highlight the need for more theoretical research in the field of OOD generalization (Ye et al., 2021).

To characterize the generalization gap between the source domain and the target domain, a canonical method (Blitzer et al., 2007) from domain adaptation theory decouples this gap into an In-Distribution (ID) generalization and a hypothesis-specific Out-of-Distribution (OOD) distance. However, this distance is based on the notion of VC-dimension (Kifer et al., 2004), resulting in a loose bound due to the large size of the hypothesis class in the modern overparameterized neural networks. Subsequent works improve the bound based on Rademacher Complexity (Du et al., 2017), whereas Germain et al. (2016) improves the bound based on PAC-Bayes. Unlike the present paper, these works did not consider algorithmic robustness, which has natural interpretation and advantages for distribution shifts In this work, we consider algorithmic robustness to derive the OOD generalization bound. The key idea is to partition the input space into $K$ non-overlapping subspaces such that the error difference in the model’s performance between any pair of points in each subspace is bounded by some constant $\epsilon$ . Within each subspace, any distributional shift is considered subtle for the robust model thus leading to less impact on OOD generalization. Figure 1 illustrates this with the two distributions where the target domain has a distributional shift from the source domain. Compared to existing non-robust OOD generalization bounds Zhao et al. (2018), our new generalization error does not depend on hypothesis size, which is more reliable in the overparameterized regime. Our goal is to measure the generalizability of a model by considering how it is robust to this shift and achieves a tighter bound than existing works.

Refer to caption — Figure 1: An example of a target distribution (red) directly translated from a source distribution (blue). The 1D density reflects the marginal distribution. Unlike existing works (left), we divide the distributions into disjoint partitions as a small change in distribution for a robust model is negligible (right). The sharpness of the model will decide the tolerance of change thus affecting the partitions. If two sub-distributions $S,T$ have small shifts such that they fall into the same partition (red grid), their distance measure $d^{\prime}(S,T)$ by considering robustness will be zero.

Although robustness captures the tolerance to distributional shift, it is intractable to compute robustness constant $\epsilon$ due to the inaccessibility of target distribution. The robustness definition in Xu & Mannor (2012) indicates that the loss landscape induced by the model’s parameters is closely tied to its robustness. To gain a deeper understanding of robustness, we further study the learned model from an optimization perspective. As shown in (Lyu et al., 2022; Petzka et al., 2021), when the loss landscape is "flat", there is a good generalization, which is also observed in OOD settings (Izmailov et al., 2018; Cha et al., 2021). However, the relationship between robustness and this geometric property of the loss landscape, termed Sharpness, remains an open question. In this paper, we establish a provable dependence between robustness and sharpness for ReLU random neural network classes. It allows us to replace robustness constant $\epsilon$ with the sharpness of a learned model which is only computed from the training dataset that addresses the problem mentioned above. Our result of the interplay between robustness and sharpness can be applied to both ID and OOD generalization bounds. We also show an example to generalize our result beyond our assumption and validate it empirically.

Our main contributions can be summarized as follows:

•

We proposed a new framework for Out-of-distribution/ Out-of-domain generalization bounds. In this framework, we use robustness to capture the tolerance of distribution shift which leads to tighter upper bounds generally.
•

We reveal the underlying connection between the robustness and sharpness of the loss landscape and use this connection to enrich our robust OOD bounds under one-hidden layer ReLU NNs. This is the first optimization-based bound in Out-of-Distribution/Domain generalization.
•

We studied two cases in ridge regression and classification which support and generalize our main theorem well. All the experimental results corroborate our findings well.

2 Preliminary

Notations

We use $[n]$ to denote the integers set $\{i\}_{i=1}^{n}$ . We use $\|\cdot\|$ to denote the $\ell_{2}$ -norm (Euclidean norm). In vector form, $\bm{w}_{i}$ denotes the $i$ -th instance while the $w_{j}$ is the $j$ -th element of the vector $\bm{w}$ and we use $|\bm{w}|$ for the element-wise absolute value of vector $\bm{w}$ . We use $n,d$ for the training set size and input dimension. $\mathcal{O}$ is the Big-O notation.

2.1 Problem formulation

Consider a source domain and a target domain of the OOD generalization problem where we use $\mathcal{D}_{S}$ , $\mathcal{D}_{T}$ to represent the source and target distribution respectively. Let each $\mathcal{D}$ be the probability measure of sample $\bm{z}$ from sample space $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ with $\mathcal{X}\in\mathbb{R}^{d}$ . In the source domain, we have a training set $S=\{\bm{z}_{i}\}_{i=1}^{n},\forall i\in[n],\bm{z}_{i}\sim\mathcal{D}_{S}$ while the target is to learn a model $f\in\mathcal{F}$ with $S$ and parameters $\bm{\theta}\in\Theta$ where $f:\Theta\times\mathcal{X}\mapsto\mathbb{R}$ generalizes well. Given loss function $\ell:\mathbb{R}\times\mathbb{R}\to\mathbb{R}_{+}$ , which is for short, the expected risk over the source distribution $\mathcal{D}_{S}$ will be

\mathcal{L}_{S}(f_{\bm{\theta}})\triangleq\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}[\ell_{\bm{\theta}}(\bm{z})]=\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}[\ell(f(\bm{\theta},\bm{x}),y)],~{}~{}~{}~{}\widehat{\mathcal{L}}_{S}(f_{\bm{\theta}})\triangleq\frac{1}{n}\sum\nolimits_{\bm{z}_{i}\in S}[\ell_{\bm{\theta}}(\bm{z}_{i})].

We use $\ell_{\bm{\theta}}(\bm{z})$ as the shorthand. The OOD generalization is to measure between target domain expected risk $\mathcal{L}_{T}(f_{\bm{\theta}})$ and the source domain empirical risk $\widehat{\mathcal{L}}_{S}(f_{\bm{\theta}})$ which involves two parts: (1) In-domain generalization error gap between empirical risk and expected risk $\mathcal{L}_{S}(f_{\bm{\theta}})$ in the source domain. (2) Out-of-Domain distance between source and target domains. A model-agnostic example in Zhao et al. (2018) gave the following uniform bound:

Proposition 2.1 (Zhao et al. Zhao et al. (2018) Theorem 2 & 3.2).

With hypothesis class $\mathcal{F}$ and pseudo dimension $\operatorname{Pdim}(\mathcal{F})=d^{\prime}$ , unlabeled empirical datasets from source and target distribution $\widehat{\mathcal{D}}_{S}$ and $\widehat{\mathcal{D}}_{T}$ with size $n$ each, then with probability at least $1-\delta$ , for all $f\in\mathcal{F}$ ,

\mathcal{L}_{T}(f)\leq\widehat{\mathcal{L}}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+\mathcal{O}\left(\sqrt{d^{\prime}/n}\right)

where $d_{\mathcal{F}\Delta\mathcal{F}}(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}):=2\sup_{A_{f}\in\mathcal{A}_{\{f(x)\oplus f^{\prime}(x):f,f^{\prime}\in\mathcal{F}\}}}\left|\mathbb{P}_{\widehat{\mathcal{D}}_{S}}[A_{f}]-\mathbb{P}_{\widehat{\mathcal{D}}_{T}}[A_{f}]\right|$ and $\oplus$ is the XOR operator. Specific form of $\mathcal{O}(\sqrt{|\mathcal{F}|/n})$ is defined in Appendix C.6.

2.2 Algorithmic robustness

Definition 2.2 (Robustness, Xu & Mannor (2012)).

A learning model $f_{\bm{\theta}}$ on training set $S$ is $(K,\epsilon(\cdot))$ -robust, for $K\in\mathbb{N}$ , if $\mathcal{Z}$ can be partitioned into $K$ disjoint sets, denoted by $\left\{C_{i}\right\}_{i=1}^{K}$ , such that $\forall\bm{s}\in S$ we have

\bm{s},\bm{z}\in C_{i},\forall i\in[K]\Rightarrow\left|\ell_{\bm{\theta}}(\bm{s})-\ell_{\bm{\theta}}(\bm{z})\right|\leq\epsilon(S).

This definition captures the robustness of the model in terms of the input. Within each partitioned set $C_{i}$ , the loss difference between any sample $\bm{z}$ belonging to $C_{i}$ and training sample $\bm{s}\in C_{i}$ will be upper bounded by the robustness constant $\epsilon(S)$ . The generalization result given by Xu & Mannor (2012) provides a framework to bound the empirical risk with algorithmic robustness which has been stated in Appendix C. Based on this framework, we are able to reframe the existing OOD generalization theory.

3 Main Results

In this section, we propose a new Out-of-Distribution (OOD) generalization bound for robust algorithms that have not been extensively studied yet. We then compare our result to the existing domain shift bound in Proposition 2.1 and discuss its implications for OOD and domain generalization problems by considering algorithmic robustness. To further explain the introduced robustness, we connect it to the sharpness of the minimum (a widely concerned geometric property in optimization) by showing a rigorous dependence between robustness and sharpness. This interplay will give us a better understanding of the OOD generalization problem, and meanwhile, provide more information on the final generalization bound. Detailed assumptions are clarified in Appendix B.1.

3.1 Robust OOD Generalization Bound

The main concern in OOD generalization is to measure the domain shift of a learned model. However, existing methods fail to consider the intrinsic property of the model, such as robustness. Definition 2.2 gives us a new robustness measurement to the model trained on dataset $S$ where the training set $S$ is a collection of i.i.d. data pair $(\bm{x},y)$ sampled from source distribution $\mathcal{D}_{S}$ with size $n$ . The measurement provides an intuition that if a test sample from the target domain is similar to a specific group of training samples, their losses will be similar as well. In other words, the model’s robustness is a reflection of its ability to generalize to unseen data. Our first theorem shows that by utilizing the "robustness" measurement, we can more effectively handle domain shifts by setting a tolerance range for distribution changes. This robustness measurement, therefore, provides a useful tool for addressing OOD generalization.

Theorem 3.1.

Let $\hat{\mathcal{D}}_{T}$ be the empirical distribution of size $n$ drawn from $\mathcal{D}_{T}$ . Assume that the loss $\ell$ is upper bounded by M. With probability at least $1-\delta$ (over the choice of the samples $S$ ), for every $f_{\bm{\theta}}$ trained on $S$ satisfying $(K,\epsilon(S))$ -robust, we have

\mathcal{L}_{T}(f_{\bm{\theta}})\leq\widehat{\mathcal{L}}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+2\epsilon(S)+3M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}

(1)

where the total variation distance $d_{(\epsilon,K)}$ for discrete empirical distributions is defined by:

\forall i\in[n],n_{i}(S):=\#(\bm{z}\in S\cap C_{i}),~{}~{}~{}d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}):=\sum_{i=1}^{K}\left|\frac{n_{i}(S)}{n}-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right|

(2)

and $n_{i}(S),n_{i}(\hat{\mathcal{D}}_{T})$ are the number of samples from $S$ and $\hat{\mathcal{D}}_{T}$ that fall into the set $C_{i}$ , respectively.

Remark.

The result can be decomposed into in-domain generalization and out-domain distance $|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|$ (please refer to Lemma C.1). Both of them depend on robustness $\epsilon(S)$ .

See proof in Appendix C. The last three terms on the RHS of (1) are distribution distance, robustness constant, and error term, respectively. Unlike traditional distribution measures, we partition the sample space and the distributional shift separately in the $K$ sub-groups instead of measuring it point-wisely. We argue that the $d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})$ can be zero measure if all small changes happen within the same partition where a 2D illustrative case is shown in Figure 1. Under the circumstances, our distribution distance term will be significantly smaller than Proposition 2.1 as the target distribution is essentially a perturbation of the source distribution. As a robust OOD generalization measure, our bound characterizes how robust the learned model is to negligible distributional perturbations. To prevent a bound that expands excessively, we also propose an alternate solution tailored for non-robust algorithms ( $K\to\infty$ ) as follows.

Corollary 3.2.

If $K\to\infty$ , the domain shift bound $|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|$ can be replaced to the distribution distance in Proposition 2.1 where

|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|\leq\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(S;\mathcal{D}_{T})\leq\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(S;\widehat{\mathcal{D}}_{T})+\mathcal{O}(\sqrt{d^{\prime}/n})

(3)

where the pseudo dimension $\operatorname{Pdim}(\mathcal{F})=d^{\prime}$ .

The proof is in Appendix C.1. As dictated in Theorem 3.1, when $K\to\infty$ , the use infinite number of partitions on the data distribution leads to meaningless robustness. However, Eq. 25 suggests that our bound can be replaced by $d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T})$ in the limit of infinite $K$ . This avoids computing a vacuous bound for non-robust algorithms. In summary, Theorem 3.1 presents a novel approach for quantifying distributional shifts by incorporating the concept of robustness. Our framework is particularly beneficial when a robust algorithm is able to adapt to local shifts in the distribution. Additionally, our data-dependent result remains valid and useful in the overparameterized regime, since $K$ does not depend on the model size.

3.2 Sharpness and Robustness

Clearly, robustness is inherently tied to the optimization properties of a model, particularly the curvature of the loss landscape. One direct approach to characterize this geometric curvature, referred to as "sharpness," involves analyzing the Hessian matrix (Foret et al., 2020; Cohen et al., 2021). Recent research (Petzka et al., 2021) has shown that the concept of “relative flatness", the sharpness in this paper, has a strong correlation with model generalization. However, the impact of relative flatness on OOD generalization remains uncertain, even within the convex setting. To address this problem, we aim to investigate the interplay between robustness and sharpness. With the following definition of sharpness, we endeavor to establish an OOD generalization bound rooted in optimization principles.

Definition 3.3 (Sharpness, Petzka et al. (2021)).

For a twice differentiable loss function $\mathcal{L}(\bm{w})=\sum_{\bm{s}\in S}\ell_{\bm{w}}(\bm{s})$ , $\bm{w}\in\mathbb{R}^{m}$ with a sample set $S$ , the sharpness is defined by

\kappa(\bm{w},S,A):=\left\langle\bm{w},\bm{w}\right\rangle\cdot\operatorname{tr}\left(H_{S,A}(\bm{w})\right)

(4)

where $H_{S,A}$ is the Hessian matrix of loss $\mathcal{L}(\bm{w})$ w.r.t. $\bm{w}$ with hypothesis set $A$ and input set $S$ .

As per Eq. 4, sharpness is characterized by the sum of all the eigenvalues of the Hessian matrix, scaled by the parameter norm. Each eigenvalue of the Hessian reflects the rate of change of the loss derivative in the corresponding eigenspace. Therefore the smaller value of $\kappa$ indicates a flatter minimum. In Cha et al. (2021), they suggest that flatter minima will improve the OOD generalization, but fail to deliver an elaborate analysis of the Hessian matrix. In this section, we begin with the random ReLU Neural Networks parameterized by $\bm{\theta}=(\{\bm{a}_{i}\}_{i\in[m]},\bm{w})$ where $\bm{w}=[w_{1},...,w_{m}]^{\top}$ is the trainable parameter. Let $A=[\bm{a}_{1},...,\bm{a}_{m}]$ , the whole function class is defined as

f(\bm{w},A,\bm{x})\triangleq\frac{1}{\sqrt{d}}\sum_{i=1}^{m}w_{i}\sigma\left(\bm{x},\bm{a}_{i}\right):w_{i}\in\mathbb{R},\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})),i\in[m]

(5)

where $\sigma(\cdot)$ is the ReLU activation function and $\bm{a}$ are random vectors uniformly distributed on $n$ -dim hypersphere whose surface is a $n-1$ manifold. We then define any convex loss function $\ell(f(\bm{w},A,\bm{x}),y):\mathbb{R}\times\mathbb{R}\to\mathbb{R}_{+}$ . The corresponding empirical minimizer in the source domain will be: $\bm{\hat{w}}=\arg\min_{\bm{w}}\frac{1}{n}\sum_{i=1}^{n}\ell(f(\bm{w},A,\bm{x}_{i}),y_{i}).$ With $\bm{\hat{w}}$ , we are interested in loss geometry over the sample domain ( $\ell_{\bm{\hat{w}},A}(\bm{z})$ for short). Intuitively, a flatter minimum on the loss landscape is expected to be more robust to varying input. Suppose the sample space $\mathcal{Z}$ can be partitioned into $K$ disjoint sets. For each set $C_{i},i\in[K]$ , the loss difference is upper bounded by $\epsilon(S,A)$ . Given $\bm{z}\in S$ , we have

\epsilon(S,A)\triangleq\max_{i\in[K]}\sup_{\bm{z},\bm{z}^{\prime}\in C_{i}}\left|\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}^{\prime})\right|.

(6)

As an alternative form of robustness, the $\epsilon(S,A)$ in (6) captures the "maximum" loss difference between any two samples in each partition and depends on the convexity and smoothness of the loss function in the input domain. Given a training set $S$ and any initialization $\bm{w}_{0}$ , the robustness $\epsilon(S,A)$ of a learned model $f_{\bm{\hat{w}}}$ will be determined. It explicitly reflects the smoothness of the loss function in each (pre-)partitioned set. Nevertheless, its connection to the sharpness of the loss function in parameter space still remains unclear. In order to address this gap, we establish a connection between sharpness and robustness in Eq. 7. Notably, this interplay holds implications not only for OOD but also for in-distribution generalization.

Theorem 3.4.

Assume for any $A$ , the loss function $\ell_{\bm{\hat{w}},A}(\bm{z})$ w.r.t. sample $\bm{z}$ satisfies the $L$ -Hessian Lipschitz continuity (refer to Definition B.2) within every set $C_{i},\forall i\in[K]$ . Let $\bm{z}_{i}(A)=\arg\max_{\bm{z}\in C_{i}\cap S}\ell_{\bm{\hat{w}},A}(\bm{z})$ . Define $\mathcal{M}_{i}$ to be the set of global minima in $C_{i}$ , suppose $\exists\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i}$ such that for some $\rho_{i}(L)>0,~{}\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\frac{\rho_{i}(L)}{L}$ almost surely, then let $\rho_{\max}(L)=\max\{\rho_{i}(L),i\in[K]\}$ , $\|\bm{x}\|^{2}\equiv R(d)$ and $n^{\prime}\leq n,\in\mathbb{N}^{+}$ , w.p $p=\min\left\{\frac{2}{\pi}\arccos(R(d)^{-\frac{1}{2}}),\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}e^{\frac{1}{4d-9}}\right|\right\}$ over $\{\bm{a}_{i}\}_{i=1}^{m}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ we have

\epsilon(S,A)\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(\left[n^{\prime}+\mathcal{O}\left(\frac{d}{m}\right)\right]\kappa(\bm{\hat{w}},S,A)+\frac{4\rho_{\max}(L)}{3}\right).

(7)

Remark.

Given the training set $S$ , we can estimate factor $\hat{n}$ that $n^{\prime}\leq\hat{n}$ by comparing the maximum Hessian norm w.r.t. $\bm{z}_{j}$ to the sum of all the Hessian norms over $\{\bm{z}_{i}\}_{i\in[n]}$ . Note that the smoothness condition only applies to every partitioned set (locally) where it is much weaker than the global requirement for the loss function to be satisfied We also discuss the difference between our results and Petzka et al. (2021) in Appendix F. The chosen family of loss functions that applied to our theorem can be found in Appendix B.1.

Corollary 3.5.

Let $\hat{w}_{\min}$ be the minimum value of $|\bm{\hat{w}}|$ . Suppose $\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ and $|\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)/\partial f^{2}|$ is bounded by $[\tilde{M}_{1},\tilde{M}_{2}]$ . If $m=\text{Poly}(d),d>2$ , $\rho_{\max}(L)<(\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}(d,m))/(2d)$ taking expectation over all $\bm{x}_{j}\in S,j\in[n]$ and all $\bm{a}_{i}\in A\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))\forall i\in[m]$ , we have

\mathbb{E}_{S,A}\left[\epsilon(S,A)\right]\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\tilde{M}_{2}\right).

(8)

where $\tilde{\sigma}(d,m)=\mathbb{E}_{\bm{a}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0$ is the minimum eigenvalue and $G_{ii}$ is product constant of Gegenbauer polynomials (definition can be founded in Appendix B).

See proof in Appendix D. From Eq. 7 we can see, the robustness constant $\epsilon(S,A)$ is (pointwise) upper bounded by the sharpness of the learned model, as measured by the quantity $\kappa(\bm{\hat{w}},S,A)$ , and the parameter $\rho_{\max}(L)$ . It should be noted that the parameter $\rho_{\max}(L)$ depends on the partition, and as the number of partitions increases, the region $\rho_{\max}(L)$ of the input domain becomes smaller, thereby making sharpness the dominant term in reflecting the model’s robustness within the small local area. In Corollary 3.5, we show a stronger connection when the partition satisfies some conditions. Overall, this bound states that the larger the sharpness of the model $\kappa(\bm{\hat{w}},S,A)$ , the larger the upper bound on the robustness parameter $\epsilon(S,A)$ . This result aligns with the intuition that a sharper model is more prone to overfitting the training domain and is less robust in the unseen domain. While the dependency is not exact, it still can be regarded as an alternative approach that avoids the explicit computation of the intractable robustness term. By substituting this upper bound for $\epsilon(S,A)$ into Theorem 3.1, we derive a sharpness-based OOD generalization bound. This implies that the OOD generalization error will have a high probability of being small if the learned model is flat enough. Unlike existing works, our generalization bound provides more information about how optimization property influences performance when generalizing to OOD data. It bridges the gap between robustness and sharpness which can also be generalized to non-OOD learning problems. Moreover, we provide a better theoretical grounding for an empirical observation that a flat minimum improves domain generalization (Cha et al., 2021) by pinpointing a clear dependence on sharpness.

3.3 Case Study

To better demonstrate the relationship between sharpness and robustness, we provide two specific examples: (1) linear ridge regression; (2) two-layer diagonal neural networks for classification.

Example 3.6.

In ridge regression models, $\epsilon(S,A)$ has a reverse relationship to the regularization parameter $\beta$ . $\beta\uparrow$ , the more probably flatter minimum $\kappa\downarrow$ and less sensitivity $\epsilon\downarrow$ of the learned model could be. Following the previous notation, we have $\exists\tilde{c}_{1}>0$ such that $\epsilon(S,A)\leq\tilde{c}_{1}\kappa(\bm{\hat{\theta}},S)+\tilde{o}_{d}$ where $\tilde{o}_{d}$ has a smaller order than $\kappa(\hat{\bm{\theta}},S)$ for large $d$ (proof refer to Appendix E.1).

As suggested in Ali et al. (2019), let’s consider a generic response model $\bm{y}|\bm{\theta}_{*}\sim(X\bm{\theta}_{*},\sigma^{2}I)$ where $X\in\mathbb{R}^{n\times d},d>n$ . The least-square empirical minimizer of the ridge regression problem will be:

\bm{\hat{\theta}}=\arg\min_{\bm{\theta}}\frac{1}{2n}\|X\bm{\theta}-\bm{y}\|^{2}+\frac{\beta}{2}\|\bm{\theta}\|^{2}=(X^{\top}X+n\beta I_{n})^{-1}X^{\top}\bm{y}=M^{-1}X^{\top}\bm{y}

(9)

Let $S$ be the training set. It’s trivial to get the sharpness of a quadratic loss function where

\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}(X^{\top}X/n+\beta I_{n})=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}(M)

(10)

It’s obvious that both of the above two equations depend on the same matrix $M=X^{\top}X/n+\beta I_{n}$ . For fixed training samples $X$ , we have $\kappa(\bm{\hat{\theta}},S)=\mathcal{O}(\beta^{-1})$ in the limit of $\beta$ . Then it’s clear that a higher penalty $\beta$ leads to a flatter minimum. This intuition is rigorously proven in Appendix E.1. According to Theorem 3.1 and Eq. 7, a flatter minimum probably associates with lower robustness constant $\epsilon(S,A)$ . Thus it enjoys a lower OOD generalization error gap. In ridge regression, this phenomenon can be reflected by the regularization coefficient $\beta$ . Therefore, in general, the larger $\beta$ is, the lower the sharpness $\kappa(\bm{\hat{\theta}},S)$ and variance are. As a consequence, larger $\beta$ learns a more robust model resulting in a lower OOD generalization error gap. This idea is later verified in the distributional shift experiments, shown as Figure 3.

Example 3.7.

We consider a classification problem using a 2-layer diagonal linear network with exp-loss. The robustness $\epsilon(S,A)$ has a similar relationship in Eq. 7. Given training set $S$ , after iterations $t>T_{\epsilon}$ , $\exists\tilde{c}_{2}>0$ , $\epsilon(S,A)\leq\tilde{c}_{2}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S)$ .

In addition to the regression and linear models, we have obtained a similar relationship for 2-layer diagonal linear networks, which are commonly used in the kernel and rich regimes as well as in intermediate settings (Moroshko et al., 2020). Example 3.7 demonstrates that the relationship also holds true when the model is well-trained, even exp-loss does not satisfy the PŁ condition. By extending our theorems to these more complex frameworks, we go beyond our initial assumptions and offer insights into broader applications. Later experiments on non-linear NN also support our statements. However, we still need a unified theorem for general function classes with fewer assumptions.

4 Related Work

Despite various methods (Sun & Saenko, 2016; Sagawa et al., 2019; Shi et al., 2021; Shafieezadeh Abadeh et al., 2015; Li et al., 2018; Cha et al., 2021; Du et al., 2020; Zhang et al., 2022) that have been proposed to overcome the poor generalization brought by unknown distribution shifts, the underlying principles and theories still remain underexplored. As pointed out in Redko et al. (2020); Miller et al. (2021), different tasks that address distributional shifts, such as domain adaptation, OOD, and domain generalization, are collectively referred to as "transfer transductive learning" and share similar generalization theories. In general, the desired generalization bound will be split into In-Distribution/Domain (ID) generalization error and Out-of-Distribution/Domain (OOD) distance. Since Blitzer et al. (2007) establish a VC-dimension-based framework to estimate the domain shift gap by a divergence term, many following works make the effort to improve this term in the following decades, such as Discrepancy (Mansour et al., 2009), Wasserstein measurement (Courty et al., 2017; Shen et al., 2018), Integral Probability Metrics (IPM) (Zhang et al., 2019b; Ye et al., 2021) and $\beta$ -divergence (Germain et al., 2016). Among them, new generalization tools like PAC-Bayes, Rademacher Complexity, and Stability are also applied. However, few of them discuss how the sharpness reacts to data distributional shifts.

Beyond this canonical framework, Ye et al. (2021) reformulate the OOD generalization problem and provide a generalization bound using the concepts of "variation" and "informativeness." The causal framework proposed in Peters et al. (2017); Rojas-Carulla et al. (2018) focuses on the impact of interventions on robust optimization over test distributions. However, none of these frameworks consider the optimization process of a model and how it affects OOD generalization. Inspired by previous investigation on the effect of sharpness on ID generalization (Lyu et al., 2022; Petzka et al., 2021), recent work in Cha et al. (2021) found that flatter minima can also improve OOD generalization. Nevertheless, they lack a sufficient theoretical foundation for the relationship between the "sharpness" of a model and OOD generalization, but end with a union bound of Blitzer et al. (2007)’s result. In this paper, we aim to provide a rigorous examination of this relationship.

5 Experiments

In light of space constraints, we present only a portion of our experimental results to support the validity of our theorems and findings. For comprehensive results, please refer to the Appendix G

5.1 Ridge regression in distributional shifting

Following Duchi & Namkoong (2021), we investigated the ridge regression on distributional shift. We randomly generate $\theta^{*}_{0}\in\mathbb{R}^{d}$ in spherical space, and data from the following generating process: $X\overset{\text{iid}}{\sim}\mathcal{N}(0,1),~{}~{}~{}\bm{y}=X\theta^{*}_{0}$ . To simulate distributional shift, we randomly generate a perpendicular unit vector $\theta_{0}^{\perp}$ to $\theta^{*}_{0}$ . Let $\theta_{0}^{\perp},\theta^{*}_{0}$ be the basis vectors, then shifted ground-truth will be computed from the basis by $\theta^{*}_{\alpha}=\theta^{*}_{0}\cdot\cos(\alpha)+\theta^{\perp}_{0}\cdot\sin(\alpha)$ . For the source domain, we use $\theta^{*}_{0}$ as our training distribution. We randomly sample 50 data points and train a linear classifier with a gradient descent of 3000 iterations. By minimizing the objective function in (9), we can get the empirical optimum $\hat{\theta}$ . Then we gradually shift the distribution by increasing $\alpha$ to get different target domains. Along distribution shifting, the test loss $\ell(\hat{\theta},\bm{y}_{\alpha})$ will increase. As shown in Figure 3, the test loss will culminate in around $3$ rads due to the maximum distribution shifting. Comparing different levels of regularization, we found that the larger L2-penalty $\beta$ brings lower OOD generalization error which is shown as darker purple lines. This plot bears out our intuition in the previous section. As stated in the aforementioned case, the sharpness of ridge regression should inversely depend on $\beta$ . Correspondingly, we compute sharpness using the definition equation (4) by averaging ten different results. For each trial, we use the same training and test data for every $\beta$ . The sharpness of each ridge regressor is shown in the legend of Figure 3. As we can see, larger $\beta$ leads to less sharpness.

5.2 Sharper minimum hurts OOD generalization

In our results, we proved that the upper bound of OOD generalization error involves the sharpness of the trained model. Here we empirically verified our theoretical insight. We follow the experiment setting in DomainBed (Gulrajani & Lopez-Paz, 2021). To easily compute the sharpness, we choose the 4-layer MLP on RotatedMNIST dataset where RotatedMNIST is a rotation of MNIST handwritten digit dataset (LeCun, 1998) with different angles ranging from $[0^{\circ},15^{\circ},30^{\circ},45^{\circ},60^{\circ},75^{\circ}]$ . In this codebase, each environment refers to selecting a domain (a specific rotation angle) as the test domain/OOD test dataset while training on all other domains. After getting the trained model of each environment, we compute the sharpness using all domain training sets based on the implementation of Petzka et al. (2021). To this end, we plot the performances of Empirical Minimization Risk (ERM), SWAD (Cha et al., 2021), Mixup (Yan et al., 2020) and GroupDRO (Sagawa et al., 2019) with 6 seeds of each. Then we measure the sharpness of all these minima. Figure 3 shows the relationship between model sharpness and out-of-domain accuracy. The tendency is clear that flat minima give better OOD performances. In general, different environments can not be plotted together due to different training sets. However, we found the middle $4$ environments are similar tasks and thus plot them together for a clearer trend. In addition, different algorithms lead to different feature scales which may affect the scale of the sharpness. To address this, we align their scales when putting them together. For more individual results, please refer to Figure 8 in the appendix.

5.3 Comparison of generalization bounds

To analyze our generalization bounds, we follow the toy example experiments in Sagawa et al. (2020). In this experiment, the distribution shift terms and generalization error terms can be explicitly computed. Furthermore, their synthetic experiment considers the spurious correlation across distribution shifts which is now a general formulation of OOD generalization (Wald et al., 2021; Aubin et al., 2021; Yao et al., 2022). Consider data $x=[x_{\text{core}},x_{\text{spu}}]\in$ that consist of two features: core feature and spurious feature. The features are generated from the following rule:

x_{\text{core }}\mid y\sim\mathcal{N}\left(y\mathbf{1},\sigma_{\text{core }}^{2}I_{d}\right)~{}~{}x_{\text{spu }}\mid a\sim\mathcal{N}\left(a\mathbf{1},\sigma_{\mathrm{spu}}^{2}I_{d}\right)

where $y\in\{-1,1\}$ is the label, and $a\in\{-1,1\}$ is the spurious attribute. Data with $y=a$ forms the majority group of size $n_{\text{maj}}$ , and data with $y=-a$ forms minority group of size $n_{\text{min}}$ . Total number of training points $n=n_{\text{maj}}+n_{\text{min}}$ . The spurious correlation probability, $p_{\text{maj}}=\frac{n_{\text{maj}}}{n}$ defines the probability of $y=a$ in training data. In testing, we always have $p_{\text{maj}}=0.5$ . The metric, worst-group error Sagawa et al. (2019) is defined as

\operatorname{Err}_{\mathrm{wg}}(w):=\max_{i\in[4]}\mathbb{E}_{x,y\mid g_{i}}\left[\ell_{0-1}(w;(x,y))\right]

where $\ell_{0-1}$ is the $0-1$ loss in binary classification. Here we compare the robustness of our proposed OOD generalization bound and the baseline in Proposition 2.1. We also give the comparison to other baselines, like PAC-Bayes DA bound in the Appendix G.

Along model size

We plot the generalization error of the random feature logistic regression along the model size increases in Figure 4(a). In this experiment, we follow the hyperparameter setup of Sagawa et al. (2020) by setting the number of points $n=500$ , data dimension $2d=200$ with $100$ on each feature. majority fraction $p_{\text{maj}}=0.9$ and noises $\sigma_{\text{spu}}^{2}=1,\sigma_{\text{core}}^{2}=100$ . The worst-group error turns out to be nearly the same as the model size increases. However, in Figure 4(b), the error term in domain shift bound Proposition 2.1(A) will keep increasing when the model size is increasing. In contrast, our domain shift bound at order $\sqrt{K}$ is independent of the model size which addresses the limitation of their bound. We follow Kawaguchi et al. (2022) to compute $K$ in an inverse image of the $\epsilon$ -covering in a randomly projected space (see details in appendix). We set the same value $K=1,000$ in our experiment. Different from the baseline, $K$ is data dependent and leads to a constant concentration error term along with model size increases. Analogously, our OOD generalization bound will not explode as model size increases (shown in Figure 4(c)).

Along distribution shift

In addition, we are interested in characterizing OOD generalization when test distribution shifts from train distribution by varying the correlation probability $p_{\text{maj}}$ during data generation. As shown in Figure 4(d), when $p_{\text{maj}}=0.5$ , there is no distributional shift between training and test data due to no spurious features correlated to training data. Thus, the training and test distributions align closer and closer when $p_{\text{maj}}<0.5$ and increase, resulting in an initial decrease in the test error for the worst-case group. However, as $p_{\text{maj}}>0.5$ and deviates from 0.5, introducing the spurious features, a shift in the distribution occurs. This deviation is likely to impact the worst-case group differently, leading to an increase in the test error. As displayed in Figure 4(e) and Figure 4(f), our distribution distance and generalization bound can capture the distribution shifts but are tighter than the baseline.

6 Conclusion

In this paper, we provide a more interpretable and informative theory to understand Out-of-Distribution (OOD) generalization Based on the notion of robustness, we propose a robust OOD bound that effectively captures the algorithmic robustness in the presence of shifting data distributions. In addition, our in-depth analysis of the relationship between robustness and sharpness further illustrates that sharpness has a negative impact on generalization. Overall, our results advance the understanding of OOD generalization and the principles that govern it.

7 Acknowledgement

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-GC-2019-001-2A), (AISG Award No: AISG2-TC-2023-010-SGIL) and the Singapore Ministry of Education Academic Research Fund Tier 1 (Award No: T1 251RES2207). We would appreciate the contributions of Fusheng Liu, Ph.D. student at NUS.

References

Ahuja et al. (2021) Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:3438–3450, 2021.
Ali et al. (2019) Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics, pp. 1370–1378. PMLR, 2019.
Arjovsky et al. (2019) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
Aubin et al. (2021) Benjamin Aubin, Agnieszka Słowik, Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Linear unit-tests for invariance discovery. arXiv preprint arXiv:2102.10867, 2021.
Bandi et al. (2018) Peter Bandi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018.
Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
Blitzer et al. (2007) John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. Advances in neural information processing systems, 20, 2007.
Bourin et al. (2013) Jean-Christophe Bourin, Eun-Young Lee, and Minghua Lin. Positive matrices partitioned into a small number of hermitian blocks. Linear Algebra and its Applications, 438(5):2591–2598, 2013.
Cha et al. (2021) Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
Cohen et al. (2021) Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
Courty et al. (2017) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 2017.
Du et al. (2017) Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. Advances in neural information processing systems, 30, 2017.
Du et al. (2020) Yingjun Du, Xiantong Zhen, Ling Shao, and Cees GM Snoek. Metanorm: Learning to normalize few-shot batches across domains. In International Conference on Learning Representations, 2020.
Duchi & Namkoong (2018) John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
Duchi & Namkoong (2021) John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
El Ghaoui & Lebret (1997) Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on matrix analysis and applications, 1997.
Foret et al. (2020) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
Germain et al. (2013) Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International conference on machine learning. PMLR, 2013.
Germain et al. (2016) Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A new pac-bayesian perspective on domain adaptation. In International conference on machine learning. PMLR, 2016.
Gulrajani & Lopez-Paz (2021) Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2021.
Izmailov et al. (2018) Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
Kawaguchi et al. (2022) Kenji Kawaguchi, Zhun Deng, Kyle Luh, and Jiaoyang Huang. Robustness implies generalization via data-dependent generalization bounds. In International Conference on Machine Learning. PMLR, 2022.
Kifer et al. (2004) Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In VLDB, volume 4, pp. 180–191. Toronto, Canada, 2004.
Koh et al. (2021) Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637–5664. PMLR, 2021.
Koyama & Yamaguchi (2020) Masanori Koyama and Shoichiro Yamaguchi. Out-of-distribution generalization with maximal invariant predictor. 2020.
LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
Li et al. (2017) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542–5550, 2017.
Li et al. (2018) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, 2018.
Liu et al. (2021) Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Heterogeneous risk minimization. In International Conference on Machine Learning, pp. 6804–6814. PMLR, 2021.
Lyu et al. (2022) Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction. In Advances in Neural Information Processing Systems, 2022.
Mansour et al. (2009) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
Mei & Montanari (2022) Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
Miller et al. (2021) John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning. PMLR, 2021.
Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
Moroshko et al. (2020) Edward Moroshko, Blake E Woodworth, Suriya Gunasekar, Jason D Lee, Nati Srebro, and Daniel Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in neural information processing systems, 33:22182–22193, 2020.
Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
Petzka et al. (2021) Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. Advances in Neural Information Processing Systems, 34:18420–18432, 2021.
Pezeshki et al. (2021) Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256–1272, 2021.
Rame et al. (2022) Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pp. 18347–18377. PMLR, 2022.
Redko et al. (2020) Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv preprint arXiv:2004.11829, 2020.
Robbins (1955) Herbert Robbins. A remark on stirling’s formula. The American mathematical monthly, 62(1):26–29, 1955.
Rojas-Carulla et al. (2018) Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 2018.
Sagawa et al. (2019) Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
Sagawa et al. (2020) Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning. PMLR, 2020.
Shafieezadeh Abadeh et al. (2015) Soroosh Shafieezadeh Abadeh, Peyman M Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. Advances in Neural Information Processing Systems, 2015.
Shen et al. (2018) Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Shi et al. (2021) Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. arXiv preprint arXiv:2104.09937, 2021.
Sun & Saenko (2016) Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443–450. Springer, 2016.
Wald et al. (2021) Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. Advances in neural information processing systems, 2021.
Wiles et al. (2022) Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre-Alvise Rebuffi, Ira Ktena, Krishnamurthy Dj Dvijotham, and Ali Taylan Cemgil. A fine-grained analysis on distribution shift. In International Conference on Learning Representations, 2022.
Xu & Mannor (2012) Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
Yan et al. (2020) Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. arXiv preprint arXiv:2001.00677, 2020.
Yao et al. (2022) Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. arXiv preprint arXiv:2201.00299, 2022.
Ye et al. (2021) Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 2021.
Zhang et al. (2019a) Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pp. 1524–1534. PMLR, 2019a.
Zhang et al. (2022) Xingxuan Zhang, Linjun Zhou, Renzhe Xu, Peng Cui, Zheyan Shen, and Haoxin Liu. Towards unsupervised domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4910–4920, 2022.
Zhang et al. (2019b) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning. PMLR, 2019b.
Zhao et al. (2018) Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31, 2018.
Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pp. 4140–4149. PMLR, 2017.

Appendix A Additional Experiments

A.1 Sharpness v.s. OOD Generalization on PACS and Wilds-Camelyon17

To evaluate our theorem more deeply, we examine the relationship between our defined sharpness and OOD generalization error on larger-scale real-world datasets, Wilds-Camelyon17 Bandi et al. (2018); Koh et al. (2021) and PACS Li et al. (2017). Wilds-Camelyon17 dataset includes 455,954 tumor and normal tissue slide images from five hospitals (environments). One of the hospitals is assigned as the test environment by the dataset publisher. Distribution shift arises from variations in patient population, slide staining, and image acquisition. PACS dataset contains 9,991 images of 7 objects in 4 visual styles (environments): art painting, cartoon, photo, and sketch. Following the common setup in Gulrajani & Lopez-Paz (2021), each environment is used as a test environment in turn. We follow the practice in Petzka et al. (2021) to compute the sharpness using the Hessian matrix from the last Fully-Connected (FC) layer of each model. For the Wilds-Camelyon17 dataset, we test the sharpness of 18 ERM models trained with different random seeds and hyperparameters. Figure 6 shows the result. For the PACS dataset, we run 60 ERM models with different random seeds and hyperparameters for each test environment. To get a clearer correlation, we align the points from 4 environments by their mean performance. Figure 6 shows the result. From the two figures, we can observe a clear correlation between sharpness and out-of-distribution (OOD) accuracy. Sharpness tends to hurt the OOD performance of the model. The result is consistent with what we report in Figure 3. It shows that the correlation between sharpness and OOD accuracy can also be observed on large-scale datasets.

Appendix B Notations and Definitions

Notations

We use $[n]$ denote the integers set $\{i\}_{i=1}^{n}$ . $\|\cdot\|$ represents $\ell_{2}$ -norm $\|\cdot\|_{2}$ for short. Without loss of generality, we use $\ell(f(\bm{\theta},\bm{x}),\bm{y})$ for the loss function of model $f_{\bm{\theta}}$ on data pair $\bm{z}=(\bm{x},\bm{y})$ , which is denoted as $\ell_{\bm{\theta}}(\bm{z}))$ and we use $n,d$ for training set size and input dimension. Note that we generally follow the notations in the original papers.

•

$\mathcal{L}_{S},\mathcal{L}_{T}$ : expected risk of the source domain and target domain, respectively. The corresponding empirical version will be $\widehat{\mathcal{L}}_{S},\widehat{\mathcal{L}}_{T}$
•

$\{C_{i}\}_{i=1}^{K}$ : $K$ partitions on sample space and $C_{i}$ denotes each partitioned set.
•

$\mathcal{D}_{S},\mathcal{D}_{T}$ : distributions of source and target domain. Their sampled dataset will be denoted as $\hat{\mathcal{D}}_{S},\hat{\mathcal{D}}_{T}$ accordingly.
•

$\bm{\theta}$ : In our setting, $\bm{\theta}=(\bm{w},\{\bm{a}\}_{i}^{m})$ denotes the model parameters where $\bm{w}$ is the trainable parameters and $\{\bm{a}_{i}\}^{m}$ are the random features (we also use $A=[\bm{a}_{1},...,\bm{a}_{m}]$ for short notation in many places). $\bm{\hat{w}}$ is the minimizer of empirical loss.
•

$M$ : upper bound of the loss function.
•

$S=\{(\bm{x}_{i},y_{i})\}_{i}^{n}/X=[\bm{x}_{1},...\bm{x}_{n}]$ : training data of size $n$ .

Definition B.1 (Robustness, Xu & Mannor (2012)).

A learning algorithm $\mathcal{A}$ on training set $S$ is $(K,\epsilon(\cdot))$ -robust, for $K\in\mathbb{N}$ , if $\mathcal{Z}$ can be partitioned into $K$ disjoint sets, denoted by $\left\{C_{k}\right\}_{k=1}^{K}$ , such that for all $s\in S,z\in\mathcal{Z}$ we have

\forall s,z\in C_{i},\forall k\in[K],\left|\ell\left(\mathcal{A}_{S},s\right)-\ell\left(\mathcal{A}_{S},z\right)\right|\leq\epsilon(S,A).

Definition B.2 (Hessian Lipschitz continuous).

For a twice differentiable function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , it has $L$ -Lipschitz continuous Hessian for domain $\bm{x},\bm{y}$ are vectors in $C_{i}$ if

\left\|\nabla^{2}f(\bm{y})-\nabla^{2}f(\bm{x})\right\|\leq L_{i}\|\bm{y}-\bm{x}\|

where $L_{i}>0$ depends on input domain $C_{i}$ and $\|\cdot\|$ is $L_{2}$ norm. Then for all $K$ domains $\cup_{i=1}^{K}C_{i}$ , let $L:=\max\{L_{i}|i\in[K]\}$ be the uniform Lipschitz constant, so we have

\left\|\nabla^{2}f(\bm{y})-\nabla^{2}f(\bm{x})\right\|\leq L_{i}\|\bm{y}-\bm{x}\|\leq L\|\bm{y}-\bm{x}\|,\forall i\in[K],(\bm{x},\bm{y})\in C_{i}

which is uniformly bounded with $L$ .

Lemma B.3 (Hessian Lipschitz Lemma).

If $f$ is twice differentiable and has $L$ -Lipschitz continuous Hessian, then

\left|f(\bm{y})-f(\bm{x})-\langle\nabla f(\bm{x}),\bm{y}-\bm{x}\rangle-\frac{1}{2}\left\langle\nabla^{2}f(\bm{x})(\bm{y}-\bm{x}),(\bm{y}-\bm{x})\right\rangle\right|\leq\frac{L}{6}\|\bm{y}-\bm{x}\|^{3}.

Gegenbauer Polynomials

We briefly define Gegenbauer polynomials here whose details can be found in Appendix of Mei & Montanari (2022). First, we denote $\mathbb{S}^{d-1}(r)=\{\bm{x}\in\mathbb{R}^{d}:\|\bm{x}\|=r\}$ as the uniform spherical distribution with the radius $r$ on $d-1$ manifold. Let $\tau_{d}$ be the probability measure on $\mathbb{S}^{d-1}$ and and the inner product in functional space $L^{2}([-d,d],\mu_{d})$ denoted as $\langle\cdot,\cdot\rangle_{L^{2}}$ and $\|\cdot\|_{L^{2}}$ :

\langle f,g\rangle_{L^{2}}\equiv\int_{\mathbb{S}^{d-1}(\sqrt{d})}f(\bm{x})g(\bm{x})\mu_{d}(\mathrm{~{}d}\bm{x}).

For any function $\sigma\in L^{2}([-d,d],\tau_{d})$ , where $\tau_{d}$ is the distribution of $\langle\bm{x},\bm{y}\rangle/\sqrt{d}~{}(\bm{x},\bm{y}\sim\mathbb{S}^{d-1}(\sqrt{d}))$ , the orthogonal basis $\{Q^{d}_{t}\}$ forms the Gegenbauer polynomial of degree $t(t\geq 0)$ , its spherical harmonics coefficients $\lambda_{d,t}(\sigma)$ can be expressed as:

\lambda_{d,t}(\sigma)=\int_{[-\sqrt{d},\sqrt{d}]}\sigma(x)Q_{t}^{(d)}(\sqrt{d}x)\tau_{d}(x),

then the Gegenbauer generating function holds in $L^{2}\left([-\sqrt{d},\sqrt{d}],\tau_{d}\right)$ sense

\sigma(x)=\sum_{k=0}^{\infty}\lambda_{d,t}(\sigma)N_{d,t}Q_{t}^{(d)}(\sqrt{d}x)

where $N_{d,t}$ is the normalized factor depending on the norm of input.

B.1 Assumptions

We discuss and list all assumptions we used in our theorems. The purposes are to offer clarity regarding the specific assumptions required for each theorem and ensure that the assumptions made in our theorems are well-founded and reasonable, reinforcing the validity and reliability of our results.

OOD Generalization

(Setting): Given a full sample space $\mathcal{Z}$ , source and target distributions are two different measures over this whole sample domain $\mathcal{Z}$ . The purpose is to study the robust algorithms in the OOD generalization setting.

(Assumptions): For any sample $\forall\bm{z}\in\mathcal{Z}$ , the loss function is bounded $\ell_{\bm{\theta}}(\bm{z})\in[0,M]$ . This assumption generally follows the original paper Xu & Mannor (2012). While it is possible to relax this assumption and derive improved bounds, our primary objective is to formulate a framework for robust OOD generalization and establish a clear connection with the optimization properties of the model.

Robustness and Sharpness

(Setting): In order to give a fine-grained analysis, we follow the common choice where a two-layer ReLU Neural Network function class is widely analyzed in most literature, i.e. Neural Tangent Kernel, non-kernel (rich) regime Moroshko et al. (2020) and random feature models. Among them, we select the following random feature models as our function class:

f(\bm{w},A,\bm{x})\triangleq\frac{1}{\sqrt{d}}\sum_{i=1}^{m}w_{i}\sigma\left(\bm{x},\bm{a}_{i}\right):w_{i}\in\mathbb{R},\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})),i\in[m]

where $m$ is the hidden size and $A=[\bm{a}_{1},...,\bm{a}_{m}]$ contains random vectors uniformly distributed on $n$ -dim hypersphere whose surface is a $n-1$ manifold. $\sigma(\bm{a}^{\top}\bm{x})=(\bm{a}^{\top}\bm{x})\mathbb{I}\{\bm{a}^{\top}\bm{x}\}$ denotes the ReLU activation function and $\mathbb{I}$ is the indicator function. $\bm{w}=[w_{1},...,w_{m}]^{\top}$ is the trainable parameter. We choose the common loss functions: (1) Homogeneity in regression; (2) (Binary) Cross-Entropy Loss; (3) Negative Log Likelihood (NLL) loss;

(Assumptions):

(i) Let $C_{i},i\in[K]$ be any set from whole partitions $\cup_{i=1}^{K}C_{i}$ , we assume $\forall\bm{z}\in C_{i}$ , the loss function $\ell_{\bm{\hat{w}},A}(\bm{z})$ satisfies $L$ -Hessian Lipschitz for all $i\in[K]$ (See details in Definition B.2). Note that we only require this assumption to hold within each partition, instead of holding globally. In general, the smoothness and convexity condition is actually equivalent to locally convex which is a weak assumption for most function classes.

(ii) Consider an optimization problem in each partition $C_{i},i\in[K]$ . Let one of the training points $\bm{z}_{i}(A)\in S\cap C_{i}$ be the initial point and $\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i}$ is the corresponding nearest local minima where $\mathcal{M}_{i}$ is the local minima set of partition $C_{i}$ . For some $\rho_{i}(L)>0$ , we assume $\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\rho_{i}(L)/L$ holds a.s.. It ensures the hessian norm $\|H(\bm{z}_{i}(A))\|$ has the lower bound. Similar conditions and estimations can be found in Zhang et al. (2019a).

(iii) To simplify the computation of probability, we assume $\xi_{i}=\bm{a}_{i}^{\top}\bm{x}$ obeys a rotationally invariant distribution.

(iv) For Corollary 3.5, we make additional assumptions that loss function $\ell()$ satisfied a bounded condition where the second derivative $\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})),|\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)/\partial f^{2}|$ with respect to its argument $f(\bm{\hat{w}},A,\bm{x})$ should be bounded by $[\tilde{M}_{1},\tilde{M}_{2}]$ . Note, we consider the case where the data $\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ while $m=\text{Poly}(d)$ is to ensure the positive definiteness of $\sum_{i=1}^{m}\bm{a}_{i}\bm{a}_{i}^{\top}\in\mathbb{R}^{d\times d}$ almost surely.

Appendix C Proof to Domain shift

Lemma C.1.

Let $\hat{\mathcal{D}}_{T}$ be the empirical distribution of size $n$ drawn from $\mathcal{D}_{T}$ . The loss $\ell$ is upper bounded by M. With probability at least $1-\delta$ (over the choice of the samples), for every $f_{\bm{\theta}}$ trained on $S$ , we have

\mathcal{L}_{T}(f_{\bm{\theta}})\leq\mathcal{L}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+\epsilon(S)+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}

(11)

where

\forall i\in[n],n_{i}(S):=\#(\bm{z}\in S\cap C_{i}),~{}~{}~{}d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}):=\sum_{i=1}^{K}\left|\frac{n_{i}(S)}{n}-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right|

(12)

and $n_{i}(S),n_{i}(\hat{\mathcal{D}}_{T})$ are the number of samples from $S$ and $\hat{\mathcal{D}}_{T}$ that fall into the set $C_{i}$ , respectively.

Proof.

In the following generalization statement, we use $\ell(f_{\bm{\theta}},\bm{z})$ to denote the error obtained with input $\bm{z}$ and hypothesis function $f_{\bm{\theta}}$ for better illustration. By definition we have,

\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}}):=\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z}).

(13)

Then we make the $K$ partitions for source distribution $\mathcal{D}_{S}$ . Let $n_{i}$ be the size of collection set of points $\bm{x}$ fall into the partition $C_{i}$ where $n_{i}$ is the i.i.d.multinomial random variable with $(p_{s}(C_{1}),...,p_{s}(C_{K}))$ . We use parallel notation for target distribution $\mathcal{D}_{T}$ with $S^{\prime}_{i}\sim(p_{t}(C_{1}),...,p_{t}(C_{K}))$ . Since

	$\displaystyle\mathrm{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z})$	$\displaystyle=\sum_{i=1}^{K}\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}\in C_{i})p_{s}(C_{i})$		(14)
	$\displaystyle\mathrm{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})$	$\displaystyle=\sum_{i=1}^{K}\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})$		(14)

and thus we have

$\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})=$	$\displaystyle\sum_{i=1}^{K}\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})\|\bm{z}\in C_{i})p_{s}(C_{i})$	(15)
	$\displaystyle\pm\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})p_{s}(C_{i})$
$\displaystyle=$	$\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))$
	$\displaystyle+\sum_{i=1}^{K}\left[\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})\|\bm{z}\in C_{i})\right]p_{s}(C_{i})$
$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))+\epsilon(S,A).$

If we sample empirical distribution $S,\hat{\mathcal{D}}_{T}$ of size $n$ each drawn from $\mathcal{D}_{S}$ and $\mathcal{D}_{T}$ , respectively. $(n_{1},...,n_{K})$ are the i.i.d. random variables belongs to $C_{i}$ . We use the parallel notation $n^{\prime}_{i}$ for target distribution.

d_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T}):=\sum_{i}^{K}\left|\frac{n_{i}(S)}{n}-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right|.

(16)

Further, we have

		$\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))-\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}-\frac{n_{i}(S)}{n}\right)$		(17)
		$\displaystyle=\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(p_{t}(C_{i})-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right)-\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(p_{s}(C_{i})-\frac{n_{i}(S)}{n}\right)$
		$\displaystyle\ \leq M\sum_{i=1}^{K}\left\|p_{t}(C_{i})-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right\|+M\sum_{i=1}^{K}\left\|p_{s}(C_{i})-\frac{n_{i}(S)}{n}\right\|.$

With Breteganolle-Huber-Carol inequality we have

\sum_{i=1}^{K}\left|\frac{n_{i}(S)}{n}-p_{s}(C_{i})\right|\leq\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}.

(18)

To integrate these two inequalities, we have

$\displaystyle\sum_{i=1}^{K}(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})$	$\displaystyle\|\bm{z}^{\prime}\in C_{i}))(p_{t}(C_{i})-p_{s}(C_{i}))$	(19)
	$\displaystyle\leq\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}-\frac{n_{i}(S)}{n}\right)+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}$
	$\displaystyle\leq Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}.$

In summary with probability $1-\delta$ we have

\mathcal{L}_{T}(f_{\bm{\theta}})\leq\mathcal{L}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+\epsilon(S)+2M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}

(20)

which completes the proof. ∎

With the result of the domain (distribution) shift and the relationship between sharpness and robustness, we can move forward to the final OOD generalization error bound. First, we state the context of ID robustness bound in Xu & Mannor (2012) as follows.

Lemma C.2 (Xu et al.Xu & Mannor (2012)).

Assume that for all $h\in\mathcal{H}$ and $z\in\mathcal{Z}$ , the loss is upper bounded by $M$ i.e., $\ell(h,z)\leq M$ . If the learning algorithm $\mathcal{A}$ is $(K,\epsilon(\cdot))$ -robust, then for any $\delta>0$ , with probability at least $1-\delta$ over an iid draw of $n$ samples $S=\left(z_{i}\right)_{i=1}^{n}$ , it holds that:

\mathbb{E}_{z}\left[\ell\left(\mathcal{A}_{S},z\right)\right]\leq\frac{1}{n}\sum_{i=1}^{n}\ell\left(\mathcal{A}_{S},z_{i}\right)+\epsilon(S)+M\sqrt{\frac{2K\ln 2+2\ln(1/\delta)}{n}}

As the conclusive results, we briefly prove the following result by summarizing Lemma C.2 and Lemma C.1.

Theorem C.3 (Restatement of Theorem 3.1).

\mathcal{L}_{T}(\bm{\theta})\leq\widehat{\mathcal{L}}_{S}(\bm{\theta})+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})+2\epsilon(S)+3M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}.

(21)

Proof.

Firstly, with Lemma C.1 and probability as least $1-\frac{\delta}{2}$ , we have

\mathcal{L}_{T}(f_{\bm{\theta}})\leq\mathcal{L}_{S}(f_{\bm{\theta}})+Md_{(\epsilon,K)}(\hat{\mathcal{D}}_{S},\hat{\mathcal{D}}_{T})+2M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}+\epsilon(S)

Secondly, with Lemma C.2 (Xu & Mannor (2012) Theorem 3) and probability as least $1-\frac{\delta}{2}$ , we have

\left|\mathcal{L}_{S}(f_{\bm{\theta}})-\hat{\mathcal{L}}_{S}(f_{\bm{\theta}})\right|\leq\epsilon(S)+M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}

By taking the union bound, we conclude our final result that with probability at least $1-\delta$

\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}})

\displaystyle\leq\hat{\mathcal{L}}_{S}(f_{\bm{\theta}})+3M\sqrt{\frac{2K\ln 2+2\ln(2/\delta)}{n}}+2\epsilon(S)+Md_{(\epsilon,K)}(S,\hat{\mathcal{D}}_{T})

(22)

∎

Here $\epsilon(S)$ is the robustness constant that we can replace with any sharpness measure.

C.1 Proof to Corollary 3.2

Definition C.4.

$d_{\mathcal{F}\Delta\mathcal{F}}\left(\mathcal{D}_{T};\mathcal{D}_{S}\right):=2\sup_{\mathcal{A}(f)\in\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}}}|\operatorname{Pr}_{\mathcal{D}_{S}}(\mathcal{A}(f))-\operatorname{Pr}_{\mathcal{D}_{T}}(\mathcal{A}(f))|$ and $\mathcal{F}\Delta\mathcal{F}$ is defined as:

\mathcal{F}\Delta\mathcal{F}:=\{f(x)\oplus f^{\prime}(x):f,f^{\prime}\in\mathcal{F}\}

where $\oplus$ is the XOR operator e.g. $\mathbb{I}(f^{\prime}(x)\neq f(x))$ .

Lemma C.5 (Lemma 2 Zhao et al. (2018)).

If $\operatorname{Pdim}(\mathcal{F})=d^{\prime}$ , then $\operatorname{VCdim}(\mathcal{F}\Delta\mathcal{F})\leq 2d^{\prime}$ .

Proposition C.6 (Zhao et al. (2018)).

Let $\mathcal{F}$ be a hypothesis class with pseudo dimension $\operatorname{Pdim}(\mathcal{F})=d^{\prime}$ . If $\widehat{\mathcal{D}}_{S}$ is the empirical distributions generated with $n$ i.i.d.. samples from source domain, and $\widehat{\mathcal{D}}_{T}$ is the empirical distribution on the target domain generated from $n$ samples without labels, then with probability at least $1-\delta$ , for all $f\in\mathcal{F}$ , we have:

	$\displaystyle\mathcal{L}_{T}(f)\leq\widehat{\mathcal{L}}_{S}(f)$	$\displaystyle+\mathcal{E}^{*}+\sqrt{\frac{2d^{\prime}\log\frac{en}{d^{\prime}}}{n}}+\sqrt{\frac{\log\frac{2}{\delta}}{2n}}$		(23)
		$\displaystyle+\underbrace{\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{2d^{\prime}\ln(2n)+\ln\frac{4}{\delta}}{n}}}_{(\text{Empirical div Error})}$		(23)

where $\mathcal{E}^{*}=\widehat{\mathcal{L}}_{S}(f^{*})+\widehat{\mathcal{L}}_{T}(f^{*})$ is the total error of best hypothesis $f^{*}$ over source and target domain.

Proof.

With Lemma 4 (Zhao et al., 2018), we have

\mathcal{L}_{T}(f)\leq\widehat{\mathcal{L}}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}+\mathcal{E}^{*}

where

\mathcal{E}^{*}=\inf_{f^{\prime}\in\mathcal{F}}\mathcal{L}_{S}(f^{\prime})+\mathcal{L}_{T}(f^{\prime}).

Lemma 6 (Zhao et al., 2018), which is actually Lemma 1 in (Ben-David et al., 2010), shows the following results

d_{\mathcal{F}\Delta\mathcal{F}}\left(\mathcal{D}_{T};\mathcal{D}_{S}\right)\leq d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{\text{VCdim}(\mathcal{F}\Delta\mathcal{F})\ln(2n)+\ln\frac{2}{\delta}}{n}}.

As suggested in Zhao et al. (2018), $\text{VCdim}(\mathcal{F}\Delta\mathcal{F})$ is at most $2d^{\prime}$ . Further, with Theorem 2 (Ben-David et al., 2010), we have at probability at least $1-\frac{\delta}{2}$

	$\displaystyle\mathcal{L}_{T}(f)$	$\displaystyle\leq\mathcal{L}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{\text{VCdim}(\mathcal{F}\Delta\mathcal{F})\ln(2n)+\ln\frac{2}{\delta}}{n}}+\mathcal{E}^{*}$		(24)
		$\displaystyle\leq\mathcal{L}_{S}(f)+\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}\left(\widehat{\mathcal{D}}_{T};\widehat{\mathcal{D}}_{S}\right)+4\sqrt{\frac{2d^{\prime}\ln(2n)+\ln\frac{2}{\delta}}{n}}+\mathcal{E}^{*}$		(24)

Using in-domain generalization error Lemma 11.6 (Mohri et al., 2018), with probability at least $1-\frac{\delta}{2}$ the result is

\mathcal{L}_{S}(f)\leq\widehat{L}_{S}(f)+M\sqrt{\frac{2d^{\prime}\log\frac{en}{d^{\prime}}}{n}}+M\sqrt{\frac{\log\frac{1}{\delta}}{2n}}

Note in Zhao et al. (2018), the $M=1$ for the normalized regression loss. Combine them all, we conclude the proof. ∎

Corollary C.7.

If $K\to\infty,M=1$ , domain shift bound $|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|$ will be reduced to (Empirical div Error) in Proposition C.6 where

|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})|\leq\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T})\leq\text{(Empirical div Error)}

(25)

Proof.

According to Eq. 21, we have

\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})=\sum_{i}^{K}\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})|\bm{z}\in C_{i})p_{s}(C_{i})

(26)

If $K\to\infty$ , let’s define a domain that $U:=\bigcup_{i=1}^{\infty}C_{i}$ . The equation (26) will be

$\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})$	$\displaystyle=\int_{\bm{z}^{\prime}\in U}\ell(f_{\bm{\theta}},\bm{z}^{\prime})p_{t}(\bm{z}^{\prime})d\bm{z}-\int_{\bm{z}\in U}\ell(f_{\bm{\theta}},\bm{z})p_{s}(\bm{z})d\bm{z}$	(27)
	$\displaystyle=\int_{\bm{z}^{\prime}\in\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})p_{t}(\bm{z}^{\prime})d\bm{z}-\int_{\bm{z}\in\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z})p_{s}(\bm{z})d\bm{z}$
	$\displaystyle=\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z}).$

In this case, we have,

		$\displaystyle\left\|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})\right\|$		(28)
		$\displaystyle=\left\|\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z})\right\|$
		$\displaystyle\leq\int_{0}^{\infty}\left\|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)dt-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)dt\right\|$
		$\displaystyle=\int_{0}^{1}\left\|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)\right\|dt~{}~{}~{}(M=1)$
		$\displaystyle\leq\sup_{t\in[0,1]}\sup_{f(\bm{\theta},\cdot)\in\mathcal{F}}\left\|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)\right\|$
		$\displaystyle\leq\sup_{\mathcal{A}(f)\in\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}}}\left\|\operatorname{Pr}_{\mathcal{\mathcal{D}}_{T}}(\mathcal{A}(f))-\operatorname{Pr}_{\mathcal{\mathcal{D}}_{S}}(\mathcal{A}(f))\right\|$
		$\displaystyle=\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T})\leq(\text{Empirical div error})$

where $\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}}$ represents a learning algorithm under the hypothesis $\mathcal{F}\Delta\mathcal{F}=\{f(x)\oplus f^{\prime}(x):f,f^{\prime}\in\mathcal{F}\}$ , which completes the proof. ∎

Appendix D Sharpness and Robustness

Lemma D.1 (positive definiteness of Hessian).

Let $\hat{w}_{\min}$ be the minimum value of $|\bm{\hat{w}}|$ and $\bm{x}^{*}=\arg\min_{\bm{x}\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}\ell(f({\bm{\hat{w}}},A,\bm{x}),\bm{y})$ . For any $A=(\bm{a}_{1},...,\bm{a}_{m}),\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))\forall i\in[m]$ denote $\tilde{\sigma}(d,m)=\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0$ be the minimum eigenvalue, where $G_{ij}=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{a}_{j}\rangle/\sqrt{d})$ is the polynomial product constant. If $m=\text{Poly}(d)$ , the hessian $H(\bm{x}^{*})$ can be lower bound by

\mathbb{E}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}H(\bm{x}^{*})\succeq\frac{\hat{w}^{2}_{\min}\tilde{\sigma}(d,m)\tilde{M}_{1}}{d}I_{d}.

(29)

Proof.

As suggested in Lemma D.6 of (Zhong et al., 2017), we have a similar result to bound the local positive definiteness of Hessian. By previous definition, the Hessian w.r.t. $\bm{x}$ has a following partial order

$\displaystyle H(\bm{x}^{*})$	$\displaystyle=\frac{D^{2}_{f}(\bm{x}^{},y^{})}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$	(30)
	$\displaystyle\succeq\frac{\tilde{M}_{1}}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$
	$\displaystyle\succeq\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$

For the ReLU activation function, we further have

\sigma(\bm{a}^{\top}\bm{x}^{*})\geq\sigma^{\prime}(\bm{a}^{\top}\bm{x}^{*})

(31)

We extend the $\sigma\in L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d})$ (where $\tau_{d}$ is the distribution of $\langle\bm{x}_{1},\bm{x}_{2}\rangle/\sqrt{d}$ ) by Gegenbauer polynomials that

\sigma(x)=\sum_{t=0}^{\infty}\lambda_{d,t}(\sigma)N_{d,t}Q_{t}^{(d)}(\sqrt{d}x).

(32)

Let $A=(\bm{a}_{1},...,\bm{a}_{m})\in\mathbb{R}^{m\times d}$ . We assume $\forall\bm{x}\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . Lemma C.7 in (Mei & Montanari, 2022), suggests that

\displaystyle\bm{U}=\left(\mathbb{E}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}[\sigma(\langle\bm{a}_{i},\bm{x}^{*}\rangle/\sqrt{d})\sigma(\langle\bm{a}_{i},\bm{x}^{*}\rangle/\sqrt{d})]\right)_{i,j\in[m]}\in\mathbb{R}^{m\times m}

(33)

which shows matrix $U$ is a positive definite matrix. Similarly, taking the expectation over $\bm{x}^{*}$ , terms in RHS of (30) bracket can be rewritten as

	$\displaystyle\operatorname{\mathbb{E}}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}$	$\displaystyle\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$		(34)
		$\displaystyle\succeq\sum_{i=1}^{m}\sum_{j=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{j}\mathbb{E}_{\bm{x}^{}}[\sigma(\bm{a}_{i}^{\top}\bm{x}^{}/\sqrt{d})\sigma(\bm{a}_{j}^{\top}\bm{x}^{*}/\sqrt{d})]$		(34)

Besides, we have the following property of Gegenbauer polynomials,

For $\bm{x},\bm{y}\in\mathbb{S}^{d-1}(\sqrt{d})$

\left\langle Q_{j}^{(d)}(\langle\bm{x},\cdot\rangle),Q_{k}^{(d)}(\langle\bm{y},\cdot\rangle)\right\rangle_{L^{2}\left(\mathbb{S}^{d-1}(\sqrt{d}),\gamma_{d}\right)}=\frac{1}{N_{d,k}}\delta_{jk}Q_{k}^{(d)}(\langle\bm{x},\bm{y}\rangle).

For $\bm{x},\bm{y}\in\mathbb{S}^{d-1}(\sqrt{d})$

Q_{k}^{(d)}(\langle\bm{x},\bm{y}\rangle)=\frac{1}{N_{d,k}}\sum_{i=1}^{N_{d,k}}Y_{ki}^{(d)}(\bm{x})Y_{ki}^{(d)}(\bm{y}).

where spherical harmonics $\{Y^{(d)}_{lj}\}_{1\leq j\leq N_{d,l}}$ forms an orthonormal basis which gives the following results

	$\displaystyle\mathbb{E}_{\bm{x}^{}}[\sigma(\bm{a}_{i}^{\top}\bm{x}^{}/\sqrt{d})\sigma(\bm{a}_{j}^{\top}\bm{x}^{*}/\sqrt{d})]$	$\displaystyle=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}\mathbb{E}_{\bm{x}^{}}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{x}^{}\rangle/\sqrt{d})Q_{t}^{(d)}(\langle\bm{a}_{j},\bm{x}^{*}\rangle/\sqrt{d})$		(35)
		$\displaystyle=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{a}_{j}\rangle/\sqrt{d})=G_{ij}<\infty.$		(35)

Hence, we have

\displaystyle\sum_{i=1}^{m}\sum_{j=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{j}\mathbb{E}_{\bm{x}^{*}}[\sigma(\bm{a}_{i}^{\top}\bm{x}^{*}/\sqrt{d})\sigma(\bm{a}_{j}^{\top}\bm{x}^{*}/\sqrt{d})]=\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii}+\mathcal{O}(1/d)\text{Var}(\bm{a})

(36)

Since $m=Poly(d)$ , and $\{\bm{a}\}_{i\in[m]}$ are i.i.d, then $\text{rank}\left(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii}\right)=\text{rank}\left(AA^{\top}\right)=d$ . Let $\tilde{\sigma}(d,m)=\mathbb{E}_{\bm{a}}\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0$ we have

\mathbb{E}_{\bm{x}^{*}}H(\bm{x}^{*})\succeq\frac{\hat{w}^{2}_{\min}\tilde{\sigma}(d,m)\tilde{M}_{1}}{d}I_{d}

(37)

∎

Lemma D.2.

Let $\bigcup_{k=1}^{K}C_{k}$ be the whole domain, the notion of $(\epsilon,K)$ -robustness is described by

\epsilon(S,A)\triangleq\max_{C_{i}\subset\bigcup_{k=1}^{K}C_{k}}\sup_{\bm{z},\bm{z}_{i}^{\prime}\in C_{i},\bm{z}\in S}\left|\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{\prime})\right|.

Define $\mathcal{M}_{i}$ be the set of global minima in $C_{i}$ , where

\mathcal{M}_{i}\triangleq\{\bm{z}(A)|\bm{z}(A)=\min_{\bm{z}\in C_{i}}\ell_{\bm{\hat{w}},A}(\bm{z})\}

suppose for some maximum training loss point

\bm{z}_{i}(A)\in\left\{\max_{\bm{z}\in C_{i}\cap S}\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))\right\}

there $\exists\bm{z}_{i}^{*}(A)$ where

\bm{z}_{i}^{*}(A)\triangleq\arg\min_{\bm{z}\in\mathcal{M}_{i}}\|\bm{z}-\bm{z}_{i}(A)\|

such that $\|\bm{z}_{i}(A)-\bm{z}^{*}(A)\|\leq\frac{\rho_{i}(L)}{L}$ almost surely hold for any $A\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ and for any $A$ , $\ell_{\bm{\hat{w}},A}(\bm{z})$ is $L$ -Hessian Lipschitz continuous. Then the $\epsilon(S,A)$ can be bounded by

\epsilon(S,A)\leq\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(\left\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\right\|+\frac{4\rho_{i}(L)}{3}\right)

Proof.

Let $\bm{z}\in S$ be a collection of $(\bm{x},y)$ from the training set $S$ and $\bm{z}_{i}^{\prime}$ denote any collection from the set $C_{i}$ . We define local minima set $\mathcal{M}_{i}$ (which is the global minima set of $C_{i}$ ). Assume that for some maximum point $\bm{z}_{i}(A)\in\max_{\bm{z}\in C_{i}\cap S}\ell_{\bm{\hat{w}},A}(\bm{z})$ , there exists a $\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i}$ almost surely for all $A\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ such that

\bm{z}_{i}^{*}(A)=\arg\min_{\bm{z}\in\mathcal{M}_{i}}f:=\left\{\bm{z}\in\mathcal{M}_{i}:\|\bm{z}_{i}(A)-\bm{z}\|\right\}~{}~{}\text{s.t.}~{}~{}\|\bm{z}_{i}(A)-\bm{z}\|\leq\frac{\rho_{i}(L)}{L}

(38)

By definition, $\epsilon(S,A)$ can be rewritten as

$\displaystyle\epsilon(S,A)$	$\displaystyle=\max_{i\in[K]}\sup_{\bm{z},\bm{z}_{i}^{\prime}\in C_{i},\bm{z}\in S}\left\|\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{\prime})\right\|$	(39)
	$\displaystyle=\max_{i\in[K]}\sup_{\bm{z}\in C_{i}\cap S,\bm{z}^{}\in\mathcal{M}_{i}}\ell_{\bm{\hat{w}},A}(\bm{z})-\ell_{\bm{\hat{w}},A}(\bm{z}^{})$
	$\displaystyle=\max_{i\in[K]}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A)).$

According to Lemma B.3, we have

$\displaystyle\epsilon(S,A)=$	$\displaystyle\max_{i\in[K]}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))-\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))$	(40)
$\displaystyle\overset{(i)}{\leq}$	$\displaystyle\max_{i\in[K]}\left\langle\nabla\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{}(A)),\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\right\rangle$
	$\displaystyle+\frac{1}{2}\left\langle\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{}(A))(\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)),\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\right\rangle+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}^{}\\|^{3}$
$\displaystyle=$	$\displaystyle\max_{i\in[K]}\frac{1}{2}\left\langle\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{}(A))(\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)),\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\right\rangle+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|^{3}$
$\displaystyle\leq$	$\displaystyle\max_{i\in[K]}\frac{1}{2}\left\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{}(A))\right\\|\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|^{2}$
	$\displaystyle+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\\|^{3}~{}~{}~{}~{}~{}(\textit{Cauchy-Schwarz})$

where $(i)$ support by the fact $\nabla\ell_{\bm{\hat{w}},A}(\bm{z}^{*})=0$ . With Lipschitz continuous Hessian we have

\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{*}(A))\|\leq L\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|+\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\|.

(41)

Overall, we have

$\displaystyle\epsilon(S,A)$	$\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{}(A))\right\\|\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|^{2}+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\\|^{3}$	(42)
	$\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left(\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\\|+L\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|\right)\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|^{2}$
	$\displaystyle+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\\|^{3}$
	$\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left(\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\\|+\rho_{i}(L)\right)\frac{\rho_{i}(L)^{2}}{L^{2}}+\frac{\rho_{i}(L)^{3}}{6L^{2}}$
	$\displaystyle=\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\\|+\frac{2\rho_{i}(L)^{3}}{3L^{2}}$

which completes the proof. ∎

Lemma D.3 (Lemma 2.1 Bourin et al. (2013)).

For every matrix in $\mathbb{M}_{n+m}^{+}$ partitioned into blocks, we have a decomposition

\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]=U\left[\begin{array}[]{cc}A&0\\ 0&0\end{array}\right]U^{*}+V\left[\begin{array}[]{ll}0&0\\ 0&B\end{array}\right]V^{*}

for some unitaries $U,V\in\mathbb{M}_{n+m}$ .

Lemma D.4.

Then, given an arbitrary partitioned positive semi-definite matrix,

\left\|\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]\right\|_{s}\leq\|A\|_{s}+\|B\|_{s}

for all symmetric norms.

Proof.

In lemma D.3 we have

\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]=U\left[\begin{array}[]{cc}A&0\\ 0&0\end{array}\right]U^{*}+V\left[\begin{array}[]{cc}0&0\\ 0&B\end{array}\right]V^{*}

for some unitaries $U,V\in\mathbb{M}_{n+m}$ . The result then follows from the simple fact that symmetric norms are non-decreasing functions of the singular values where $f=\|\cdot\|_{s}:\mathbb{M}\mapsto\mathbb{R}$ , we have

f\left(\left[\begin{array}[]{cc}A&X\\ X^{*}&B\end{array}\right]\right)\leq f\left(U\left[\begin{array}[]{cc}A&0\\ 0&0\end{array}\right]U^{*}\right)+f\left(V\left[\begin{array}[]{cc}0&0\\ 0&B\end{array}\right]V^{*}\right)

∎

Lemma D.5.

For $\bm{a}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ and $\bm{x}$ are some vector $\in\mathbb{R}^{d}$ with norm $\|\bm{x}\|\equiv\sqrt{R(d)}\geq d$ , we have

\mathbb{P}(\langle\bm{x},\bm{a}\rangle^{2}\geq\|\bm{a}\|^{2})\geq\min\left\{\frac{2\arccos(\frac{1}{\sqrt{R(d)}})}{\pi},\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}\exp(\frac{1}{4d-9})\right|\right\}

(43)

Proof.

We can replace the unit vector of $\bm{a}$ with $\bm{e}$ by

\mathbb{P}(\langle\bm{x},\bm{a}\rangle^{2}\geq\|\bm{a}\|^{2})=\mathbb{P}(\langle\bm{x},\bm{e}\rangle^{2}\geq 1)

(44)

Similarly, we can replace $\bm{x}$ by unit vector $\bm{s}$ such that

\mathbb{P}(\langle\bm{x},\bm{e}\rangle^{2}\geq 1)=\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)

(45)

Solving $\langle\bm{s},\bm{e}\rangle^{2}=\frac{1}{R(d)}$ , we get

\langle\bm{s},\bm{e}\rangle^{2}=\cos^{2}\phi=\frac{1}{R(d)}\Rightarrow\phi=\arccos\pm\frac{1}{\sqrt{R(d)}}

(46)

In this case, the probability will converge to $1$ as $R(d)$ increases. As is known to us, surface area of $\mathbb{S}^{d-1}$ equals

A_{d}=r^{d-1}\frac{2\pi^{d/2}}{\Gamma\left(\frac{d}{2}\right)}

(47)

An area $C_{d}$ of the spherical cap equals

A_{d}^{\mathrm{cap}}(r)=\int_{0}^{\phi}A_{d-1}(r\sin\theta)r\mathrm{d}\theta=\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}r^{d-1}\int_{0}^{\phi}\sin^{d-2}\theta\mathrm{d}\theta.

(48)

where $\Gamma\left(n-\frac{1}{2}\right)=\frac{(2(n-1))!}{2^{2(n-1)}(n-1)!}\sqrt{\pi}$ .

1) When $d=1$ , almost surely we have

\mathbb{P}((x\cdot e)^{2}\geq e^{2})=\mathbb{P}(x^{2}\geq 1)=1.

(49)

2) When $d=2$ , we have $\bm{a}\sim S^{2}$ where $S^{2}$ is a circle $r=1$ and the probability is the angle between $\bm{s},\bm{e}$ how much the vectors span within the circle where

\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)=\frac{2\int_{0}^{\phi}rd\theta}{\pi}=\frac{2\phi}{\pi}.

(50)

3) When $d\geq 3$ , the probability equals

\mathbb{P}\left(|\langle\bm{s},\bm{e}\rangle|\geq\frac{1}{\sqrt{R(d)}}\right)=\frac{A_{d}^{\mathrm{cap}}(r)}{\frac{1}{2}A_{d}}=1-\frac{\tilde{A}_{d}^{\mathrm{cap}}(r)}{\frac{1}{2}A_{d}}

(51)

where $\tilde{A}_{d}^{\mathrm{cap}}(r)$ is the remaining area of cutting the hyperspherical caps in half of the sphere,

$\displaystyle\tilde{A}_{d}^{\mathrm{cap}}(r)$	$\displaystyle=\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}\int_{\phi}^{\frac{\pi}{2}}\sin^{d-2}\theta\mathrm{d}\theta$	(52)
	$\displaystyle\leq\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}\int_{\phi}^{\frac{\pi}{2}}\sin\theta\mathrm{d}\theta$
	$\displaystyle\leq\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}(-\cos\frac{\pi}{2}+\cos\phi)$
	$\displaystyle=\frac{2\pi^{(d-1)/2}}{\Gamma\left(\frac{d-1}{2}\right)}\cos\phi.$

3-a) If $d$ is even then $\Gamma\left(\frac{d}{2}\right)=\left(\frac{d}{2}-1\right)!$ , so

\frac{\Gamma\left(\frac{d-1}{2}\right)}{\Gamma\left(\frac{d}{2}\right)}=\frac{(d-2)!\sqrt{\pi}}{2^{d-2}\left(\frac{d}{2}-1\right)!^{2}}=\frac{\sqrt{\pi}}{2^{d-2}}\left(\begin{array}[]{c}d-2\\ \frac{d-2}{2}\end{array}\right).

(53)

Robbins’ bounds (Robbins, 1955) imply that for any positive integer $d$

\frac{4^{d}}{\sqrt{\pi d}}\exp\left(-\frac{1}{8d-1}\right)<\left(\begin{array}[]{c}2d\\ d\end{array}\right)<\frac{4^{d}}{\sqrt{\pi d}}\exp\left(-\frac{1}{8d+1}\right).

(54)

So we have

$\displaystyle\frac{2A_{d}^{\mathrm{cap}}(r)}{A_{d}}$	$\displaystyle\leq\frac{2\Gamma\left(\frac{d}{2}\right)}{\sqrt{\pi}\Gamma\left(\frac{d-1}{2}\right)}=\frac{2^{d-1}}{\pi}\left(\begin{array}[]{c}d-2\\ \frac{d-2}{2}\end{array}\right)^{-1}\cos\phi$	(55)
	$\displaystyle<\frac{2^{d-1}}{\pi}\frac{\sqrt{\pi(d-2)/2}}{2^{d-2}\exp(-\frac{1}{4d-9})}\cos\phi$
	$\displaystyle<\frac{\sqrt{2d-4}}{\sqrt{\pi}\exp(-\frac{1}{4d-9})}\cos\phi.$

So the probability will be

\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)>1-\frac{\sqrt{2d-4}}{\sqrt{\pi}}\exp(\frac{1}{4d-9})\cos\phi.

(56)

Suppose $R(d)=kd,k>1$ , we have

\lim_{d\to\infty}\frac{\sqrt{2d-4}}{\sqrt{\pi kd}\exp(-\frac{1}{4d-9})}=\sqrt{\frac{2}{k\pi}}.

(57)

3-b) Similarly, if $d$ is odd, then $\Gamma\left(\frac{d-1}{2}\right)=\left(\frac{d-3}{2}\right)!$ , so

\frac{\Gamma\left(\frac{d-1}{2}\right)}{\Gamma\left(\frac{d}{2}\right)}=\frac{2^{d-2}\left(\frac{d-3}{2}\right)!^{2}}{(d-2)!\sqrt{\pi}}=\frac{2^{d-2}}{(d-2)\sqrt{\pi}}\left(\begin{array}[]{c}d-3\\ \frac{d-3}{2}\end{array}\right)^{-1}.

(58)

If $d=3$ then

\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)\geq 1-\frac{2\sqrt{\pi}}{2\sqrt{\pi}}\cos\phi=1-\frac{1}{\sqrt{R(d)}}.

(59)

If $d>3$ then Robbins’ bounds imply that

\displaystyle\frac{\sqrt{\pi}\Gamma\left(\frac{d-1}{2}\right)}{2\Gamma\left(\frac{d}{2}\right)}=\frac{2^{d-4}}{d-2}\left(\begin{array}[]{c}d-3\\ \frac{d-3}{2}\end{array}\right)^{-1}>\frac{\sqrt{\pi(d-3)/2}}{d-2}\exp\left(\frac{1}{4d-11}\right).

(60)

Thus, the probability will be at least

\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)>1-\frac{d-2}{\sqrt{\pi(d-3)}}\exp(-\frac{1}{4d-11})\cos\phi.

(61)

To simplify the result, we compare the minimum probability that for $\forall d\geq 3$

$\displaystyle\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)$	$\displaystyle>1-\frac{d-2}{\sqrt{\pi(d-3)}}\exp(-\frac{1}{4d-11})\cos\phi$	(62)
	$\displaystyle=1-\frac{d-2}{\sqrt{\pi(d-3)}}\exp(\frac{1}{4d-11})\cos\phi$
	$\displaystyle=1-\frac{d-2}{\sqrt{\pi(d-3)R(d)}}\exp(\frac{1}{4d-11})\cos\phi$
	$\displaystyle>1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}\exp(\frac{1}{4d-9})$

Overall, we have $\forall d\in\mathbb{N}_{+}$ ,

\mathbb{P}\left(\langle\bm{s},\bm{e}\rangle^{2}\geq\frac{1}{R(d)}\right)>\min\left\{\frac{2\arccos(\frac{1}{\sqrt{R(d)}})}{\pi},\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}\exp(\frac{1}{4d-9})\right|\right\}.

(63)

∎

D.1 Proof to Eq. 7

Proof.

Let $\mathcal{A}^{d}=\left(\bm{a}_{i}\right)_{i\in[d]}\overset{i.i.d.}{\sim}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ . We consider the random ReLU NN function class to be

\mathcal{F}_{relu}(\mathcal{A}^{d})=\left\{f(\bm{w},A,\bm{x})=\frac{1}{\sqrt{d}}\sum_{i=1}^{m}w_{i}\sigma\left(\bm{x}^{\top}\bm{a}_{i}\right):w_{i}\in\mathbb{R},i\in[m]\right\}

where $A=[\bm{a}_{1},...,\bm{a}_{m}]\in\mathbb{R}^{d\times m}$ . The empirical minimizer of the source domain is

\bm{\hat{w}}=\min_{\bm{w}\in\mathbb{R}^{d}}\frac{1}{n}\sum_{\bm{x}_{i},y_{i}\in S}\ell(f(\bm{w},A,\bm{x}_{i}),y_{i})=\frac{1}{n}\sum_{\bm{z}_{i}\in S}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}).

(64)

Then with the chain rule, the first derivative of any input $\bm{x}$ at $\bm{\hat{w}}$ will be

$\displaystyle\nabla_{\bm{x}}\ell\left(f(\bm{\hat{w}},A,\bm{x}),y\right)$	$\displaystyle=\frac{\partial\ell\left(f(\bm{\hat{w}},A,\bm{x}),y\right)}{\partial f(\bm{\hat{w}},A,\bm{x})}\frac{\partial\ell f(\bm{\hat{w}},A,\bm{x})}{\sigma(A\bm{x})}\frac{\partial\sigma(A\bm{x})}{\partial\bm{x}}$	(65)
	$\displaystyle=\frac{D^{1}_{f}(\bm{x},y)}{\sqrt{d}}\sum_{i=1}^{m}\hat{w}_{i}\bm{a}_{i}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}\geq 0\right\}$
	$\displaystyle=\frac{D^{1}_{f}(\bm{x},y)}{\sqrt{d}}\bm{\hat{w}}^{\top}\sigma^{\prime}(A\bm{x})$

where a short notation $D^{1}_{f}(\bm{x},y)$ denotes the first order directional derivative of $f(\bm{\hat{w}},A,\bm{x})$ $\sigma^{\prime}(A\bm{x})\in\mathbb{R}^{m\times d}$ is the Jacobian matrix w.r.t. input $\bm{x}$ . Apparently, the second order derivative is represented as $D^{2}_{f}(\bm{x},y)$ , thus the Hessian will be

	$\displaystyle\nabla^{2}_{\bm{x}}\ell(f(\bm{\hat{w}},A,\bm{x}),y)$	$\displaystyle=\frac{D^{2}_{f}(\bm{x},y)}{d}\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}\geq 0~{},\bm{a}_{j}^{\top}\bm{x}\geq 0\right\}$		(66)
		$\displaystyle=\frac{D^{2}_{f}(\bm{x},y)}{d}\sigma^{\prime}(A\bm{x})^{\top}\bm{\hat{w}}\bm{\hat{w}}^{\top}\sigma^{\prime}(A\bm{x})$		(66)

Similarly, we have

\nabla^{2}_{y}\ell(f(\bm{\hat{w}},A,\bm{x}),y)=D^{2}_{y}(\bm{x},y)\cdot(\mathrm{sgn}(y))^{2}\overset{*}{\leq}D^{2}_{f}(\bm{x},y)

(67)

where $\mathrm{sgn}(y)$ is the sign function. ^∗ holds under our choice of the family of loss functions.

1.

Homogeneity in regression, i.e. L1, MSE, MAE, Huber Loss, we have $|D^{2}_{y}(\bm{x},y)|=|D^{2}_{f}(\bm{x},y)|$ ;
2.

(Binary) Cross-Entropy Loss:

$D^{2}_{y}(\bm{x},y)=\partial^{2}\left(y\sum_{i}\exp(\bm{x})/\sum_{c=1}^{C}\exp(\bm{x}_{c})\right)/\partial y^{2}=0;$
3.

Negative Log Likelihood (NLL) loss: $D^{2}_{y}(\bm{x},y)=0$ .

Besides, as a convex loss function, $D^{2}_{y}(\bm{x},y)\geq 0$ . Hence, the range of $D^{2}_{y}(\bm{x},y)$ will be $[0,D^{2}_{f}(\bm{x},y)]$ .

To combine with robustness, we denote $\bm{z}=(\bm{x};y),\in\mathbb{R}^{d+1}$ . Therefore, the Hessian of $\bm{z}$ will be

H(\bm{z}|S,A):=\begin{bmatrix}\nabla^{2}_{\bm{x}}\ell(f(\bm{\hat{w}},A,\bm{x}),y)&\frac{\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y))}{\partial y\partial\bm{x}}\\ \left(\frac{\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)}{\partial y\partial\bm{x}}\right)^{\top}&\nabla^{2}_{y}(\ell(f(\bm{\hat{w}},A,\bm{x}),y))\end{bmatrix}.

(68)

With Lemma D.4, the spectral norm of Hessian $\bm{z}$ will be bounded by

\left\|H(\bm{z})\right\|\leq\left\|\nabla^{2}_{\bm{x}}\ell(f(\bm{\hat{w}},A,\bm{x}),y)\right\|+\left|\nabla^{2}_{y}\ell(f(\bm{\hat{w}},A,\bm{x}),y)\right|.

(69)

The first term in (69) can be further bounded by

	$\displaystyle\left\\|D^{2}_{f}(\bm{x},y)\sigma^{\prime}(A\bm{x})^{\top}\bm{\hat{w}}\bm{\hat{w}}^{\top}\sigma^{\prime}(A\bm{x})\right\\|$	$\displaystyle\leq\|D^{2}_{f}(\bm{x},y)\|\left\\|\bm{\hat{w}}\bm{\hat{w}}^{\top}\right\\|\left\\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right\\|$		(70)
		$\displaystyle=D^{2}_{f}(\bm{x},y)\\|\bm{\hat{w}}\\|^{2}\left\\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right\\|$		(70)

where the convexity of loss functions $\forall\bm{x},y,D^{2}_{f}(\bm{x},y)\geq 0$ supports the last equation. The right term has the facts that

\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\|\leq\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\|_{F}=\operatorname{tr}\left(\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right).

(71)

In summary, we have the following inequality:

$\displaystyle\left\\|H(\bm{z})\right\\|$	$\displaystyle\leq D^{2}_{f}(\bm{x},y)\left(\frac{1}{d}\\|\bm{\hat{w}}\\|^{2}\left\\|\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right\\|+1\right)$	(72)
	$\displaystyle\leq D^{2}_{f}(\bm{x},y)\left(\frac{1}{d}\\|\bm{\hat{w}}\\|^{2}\operatorname{tr}\left(\sigma^{\prime}(A\bm{x})\sigma^{\prime}(A\bm{x})^{\top}\right)+1\right)$
	$\displaystyle=D^{2}_{f}(\bm{x},y)\left(\frac{1}{d}\\|\bm{\hat{w}}\\|^{2}\sum_{j=1}^{m}\left\\|\bm{a}_{j}\right\\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}\geq 0\right\}+1\right)$

In Lemma D.2, it depends on some $\bm{z}_{i}=(\bm{x}_{i},y_{i})\in S\cap C_{i}$ that

$\displaystyle\epsilon(S,A)$	$\displaystyle\leq\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(\\|H(\bm{z}_{i}(A))\\|+\frac{4\rho_{i}(L)}{3}\right)$	(73)
	$\displaystyle\leq\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(D^{2}_{f}(\bm{x}_{i},y_{i})\left(\frac{\\|\bm{\hat{w}}\\|^{2}}{d}\sum_{j=1}^{m}\left\\|\bm{a}_{j}\right\\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{t}\geq 0\right\}+1\right)+\frac{4\rho_{i}(L)}{3}\right)$
	$\displaystyle\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(\frac{D^{2}_{f}(\bm{x}_{k},y_{k})}{d}\\|\bm{\hat{w}}\\|^{2}\sum_{j=1}^{m}\left\\|\bm{a}_{j}\right\\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}+\frac{4\rho_{\max}(L)}{3}+\tilde{o}_{\kappa}\right)$

where $\rho_{\max}(L)=\max\{\rho_{i}(L)\}_{i=0}^{K}$ , the $\tilde{o}_{\kappa}=\mathcal{O}(\ell^{\prime\prime}(f,\bm{x}_{k},y_{k})\|\bm{\hat{w}}\|^{2}\sum_{j=1}^{m}\left\|\bm{a}_{j}\right\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}d/m)$ is a smaller order term compared to first term, since $m\gg d$ . Last equality, the maximum can be taken as we find maximum $(\bm{x}_{k},y_{k})\in\hat{C}\in\{C_{i}\}_{i\in[K]}$ . Because $\bm{x}_{k},y_{k}\in S$ is one of the training sample $k\in[n]$ , there must exist $n^{\prime}\in[0,n]$ that

	$\displaystyle D^{2}_{f}(\bm{x}_{k},y_{k})\\|\bm{\hat{w}}\\|^{2}\sum_{j=1}^{m}$	$\displaystyle\left\\|\bm{a}_{j}\right\\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}$		(74)
		$\displaystyle=\\|\bm{\hat{w}}\\|^{2}\frac{n^{\prime}}{n}\sum_{k=1}^{n}\frac{D^{2}_{f}(\bm{x}_{k},y_{k})}{d}\sum_{j=1}^{m}\\|\bm{a}_{j}\\|^{2}\mathbb{I}\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\}.$		(74)

Recall that the sharpness of parameter $\bm{\hat{w}}$ is defined by

$\displaystyle\kappa(\bm{\hat{w}},S,A)$	$\displaystyle:=\\|\bm{\hat{w}}\\|^{2}\operatorname{tr}[H_{S,A}(\bm{\hat{w}})]$	(75)
	$\displaystyle=\\|\bm{\hat{w}}\\|^{2}\frac{1}{n}\sum_{j=1}^{n}D^{2}_{f}(\bm{x}_{j},y_{j})\cdot\operatorname{tr}\left(\sigma(A\bm{x}_{j})\sigma(A\bm{x}_{j})^{\top}\right)$
	$\displaystyle=\\|\bm{\hat{w}}\\|^{2}\frac{1}{n}\sum_{j=1}^{n}\frac{D^{2}_{f}(\bm{x}_{j},y_{j})}{d}\sum_{i=1}^{m}(\bm{a}_{i}^{\top}\bm{x}_{j})^{2}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}.$

Let the $\xi_{i}=\bm{a}_{i}^{\top}\bm{x}\sim D(\xi)$ and the expectation of $\mathbb{E}(\xi_{i}>0)=q_{i}$ where $D(\xi)$ is some rotationally invariant distribution, i.e. uniform or normal distribution. Under this circumstance, the sample mean of $\xi_{i}$ still obeys the same family distribution as $D(t\xi)$ . Thus, we have

	$\displaystyle\mathbb{P}\left(\sum_{j=1}^{m}\xi_{i}\mathbb{I}\left\{\xi_{i}\geq 0\right\}\geq\sum_{j=1}^{m}\\|\bm{a}_{j}\\|^{2}\mathbb{I}\left\{\xi_{i}\geq 0\right\}\right)$	$\displaystyle=\mathbb{P}\left(\sum_{j=1}^{q_{i}}\xi_{i}\geq\sum_{j=1}^{q_{i}}\\|\bm{a}_{j}\\|^{2}\right)$		(76)
		$\displaystyle=\mathbb{P}((\bm{a}^{\top}\bm{x})^{2}\geq\\|\bm{a}\\|^{2})=\mathbb{E}_{\bm{x}}p(\bm{x})$		(76)

With Eq. 43, we have at least a probability at

\mathbb{P}((\bm{a}^{\top}\bm{x})^{2}\geq\|\bm{a}\|^{2})=\min\left\{\frac{2}{\pi}\arccos(R(d)^{-\frac{1}{2}}),\left|1-\frac{\sqrt{2d-4}}{\sqrt{\pi R(d)}}e^{\frac{1}{4d-9}}\right|\right\}

(77)

the following inequality holds,

	$\displaystyle\epsilon(S,A)$	$\displaystyle\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\frac{4\rho_{\max}(L)}{3}+\tilde{o}_{\kappa}\right)$		(78)
		$\displaystyle=\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(\left[n^{\prime}+\mathcal{O}\left(\frac{d}{m}\right)\right]\kappa(\bm{\hat{w}},S,A)+\frac{4}{3}\rho_{\max}(L)\right)$		(78)

∎

D.2 Proof to Corollary 3.5

Corollary D.6 (Restatement of Corollary 3.5).

Let $\hat{w}_{\min}$ be the minimum value of $|\bm{\hat{w}}|$ . Suppose $\forall\bm{x}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ and $|\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)/\partial f^{2}|$ is bounded by $[\tilde{M}_{1},\tilde{M}_{2}]$ . If $m>d$ , $\rho_{\max}(L)<(\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}(d,m))/(2d)$ for any $A=(\bm{a}_{1},...,\bm{a}_{m}),\bm{a}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))\forall i\in[m]$ , taking expectation over all $\bm{x}_{j}\in\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))$ in $S$ , we have

\mathbb{E}_{S,A}\left[\epsilon(S,A)\right]\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\tilde{M}_{2}\right).

(79)

where $\tilde{\sigma}(d,m)=\mathbb{E}_{\bm{a}}\lambda_{\min}(\sum_{i=1}^{m}\bm{a}_{i}\bm{a}^{\top}_{i}G_{ii})>0$ is the minimum eigenvalue and $G_{ii}$ is product constant of Gegenbauer polynomials

G_{ij}=\sum_{t=0}^{\infty}\lambda^{2}_{d,t}(\sigma)N^{2}_{d,t}Q_{t}^{(d)}(\langle\bm{a}_{i},\bm{a}_{j}\rangle/\sqrt{d}).

Proof.

In our main theorem, with some probability, we have the following relation

\epsilon(S,A)\leq\frac{\rho_{\max}(L)^{2}}{2L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\frac{4\rho_{\max}(L)}{3}+d_{\kappa}\right).

So, we are concerned about the relation between the $\kappa(\bm{\hat{w}},S,A)$ second term. If $\rho_{i}(L)<\kappa(\bm{\hat{w}},S,A)$ , we may say the RHS is dominated by sharpness term $\kappa(\bm{\hat{w}},S,A)$ as well as the main effect is taken by the sharpness. As suggested in Lemma D.1, we have

\mathbb{E}_{\bm{x}^{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d}))}H(\bm{x}^{*})\succeq\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}_{A}(m,d)}{d}I_{d}

(80)

where $\bm{x}^{*}$ is the global minimum over the whole set. As defined in (38), the following condition holds true

\exists\bm{z}_{i}^{*}(A)\in\mathcal{M}_{i},\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\|\leq\frac{\rho_{i}(L)}{L},

(81)

and with Hessian Lipschitz, the relation is almost surely for arbitrary $\bm{x}$ that

$\displaystyle\mathbb{E}_{\bm{z}_{i}^{}(A)}\\|H(\bm{z}_{i}(A))-H(\bm{z}_{i}^{}(A))\\|$	$\displaystyle\leq L\\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\\|\leq\rho_{i}(L)$	(82)
	$\displaystyle\leq\left\\|\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}\tilde{\sigma}(d,m)}{2d}I_{d}\right\\|$
	$\displaystyle\leq\mathbb{E}_{A,\bm{x}^{}}\frac{1}{2}\\|H(\bm{x}^{})\\|$
	$\displaystyle\leq\mathbb{E}_{\bm{z}_{i}^{}(A)}\frac{1}{2}\\|H(\bm{z}_{i}^{}(A))\\|.$

Obviously, $\|H(\bm{z}_{i}^{*}(A))\|>2\rho_{i}(L)$ . Following Lemma A.2 of Zhang et al. (2019a), for $\bm{z}_{i}^{*}(A),\forall\bm{z}(A)\in C_{i}$ , we have a similar result that

\tilde{\sigma}_{\min}(H(\bm{z}(A)))\geq\tilde{\sigma}_{\min}(H(\bm{z}_{i}^{*}(A))-\|H(\bm{z}(A))-H(\bm{z}_{i}^{*}(A))\|\geq\rho_{i}(L)

(83)

where $\tilde{\sigma}_{\min}$ denotes the minimum singular value. With Lemma D.2, we know that

\mathbb{E}_{S,A}\epsilon(S,A)\leq\mathbb{E}_{S,A}\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(\|H(\bm{z}_{i}(A))\|+\frac{4\rho_{i}(L)}{3}\right).

(84)

We also have another condition that

|D^{2}_{f}(\bm{x},y)|:=\left|\frac{\partial^{2}\ell(f(\bm{\hat{w}},A,\bm{x}),y)}{\partial f^{2}}\right|\in[\tilde{M}_{1},\tilde{M}_{2}],\forall\bm{x},y.

(85)

Combine all these results, we finally have

		$\displaystyle\mathbb{E}_{S,A}\epsilon(S,A)\leq\mathbb{E}_{S,A}\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\left(1+\frac{4}{3}\right)\\|H(\bm{z}_{i}(A))\\|$		(86)
		$\displaystyle\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(\frac{D^{2}_{f}(\bm{x}_{k},y_{k})}{d}\\|\bm{\hat{w}}\\|^{2}\sum_{j=1}^{m}\left\\|\bm{a}_{j}\right\\|^{2}\mathbb{I}\left\{\bm{a}_{j}^{\top}\bm{x}_{k}\geq 0\right\}+\tilde{M}_{2}\right).$		(86)

Recall the definition of $\kappa(\bm{\hat{w}},S,A)$ in the main theorem that

\mathbb{E}_{S,A}\kappa(\bm{\hat{w}},S,A)\leq\mathbb{E}_{\{\bm{x}\}^{n}}\|\bm{\hat{w}}\|^{2}\frac{1}{n}\sum_{j=1}^{n}\frac{D^{2}_{f}(\bm{x}_{j},y_{j})}{d}\sum_{i=1}^{m}(\bm{a}_{i}^{\top}\bm{x}_{j})^{2}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}

(87)

Look at the second sum, we have

	$\displaystyle\mathbb{E}_{\{\bm{x}_{j}\}^{n},\{\bm{a}_{i}\}^{m}}\sum_{i=1}^{m}(\bm{a}_{i}^{\top}\bm{x}_{j})^{2}\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}$	$\displaystyle=\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\\|\bm{a}_{i}\\|^{2}\\|\bm{x}_{j}\\|^{2}\cos^{2}(\beta)\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}$		(88)
		$\displaystyle=\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\\|\bm{a}_{i}\\|^{2}\\|\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}d\cos^{2}(\beta).$		(88)

Suppose $\bm{x}$ and $\bm{a}$ are i.i.d. from $\text{Unif}(\mathbb{S}^{d-1}(1))$ , let $u=\langle\bm{x},\bm{a}\rangle$ , we have a well-known result that

$\displaystyle\mathbb{E}_{u}[u^{2}]$	$\displaystyle=\mathbb{E}[\langle\bm{x},\bm{a}\rangle^{2}]$	(89)
	$\displaystyle=\mathbb{E}[\\|\bm{x}\\|^{2}\\|\bm{a}\\|^{2}\cos^{2}(\beta)]$
	$\displaystyle=\mathbb{E}[\cos^{2}(\beta)]=\frac{1}{d},~{}~{}~{}~{}\bm{x},\bm{a}\in\mathbb{R}^{d}$

Therefore, in (88),

\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\|\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}d\cos^{2}(\beta)=\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\|\bm{a}_{i}\|^{2}\|\mathbb{I}\left\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\right\}

(90)

and we have (based on proof of main theorem),

		$\displaystyle\mathbb{E}_{S,A}\epsilon(S,A)$		(91)
		$\displaystyle\leq\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(\\|\bm{\hat{w}}\\|^{2}\frac{n^{\prime}}{n}\sum_{j=1}^{n}\frac{D^{2}_{f}(\bm{x}_{j},y_{j})}{d}\sum_{i=1}^{m}\mathbb{E}_{\bm{x}_{j}}\mathbb{E}_{\bm{a}_{i}}\\|\bm{a}_{i}\\|^{2}\mathbb{I}\{\bm{a}_{i}^{\top}\bm{x}_{j}\geq 0\}+\tilde{M}_{2}\right)$
		$\displaystyle\leq\mathbb{E}_{S,A}\frac{7\rho_{\max}(L)^{2}}{6L^{2}}\left(n^{\prime}\kappa(\bm{\hat{w}},S,A)+\tilde{M}_{2}\right)$

∎

Appendix E Case Study

To better illustrate our theorems, we here give two different cases for clearly picturing intuition. The first case is the very basic model, ridge regression. As is known to us, ridge regression provides a straightforward way (by punishing the $\ell_{2}$ norm of the weights) to reduce the "variance" of the model in order to avoid overfitting. In this case, this mechanism is equivalent to reducing the model’s sharpness.

Example E.1.

In ridge regression models, the robustness constant $\epsilon$ has a reverse relationship to regularization parameter $\beta$ where $\beta\uparrow$ , the more probably flatter minimum $\kappa\downarrow$ and less sensitivity $\epsilon\downarrow$ of the learned model could be. Follow the previous notation that $\epsilon(S,A)$ denotes the robustness and $\kappa(\bm{\hat{\theta}},S)$ is the sharpness on training set $S$ , then we have

\tau>0,~{}~{}c\in(0,n],~{}~{}\epsilon(S,A)\leq c\kappa(\bm{\hat{\theta}},S)+\tilde{o}_{d}

where $\tilde{o}_{d}$ is a much smaller order than $\kappa(\bm{\hat{\theta}},S)$ .

E.1 Ridge regression

We consider a generic response model as stated in Ali et al. (2019).

\bm{y}|\bm{\theta}_{*}\sim(X\bm{\theta}_{*},\sigma^{2}I)

ridge regression minimization problem is defined by

\min_{\bm{\theta}}\frac{1}{2n}\|X\bm{\theta}-\bm{y}\|^{2}+\frac{\beta}{2}\|\bm{\theta}\|^{2},X\in\mathbb{R}^{n\times d},n<d.

(92)

The least-square solution of ridge regression is

\bm{\hat{\theta}}=\left(X^{\top}X+n\beta I\right)^{-1}X^{\top}\bm{y}.

(93)

With minimizer $\bm{\hat{\theta}}$ , we now focus on its geometry w.r.t. $\bm{x}$ . Let $S=\{\bm{z}\}_{i}^{n}=(X,\bm{y})$ be the training set, $(\mathcal{Z},\Sigma,\rho)$ be a measure space. Consider the bounded sample set $\mathcal{Z}$ such that

\exists M>0,~{}~{}\rho(\mathcal{Z})<+\infty.

(94)

The $\mathcal{Z}$ can be partitioned into $K$ disjoint sets $\{C_{i}\}_{i\in[K]}$ . By definition, we have robustness defined by each partition $C_{i}$ ,

\forall\bm{z},\bm{z}^{\prime}\in C_{i},\left|\ell(\bm{\hat{\theta}},\bm{z})-\ell(\bm{\hat{\theta}},\bm{z}^{\prime})\right|\leq\epsilon(S,A).

(95)

For this convex function $\ell(\bm{\hat{\theta}},\bm{z})$ , we have the following upper bound in the whole sample domain

$\displaystyle\epsilon(S)$	$\displaystyle=\max_{i\in[K]}\sup_{\bm{z},\bm{z}^{\prime}\in C_{i}}\left\|\ell(\bm{\hat{\theta}},\bm{z})-\ell(\bm{\hat{\theta}},\bm{z}^{\prime})\right\|$	(96)
	$\displaystyle=\max_{i\in[K]}\sup_{\bm{z}\in C_{i}}\ell(\bm{\hat{\theta}},\bm{z})-\ell(\bm{\hat{\theta}},\bm{z}_{i}^{*}\in C_{i})$
	$\displaystyle\leq\sup_{\bm{z}_{j}\in\mathcal{Z}\cap S}\ell(\bm{\hat{\theta}},\bm{z}_{j})-\ell(\bm{\hat{\theta}},\bm{z}^{*})$

where the $\bm{z}_{i}^{*},\bm{z}^{*}$ are the global minimum point in $C_{i}$ and whole domain $\mathcal{Z}=\bigcup_{i}^{K}C_{i}$ , respectively. $\bm{z}_{j}$ is a training data point that has the maximum loss difference from the optimum. Specifically, it as well as the augmented form of $\bm{\hat{\theta}}$ can be expressed as

\bm{z}=[x_{1},...,x_{d},y]^{\top},~{}~{}~{}\bm{\hat{\theta}}_{+}=[\hat{\theta}_{1},...,\hat{\theta}_{d},-1]^{\top},\in\mathbb{R}^{d+1}.

Then the loss difference can be rewritten as

\ell_{\bm{\hat{\theta}}_{+}}(\bm{z})=(\bm{\hat{\theta}}_{+}^{\top}\bm{z})^{2}=(\bm{\hat{\theta}}^{\top}\bm{x}-y)^{2}\Rightarrow H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}))\text{is P.S.D matrix}.

It is a convex function with regards to $\bm{z}$ such that

$\displaystyle\epsilon(S)$	$\displaystyle\leq\sup_{\bm{z}_{j}\in S}\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j})-\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*})$	(97)
	$\displaystyle=\sup_{\bm{z}_{j}\in S}\nabla\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{})^{\top}(\bm{z}_{j}-\bm{z}^{})+\frac{1}{2}(\bm{z}_{j}-\bm{z}^{})^{\top}H(\ell_{\bm{\hat{\theta}}}(\bm{z}^{}))(\bm{z}_{j}-\bm{z}^{*})$
	$\displaystyle=\sup_{\bm{z}_{j}\in S}\frac{1}{2}(\bm{z}_{j}-\bm{z}^{})^{\top}H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{}))(\bm{z}_{j}-\bm{z}^{*})$
	$\displaystyle\leq\sup_{\bm{z}_{j}\in S}\frac{1}{2}\\|H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{}))\\|\\|\bm{z}_{j}-\bm{z}^{}\\|^{2}$

where the second equality is supported by convexity and the third equality is due to $\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j})=0$ . Further, with Lemma D.4, we have

\|H(\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}^{*}))\|=\left\|\begin{bmatrix}\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}&\frac{\partial^{2}\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j}))}{\partial y\partial\bm{x}}\\ \left(\frac{\partial^{2}\ell_{\bm{\hat{\theta}}_{+}}(\bm{z}_{j}))}{\partial\bm{x}\partial y}\right)^{\top}&1\end{bmatrix}\right\|\leq\left\|\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right\|+1.

(98)

In (97), we can also bound the norm of the input difference by

\|\bm{z}_{j}-\bm{z}^{*}\|^{2}\leq\|\bm{x}_{j}-\bm{x}^{*}\|^{2}+(y_{j}-y^{*})^{2}

(99)

and then for simplicity, assume $\|\bm{x}^{*}\|^{2}\leq\|\bm{x}_{j}\|^{2}$ we have

$\displaystyle\epsilon(S)$	$\displaystyle\leq\sup_{\bm{x}_{j}\in S}\frac{1}{2}\left(\left\\|\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right\\|+1\right)\left(\\|\bm{x}_{j}-\bm{x}^{}\\|^{2}+(y_{j}-y^{})^{2}\right)$	(100)
	$\displaystyle\leq\sup_{\bm{x}_{j}\in S}\frac{1}{2}\left(\left\\|\bm{\hat{\theta}}\right\\|^{2}+1\right)\left(\\|\bm{x}_{j}\\|^{2}+\\|\bm{x}^{}\\|^{2}+(y_{j}-y^{})^{2}\right)$
	$\displaystyle\leq\sup_{\bm{x}_{j}\in S}\frac{1}{2}\left\\|\bm{\hat{\theta}}\right\\|^{2}\left(\\|\bm{x}_{j}\\|^{2}+\\|\bm{x}^{*}\\|^{2}\right)+\mathcal{O}(d)$
	$\displaystyle\leq\sup_{\bm{x}_{j}\in S}\left\\|\bm{\hat{\theta}}\right\\|^{2}\\|\bm{x}_{j}\\|^{2}+\mathcal{O}(d)$

where $\left\|\bm{\hat{\theta}}\right\|^{2}\|\bm{x}_{j}\|^{2}=\mathcal{O}(d^{2})$ is the dominate term for large $d$ . Now, let’s look at the relation to sharpness. By definition,

\displaystyle\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}\left(H_{\bm{\hat{\theta}}}(\ell(\bm{\hat{\theta}},S))\right)=\|\bm{\hat{\theta}}\|^{2}\operatorname{tr}\left(\frac{X^{\top}X}{n}+\beta I\right).

(101)

Since

\operatorname{tr}\left(\frac{X^{\top}X}{n}+\beta I\right)=\operatorname{tr}\left(\frac{X^{\top}X}{n}\right)+\operatorname{tr}\left(\beta I\right)=\operatorname{tr}\left(\frac{XX^{\top}}{n}\right)+\beta=\frac{1}{n}\sum_{j}^{n}\|\bm{x}_{j}\|^{2}+\beta,

(102)

so we have

\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\frac{1}{n}\sum_{j}^{n}\|\bm{x}_{j}\|^{2}+\beta\|\bm{\hat{\theta}}\|^{2}.

(103)

As is known to us, the "variance" of ridge estimator Ali et al. (2019) can be defined by

\operatorname{Var}(\bm{\hat{\theta}})\triangleq\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right)=\operatorname{tr}\left(\left(X^{T}X+n\beta I\right)^{-1}X^{T}\bm{y}\bm{y}^{\top}X\left(X^{T}X+n\beta I\right)^{-1}\right).

(104)

Note that $\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}$ is a PSD with $\text{rank}(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top})=1$ , thus it has only one eigenvalue $\lambda(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top})=\|\bm{\hat{\theta}}\|^{2}>0$ .

\operatorname{Var}(\theta)=\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right)=\|\bm{\hat{\theta}}\|^{2}=\mathcal{O}(\beta^{-2})

(105)

By definition, the covariance matrix $\mathbb{E}[\bm{y}\bm{y}^{\top}]$ is a diagonal matrix with entries of $\sigma^{2}$ . Averagely, we have

$\displaystyle\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right)$	$\displaystyle=\sigma^{2}\operatorname{tr}\left[\left(X^{\top}X+n\beta I\right)^{-1}X^{\top}X\left(X^{\top}X+n\beta I\right)^{-1}\right]$	(106)
	$\displaystyle=\frac{\sigma^{2}}{n}\operatorname{tr}\left[\frac{X^{\top}X}{n}\left(\frac{X^{\top}X}{n}+\beta I\right)^{-2}\right]$
	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{i=1}^{d}\frac{\lambda_{i}(X^{\top}X/n)}{\left(\lambda_{i}(X^{\top}X/n)+\beta\right)^{2}}$

where $\frac{X^{\top}X}{n}$ and $\left(\frac{X^{T}X}{n}+\beta I\right)^{-1}$ are simultaneously diagonalizable and commutable. Therefore, the greater $\beta$ is, the smaller $\operatorname{tr}\left(\bm{\hat{\theta}}\bm{\hat{\theta}}^{\top}\right)$ is.

Conclusions

From our above analysis, we have the following conditions.

•

Upper bound of robustness $\epsilon(S)\leq\sup_{\bm{x}_{j}\in S}\|\bm{\hat{\theta}}\|^{2}\|\bm{x}_{j}\|^{2}+\mathcal{O}(d).$
•

Sharpness expression $\kappa(\bm{\hat{\theta}},S)=\|\bm{\hat{\theta}}\|^{2}\left(\frac{1}{n}\sum_{\bm{x}_{j},y_{j}\in S}\|\bm{x}_{j}\|^{2}+\beta\right).$
•

Variance expression $\text{Var}(\theta)=\|\bm{\hat{\theta}}\|^{2}.$

1) First, let’s discuss how $\beta$ influences sharpness.

$\displaystyle\kappa(\bm{\hat{\theta}},S)$	$\displaystyle=\\|\bm{\hat{\theta}}\\|^{2}\operatorname{tr}\left(H_{\bm{\hat{\theta}}}(\ell(\bm{\hat{\theta}},S))\right)$	(107)
	$\displaystyle=\bm{y}^{\top}X\left(X^{\top}X+n\beta I\right)^{-2}X^{\top}\bm{y}\left(\frac{1}{n}\sum_{\bm{x}_{j},y_{j}\in S}\\|\bm{x}_{j}\\|^{2}+\beta\right)$
	$\displaystyle=\mathcal{O}(\beta^{-1}).$

As dictated in above equation, the $\kappa(\bm{\hat{\theta}},S)=\mathcal{O}(\beta^{-1})$ where sharpness holds an inverse relationship to $\beta$ .

2) Now, it’s trivial to get the relationship between robustness and sharpness by combining the first two points. Because $\bm{x}_{j},y_{j}\in S$ in supermum is one of the training samples there exists a constant $c<n$ and $\tilde{o}_{d}=o(d^{2})$ such that

$\displaystyle\epsilon(S)$	$\displaystyle\leq\frac{c}{n}\sum_{j}^{n}\left(\\|\bm{\hat{\theta}}\\|^{2}\\|\bm{x}_{j}\\|^{2}\right)+\tilde{o}_{d}$	(108)
	$\displaystyle\leq\frac{c}{n}\sum_{j}^{n}\left(\\|\bm{\hat{\theta}}\\|^{2}\\|\bm{x}_{j}\\|^{2}\right)+c\beta\\|\bm{\hat{\theta}}\\|^{2}+\tilde{o}_{d}$
	$\displaystyle=c\kappa(\bm{\hat{\theta}},S)+\tilde{o}_{d}.$

This relation is consistent with Eq. 7 where the robustness is upper bounded by $n^{\prime}$ times sharpness $\kappa(\bm{\hat{\theta}},S)$ . Besides, the relation is simpler here, where robustness only depends on sharpness without other coefficients before $\kappa(\bm{\hat{\theta}},S)$ .

3) Finally, the relation between $\beta$ and robustness will be

$\displaystyle\epsilon(S)$	$\displaystyle\leq\frac{c}{n}\sum_{j}^{n}\left(\text{Var}(\theta)\\|\bm{x}_{j}\\|^{2}\right)+\tilde{o}_{d}=\frac{c\sigma^{2}}{n}\sum_{i=1}^{d}\frac{\lambda_{i}(X^{\top}X/n)}{\left(\lambda_{i}(X^{\top}X/n)+\beta\right)^{2}}\frac{\sum_{j}^{n}\\|\bm{x}_{j}\\|^{2}}{n}+\tilde{o}_{d}$	(109)
	$\displaystyle=\frac{\sigma^{2}}{n}\sum_{i=1}^{d}\frac{\lambda_{i}(X^{\top}X/n)}{\left(\lambda_{i}(X^{\top}X/n)+\beta\right)^{2}}\frac{\sum_{j}^{d}\lambda_{j}(X^{\top}X/n)}{n}+\tilde{o}_{d}$
	$\displaystyle=\mathcal{O}(\beta^{-2}).$

where the $\epsilon(S)$ somehow is the order of $\mathcal{O}(\beta^{-2})$ in the limit of $\beta$ . It’s clear to us that the greater $\beta$ is, the less sensitive (more robust) model we can get. In practical, we show that "over-robust" may hurt the model’s performance (we show the detail empirically in Figure 7).

E.2 2-layer Diagonal Linear Network Classification

However, our main theorem assumes the loss function satisfying Polyak-Łojasiewicz (PŁ) condition. To extend our result to a more general case, here we study a 2-layer diagonal linear network classification problem whose loss is exponential-based and not satisfied the PŁ condition.

Example E.2.

We consider a classification problem using a 2-layer diagonal linear network with exp-loss. The robustness $\epsilon(S,A)$ has a similar relationship in Eq. 7. Given training set $S$ , after iterations $t>T_{\epsilon}$ , $\exists C_{2}>0$ , $\epsilon(S,A)\leq C_{2}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S)$ .

Given a training set $S=(X,y),X=[\bm{x}_{1},...,\bm{x}_{n}],X\in\mathbb{R}^{d\times n}$ . A depth- $2$ diagonal linear networks with parameters $\bm{u}=[\bm{u}_{+},\bm{u}_{-}]^{\top}\in\mathbb{R}^{2d}$ specified by:

f(\bm{u},\bm{x})=\langle\bm{u}_{+}^{2}-\bm{u}_{-}^{2},\bm{x}\rangle

We consider exponential loss where $\mathcal{L}(t)=\frac{1}{n}\sum_{i=1}^{n}\text{exp}(-\bm{x}_{i}^{\top}\bm{\theta}(t)y_{i})$ and $y_{i}\in\{-1,1\}$ . It has the same tail behavior and thus similar asymptotic properties as the logistic or cross-entropy loss. WLOG, we assume $\forall i\in[n]:y_{i}=1$ such that $\bm{x}_{i}=y_{i}\bm{x}_{i}$ . Suggest by Moroshko et al. (2020), we have

\bm{\theta}(t)=2\alpha^{2}\operatorname{sinh}\left(4X\int^{t}_{0}\mathbf{r}(s)ds\right)

where $\mathbf{r}(t)=\frac{1}{n}\text{exp}(-X^{\top}\bm{\theta}(t))$ and $\|\mathbf{r}(t)\|_{1}=\mathcal{L}(t)$ . Note that

\frac{d\bm{\theta}(t)}{dt}=\frac{4}{n}\sqrt{\bm{\theta}^{2}(t)+4\alpha^{4}\mathbf{1}}\circ X\exp\left(-X^{\top}\bm{\theta}(t)\right)=A(t)X\mathbf{r}(t)

(110)

where $A(t)=\operatorname{diag}\left(4\sqrt{\bm{\theta}^{2}(t)+4\alpha^{4}\mathbf{1}}\right)$ .

Part i, sharpness. First derivative and Hessian can be obtained by

	$\displaystyle\nabla_{\bm{\theta}}\mathcal{L}(t)$	$\displaystyle=-(X\mathbf{r}(t))^{\top}$		(111)
	$\displaystyle H_{\bm{\theta}}(\mathcal{L}(t))$	$\displaystyle=-\frac{\partial(X\mathbf{r}(t))^{\top}}{\partial\bm{\theta}(t)}=\sum_{i=1}^{n}\mathbf{r}_{i}(t)X_{i}X_{i}^{\top}.$		(111)

With the definition of sharpness, we can get

\begin{split}\kappa(\bm{\theta}(t),S)&=\|\bm{\theta}(t)\|^{2}\operatorname{tr}(H_{\bm{\theta}}[\mathcal{L}(t)])\\ &=\|\bm{\theta}(t)\|^{2}\operatorname{tr}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)X_{i}X_{i}^{\top}\right)\\ &=\|\bm{\theta}(t)\|^{2}\operatorname{tr}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)X_{i}^{\top}X_{i}\right)\\ &=\|\bm{\theta}(t)\|^{2}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\|\bm{x}_{i}\|^{2}\right).\end{split}

(112)

Part ii, robustness. Now, let’s use the same discussion of the robustness constant in the previous case. Follow the previous definition of $\epsilon(S,A)$ , after some iteration number $T_{\epsilon}$ it is defined by

\epsilon(S,A):=\sup_{i\in[n],t\geq T_{\epsilon}}\left|n\mathbf{r}_{j}(t)-\mathbf{r}^{*}(t)\right|

(113)

where $n\mathbf{r}_{j}(t)$ is the (denormalized) point-wise loss of $\bm{x}_{j}$ and $\mathbf{r}^{*}(t)$ denotes the minimum loss of point $\bm{x}^{*}$ . There exists $n^{\prime}<n$ , we have

\epsilon(S,A)=\sup_{t\geq T_{\epsilon}}n^{\prime}\left(\|\mathbf{r}(t)\|_{1}-\mathbf{r}^{*}(t)\right)\leq\sup_{t\geq T_{\epsilon}}n^{\prime}\|\mathbf{r}(t)\|_{1}.

(114)

Let $\|\bm{x}_{\min}\|=\min_{i\in[n]}\|\bm{x}_{i}\|$ , the above equation

$\displaystyle\epsilon(S,A)$	$\displaystyle\leq\frac{1}{\\|\bm{x}_{\min}\\|^{2}}\sup_{t\geq T_{\epsilon}}(n^{\prime}\\|\mathbf{r}(t)\\|\\|\bm{x}_{\min}\\|^{2})$	(115)
	$\displaystyle=\frac{n^{\prime}}{\\|\bm{x}_{\min}\\|^{2}}\sup_{t\geq T_{\epsilon}}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\\|\bm{x}_{\min}\\|^{2}\right)$
	$\displaystyle\leq\frac{n^{\prime}}{\\|\bm{x}_{\min}\\|^{2}}\sup_{t\geq T_{\epsilon}}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\\|\bm{x}_{i}\\|^{2}\right).$

Part iii, connection. Compare the last part of (112) and (115), we found that robustness and sharpness depend on the same term. Further, we can say that for any step $t$ , $\|\bm{\theta}(t)\|^{2}$ will have the upper bound.

From Lemma 11 of Moroshko et al. (2020), we have the following inequality,

\|\bm{\theta}(t)\|_{\infty}\leq 2\alpha^{2}\sinh\left(\frac{\bar{x}}{2\gamma_{2}^{2}\alpha^{2}}\tilde{\gamma}(t)\right)

(116)

where $\mathcal{L}(t)=\exp(-\tilde{\gamma}(t))$ . Then, we can bound the $\|\bm{\theta}(t)\|^{2}$ via:

C_{1}\leq\|\bm{\theta}(t)\|^{2}\leq d\|\bm{\theta}(t)\|^{2}_{\infty}=4d\alpha^{4}\sinh^{2}\left(\frac{\bar{x}}{2\gamma_{2}^{2}\alpha^{2}}\tilde{\gamma}(t)\right)<\infty

(117)

Note that $C_{1}>0$ In summary, we have

\epsilon(S,A)\leq\frac{n^{\prime}}{C_{1}\|\bm{x}_{\min}\|}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S)\leq C_{2}\sup_{t\geq T_{\epsilon}}\kappa(\bm{\theta}(t),S)

(118)

where $C_{2}=\frac{n^{\prime}}{C_{1}\|\bm{x}_{\min}\|}>0$ is a constant.

Part iv, sanity check– asymptotic. Asymptotically, as $t\to\infty$ , $\mathcal{L}(t)\to 0$ while $\|\bm{\theta}(t)\|_{2}$ will be explode. So if sharpness $\kappa(\bm{\theta}(t),X)\to\infty$ , then it will fail to imply robustness.

$\displaystyle\kappa(\bm{\theta}(t),S)$	$\displaystyle=\\|\bm{\theta}(t)\\|^{2}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\\|\bm{x}_{i}\\|^{2}\right)$	(119)
	$\displaystyle\leq\\|\bm{\theta}(t)\\|^{2}\left(\sum_{i=1}^{n}\mathbf{r}_{i}(t)\\|\bm{x}_{\max}\\|^{2}\right)$
	$\displaystyle=\\|\bm{\theta}(t)\\|^{2}\mathcal{L}(\bm{\theta}(t))\\|\bm{x}_{\max}\\|^{2}$
	$\displaystyle=\kappa_{\max}(\bm{\theta}(t),S)$

Let $\kappa_{\max}(\bm{\theta}(t),X)$ be a upper bound of $\kappa(\bm{\theta}(t),X)$ at any time step $t$ . The dynamics will be

$\displaystyle\frac{d\kappa_{\max}(\bm{\theta}(t),S)}{dt}=$	$\displaystyle\nabla_{\bm{\theta}}\kappa_{\max}(\bm{\theta}(t),S)\dot{\bm{\theta}}(t)$	(120)
$\displaystyle=$	$\displaystyle\\|\bm{x}_{\max}\\|^{2}\operatorname{tr}(XX^{\top})\left(\\|\bm{\theta}(t)\\|^{2}\dot{\mathcal{L}}(t)+2\mathcal{L}(t)\bm{\theta}(t)^{\top}\dot{\bm{\theta}}(t)\right)$
$\displaystyle=$	$\displaystyle\\|\bm{x}_{\max}\\|^{2}\operatorname{tr}(XX^{\top})\left(-\\|\bm{\theta}(t)\\|^{2}(X\bm{r}(t))^{\top}A(t)X\bm{r}(t)+2\mathcal{L}(t)\bm{\theta}(t)^{\top}A(t)X\bm{r}(t)\right)$
$\displaystyle=$	$\displaystyle\\|\bm{x}_{\max}\\|^{2}\operatorname{tr}(XX^{\top})\left(2\mathcal{L}(t)\bm{\theta}(t)^{\top}-\\|\bm{\theta}(t)\\|^{2}(X\bm{r}(t))^{\top}\right)A(t)X\bm{r}(t)$

As $t\to\infty$ , we have $\mathcal{L}(t)\to 0$ , thus it is easy to converge that

\lim_{t\to\infty}\frac{d\kappa_{\max}(\bm{\theta}(t),S)}{dt}=-\|\bm{x}_{\max}\|^{2}\|\bm{\theta}(t)\|^{2}(X\bm{r}(t))^{\top}A(t)X\bm{r}(t)=-\|\bm{x}_{\max}\|^{2}\|\bm{\theta}(t)\|^{2}\dot{\mathcal{L}}(t)=0

(121)

As we can see, the dynamics of the derivative of $\kappa_{\max}(\bm{\theta}(t),X)$ is decreasing to $0$ as $t\to\infty$ which means the sharpness $\kappa(\bm{\theta}(t),X)$ will be upper bounded by a converged number $\kappa(\bm{\theta}(t\to\infty),X)<C_{\infty}<\infty$ . So sharpness will not explode.

Appendix F Comparison to feature robustness

F.1 Their results

Definition F.1 (Feature robustness Petzka et al. Petzka et al. (2021)).

Let $\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}_{+}$ denote a loss function, $\epsilon$ and $\delta$ two positive (small) real numbers, $S\subseteq\mathcal{X}\times\mathcal{Y}$ a finite sample set, and $A\in\mathbb{R}^{m\times m}$ a matrix. A model $f(x)=(\psi\circ\phi)(x)$ with $\phi(\mathcal{X})\subseteq\mathbb{R}^{m}$ is called $((\bm{\delta},S,A),\epsilon)$ -feature robust, if $\left|\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\alpha A)\right|\leq\epsilon$ for all $0\leq\alpha\leq\delta$ . More generally, for a probability distribution $\mathcal{A}$ on perturbation matrices in $\mathbb{R}^{m}$ , we define

\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\mathcal{A})=\mathbb{E}_{A\sim\mathcal{A}}\left[\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,A)\right],

and call the model $((\bm{\delta},\bm{S},\mathcal{A}),\bm{\epsilon})$ -feature robust on average over $\mathcal{A}$ , if $\left|\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\alpha\mathcal{A})\right|\leq\epsilon$ for $0\leq$ $\alpha\leq\delta$

Theorem F.2 (Theorem 5 Petzka et al. Petzka et al. (2021)).

Consider a model $f(x,\mathbf{w})=g(\mathbf{w}\phi(x))$ as above, a loss function $\ell$ and a sample set $S$ , and let $O_{m}\subset\mathbb{R}^{m\times m}$ denote the set of orthogonal matrices. Let $\delta$ be a positive (small) real number and $\mathbf{w}=\omega\in\mathbb{R}^{d\times m}$ denote parameters at a local minimum of the empirical risk on a sample set $S$ . If the labels satisfy that $y\left(\phi_{\delta A}\left(x_{i}\right)\right)=y\left(\phi\left(x_{i}\right)\right)=y_{i}$ for all $\left(x_{i},y_{i}\right)\in S$ and all $\|A\|\leq 1$ , then $f(x,\omega)$ is $\left(\left(\delta,S,O_{m}\right),\epsilon\right)$ -feature robust on average over $O_{m}$ for $\epsilon=\frac{\delta^{2}}{2m}\kappa^{\phi}(\omega)+\mathcal{O}\left(\delta^{3}\right)$ .

Proof sketch

With assumption that $y[\phi_{\delta A}(x_{i})]=y_{i}$ for all $(x_{i},y_{i})\in S$ and all $\|A\|\leq 1$ , feature perturbation around $w$ is

\mathcal{E}^{\phi}_{\mathcal{F}}(f,S,\alpha A)+\mathcal{E}_{emp}(\bm{w},S)=\mathcal{E}_{emp}(\bm{w}+\alpha\bm{w}A,S)

Since Taylor expansion for local minimum $\bm{w}=\omega$ will only remains second order term, thus

\mathcal{E}_{emp}(\bm{w}+\alpha\bm{w}A,S)=\mathcal{E}_{emp}(\omega,S)+\frac{\alpha^{2}}{2}\sum^{d}_{s,t=1}(\omega_{s}A)H_{s,t}(\omega,\phi(S))(\omega_{t}A)^{\top}

With basic algebra, one can easily get

	$\displaystyle\mathbb{E}_{A\sim O_{m}}\left[\mathcal{E}_{\mathcal{F}}^{\phi}(f,S,\alpha A)\right]$	$\displaystyle\leq\frac{\delta^{2}}{2}\sum_{s,t=1}^{d}\mathbb{E}_{A\sim O_{m}}\left[\left(\omega_{s}A\right)H_{s,t}\left(\omega_{t}A\right)^{T}\right]+\mathcal{O}\left(\delta^{3}\right)$
		$\displaystyle=\frac{\delta^{2}}{2m}\sum_{s,t=1}^{d}\left\langle\omega_{s},\omega_{t}\right\rangle\cdot\operatorname{Tr}\left(H_{s,t}\right)+\mathcal{O}\left(\delta^{3}\right)$
		$\displaystyle=\frac{\delta^{2}}{2m}\kappa^{\phi}(\omega)+\mathcal{O}\left(\delta^{3}\right)$

F.2 Comparison to our results

We first summarize the commonalities and differences between their results and ours:

•

Both of us consider the robustness of the model. But they define the feature robustness while we study the loss robustness Xu & Mannor (2012) which has been studied for many years.
•

They consider a non-standard generalization gap by decomposing it into representativeness and the expected deviation of the loss around the sample points. But we strive to integrate sharpness into the general generalization guarantees.

For point 1, their defined feature robustness trivially depends on the sharpness. Because the sharpness (the curvature information) is just defined by the robust perturbation areas around the desired point. From step 2 in the above proof sketch we can see, the hessian w.r.t. $\omega$ is exactly the second expansion of perturbed expected risk. So we think this definition provides less information about the optimization landscape. In contrast, we consider the loss robustness for two reasons: 1) it is easy to get in practice without finding the orthogonal matrices $O_{m}$ first. 2) we highlight its dependence on the data manifold.

For point 2, we try to integrate this optimization property (sharpness) into the standard generalization frameworks in order to get a clearer interplay. Unlike feature robustness, the robustness defined by loss function will be easier analyzed in generalization tools, because it’s hard and vague to define the "feature" in general. Besides, our result will also benefit the data-dependent bounds Xu & Mannor (2012); Kawaguchi et al. (2022).

Appendix G Experiments and details

G.1 Ridge regression with distributional shifting

As we stated before, we followed the Duchi & Namkoong (2021) to investigate the ridge regression on distributional shift. We randomly generate $\theta^{*}_{0}\in\mathbb{R}^{d}$ in spherical space, and data from

X\overset{\text{iid}}{\sim}\mathcal{N}(0,1),~{}~{}~{}\bm{y}=X\theta^{*}_{0}

(122)

To simulate distributional shift, we randomly generate a perpendicular unit vector $\theta_{0}^{\perp}$ to $\theta^{*}_{0}$ . Let $\theta_{0}^{\perp},\theta^{*}_{0}$ be the basis vectors, then shifted ground-truth will be computed from the basis by $\theta^{*}_{\alpha}=\theta^{*}_{0}\cdot\cos(\alpha)+\theta^{\perp}_{0}\cdot\sin(\alpha)$ . For the source domain, we use $\theta^{*}_{0}$ as our training distribution. We randomly sample 50 data points and train a linear classifier with a gradient descent of 3000 iterations. Starting from $\alpha=0$ , we gradually increase the $\alpha$ to generate different distributional shifts.

From the left panel in Figure 7 we can see that a larger penalty suffers from lower loss increasing when $\beta$ ranges from $0$ to $2$ . Since we consider a cycling shift of label space, $180^{\circ}$ corresponds to the maximum shift thus leading to the highest loss increase. According to our analysis of ridge regression, larger $\beta$ means a flatter minimum and more robustness, resulting in a better OOD generalization. This experiment verifies our theoretical analysis. However, it is important to note that too large a coefficient of $\ell_{2}$ regularization will hurt the performance. As shown in Figure 7 (right panel), the curse of underfitting (indicated by brown colors) appears when $\beta>4$ .

G.2 Additional results on sharpness

Here we show more experimental results on RotatedMNIST which is a rotation of MNIST handwritten digit dataset with different angles ranging from $0^{\circ},15^{\circ},30^{\circ},45^{\circ},60^{\circ},75^{\circ}$ (6 domains/distributions). For each environment, we select a domain as a test domain and train the model on all other domains. Then OOD generalization test accuracy will be reported on the test domain. In our experiments, we run each algorithm with $12$ different seeds, thus getting $12$ trained models of different minima. Then we compute the sharpness (see Algorithm 1) for all these models and then plot them in Figure 8. For algorithms, we choose Empirical Risk Minimization (ERM), and Stochastic Weight Averaging (SWA) as the plain OOD generalization algorithm which is shown in the first column. In robust optimization algorithms, we choose Group Distributional Robust Optimization Sagawa et al. (2019). We also choose CORALSun & Saenko (2016) as a multi-source domain generalization algorithm. Among these different types of out-of-domain generalization algorithms, we can conclude that sharpness will affect the test accuracy on the OOD dataset.

Experimental configurations are listed in Table 1. For each model, we run the $5000$ iterations and choose the last model as the test model. To ease the computational burden, we choose the $3$ -layer MLP to compute the sharpness.

Algorithms	Optimizer	lr	WD	batch size	MLP size	eta	MMD $\gamma$
ERM(SWA)	Adam	0.001	0	64	265*3	-	-
DRO	Adam	0.001	0	64	265*3	0.01	-
CORAL	Adam	0.001	0	64	265*3	-	1

Table 1: Hyperparameters we use for different DG algorithms in the experiments.

From Figure 8 we can see, the sharpness has an inverse relationship with the out-of-domain generalization performance for every model in each individual environment. To make it clear, we plot similar tasks from environment $1$ to $4$ as the last row. Thus, we can see a clearer tendency in all algorithms. It verified our Theorem 3.1. Note that all algorithms have different feature scales. One may need to normalize the results of different algorithms when plotting them together.

Algorithm 1 Pseudocode of model sharpness computation

0: feature layer

f(x)

, training loss

\ell

0: Sharpness

S

Get Jacobian matrix w.r.t. feature layer

\mathbf{J}=\nabla\ell(f(x),y)

for each gradient vector

\mathbf{J}_{i}

\mathbf{J}^{\top}

Compute Hessian w.r.t. element

i,j

f(x)

\frac{\partial\mathbf{J}_{i}}{\partial f_{l}(x)}

end for

We store Hessian in the variable

\mathbf{H}

Initialize sharpness

S=0

for

i

in feature layer.shape[0] do

for

j

in feature layer.shape[0] do

Retrieve the hessian value of

i,j

element via

h\leftarrow\mathbf{H}[:,j,i,:]

Sharpness

s_{ij}\leftarrow Trace(h)*f_{ij}(x)^{2}

S=S+s_{ij}

end for

G.3 Compare our robust bound to other OOD generalization bounds

G.3.1 Computation of our bounds

First, we follow Kawaguchi et al. (2022) to compute the $K$ in an inverse image of the $\epsilon-$ covering in a randomly projected space. The main idea is to partition input space in a projected space with transformation matrix $\tilde{A}$ . The specific steps will be (1) To generate a random matrix $\tilde{A}$ , we i.i.d. sample each entry from the Uniform Distribution $\mathcal{U}(0,1)$ . (2) Each row of the random matrix $\mathcal{A}\in\mathbb{R}^{3\times d}$ is then normalized so that $Ax\in[0,1]^{3}$ , i.e. $A_{ij}=\tilde{A}_{ij}/\sum_{j=1}^{d}\tilde{A}_{ij}$ (3) After generating a random matrix $A$ , we use the $\epsilon$ -covering of the space of $u=Ax$ to define the pre-partition $\{\tilde{C}_{i}\}_{i=1}^{K}$ .

G.3.2 Computation of PAC-Bayes bound

We follow the definition to compute expected $\textit{dis}_{\rho}$ in Germain et al. (2013) where

Definition G.1.

Let $\mathcal{H}$ be a hypothesis class. For any marginal distributions $D_{S}$ and $D_{T}$ over $X$ , any distribution $\rho$ on $\mathcal{H}$ , the domain disagreement $\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)$ between $D_{S}$ and $D_{T}$ is defined by,

\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)\stackrel{{\scriptstyle\text{ def }}}{{=}}\left|\underset{h,h^{\prime}\sim\rho^{2}}{\mathbf{E}}\left[R_{D_{T}}\left(h,h^{\prime}\right)-R_{D_{S}}\left(h,h^{\prime}\right)\right]\right|.

Since the $\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)$ is defined as the expected distance, we can compute its empirical version according to their theoretical upper bound as follows.

Proposition G.2 (Germain et al. Germain et al. (2013) Theorem 3).

For any distributions $D_{S}$ and $D_{T}$ over $X$ , any set of hypothesis $\mathcal{H}$ , any prior distribution $\pi$ over $\mathcal{H}$ , any $\delta\in(0,1]$ , and any real number $\alpha>0$ , with a probability at least $1-\delta$ over the choice of $S\times T\sim$ $\left(D_{S}\times D_{T}\right)^{m}$ , for every $\rho$ on $\mathcal{H}$ , we have

\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)\leq\frac{2\alpha\left[\operatorname{dis}_{\rho}(S,T)+\frac{2\mathrm{KL}(\rho\|\pi)\ln\frac{2}{\delta}}{m\times\alpha}+1\right]-1}{1-e^{-2\alpha}}

. With $\operatorname{dis}_{\rho}(S,T)$ , we can then compute the final generalization bound by the following inequality

		$\displaystyle\forall\rho\text{ on }\mathcal{H},R_{P_{T}}\left(G_{\rho}\right)-R_{P_{T}}\left(G_{\rho_{T}^{*}}\right)\leq R_{P_{S}}\left(G_{\rho}\right)$
		$\displaystyle\quad+\operatorname{dis}_{\rho}\left(D_{S},D_{T}\right)+R_{D_{T}}\left(G_{\rho},G_{\rho_{T}^{}}\right)+R_{D_{S}}\left(G_{\rho},G_{\rho_{T}^{}}\right)$

with $\rho_{T}^{*}=\operatorname{argmin}_{\rho}R_{P_{T}}\left(G_{\rho}\right)$ is the best target posterior, and $R_{D}\left(G_{\rho},G_{\rho_{T}^{*}}\right)=E_{h\sim\rho}E_{h^{\prime}\sim\rho_{T}^{*}}R_{D}\left(h,h^{\prime}\right)$ .

Note that we ignore the expected errors over the best hypothesis by assuming the $R_{D}\left(G_{\rho},G_{\rho_{T}^{*}}\right)=0$ . We apply the same operation in $\mathcal{E}^{*}$ of Proposition C.6 as well.

G.3.3 Comparisons

In this section, we add some additional experiments on comparing to other baselines, i.e. PAC-Bayes bounds Germain et al. (2013). As shown in the first row of Figure LABEL:fig_app:spurious_features, our robust framework has a smaller distribution distance in the bound compared to the two baselines when increasing the model size. In the second row, we have similar results in final generalization bounds. From the third and fourth rows we can see, our bound is tighter than baselines when suffering distributional shifts.

$\displaystyle\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})=$	$\displaystyle\sum_{i=1}^{K}\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})p_{t}(C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})\|\bm{z}\in C_{i})p_{s}(C_{i})$	(15)
	$\displaystyle\pm\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})p_{s}(C_{i})$
$\displaystyle=$	$\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))$
	$\displaystyle+\sum_{i=1}^{K}\left[\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})-\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z})\|\bm{z}\in C_{i})\right]p_{s}(C_{i})$
$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))+\epsilon(S,A).$

		$\displaystyle\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)(p_{t}(C_{i})-p_{s}(C_{i}))-\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}-\frac{n_{i}(S)}{n}\right)$		(17)
		$\displaystyle=\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(p_{t}(C_{i})-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right)-\sum_{i=1}^{K}\left(\mathbb{E}(\ell(f_{\bm{\theta}},\bm{z}^{\prime})\|\bm{z}^{\prime}\in C_{i})\right)\left(p_{s}(C_{i})-\frac{n_{i}(S)}{n}\right)$
		$\displaystyle\ \leq M\sum_{i=1}^{K}\left\|p_{t}(C_{i})-\frac{n_{i}(\hat{\mathcal{D}}_{T})}{n}\right\|+M\sum_{i=1}^{K}\left\|p_{s}(C_{i})-\frac{n_{i}(S)}{n}\right\|.$

		$\displaystyle\left\|\mathcal{L}_{T}(f_{\bm{\theta}})-\mathcal{L}_{S}(f_{\bm{\theta}})\right\|$		(28)
		$\displaystyle=\left\|\mathbb{E}_{\bm{z}^{\prime}\sim\mathcal{D}_{T}}\ell(f_{\bm{\theta}},\bm{z}^{\prime})-\mathbb{E}_{\bm{z}\sim\mathcal{D}_{S}}\ell(f_{\bm{\theta}},\bm{z})\right\|$
		$\displaystyle\leq\int_{0}^{\infty}\left\|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)dt-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)dt\right\|$
		$\displaystyle=\int_{0}^{1}\left\|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)\right\|dt~{}~{}~{}(M=1)$
		$\displaystyle\leq\sup_{t\in[0,1]}\sup_{f(\bm{\theta},\cdot)\in\mathcal{F}}\left\|\operatorname{Pr}_{\mathcal{D}_{T}}\left(\ell(f(\bm{\theta},\bm{x}^{\prime}),y^{\prime})>t\right)-\operatorname{Pr}_{\mathcal{D}_{S}}\left(\ell(f(\bm{\theta},\bm{x}),y)>t\right)\right\|$
		$\displaystyle\leq\sup_{\mathcal{A}(f)\in\mathcal{A}_{\mathcal{F}\Delta\mathcal{F}}}\left\|\operatorname{Pr}_{\mathcal{\mathcal{D}}_{T}}(\mathcal{A}(f))-\operatorname{Pr}_{\mathcal{\mathcal{D}}_{S}}(\mathcal{A}(f))\right\|$
		$\displaystyle=\frac{1}{2}d_{\mathcal{F}\Delta\mathcal{F}}(\mathcal{D}_{S};\mathcal{D}_{T})\leq(\text{Empirical div error})$

$\displaystyle H(\bm{x}^{*})$	$\displaystyle=\frac{D^{2}_{f}(\bm{x}^{},y^{})}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$	(30)
	$\displaystyle\succeq\frac{\tilde{M}_{1}}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$
	$\displaystyle\succeq\frac{\hat{w}^{2}_{\min}\tilde{M}_{1}}{d}\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\hat{w}_{i}\hat{w}_{j}\bm{a}_{i}\bm{a}^{\top}_{j}\sigma^{\prime}(\bm{a}_{i}^{\top}\bm{x}^{})\sigma^{\prime}(\bm{a}_{j}^{\top}\bm{x}^{})\right)$

$\displaystyle\epsilon(S,A)$	$\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}^{}(A))\right\\|\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|^{2}+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\\|^{3}$	(42)
	$\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left(\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\\|+L\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|\right)\\|\bm{z}_{i}(A)-\bm{z}_{i}^{}(A)\\|^{2}$
	$\displaystyle+\frac{L}{6}\\|\bm{z}_{i}(A)-\bm{z}_{i}^{*}(A)\\|^{3}$
	$\displaystyle\leq\max_{i\in[K]}\frac{1}{2}\left(\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\\|+\rho_{i}(L)\right)\frac{\rho_{i}(L)^{2}}{L^{2}}+\frac{\rho_{i}(L)^{3}}{6L^{2}}$
	$\displaystyle=\max_{i\in[K]}\frac{\rho_{i}(L)^{2}}{2L^{2}}\\|\nabla^{2}\ell_{\bm{\hat{w}},A}(\bm{z}_{i}(A))\\|+\frac{2\rho_{i}(L)^{3}}{3L^{2}}$

Towards Robust Out-of-Distribution Generalization Bounds via Sharpness

Abstract

1 Introduction

2 Preliminary

Notations

2.1 Problem formulation

Proposition 2.1 (Zhao et al. Zhao et al. (2018) Theorem 2 & 3.2).

2.2 Algorithmic robustness

Definition 2.2 (Robustness, Xu & Mannor (2012)).

3 Main Results

3.1 Robust OOD Generalization Bound

Theorem 3.1.

Remark.

Corollary 3.2.

3.2 Sharpness and Robustness

Definition 3.3 (Sharpness, Petzka et al. (2021)).

Theorem 3.4.

Remark.

Corollary 3.5.

3.3 Case Study

Example 3.6.

Example 3.7.

4 Related Work

5 Experiments

5.1 Ridge regression in distributional shifting

5.2 Sharper minimum hurts OOD generalization

5.3 Comparison of generalization bounds

Along model size

Along distribution shift

6 Conclusion

7 Acknowledgement

References

Appendix A Additional Experiments

A.1 Sharpness v.s. OOD Generalization on PACS and Wilds-Camelyon17

Appendix B Notations and Definitions

Notations

Definition B.1 (Robustness, Xu & Mannor (2012)).

Definition B.2 (Hessian Lipschitz continuous).

Lemma B.3 (Hessian Lipschitz Lemma).

B.1 Assumptions

OOD Generalization

Robustness and Sharpness

Appendix C Proof to Domain shift

Lemma C.1.

Proof.

Lemma C.2 (Xu et al.Xu & Mannor (2012)).

Theorem C.3 (Restatement of Theorem 3.1).

Proof.

C.1 Proof to Corollary 3.2

Definition C.4.

Lemma C.5 (Lemma 2 Zhao et al. (2018)).

Proposition C.6 (Zhao et al. (2018)).

Proof.

Corollary C.7.

Proof.

Appendix D Sharpness and Robustness

Lemma D.1 (positive definiteness of Hessian).

Proof.

Lemma D.2.

Proof.

Lemma D.3 (Lemma 2.1 Bourin et al. (2013)).

Lemma D.4.

Proof.

Lemma D.5.

Proof.

D.1 Proof to Eq. 7

Proof.

D.2 Proof to Corollary 3.5

Corollary D.6 (Restatement of Corollary 3.5).

Proof.

Appendix E Case Study

Example E.1.

E.1 Ridge regression

Conclusions

E.2 2-layer Diagonal Linear Network Classification

Example E.2.

Appendix F Comparison to feature robustness

F.1 Their results

Definition F.1 (Feature robustness Petzka et al. Petzka et al. (2021)).

Theorem F.2 (Theorem 5 Petzka et al. Petzka et al. (2021)).

Towards Robust Out-of-Distribution
Generalization Bounds via Sharpness