This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training

Binghui Li    Yuanzhi Li
Abstract

Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting with NN separated training data points. First, we prove that, based on the assumption that we assume there is poly(D)\operatorname{poly}(D)-size clean classifier (where DD is the data dimension), ReLU net with only O~(ND)\tilde{O}(ND) extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network with O~(ND)\tilde{O}(ND) width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition datasets.

Machine Learning, ICML

1 Introduction

Nowadays, deep neural networks have achieved excellent performance in a variety of disciplines, especially including in computer vision (Krizhevsky et al., 2012; Dosovitskiy et al., 2020; Kirillov et al., 2023) and natural language process (Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022). However, it is well-known that small but adversarial perturbations to the natural data can make well-trained classifiers confused (Biggio et al., 2013; Szegedy et al., 2013; Goodfellow et al., 2014), which potentially gives rise to reliability and security problems in real-world applications and promotes designing adversarial robust learning algorithms.

In practice, adversarial training methods (Goodfellow et al., 2014; Madry et al., 2017; Shafahi et al., 2019; Zhang et al., 2019; Pang et al., 2022) are widely used to improve the robustness of models by regarding perturbed data as training data. However, while these robust learning algorithms are able to achieve high robust training accuracy (Gao et al., 2019), it still leads to a non-negligible robust generalization gap (Raghunathan et al., 2019), which is also called robust overfitting (Rice et al., 2020; Yu et al., 2022).

To explain this puzzling phenomenon, a series of works have attempted to provide theoretical understandings from different perspectives. Despite these aforementioned works seem to provide a series of convincing evidence from theoretical views in different settings, there still exists a gap between theory and practice for at least two reasons.

First, although previous works have shown that adversarial robust generalization requires more data and larger models (Schmidt et al., 2018; Gowal et al., 2021; Li et al., 2022; Bubeck & Sellke, 2023), it is unclear that what mechanism, during adversarial training process, directly causes robust overfitting. A line of work about uniform algorithmic stability (Xing et al., 2021; Xiao et al., 2022), under Lipschitzian smoothness assumptions, also suggest that robust generalization gap increases when training iteration is large. In other words, we know there is no robust generalization gap for a trivial model that only guesses labels totally randomly (e.g. deep neural networks at random initialization), which implies that we should take learning process into consideration to analyze robust generalization.

Second and most importantly, while some works (Tsipras et al., 2018; Zhang et al., 2019; Hassani & Javanmard, 2022) point out that achieving robustness may hurt clean test accuracy, in most of the cases, it is observed that drop of robust test accuracy is much higher than drop of clean test accuracy in adversarial training (Madry et al., 2017; Schmidt et al., 2018; Raghunathan et al., 2019) (see in Figure 1, where clean test accuracy is more than 80%80\% but robust test accuracy only attains nearly 50%50\%). Namely, a weak version of benign overfitting (Zhang et al., 2017), which means that overparameterized deep neural networks can both fit random data powerfully and generalize well for unseen clean data, remains after adversarial training.


Refer to caption

Figure 1: The learning curves of adversarial training on CIFAR10 with \ell_{\infty}-perturbation radius δ=8/255\delta=8/255 (Rice et al., 2020).

Therefore, it is natural to ask the following question:

What is the underlying mechanism that results in both Clean Generalization and Robust Overfitting (CGRO) during adversarial training?

In this paper, we provide a theoretical understanding of this question. Precisely, we make the following contributions:

  • In Section 3, we first present some useful notations used in our work, and then provide the formal definition of the CGRO classifier (Definition 3.4).

  • In Section 4, we study the CGRO classifiers via the view of representation complexity. Based on some data assumptions observed in practice, we prove that achieving CGRO classifier only needs extra linear parameters by leveraging robust memorization (Theorem 4.4), but robust classifier requires even exponential model capacity in worst case (Theorem 4.7).

  • In Section 5, under our theoretical framework of adversarial training, we apply a two-layer convolutional network to learn the structured data. We propose a three-stage analysis technique to decouple the complicated training dynamics of adversarial training, which shows that the network learner provably converges to the CGRO regime (Theorem 5.9).

  • In Section 6, we empirically demonstrate our theoretical results in Section 4 and Section 5 by experiments in real-world data and synthetic data, respectively.

2 Additional Related Work

Adversarial Examples and Adversarial Training Method. Szegedy et al. (2013) first made an intriguing discovery: even state-of-the-art deep neural networks are vulnerable to adversarial examples. Then, Goodfellow et al. (2014) proposes FGSM to seek adversarial examples, and it also introduce the adversarial training framework to enhance the defence of models against adversarial perturbations.

Empirical Works on Robust Overfitting. One surprising behavior of deep learning is that over-parameterized neural networks can generalize well, which is also called benign overfitting that deep models have not only the powerful memorization but a good performance for unseen data (Zhang et al., 2017; Belkin et al., 2019). However, in contrast to the standard (non-robust) generalization, for the robust setting, Rice et al. (2020) empirically investigates robust performance of models based on adversarial training methods, which are used to improve adversarial robustness (Szegedy et al., 2013; Madry et al., 2017), and the work Rice et al. (2020) shows that robust overfitting can be observed on multiple datasets, including CIFAR10 and ImageNet.

Theoretical Works on Robust Overfitting. A list of works (Schmidt et al., 2018; Balaji et al., 2019; Dan et al., 2020) study the sample complexity for adversarial robustness, and their works manifest that adversarial robust generalization requires more data than the standard setting, which gives an explanation of the robust generalization gap from the perspective of statistical learning theory. And another line of works (Tsipras et al., 2018; Zhang et al., 2019) propose a principle called the robustness-accuracy trade-off and have theoretically proven the principle in different setting, which mainly explains the widely observed drop of robust test accuracy due to the trade-off between adversarial robustness and clean accuracy. Recently, Li et al. (2022) investigates the robust expressive ability of neural networks and shows that robust generalization requires exponentially large models.

Feature Learning Theory of Deep Learning. The feature learning theory of neural networks (Allen-Zhu & Li, 2020a, b, 2022; Shen et al., 2022; Jelassi & Li, 2022; Jelassi et al., 2022; Chen et al., 2022; Chidambaram et al., 2023) is proposed to study how features are learned in deep learning tasks, which provide a theoretical analysis paradigm beyond the neural tangent kernel (NTK) theory (Jacot et al., 2018; Du et al., 2018, 2019; Allen-Zhu et al., 2019; Arora et al., 2019). In this work, we make a first step to understand clean generalization and robust overfitting (CGRO) phenomenon in adversarial training by analyzing feature learning process under our theoretical framework about structured data.

3 Preliminaries

In this section, we first introduce some useful notations that are used in this paper, and then present our problem setup.

3.1 Notations

Throughout this work, we use letters for scalars and bold letters for vectors. We will use [k][k] to indicate the index set {1,2,,k}\{1,2,\cdots,k\}. The indicator of an event is defined as 𝕀{}=1\mathbb{I}\{\cdot\}=1 if the event \cdot holds and 𝕀{}=0\mathbb{I}\{\cdot\}=0 otherwise. We use sgn()\operatorname{sgn}(\cdot) to denote the sign function of the real number \cdot, and we use span(𝒗1,𝒗2,,𝒗n)\operatorname{span}(\bm{v}_{1},\bm{v}_{2},\cdots,\bm{v}_{n}) to denote the linear span of the vectors 𝒗1,𝒗2,,𝒗nD\bm{v}_{1},\bm{v}_{2},\cdots,\bm{v}_{n}\in\mathbb{R}^{D}.

For any given two sequences {An}n=0\left\{A_{n}\right\}_{n=0}^{\infty} and {Bn}n=0\left\{B_{n}\right\}_{n=0}^{\infty}, we denote An=O(Bn)A_{n}=O\left(B_{n}\right) if there exist some absolute constant C1>0C_{1}>0 and N1>0N_{1}>0 such that |An|C1|Bn|\left|A_{n}\right|\leq C_{1}\left|B_{n}\right| for all nN1n\geq N_{1}. Similarly, we denote An=Ω(Bn)A_{n}=\Omega\left(B_{n}\right) if there exist C2>0C_{2}>0 and N2>0N_{2}>0 such that |An|C2|Bn|\left|A_{n}\right|\geq C_{2}\left|B_{n}\right| for all n>N2n>N_{2}. We say An=Θ(Bn)A_{n}=\Theta\left(B_{n}\right) if An=O(Bn)A_{n}=O\left(B_{n}\right) and An=Ω(Bn)A_{n}=\Omega\left(B_{n}\right) both holds. We use O~(),Ω~()\widetilde{O}(\cdot),\widetilde{\Omega}(\cdot), and Θ~()\widetilde{\Theta}(\cdot) to hide logarithmic factors in these notations respectively. Moreover, we denote An=poly(Bn)A_{n}=\operatorname{poly}\left(B_{n}\right) if An=O(BnK)A_{n}=O\left(B_{n}^{K}\right) for some positive constant KK, and An=polylog(Bn)A_{n}=\operatorname{polylog}\left(B_{n}\right) if Bn=poly(log(Bn))B_{n}=\operatorname{poly}\left(\log\left(B_{n}\right)\right). We also say An=o(Bn)A_{n}=o(B_{n}) if for arbitrary positive constant C3>0C_{3}>0, there exists N3>0N_{3}>0 such that |An|<C3|Bn||A_{n}|<C_{3}|B_{n}| for all n>N3n>N_{3}. An event is said that it happens with high probability (or w.h.p. for short) if it happens with probability at least 1o(1)1-o(1).

We use the notation p,p[1,+]\left\|\cdot\right\|_{p},p\in[1,+\infty] to denote the p\ell_{p} norm in the vector space D\mathbb{R}^{D}. For two sets 𝒜,D\mathcal{A},\mathcal{B}\subset\mathbb{R}^{D}, we can define the p\ell_{p}-distance between 𝒜\mathcal{A} and \mathcal{B} as distp(𝒜,):=inf{𝑿𝒀p:𝑿𝒜,𝒀}\operatorname{dist}_{p}(\mathcal{A},\mathcal{B}):=\operatorname{inf}\left\{\|\bm{X}-\bm{Y}\|_{p}:\bm{X}\in\mathcal{A},\bm{Y}\in\mathcal{B}\right\}. For r>0r>0, 𝔹p(𝑿,r):={𝒁D:𝒁𝑿pr}\mathbb{B}_{p}(\bm{X},r):=\left\{\bm{Z}\in\mathbb{R}^{D}:\left\|\bm{Z}-\bm{X}\right\|_{p}\leq r\right\} is defined as the p\ell_{p}-ball with radius rr centered at 𝑿\bm{X}.

A multi-layer neural network is a function from input in D\mathbb{R}^{D} to output in m\mathbb{R}^{m}, which is defined as follows:

𝑭𝑾,𝑩(𝑿)\displaystyle\bm{F}_{\bm{W},\bm{B}}\left(\bm{X}\right) :=𝑾Lσ(𝑾L1σ(𝑾1σ(𝑾0𝑿\displaystyle:=\bm{W}_{L}\sigma\left(\bm{W}_{L-1}\sigma\left(\cdots\bm{W}_{1}\sigma\left(\bm{W}_{0}\bm{X}\right.\right.\right.
+𝑩0)+𝑩1)+𝑩L1)+𝑩L\displaystyle\left.\left.\left.\quad+\bm{B}_{0}\right)+\bm{B}_{1}\cdots\right)+\bm{B}_{L-1}\right)+\bm{B}_{L}

and

𝑾0m0×D,𝑾imi×mi1,1iL1,\displaystyle\bm{W}_{0}\in\mathbb{R}^{m_{0}\times D},\quad\bm{W}_{i}\in\mathbb{R}^{m_{i}\times m_{i-1}},1\leq i\leq L-1,
𝑾Lm×mL1,𝑩imi,0iL1,\displaystyle\bm{W}_{L}\in\mathbb{R}^{m\times m_{L-1}},\quad\bm{B}_{i}\in\mathbb{R}^{m_{i}},0\leq i\leq L-1,
𝑩Lm,𝑿D,𝑭𝑾,𝑩(𝑿)m,\displaystyle\bm{B}_{L}\in\mathbb{R}^{m},\quad\bm{X}\in\mathbb{R}^{D},\quad\bm{F}_{\bm{W},\bm{B}}\left(\bm{X}\right)\in\mathbb{R}^{m},

where the weight 𝑾=[𝑾0,𝑾1,,𝑾L]\bm{W}=[\bm{W}_{0},\bm{W}_{1},\cdots,\bm{W}_{L}] and the bias 𝑩=[𝑩0,𝑩1,,𝑩L]\bm{B}=[\bm{B}_{0},\bm{B}_{1},\cdots,\bm{B}_{L}] are learnable parameters, and max{D,m0,m1,,mL1,m}\max\{D,m_{0},m_{1},\cdots,m_{L-1},m\} and L+1L+1 are the width and depth of the neural network, respectively. We use σ()\sigma(\cdot) to denote the (non-linear) entry-wise activation function, and ReLU activation function is defined as σ():=max(,0)\sigma(\cdot):=\max(\cdot,0).

3.2 Problem Setup

We consider a binary classification setting, where we use 𝑿𝒳D\bm{X}\in\mathcal{X}\subset\mathbb{R}^{D} to denote the data input and the binary label yy is in 𝒴={1,1}\mathcal{Y}=\{-1,1\}. Given a data distribution 𝒟\mathcal{D} that is a joint distribution over the supporting set 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, and a function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} as the classifier, we can define the following measurements to describe the clean (robust) classification performance of the classifier ff on data 𝒟\mathcal{D}.

Definition 3.1.

(Clean Test Error) The clean test error of the classifier ff w.r.t. the data distribution 𝒟\mathcal{D} is defined as 𝒟(f):=(𝑿,y)𝒟[sgn(f(𝑿))y].\mathcal{L}_{\mathcal{D}}(f):=\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[\operatorname{sgn}(f(\bm{X}))\neq y\right].

Definition 3.2.

(Robust Test Error) Given a p\ell_{p}-robust radius δ0\delta\geq 0, the robust test error of the classifier ff w.r.t. the data distribution 𝒟\mathcal{D} and δ\delta under p\ell_{p} norm is defined as 𝒟p,δ(f):=𝔼(𝑿,y)𝒟[max𝑿𝑿pδ𝕀{sgn(f(𝑿))y}].\mathcal{L}_{\mathcal{D}}^{p,\delta}(f):=\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}\left[\max_{\|\bm{X^{\prime}}-\bm{X}\|_{p}\leq\delta}\mathbb{I}\{\operatorname{sgn}(f(\bm{X^{\prime}}))\neq y\}\right].

In our work, we mainly focus on the cases when p=2p=2 and p=p=\infty, which can be extended to the general pp case as well.

In adversarial training, with access to the training dataset 𝒮={(𝑿1,y1),(𝑿2,y2),,(𝑿N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\} randomly sampled from the data distribution 𝒟\mathcal{D}, we aim to minimize the following robust training error to derive the robust classifier.

Definition 3.3.

(Robust Training Error) Given a p\ell_{p}-robust radius δ0\delta\geq 0, the robust training error of the classifier ff w.r.t. training dataset 𝒮\mathcal{S} and δ\delta under p\ell_{p} norm is defined as 𝒮p,δ(f):=1Ni=1Nmax𝑿i𝑿ipδ𝕀{sgn(f(𝑿i))y}\mathcal{L}_{\mathcal{S}}^{p,\delta}(f):=\frac{1}{N}\sum_{i=1}^{N}\max_{\|\bm{X}_{i}^{\prime}-\bm{X}_{i}\|_{p}\leq\delta}\mathbb{I}\{\operatorname{sgn}(f(\bm{X}_{i}^{\prime}))\neq y\} .

Now, we present the concept CGRO classifier that we mainly study in this paper as the following definition.

Definition 3.4.

(CGRO Classifier) Given a p\ell_{p}-robust radius δ0\delta\geq 0, we say a classifier ff is CGRO classifier w.r.t. the data distribution 𝒟\mathcal{D} and training dataset 𝒮\mathcal{S} if it satisfies that 𝒟(f)=o(1)\mathcal{L}_{\mathcal{D}}(f)=o(1), 𝒮p,δ(f)=o(1)\mathcal{L}_{\mathcal{S}}^{p,\delta}(f)=o(1) but 𝒟p,δ(f)=Ω(1)\mathcal{L}_{\mathcal{D}}^{p,\delta}(f)=\Omega(1).

Remark 3.5.

In the above definition of CGRO classifier, the asymptotic notations o(1)o(1) and Ω(1)\Omega(1) are both defined with respective to the data dimension DD, which means that CGRO classifier has a good clean test performance but a poor robust generalization at the same time.

4 Analyzing the CGRO Phenomenon from the View of Representation Complexity

In this section, we provide a theoretical understanding of the CGRO phenomenon from the view of representation complexity. First, We present the data assumption as follow.

For the data distribution 𝒟\mathcal{D} and p\ell_{p}-robust radius δ>0\delta>0, the supporting set of the data input 𝒳\mathcal{X} can be divided into two sets 𝒳+\mathcal{X}_{+} and 𝒳\mathcal{X}_{-} that correspond to the positive and negative classes respectively, i.e. 𝒳+:={𝑿𝒳:(𝑿,y)𝒟,y=1}\mathcal{X}_{+}:=\{\bm{X}\in\mathcal{X}:(\bm{X},y)\in\mathcal{D},y=1\} and 𝒳:={𝑿𝒳:(𝑿,y)𝒟,y=1}\mathcal{X}_{-}:=\{\bm{X}\in\mathcal{X}:(\bm{X},y)\in\mathcal{D},y=-1\}.

Assumption 4.1.

(Bounded) There exists a absolute constant >0\mathcal{R}>0 such that, with high probability over the data distribution 𝒟\mathcal{D}, it holds that the data input 𝑿[,]D\bm{X}\in[-\mathcal{R},\mathcal{R}]^{D}.

Recall the definition of CGRO classifier (Definition 3.4). W.l.o.g., we can only focus on the case 𝒳[,]D\mathcal{X}\subset[-\mathcal{R},\mathcal{R}]^{D}.

Assumption 4.2.

(Well-Separated) we assume that the separation between the positive and negative classes is large than twice the robust radius, i.e. distp(𝒳+,𝒳)>2δ\operatorname{dist}_{p}(\mathcal{X}_{+},\mathcal{X}_{-})>2\delta.

This assumption is necessary to ensure the existence of a robust classifier, and which is indeed empirically observed in real-world image classification data (Yang et al., 2020).

Assumption 4.3.

(Polynomial-Size Clean Classifier Exists) we assume that there exists a clean classifier fclean:𝒳f_{\textit{clean}}:\mathcal{X}\rightarrow\mathbb{R} represented as a ReLU network with poly(D)\operatorname{poly}(D) parameters such that 𝒟(fclean)=o(1)\mathcal{L}_{\mathcal{D}}(f_{\textit{clean}})=o(1) but 𝒟p,δ(fclean)=Ω(1)\mathcal{L}_{\mathcal{D}}^{p,\delta}(f_{\textit{clean}})=\Omega(1).

In Assumption 4.3, the asymptotic notations o(1)o(1) and Ω(1)\Omega(1) are defined w.r.t. the data dimension DD. It means that the clean classifier with moderate size can perfectly classify the natural data but fails to classify the adversarially perturbed data, which is consistent with the common practice that well-trained neural networks are vulnerable to adversarial examples (Szegedy et al., 2013; Raghunathan et al., 2019).

Now, we present our main result about the upper bound of representation complexity for CGRO classifier as follow.

Theorem 4.4.

(Polynomial Upper Bound for CGRO) Under Assumption 4.1, 4.2 and 4.3, with NN-sample training dataset 𝒮={(𝐗1,y1),(𝐗2,y2),,(𝐗N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\} drawn from the data distribution 𝒟\mathcal{D}, there exists a CGRO classifier fCGRO:𝒳f_{\textit{CGRO}}:\mathcal{X}\rightarrow\mathbb{R} that can be represented as a ReLU network with at most poly(D)+O~(ND)\operatorname{poly}(D)+\tilde{O}(ND) parameters.

Proof Sketch. First, we show the existence of the CGRO classifier w.r.t the data distribution 𝒟\mathcal{D} and training dataset 𝒮\mathcal{S}.

Lemma 4.5.

For given the data distribution 𝒟\mathcal{D} satisfying Assumption 4.1, 4.2 and 4.3, and the randomly sampled training dataset 𝒮={(𝐗1,y1),(𝐗2,y2),,(𝐗N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\}, we consider the following function f𝒮:𝒳f_{\mathcal{S}}:\mathcal{X}\rightarrow\mathbb{R} defined as

f𝒮(𝑿)=\displaystyle f_{\mathcal{S}}(\bm{X})= fclean(𝑿)(1𝕀{𝑿i=1N𝔹p(𝑿i,δ)})clean classification on unseen test data\displaystyle\underbrace{f_{\textit{clean}}(\bm{X})\left(1-\mathbb{I}\{\bm{X}\in\cup_{i=1}^{N}\mathbb{B}_{p}(\bm{X}_{i},\delta)\}\right)}_{\text{clean classification on unseen test data}}
+i=1Nyi𝕀{𝑿𝔹p(𝑿i,δ)}robust memorization on training data.\displaystyle+\underbrace{\sum_{i=1}^{N}y_{i}\mathbb{I}\{\bm{X}\in\mathbb{B}_{p}(\bm{X}_{i},\delta)\}}_{\text{robust memorization on training data}}.

And it holds that the function f𝒮f_{\mathcal{S}} is a CGRO classifier.

Due to Assumption 4.2, it is clear that we have 𝔹p(𝑿i,δ)𝔹p(𝑿j,δ)=\mathbb{B}_{p}(\bm{X}_{i},\delta)\cap\mathbb{B}_{p}(\bm{X}_{j},\delta)=\emptyset for any distinct i,j[N]i,j\in[N]. Thus, f𝒮f_{\mathcal{S}} perfectly clean classifies unseen test data by the first summand (Assumption 4.3) and robustly memorizes training data by the second summand, which belongs to CGRO classifiers.

Back to the proof of Theorem 4.4, the key idea is to approximate the classifier f𝒮f_{\mathcal{S}} proposed in Lemma 4.5 by ReLU network. Indeed, the function f𝒮f_{\mathcal{S}} can be rewritten as

f𝒮(𝑿)\displaystyle f_{\mathcal{S}}(\bm{X}) =i=1N(yifclean(𝑿))𝕀{𝑿𝑿ipδ}weighted sum of robust local indicators\displaystyle=\underbrace{\sum_{i=1}^{N}(y_{i}-f_{\textit{clean}}(\bm{X}))\mathbb{I}\{\|\bm{X}-\bm{X}_{i}\|_{p}\leq\delta\}}_{\text{weighted sum of robust local indicators}}
+fclean(𝑿)poly(D).\displaystyle+\underbrace{f_{\textit{clean}}(\bm{X})}_{\operatorname{poly}(D)}.

Therefore, we use ReLU nets to approximate the distance function di(𝑿):=𝑿𝑿ipd_{i}(\bm{X}):=\|\bm{X}-\bm{X}_{i}\|_{p} in [,]D[-\mathcal{R},\mathcal{R}]^{D} efficiently, and we also noticed that the indicator 𝕀{}\mathbb{I}\{\cdot\} can be approximated by a soft indicator represented by two ReLU neurons. Combined with the above results, there exists a ReLU net fCGROf_{\textit{CGRO}} with poly(D)+O~(ND)\operatorname{poly}(D)+\tilde{O}(ND) parameters such that fCGROf𝒮=o(1)\|f_{\textit{CGRO}}-f_{\mathcal{S}}\|_{\infty}=o(1), which implies Theorem 4.4. \square

Remark 4.6.

Theorem 4.4 manifests that ReLU net with only O~(ND)\tilde{O}(ND) extra parameters is able to leverages robust memorization (Lemma 4.5) to achieve the CGRO regime.

However, to achieve robust generalization, higher complexity seems necessary. We generalize the result in Li et al. (2022) from linear-separable setting to Assumption 4.3.

Theorem 4.7.

(Exponential Lower Bound for Robust Classifer) Let M\mathcal{F}_{M} be the family of function represented by ReLU networks with at most MM parameters. There exists a number MD=Ω(exp(D))M_{D}=\Omega(\operatorname{exp}(D)) and a distribution 𝒟\mathcal{D} satisfying Assumption 4.1, 4.2 and 4.3 such that, for any classifier in the family MD\mathcal{F}_{M_{D}}, the robust test error w.r.t. 𝒟\mathcal{D} is at least Ω(1)\Omega(1).

Proof Sketch. The main proof idea of Theorem 4.7 is the linear region decomposition of ReLU nets (Montufar et al., 2014), and then we apply the technique bounding the VC dimension for local region similar to Li et al. (2022). \square

According to Assumption 4.3, Theorem 4.4 and Theorem 4.7, we obtain the following inequalities about representation complexity of classifiers in different regimes.

Clean Classifierpoly(D)CGRO Classifierpoly(D)+O~(ND)Robust ClassifierΩ(exp(D)).\displaystyle\underbrace{\textit{Clean Classifier}}_{\operatorname{poly}(D)}\lesssim\underbrace{\textit{CGRO Classifier}}_{\operatorname{poly}(D)+\tilde{O}(ND)}\ll\underbrace{\textit{Robust Classifier}}_{\Omega(\operatorname{exp}(D))}.
Remark 4.8.

These inequalities states that while CGRO classifiers have mildly higher representation complexity than clean classifiers, adversarial robustness requires excessively higher complexity, which may lead the classifier trained by adversarial training to the CGRO regime. We also empirically verify this by the experiments of adversarial training regarding different model sizes in Section 6.

5 Analyzing Dynamics of Adversarial Training on Structured Data

In the previous section, we show the efficiency of CGRO classifier via representation complexity. However, it is unclear how the classifier trained by adversarial training converges to the CGRO regime. In this section, we will focus on a structured-data setting to study the learning process of adversarial training. First, we introduce our structured-data setting as follow, and then provide a fine-grained analysis of adversarial training under our theoretical framework.

5.1 Structured Data Setting

In this section, we consider the case that data input 𝑿D\bm{X}\in\mathbb{R}^{D} has a patch structure as follow, which is similar to a list of theoretical works about feature learning (Allen-Zhu & Li, 2020b; Chen et al., 2021; Jelassi & Li, 2022; Jelassi et al., 2022; Kou et al., 2023; Chidambaram et al., 2023).

Definition 5.1.

(Patch Data Distribution) We define a data distribution 𝒟\mathcal{D}, in which each instance consists in an input 𝑿𝒳=D\bm{X}\in\mathcal{X}=\mathbb{R}^{D} and a label y𝒴={1,1}y\in\mathcal{Y}=\{-1,1\} generated by

  1. 1.

    The label yy is uniformly drawn from {1,1}\{-1,1\}.

  2. 2.

    The input 𝑿=(𝑿[1],,𝑿[P])\bm{X}=(\bm{X}[1],\ldots,\bm{X}[P]), where each patch 𝑿[j]d\bm{X}[j]\in\mathbb{R}^{d} and P=D/dP=D/d is the number of patches (we assume that D/dD/d is an integer and P=polylog(d)P=\operatorname{polylog}(d)).

  3. 3.

    Meaningful Signal patch: for each instance, there exists one and only one meaningful patch signal(𝑿)[P]\operatorname{signal}(\bm{X})\in[P] satisfies 𝑿[signal(𝑿)]=αy𝒘\bm{X}[\operatorname{signal}(\bm{X})]=\alpha y\bm{w}^{*}, where 𝒘d(𝒘2=1)\bm{w}^{*}\in\mathbb{R}^{d}(\left\|\bm{w}^{*}\right\|_{2}=1) is the unit meaningful signal vector and α>0\alpha>0 is the norm of signal.

  4. 4.

    Noisy patches: 𝑿[j]𝒩(0,(𝐈d𝒘𝒘)σp2)\bm{X}[j]\sim\mathcal{N}\left(0,\left(\mathbf{I}_{d}-\bm{w}^{*}\bm{w}^{*\top}\right)\sigma_{p}^{2}\right), for j[P]\{signal(𝑿)}j\in[P]\backslash\{\operatorname{signal}(\bm{X})\}, where σp2>0\sigma_{p}^{2}>0 denotes the variance of noise .

Remark 5.2.

The patch structure that we leverage can be viewed as a simplification of real-world vision-recognition datasets. Indeed, images are divided into signal patches that are meaningful for the classification such as the whisker of a cat or the nose of a dog, and noisy patches like the uninformative background of a photo. And our patch data assumption can also be generalized to the case that there exist a set of patches that are meaningful, while analyzing the learning process will be too complicated to un-clarify our main idea that we want to present. Therefore, we focus on the single meaningful patch case in our work.

To learn our structured data, we use two-layer convolutional neural network (CNN) (LeCun et al., 1998; Krizhevsky et al., 2012) with non-linear activation as the learner network.

Definition 5.3.

(Two-layer Convolutional Network) For a given input data 𝑿Pd\bm{X}\in\mathbb{R}^{Pd}, our network learner f𝑾:Pdf_{\bm{W}}:\mathbb{R}^{Pd}\rightarrow\mathbb{R} is defined as

f𝑾(𝑿)=\displaystyle f_{\bm{W}}(\bm{X})= r=1mj=1Par+σ(𝒘r+,𝑿[j])\displaystyle\sum_{r=1}^{m}\sum_{j=1}^{P}a_{r}^{+}\sigma\left(\left\langle\bm{w}_{r}^{+},\bm{X}[j]\right\rangle\right)
r=1mj=1Parσ(𝒘r,𝑿[j]).\displaystyle-\sum_{r=1}^{m}\sum_{j=1}^{P}a_{r}^{-}\sigma\left(\left\langle\bm{w}_{r}^{-},\bm{X}[j]\right\rangle\right).

where 𝑾=[{ar+}r=1m,{ar}r=1m,{𝒘r+}r=1m,{𝒘r}r=1m]\bm{W}=[\{a_{r}^{+}\}_{r=1}^{m},\{a_{r}^{-}\}_{r=1}^{m},\{\bm{w}_{r}^{+}\}_{r=1}^{m},\{\bm{w}_{r}^{-}\}_{r=1}^{m}] are learnable parameters, and σ()\sigma(\cdot) is the ReLUq\operatorname{ReLU}^{q} activation function defined as σ()=(max(,0))q(q2)\sigma(\cdot)=\left(\max(\cdot,0)\right)^{q}(q\geq 2).

ReLUq\operatorname{ReLU}^{q} activation functions are useful to address the non-smoothness of ReLU function at zero, which are widely applied in literatures of deep learning theory (Kileel et al., 2019; Allen-Zhu & Li, 2020a, b; Chen et al., 2021; Jelassi & Li, 2022; Jelassi et al., 2022). To simplify our analysis, we fix the second layer weights as ar+=ar=12a_{r}^{+}=a_{r}^{-}=\frac{1}{2}. We also assume that qq is odd and it holds that wr:=wr+=wrw_{r}:=w_{r}^{+}=-w_{r}^{-} during optimization process. At initialization, we sample 𝒘r\bm{w}_{r} i.i.d from 𝒩(0,σ02𝐈d)\mathcal{N}(0,\sigma_{0}^{2}\mathbf{I}_{d}). Our results can be extended to the case when qq is even and wr+,wrw_{r}^{+},-w_{r}^{-} are differently.

The Role of Non-linearity. Indeed, a series of recent theoretical works (Li et al., 2019; Chen et al., 2021; Javanmard & Soltanolkotabi, 2022) show that linear model can achieve robust generalization for adversarial training under certain settings, which but fails to explain the CGRO phenomenon observed in practice. To mitigate this gap, we improve the expressive power of model by using non-linear activation that can characterize the data structure and learning process more precisely.

In adversarial training, we aim to minimize the following adversarial training loss, which is a trade-off between natural risk and robust regularization defined as follow.

Definition 5.4.

(Adversarial Training Loss) For a hyperparameter λ>0\lambda>0 and the p\ell_{p}-robust radius δ>0\delta>0, the adversarial training loss of the network f𝑾f_{\bm{W}} w.r.t. the training dataset 𝒮={(𝑿1,y1),(𝑿2,y2),,(𝑿N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\} is defined as

^adv(𝑾):=1Ni=1N(𝑾;𝑿i,yi)natural risk\displaystyle\widehat{\mathcal{L}}_{\textit{adv}}(\bm{W}):=\frac{1}{N}\sum_{i=1}^{N}\underbrace{\mathcal{L}\left(\bm{W};\bm{X}_{i},y_{i}\right)}_{\text{natural risk }}
+λmax𝑿^i𝔸p(𝑿i,δ)[(𝑾;𝑿^i,yi)(𝑾;𝑿i,yi)]robust regularization .\displaystyle+\lambda\cdot\underbrace{\max_{\widehat{\bm{X}}_{i}\in\mathbb{A}_{p}\left(\bm{X}_{i},\delta\right)}\left[\mathcal{L}\left(\bm{W};\widehat{\bm{X}}_{i},y_{i}\right)-\mathcal{L}\left(\bm{W};\bm{X}_{i},y_{i}\right)\right]}_{\text{robust regularization }}.

where we use (𝑾;𝑿,y)\mathcal{L}(\bm{W};\bm{X},y) to denote the single-point loss with respect to f𝑾f_{\bm{W}} on (𝑿,y)(\bm{X},y) and 𝔸p(𝑿,δ)\mathbb{A}_{p}\left(\bm{X},\delta\right) denotes the perturbed range of the data point 𝑿\bm{X} with p\ell_{p}-radius δ\delta.

Remark 5.5.

Adversarial training loss in Definition 5.4 gives a general form of adversarial training methods (Goodfellow et al., 2014; Madry et al., 2017; Zhang et al., 2019) for different values of hyperparameter λ\lambda and different types of loss function (𝐖;𝐗,y)\mathcal{L}(\bm{W};\bm{X},y) and perturbed range 𝔸p(,δ)\mathbb{A}_{p}\left(\cdot,\delta\right).

Here, we use the logistic loss defined as (𝑾;𝑿,y):=log(1+exp{yfW(𝑿)})\mathcal{L}(\bm{W};\bm{X},y):=\log\left(1+\exp\{-yf_{W}\left(\bm{X}\right)\}\right). And we apply the perturbed range defined as 𝔸p(𝑿,δ):=𝔹p(𝑿,δ)𝑿+𝚫(𝑿)\mathbb{A}_{p}\left(\bm{X},\delta\right):=\mathbb{B}_{p}\left(\bm{X},\delta\right)\cap\bm{X}+\bm{\Delta}\left(\bm{X}\right), where 𝚫(𝑿)Pd\bm{\Delta}\left(\bm{X}\right)\subset\mathbb{R}^{Pd} satisfies that

𝚫(𝑿)[j]={span(𝒘), if j=signal(𝑿),𝟎, otherwise. \bm{\Delta}\left(\bm{X}\right)[j]=\begin{cases}\operatorname{span}(\bm{w}^{*})&\text{, if }j=\operatorname{signal}(\bm{X}),\\ \bm{0}&\text{, otherwise. }\end{cases}

This perturbed range ensures that adversarial perturbations used to generate adversarial examples are always aligned with the meaningful signal vector 𝒘\bm{w}^{*} during adversarial training, which exactly simplifies the optimization analysis.

Definition 5.6.

(Adversarial Training Algorithm) We run the standard gradient descent method to update the network parameters {𝑾(t)}t=0T\{\bm{W}^{(t)}\}_{t=0}^{T} for TT iterations w.r.t. the adversarial training loss that can be rewritten as ^adv(t)(𝑾)=1Ni=1N(1λ)(𝑾;𝑿i,yi)+λ(𝑾;𝑿iadv,(t),yi)\widehat{\mathcal{L}}_{\textit{adv}}^{(t)}(\bm{W})=\frac{1}{N}\sum_{i=1}^{N}(1-\lambda)\mathcal{L}\left(\bm{W};\bm{X}_{i},y_{i}\right)+\lambda\mathcal{L}\left(\bm{W};\bm{X}_{i}^{\textit{adv},(t)},y_{i}\right), where 𝑿iadv,(t)\bm{X}_{i}^{\textit{adv},(t)} denotes the adversarial example generated by ii-th training data 𝑿i\bm{X}_{i} at tt-th iteration. Concretely, the adversarial examples 𝑿iadv,(t)\bm{X}_{i}^{\textit{adv},(t)} and network parameters 𝑾(t)\bm{W}^{(t)} are updated alternatively as

{𝑿iadv,(t)=argmax𝑿^i𝔸p(𝑿i,δ)(𝑾(t);𝑿^i,yi),i[N],𝑾(t+1)=𝑾(t)η𝑾^adv(t)(𝑾(t)),\begin{cases}&\bm{X}_{i}^{\textit{adv},(t)}=\mathop{\operatorname{argmax}}\limits_{\widehat{\bm{X}}_{i}\in\mathbb{A}_{p}\left(\bm{X}_{i},\delta\right)}\mathcal{L}\left(\bm{W}^{(t)};\widehat{\bm{X}}_{i},y_{i}\right),i\in[N],\\ &\bm{W}^{(t+1)}=\bm{W}^{(t)}-\eta\nabla_{\bm{W}}\widehat{\mathcal{L}}_{\textit{adv}}^{(t)}\left(\bm{W}^{(t)}\right),\end{cases}

where η>0\eta>0 is the learning rate.

Next, we make the following assumptions about hyperparameters we introduced in our structured-data setting.

Assumption 5.7.

(Choice of Hyperparameters) We assume that:

α=Θ(dcα),σp=Θ(dcp),m=Θ(N)=poly(d),\displaystyle\alpha=\Theta(d^{c_{\alpha}}),\quad\sigma_{p}=\Theta(d^{-c_{p}}),\quad m=\Theta(N)=\operatorname{poly}(d),
σ0=polylog(d)d,δ=α(11dpolylog(d)),\displaystyle\sigma_{0}=\frac{\operatorname{polylog}(d)}{\sqrt{d}},\quad\delta=\alpha\left(1-\frac{1}{\sqrt{d}\operatorname{polylog}(d)}\right),
λ[1poly(d),1),η=(0,1poly(d)],\displaystyle\lambda\in\left[\frac{1}{\operatorname{poly}(d)},1\right),\quad\eta=\left(0,\frac{1}{\operatorname{poly}(d)}\right],

where cα,cp(0,1)c_{\alpha},c_{p}\in(0,1) are constants satisfying cα+cp>12c_{\alpha}+c_{p}>\frac{1}{2}.

Discussion of Hyperparameter Choices. Actually, the choices of these hyperparameters are not unique. We make concrete choices above for the sake of calculations in our proofs, but only the relationships between them are important. Namely, since the norm of signal patch is α\alpha and the norm of noise patch w.h.p. is Θ(σpd)\Theta(\sigma_{p}\sqrt{d}), our choices ensure that meaningful signal is stronger than noise. Without this assumption, in other word, if the signal-to-noise ratio is very low, there even exists no clean generalization, which has been theoretically shown under the similar patch-structured data setting (Cao et al., 2021; Chen et al., 2021; Frei et al., 2022). The width of network learner is O~(ND)\tilde{O}(ND) to achieve mildly over-parameterization we mentioned in Theorem 4.4. The separation in Assumption 4.2 also holds due to δ<α\delta<\alpha.

5.2 Main Results

First, we introduce the concept called feature learning to characterize what the model learns from training dynamics.

Definition 5.8.

(Feature Learning) Specifically, given a certain training data-point (𝑿,y)𝒟(\bm{X},y)\sim\mathcal{D} and the network learner f𝑾f_{\bm{W}}, we focus on the following two types of feature learning.

  • True Feature Learning. We project the weight 𝑾\bm{W} on the meaningful signal vector to measure the correlation between the model and the true feature as

    𝒰:=r=1m𝒘r,𝒘q.\mathcal{U}:=\sum_{r=1}^{m}\left\langle\bm{w}_{r},\bm{w}^{*}\right\rangle^{q}.
  • Spurious Feature Learning. We project the weight 𝑾\bm{W} on the random noise to measure the correlation between the model and the spurious feature as

    𝒱:=r=1mj[P]signal(𝑿)𝒘r,y𝑿[j]q.\mathcal{V}:=\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r},y\bm{X}[j]\right\rangle^{q}.

Thus, it holds that the model correctly classify the clean data in two cases. One is that the model learns the true feature and ignores the spurious features, i.e. 𝒰=Ω(1)|𝒱|\mathcal{U}=\Omega(1)\gg|\mathcal{V}|. Another is that the model doesn’t learn the true feature but memorizes the spurious features, i.e. 𝒰=o(1)\mathcal{U}=o(1) and 𝒱=Ω(1)0\mathcal{V}=\Omega(1)\gg 0. We can analyze the perturbed data similarly.

Now, we give our main result about feature learning process.

Theorem 5.9.

Under Assumption 5.7, we run the adversarial training algorithm in Definition 5.6 to update the weight of the network learner for T=Ω(poly(d))T=\Omega(\operatorname{poly}(d)) iterations. Then, with high probability, it holds that the network leanrer

  1. 1.

    partially learns the true feature, i.e. 𝒰(T)=Θ(αq)\mathcal{U}^{(T)}=\Theta({\alpha}^{-q});

  2. 2.

    exactly memorizes the spurious feature, i.e. for each i[N],𝒱i(T)=Θ(1)i\in[N],\mathcal{V}_{i}^{(T)}=\Theta(1),

where 𝒰(t)\mathcal{U}^{(t)} and 𝒱i(t)\mathcal{V}_{i}^{(t)} is defined for ii-th instance (𝐗i,yi)(\bm{X}_{i},y_{i}) and tt-th iteration as the same in Definition 5.8. Consequently, the clean test error and robust training error are both smaller than o(1)o(1), but the robust test error is at least 12o(1)\frac{1}{2}-o(1), which means that f𝐖(T)f_{\bm{W}^{(T)}} is a CGRO classifier.

Remark 5.10.

Theorem 5.9 states that, during adversarial training, the neural network partially learns the true feature of objective classes and exactly memorizes the spurious features depending on specific training data, which causes that the network learner is able to correctly classify clean data by partial meaningful signal (clean generalization), but fails to classify the unseen perturbed data since it leverages only data-wise random noise to memorize training adversarial examples (robust overfitting). We also conduct numerical simulation experiments to confirm our results in Section 6.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: (a)(b): The effect of network capacity on the performance of the network. We trained the networks of varying capacity on MNIST (a) and CIFAR10 (b); (c): Feature learning process of the two-layer convolutional network on the structured data.

5.3 Analysis of Learning Process

Next, we provide a proof sketch of Theorem 5.9. To obtain a detailed analysis of learning process, we consider the following objects that can be viewed as weight-wise version of 𝒰(t)\mathcal{U}^{(t)} and 𝒱i(t)\mathcal{V}_{i}^{(t)}. We define u(t)u^{(t)} and vi,j(t)v_{i,j}^{(t)} as

{u(t):=maxr[m]𝒘r(t),𝒘(Signal Component),vi,j(t):=maxr[m]𝒘r(t),yi𝑿i[j](Noise Component),\begin{cases}u^{(t)}:=\mathop{\max}\limits_{r\in[m]}\left\langle\bm{w}_{r}^{(t)},\bm{w}^{*}\right\rangle&(\textit{Signal Component}),\\ v_{i,j}^{(t)}:=\mathop{\max}\limits_{r\in[m]}\left\langle\bm{w}_{r}^{(t)},y_{i}\bm{X}_{i}[j]\right\rangle&(\textit{Noise Component}),\end{cases}

for each r[m]r\in[m], i[N]i\in[N] and j[P]signal(𝑿i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}).

We then propose a three-stage analysis technique to decouple the complicated feature learning process as follows.

Phase I: At the beginning, the signal component of lottery tickets winner u(t)u^{(t)} increases quadratically (Lemma 5.11).

Lemma 5.11.

(Lower Bound of Signal Component Growth) For each r[m]r\in[m] and any t0t\geq 0, the signal component grows as

u(t+1)u(t)+Θ(ηαq)(u(t))q1ψ(αq𝒰(t)),u^{(t+1)}\geq u^{(t)}+\Theta(\eta\alpha^{q})\left(u^{(t)}\right)^{q-1}\psi\left(\alpha^{q}\mathcal{U}^{(t)}\right),

where we use ψ()\psi(\cdot) to denote the negative sigmoid function ψ(z)=11+ez\psi(z)=\frac{1}{1+e^{z}} as well as Lemma 5.12,5.13.

Lemma 5.11 manifests that the signal component increases quadratically at initialization. Therefore, we know that, after T0=Θ(η1αqσ01)T_{0}=\Theta\left(\eta^{-1}\alpha^{-q}\sigma_{0}^{-1}\right) iterations, the signal component u(T0)u^{(T_{0})} attains the order Ω~(α1)\tilde{\Omega}(\alpha^{-1}), which implies the model learns partial true feature at this point.

Phase II: Once the signal component u(t)u^{(t)} attains the order Ω~(α1)\tilde{\Omega}(\alpha^{-1}), the growth of signal component nearly stop updating since that the increment of signal component is now mostly dominated by the noise component (Lemma 5.12).

Lemma 5.12.

(Upper Bound of Signal Component Growth) For T0=Θ(η1αqσ01)T_{0}=\Theta\left(\eta^{-1}\alpha^{-q}\sigma_{0}^{-1}\right) and any t[T0,T]t\in[T_{0},T], the signal component is upper bounded as

u(t)\displaystyle u^{(t)}\leq O~(η(αδ)qN)s=T0t1i=1Nψ((αδ)q𝒰(s)+𝒱i(s))\displaystyle\tilde{O}\left(\frac{\eta(\alpha-\delta)^{q}}{N}\right)\sum_{s=T_{0}}^{t-1}\sum_{i=1}^{N}\psi\left((\alpha-\delta)^{q}\mathcal{U}^{(s)}+\mathcal{V}_{i}^{(s)}\right)
+O~(α1).\displaystyle+\tilde{O}(\alpha^{-1}).

Lemma 5.12 shows that, after partial true feature learning, the increment of signal component is mostly dominated by the noise component 𝒱i(t)\mathcal{V}_{i}^{(t)}, which implies that the growth of signal component will converge when 𝒱i(t)=Ω(1)\mathcal{V}_{i}^{(t)}=\Omega(1).

Phase III: After that, by the quadratic increment of noise component (Lemma 5.13), the total noise 𝒱i(t)\mathcal{V}_{i}^{(t)} eventually attains the order Ω(1)\Omega(1), which implies the model memorizes the spurious feature (data-wise noise) in final.

Lemma 5.13.

(Lower Bound of Noise Component Growth) For each i[N]i\in[N], r[m]r\in[m] and j[P]signal(𝐗i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}) and any t1t\geq 1, the noise component grows as

vi,j(t)\displaystyle v_{i,j}^{(t)}\geq vi,j(0)+Θ(ησp2dN)s=0t1ψ(𝒱i(s))(vi,j(s))q1\displaystyle v_{i,j}^{(0)}+\Theta\left(\frac{\eta\sigma_{p}^{2}d}{N}\right)\sum_{s=0}^{t-1}\psi(\mathcal{V}_{i}^{(s)})\left(v_{i,j}^{(s)}\right)^{q-1}
O~(Pσp2α1d).\displaystyle-\tilde{O}(P\sigma_{p}^{2}\alpha^{-1}\sqrt{d}).

The practical implication of Lemma 5.13 is two-fold. First, by the quadratic increment of noise component, we derive that, after T1=Θ(Nη1σ01σp3d32)T_{1}=\Theta\left(N\eta^{-1}\sigma_{0}^{-1}\sigma_{p}^{-3}d^{-\frac{3}{2}}\right) iterations, the total noise memorization 𝒱i(T)\mathcal{V}_{i}^{(T)} attains the order Ω(1)\Omega(1), which suggests that the model is able to robustly classify adversarial examples by memorizing the data-wise noise. Second, combined with Lemma 5.12, the signal component u(t)u^{(t)} will maintain the order Θ(α1)\Theta(\alpha^{-1}), which immediately implies the main conclusion of Theorem 5.9. And the full detailed proof of Theorem 5.9 can be see in Appendix E.

6 Experiments

Table 1: Performance of Models with Different Sizes
Dataset MNIST CIFAR10
Model Size 1 4 8 12 16 1 2 5 10
Clean Test 11.35 11.35 11.35 95.06 94.85 82.56 84.92 85.83 86.05
Robust Test 11.35 11.35 11.35 77.96 83.43 43.39 43.74 46.25 50.08
Robust Train 11.70 11.70 11.70 99.30 99.50 64.19 79.82 97.37 99.57

In this section, we empirically verify our theoretical results in Section 4 and Section 5 by the numerical experiments in real-world image-recognition datasets and synthetic structured data, respectively.

6.1 Effect of Different Model Sizes

Experiment Settings. For the MNIST dataset, we consider a simple convolutional network, LeNet-5 (LeCun et al., 1998), and study how its performance changes as we increases the size of network (i.e. we expand the number of convolutional filters and the size of the fully connected layer to the size number multiple of the initial, where the size numbers are 1,4,8,12,161,4,8,12,16). The original convolutional network has a convolutional layer with 11 filters, followed by another convolutional layer with 22 filters, and a fully connected hidden layer with 32 units. Convolutional layers are followed by 2×22\times 2 max-pooling layers. For the CIFAR10 dataset, we apply WideResNet-34 (Zagoruyko & Komodakis, 2016) with different widen factors 1,2,5,101,2,5,10. We use the standared projected gradient descent (PGD) (Madry et al., 2017) to train the network by adversarial training. We choose the classical \ell_{\infty}-perturbation with radius 0.30.3 for MNIST and 8/2558/255 for CIFAR10. All models are trained for 100 epoches on MNIST and 200 epoches on CIFAR10 by using a single NVIDIA RTX 4090 GPU.

Experiment Results. We report our results about the performance (including the clean test accuracy, robust test accuracy and robust training accuracy) of models with different sizes in Table 1 and Figure 2 (a)(b). It shows that when the model size becomes larger, first the robust training loss decreases but the robust generalization gap remains large, and then when the model gets even larger, the robust generalization gap gradually decreases, and we also find that, in the small size case (see LeNet with the size number 1,4,81,4,8), adversarial training converges to a trivia solution that always predicts a fixed class, while it can learn an accurate clean classifier through the standard training, which corresponds to the theoretical results in Theorem 4.4 and Theorem 4.7.

6.2 Synthetic Structured Data

Experiment Settings. Here we generate synthetic data exactly following Definition 5.1 and apply the two-layer convolutional network in Definition 5.3. We train the network by the adversarial training algorithm we mentioned in Definition 5.6. We choose the hyperparameters that we need as: d=100,P=2,α=10,σp=1,σ0=0.01,q=3,N=20,m=10,p=2,δ=10,λ=0.9,η=0.1,T=100d=100,P=2,\alpha=10,\sigma_{p}=1,\sigma_{0}=0.01,q=3,N=20,m=10,p=2,\delta=10,\lambda=0.9,\eta=0.1,T=100, which is a feasible one under Assumption 5.7. We characterize the true feature learning and noise memorization via calculating 𝒰(t)\mathcal{U}^{(t)} and the smallest/largest/average of {𝒱i(t)}i[N]\{\mathcal{V}_{i}^{(t)}\}_{i\in[N]}. We calculate the robust train and test accuracy of the model by using the standard PGD attack.

Table 2: Performance on Synthetic Data
Accuracy Train Test
Clean 100.0 98.5
Robust 100.0 17.5

Experiment Results. We plot the dynamics of true feature learning and noise memorization in Figure 2 (c). It is clear that a three-stage phase transitions happen during adversarial training , which is consistent with our theoretical analysis of learning process (Lemma 5.11, Lemma 5.12 and Lemma 5.13), and finally the model partially learns true feature but exactly memorizes all training data (Theorem 5.9). The performance of the model is presented in Table 2, which shows that the network converges to a CGRO classifier eventually.

7 Conclusion

To summarize, we study the puzzling clean generalization and robust overfitting (CGRO) phenomenon in adversarial training and present theoretical explanations: from the perspective of representation complexity, we prove that the CGRO classifier is efficient to achieve by leveraging robust memorization regarding the training data, while robust generalization requires excessively large model capacity in worst case, which may lead adversarial training to the CGRO regime; from the perspective of training dynamics, we propose a three-stage analysis technique to analyze the feature learning process of adversarial training under our structured-data framework, and it shows that two-layer neural network trained by adversarial training provably learns the partial true feature but memorizes the random noise from training data, which thereby causes the CGRO regime. On the empirical side, we confirm our theoretical findings above by real-world vision datasets and synthetic data simulation.

References

  • Allen-Zhu & Li (2020a) Allen-Zhu, Z. and Li, Y. Backward feature correction: How deep learning performs deep learning. arXiv preprint arXiv:2001.04413, 2020a.
  • Allen-Zhu & Li (2020b) Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020b.
  • Allen-Zhu & Li (2022) Allen-Zhu, Z. and Li, Y. Feature purification: How adversarial training performs robust deep learning. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp.  977–988. IEEE, 2022.
  • Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp.  242–252. PMLR, 2019.
  • Anthony et al. (1999) Anthony, M., Bartlett, P. L., Bartlett, P. L., et al. Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge, 1999.
  • Arora et al. (2019) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp.  322–332. PMLR, 2019.
  • Balaji et al. (2019) Balaji, Y., Goldstein, T., and Hoffman, J. Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets. arXiv preprint arXiv:1910.08051, 2019.
  • Bartlett et al. (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
  • Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • Biggio et al. (2013) Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp.  387–402. Springer, 2013.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Bubeck & Sellke (2023) Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. Journal of the ACM, 70(2):1–18, 2023.
  • Cao et al. (2021) Cao, Y., Gu, Q., and Belkin, M. Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. Advances in Neural Information Processing Systems, 34:8407–8418, 2021.
  • Chen et al. (2021) Chen, J., Cao, Y., and Gu, Q. Benign overfitting in adversarially robust linear classification. arXiv preprint arXiv:2112.15250, 2021.
  • Chen et al. (2022) Chen, Z., Deng, Y., Wu, Y., Gu, Q., and Li, Y. Towards understanding mixture of experts in deep learning. arXiv preprint arXiv:2208.02813, 2022.
  • Chidambaram et al. (2023) Chidambaram, M., Wang, X., Wu, C., and Ge, R. Provably learning diverse features in multi-view data with midpoint mixup. In International Conference on Machine Learning, pp.  5563–5599. PMLR, 2023.
  • Dan et al. (2020) Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp.  2345–2355. PMLR, 2020.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Du et al. (2019) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp.  1675–1685. PMLR, 2019.
  • Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Frei et al. (2022) Frei, S., Chatterji, N. S., and Bartlett, P. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pp.  2668–2703. PMLR, 2022.
  • Gao et al. (2019) Gao, R., Cai, T., Li, H., Hsieh, C.-J., Wang, L., and Lee, J. D. Convergence of adversarial training in overparametrized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Gowal et al. (2021) Gowal, S., Rebuffi, S.-A., Wiles, O., Stimberg, F., Calian, D. A., and Mann, T. A. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34:4218–4233, 2021.
  • Hassani & Javanmard (2022) Hassani, H. and Javanmard, A. The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. arXiv preprint arXiv:2201.05149, 2022.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Javanmard & Soltanolkotabi (2022) Javanmard, A. and Soltanolkotabi, M. Precise statistical analysis of classification accuracies for adversarial training. The Annals of Statistics, 50(4):2127–2156, 2022.
  • Jelassi & Li (2022) Jelassi, S. and Li, Y. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pp.  9965–10040. PMLR, 2022.
  • Jelassi et al. (2022) Jelassi, S., Sander, M., and Li, Y. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822–37836, 2022.
  • Karimi et al. (2016) Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-lojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp.  795–811. Springer, 2016.
  • Kileel et al. (2019) Kileel, J., Trager, M., and Bruna, J. On the expressive power of deep polynomial neural networks. Advances in neural information processing systems, 32, 2019.
  • Kirillov et al. (2023) Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • Kou et al. (2023) Kou, Y., Chen, Z., Chen, Y., and Gu, Q. Benign overfitting for two-layer relu networks. arXiv preprint arXiv:2303.04145, 2023.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2022) Li, B., Jin, J., Zhong, H., Hopcroft, J. E., and Wang, L. Why robust generalization in deep learning is difficult: Perspective of expressive power. arXiv preprint arXiv:2205.13863, 2022.
  • Li et al. (2019) Li, Y., Fang, E. X., Xu, H., and Zhao, T. Inductive bias of gradient descent based adversarial training on separable data. arXiv preprint arXiv:1906.02931, 2019.
  • Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014.
  • Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Pang et al. (2022) Pang, T., Lin, M., Yang, X., Zhu, J., and Yan, S. Robustness and accuracy could be reconcilable by (proper) definition. In International Conference on Machine Learning, pp.  17258–17277. PMLR, 2022.
  • Petzka et al. (2020) Petzka, H., Kamp, M., Adilova, L., Boley, M., and Sminchisescu, C. Relative flatness and generalization in the interpolation regime. arXiv preprint arXiv:2001.00939, 2020.
  • Raghunathan et al. (2019) Raghunathan, A., Xie, S. M., Yang, F., Duchi, J. C., and Liang, P. Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032, 2019.
  • Rice et al. (2020) Rice, L., Wong, E., and Kolter, Z. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp.  8093–8104. PMLR, 2020.
  • Schmidt et al. (2018) Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. Advances in neural information processing systems, 31, 2018.
  • Shafahi et al. (2019) Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.
  • Shen et al. (2022) Shen, R., Bubeck, S., and Gunasekar, S. Data augmentation as feature manipulation: a story of desert cows and grass cows. arXiv preprint arXiv:2203.01572, 2022.
  • Silverman (2018) Silverman, B. W. Density estimation for statistics and data analysis. Routledge, 2018.
  • Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Tsipras et al. (2018) Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  • Wong et al. (2020) Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020.
  • Xiao et al. (2022) Xiao, J., Fan, Y., Sun, R., Wang, J., and Luo, Z.-Q. Stability analysis and generalization bounds of adversarial training. arXiv preprint arXiv:2210.00960, 2022.
  • Xing et al. (2021) Xing, Y., Song, Q., and Cheng, G. On the algorithmic stability of adversarial training. Advances in neural information processing systems, 34:26523–26535, 2021.
  • Yang et al. (2020) Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R. R., and Chaudhuri, K. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
  • Yarotsky (2017) Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  • Yu et al. (2022) Yu, C., Han, B., Shen, L., Yu, J., Gong, C., Gong, M., and Liu, T. Understanding robust overfitting of adversarial training and beyond. In International Conference on Machine Learning, pp.  25595–25610. PMLR, 2022.
  • Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. iclr. arXiv preprint arXiv:1611.03530, 2017.
  • Zhang et al. (2019) Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp.  7472–7482. PMLR, 2019.

Appendix A Additional Experiments Regarding Robust Memorization

Refer to caption
(a) MNIST: Grad
Refer to caption
(b) MNIST: Change
Refer to caption
(c) CIFAR10: Grad
Refer to caption
(d) CIFAR10: Change
Refer to caption
(e) MNIST: Train
Refer to caption
(f) MNIST: Test
Refer to caption
(g) CIFAR10: Train
Refer to caption
(h) CIFAR10: Test
Figure 3: Experiment Results (\ell_{\infty} Perturbation Radius ϵ0=0.1\epsilon_{0}=0.1 on MNIST, =8/255=8/255 on CIFAR10).

In this section, we demonstrate that adversarial training converges to similarities of the construction f𝒮f_{\mathcal{S}} of Lemma 4.5 on real image datasets, which results in CGRO. In fact, we need to verify models trained by adversarial training tend to memorize data by approximating local robust indicators on training data.

Concretely, for given loss (,)\mathcal{L}(\cdot,\cdot), instance (𝑿,y)(\bm{X},y) and model ff, we use two measurements, maximum gradient norm within the neighborhood of training data,

max𝝃δ𝑿(f(𝑿+𝝃),y)1,\max_{\|\bm{\xi}\|_{\infty}\leq\delta}\|\nabla_{\bm{X}}\mathcal{L}(f(\bm{X}+\bm{\xi}),y)\|_{1},

and maximum loss function value change

max𝝃δ[(f(𝑿+𝝃),y)(f(𝑿),y)]\max_{\|\bm{\xi}\|_{\infty}\leq\delta}[\mathcal{L}(f(\bm{X}+\bm{\xi}),y)-\mathcal{L}(f(\bm{X}),y)]

.

The former measures the δ\delta-local flatness on (𝑿,y)(\bm{X},y), and the latter measures δ\delta-local adversarial robustness on (𝑿,y)(\bm{X},y), which both describe the key information of loss landscape over input.

Experiment Settings. In numerical experiments, we mainly focus on two common real-image datasets, MNIST and CIFAR10. During adversarial training, we use cyclical learning rates and mixed precision technique (Wong et al., 2020). For the MNIST dataset, we use a LeNet architecture and train total 20 epochs. For the CIFAR10 dataset, we use a Resnet architecture and train total 15 epochs.

Numerical Results. First, we apply the adversarial training method to train models by a fixed perturbation radius ϵ0\epsilon_{0}, and then we compute empirical average of maximum gradient norm and maximum loss change on training data within different perturbation radius ϵ\epsilon. We can see numerical results in Figure 3 (a\simd), and it shows that loss landscape has flatness within the training radius, but is very sharp outside, which practically demonstrates our conjecture on real image datasets.

Learning Process. We also focus on the dynamics of loss landscape over input during the adversarial learning process. Thus, we compute empirical average of maximum gradient norm within different perturbation radius ϵ\epsilon and in different training epochs. The numerical results are plotted in Figure 3 (e\simh). Both on MNIST and CIFAR10, with epochs increasing, it is observed that the training curve descents within training perturbation radius, which implies models learn the local robust indicators to robustly memorize training data. However, the test curve of CIFAR10 ascents within training radius instead, which is consistent with our theoretical analysis in Section 5.

Robust Generalization Bound. Moreover, we prove a robust generalization bound based on global flatness of loss landscape (see in Appendix B). We show that, while adversarial training achieves local flatness by robust memorization, the model lacks global flatness, which causes robust overfitting.

Appendix B Robust Generalization Bound Based on Global Flatness

In this section, we prove a novel robust generalization bound that mainly depends on global flatness of loss landscape. We consider p\ell_{p}-adversarial robustness with perturbation radius δ\delta and we use clean\mathcal{L}_{\text{clean}}, adv(f)\mathcal{L}_{\text{adv}}(f) and ^adv(f)\widehat{\mathcal{L}}_{\text{adv}}(f) to denote the clean test risk, the adversarial test risk and the adversarial empirical risk w.r.t. the model ff, respectively. We also assume 1p+1q=1\frac{1}{p}+\frac{1}{q}=1 for the next results.

Theorem B.1.

(Robust Generalization Bound) Let 𝒟\mathcal{D} be the underlying distribution with a smooth density function, and NN-sample training dataset 𝒮={(𝐗1,y1),(𝐗2,y2),,(𝐗N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\} is i.i.d. drawn from 𝒟\mathcal{D}. Then, with high probability, it holds that,

adv(f)^adv(f)+N1D+2O(𝔼(𝑿,y)𝒟[max𝝃pδ𝑿(f(𝑿+𝝃),y)q]global flatness).\displaystyle\mathcal{L}_{\text{adv}}(f)\leq\widehat{\mathcal{L}}_{\text{adv}}(f)+N^{-\frac{1}{D+2}}O\left(\underbrace{\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}\left[\max_{\|\bm{\xi}\|_{p}\leq\delta}\|\nabla_{\bm{X}}\mathcal{L}(f(\bm{X}+\bm{\xi}),y)\|_{q}\right]}_{\text{global flatness}}\right).

This generalization bound shows that robust generalization gap can be dominated by global flatness of loss landscape. And we also have the lower bound of robust generalization gap stated as follow.

Proposition B.2.

Let 𝒟\mathcal{D} be the underlying distribution with a smooth density function, then we have

adv(f)clean(f)=Ω(δ𝔼(𝑿,y)𝒟[𝑿(f(𝑿),y)q]).\mathcal{L}_{\text{adv}}(f)-\mathcal{L}_{\text{clean}}(f)=\Omega\left(\delta\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}\left[\|\nabla_{\bm{X}}\mathcal{L}(f(\bm{X}),y)\|_{q}\right]\right).

Theorem B.1 and Proposition B.2 manifest that robust generalization gap is very related to global flatness. However, although adversarial training achieves good local flatness by robust memorization on training data, the model lacks global flatness, which leads to robust overfitting.

This point is also verified by numerical experiment on CIFAR10 (see results in Figure 4). First, global flatness grows much faster than local flatness in practice. Second, with global flatness increasing during training process, it causes an increase of robust generalization gap.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Left: Local and Global Flatness During Adversarial Training on CIFAR10; Right: The Relation Between Robust Generalization Gap and Global Flatness on CIFAR10.

Appendix C Preliminary Lemmas

First, we present a technique called Tensor Power Method proposed by Allen-Zhu & Li (2020a, b).

Lemma C.1.

Let {z(t)}t=0T\left\{z^{(t)}\right\}_{t=0}^{T} be a positive sequence defined by the following recursions

{z(t+1)z(t)+m(z(t))2z(t+1)z(t)+M(z(t))2,\left\{\begin{array}[]{l}z^{(t+1)}\geq z^{(t)}+m\left(z^{(t)}\right)^{2}\\ z^{(t+1)}\leq z^{(t)}+M\left(z^{(t)}\right)^{2}\end{array},\right.

where z(0)>0z^{(0)}>0 is the initialization and m,M>0m,M>0. Let v>0v>0 such that z(0)vz^{(0)}\leq v. Then, the time t0t_{0} such that ztvz_{t}\geq v for all tt0t\geq t_{0} is:

t0=3mz(0)+8Mmlog(v/z0)log(2).t_{0}=\frac{3}{mz^{(0)}}+\frac{8M}{m}\left\lceil\frac{\log\left(v/z_{0}\right)}{\log(2)}\right\rceil.
Lemma C.2.

Let {z(t)}t=0T\left\{z^{(t)}\right\}_{t=0}^{T} be a positive sequence defined by the following recursion

{z(t)z(0)+As=0t1(z(s))2Cz(t)z(0)+As=0t1(z(s))2+C\left\{\begin{array}[]{l}z^{(t)}\geq z^{(0)}+A\sum_{s=0}^{t-1}\left(z^{(s)}\right)^{2}-C\\ z^{(t)}\leq z^{(0)}+A\sum_{s=0}^{t-1}\left(z^{(s)}\right)^{2}+C\end{array}\right.

where A,C>0A,C>0 and z(0)>0z^{(0)}>0 is the initialization. Assume that Cz(0)/2C\leq z^{(0)}/2. Let v>0v>0 such that z(0)vz^{(0)}\leq v. Then, the time t such that z(t)vz^{(t)}\geq v is upper bounded as:

t0=8log(v/z0)log(2)+21(z(0))A.t_{0}=8\left\lceil\frac{\log\left(v/z_{0}\right)}{\log(2)}\right\rceil+\frac{21}{\left(z^{(0)}\right)A}.
Lemma C.3.

Let 𝒯0\mathcal{T}\geq 0. Let (zt)t>𝒯\left(z_{t}\right)_{t>\mathcal{T}} be a non-negative sequence that satisfies the recursion: z(t+1)z(t)A(z(t))2z^{(t+1)}\leq z^{(t)}-A\left(z^{(t)}\right)^{2}, for A>0A>0. Then, it is bounded at a time t>𝒯t>\mathcal{T} as

z(t)1A(t𝒯).z^{(t)}\leq\frac{1}{A(t-\mathcal{T})}.

Then, we provide a probability inequality proved by Jelassi & Li (2022).

Lemma C.4.

Let {𝐯r}r=1m\left\{\bm{v}_{r}\right\}_{r=1}^{m} be vectors in d\mathbb{R}^{d} such that there exist a unit norm vector 𝐱\bm{x} that satisfies |r=1m𝐯r,𝐱3|1\left|\sum_{r=1}^{m}\left\langle\bm{v}_{r},\bm{x}\right\rangle^{3}\right|\geq 1. Then, for 𝛏1,,𝛏k𝒩(0,σ2𝐈d)\bm{\xi}_{1},\ldots,\bm{\xi}_{k}\sim\mathcal{N}\left(0,\sigma^{2}\mathbf{I}_{d}\right) i.i.d., we have:

[|j=1Pr=1m𝒗r,𝝃j3|Ω~(σ3)]1O(d)21/d.\mathbb{P}\left[\left|\sum_{j=1}^{P}\sum_{r=1}^{m}\left\langle\bm{v}_{r},\bm{\xi}_{j}\right\rangle^{3}\right|\geq\tilde{\Omega}\left(\sigma^{3}\right)\right]\geq 1-\frac{O(d)}{2^{1/d}}.

Next, we introduce some concepts about learning theory.

Definition C.5 (growth function).

Let \mathcal{F} be a class of functions from 𝒳d\mathcal{X}\subset\mathbb{R}^{d} to {1,+1}\{-1,+1\}. For any integer m0m\geq 0, we define the growth function of \mathcal{F} to be

Π(m)=maxxi𝒳,1im|{(f(x1),f(x2),,f(xm)):f}|.\Pi_{\mathcal{F}}(m)=\max_{x_{i}\in\mathcal{X},1\leq i\leq m}\left|\left\{(f(x_{1}),f(x_{2}),\cdots,f(x_{m})):f\in\mathcal{F}\right\}\right|.

In particular, if |{(f(x1),f(x2),,f(xm)):f}|=2m\left|\left\{(f(x_{1}),f(x_{2}),\cdots,f(x_{m})):f\in\mathcal{F}\right\}\right|=2^{m}, then (x1,x2,,xm)(x_{1},x_{2},\cdots,x_{m}) is said to be shattered by \mathcal{F}.

Definition C.6 (Vapnik-Chervonenkis dimension).

Let \mathcal{F} be a class of functions from 𝒳D\mathcal{X}\subset\mathbb{R}^{D} to {1,+1}\{-1,+1\}. The VC-dimension of \mathcal{F}, denoted by VC-dim()\text{VC-dim}(\mathcal{F}), is defined as the largest integer m0m\geq 0 such that Π(m)=2m\Pi_{\mathcal{F}}(m)=2^{m}. For real-value function class \mathcal{H}, we define VC-dim():=VC-dim(sgn())\text{VC-dim}(\mathcal{H}):=\text{VC-dim}(\mathrm{sgn}(\mathcal{H})).

The following result gives a nearly-tight upper bound on the VC-dimension of neural networks.

Lemma C.7.

(Bartlett et al., 2019) Consider a ReLU network with LL layers and WW total parameters. Let FF be the set of (real-valued) functions computed by this network. Then we have VC-dim(F)=O(WLlog(W))\text{VC-dim}(F)=O(WL\log(W)).

The growth function is connected to the VC-dimension via the following lemma; see e.g. Anthony et al. (1999).

Lemma C.8.

Suppose that VC-dim()=k\text{VC-dim}(\mathcal{F})=k, then Πm()i=0k(mi)\Pi_{m}(\mathcal{F})\leq\sum_{i=0}^{k}\binom{m}{i}. In particular, we have Πm()(em/k)k\Pi_{m}(\mathcal{F})\leq\left(em/k\right)^{k} for all m>k+1m>k+1.

Appendix D Feature Learning

In this section, we provide a full introduction to feature learning, which is widely applied in theoretical works (Allen-Zhu & Li, 2020a, b, 2022; Shen et al., 2022; Jelassi & Li, 2022; Jelassi et al., 2022; Chen et al., 2022) explore what and how neural networks learn in different tasks. In this work, we first leverage feature learning theory to explain CGRO phenomenon in adversarial training. Specifically, for an arbitrary clean training data-point (𝑿,y)𝒟(\bm{X},y)\sim\mathcal{D} and a given model f𝑾f_{\bm{W}}, we focus on

  • True Feature Learning. We project the weight 𝑾\bm{W} on the meaningful signal vector to measure the correlation between the model and the true feature as

    𝒰:=r=1m𝒘r,𝒘q.\mathcal{U}:=\sum_{r=1}^{m}\left\langle\bm{w}_{r},\bm{w}^{*}\right\rangle^{q}.
  • Spurious Feature Learning. We project the weight 𝑾\bm{W} on the random noise to measure the correlation between the model and the spurious feature as

    𝒱:=yr=1mj[P]signal(𝑿)𝒘r,𝑿[j]q.\mathcal{V}:=y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r},\bm{X}[j]\right\rangle^{q}.

We then calculate the model’s classification correctness on certain clean data point as

yf𝑾(𝑿)\displaystyle yf_{\bm{W}}(\bm{X}) =yr=1m𝒘r,𝑿[signal(𝑿)]q+yr=1mj[P]signal(𝑿)𝒘r,𝑿[j]q\displaystyle=y\sum_{r=1}^{m}\left\langle\bm{w}_{r},\bm{X}[\operatorname{signal}(\bm{X})]\right\rangle^{q}+y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r},\bm{X}[j]\right\rangle^{q}
=yr=1m𝒘r,αy𝒘qαq𝒰+yr=1mj[P]signal(𝑿)𝒘r,𝑿[j]q𝒱.\displaystyle=\underbrace{y\sum_{r=1}^{m}\left\langle\bm{w}_{r},\alpha y\bm{w}^{*}\right\rangle^{q}}_{{\alpha}^{q}\mathcal{U}}+\underbrace{y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r},\bm{X}[j]\right\rangle^{q}}_{\mathcal{V}}.

Thus, the model correctly classify the data if and only if αq𝒰+𝒱0{\alpha}^{q}\mathcal{U}+\mathcal{V}\geq 0, which holds in at least two cases. Indeed, one is that the model learns the true feature and ignores the spurious features, where 𝒰=Ω(1)|𝒱|\mathcal{U}=\Omega(1)\gg|\mathcal{V}|. Another is that the model doesn’t learn the true feature but memorizes the spurious features, where |𝒰|=o(1)|\mathcal{U}|=o(1) and |𝒱|=Ω(1)0|\mathcal{V}|=\Omega(1)\gg 0.

Therefore, this analysis tells us that the model will generalize well for unseen data if the model learns true feature learning. But the model will overfit training data if the model only memorizes spurious features since the data-specific random noises are independent for distinct instances, which means that, with high probability, it holds that 𝒱=o(1)\mathcal{V}=o(1) for unseen data (𝑿,y)(\bm{X},y).

We also calculate the model’s classification correctness on perturbed data point, where we use attack proposed in definition 5.6 to generate adversarial example as

𝑿adv[j]={α(1γ)y𝒘,j=signal(𝑿)𝑿[j],j[p]signal(𝑿)\displaystyle\bm{X}^{\text{adv}}[j]=\left\{\begin{array}[]{l}\alpha(1-\gamma)y\bm{w}^{*},j=\operatorname{signal}(\bm{X})\\ \bm{X}[j],j\in[p]\setminus\operatorname{signal}(\bm{X})\end{array}\right.

We then derive the correctness as

yf𝑾(𝑿adv)\displaystyle yf_{\bm{W}}(\bm{X}^{\text{adv}}) =yr=1m𝒘r,𝑿adv[signal(𝑿)]q+yr=1mj[P]signal(𝑿)𝒘r,𝑿adv[j]q\displaystyle=y\sum_{r=1}^{m}\left\langle\bm{w}_{r},\bm{X}^{\text{adv}}[\operatorname{signal}(\bm{X})]\right\rangle^{q}+y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r},\bm{X}^{\text{adv}}[j]\right\rangle^{q}
=yr=1m𝒘r,α(1γ)y𝒘qαq(1γ)q𝒰+yr=1mj[P]signal(𝑿)𝒘r,𝑿[j]q𝒱.\displaystyle=\underbrace{y\sum_{r=1}^{m}\left\langle\bm{w}_{r},\alpha(1-\gamma)y\bm{w}^{*}\right\rangle^{q}}_{{\alpha}^{q}(1-\gamma)^{q}\mathcal{U}}+\underbrace{y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r},\bm{X}[j]\right\rangle^{q}}_{\mathcal{V}}.

Thus, the model correctly classify the perturbed data if and only if αq(1γ)q𝒰+𝒱0\alpha^{q}(1-\gamma)^{q}\mathcal{U}+\mathcal{V}\geq 0, which implies that We can analyze the perturbed data similarly.

Appendix E Proof for Section 5

In this section, we present the full proof for Section 5. To simplify our proof, we only focus on the case when q=3q=3, and it can be easily extended to the general case when q2q\geq 2. And we need the re-defined notation as the following.

For r[m]r\in[m], i[N]i\in[N] and j[P]signal(𝑿i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}), we define ur(t)u_{r}^{(t)} and vi,j,r(t)v_{i,j,r}^{(t)} as

Signal Component. ur(t):=𝒘r(t),𝒘u_{r}^{(t)}:=\left\langle\bm{w}_{r}^{(t)},\bm{w}^{*}\right\rangle, thus 𝒰(t)=r[m](ur(t))3\mathcal{U}^{(t)}=\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}.

Noise Component. vi,j,r(t):=yi𝒘r(t),𝑿i[j]v_{i,j,r}^{(t)}:=y_{i}\left\langle\bm{w}_{r}^{(t)},\bm{X}_{i}[j]\right\rangle, thus 𝒱i(t)=r[m]j[P]signal(𝑿i)(vi,j,r(t))3\mathcal{V}_{i}^{(t)}=\sum_{r\in[m]}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{3}.

First, we give detailed proofs of Lemma 5.11, Lemma 5.12 and Lemma 5.13. Then, we prove Theorem 5.9 base on the above lemmas.

We prove our main results using an induction. More specifically, we make the following assumptions for each iteration t<Tt<T.

Hypothesis E.1.

Throughout the learning process using the adversarial training update for t<Tt<T, we maintain that:

  • (Uniform Bound for Signal Component) For each r[m]r\in[m], we assume ur(t)O~(α1)u_{r}^{(t)}\leq\tilde{O}(\alpha^{-1}).

  • (Uniform Bound for Noise Component) For each r[m]r\in[m], i[N]i\in[N] and j[P]signal(𝑿i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}), we assume |vi,j,r(t)|O~(1)|v_{i,j,r}^{(t)}|\leq\tilde{O}(1).

In what follows, we assume these induction hypotheses for t<Tt<T to prove our main results. We then prove these hypotheses for iteration t=Tt=T in Lemma E.11.

Now, we first give proof details about Lemma 5.11.

Theorem E.2.

(Restatement of Lemma 5.11) For each r[m]r\in[m] and any t0t\geq 0, the signal component grows as

ur(t+1)ur(t)+Θ(ηα3)(ur(t))2ψ(α3k[m](uk(t))3),u_{r}^{(t+1)}\geq u_{r}^{(t)}+\Theta(\eta\alpha^{3})\left(u_{r}^{(t)}\right)^{2}\psi\left(\alpha^{3}\sum_{k\in[m]}\left(u_{k}^{(t)}\right)^{3}\right),

where we use ψ()\psi(\cdot) to denote the negative sigmoid function ψ(z)=11+ez\psi(z)=\frac{1}{1+e^{z}} as well as Lemma 5.12,5.13.

Proof.

First, we calculate the gradient of adversarial loss with respect to 𝒘r(r[m])\bm{w}_{r}(r\in[m]) as

𝒘r^adv(𝑾(t))\displaystyle\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)}) =3Ni=1Nj=1P((1λ)yi𝒘r(t),𝑿i[j]21+exp(yif𝑾(t)(𝑿i))𝑿i[j]+λyi𝒘r(t),𝑿iadv[j]21+exp(yif𝑾(t)(𝑿iadv))𝑿iadv[j])\displaystyle=-\frac{3}{N}\sum_{i=1}^{N}\sum_{j=1}^{P}\left(\frac{(1-\lambda)y_{i}\left\langle\bm{w}_{r}^{(t)},\bm{X}_{i}[j]\right\rangle^{2}}{1+\exp\left(y_{i}f_{\bm{W}^{(t)}}\left(\bm{X}_{i}\right)\right)}\bm{X}_{i}[j]+\frac{\lambda y_{i}\left\langle\bm{w}_{r}^{(t)},\bm{X}_{i}^{\text{adv}}[j]\right\rangle^{2}}{1+\exp\left(y_{i}f_{\bm{W}^{(t)}}\left(\bm{X}_{i}^{\text{adv}}\right)\right)}\bm{X}_{i}^{\text{adv}}[j]\right)
=3N((ur(t))2(i=1N(1λ)α3ψ(yif𝑾(t)(𝑿i))+λα3(1γ)3ψ(yif𝑾(t)(𝑿iadv)))𝒘\displaystyle=-\frac{3}{N}\left(\left(u_{r}^{(t)}\right)^{2}\left(\sum_{i=1}^{N}(1-\lambda)\alpha^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\lambda\alpha^{3}(1-\gamma)^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)\bm{w}^{*}\right.
+i=1Njsignal(𝑿i)(vi,j,r(t))2((1λ)ψ(yif𝑾(t)(𝑿i))+λψ(yif𝑾(t)(𝑿iadv)))𝑿i[j]).\displaystyle+\left.\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\left((1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)\bm{X}_{i}[j]\right).

Then, we project the gradient descent algorithm equation 𝑾(t+1)=𝑾(t)η𝑾^adv(𝑾(t))\bm{W}^{(t+1)}=\bm{W}^{(t)}-\eta\nabla_{\bm{W}}\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right) on the signal vector 𝒘\bm{w}^{*}. We derive the following result due to 𝑿i[j]𝒘\bm{X}_{i}[j]\perp\bm{w}^{*} for j[P]signal(𝑿i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}).

ur(t+1)\displaystyle u_{r}^{(t+1)} =ur(t)+3ηN(ur(t))2i=1N((1λ)α3ψ(yif𝑾(t)(𝑿i))+λα3(1γ)3ψ(yif𝑾(t)(𝑿iadv)))\displaystyle=u_{r}^{(t)}+\frac{3\eta}{N}\left(u_{r}^{(t)}\right)^{2}\sum_{i=1}^{N}\left((1-\lambda)\alpha^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\lambda\alpha^{3}(1-\gamma)^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)
ur(t)+3ηα3(1λ)N(ur(t))2i=1Nψ(yif𝑾(t)(𝑿i))\displaystyle\geq u_{r}^{(t)}+\frac{3\eta\alpha^{3}(1-\lambda)}{N}\left(u_{r}^{(t)}\right)^{2}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))
ur(t)+Θ(ηα3)(ur(t))2ψ(α3k[m](uk(t))3),\displaystyle\geq u_{r}^{(t)}+\Theta(\eta\alpha^{3})\left(u_{r}^{(t)}\right)^{2}\psi\left(\alpha^{3}\sum_{k\in[m]}\left(u_{k}^{(t)}\right)^{3}\right),

where we derive last inequality by using ψ(yif𝑾(t)(𝑿i))=Θ(1)ψ(α3k[m](uk(t))3)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))=\Theta(1)\psi\left(\alpha^{3}\sum_{k\in[m]}\left(u_{k}^{(t)}\right)^{3}\right), which is obtained due to Hypothesis E.1.

Consequently, we have the following result that shows the order of maximum signal component.

Lemma E.3.

During adversarial training, with high probability, it holds that, after T0=Θ~(1ηα3σ0)T_{0}=\tilde{\Theta}\left(\frac{1}{\eta\alpha^{3}\sigma_{0}}\right) iterations, for all t[T0,T]t\in\left[T_{0},T\right], we have maxr[m]ur(t)Ω~(α1)\max_{r\in[m]}u_{r}^{(t)}\geq\tilde{\Omega}(\alpha^{-1}).

Proof.

From the proof of Theorem E.2, we know that

ur(t+1)ur(t)=3ηN(ur(t))2i=1N((1λ)α3ψ(yif𝑾(t)(𝑿i))+λα3(1γ)3ψ(yif𝑾(t)(𝑿iadv))).u_{r}^{(t+1)}-u_{r}^{(t)}=\frac{3\eta}{N}\left(u_{r}^{(t)}\right)^{2}\sum_{i=1}^{N}\left((1-\lambda)\alpha^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\lambda\alpha^{3}(1-\gamma)^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right).

By applying Hypothesis E.1, we can simplify the above equation to the following inequalities.

{ur(t+1)ur(t)+A(ur(t))2ur(t+1)ur(t)+B(ur(t))2\left\{\begin{array}[]{l}u_{r}^{(t+1)}\leq u_{r}^{(t)}+A\left(u_{r}^{(t)}\right)^{2}\\ u_{r}^{(t+1)}\geq u_{r}^{(t)}+B\left(u_{r}^{(t)}\right)^{2}\end{array}\right.

where AA and BB are respectively defined as:

A:=Θ~(η)((1λ)α3+λα3(1γ)3)\displaystyle A:=\tilde{\Theta}(\eta)\left((1-\lambda)\alpha^{3}+\lambda\alpha^{3}(1-\gamma)^{3}\right)
B:=Θ~(η)(1λ)α3.\displaystyle B:=\tilde{\Theta}(\eta)(1-\lambda)\alpha^{3}.

At initialization, since we choose the weights 𝒘r(0)𝒩(0,σ02𝐈d)\bm{w}_{r}^{(0)}\sim\mathcal{N}\left(0,\sigma_{0}^{2}\mathbf{I}_{d}\right), we know the initial signal components ur(0)u_{r}^{(0)} are i.i.d. zero-mean Gaussian random variables, which implies that the probability that at least one of the ur(0)u_{r}^{(0)} is non-negative is 1(12)m=1o(1)1-\left(\frac{1}{2}\right)^{m}=1-o(1).

Thus, with high probability, there exists an initial signal component ur(0)0u_{r^{\prime}}^{(0)}\geq 0. By using Tensor Power Method (Lemma C.1) and setting v=Θ~(α1)v=\tilde{\Theta}(\alpha^{-1}), we have the threshold iteration T0T_{0} as

T0=Θ~(1)ηα3σ0+Θ~(1)((1λ)α3+λβ3)(1λ)α3log(Θ~(σ0α))log(2).T_{0}=\frac{\tilde{\Theta}(1)}{\eta\alpha^{3}\sigma_{0}}+\frac{\tilde{\Theta}(1)\left((1-\lambda)\alpha^{3}+\lambda\beta^{3}\right)}{(1-\lambda)\alpha^{3}}\left\lceil\frac{-\log\left(\tilde{\Theta}\left(\sigma_{0}\alpha\right)\right)}{\log(2)}\right\rceil.

Next, we prove Lemma 5.12 to give an upper bound of signal components’ growth.

Theorem E.4.

(Restatement of Lemma 5.12) For T0=Θ(1ηα3σ0)T_{0}=\Theta\left(\frac{1}{\eta\alpha^{3}\sigma_{0}}\right) and any t[T0,T]t\in[T_{0},T], the signal component is upper bounded as

maxr[m]ur(t)O~(α1)+O~(ηα3(1γ)3N)s=T0t1i=1Nψ(α3(1γ)3k[m](uk(s))3+𝒱i(s)).\max_{r\in[m]}u_{r}^{(t)}\leq\tilde{O}(\alpha^{-1})+\tilde{O}\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=T_{0}}^{t-1}\sum_{i=1}^{N}\psi\left(\alpha^{3}(1-\gamma)^{3}\sum_{k\in[m]}\left(u_{k}^{(s)}\right)^{3}+\mathcal{V}_{i}^{(s)}\right).
Proof.

First, we analyze the upper bound of derivative generated by clean data. By following the proof of Theorem E.2, we know that, for each r[m]r\in[m],

maxr[m]ur(t+1)\displaystyle\max_{r\in[m]}u_{r}^{(t+1)} maxr[m]ur(t)+3ηα3(1λ)N(maxr[m]ur(t))2i=1Nψ(yif𝑾(t)(𝑿i))\displaystyle\geq\max_{r\in[m]}u_{r}^{(t)}+\frac{3\eta\alpha^{3}(1-\lambda)}{N}\left(\max_{r\in[m]}u_{r}^{(t)}\right)^{2}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))
maxr[m]ur(t)+Ω~(ηα)1λNi=1Nψ(yif𝑾(t)(𝑿i)),\displaystyle\geq\max_{r\in[m]}u_{r}^{(t)}+\tilde{\Omega}(\eta\alpha)\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i})),

where we obtain the first inequality by the definition of maxr[m]ur(t),maxr[m]ur(t+1)\max_{r\in[m]}u_{r}^{(t)},\max_{r\in[m]}u_{r}^{(t+1)}, and we use maxr[m]ur(t)Ω~(α1)\max_{r\in[m]}u_{r}^{(t)}\geq\tilde{\Omega}(\alpha^{-1}) derived by Lemma E.3 in the last inequality. Thus, we then have

1λNi=1Nψ(yif𝑾(t)(𝑿i))O~(η1α1)(maxr[m]ur(t+1)maxr[m]ur(t)).\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))\leq\tilde{O}(\eta^{-1}\alpha^{-1})\left(\max_{r\in[m]}u_{r}^{(t+1)}-\max_{r\in[m]}u_{r}^{(t)}\right).

Now, we focus on maxr[m]ur(t+1)maxr[m]ur(t)\max_{r\in[m]}u_{r}^{(t+1)}-\max_{r\in[m]}u_{r}^{(t)}. By the non-decreasing property of ur(t)u_{r}^{(t)}, we have

maxr[m]ur(t+1)maxr[m]ur(t)r[m](ur(t+1)ur(t))\displaystyle\max_{r\in[m]}u_{r}^{(t+1)}-\max_{r\in[m]}u_{r}^{(t)}\leq\sum_{r\in[m]}\left(u_{r}^{(t+1)}-u_{r}^{(t)}\right)
(1λ)Θ(ηα)ψ(α3r[m](ur(t))3)r[m](αur(t))2+λΘ(ηα3(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv))\displaystyle\leq(1-\lambda)\Theta(\eta\alpha)\psi\left(\alpha^{3}\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}\right)\sum_{r\in[m]}\left(\alpha u_{r}^{(t)}\right)^{2}+\lambda\Theta\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))
(1λ)O~(ηα2)ϕ(α3r[m](ur(t))3)+λΘ(ηα3(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv)),\displaystyle\leq(1-\lambda)\tilde{O}(\eta\alpha^{2})\phi\left(\alpha^{3}\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}\right)+\lambda\Theta\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i})),

where we use ϕ()\phi(\cdot) to denote the logistics function defined as ϕ(z)=log(1+exp(z))\phi(z)=\log(1+\exp(-z)) and we derive the last inequality by Hypothesis E.1. Then, we know

1λNi=1Nψ(yif𝑾(t)(𝑿i))\displaystyle\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i})) (1λ)O~(α)ϕ(α3r[m](ur(t))3)\displaystyle\leq(1-\lambda)\tilde{O}(\alpha)\phi\left(\alpha^{3}\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}\right)
+λΘ(α2(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv)).\displaystyle+\lambda\Theta\left(\frac{\alpha^{2}(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i})).

Then, we derive the following result by Hypothesis E.1 and the above inequality.

maxr[m]ur(t+1)\displaystyle\max_{r\in[m]}u_{r}^{(t+1)} maxr[m]ur(t)+3ηN(maxr[m]ur(t))2i=1N((1λ)α3ψ(yif𝑾(t)(𝑿i))\displaystyle\leq\max_{r\in[m]}u_{r}^{(t)}+\frac{3\eta}{N}\left(\max_{r\in[m]}u_{r}^{(t)}\right)^{2}\sum_{i=1}^{N}\left((1-\lambda)\alpha^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))\right.
+λα3(1γ)3ψ(yif𝑾(t)(𝑿iadv)))\displaystyle+\left.\lambda\alpha^{3}(1-\gamma)^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)
maxr[m]ur(t)+Θ~(ηα)1λNi=1Nψ(yif𝑾(t)(𝑿i))+Θ~(ηα(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv))\displaystyle\leq\max_{r\in[m]}u_{r}^{(t)}+\tilde{\Theta}(\eta\alpha)\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\tilde{\Theta}\left(\frac{\eta\alpha(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))
maxr[m]ur(t)+(1λ)O~(ηα2)ϕ(α3r[m](ur(t))3)+Θ~(ηα3(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv))\displaystyle\leq\max_{r\in[m]}u_{r}^{(t)}+(1-\lambda)\tilde{O}(\eta\alpha^{2})\phi\left(\alpha^{3}\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}\right)+\tilde{\Theta}\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))
maxr[m]ur(t)+(1λ)O~(ηα2)1+exp(α3r[m](ur(t))3)+Θ~(ηα3(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv))\displaystyle\leq\max_{r\in[m]}u_{r}^{(t)}+\frac{(1-\lambda)\tilde{O}(\eta\alpha^{2})}{1+\exp\left(\alpha^{3}\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}\right)}+\tilde{\Theta}\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))
maxr[m]ur(t)+(1λ)O~(ηα2)1+exp(Ω~(1))+Θ~(ηα3(1γ)3N)i=1Nψ(yif𝑾(t)(𝑿iadv)).\displaystyle\leq\max_{r\in[m]}u_{r}^{(t)}+\frac{(1-\lambda)\tilde{O}(\eta\alpha^{2})}{1+\exp\left(\tilde{\Omega}(1)\right)}+\tilde{\Theta}\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i})).

By summing up iteration s=T0,,t1s=T_{0},\dots,t-1, we have the following result as

maxr[m]ur(t)\displaystyle\max_{r\in[m]}u_{r}^{(t)} maxr[m]ur(T0)+s=T0t1(1λ)O~(ηα2)1+exp(Ω~(1))+(ηα3(1γ)3N)s=T0t1i=1Nψ(yif𝑾(s)(𝑿iadv))\displaystyle\leq\max_{r\in[m]}u_{r}^{(T_{0})}+\sum_{s=T_{0}}^{t-1}\frac{(1-\lambda)\tilde{O}(\eta\alpha^{2})}{1+\exp\left(\tilde{\Omega}(1)\right)}+\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=T_{0}}^{t-1}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))
O~(α1)+O~(ηα3(1γ)3N)s=T0t1i=1Nψ(α3(1γ)3k[m](uk(s))3+𝒱i(s)).\displaystyle\leq\tilde{O}(\alpha^{-1})+\tilde{O}\left(\frac{\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=T_{0}}^{t-1}\sum_{i=1}^{N}\psi\left(\alpha^{3}(1-\gamma)^{3}\sum_{k\in[m]}\left(u_{k}^{(s)}\right)^{3}+\mathcal{V}_{i}^{(s)}\right).

Therefore, we derive the conclusion of Theorem E.4. ∎

Next, we prove the following theorem about the update of noise components.

Lemma E.5.

For each r[m]r\in[m], i[N]i\in[N] and j[P]signal(𝐗i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}), any iterationt0,tt_{0},t such that t0<tTt_{0}<t\leq T, with high probability, it holds that

|vi,j,r(t)vi,j,r(t0)Θ(ησ2dN)s=t0t1ψ~i(s)(vi,j,r(s))2|\displaystyle\left|v_{i,j,r}^{(t)}-v_{i,j,r}^{(t_{0})}-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}\right| O~(ληα3(1γ)3N)s=t0t1i=1Nψ(yif𝑾(s)(𝑿iadv))\displaystyle\leq\tilde{O}\left(\frac{\lambda\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=t_{0}}^{t-1}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))
+O~(Pσ2α1d),\displaystyle+\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d}),

where we use the notation ψ~i(s)\tilde{\psi}_{i}^{(s)} to denote (1λ)ψ(yif𝐖(s)(𝐗i))+λψ(yif𝐖(s)(𝐗iadv))(1-\lambda)\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}_{i}))+\lambda\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i})).

Proof.

To obtain Lemma E.5, we prove the following stronger result by induction w.r.t. iteration tt.

|vi,j,r(t)vi,j,r(t0)Θ(ησ2dN)\displaystyle\left|v_{i,j,r}^{(t)}-v_{i,j,r}^{(t_{0})}-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\right. s=t0t1ψ~i(s)(vi,j,r(s))2|O~(Pσ2α1d)(1+λαη+ασ2d)q=0tt01(P1d)q\displaystyle\left.\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}\right|\leq\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d})\left(1+\lambda\alpha\eta+\frac{\alpha}{\sigma^{2}d}\right)\sum_{q=0}^{t-t_{0}-1}(P^{-1}\sqrt{d})^{-q} (1)
+O~(ληα3(1γ)3N)q=0tt01s=t0tqi=1N(P1d)qψ(yif𝑾(s)(𝑿iadv))\displaystyle+\tilde{O}\left(\frac{\lambda\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{q=0}^{t-t_{0}-1}\sum_{s=t_{0}}^{t-q}\sum_{i=1}^{N}(P^{-1}\sqrt{d})^{-q}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))

First, we project the training update on noise patch 𝑿i[j]\bm{X}_{i}[j] to verify the above inequality when t=t0+1t=t_{0}+1 as

|vi,j,r(t0+1)vi,j,r(t0)Θ(ησ2dN)ψ~i(t0)(vi,j,r(s))2|\displaystyle\left|v_{i,j,r}^{(t_{0}+1)}-v_{i,j,r}^{(t_{0})}-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\tilde{\psi}_{i}^{(t^{0})}\left(v_{i,j,r}^{(s)}\right)^{2}\right| Θ(ησ2dN)a=1Nbsignal(𝑿a)ψ~a(t0)(va,b,r(t0))2\displaystyle\leq\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\tilde{\psi}_{a}^{(t_{0})}\left(v_{a,b,r}^{(t_{0})}\right)^{2}
Θ(ηPσ2d)1λNi=1Nψ(yif𝑾(t0)(𝑿i))\displaystyle\leq\Theta(\eta P\sigma^{2}\sqrt{d})\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t_{0})}}(\bm{X}_{i}))
+Θ(ηPσ2d)λNi=1Nψ(yif𝑾(t0)(𝑿iadv))\displaystyle+\Theta(\eta P\sigma^{2}\sqrt{d})\frac{\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t_{0})}}(\bm{X}_{i}^{\text{adv}}))
O~(Pσ2α1d)(1+λαη+ασ2d),\displaystyle\leq\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d})\left(1+\lambda\alpha\eta+\frac{\alpha}{\sigma^{2}d}\right),

where we apply 1λNi=1Nψ(yif𝑾(t0)(𝑿i))O~(η1α1)(maxr[m]ur(t0+1)maxr[m]ur(t0))O~(η1α2)\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t_{0})}}(\bm{X}_{i}))\leq\tilde{O}(\eta^{-1}\alpha^{-1})\left(\max_{r\in[m]}u_{r}^{(t_{0}+1)}-\max_{r\in[m]}u_{r}^{(t_{0})}\right)\leq\tilde{O}(\eta^{-1}\alpha^{-2}) and i=1Nψ(yif𝑾(t0)(𝑿iadv))O~(1)\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t_{0})}}(\bm{X}_{i}^{\text{adv}}))\leq\tilde{O}(1) to derive the last inequality.

Next, we assume that the stronger result holds for iteration tt, and then we prove the result for iteration t+1t+1 as follow.

|vi,j,r(t+1)vi,j,r(t0)\displaystyle\left|v_{i,j,r}^{(t+1)}-v_{i,j,r}^{(t_{0})}\right. Θ(ησ2dN)s=t0t1ψ~i(s)(vi,j,r(s))2|Θ(ησ2dN)s=t0t1a=1Nbsignal(𝑿a)ψ~a(s)(va,b,r(s))2\displaystyle\left.-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}\right|\leq\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\tilde{\psi}_{a}^{(s)}\left(v_{a,b,r}^{(s)}\right)^{2}
+Θ(ηPσ2d)1λNi=1Nψ(yif𝑾(t)(𝑿i))+Θ(ηPσ2d)λNi=1Nψ(yif𝑾(t)(𝑿iadv)).\displaystyle+\Theta(\eta P\sigma^{2}\sqrt{d})\frac{1-\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\Theta(\eta P\sigma^{2}\sqrt{d})\frac{\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}})).

Then, we bound the first term in the right of the above inequality by our induction hypothesis for tt, and we can derive

|vi,j,r(t+1)vi,j,r(t0)\displaystyle\left|v_{i,j,r}^{(t+1)}-v_{i,j,r}^{(t_{0})}\right. Θ(ησ2dN)s=t0t1ψ~i(s)(vi,j,r(s))2|O~(Pσ2α1d)(1+λαη+ασ2d)q=0tt01(P1d)q\displaystyle\left.-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}\right|\leq\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d})\left(1+\lambda\alpha\eta+\frac{\alpha}{\sigma^{2}d}\right)\sum_{q=0}^{t-t_{0}-1}(P^{-1}\sqrt{d})^{-q}
+O~(ληα3(1γ)3N)q=0tt01s=t0tqi=1N(P1d)qψ(yif𝑾(s)(𝑿iadv))\displaystyle+\tilde{O}\left(\frac{\lambda\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{q=0}^{t-t_{0}-1}\sum_{s=t_{0}}^{t-q}\sum_{i=1}^{N}(P^{-1}\sqrt{d})^{-q}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))
+O~(Pσ2α1d)(1+λαη+ασ2d)+Θ(ηPσ2d)λNi=1Nψ(yif𝑾(t)(𝑿iadv)).\displaystyle+\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d})\left(1+\lambda\alpha\eta+\frac{\alpha}{\sigma^{2}d}\right)+\Theta(\eta P\sigma^{2}\sqrt{d})\frac{\lambda}{N}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}})).

By summing up the terms, we proved the stronger result for t+1t+1.

Finally, we simplify the form of stronger result by using q=0(P1d)q=(1P/d)1=Θ(1)\sum_{q=0}^{\infty}(P^{-1}\sqrt{d})^{-q}=(1-P/\sqrt{d})^{-1}=\Theta(1), which implies the conclusion of Lemma E.5. ∎

Now, we prove Lemma 5.13 based on Lemma E.5 as follow.

Theorem E.6.

(Restatement of Lemma 5.13) For each i[N]i\in[N], r[m]r\in[m] and j[P]signal(𝐗i)j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}) and any t1t\geq 1, the signal component grows as

vi,j,r(t)vi,j,r(0)+Θ(ησ2dN)s=0t1ψ(𝒱i(s))(vi,j,r(s))2O~(Pσ2α1d).v_{i,j,r}^{(t)}\geq v_{i,j,r}^{(0)}+\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=0}^{t-1}\psi(\mathcal{V}_{i}^{(s)})\left(v_{i,j,r}^{(s)}\right)^{2}-\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d}).
Proof.

By applying the one-side inequality of Lemma E.5, we have

vi,j,r(t)vi,j,r(t0)Θ(ησ2dN)s=t0t1ψ~i(s)(vi,j,r(s))2\displaystyle v_{i,j,r}^{(t)}-v_{i,j,r}^{(t_{0})}-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2} O~(ληα3(1γ)3N)s=t0t1i=1Nψ(yif𝑾(s)(𝑿iadv))\displaystyle\geq-\tilde{O}\left(\frac{\lambda\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=t_{0}}^{t-1}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))
O~(Pσ2α1d).\displaystyle-\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d}).

Thus, we obtain Theorem E.6 by using O~(ληα3(1γ)3N)s=t0t1i=1Nψ(yif𝑾(s)(𝑿iadv))O~(ληTα3(1γ)3)O~(Pσ2α1d)\tilde{O}\left(\frac{\lambda\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=t_{0}}^{t-1}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))\leq\tilde{O}(\lambda\eta T\alpha^{3}(1-\gamma)^{3})\leq\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d}) and ψ~i(s)=Θ(1)ψ(𝒱i(s))\tilde{\psi}_{i}^{(s)}=\Theta(1)\psi(\mathcal{V}_{i}^{(s)}) derived by Hypothesis E.1. ∎

Consequently, we derive the upper bound of total noise components as follow.

Lemma E.7.

During adversarial training, with high probability, it holds that, after T1=Θ(Nησ0σ3d32)T_{1}=\Theta\left(\frac{N}{\eta\sigma_{0}\sigma^{3}d^{\frac{3}{2}}}\right) iterations, for all t[T1,T]t\in[T_{1},T] and each i[N]i\in[N], we have 𝒱i(t)O~(1)\mathcal{V}_{i}^{(t)}\geq\tilde{O}(1).

Proof.

By applying Lemma E.5 as the same in the proof of Theorem E.6, we know that

|vi,j,r(t)vi,j,r(t0)Θ(ησ2dN)s=t0t1ψ~i(s)(vi,j,r(s))2|O~(Pσ2α1d),\left|v_{i,j,r}^{(t)}-v_{i,j,r}^{(t_{0})}-\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}\right|\leq\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d}),

which implies that, for any iteration tTt\leq T, we have

{vi,j,r(t)vi,j,r(0)+As=0t1(vi,j,r(s))2Cvi,j,r(t)vi,j,r(0)+As=0t1(vi,j,r(s))2+C,\left\{\begin{array}[]{l}v_{i,j,r}^{(t)}\geq v_{i,j,r}^{(0)}+A\sum_{s=0}^{t-1}\left(v_{i,j,r}^{(s)}\right)^{2}-C\\ v_{i,j,r}^{(t)}\leq v_{i,j,r}^{(0)}+A\sum_{s=0}^{t-1}\left(v_{i,j,r}^{(s)}\right)^{2}+C\end{array},\right.

where A,C>0A,C>0 are constants defined as

A=Θ~(ησ2d)N,C=O~(Pσ2α1d).A=\frac{\tilde{\Theta}\left(\eta\sigma^{2}d\right)}{N},\quad C=\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d}).

At initialization, since we choose the weights 𝒘r(0)𝒩(0,σ02𝐈d)\bm{w}_{r}^{(0)}\sim\mathcal{N}\left(0,\sigma_{0}^{2}\mathbf{I}_{d}\right) and 𝑿i[j]𝒩(0,σ2𝐈d)\bm{X}_{i}[j]\sim\mathcal{N}\left(0,\sigma^{2}\mathbf{I}_{d}\right), we know the initial noise components vi,j,r(0)v_{i,j,r}^{(0)} are i.i.d. zero-mean Gaussian random variables, which implies that, with high probability, there exists at least one index rr^{\prime} such that vi,j,r(0)Ω~(Pσ2α1d)v_{i,j,r}^{(0)}\geq\tilde{\Omega}(P\sigma^{2}\alpha^{-1}\sqrt{d}). By using Tensor Power Method (Lemma C.2) and setting v=Θ~(1)v=\tilde{\Theta}(1), we have the threshold iteration T1T_{1} as

T1=21NΘ~(ησ2d)vi,j,r(0)+8NΘ~(ησ2d)(vi,j,r(0))log(O~(1)vi,j,r(0))log(2).T_{1}=\frac{21N}{\tilde{\Theta}\left(\eta\sigma^{2}d\right)v_{i,j,r}^{(0)}}+\frac{8N}{\tilde{\Theta}\left(\eta\sigma^{2}d\right)\left(v_{i,j,r}^{(0)}\right)}\left\lceil\frac{\log\left(\frac{\tilde{O}(1)}{v_{i,j,r}^{(0)}}\right)}{\log(2)}\right\rceil.

Therefore, we get T1=Θ(Nησ0σ3d32)T_{1}=\Theta\left(\frac{N}{\eta\sigma_{0}\sigma^{3}d^{\frac{3}{2}}}\right), and we use 𝒱i(t)=r[m]j[P]signal(𝑿i)(vi,j,r(t))3\mathcal{V}_{i}^{(t)}=\sum_{r\in[m]}\sum_{j\in\in[P]\setminus\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{3} to derive 𝒱i(t)Ω~(1)\mathcal{V}_{i}^{(t)}\geq\tilde{\Omega}(1). ∎

Indeed, our aimed loss function ^adv(𝑾)\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}\right) is non-convex due to the non-linearity of our CNN model f𝑾f_{\bm{W}}. To analyze the convergence of gradient algorithm, we need to prove the following condition that is used to show non-convexly global convergence (Karimi et al., 2016; Li et al., 2019).

Lemma E.8.

(Lojasiewicz Inequality for Non-convex Optimization) During adversarial training, with high probability, it holds that, after T1=Θ(Nησ0σ3d32)T_{1}=\Theta\left(\frac{N}{\eta\sigma_{0}\sigma^{3}d^{\frac{3}{2}}}\right) iterations, for all t[T1,T]t\in[T_{1},T], we have

𝑾^adv(𝑾(t))2Ω~(1)^adv(𝑾(t)).\left\|\nabla_{\bm{W}}\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)\right\|_{2}\geq\tilde{\Omega}(1)\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right).
Proof.

To prove Lojasiewicz Inequality, we first recall the gradient w.r.t. 𝒘r\bm{w}_{r} as

𝒘r^adv(𝑾(t))\displaystyle\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)}) =3N((ur(t))2(i=1N(1λ)α3ψ(yif𝑾(t)(𝑿i))+λα3(1γ)3ψ(yif𝑾(t)(𝑿iadv)))𝒘\displaystyle=-\frac{3}{N}\left(\left(u_{r}^{(t)}\right)^{2}\left(\sum_{i=1}^{N}(1-\lambda)\alpha^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\lambda\alpha^{3}(1-\gamma)^{3}\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)\bm{w}^{*}\right.
+i=1Njsignal(𝑿i)(vi,j,r(t))2((1λ)ψ(yif𝑾(t)(𝑿i))+λψ(yif𝑾(t)(𝑿iadv)))𝑿i[j]).\displaystyle+\left.\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\left((1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))+\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)\bm{X}_{i}[j]\right).

Then, we project the gradient on the signal direction and total noise, respectively.

For the signal component, we have

𝒘r^adv(𝑾(t))22\displaystyle\left\|\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)})\right\|_{2}^{2} 𝒘r^adv(𝑾(t)),𝒘2\displaystyle\geq\left\langle\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)}),\bm{w}^{*}\right\rangle^{2}
Ω~(1)((1λ)α3(ur(t))2ψ(α3k[m](uk(t))3))2.\displaystyle\geq\tilde{\Omega}(1)\left((1-\lambda)\alpha^{3}\left(u_{r}^{(t)}\right)^{2}\psi\left(\alpha^{3}\sum_{k\in[m]}\left(u_{k}^{(t)}\right)^{3}\right)\right)^{2}.

For the total noise component, we have

𝒘r^adv(𝑾(t))22𝒘r^adv(𝑾(t)),i=1Njsignal(𝑿i)𝑿i[j]i=1Njsignal(𝑿i)𝑿i[j]22\displaystyle\left\|\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)})\right\|_{2}^{2}\geq\left\langle\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)}),\frac{\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\bm{X}_{i}[j]}{\left\|\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\bm{X}_{i}[j]\right\|_{2}}\right\rangle^{2}
=3Ni=1Njsignal(𝑿i)(vi,j,r(t))2((1λ)ψ(yif𝑾(t)(𝑿i))+λψ(yif𝑾(t)(𝑿iadv)))𝑿i[j],\displaystyle=\left\langle-\frac{3}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\left((1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))\right.+\left.\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}^{\text{adv}}_{i}))\right)\bm{X}_{i}[j],\right.
a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]22\displaystyle\left.\right.\left.\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle^{2}
=(3Ni=1Njsignal(𝑿i)(vi,j,r(t))2(1λ)ψ(yif𝑾(t)(𝑿i))𝑿i[j],a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]2\displaystyle=\left(\left\langle-\frac{3}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}(1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))\bm{X}_{i}[j]\right.,\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle
+3Ni=1Njsignal(𝑿i)(vi,j,r(t))2λψ(yif𝑾(t)(𝑿iadv))𝑿i[j],a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]2)2.\displaystyle+\left\langle-\frac{3}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}}))\bm{X}_{i}[j],\left.\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle\right)^{2}.

For the first term, with high probability, it holds that

3Ni=1Njsignal(𝑿i)(vi,j,r(t))2(1λ)ψ(yif𝑾(t)(𝑿i))𝑿i[j],a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]2\displaystyle\left\langle-\frac{3}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}(1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))\bm{X}_{i}[j],\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle
O~(σ)(1λ)Ni=1Njsignal(𝑿i)(vi,j,r(t))2(1λ)ψ(yif𝑾(t)(𝑿i)),\displaystyle\geq-\frac{\tilde{O}(\sigma)(1-\lambda)}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}(1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i})),

where we use that 𝑿i[j],a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]2\left\langle\bm{X}_{i}[j],\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle is a sub-Gaussian random variable of parameter σ\sigma, which implies w.h.p. |𝑿i[j],a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]2|O~(σ)\left|\left\langle\bm{X}_{i}[j],\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle\right|\leq\tilde{O}(\sigma).

For the second term, with high probability, it holds that

3Ni=1Njsignal(𝑿i)(vi,j,r(t))2λψ(yif𝑾(t)(𝑿iadv))𝑿i[j],a=1Nbsignal(𝑿a)𝑿a[b]a=1Nbsignal(𝑿a)𝑿a[b]2\displaystyle\left\langle\frac{3}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}}))\bm{X}_{i}[j],\frac{\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\right\rangle
=Θ(1)Ni=1Njsignal(𝑿i)(vi,j,r(t))2λψ(yif𝑾(t)(𝑿iadv))𝑿i[j]22a=1Nbsignal(𝑿a)𝑿a[b]2\displaystyle=\frac{\Theta(1)}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}}))\frac{\left\|\bm{X}_{i}[j]\right\|_{2}^{2}}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}
=Θ(σd)Ni=1Njsignal(𝑿i)(vi,j,r(t))2λψ(yif𝑾(t)(𝑿iadv)),\displaystyle=\frac{\Theta(\sigma\sqrt{d})}{N}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}})),

where we use w.h.p. 𝑿i[j],𝑿i[j]a=1Nbsignal(𝑿a)𝑿a[b]2Θ(1d)𝑿i[j]22a=1Nbsignal(𝑿a)𝑿a[b]2\frac{\left\langle\bm{X}_{i}[j],\bm{X}_{i^{\prime}}[j^{\prime}]\right\rangle}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}}\leq\frac{\Theta\left(\frac{1}{\sqrt{d}}\right)\left\|\bm{X}_{i}[j]\right\|_{2}^{2}}{\left\|\sum_{a=1}^{N}\sum_{b\neq\operatorname{signal}(\bm{X}_{a})}\bm{X}_{a}[b]\right\|_{2}} for (i,j)(i,j)(i,j)\neq(i^{\prime},j^{\prime}).

Now, combine the above bounds, we derive

r=1m𝒘r^adv(𝑾(t))22\displaystyle\sum_{r=1}^{m}\left\|\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)})\right\|_{2}^{2} r=1m𝒘r^adv(𝑾(t)),𝒘2\displaystyle\geq\sum_{r=1}^{m}\left\langle\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)}),\bm{w}^{*}\right\rangle^{2}
+r=1m𝒘r^adv(𝑾(t)),i=1Njsignal(𝑿i)𝑿i[j]i=1Njsignal(𝑿i)𝑿i[j]22\displaystyle+\sum_{r=1}^{m}\left\langle\nabla_{\bm{w}_{r}}\widehat{\mathcal{L}}_{\text{adv}}(\bm{W}^{(t)}),\frac{\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\bm{X}_{i}[j]}{\left\|\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\bm{X}_{i}[j]\right\|_{2}}\right\rangle^{2}
Ω(1m)((1λ)α3r=1m(ur(t))2ψ(α3k[m](uk(t))3)\displaystyle\geq\Omega\left(\frac{1}{m}\right)\left((1-\lambda)\alpha^{3}\sum_{r=1}^{m}\left(u_{r}^{(t)}\right)^{2}\psi\left(\alpha^{3}\sum_{k\in[m]}\left(u_{k}^{(t)}\right)^{3}\right)\right.
+Θ(σd)Nr=1mi=1Njsignal(𝑿i)(vi,j,r(t))2λψ(yif𝑾(t)(𝑿iadv))\displaystyle+\frac{\Theta(\sigma\sqrt{d})}{N}\sum_{r=1}^{m}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}\lambda\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}^{\text{adv}}))
O~(σ)(1λ)Nr=1mi=1Njsignal(𝑿i)(vi,j,r(t))2(1λ)ψ(yif𝑾(t)(𝑿i)))2\displaystyle\left.-\frac{\tilde{O}(\sigma)(1-\lambda)}{N}\sum_{r=1}^{m}\sum_{i=1}^{N}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(t)}\right)^{2}(1-\lambda)\psi(y_{i}f_{\bm{W}^{(t)}}(\bm{X}_{i}))\right)^{2}
Ω~(1)((1λ)ϕ(α3r[m](ur(t))3)+λNi=1Nϕ(𝒱i(t)))2\displaystyle\geq\tilde{\Omega}(1)\left((1-\lambda)\phi\left(\alpha^{3}\sum_{r\in[m]}\left(u_{r}^{(t)}\right)^{3}\right)+\frac{\lambda}{N}\sum_{i=1}^{N}\phi\left(\mathcal{V}_{i}^{(t)}\right)\right)^{2}
Ω~(1)(^adv(𝑾(t)))2.\displaystyle\geq\tilde{\Omega}(1)\left(\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)\right)^{2}.

Consequently, we derive the following sub-linear convergence result by applying Lojasiewicz Inequality.

Lemma E.9.

(Sub-linear Convergence for Adversarial Training) During adversarial training, with high probability, it holds that, after T1=Θ(Nησ0σ3d32)T_{1}=\Theta\left(\frac{N}{\eta\sigma_{0}\sigma^{3}d^{\frac{3}{2}}}\right) iterations, the adversarial training loss sub-linearly converges to zero as

^adv(𝑾(t))O~(1)η(tT1+1).\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)\leq\frac{\tilde{O}(1)}{\eta(t-T_{1}+1)}.
Proof.

Due to the smoothness of loss function ^adv(𝑾)\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}\right) and learning rate η=O~(1)\eta=\tilde{O}(1), we have

^adv(𝑾(t+1))\displaystyle\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t+1)}\right) ^adv(𝑾(t))η2𝑾^adv(𝑾(t))2\displaystyle\leq\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)-\frac{\eta}{2}\left\|\nabla_{\bm{W}}\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)\right\|_{2}
^adv(𝑾(t))Ω~(η)(^adv(𝑾(t)))2,\displaystyle\leq\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)-\tilde{\Omega}(\eta)\left(\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)\right)^{2},

where we use Lojasiewicz Inequality in the last inequality. Then, by applying Tensor Power Method (Lemma C.3), we obtain the sub-linear convergence rate. ∎

Now, we present the following result to bound the derivative generated by training-adversarial examples.

Lemma E.10.

During adversarial training, with high probability, it holds that, after T1=Θ(Nησ0σ3d32)T_{1}=\Theta\left(\frac{N}{\eta\sigma_{0}\sigma^{3}d^{\frac{3}{2}}}\right) iterations, we have λNs=0ti=1Nψ(yif𝐖(s)(𝐗iadv))O~(η1σ01).\frac{\lambda}{N}\sum_{s=0}^{t}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}_{i}^{\text{adv}}))\leq\tilde{O}(\eta^{-1}\sigma_{0}^{-1}).

Proof.

First, we bound the total derivative during iteration s=T1,,ts=T_{1},\dots,t. By applying the conclusion of Lemma E.5, we have

λNs=T1ti=1Nψ(yif𝑾(s)(𝑿iadv))\displaystyle\frac{\lambda}{N}\sum_{s=T_{1}}^{t}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}_{i}^{\text{adv}})) O~(1)Ns=T1t1i=1Nψ~i(s)(vi,j,r(s))2\displaystyle\leq\frac{\tilde{O}(1)}{N}\sum_{s=T_{1}}^{t-1}\sum_{i=1}^{N}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}
+O~(λα3(1γ)3Nσ2d)s=T1t1i=1Nψ(yif𝑾(s)(𝑿iadv))+O~(Pηαd).\displaystyle+\tilde{O}\left(\frac{\lambda\alpha^{3}(1-\gamma)^{3}}{N\sigma^{2}d}\right)\sum_{s=T_{1}}^{t-1}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}_{i}^{\text{adv}}))+\tilde{O}\left(\frac{P}{\eta\alpha\sqrt{d}}\right).

Due to O~(α3(1γ)3σ2d)1\tilde{O}\left(\frac{\alpha^{3}(1-\gamma)^{3}}{\sigma^{2}d}\right)\ll 1, we know

λNs=T1ti=1Nψ(yif𝑾(s)(𝑿iadv))\displaystyle\frac{\lambda}{N}\sum_{s=T_{1}}^{t}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}_{i}^{\text{adv}})) O~(1)Ns=T1t1i=1Nψ~i(s)(vi,j,r(s))2+O~(Pηαd)\displaystyle\leq\frac{\tilde{O}(1)}{N}\sum_{s=T_{1}}^{t-1}\sum_{i=1}^{N}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}+\tilde{O}\left(\frac{P}{\eta\alpha\sqrt{d}}\right)
O~(1)Ns=T1t1i=1Nϕ(𝒱i(s))+O~(Pηαd)\displaystyle\leq\frac{\tilde{O}(1)}{N}\sum_{s=T_{1}}^{t-1}\sum_{i=1}^{N}\phi\left(\mathcal{V}_{i}^{(s)}\right)+\tilde{O}\left(\frac{P}{\eta\alpha\sqrt{d}}\right)
O~(1)s=T1t1^adv(𝑾(t))+O~(Pηαd)\displaystyle\leq\tilde{O}(1)\sum_{s=T_{1}}^{t-1}\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(t)}\right)+\tilde{O}\left(\frac{P}{\eta\alpha\sqrt{d}}\right)
O~(1)s=T1t1O~(1)η(tT1+1)+O~(Pηαd)O~(η1).\displaystyle\leq\tilde{O}(1)\sum_{s=T_{1}}^{t-1}\frac{\tilde{O}(1)}{\eta(t-T_{1}+1)}+\tilde{O}\left(\frac{P}{\eta\alpha\sqrt{d}}\right)\leq\tilde{O}(\eta^{-1}).

Thus, we obtain λNs=0ti=1Nψ(yif𝑾(s)(𝑿iadv))O~(σ01)+O~(η1)O~(η1σ01).\frac{\lambda}{N}\sum_{s=0}^{t}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}_{i}^{\text{adv}}))\leq\tilde{O}(\sigma_{0}^{-1})+\tilde{O}(\eta^{-1})\leq\tilde{O}(\eta^{-1}\sigma_{0}^{-1}).

Consequently, we have the following lemma that verifies Hypothesis E.1 for t=Tt=T.

Lemma E.11.

During adversarial training, with high probability, it holds that, for any tTt\leq T, we have maxr[m]ur(t)O~(α1)\max_{r\in[m]}u_{r}^{(t)}\leq\tilde{O}(\alpha^{-1}) and |vi,j,r(t)|O~(1)|v_{i,j,r}^{(t)}|\leq\tilde{O}(1) for each r[m],i[N],j[P]signal(𝐗i)r\in[m],i\in[N],j\in[P]\setminus\operatorname{signal}(\bm{X}_{i}).

Proof.

Combined with Theorem E.4 and Lemma E.10, we can derive maxr[m]ur(T)O~(α1)\max_{r\in[m]}u_{r}^{(T)}\leq\tilde{O}(\alpha^{-1}).

By applying Lemma E.5, we have

|vi,j,r(T)|\displaystyle|v_{i,j,r}^{(T)}| |vi,j,r(0)|+Θ(ησ2dN)s=t0t1ψ~i(s)(vi,j,r(s))2\displaystyle\leq|v_{i,j,r}^{(0)}|+\Theta\left(\frac{\eta\sigma^{2}d}{N}\right)\sum_{s=t_{0}}^{t-1}\tilde{\psi}_{i}^{(s)}\left(v_{i,j,r}^{(s)}\right)^{2}
+O~(ληα3(1γ)3N)s=t0t1i=1Nψ(yif𝑾(s)(𝑿iadv))+O~(Pσ2α1d)\displaystyle+\tilde{O}\left(\frac{\lambda\eta\alpha^{3}(1-\gamma)^{3}}{N}\right)\sum_{s=t_{0}}^{t-1}\sum_{i=1}^{N}\psi(y_{i}f_{\bm{W}^{(s)}}(\bm{X}^{\text{adv}}_{i}))+\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d})
O~(1)+O~(σ2d)+O~(α3(1γ)3σ01)+O~(Pσ2α1d)O~(1).\displaystyle\leq\tilde{O}(1)+\tilde{O}(\sigma^{2}d)+\tilde{O}(\alpha^{3}(1-\gamma)^{3}\sigma_{0}^{-1})+\tilde{O}(P\sigma^{2}\alpha^{-1}\sqrt{d})\leq\tilde{O}(1).

Therefore, our Hypothesis E.1 holds for iteration t=Tt=T. ∎

Finally, we prove our main result as follow.

Theorem E.12.

(Restatement of Theorem 5.9) Under Assumption 5.7, we run the adversarial training algorithm to update the weight of the simplified CNN model for T=Ω(poly(d))T=\Omega(\operatorname{poly}(d)) iterations. Then, with high probability, it holds that the CNN model

  1. 1.

    partially learns the true feature, i.e. 𝒰(T)=Θ(α3)\mathcal{U}^{(T)}=\Theta({\alpha}^{-3});

  2. 2.

    exactly memorizes the spurious feature, i.e. for each i[N],𝒱i(T)=Θ(1)i\in[N],\mathcal{V}_{i}^{(T)}=\Theta(1),

where 𝒰(t)\mathcal{U}^{(t)} and 𝒱i(t)\mathcal{V}_{i}^{(t)} is defined for ii-th instance (𝐗i,yi)(\bm{X}_{i},y_{i}) and tt-th iteration as the same in (5.8)(5.8). Consequently, the clean test error and robust training error are both smaller than o(1)o(1), but the robust test error is at least 12o(1)\frac{1}{2}-o(1).

Proof.

First, by applying Lemma E.3, Lemma E.7 and Lemma E.11, we know for any i[N]i\in[N]

𝒰(T)\displaystyle\mathcal{U}^{(T)} =r[m](ur(T))3=Θ(α3)\displaystyle=\sum_{r\in[m]}\left(u_{r}^{(T)}\right)^{3}=\Theta(\alpha^{-3})
𝒱i(T)\displaystyle\mathcal{V}_{i}^{(T)} =r[m]jsignal(𝑿i)(vi,j,r(T))3=Θ(1).\displaystyle=\sum_{r\in[m]}\sum_{j\neq\operatorname{signal}(\bm{X}_{i})}\left(v_{i,j,r}^{(T)}\right)^{3}=\Theta(1).

Then, since adversarial loss sub-linearly converges to zero i.e. ^adv(𝑾(T))O~(1)η(TT1+1)O~(1poly(d))=o(1)\widehat{\mathcal{L}}_{\text{adv}}\left(\bm{W}^{(T)}\right)\leq\frac{\tilde{O}(1)}{\eta(T-T_{1}+1)}\leq\tilde{O}\left(\frac{1}{\operatorname{poly}(d)}\right)=o(1), the robust training error is also at most o(1)o(1).

To analyze test errors, we decompose 𝒘r(T)\bm{w}_{r}^{(T)} into 𝒘r(T)=μr(T)𝒘+𝜷r(T)\bm{w}_{r}^{(T)}=\mu_{r}^{(T)}\bm{w}^{*}+\bm{\beta}_{r}^{(T)} for each r[m]r\in[m], where 𝜷r(T)(span(𝒘))\bm{\beta}_{r}^{(T)}\in\left(\operatorname{span}(\bm{w}^{*})\right)^{\perp}. Due to 𝒱i(T)=Θ(1)\mathcal{V}_{i}^{(T)}=\Theta(1), we know 𝜷r(T)2=Θ(1)\|\bm{\beta}_{r}^{(T)}\|_{2}=\Theta(1).

For the clean test error, we have

(𝑿,y)𝒟[yf𝑾(T)(𝑿)<0]\displaystyle\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[yf_{\bm{W}^{(T)}}(\bm{X})<0\right] =(𝑿,y)𝒟[α3r=1m(ur(T))3+yr=1mj[P]signal(𝑿)𝒘r(T),𝑿[j]3<0]\displaystyle=\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[\alpha^{3}\sum_{r=1}^{m}\left(u_{r}^{(T)}\right)^{3}+y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r}^{(T)},\bm{X}[j]\right\rangle^{3}<0\right]
(𝑿,y)𝒟[r=1mj[P]signal(𝑿)𝜷r(T),𝑿[j]3Ω~(1)]\displaystyle\leq\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{\beta}_{r}^{(T)},\bm{X}[j]\right\rangle^{3}\geq\tilde{\Omega}(1)\right]
exp(Ω~(1)σ6r=1m𝜷r(T)26)O(1poly(d))=o(1),\displaystyle\leq\exp\left(-\frac{\tilde{\Omega}(1)}{\sigma^{6}\sum_{r=1}^{m}\|\bm{\beta}_{r}^{(T)}\|_{2}^{6}}\right)\leq O\left(\frac{1}{\operatorname{poly}(d)}\right)=o(1),

where we use the fact that r=1mj[P]signal(𝑿)𝜷r(T),𝑿[j]3\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{\beta}_{r}^{(T)},\bm{X}[j]\right\rangle^{3} is a sub-Gaussian random variable with parameter σ3(P1)r=1m𝜷r(T)26\sigma^{3}\sqrt{(P-1)\sum_{r=1}^{m}\|\bm{\beta}_{r}^{(T)}\|_{2}^{6}}.

For the robust test error, we use 𝒜()\mathcal{A}(\cdot) to denote attack in Definition 5.6, and then we derive

(𝑿,y)𝒟[min𝝃2δyf𝑾(T)(𝑿+𝝃)<0](𝑿,y)𝒟[yf𝑾(T)(𝒜(𝑿))<0]\displaystyle\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[\min_{\|\bm{\xi}\|_{2}\leq\delta}yf_{\bm{W}^{(T)}}(\bm{X}+\bm{\xi})<0\right]\geq\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[yf_{\bm{W}^{(T)}}(\mathcal{A}(\bm{X}))<0\right]
=(𝑿,y)𝒟[α3r=1m(ur(T))3(1γ)3+yr=1mj[P]signal(𝑿)𝒘r(T),𝑿[j]3<0]\displaystyle=\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[\alpha^{3}\sum_{r=1}^{m}\left(u_{r}^{(T)}\right)^{3}(1-\gamma)^{3}+y\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{w}_{r}^{(T)},\bm{X}[j]\right\rangle^{3}<0\right]
12(𝑿,y)𝒟[|r=1mj[P]signal(𝑿)𝜷r(T),𝑿[j]3|Ω~((1γ)3)]12(1O~(d)2d)=12o(1),\displaystyle\geq\frac{1}{2}\mathbb{P}_{(\bm{X},y)\sim\mathcal{D}}\left[\left|\sum_{r=1}^{m}\sum_{j\in[P]\setminus\operatorname{signal}(\bm{X})}\left\langle\bm{\beta}_{r}^{(T)},\bm{X}[j]\right\rangle^{3}\right|\geq\tilde{\Omega}\left((1-\gamma)^{3}\right)\right]\geq\frac{1}{2}\left(1-\frac{\tilde{O}(d)}{2^{d}}\right)=\frac{1}{2}-o(1),

where we use Lemma C.4 in the last inequality. ∎

Appendix F Proof for Section 4

We prove Theorem 4.4 by using ReLU network to approximate f𝒮f_{\mathcal{S}} proposed in Section 1.

Theorem F.1.

(Restatement of Theorem 4.4) Under Assumption 4.1, 4.2 and 4.3, with NN-sample training dataset 𝒮={(𝐗1,y1),(𝐗2,y2),,(𝐗N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\} drawn from the data distribution 𝒟\mathcal{D}, there exists a CGRO classifier that can be represented as a ReLU network with poly(D)+O~(ND)\operatorname{poly}(D)+\tilde{O}(ND) parameters, which means that, under the distribution 𝒟\mathcal{D} and dataset 𝒮\mathcal{S}, the network achieves zero clean test and robust training errors but its robust test error is at least Ω(1)\Omega(1).

Proof.

First, we give the following useful results about function approximation by ReLU nets.

Lemma F.2.

(Yarotsky, 2017) The function f(x)=x2f(x)=x^{2} on the segment [0,1][0,1] can be approximated with any error ϵ>0\epsilon>0 by a ReLU network having the depth and the number of weights and computation units O(log(1/ϵ))O(\log(1/\epsilon)).

Lemma F.3.

(Yarotsky, 2017) Let ϵ>0\epsilon>0, 0<a<b0<a<b and B1B\geq 1 be given. There exists a function ×~:[0,B]2[0,B2]\widetilde{\times}:[0,B]^{2}\to[0,B^{2}] computed by a ReLU network with O(log2(ϵ1B))O\left(\log^{2}\left(\epsilon^{-1}B\right)\right) parameters such that

supx,y[0,B]|×~(x,y)xy|ϵ,\sup_{x,y\in[0,B]}\left|\widetilde{\times}(x,y)-xy\right|\leq\epsilon,

and ×~(x,y)=0\widetilde{\times}(x,y)=0 if xy=0xy=0.

Since for 𝑿0[0,1]D\forall\bm{X}_{0}\in[0,1]^{D}, the 2\ell_{2}-distance function 𝑿𝑿02=i=1D|𝑿(i)𝑿0(i)|2\|\bm{X}-\bm{X}_{0}\|^{2}=\sum_{i=1}^{D}|\bm{X}^{(i)}-\bm{X}_{0}^{(i)}|^{2}, by using Lemma F.2, there exists a function ϕ1\phi_{1} computed by a ReLU network with 𝒪(Dlog(ε11D))\mathcal{O}\left(D\log\left(\varepsilon_{1}^{-1}D\right)\right) parameters such that sup𝑿[0,1]D|ϕ1(𝑿)𝑿𝑿02|ε1\sup_{\bm{X}\in[0,1]^{D}}\left|\phi_{1}(\bm{X})-\|\bm{X}-\bm{X}_{0}\|^{2}\right|\leq\varepsilon_{1}.

Return to our main proof back, indeed, functions computed by ReLU networks are piecewise linear but the indicator functions are not continuous, so we need to relax the indicator such that I^soft(x)=1\hat{I}_{\text{soft}}(x)=1 for xδ+ϵ0x\leq\delta+\epsilon_{0}, I^soft(x)=0\hat{I}_{\text{soft}}(x)=0 for xRδϵ0x\geq R-\delta\epsilon_{0} and I^soft\hat{I}_{\text{soft}} is linear in (δ+ϵ0,Rδϵ0)(\delta+\epsilon_{0},R-\delta\epsilon_{0}) by using only two ReLU neurons, where ϵ0\epsilon_{0} is sufficient small for approximation.

Now, we notice that the constructed function f𝒮f_{\mathcal{S}} can be re-written as

f𝒮(𝑿)\displaystyle f_{\mathcal{S}}(\bm{X}) =fclean(𝑿)(1𝕀{𝑿i=1N𝔹2(𝑿i,δ)})+i=1Nyi𝕀{𝑿𝔹2(𝑿i,δ)}\displaystyle=f_{\text{clean}}(\bm{X})\left(1-\mathbb{I}\{\bm{X}\in\cup_{i=1}^{N}\mathbb{B}_{2}(\bm{X}_{i},\delta)\}\right)+\sum_{i=1}^{N}y_{i}\mathbb{I}\{\bm{X}\in\mathbb{B}_{2}(\bm{X}_{i},\delta)\}
=fclean(𝑿)+i=1N(yifclean(𝑿))𝕀{𝑿𝑿𝒊22δ2}.\displaystyle=f_{\text{clean}}(\bm{X})+\sum_{i=1}^{N}(y_{i}-f_{\text{clean}}(\bm{X}))\mathbb{I}\{\|\bm{X}-\bm{X_{i}}\|_{2}^{2}\leq\delta^{2}\}.

Combined with Lemma F.2, Lemma F.3 and the relaxed indicator, we know that there exists a ReLU net hh with at most poly(D)+O~(ND)\operatorname{poly}(D)+\tilde{O}(ND) parameters such that |hf𝒮|=o(1)|h-f_{\mathcal{S}}|=o(1) for all input X[0,1]DX\in[0,1]^{D}. Thus, it is easy to check that hh belongs to CGRO classifiers. ∎

Next, we prove Theorem 4.7 by using the VC-dimension theory.

Theorem F.4.

(Restatement of Theorem 4.7) Let M\mathcal{F}_{M} be the family of function represented by ReLU networks with at most MM parameters. There exists a number MD=Ω(exp(D))M_{D}=\Omega(\operatorname{exp}(D)) and a distribution 𝒟\mathcal{D} satisfying Assumption 4.1, 4.2 and 4.3 such that, for any classifier in the family MD\mathcal{F}_{M_{D}}, under the distribution 𝒟\mathcal{D}, the robust test error is at least Ω(1)\Omega(1).

Proof.

Now, we notice that ReLU networks are piece-wise linear functions. Montufar et al. (2014) study the number of local linear regions, which provides the following result.

Proposition F.5.

The maximal number of linear regions of the functions computed by any ReLU network with a total of nn hidden units is bounded from above by 2n2^{n}.

Thus, for a given clean classifier fcleanf_{\text{clean}} represented by a ReLU net with poly(D)\operatorname{poly}(D) parameters, we know there exists at least a local region VV such that decision boundary of fcleanf_{\text{clean}} is linear hyperplane in VV. And we assume that the hyperplane is 𝑿(D)=12\bm{X}^{(D)}=\frac{1}{2}.

Then, let VV^{\prime} be the projection of VV on the decision boundary of fcleanf_{\text{clean}}, and 𝒫\mathcal{P} be an 2δ2\delta-packing of VV^{\prime}. Since the packing number 𝒫(V,,2δ)𝒞(V,2,2δ)=exp(Ω(D))\mathcal{P}(V^{\prime},\|\cdot\|,2\delta)\geq\mathcal{C}(V^{\prime},\|\cdot\|_{2},2\delta)=\operatorname{exp}(\Omega(D)), where 𝒞(Θ,,δ)\mathcal{C}(\Theta,\|\cdot\|,\delta) is the δ\delta-covering number of a set Θ\Theta. For any ϵ0(0,1)\epsilon_{0}\in(0,1), we can consider the construction

Sϕ={(𝒙,12+ϵ0ϕ(𝒙)):𝒙𝒫},S_{\phi}=\left\{\left(\bm{x},\frac{1}{2}+\epsilon_{0}\cdot\phi(\bm{x})\right):\bm{x}\in\mathcal{P}\right\},

where ϕ:𝒫{1,+1}\phi:\mathcal{P}\to\{-1,+1\} is an arbitrary mapping. It’s easy to see that all points in SϕS_{\phi} with first D1D-1 components satisfying 𝒙21ϵ02\left\|\bm{x}\right\|_{2}\leq\sqrt{1-\epsilon_{0}^{2}} are in VV^{\prime}, so that by choosing ϵ0\epsilon_{0} sufficiently small, we can guarantee that |SϕV|=exp(Ω(D))\left|S_{\phi}\cap V\right|=\operatorname{exp}(\Omega(D)). For convenience we just replace SϕS_{\phi} with SϕVS_{\phi}\cap V from now on.

Let Aϕ=Sϕ{𝑿V:𝒙(D)>12}A_{\phi}=S_{\phi}\cap\left\{\bm{X}\in V:\bm{x}^{(D)}>\frac{1}{2}\right\}, Bϕ=SϕAϕB_{\phi}=S_{\phi}-A_{\phi}. It’s easy to see that for arbitrary ϕ\phi, the construction is linear-separable and satisfies 2δ2\delta-separability.

Assume that for any choices of ϕ\phi, the induced sets AϕA_{\phi} and BϕB_{\phi} can always be robustly classified with (O(δ),1μ)(O(\delta),1-\mu)-accuracy by a ReLU network with at most MM parameters. Then, we can construct an enveloping network FθF_{{\theta}} with M1M-1 hidden layers, MM neurons per layer and at most M3M^{3} parameters such that any network with size M\leq M can be embedded into this envelope network. As a result, FθF_{{\theta}} is capable of (O(δ),1μ)(O(\delta),1-\mu)-robustly classify any sets Aϕ,BϕA_{\phi},B_{\phi} induced by arbitrary choices of ϕ\phi. We use RϕR_{\phi} to denote the subset of Sϕ=AϕBϕS_{\phi}=A_{\phi}\cup B_{\phi} satisfying |Rϕ|=(1μ)|Sϕ|=exp(Ω(D))\left|R_{\phi}\right|=(1-\mu)\left|S_{\phi}\right|=\operatorname{exp}(\Omega(D)) such that RϕR_{\phi} can be O(δ)O(\delta)-robustly classified.

Next, we estimate the lower and upper bounds for the cardinal number of the vector set

R:={(f(𝒙))𝒙𝒫|fMD}.R:=\{(f(\bm{x}))_{\bm{x}\in\mathcal{P}}|f\in\mathcal{F}_{M_{D}}\}.

Let nn denote |𝒫||\mathcal{P}|, then we have

R={(f(𝒙1),f(𝒙2),f(𝒙n))|fMD},R=\{(f(\bm{x}_{1}),f(\bm{x}_{2}),...f(\bm{x}_{n}))|f\in\mathcal{F}_{M_{D}}\},

where 𝒫={𝒙1,𝒙2,,𝒙n}\mathcal{P}=\{\bm{x}_{1},\bm{x}_{2},...,\bm{x}_{n}\}.

On one hand, we know that for any u{1,1}nu\in\{-1,1\}^{n}, there exists a vRv\in R such that dH(u,v)αnd_{H}(u,v)\leq\alpha n, where dH(,)d_{H}(\cdot,\cdot) denotes the Hamming distance, then we have

|R|𝒩({1,1}n,dH,μn)2ni=0μn(ni).|R|\geq\mathcal{N}(\{-1,1\}^{n},d_{H},\mu n)\geq\frac{2^{n}}{\sum_{i=0}^{\mu n}\binom{n}{i}}.

On the other hand, by applying Lemma C.8, we have

2ni=1μn(ni)|R|ΠMD(n)j=0l(nj).\frac{2^{n}}{\sum_{i=1}^{\mu n}\binom{n}{i}}\leq|R|\leq\Pi_{\mathcal{F}_{M_{D}}}(n)\leq\sum_{j=0}^{l}\binom{n}{j}.

where ll is the VC-dimension of MD\mathcal{F}_{M_{D}}. In fact, we can derive l=Ω(n)l=\Omega(n) when μ\mu is a small constant. Assume that l<n1l<n-1 , then we have j=0l(nj)(en/l)l\sum_{j=0}^{l}\binom{n}{j}\leq(en/l)^{l} and i=1μn(ni)(e/μ)μn\sum_{i=1}^{\mu n}\binom{n}{i}\leq(e/\mu)^{\mu n}, so

2n(e/μ)μn|R|(en/l)l.\frac{2^{n}}{\left(e/\mu\right)^{\mu n}}\leq|R|\leq(en/l)^{l}.

We define a function h(x)h(x) as h(x)=(e/x)xh(x)=(e/x)^{x}, then we derive

2(eμ)μ(el/n)l/n=h(μ)h(l/n).2\leq\left(\frac{e}{\mu}\right)^{\mu}\left(\frac{e}{l/n}\right)^{l/n}=h(\mu)h(l/n).

When μ\mu is sufficient small, l/nC(μ)l/n\geq C(\mu) that is a constant only depending on μ\mu, which implies l=Ω(n)l=\Omega(n). Finally, by using Lemma C.7 and n=|𝒫|=exp(Ω(D))n=|\mathcal{P}|=\operatorname{exp}(\Omega(D)), we know MD=exp(Ω(D))M_{D}=\operatorname{exp}(\Omega(D)). ∎

Appendix G Proof for Section B

Theorem G.1.

(Restatement of Theorem B.1) Let 𝒟\mathcal{D} be the underlying distribution with a smooth density function, and NN-sample training dataset 𝒮={(𝐗1,y1),(𝐗2,y2),,(𝐗N,yN)}\mathcal{S}=\{(\bm{X}_{1},y_{1}),(\bm{X}_{2},y_{2}),\dots,(\bm{X}_{N},y_{N})\} is i.i.d. drawn from 𝒟\mathcal{D}. Then, with high probability, it holds that,

adv(f)^adv(f)+N1D+2O(𝔼(𝑿,y)𝒟[max𝝃pδ𝑿(f(𝑿+𝝃),y)q]global flatness).\displaystyle\mathcal{L}_{\text{adv}}(f)\leq\widehat{\mathcal{L}}_{\text{adv}}(f)+N^{-\frac{1}{D+2}}O\left(\underbrace{\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}\left[\max_{\|\bm{\xi}\|_{p}\leq\delta}\|\nabla_{\bm{X}}\mathcal{L}(f(\bm{X}+\bm{\xi}),y)\|_{q}\right]}_{\text{global flatness}}\right).
Proof.

Indeed, we notice the following loss decomposition,

adv(f)^adv(f)=(clean(f)^adv(f))+(adv(f)clean(f)).\displaystyle\mathcal{L}_{\text{adv}}(f)-\widehat{\mathcal{L}}_{\text{adv}}(f)=\left(\mathcal{L}_{\text{clean}}(f)-\widehat{\mathcal{L}}_{\text{adv}}(f)\right)+\left(\mathcal{L}_{\text{adv}}(f)-\mathcal{L}_{\text{clean}}(f)\right).

To bound the first term, by applying λi\lambda_{i} to denote kernel density estimation (KDE) proposed in Petzka et al. (2020), then we derive

clean(f)^adv(f)\displaystyle\mathcal{L}_{\text{clean}}(f)-\widehat{\mathcal{L}}_{\text{adv}}(f) =𝔼(𝑿,y)𝒟[(f(𝑿),y)]1Ni=1Nmax𝝃pδ(f(𝑿i+𝝃,yi))\displaystyle=\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}[\mathcal{L}(f(\bm{X}),y)]-\frac{1}{N}\sum_{i=1}^{N}\max_{\|\bm{\xi}\|_{p}\leq\delta}\mathcal{L}(f(\bm{X}_{i}+\bm{\xi},y_{i}))
𝔼(𝑿,y)𝒟[(f(𝑿),y)]1Ni=1N𝔼𝝃λi[(f(𝑿i+𝝃),yi)]\displaystyle\leq\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}[\mathcal{L}(f(\bm{X}),y)]-\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\bm{\xi}\sim\lambda_{i}}[\mathcal{L}(f(\bm{X}_{i}+\bm{\xi}),y_{i})]
=𝑿p𝒟(𝑿)(f(𝑿),y)𝑑𝑿𝑿p𝒮(𝑿)(f(𝑿),y)𝑑𝑿\displaystyle=\int_{\bm{X}}p_{\mathcal{D}}(\bm{X})\mathcal{L}(f(\bm{X}),y)d\bm{X}-\int_{\bm{X}}p_{\mathcal{S}}(\bm{X})\mathcal{L}(f(\bm{X}),y)d\bm{X}
|𝑿(p𝒟(𝑿)𝔼𝒮[p𝒮(𝑿)])(f(𝑿),y(𝑿))𝑑𝑿|(I)\displaystyle\leq\underbrace{\left|\int_{\bm{X}}(p_{\mathcal{D}}(\bm{X})-\mathbb{E}_{\mathcal{S}}[p_{\mathcal{S}}(\bm{X})])\mathcal{L}(f(\bm{X}),y(\bm{X}))d\bm{X}\right|}_{(I)}
+|𝑿(𝔼𝒮[p𝒮(𝑿)]p𝒮(𝑿))(f(𝑿),y(𝑿))𝑑𝑿|(II),\displaystyle+\underbrace{\left|\int_{\bm{X}}(\mathbb{E}_{\mathcal{S}}[p_{\mathcal{S}}(\bm{X})]-p_{\mathcal{S}}(\bm{X}))\mathcal{L}(f(\bm{X}),y(\bm{X}))d\bm{X}\right|}_{(II)},

where p𝒟(𝑿)p_{\mathcal{D}}(\bm{X}) is the density function of the distribution 𝒟\mathcal{D}, and p𝒮(𝑿)p_{\mathcal{S}}(\bm{X}) is the KDE of point 𝑿\bm{X}.

With the smoothness of density function of 𝒟\mathcal{D} and Silverman (2018), we know that (I)=O(δ2)(I)=O(\delta^{2}).

For (II), by using Chebychef inequality and Silverman (2018), with probability 1Δ1-\Delta, we have

(II)=O(Δ12N12δD2+N2).(II)=O(\Delta^{-\frac{1}{2}}N^{-\frac{1}{2}}\delta^{-\frac{D}{2}}+N^{-2}).

On the other hand, by Taylor expansion, we know

adv(f)clean(f)O(δ)𝔼(𝑿,y)𝒟[max𝝃pδ𝑿(f(𝑿+𝝃),y)q].\mathcal{L}_{\text{adv}}(f)-\mathcal{L}_{\text{clean}}(f)\leq O(\delta)\mathbb{E}_{(\bm{X},y)\sim\mathcal{D}}\left[\max_{\|\bm{\xi}\|_{p}\leq\delta}\|\nabla_{\bm{X}}\mathcal{L}(f(\bm{X}+\bm{\xi}),y)\|_{q}\right].

Combined with the bounds for (I)(I) and (II)(II), we can derive Theorem B.1. ∎