^$*$^$*$footnotetext: Equal Contribution^$\diamond$^$\diamond$footnotetext: School of Data Science, Fudan University; {zlzhou20, zrliu20, cmxu18, yanweifu, sunxinwei}@fudan.edu.cn^$\star$^$\star$footnotetext: Correspondence Author

Robust Network Learning via Inverse Scale Variational Sparsification

Zhiling Zhou^∗⋄ Zirui Liu^∗⋄ Chengming Xu^⋄ Yanwei Fu^⋄ Xinwei Sun^⋄⋆

Abstract

While neural networks have made significant strides in many AI tasks, they remain vulnerable to a range of noise types, including natural corruptions, adversarial noise, and low-resolution artifacts. Many existing approaches focus on enhancing robustness against specific noise types, limiting their adaptability to others. Previous studies have addressed general robustness by adopting a spectral perspective, which tends to blur crucial features like texture and object contours. Our proposed solution, however, introduces an inverse scale variational sparsification framework within a time-continuous inverse scale space formulation. This framework progressively learns finer-scale features by discerning variational differences between pixels, ultimately preserving only large-scale features in the smoothed image. Unlike frequency-based methods, our approach not only removes noise by smoothing small-scale features where corruptions often occur but also retains high-contrast details such as textures and object contours. Moreover, our framework offers simplicity and efficiency in implementation. By integrating this algorithm into neural network training, we guide the model to prioritize learning large-scale features. We show the efficacy of our approach through enhanced robustness against various noise types.

1 Introduction

Despite the significant achievements of deep learning models in various imaging tasks [1, 2, 3, 4], they are vulnerable to different types of noise, including adversarial noise [5, 6], natural corruptions [7], and compression artifacts in low-resolution images [8, 9]. This vulnerability can lead to significant safety issues in practical applications, posing a major barrier to model deployment.

Many studies have aimed to enhance robustness against each individual type of noise [10, 11, 8, 12, 13, 14, 15]. Most of these rely on data augmentation, using methods tailored to specific types of noise. For example, some works [10, 11, 12] generated images with natural corruptions for training or self-supervised learning. Similarly, adversarial training [13, 14] involves adding small, often imperceptible, perturbations to create adversarial samples for training. Unfortunately, in real-world situations, we can’t predict the types of noise we will face, so focusing on just one type is impractical. This limits the usefulness of robust networks. Therefore, we need to find a way to improve network robustness that works for all types of noise in a unified manner.

Several efforts have achieved general robustness by either learning domain-invariant representations for out-of-distribution generalization [16, 17] or by smoothing out high-frequency components [18, 19]. However, learning invariant representations requires data from multiple domains, which is often hard to obtain. Conversely, while frequency-based methods reduce high-frequency noise, they often blur important features like edges and textures, limiting their effectiveness.

On the other hand, some recent pilot studies aim to improve general robustness by focusing on learning large-scale information (or ’features’). For example, they emphasize larger gradient components of low-dimensional manifolds in natural images [20] and higher entropy of feature activations [21]. Theoretically, we will align with such an idea, and re-introduce the classical inverse scale space theory [22], which is a generalized form of Tikhonov-Morozov regularization, can iteratively refine a sequence of inverse scale-space variations of natural images, gradually evolving toward the noisy ones^⋆\star^⋆\star $\star$ This idea essentially resembles the diffusion model.. This method effectively removes small-scale information, like subtle intensity variations and fine-grained patterns [23], and captures large-scale features, like general shapes, as shown in Fig. 1 (a).

Formally, we generalize the inverse scale space method, and propose a novel inverse scale variational sparsification method that can effectively smooth out noise components without blurring important features encoded in the image. To enforce variational sparsity, we leverage the Total-Variation (TV) regularization [24, 25] in a newly proposed ordinary differential equation in the inverse scale space [23, 22, 26]. Starting from a blank image without any information, this equation will generate a TV-regularized image path, where large-scale features are learned faster than small-scale ones. Therefore, with a proper early-stopping time, it can effectively eliminate small-scale features while preserving large-scale ones in the resulting regularized image, as illustrated in Fig. 1 (b). Furthermore, equipped with a simple discretization and an efficient sparse projection method, our dynamics can be easily implemented with an iterative algorithm, dubbed as Vision Robust Linearized Bregman Iteration (ViRoLBI) in this paper.

Refer to caption — Figure 1: (a) Visualization of Inverse Scale Space. As $t$ grows, our method progressively learns finer scale information, until fully recovering the original image. (b) Illustration of the difference between low-frequency components and large-scale features. The first row shows the image, and the second shows visualization via Grad-CAM [27]. Unlike low-frequency images, large-scale images smooth out fine-grained details without blurring important features such as texture and object contours, effectively removing redundant information.

To further enhance the robustness, we introduce several training procedures to integrate ViROLBI, including fixed training procedure and iterative training procedure. Specifically, the fixed training directly trains the model parameters on smoothed data with fixed sparsity; while the iterative training alternatively runs the instance smoothing algorithm and optimizes the model parameters. Critically, we can also apply the above procedure to tune any trained model. To validate the effectiveness of the proposed pipeline, we conduct extensive experiments on various types of noise, including adversarial noise, low-resolution images, and natural corruption.

Our main contributions are summarized as follows.

•

We introduce an image sparsification approach as a differential inclusion with Total Variation regularization, which can effectively remove small-scale features.
•

We introduce a sparse projection technique to derive the Total Variation regularized image, enhancing the algorithm’s implementation efficiency.
•

We present several training procedures that seamlessly integrate our sparsification algorithm into the training process, which further enhances the robustness of visual models.
•

Our model demonstrates promising results in robustness tasks, including noisy classification, adversarial defense, and low-resolution classification. Additionally, we employ visualization to illustrate our method’s capacity to capture semantic features.

2 Related Work

Vision Robustness. The robustness issue in vision tasks has been intensively studied. Most of these works only focused on specific types of noise, mainly including natural corruption [28, 29], adversarial noise [30, 31], and low-resolution artifacts [8, 9]. For each specific type considered, these works exploited properties of the noise and tailored their methods accordingly to achieve robustness [32]. Specifically, since artifacts in natural corruption can be easily modeled, one could employ data augmentation [10, 11, 33] or self-supervised learning [12, 34] to improve robustness. The data augmentation method was applied to low-resolution images by generating resolution-degraded images with compressed artifacts [8, 35]. Similarly, for adversarial noise, it is common to perform adversarial training [36, 13, 14, 37] using data generated by adversarial attacks, with the difference from data augmentation that it is an end-to-end optimization.

There are also some works focused on general robustness, by either learning domain-invariant representation [16, 17] or identifying only low-frequency features for prediction [18, 38, 39]. However, the representation learning methods commonly relied on un-pooled data from multiple domains, while smoothing high-frequency components will induce compression artifacts like low-resolution images, thus losing important features such as the contour and content of the object [40, 41, 42]. In contrast, we provide a new perspective of inverse scale space, which can efficiently capture important and large-scale features while removing small-scale ones where corruptions frequently occur, thereby achieving robustness uniformly against various types of noise.

Total Variation and Inverse Scale Space methods. Total Variation (TV), proposed by [24], has been successfully applied in various vision tasks including denoising [43, 44], deconvolution [45], deblurring [43], superresolution [46], structure-texture decomposition [47], and segmentation [48]. Recently, [49] has shown the benefit of deep learning models brought by the introduction of the TV Optimization layer. On the other hand, the inverse scale space methods [23, 22] were firstly proposed in image denoising. Different from previous methods that smooth firstly small-scale features from a noisy image, these methods start from a blank image and progressively learn finer scale information, until successfully recover the clean image from the noisy one. Later, this property was integrated into an ordinary differential equation for sparse recovery [26, 50], as continuous limits of Linearized Bregman Iteration (LBI) [51] for image processing. Besides, they established the model selection consistency in this ODE, i.e., the important features are selected first than others.

In this paper, we explore the total variation sparsity in the inverse scale space, termed as inverse scale variational sparsification, such that the TV-regularized images, as solutions of a newly proposed differential equation at early iterations, are able to contain only large-scale information.

3 Methodology

Problem Setup. Given a clean dataset $\{x_{i},y_{i}\}_{i=1}^{n}$ , our goal is to learn a predictor $f:\mathcal{X}\to\mathcal{Y}$ , where $\mathcal{X}$ represents the image space and $\mathcal{Y}$ represents the label space, such that it can generalize well to new data that may be corrupted by various types of noise.

To achieve this goal, we present a unified framework to effectively extract large-scale features for training. In Sec. 3.1, we first introduce a differential equation induced by Total Variation regularization, as an image sparsification method to obtain $x_{i}^{L}$ with only large-scale information, for each training data $x_{i}$ . Then in Sec. 3.2, we introduce our training procedures for learning robust neural networks.

3.1 Inverse Scale Variational Sparsification

To smooth out small-scale features from an image $x\in\mathbb{R}^{p}$ ( $p:=h\times w$ denotes the size of the image vector, with $h,w$ resp. denoting the height and width), we consider the following variation problem known as Rudin-Osher-Fatemi (ROF) functional [24, 52] that was proposed in image denoising,

\min_{u}\frac{1}{2}\|u-x\|_{2}^{2}+\lambda\|u\|_{BV},

(1)

where $\|\cdot\|_{BV}:=\|\nabla u\|_{1}$ denotes the bounded variation norm. In the image space $\Omega\subset\mathbb{R}^{2}$ , this norm can be expressed as $\|Du\|_{1}$ , where $D$ is a graph difference matrix associated with a graph $G:=(V,E)$ with $V:=\{1,...,p\}$ and $E$ denoting the set of adjacent pixels pairs, such that $(Du)(i,j):=u_{i}-u_{j}$ for each $(i,j)\in E$ . Here, $\lambda$ denotes the scale parameter. A large-scale parameter will smooth out those small-scale features, which refer to non-significant variational differences among $u$ . By varying $\lambda$ from $\infty$ to $0$ , Eq. equation (1) generates a smoothed image path $\{u(\lambda)\}$ , where $u(\lambda)$ with larger $\lambda$ is more variational sparse, and consequently, more small-scale features are being smoothed out. However, the optimization to obtain such an image path is very expensive, especially when $p$ is large [53].

To improve the efficiency, we introduce the variable splitting scheme [54] into the following objective:

\mathcal{L}_{\beta}(u,\gamma):=\frac{1}{2}\|u-x\|_{2}^{2}+\beta\|Du-\gamma\|_{2}^{2},\ \beta>0.

(2)

Here, $\gamma$ is an augmented parameter constrained to exhibit sparsity and proximity to $Du$ , where the latter constraint is achieved through the $\ell_{2}$ term $\frac{\rho}{2}\|Du-\gamma\|_{2}^{2}$ . To enforce sparsity on $\gamma$ , we consider the following dynamics:


$\displaystyle 0$	$\displaystyle=-\nabla_{u}{\mathcal{L}_{\beta}}\left(u_{t},\gamma_{t}\right),$	(3a)
$\displaystyle\dot{\rho_{t}}$	$\displaystyle=-\nabla_{\gamma}{\mathcal{L}_{\beta}}\left(u_{t},\gamma_{t}\right),$	(3b)
$\displaystyle\rho_{t}$	$\displaystyle\in\partial\\|\gamma_{t}\\|_{1}.$	(3c)

Remark 1.

When $\frac{1}{2}\|u-x\|_{2}^{2}$ becomes the squared loss $\frac{1}{2}\|y-Xu\|_{2}^{2}$ for the linear model, ( $y$ , $X$ denotes the response vector and covariate matrix), our dynamics in Eq. equation (3) degenerates to the Split Bregman Inverse Scale Space [55] that was proposed for signal recovery from noisy measurements. In contrast, our goal is to smooth scale scale information from a clean image.

Starting from $u(0)=0$ and $\gamma(0)=0$ , such a differential equation generates a regularization solution path as $t$ increases, where $\gamma(t)$ changes from sparse to dense, leading to a smoothed image flow $\widetilde{u}_{t}$ starting from $\widetilde{u}_{0}=\bf{0}_{p}$ to $\lim_{t\to\infty}\widetilde{u}_{t}=x$ . Therefore, $t$ plays a similar role to $1/\lambda$ , and hence is called inverse scale space parameter. That means, the image at a small $t$ will preserve only large-scale features while smoothing those small-scale ones.

Remark 2.

To understand in details why $t$ is called the inverse scale space parameter, we consider the following objective with the scale space parameter $\lambda$ :

\mathcal{L}_{\beta,\lambda}(u,\gamma):=\frac{1}{2}\|u-x\|_{2}^{2}+\beta\|Du-\gamma\|_{2}^{2}+\lambda\|\gamma\|_{1}.

(4)

As the scale parameter $\lambda$ decreases from $\infty$ to 0, $\gamma_{\lambda}$ in Eq. equation (4) gets from sparse to dense, with $\lim_{\lambda\to\infty}\gamma_{\lambda}=0$ and $\lim_{\lambda\to 0}(u_{\lambda},\gamma_{\lambda})=\arg\min_{u,\gamma}\mathcal{L}_{\beta}(u,\gamma)$ in Eq. equation (2). During this process, the solution path $\gamma_{\lambda}$ will progressively learn finer scale features as $\lambda$ decreases, which is similar to the behavior of $\gamma_{t}$ in Eq. equation (3) as $t$ increases. Therefore, $t$ plays a similar role to $1/\lambda$ , hence is called the inverse scale space parameter.

To explain, we note from Eq. equation (3b) that $\rho_{t}$ follows a gradient descent flow, starting from $\rho_{0}=0$ . This implies that $\gamma_{0}=0$ , according to the definition of the subgradient in Eq. equation (3c). As $t$ grows, more elements $\rho_{t}\in\partial\|\gamma_{t}\|_{1}$ tend to hit the boundary of $\pm 1$ , making corresponding elements of $\gamma_{t}$ being non-zeros according to Eq. equation (3c). Because $\gamma_{t}$ is sparse at each $t$ , we can obtain a variational sparse image $\widetilde{u}_{t}$ by projecting $u_{t}$ onto the subspace expanded by the support set of $\gamma_{t}$ , i.e., $\widetilde{u}_{t}:=\mathrm{Proj}_{S_{t}}(u_{t})$ with $S_{t}:=\mathrm{supp}(\gamma_{t}):=\{i:\gamma_{t}(i)\neq 0\}$ . After projection, $D_{S_{t}^{c}}\widetilde{u}_{t}=0$ , indicating that that $\widetilde{u}_{t}$ smooth out those non-significant variational differences outside $S_{t}$ . Consequently, as $t$ increases, $S_{t}$ expands, enabling $\widetilde{u}_{t}$ to learn finer-scale features. Therefore, we need a proper $t_{0}$ to stop the dynamics, ensuring that small-scale features are removed in $\widetilde{u}_{t_{0}}$ .

Remark 3.

$t$ is a trade-off between robustness and accuracy in the standard setting. With access to noisy data, we can determine $t_{0}$ through cross-validation based on the reconstruction error. In a general setting where noisy data are not accessible, we determine it according to the sparsity level of the smoothed image. Empirically, we observe that our algorithm consistently achieves robustness across a diverse range of sparsity levels, provided that it remains below 0.9.

Discretization for Implementation. We follow [51, 55] to propose an iterative form of Eq. equation (3), by introducing an elastic net penalty $\|\gamma_{t}\|_{1}+\frac{1}{2\kappa}\|\gamma_{t}\|_{2}^{2}$ to approximate the original $\ell_{1}$ penalty $\|\gamma_{t}\|_{1}$ by setting $\kappa$ large enough. Let $z_{t}\in\partial_{\gamma}\left(\|\gamma_{t}\|_{1}+\frac{1}{2\kappa}\|\gamma_{t}\|_{2}^{2}\right)$ , we have the following iteration:


		$\displaystyle u_{k+1}=u_{k}-\kappa\alpha\nabla_{u}\mathcal{L}_{\beta}(u_{k},\gamma_{k}),$		(5a)
		$\displaystyle z_{k+1}=z_{k}-\alpha\nabla_{\gamma}\mathcal{L}_{\beta}(u_{k},\gamma_{k}),$		(5b)
		$\displaystyle\gamma_{k+1}=\kappa*\mathrm{prox}_{\\|\gamma\\|_{1}}(z_{k+1}),$		(5c)

where $\mathrm{prox}_{\|\gamma\|_{1}}(z_{t}):=\arg\min_{u}\frac{1}{2}\|u-z_{t}\|^{2}+\|u\|_{1}=\mathrm{sign}(z_{t})\max(|z_{t}|-1,0)$ gives an explicit form of $\gamma_{k}$ from $z_{k}$ . Here, $\alpha$ is the step size to approximate the gradient. If $\alpha\to 0$ and $\kappa\to\infty$ , the above iteration will converge to the original dynamics Eq. equation (3). Besides, $\alpha$ should satisfy $\alpha<\frac{2}{\kappa\|H_{\nu}\|_{2}}$ with $H_{\nu}:=\nabla^{2}\mathcal{L}_{\beta}(u,\gamma)$ , in order to ensure that $\mathcal{L}_{\beta}(u_{k},\gamma_{k})$ decrease as iterates. We term this iteration as Vision Robust Linearized Bregman Iteration (ViRoLBI). Compared to running Eq. equation (1) for several $\lambda$ , this iteration can easily obtain a whole smoothed image path with a single run of Eq. equation (5).

To obtain a smoothed image $\widetilde{u}_{k}$ at each $k$ , we project $u_{k}$ onto the subspace of the support set of $\gamma_{k}$ , i.e., $S_{k}:=\mathrm{supp}(\gamma_{k}):=\{i:\gamma_{k}(i)\neq 0\}$ , such that $D_{S_{k}^{c}}\widetilde{u}_{k}=0$ :

\widetilde{u}_{k}=\mathrm{proj}_{S_{k}}(u_{k}):=\arg\min_{D_{S_{k}^{c}}u^{\prime}=0}\|u^{\prime}-u_{k}\|_{2}.

(6)

We can obtain the closed-form solution of Eq. equation (6) as: $\widetilde{u}_{k}=(I-D_{S_{k}^{c}}^{\dagger}D_{S_{k}^{c}})u_{k}$ ^⋆\star^⋆\star $\star$ For a general matrix $A$ , we denote $A_{S}$ as the sub-matrix of $A$ with rows indexed by $S$ , where $D_{S_{k}^{c}}^{\dagger}$ denotes the pseudo-inverse matrix of $D_{S_{k}^{c}}$ . However, the computation is expensive, as the complexity of $D_{S_{k}^{c}}^{\dagger}$ is at the scale of $\mathcal{O}\left(|S_{k}^{c}|^{3}\right)$ , which can be significantly higher than the gradient descent step’s cost of $\mathcal{O}(p)$ , especially when $\gamma_{k}$ is sparse in the early iterations. To accelerate, we propose an efficient projection algorithm by exploiting the graph structure of $D_{S_{k}^{c}}$ . Specifically, note that $D_{S_{k}^{c}}$ corresponds to the sub-graph $G_{S_{k}^{c}}:=(V,E_{S_{k}^{c}})$ , such that

\displaystyle D_{S_{k}^{c}}(\widetilde{u})(i,j):=\widetilde{u}_{k}(i)-\widetilde{u}_{k}(j)=0,\ \forall(i,j)\in E_{S_{k}^{c}}.

That means, we have $\widetilde{u}_{k}(i)=\widetilde{u}_{k}(j)$ as long as $i$ and $j$ are connected. We then propose to identify all connected components, since $\widetilde{u}$ shares the same value for all elements within each component. To minimize Eq. equation (6), this shared value should be the average of $u_{k}$ over elements in that component. Since the complexity of finding connected components of a $p$ -node graph is $\mathcal{O}(p)$ , this projection has the same complexity as the gradient descent, as summarized below.

Proposition 1.

Given $u_{k}$ and $S_{k}:=\mathrm{supp}(\gamma_{k})$ , if $G=(V,E_{S_{k}^{c}})$ has $C$ connected components $G_{1}=(V_{1},E_{1}),...,G_{C}=(V_{C},E_{C})$ , such that $V=V_{1}\cup...\cup V_{C}$ , then $\widetilde{u}_{k}$ in Eq. equation (6) can be determined as follows, with a complexity of $\mathcal{O}(p)$ :

\widetilde{u}_{k}(j)=\overline{u}_{k}(V_{c}):=\frac{1}{|V_{c}|}\sum_{l\in V_{c}}u_{k}(l),\ \forall j\in V_{c}\ \text{ for some $c\in\{1,..,C\}$}.

Extension to colored image via group sparsity. For a colored image, we have $x\in\mathbb{R}^{p\times 3}$ . This means each pixel is a 3-d vector $x_{i}=[x_{i1},x_{i2},x_{i3}]$ in the RGB channels. Correspondingly, we enforce group sparsity on $\gamma\in\mathbb{R}^{p\times 3}$ , where each group $i$ corresponds to a vector $\gamma(i,)\in\mathbb{R}^{3}$ :

\negthickspace\negthickspace\negthickspace P(\gamma)=\|\gamma\|_{1,2}:=\sum_{i}\|\gamma(i,)\|_{2}=\sum_{i}\sqrt{\gamma^{2}(i,1)+\gamma^{2}(i,2)+\gamma^{2}(i,3)}.

(7)

Let $z\in\partial_{\gamma}\left(P(\lambda)+\frac{1}{2\kappa}\|\gamma\|_{2}^{2}\right)$ , we obtain $\gamma_{k}$ from $z_{k}\in\mathbb{R}^{p\times 3}$ as follows:

\displaystyle\gamma(i,)=\mathrm{prox}_{\|\gamma\|_{1,2}}(z)_{i}:=\begin{cases}\left(1-\frac{1}{\|z(i,)\|_{2}}\right)z(i,)&\|z(i,)\|_{2}\geq 1,\\ 0&\text{otherwise,}\end{cases}

(8)

which can replace Eq. equation (5c) to generate the smoothed image path for colored images.

3.2 Robust Network Learning

We introduce two training strategies to learn robust neural networks: fixed training, and iterative training. Let $f_{\theta}:\mathcal{X}\to\mathcal{Y}$ as the neural network parameterized with $\theta$ , where $\theta$ is typically trained via Empirical Risk Minimization (ERM) with loss $\ell(f_{\theta}(x),y)$ .

Fixed training. We directly train $f_{\theta}$ on smoothed images when the sparsity level of $\gamma$ (i.e., the proportion of non-zero elements of $\gamma$ ) reaches a fixed value, e.g., $80\%$ . Since only large-scale information is preserved in data, we expect the trained network to smooth small-scale features.

Iterative training. We train the network parameter $\theta$ and run ViRoLBI in an alternative manner, as shown below:


$\displaystyle u_{k+1}$	$\displaystyle=u_{k}-\kappa\alpha\nabla_{u}\mathcal{L}_{\beta}(u_{k},\gamma_{k}),$
$\displaystyle z_{k+1}$	$\displaystyle=z_{k}-\alpha\nabla_{\gamma}\mathcal{L}_{\beta}(u_{k},\gamma_{k}),$
$\displaystyle\gamma_{k+1}$	$\displaystyle=\kappa*\mathrm{prox}_{\\|\gamma\\|_{1}}(z_{k+1})\text{ from Eq.~{}equation~{}(\ref{eq:lbi})},$	(9a)
$\displaystyle\widetilde{u}_{k+1}$	$\displaystyle=\mathrm{proj}_{\mathrm{supp}(\gamma_{k+1})}(u_{k+1})\text{ from Prop.~{}\ref{prop.graph}},$	(9b)
$\displaystyle\theta_{k+1}$	$\displaystyle=\theta_{k}-\nabla_{\theta}\ell(f_{\theta}(\widetilde{u}_{k+1},y)),$	(9c)

where Eq. equation (9c) can be implemented by other optimizers such as SGD or Adam. As iterates, the network will first learn large-scale features, followed by small-scale ones. Instead of training networks for each scale-level data like fixed training, we can efficiently obtain a family of neural networks that progressively learn finer-scale features. If we stop at a proper iteration, the network will also learn only large-scale features. Empirically, both training procedures perform well on various types of noise, including natural corruption, adversarial noise, and low-resolution images.

During testing, we can implement ViRoLBI to preprocess test data at a chosen sparsity level, in order to align the distribution to the training data.

4 Experiments

Applications. By effectively separating large-scale structural information from intricate details, our approach shows promise in enhancing robustness and explainability. Empirically, we show the scenarios with natural corruptions, adversarial attacks, and low-resolution images. In addition, we also show the potential of our framework in high-frequency perturbations. The visualization result is also provided.

Datasets. In order to demonstrate the generalization and scalability of our method, extensive benchmarks are adopted including CIFAR10 [56], CIFAR100 and ImageNet100 [57, 58, 59]. Their noisy variants are built using the same methods as in [60], which contains different kinds of noisy and corrupted images, for noisy robustness testing.

We offer visualization of the regularized image path of instances from of ImageNet [61] and COCO Dataset [62] in Appx. F due to the space limit.

Backbone. We use ResNet18 for CIFAR10 and CIFAR100 and ResNet50 [63] for ImageNet100 in our experiments. To further demonstrate the impact on the transformer model, we leverage the ViT-tiny [3] model for all datasets.

Competitors. (1) Vanilla. We train models on original clean images from datasets. (2) Blur. We train models on images preprocessed by Gaussian-Blur. For images from CIFAR10 and CIFAR100, we use kernel size as 3, strength as 2. For images from ImageNet100, we use kernel size as 7, strength as 2. (3) TV-layer. We train models following [49]. This method is only applied to ResNet. (4) Fix. We train models on a fixed set of smoothed images generated by our method. (5) Iterative. We train models following strategy in Eq. 9. Specifically, for ViT-tiny and ResNet50, we finetune them with pretrain weight.

Selection of Early Stopping Time. An essential component of our method is selecting the early stopping time based on the sparsity level, which increases over time. This selection is highly dependent on the image size. Empirically, for Fixed and Iterative training, we stop at a sparsity level of 0.6 for images from CIFAR10 and CIFAR100, and at a level of 0.3 for images from ImageNet100, as larger image sizes are more susceptible to natural corruptions. We further discuss it in Appx. C.

4.1 Robustness Against Natural Corruptions

Table 1: Classification results on noisy examples. For each model, we display accuracy on test data without preprocessing (first row) and with preprocessing (second row).

Model	Noise	CIFAR10					CIFAR100				ImageNet100
Model	Noise	Vanilla	Blur	TV layer	Fix	Iterative	Vanilla	Blur	Fix	Iterative	Vanilla	Fix	Iterative
ResNet	Gaussian	45.90	11.57	49.97	23.91	33.20	7.35	1.12	1.56	3.05	40.31	35.72	39.96
	Gaussian	72.57	61.91	76.15	75.34	78.46	18.83	10.30	22.28	17.01	45.33	45.39	48.15
	Shot	59.08	13.48	62.14	33.41	42.39	6.65	1.11	1.56	2.59	36.51	32.10	37.36
	Shot	76.34	70.14	78.39	83.93	83.69	15.28	9.31	25.11	28.72	37.82	33.67	39.26
	Impulse	51.43	14.16	58.75	34.08	35.62	10.06	1.15	2.48	3.22	31.54	26.70	34.28
	Impulse	51.30	56.18	59.15	69.51	71.86	10.06	9.04	20.87	25.62	34.26	30.21	39.53
ViT	Gaussian	58.75	28.32	-	48.25	50.41	6.58	2.07	5.43	4.40	35.42	40.87	42.29
	Gaussian	67.58	68.59	-	80.19	81.07	13.23	15.09	23.68	23.74	48.17	52.94	48.39
	Shot	68.04	37.77	-	60.60	58.66	5.73	2.14	5.02	3.95	31.77	38.84	38.76
	Shot	71.29	73.90	-	82.11	82.66	11.06	13.20	21.66	21.38	36.64	44.38	40.35
	Impulse	61.75	61.95	-	69.02	59.16	11.87	6.02	10.28	8.50	30.44	37.48	38.79
	Impulse	69.73	64.74	-	72.96	71.82	13.24	12.54	18.89	19.76	35.57	45.23	44.09

We consider noisy images from CIFAR10-C, CIFAR100-C and ImageNet100-C [60], specifically with Gaussian noise, shot noise and impulse noise. For each noise, we consider 5 severity for CIFAR10 and CIFAR100, and 2 severity for ImageNet100, reporting the average accuracy. To explain the efficacy of our proposed method when dealing with noisy images, we compare our model with the vanilla model, Blur, and TV Layer on CIFAR10-C, CIFAR100-C, and with the vanilla method on ImageNet100-C to demonstrate the potential of our method on large-scale datasets. In the test stage, we consider two scenarios for all the methods: with and without preprocessing (preprocess test images via our instance smoothing algorithm in Eq. 5 with sparsity 0.6 for CIFAR10 and CIFAR100, 0.3 for ImageNet100). For the Blur model, we blur the test image as preprocessing.

We present the classification accuracy in Tab. 1. Our models outperform other baselines across almost all noise types and datasets, even without the benefit of preprocessing. Additionally, our model stands out by achieving further improvement over others during the testing phase, where it refines the small-scale information through preprocessing to enhance performance. Furthermore, when employed as a preprocessing technique, our sparsification framework significantly enhances the accuracy of nearly all models across various types of noisy data.

4.2 Robustness against Adversarial Attack

In this section, we show the robustness of our method against adversarial attacks. The attacked data are generated via commonly-used FSGM [5] and PGD [64], whose details can be referred to Sec. G. During the test stage, similar to Appx. 4.1, we smooth each data at a sparsity level of 0.6 for CIFAR10 and CIFAR100, and 0.3 for ImageNet100.

We report the accuracy at strengths $\varepsilon={8}/{255}$ and $\varepsilon={16}/{255}$ of all the datasets in Tab. 2, where $\varepsilon$ stands for the attack strengths on normalized images. Apart from these results, we additionally report the result of PNI [65] with ResNet18 as the backbone and $\varepsilon={8}/{255}$ . We first note that for all methods, applying our variational sparsification framework to preprocess test data can bring significant robustness improvement, which suggests its utility in smoothing noise components. Besides, it is also interesting to see that all variants of our methods can outperform the Vanilla method by a large margin, which can further demonstrate the utility of our robust learning framework.

Table 2: Classification results on adversarial examples. For each model, we display accuracy on test data without preprocessing (first row) and with preprocessing (second row).

Model	Strength	CIFAR10					CIFAR100					ImageNet100
Model	Strength	Vanilla	Blur	PNI	Fix	Iterative	Vanilla	Blur	PNI	Fix	Iterative	Vanilla	Fix	Iterative
ResNet	8/255	37.95	3.66	41.07	21.04	18.67	2.15	0.55	20.78	1.57	2.21	21.18	5.10	18.22
	8/255	49.45	60.22	51.21	62.72	60.39	14.61	36.57	23.06	36.73	33.06	24.41	44.82	32.00
	16/255	23.97	5.10	26.05	11.37	12.87	1.69	0.55	8.26	0.93	1.25	15.82	8.20	17.32
	16/255	42.27	25.91	43.35	48.51	45.51	9.36	12.73	13.39	23.16	19.78	21.32	27.30	22.34
ViT	8/255	16.63	14.62	-	24.30	12.76	6.74	3.46	-	5.18	3.57	4.64	1.14	0.82
	8/255	40.81	51.95	-	64.33	58.34	12.29	33.18	-	37.42	32.22	13.46	33.34	11.72
	16/255	16.48	12.58	-	23.20	16.96	6.18	2.26	-	4.76	3.09	3.48	0.48	0.38
	16/255	31.10	37.44	-	51.63	45.27	9.75	20.40	-	26.65	21.56	8.06	20.54	6.10

4.3 Robustness against Low Resolution

To illustrate the robustness of our method against low-resolution data, we apply our method to the task of classifying low-resolution images. We first downsample the original images and then upsample to the original size via the nearest interpolation. The smaller intermediate size will result in a lower-resolution image.

The results are presented for different models with various datasets in Tab. 3 for test data with different scaling factors. As shown, all variants of our methods outperform the vanilla model with lower-resolution images, especially the Fixed training model. This result suggests the effectiveness of our sparsification framework in learning large-scale information during training, as the low-resolution images can smooth out the details while maintaining the object’s shape and contour.

Table 3: Classification results on low-resolution examples. ”IN100” refers to the ImageNet100 dataset, and ”scale factor” denotes the ratio of the compressed image size to the original size.

Method	ResNet						ViT
Dataset	CIFAR10		CIFAR100		IN100		CIFAR10		CIFAR100		IN100
Scale Factor	1/4	1/2	1/4	1/2	1/7	1/4	1/4	1/2	1/4	1/2	1/7	1/4
Vanilla	23.33	38.50	5.25	19.48	10.16	28.60	16.54	42.27	3.83	17.43	6.53	24.18
Blur	29.69	33.03	6.43	8.30	8.68	16.82	25.32	62.76	4.82	20.75	10.94	21.36
Fix	30.00	41.12	7.97	14.53	13.24	29.53	24.52	64.64	6.90	25.58	12.28	25.76
Iterative	32.00	48.68	13.13	23.33	11.50	31.16	27.95	65.25	9.98	26.92	17.52	26.52

4.4 Extension to High-frequency Perturbations

To further demonstrate the capability of our method in defending general noise, we apply our method to high-frequency perturbed data, by following the scenario of [41]. Specifically, we first decompose the images into low-frequency and high-frequency components as shown in Fig. 2 (a), and then respectively test the accuracy of models on both the high and low-frequency components.

As indicated in Tab. 4, our models perform better than the vanilla model on low-frequency components, suggesting the capability of our framework to smooth high-frequency information. As a further verification, we visualize the frequencies in the first layer’s feature maps during training in Fig. 2 (b). As shown, the vanilla model (bottom) tends to learn high-frequency features while our method can first learn low-frequency features and then high-frequency features during training. Apart from that, in Fig. 2 (c) we visualize the expected difference in the frequency domain as proposed in [39]. To be concrete, we calculate $\mathbb{E}(\mathcal{F}(X)-\mathcal{F}(\hat{X}))$ on CIFAR10, where $\mathcal{F}$ stands for Fourier transformation, $X$ and $\hat{X}$ stand for different images. One can find that if we stop at a higher sparsity level, more high-frequency features are learned. With early stopping, the images contained more low-frequency features, which further supports the ability of our method to preserve low-frequent information while smoothing some high-frequency ones.

Table 4: Test accuracy on high and low-frequency components of images.

Model	Strength	CIFAR10				CIFAR100
Model	Strength	vanilla	Blur	fix	iterative	vanilla	Blur	fix	iterative
ResNet	High	41.05	17.94	20.93	11.08	4.64	1.97	2.01	1.98
ResNet	Low	47.37	93.14	73.29	73.63	13.69	72.91	41.96	48.30
ViT	High	35.90	29.23	33.09	19.67	8.75	3.61	4.46	4.07
ViT	Low	84.04	93.11	82.96	84.97	25.41	76.88	54.61	57.94

4.5 Visualization

We visualize learned features via Layer-Wise Relevance Propagation (LRP) [66], which indicates the importance of features by backpropagating the relevance and overlaying the normalized results onto the input image. Appx. D presents more results implemented via Grad-CAM visualization [27].

The results regarding three image variants including original images, blurred images, and sparse images generated by our algorithm are shown in Fig. 3, where brighter colors typically indicate higher importance.

It is evident that on the original image, all the models can capture the main object (the bird), but the pixel in the background is important. After being preprocessed with our method, all the models can focus more on the main object without overly relying on the texture on the background.

5 Conclusions and Discussions

We propose a variational sparsification algorithm that exploits the Total Variational sparsity in the inverse scale space. By employing early stopping, this method efficiently smooths out small-scale features where noise typically occurs. Besides, it can effectively preserve important high-contrast features. With discretization and sparse projection, it has a simple iterative algorithm to implement. We demonstrate the utility in several robustness tasks.

Limitations. Although the complexity of sparse projection is comparable to gradient descent, its running time is significantly longer due to the current CPU-only implementation of the sparse projection algorithm. We believe our method can be applied to feature maps with TV regularization. Future work will explore this extension and optimize memory usage.

References

[1] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. Advances in neural information processing systems, 25, 2012.
[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[4] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[5] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[6] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
[7] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018.
[8] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To learn image super-resolution, use a gan to learn how to do image degradation first. In Proceedings of the European conference on computer vision (ECCV), pages 185–200, 2018.
[9] Andreas Lugmayr, Martin Danelljan, and Radu Timofte. Unsupervised learning for real-world super-resolution. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3408–3416. IEEE, 2019.
[10] Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2019.
[11] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611, 2019.
[12] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
[13] Chaithanya Kumar Mummadi, Thomas Brox, and Jan Hendrik Metzen. Defending against universal perturbations with shared adversarial training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4928–4937, 2019.
[14] Nupur Kumari, Mayank Singh, Abhishek Sinha, Harshitha Machiraju, Balaji Krishnamurthy, and Vineeth N Balasubramanian. Harnessing the vulnerability of latent layers in adversarially trained models. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2779–2785, 2019.
[15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[16] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV), pages 624–639, 2018.
[17] Xinwei Sun, Botong Wu, Xiangyu Zheng, Chang Liu, Wei Chen, Tao Qin, and Tie-Yan Liu. Recovering latent causal factor for generalization to distributional shifts. Advances in Neural Information Processing Systems, 34:16846–16859, 2021.
[18] Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, and Pinar Duygulu. Hybridaugment++: Unified frequency spectra perturbations for model robustness. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5718–5728, 2023.
[19] Jiachen Sun, Akshay Mehra, Bhavya Kailkhura, Pin-Yu Chen, Dan Hendrycks, Jihun Hamm, and Z Morley Mao. A spectral view of randomized smoothing under common corruptions: Benchmarking and improving certified robustness. In European Conference on Computer Vision, pages 654–671. Springer, 2022.
[20] Yueru Li, Shuyu Cheng, Hang Su, and Jun Zhu. Defense against adversarial attacks via controlling gradient leaking on embedded manifolds. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 753–769. Springer, 2020.
[21] Pei Wang, Yijun Li, and Nuno Vasconcelos. Rethinking and improving the robustness of image style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 124–133, 2021.
[22] Otmar Scherzer and Chuck Groetsch. Inverse scale space theory for inverse problems. In International Conference on Scale-Space Theories in Computer Vision, pages 317–325. Springer, 2001.
[23] Martin Burger, Stanley Osher, Jinjun Xu, and Guy Gilboa. Nonlinear inverse scale space methods for image restoration. In VLSM, volume 5, pages 25–36. Springer, 2005.
[24] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
[25] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
[26] Stanley Osher, Feng Ruan, Jiechao Xiong, Yuan Yao, and Wotao Yin. Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis, 41(2):436–469, 2016. arXiv:1406.7728.
[27] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[28] Mohammad Momeny, Ali Mohammad Latif, Mehdi Agha Sarram, Razieh Sheikhpour, and Yu Dong Zhang. A noise robust convolutional neural network for image classification. Results in Engineering, 10:100225, 2021.
[29] Eric Mintun, Alexander Kirillov, and Saining Xie. On interaction between augmentations and corruptions in natural corruption robustness. Advances in Neural Information Processing Systems, 34:3571–3583, 2021.
[30] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[31] Joana C Costa, Tiago Roxo, Hugo Proença, and Pedro RM Inácio. How deep learning sees the world: A survey on adversarial attacks & defenses. IEEE Access, 2024.
[32] Alfred Laugros. Synthetic, Adversarial and Natural Corruptions: Image Classifier Robustness Transfers From one Distribution Shift to an Other. PhD thesis, Université Grenoble Alpes [2020-….], 2022.
[33] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021.
[34] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2020.
[35] Gwantae Kim, Jaihyun Park, Kanghyu Lee, Junyeop Lee, Jeongki Min, Bokyeung Lee, David K Han, and Hanseok Ko. Unsupervised real-world super resolution with cycle generative adversarial network and domain discriminator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 456–457, 2020.
[36] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
[37] Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2019.
[38] Kunyu Wang, Juluan Shi, and Wenxuan Wang. Lfaa: Crafting transferable targeted adversarial examples with low-frequency perturbations. arXiv preprint arXiv:2310.20175, 2023.
[39] Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. Advances in Neural Information Processing Systems, 32, 2019.
[40] Shishira R Maiya, Max Ehrlich, Vatsal Agarwal, Ser-Nam Lim, Tom Goldstein, and Abhinav Shrivastava. A frequency perspective of adversarial robustness. arXiv preprint arXiv:2111.00861, 2021.
[41] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8684–8694, 2020.
[42] Zhuang Zhang, Dejian Meng, Lijun Zhang, Wei Xiao, and Wei Tian. The range of harmful frequency for dnn corruption robustness. Neurocomputing, 481:294–309, 2022.
[43] Amir Beck and Marc Teboulle. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419–2434, 2009.
[44] Antonin Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision, 20(1):89–97, 2004.
[45] T.F. Chan and Chiu-Kwong Wong. Total variation blind deconvolution. IEEE Transactions on Image Processing, 7(3):370–375, 1998.
[46] Antonio Marquina and Stanley J Osher. Image super-resolution by tv-regularization and bregman iteration. Journal of Scientific Computing, 37:367–382, 2008.
[47] Jean-François Aujol, Guy Gilboa, Tony Chan, and Stanley Osher. Structure-texture image decomposition—modeling, algorithms, and parameter selection. International journal of computer vision, 67:111–136, 2006.
[48] Michael Donoser, Martin Urschler, Martin Hirzer, and Horst Bischof. Saliency driven total variation segmentation. In 2009 IEEE 12th International Conference on Computer Vision, pages 817–824, 2009.
[49] Raymond A Yeh, Yuan-Ting Hu, Zhongzheng Ren, and Alexander G Schwing. Total variation optimization layers for computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 711–721, 2022.
[50] Chendi Huang, Xinwei Sun, Jiechao Xiong, and Yuan Yao. Split lbi: An iterative regularization path with structural sparsity. Advances In Neural Information Processing Systems, 29, 2016.
[51] Wotao Yin, Stanley Osher, Jerome Darbon, and Donald Goldfarb. Bregman iterative algorithms for compressed sensing and related problems. SIAM Journal on Imaging sciences, 1(1):143–168, 2008.
[52] Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and Wotao Yin. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 4(2):460–489, 2005.
[53] Tony F Chan, Gene H Golub, and Pep Mulet. A nonlinear primal-dual method for total variation-based image restoration. SIAM journal on scientific computing, 20(6):1964–1977, 1999.
[54] Tom Goldstein and Stanley Osher. The split bregman method for l1-regularized problems. SIAM journal on imaging sciences, 2(2):323–343, 2009.
[55] Chendi Huang, Xinwei Sun, Jiechao Xiong, and Yuan Yao. Boosting with structural sparsity: A differential inclusion approach. Applied and Computational Harmonic Analysis, 2018. arXiv preprint arXiv:1704.04833.
[56] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. In Canadian Conference on Computer and Robot Vision, pages 1–6. IEEE, 2009.
[57] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
[58] Zelin Zang, Siyuan Li, Di Wu, Ge Wang, Kai Wang, Lei Shang, Baigui Sun, Hao Li, and Stan Z Li. Dlme: Deep local-flatness manifold embedding. In European Conference on Computer Vision, pages 576–592. Springer, 2022.
[59] Krisna Pinasthika, Blessius Sheldo Putra Laksono, Riyandi Banovbi Putera Irsal, Novanto Yudistira, et al. Sparseswin: Swin transformer with sparse transformer block. Neurocomputing, page 127433, 2024.
[60] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
[61] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[62] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[64] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[65] Zhezhi He, Adnan Siraj Rakin, and Deliang Fan. Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 588–597, 2019.
[66] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
[67] Alessandro Lulli, Emanuele Carlini, Patrizio Dazzi, Claudio Lucchese, and Laura Ricci. Fast connected components computation in large graphs by vertex pruning. IEEE Transactions on Parallel and Distributed systems, 28(3):760–773, 2016.
[68] Lang Huang, Chao Zhang, and Hongyang Zhang. Self-adaptive training: beyond empirical risk minimization. Advances in neural information processing systems, 33:19365–19376, 2020.
[69] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in neural information processing systems, 33:2958–2969, 2020.

Appendix A Proof of Proposition 1

Notations. We project $\beta_{k}$ onto the sparse subspace of $\gamma_{k}$ , i.e., $S_{k}:=\mathrm{supp}(\gamma_{k})$ : $\tilde{\beta}_{k}=\mathrm{proj}_{S_{k}}(\beta_{k}):=\arg\min_{D_{S_{k}^{c}}\beta^{\prime}=0}\|\beta^{\prime}-\beta_{k}\|_{2}$ . We denote $D_{S_{k}^{c}}$ as the sub-matrix of $D$ with rows indexed by $S_{k}^{c}$ , which indexes the nonzero elements of $\gamma_{k}$ . Specifically, $D_{S_{k}^{c}}$ is the graph difference matrix of $G:=(V,E_{S_{k}^{c}})$ , such that $(i,j)\in E_{S_{k}^{c}}$ if $D_{S_{k}^{c}}\tilde{\beta}_{k}(i,j):=\tilde{\beta}_{k}(i)-\tilde{\beta}_{k}(j)=0$ . $E$ is the corresponding edge set of $D$ , which is composed of adjacent pairs of pixels.

Proof of Prop. 1.

Suppose $G=(V,E_{S_{k}^{c}})$ has $C$ connected components $G_{1}=(V_{1},E_{1}),...,G_{C}=(V_{C},E_{C})$ , such that $V=V_{1}\cup...\cup V_{C}$ . If two nodes $i$ and $j$ are in the same component, the corresponding elements of $\tilde{\beta}_{k}$ have the same value, i.e., $\tilde{\beta}_{k}(i)=\tilde{\beta}_{k}(j)$ . Then for each component $V_{c}$ , $\tilde{\beta}_{k}(V_{c})$ shares the same value. If we denote it as $\eta_{c}$ , then the $\eta_{c}$ to minimize

\displaystyle\sum_{j\in V_{c}}(\eta_{c}-\beta_{k}(j))^{2},

equals to the average of $\beta_{k}(V_{c})$ , i.e., $\mathrm{mean}(\beta_{k}(V_{c}))$ . Using the strong connected-component algorithm proposed in [67], the decomposition of connected components will cost $\mathcal{O}(log(p))$ . ∎

The algorithm is shown in Alg. 1, and the flowchart of the graph algorithm is shown in Fig. 4.

Algorithm 1 Projection by Connected Components in Graph

Input: An image

\beta

, current

\gamma_{t}

, the graph

G(V,E)

where

V

denotes the set of pixels and

E

contains edges defined according to the graph difference matrix

D

in Eq. equation (1).

Output:

\tilde{\beta}

via projection in Eq. equation (6).

Find connected components

G_{1}:=(V_{1},E_{1}),\dots,G_{C}:=(V_{C},E_{C})

For each

i=1,...,C

, compute the average of

\beta

over

V_{i}

, i.e.,

z_{i}:=\sum_{j\in V_{i}}\beta(j)/|V_{i}|

and take

\tilde{\beta}(j)=z_{i}

for each

j\in V_{i}

Return:

\hat{\beta}

Remark 4.

After obtaining the connected components, we need to compute the average of each component, which has the complexity of $\mathcal{O}(p)$ and is comparable to the gradient descent. Since the complexity of the soft-thresholding in Eq. equation (5c) is also $\mathcal{O}(p)$ , the overall complexity of our instance smoothing algorithm in Eq. 5 has the same order of the gradient descent.

Appendix B Implementation Details

We use ResNet18 for CIFAR10 and CIFAR100 and ResNet50 [63] for ImageNet100. To further demonstrate the impact on the transformer model, we leverage the ViT-tiny [3] model for all the datasets. Specifically, we use the pre-trained ResNet50 and ViT-tiny ^⋆\star^⋆\star $\star$ The weights are available in https://huggingface.co/WinKawaks/vit-tiny-patch16-224. and further fine-tune them on our dataset, also with fixed, iterative, and incremental training strategies. For hyperparameters, we set $\kappa=5$ , $\nu=1$ , and $\alpha=\frac{1}{\kappa\|H\|_{2}}$ , where $H=\nabla^{2}\mathcal{L}_{\nu}$ is the Hessian matrix of the loss function. We denote ”Vanilla” for the vanilla model, ”Fix” for the fixed training model, and ”Iterative” for the iterative training model.

Appendix C Classification on clean data

In this experiment, we apply our method to the standard classification on CIFAR10 and ImageNet100 [61].

Experiment Setup. We adopt ResNet-18 as the backbone on CIFAR10 and ResNet-50 on ImageNet100. For fixed training, we try different sparsity levels for preprocessing images. For the iterative training, we use the strategy in Eq. 5 from the sparsity level 0.3 to 0.8.

Results. We report the classification accuracy on CIFAR10 In Tab. 5. As shown, both fixed and iterative training offer comparable results to the vanilla model after the sparsity reaches $0.6$ , suggesting that the loss in information is limited. Moreover, fixed training slightly outperforms the vanilla model at the sparsity level of $0.9$ . We also report the classification accuracy on ImageNet100 In Tab. 6.

Further Discussion of Sparsity Level. As shown in results in Tab. 5 and Tab. 6, images with the sparsity level higher than 0.6 capture important features for classification. Also, from the results in Sec. 4.1, 4.2, 4.3, we can empirically notice that training with those images can improve the robustness of the model.

When the images are processed with a sparsity level of 0.6, the fine-grained small-scale information has been eliminated, while keeping the structural information. When the sparsity level is lower, only shape information (smooth information) is maintained, but some detailed semantic information. However, when the sparsity level is close to 1.0, noise and confusing texture will show up, which will deteriorate the robustness.

Table 5: Results of clean data in CIFAR10.

Sparsity Level	0.4	0.5	0.6	0.7	0.8	0.9	1.0 (Vanilla)	Iterative
Accuracy	$85.24\pm 1.69$	$92.47\pm 1.53$	$93.82\pm 0.68$	$94.53\pm 0.27$	$94.78\pm 0.24$	$\textbf{95.38}\pm 0.14$	$95.29\pm 0.06$	$94.54\pm 0.14$

Table 6: Results of clean data in ImageNet100.

Sparsity Level	0.4	0.6	0.8	1.0 (Vanilla)	Iterative
Accuracy	$59.12$	$75.51$	${78.66}$	$79.36$	$74.39$

Appendix D More Visualization Results

D.1 Visualizations with Grad-CAM

In this experiment, we apply the Grad-CAM [27] to visualize learned features during iterative training. We consider the model trained using the strategy described in Eq. 3, with sparsity levels ranging from 0.3 to 0.8. As shown in Fig. 5, the features learned by our model in the early epochs are more concentrated on the class-dependent regions (e.g., the cat’s face in the top-left image and the dog‘s body in the bottom left image). As iterates, finer-scale information is learned; thus the feature map is enlarged due to the completeness of information. The larger saliency map of our model shows that our model learns more shape information than the vanilla model.

Appendix E Running Time of the Algorithm

In this experiment, we compare the running time of our algorithm on gray-scale images from miniImagenet dataset to sparsity level 0.6. We consider the matrix factorization method and our graph method for the sparse projection in Eq. 9b. We run this test on an NVIDIA Tesla V100 (32GB) and an Intel Gold 6240 CPU @ 2.60GHz.

Results. For other methods such as Singular value decomposition (SVD) decomposition or QR decomposition that can obtain the closed-form solution suffer from high computational costs. Assume $p$ is the dimension of $\beta_{k}$ . For example, the complexity of SVD decomposition is $\mathcal{O}(p^{3})$ , which is much more expensive than the gradient descent. In contrast, the complexity of the graph projection is only $\mathcal{O}(p)$ . To illustrate, we compare our graph projection methods with other alternatives, as well as the gradient descent in terms of time complexity. We report the running time for 15,000 iterations on a 84x84 grayscale image, in Tab 7. As shown, our graph projection method is much more efficient than others.

Table 7: Computational time (

s

) of different methods for 15,000 iterations on a 84x84 grayscale image.

Projection Method	SVD	LSQR	Graph Algorithm
Running Time (s)	$373.24\pm 1.41$	$171.79\pm 2.31$	$4.87\pm 0.09$

Appendix F Total Variation regularized Image Path

In this experiment, we choose more cases in ImageNet [61] and COCO Dataset [62], a multi-object image dataset to visualize the regularized image path.

Results. As shown in Fig. 6, as the sparsity level increases, the image first identifies large-scale structural information and then small-scale detailed information. Such large-scale information can refer to the object’s shape or contour in the first three rows where the object as a whole has a convex and smoothed boundary; while in the last three rows with irregular and complex contour, such structural information can refer to the key parts of the object, e.g., the plow of a plow truck in the fifth row, and umbrellas in the last row.

As shown in Fig. 7, when our method meets multi-object images, the shape of the object in the images will pop out at the beginning of the image path, and more detailed texture will gradually add to the background, and the object smoothly.

Appendix G More Results of Adversarial Robustness

We additionally compare our method to some adversarial defense methods, including PNI [65], SAT[68], AWP[69] and adversarial training results from [41] into comparison with ResNet18 as backbone and $\varepsilon={8}/{255}$ .

Table 8: Comparison with other adversarial defence method.

	Adv Training [41]	SAT[68]	AWP[69]	PNI (o)[65]	PNI (w)	TV layer (o)[49]	TV layer (w)	Fix (o)	Fix (w)	Iterative(o)	Iterative (w)
CIFAR10	43.50	55.81	55.30	41.07	51.21	43.57	53.25	37.23	51.19	35.31	44.79
CIFAR100	-	-	29.09	20.78	23.06	-	-	5.26	23.06	6.16	27.63

Appendix H Social Impact

This work in this paper can make the network more robust to various types of noise. Thus it may improve the safety of some devices like self-driving cars. It has the potential of well protecting the society.







Sparsity=0.2	Sparsity=0.4	Sparsity=0.6	Sparsity=0.8	Sparsity=1.0




Sparsity=0.2	Sparsity=0.4	Sparsity=0.6	Sparsity=0.8	Sparsity=1.0