This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Large Certified Radius in Randomized Smoothing using
Quasiconcave Optimization

Bo-Han Kung, Shang-Tse Chen
Abstract

Randomized smoothing is currently the state-of-the-art method that provides certified robustness for deep neural networks. However, due to its excessively conservative nature, this method of incomplete verification often cannot achieve an adequate certified radius on real-world datasets. One way to obtain a larger certified radius is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on CIFAR-10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.

1 Introduction

Although deep learning has achieved tremendous success in various fields (Wang, Bochkovskiy, and Liao 2022; Zhai et al. 2022), it is known to be vulnerable to adversarial attacks (Szegedy et al. 2013). This kind of attack crafts an imperceptible perturbation on images (Goodfellow, Shlens, and Szegedy 2014) or voices (Carlini and Wagner 2018) to make the AI system predict incorrectly. Many adversarial defense methods have been proposed to defend against adversarial attacks. Adversarial defenses can be categorized into empirical defenses and theoretical defenses. Common empirical defenses include adversarial training (Madry et al. 2017; Shafahi et al. 2019; Wong, Rice, and Kolter 2020) and preprocessing-based methods (Samangouei, Kabkab, and Chellappa 2018; Das et al. 2018). Though effective, empirical defenses cannot guarantee robustness.

Different from empirical defenses, theoretical defenses provide robustness verification with mathematical guarantees (Li, Xie, and Li 2023). The verification must be sound, thereby preventing false positives. However, while it has the potential to be complete, ensuring no false negatives, it may also be incomplete. One the other hand, theoretical defenses can be categorized into deterministic verification and probabilistic verification. Deterministic verification methods, such as mixed-integer programming (MIP) (Tjeng, Xiao, and Tedrake 2018), interval bound propagation (IBP)(Ehlers 2017; Weng et al. 2018; Gowal et al. 2018; Mueller et al. 2022) and Lipschitz-bounded networks (Singla and Feizi 2021; Singla, Singla, and Feizi 2022; Xu, Li, and Li 2022), offer theoretical robustness guarantees for a DNN model. The verification in these cases is deterministic. In contrast, probabilistic verification methods, such as randomized smoothing (Cohen, Rosenfeld, and Kolter 2019; Lecuyer et al. 2019; Yang et al. 2020), provide a provable defense that ensures that there are no adversarial examples within a specific ball with a radius RR. Notably, this guarantee is accompanied by a degree of randomness that is independent of input images.

Among those methods, only randomized smoothing (RS) can scale to state-of-the-art deep neural networks and real-world datasets. Randomized smoothing first builds a smoothed classifier for a given data point via a Gaussian filter and Monte Carlo sampling, and then it estimates a confidence lower bound for the highest-probability class. Next, it determines a certified radius for the class and promise that there is no adversarial example within this radius.

Although randomized smoothing is effective, it suffers from two main disadvantages. First, randomized smoothing applies the same constant-variance Gaussian filter to every data point when constructing a smoothed classifier. This makes the certified radius dramatically underestimated. Second, randomized smoothing adopts a confidence lower bound (Clopper-Pearson lower bound) to estimate the highest-probability class, which also limits the certified radius. As a result, when evaluating radius-accuracy curve, a truncation fall often occurs (see the gray curve in the upper right of Fig. 1). This is called truncation effect or waterfall effect (Súkeník, Kuvshinov, and Günnemann 2021), which shows the conservation aspect in randomized smoothing. Other issues such as fairness (Mohapatra et al. 2021), dimension (Kumar et al. 2020b), and time-efficiency (Chen et al. 2022) also limit its application.

Refer to caption
Figure 1: Overview of the proposed QCRS algorithm. QCRS finds the better sigma value for each smoothed classifier using quasiconcave optimization. Thus, it provides better certified radii than the classical randomized smoothing. In this paper, we discuss quasiconcavity on the certified radius w.r.t. σ\sigma. i.e., R(σ)R(\sigma).

To alleviate truncation effect and improve the certified radii, a more precise workflow is necessary. Prior work (Chen et al. 2021; Alfarra et al. 2022) proposed input-specific methods that can assign different Gaussian filters to different data points. Those methods attempt to optimize the radius by finding the optimal variance σ2\sigma^{2} of the Gaussian filter. In this work, we first delve into randomized smoothing and discover a useful property called quasiconcavity for the sigma-radius curve. Next, we define a probabilistic quasiconcavity assumption and then develop a novel algorithm called Quasiconcavity-based Randomized Smoothing (QCRS) that optimizes certified radii with respect to sigma. The overview of QCRS is illustrated in Fig. 1. QCRS significantly improves the certified radii with little computational overhead compared to existing methods (Chen et al. 2021; Alfarra et al. 2022). The proposed QCRS enjoys the advantages of both performance and time-efficiency. The main technical contributions are summarized as follows:

  • We discover and empirically demonstrate that the sigma-radius curves are quasiconcave for most data points. In our experiments, approximately 99%99\% of the data points satisfy our proposed quasiconcavity condition.

  • Based on the observed quasiconcavity property, we propose a novel and efficient input-specific algorithm called QCRS. This algorithm aims to enhance certified radii and alleviate the truncation effect in randomized smoothing.

  • Through extensive experiments, we demonstrate the effectiveness of our proposed QCRS method on the CIFAR-10 and ImageNet datasets. Furthermore, by combining QCRS with a training-based method, we achieve state-of-the-art certified radii.

2 Related Work

Randomized smoothing utilizes a spatial low-pass Gaussian filter to construct a smoothed model (Cohen, Rosenfeld, and Kolter 2019). Based on the Neyman-Pearson lemma, this smoothed model can provide a provable radius RR to guarantee robustness for large-scale datasets. To improve randomized smoothing, some works (Yang et al. 2020; Zhang et al. 2020; Levine and Feizi 2021) proposed general methods using different smoothing distribution for different p\ell_{p} balls, while others tried to provide a better and tighter certification (Kumar et al. 2020a; Levine et al. 2020).

Improving RS during training phase. To further enlarge the radius RR, some works used training-based method (Salman et al. 2019; Zhai et al. 2019; Jeong et al. 2021; Anderson and Sojoudi 2022). These models were specifically designed for randomized smoothing. For example, MACER (Zhai et al. 2019) made the computation of certified radius differentiable and add it to the standard cross-entropy loss. Thus, the average certified radius of MACER outperforms the Gaussian-augmentation model that was used by the original randomized smoothing.

Improving RS during inference phase. Different from training-based method, some works utilized different smoothing methods to enhance the certified region. Chen et al., (Chen et al. 2021) proposed a multiple-start search algorithm to find the best parameter for building smoothed classifiers. Alfarra et al., (Alfarra et al. 2022) adopted a memory-based approach to optimize the Gaussian filter of each input data. Chen et al., (Chen et al. 2022) proposed an input-specific sampling acceleration method to control the sampling number and provides fast and effective certification. Li et al., (Li et al. 2022) proposed double sampling randomized smoothing that utilizes additional smoothing information for tighter certification. These inference-time methods are the most relevant to our work. See Section 4.1 for more detailed description on these methods.

3 Preliminaries

Let xdx\in\mathbb{R}^{d} be a data point, where dd is the input dimension. 𝒞={1,2,,c}\mathcal{C}=\{1,2,...,c\} is the set of classes. F:dcF:\mathbb{R}^{d}\rightarrow\mathbb{R}^{c} is a general predictor such as neural networks. We define the base classifier as

f(x)=eξ;ξ=argmaxjFj(x),f(x)=e_{\xi};\quad\xi=\operatorname*{arg\,max}_{j}F_{j}(x), (1)

where eje_{j} denotes a one-hot vector where the jthj^{th} component is 1 and all the other components are 0. The smoothed classifier (Cohen, Rosenfeld, and Kolter 2019) g:d𝒞g:\mathbb{R}^{d}\rightarrow\mathcal{C} is defined as

g(x)=argmaxc𝒞Pr[f(x+ϵ)=ec],ϵ𝒩(0,σ2I),g(x)=\operatorname*{arg\,max}_{c\in\mathcal{C}}Pr[f(x+\epsilon)=e_{c}],\quad\epsilon\sim\mathcal{N}(0,\sigma^{2}I), (2)

where 𝒩\mathcal{N} is Gaussian distribution and ϵ\epsilon is a noise vector sampled from 𝒩\mathcal{N}. Cohen et al., (Cohen) proposed a provable method to calculate the certifiable robust radius as follows:

R=σ2[Φ1(pA¯)Φ1(pB¯)],\displaystyle R=\frac{\sigma}{2}\cdot[\Phi^{-1}(\underline{p_{A}})-\Phi^{-1}(\overline{p_{B}})], (3)
pA=Pr[f(x+ϵ)=eA], and pB=Pr[f(x+ϵ)=eB],p_{A}=Pr[f(x+\epsilon)=e_{A}],\text{ and }p_{B}=Pr[f(x+\epsilon)=e_{B}],

where AA is the highest-probability class of the smoothed classifier, and BB is the runner-up class. pA¯\underline{p_{A}} and pB¯\overline{p_{B}} are the Clopper-Pearson lower/upper bound of pAp_{A} and pBp_{B}, which can be estimated by Monte Carlo (MC) sampling with a confidence level 1α1-\alpha. RR indicates the certified radius. Any data point inside this radius would be predicted as class AA by the smoothed classifier. That is, g(x+δ)=g(x)g(x+\delta)=g(x) for all δ2R||\delta||_{2}\leq R. In practice, Cohen replaces pB¯\overline{p_{B}} with 1pA¯1-\underline{p_{A}}, so equation 3 usually is reformulated as R=σΦ1(pA¯)R=\sigma\cdot\Phi^{-1}(\underline{p_{A}}). If pA¯<0.5\underline{p_{A}}<0.5, it indicates that there is no certified radius in this data point according to Cohen.

The smoothed classifier gg is constructed from the base classifier ff by introducing perturbations ϵ\epsilon to xx. Therefore, the smoothed classifier gg can be regarded as a spatial smoothing measure of the original base classifier ff using a Gaussian kernel 𝒢\mathcal{G}, i.e., g=f𝒢g=f\star\mathcal{G}, where \star is the convolution operator. From the signal-processing perspective, gg is the Weierstrass transformation of ff. That is, the smoothed classifier gg is constructed by applying a low-pass filter 𝒢\mathcal{G} on ff. Randomized smoothing constructs smoothed classifier to provide certifiable robustness guarantee.

4 QCRS Methodology

4.1 Observation and Motivation

As mentioned earlier, several existing methods attempt to address the truncation effect. Some focus on training the base model to enlarge certified radii, while others use a different Gaussian kernel 𝒢(σ)\mathcal{G}(\sigma) for each image to construct gg. We follow the later approach and propose an input-specific algorithm that finds the optimal 𝒢\mathcal{G} for most data points. Intuitively, for a data point xx of class cAc_{A}, if most neighboring points belong to the same class cAc_{A}, we can use 𝒢\mathcal{G} with a larger variance to convolute the data space of ff. In contrast, if the neighborhood is full of different class samples, 𝒢\mathcal{G} needs a small variance to prevent misclassification. The proposed method draws inspiration from this concept and aims to optimize the selection of 𝒢(σ)\mathcal{G}(\sigma). Below, we will discuss some input-specific search algorithms that have been utilized in prior works. These algorithms contribute to the development of our approach.

DDRS (Alfarra et al. 2022) assumes that sigma-radius curves, R(σ)=σΦ1(pA¯(σ))R(\sigma)=\sigma\cdot\Phi^{-1}(\underline{p_{A}}(\sigma)), are concave and use gradient-based convex optimization along with some relaxation and approximation to find the optimal σ\sigma value. However, in our experiments, almost all sigma-radius curves are not concave. We select 200200 images from CIFAR-10 dataset and compute the certified radii with respect to σ\sigma for each image. Among these 200200 images, as least 189189 images do not satisfy concavity. Thus, the gradient-based convex optimization method may not work well in this task. Instead of depending on the assumption of concavity, Insta-RS (Chen et al. 2021) uses a multi-start searching algorithm to optimize σ\sigma. However, the multi-start procedure incurs high computational overhead. In our work, we observe an intriguing quasiconcave property on the sigma-radius curves, which helps us develop an effective and efficient algorithm to optimize sigma. In order to assess the generality of the quasiconcave property, we conducted evaluations on two datasets and four distinct models. Table 1 illustrates the generality of concavity and quasiconcavity of the sigma-radius curves. The table demonstrates that quasiconcavity is significantly more prevalent than concavity in real-world datasets. We will delve deeper into this table in Subsection 4.3.

Dataset CIFAR-10 ImageNet
Trained Models σ=.12\sigma=.12 σ=.25\sigma=.25 σ=.25\sigma=.25 σ=.50\sigma=.50
Concave 16%16\% 6.71%6.71\% 0%0\% 0%0\%
Quasiconcave 97%97\% 98.17%98.17\% 92.95%92.95\% 90.47%90.47\%
Table 1: We evaluate the generality of the concave and quasiconcave properties of the sigma-radius curve on two different datasets and four different trained models. We find that a significant number of data points do not exhibit a concave sigma-radius curve, while over 90%90\% of the data points demonstrate a quasiconcave sigma-radius curve.

4.2 Quasiconcavity

This section lists the definition and two fundamental lemmas about quasiconcavity that will be used in designing our QCRS method. Quasiconcavity is a generalization of concavity, defined as follows:

Definition 1.

(quasiconcavity (Boyd and Vandenberghe 2004)). A function h:nh:\mathbb{R}^{n}\rightarrow\mathbb{R} is quasiconcave if domh\operatorname{dom}h is convex and for any θ[0,1]\theta\in[0,1] and x,yx,y \in domh\operatorname{dom}h,

h(θx+(1θ)y)min{h(x),h(y)}.\displaystyle h(\theta x+(1-\theta)y)\geq\min\{h(x),h(y)\}.

Furthermore, a function hh is strictly quasiconcave if domh\operatorname{dom}h is convex and for any xyx\neq y, x,yx,y \in domh\operatorname{dom}h, and θ(0,1)\theta\in(0,1):

h(θx+(1θ)y)>min{h(x),h(y)}.\displaystyle h(\theta x+(1-\theta)y)>\min\{h(x),h(y)\}.

In this paper, we mainly use strict quasiconcavity. Below, we list lemmas on strict quasiconcavity that we will use later.

Lemma 1.

If a function hh is strictly quasiconcave, then any local optimal solution of hh must also be globally optimal.

Lemma 2.

Suppose h:h:\mathbb{R}\rightarrow\mathbb{R} is differentiable, and let xx^{*} be the optimal solution of hh. The function hh is strictly quasiconcave if and only if the following two statements hold:

h(x)>0,x(,x) and h(x)<0,x(x,)\nabla h(x)>0,\forall x\in(-\infty,x^{*})\text{ and }\nabla h(x)<0,\forall x\in(x^{*},\infty)

We defer the proofs of these two lemmas to Appendix D.

4.3 Design

In this section, we show quasiconcavity related to sigma-radius curves, i.e., R(σ)R(\sigma). First, we differentiate R(σ)R(\sigma):

σR(σ)=R(σ)σ=Φ1(pA¯(σ))+σΦ1(pA¯(σ))σ\displaystyle\nabla_{\sigma}R(\sigma)=\frac{\partial R(\sigma)}{\partial\sigma}=\Phi^{-1}(\underline{p_{A}}(\sigma))+\sigma\cdot\frac{\partial\Phi^{-1}(\underline{p_{A}}(\sigma))}{\partial\sigma}

Assume that σ\sigma^{*} exists, where σ=arg maxσR(σ)\sigma^{*}=\text{arg max}_{\sigma}R(\sigma). Then, according to Lemma 2, strict quasiconcavity is established if the gradient σΦ1\nabla_{\sigma}\Phi^{-1} adheres to the subsequent upper and lower bounds:

{σΦ1=Φ1(pA¯(σ))σ>Φ1(pA¯(σ))σfor σ<σσΦ1=Φ1(pA¯(σ))σ<Φ1(pA¯(σ))σfor σ>σ\displaystyle\begin{cases}\nabla_{\sigma}\Phi^{-1}=\frac{\partial\Phi^{-1}(\underline{p_{A}}(\sigma))}{\partial\sigma}>-\frac{\Phi^{-1}(\underline{p_{A}}(\sigma))}{\sigma}&\text{for }\sigma<\sigma^{*}\\ \nabla_{\sigma}\Phi^{-1}=\frac{\partial\Phi^{-1}(\underline{p_{A}}(\sigma))}{\partial\sigma}<-\frac{\Phi^{-1}(\underline{p_{A}}(\sigma))}{\sigma}&\text{for }\sigma>\sigma^{*}\end{cases}

Intuitively, it indicates that σΦ1\nabla_{\sigma}\Phi^{-1} has a negative lower bound when σ<σ\sigma<\sigma^{*} and a negative upper bound when σ>σ\sigma>\sigma^{*}.

The computation of a classifier’s decision boundary is proven to be NP-hard, rendering the calculation of σΦ1\nabla_{\sigma}\Phi^{-1} impractical (Katz et al. 2017). Randomized smoothing employs a series of probabilistic approaches, such as MC sampling, to address this issue. Similarly, instead of evaluating σΦ1\nabla_{\sigma}\Phi^{-1}, we define a probabilistic condition based on Lemma 2 to determine quasiconcavity of R(σ)R(\sigma):

Definition 2.

((υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition) Given an optimal σ\sigma^{*}, we call the sigma-radius curve satisfies (υ,υ+\upsilon^{-},\upsilon^{+})-Strict Quasiconcave Condition ((υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition), if for any {σ|R(σ)>0}\{\sigma|R(\sigma)>0\} , R(σ)\nabla R(\sigma) satisfies the following:

Prσ<σ[R(σ)>0]=υ,Prσ>σ[R(σ)<0]=υ+.\displaystyle\mathop{Pr}\limits_{\sigma<\sigma^{*}}[\nabla R(\sigma)>0]=\upsilon^{-},\quad\mathop{Pr}\limits_{\sigma>\sigma^{*}}[\nabla R(\sigma)<0]=\upsilon^{+}.

Intuitively, if υ=υ+=1\upsilon^{-}=\upsilon^{+}=1, the condition states that the slope of the sigma-radius curve is positive on the left side of the optimal solution and negative on the right side. Note that this condition is weaker and more general than the concentration assumption used in (Li et al. 2022), which requires additional assumptions on the distribution of data points. It is also weaker than the concavity assumption used in DDRS (Alfarra et al. 2022). Since the (υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition is weak, we expect that more data points would satisfy this assumption. Thus, we conduct some experiments on CIFAR-10 and ImageNet, which are detailed in Appendix B. For each data point in CIFAR-10 and ImageNet, we first employ grid search to determine the optimal sigma σ\sigma^{*}. Subsequently, we uniformly sample 20 distinct sigma values and evaluate the (υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition. In terms of concavity, we evaluate the Hessian of these 20 points. The results are illustrated in Table 1. Taking the model with σ=.25\sigma=.25 on CIFAR=10 as an example, we derive the following conclusions: 1) There are at most 98.17%98.17\% data points satisfy the (1,11,1)-SQC condition, while only a maximum of 6.7%6.7\% satisfy the concavity assumption; 2) Among these 98.17%98.17\% data points, their υ\upsilon^{-} and υ+\upsilon^{+} are both within the interval [0.8609,1][0.8609,1] using 95%95\% confidence level, determined by the Clopper-Pearson interval. That is, these data points at least satisfy (0.86,0.860.86,0.86)-SQC condition. Notably, in our experiment, the mean values of υ\upsilon^{-} and υ+\upsilon^{+} are greater than 0.990.99; 3) If we set υ\upsilon^{-} and υ+\upsilon^{+} to 0.990.99, we can expect that 95%95\% of data points will achieve the optimal sigma, as the proposed method requires five iterations to converge (0.9950.950.99^{5}\approx 0.95). We will discuss the convergence in the next section.

Refer to caption
(a) σ=0.12\sigma=0.12
Refer to caption
(b) σ=0.25\sigma=0.25
Refer to caption
(c) σ=0.50\sigma=0.50
Figure 2: The comparison between Cohen, grid search, and the proposed QCRS on the CIFAR-10 dataset. The models are trained by Gaussian augmentation with sigma (a) 0.120.12, (b) 0.250.25, and (c) 0.500.50. The proposed QCRS outperforms the baseline method and is very close to grid search. In addition, we can observe the truncation effect on the curves of Cohen.

We assume that a data point satisfies (υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition, with the corresponding υ\upsilon^{-} and υ+\upsilon^{+} being close to one. According to Lemma 2, if we detect that the gradient of a point is positive, we can assert that the unique optimal sigma is on its right hand side. Based on these rules, we design a time-efficient algorithm that can achieve optimal σ\sigma, shown in Algorithm 1. Algorithm 1 finds the optimal sigma efficiently based on binary search, and the sigma is the globally optimal solution, as demonstrated by Lemma 1. On the other hand, the sigma values within the non-certified interval {σ|R(σ)=0}\{\sigma|R(\sigma)=0\} must not be the solution. The gradients R(σ)\nabla R(\sigma) are likely to be zero in the interval because the curve is a horizontal line with R(σ)=0R(\sigma)=0 there. This leads to a gradient vanishing issue in Algorithm 1. To circumvent this issue, we utilize momentum MM to guide the optimization direction. Algorithm 1 guarantees to find the same optimal solution as grid search if the curve satisfies (1,11,1)-SQC condition. The time complexity is NN for grid search and logN\log N for Algorithm 1, where NN is the number of points on the grid. Therefore, the proposed method is significantly faster than grid search, while both of them can achieve the same optimal σ\sigma.

Algorithm 1 Our proposed QCRS method

Input: Searching region σmax\sigma_{max} and σmin\sigma_{min}; suboptimal interval ε\varepsilon; original sigma σ0\sigma_{0}; gradient step τ\tau
Parameter: momentum M0M\leftarrow 0
Output: The optimal σ\sigma

1:  while σmaxσmin>ε\sigma_{max}-\sigma_{min}>\varepsilon do
2:     σ(σmin+σmax)/2\sigma\leftarrow({\sigma_{min}+\sigma_{max}})/{2}
3:     Calculate the gradient:
4:     σR(σ)R(σ+τ)R(στ)\qquad\nabla_{\sigma}R(\sigma)\leftarrow R(\sigma+\tau)-R(\sigma-\tau)
5:     if sign(σR(σ))>0sign(\nabla_{\sigma}R(\sigma))>0 then
6:        σminσ\sigma_{min}\leftarrow\sigma; M1M\leftarrow 1
7:     else if sign(σR(σ))<0sign(\nabla_{\sigma}R(\sigma))<0 then
8:        σmaxσ\sigma_{max}\leftarrow\sigma; M1M\leftarrow-1
9:     else
10:        if M0M\geq 0 then
11:           σmaxσ\sigma_{max}\leftarrow\sigma; M1M\leftarrow-1
12:        else
13:           σminσ\sigma_{min}\leftarrow\sigma; M1M\leftarrow 1
14:        end if
15:     end if
16:  end while
17:  σ^(σmin+σmax)/2\hat{\sigma}\leftarrow({\sigma_{min}+\sigma_{max}})/{2}
18:  return  σargmaxσ{σ^,σ0}R(σ)\sigma\leftarrow\operatorname*{arg\,max}_{\sigma\in\{\hat{\sigma},\sigma_{0}\}}R(\sigma)

Prior work utilizes backpropagation to compute gradients, which is time-consuming, and the computed gradient is unstable due to MC sampling. Therefore, we use forward passes to compute gradient, which takes the difference of two neighboring points. This is because we only care about the gradient sign rather than the exact value. On the last stage of Algorithm 1, we employ a rejection policy that compares the resulting σ\sigma to the original σ\sigma and returns the larger one.

The proposed method is time-efficient compared to Insta-RS (Chen et al. 2021) and DDRS (Alfarra et al. 2022). DDRS uses a low MC sampling number (one or eight) due to expansive computation and may obtain unstable gradients. To verify this, we analyze the value of gradient under different MC sampling number. The detailed results can be found in the Appendix H. The results reveal that the gradient values vary dramatically when using low MC sampling numbers. Thus, a low MC sampling number can not accurately estimate gradients, which would affect the gradient-based optimization. However, the proposed QCRS only uses the gradient sign, which is more stable than the gradient value. We observe that the sign of the gradient is consistent and hardly changes when the MC sampling number exceeds 500500.

4.4 Convergence Analysis

The proposed QCRS enjoys a convergence guarantee with the convergence rate analyzed below.

Theorem 1.

Suppose (1,1)-SQC condition holds. Given hyper-parameters σmin\sigma_{min} and σmax\sigma_{max}, let σt\sigma_{t} be the σ\sigma value after tt iterations in Algorithm 1. Algorithm 1 converges to optimal σ\sigma^{*} as follows:

σmaxσmin2t|σtσ|.\displaystyle\frac{\sigma_{max}-\sigma_{min}}{2^{t}}\geq|\sigma_{t}-\sigma^{*}|.

Theorem 1 means the convergence rate of Algorithm 1 is 𝒪((12)t)\mathcal{O}((\frac{1}{2})^{t}). We defer the proof to Appendix D.

On the other hand, gradient-descent-based method requires much stricter assumptions, such as L-smoothness and μ\mu-strongly concavity, to guarantee convergence. When these conditions are satisfied, the convergence rate is as below (Nesterov 2018):

|σtσ|2(LμL+μ)(t1)|σ1σ|2.\displaystyle|\sigma_{t}-\sigma^{*}|^{2}\leq(\frac{L-\mu}{L+\mu})^{(t-1)}|\sigma_{1}-\sigma^{*}|^{2}.

The convergence rate is 𝒪((LμL+μ)t)\mathcal{O}((\frac{L-\mu}{L+\mu})^{t}), where LL and μ\mu depend on the data points. Note that only a maximum of 6.7%6.7\% of data points satisfy the concavity assumption empirically, and even fewer data points satisfy L-smoothness and μ\mu-strongly concavity simultaneously. Thus, some concave optimization methods cannot converge to the optimal sigma values for most data points.

ACR σ=.12\sigma=.12 σ=.25\sigma=.25 σ=.50\sigma=.50 Time Cost (Sec.)
Cohen 0.270 0.429 0.538 6.50±\pm0.021
DDRS (Alfarra et al. 2022) 0.310 0.448 0.540 7.39±\pm0.016
Grid Search 0.378 0.492 0.633 155.80±\pm0.50
QCRS 0.400 0.509 0.658 6.96±\pm0.017
Table 2: ACR and Time Cost for CIFAR-10.

5 Experimental Results

We evaluate the proposed QCRS and present the experimental results on CIFAR-10 (Krizhevsky, Hinton et al. 2009) and ImageNet (Russakovsky et al. 2015). We also verify that QCRS can be combined with training-based techniques like MACER (Zhai et al. 2019) to produce state-of-the-art certification results. Detailed implementation information is available in Appendix E.111Code: https://github.com/ntuaislab/QCRS We also present additional experiments in Appendix, including an ablation study, error analysis, and more. Following (Zhai et al. 2019), we use the average certified radius (ACR) as the performance metric, defined as: ACR=1|𝒟test|x𝒟testR(x,y;g),ACR=\frac{1}{|\mathcal{D}_{test}|}\sum_{x\in\mathcal{D}_{test}}R(x,y;g), where 𝒟test\mathcal{D}_{test} is the test dataset, and R(x,y;g)R(x,y;g) is the certified radius obtained by the smoothed classifier gg.

5.1 CIFAR-10 and ImageNet

Certified Accuracy
Certified radii RR 0.250.25 0.50.5 0.750.75 1.01.0 1.251.25 1.51.5 1.751.75 2.02.0 2.252.25
Cohen 0.55 0.41 0.32 0.23 0.15 0.09 0.05 0.00 0.00
DDRS (Alfarra et al. 2022) 0.56 0.44 0.34 0.23 0.15 0.08 0.04 0.00 0.00
DSRS (Li et al. 2022) 0.55 0.42 0.30 0.19 0.09 0.04 0.00 0.00 0.00
Grid Search (24 points) 0.58 0.51 0.42 0.30 0.18 0.12 0.07 0.04 0.01
QCRS (Proposed) 0.64 0.54 0.43 0.31 0.20 0.11 0.05 0.02 0.01
Table 3: Certified accuracy under different radii RR of DSRS, DDRS, Grid Search, and the proposed QCRS.

Fig. 2 compares the radius-accuracy curves for different methods on the CIFAR-10 dataset. In the figure, we also show the corresponding ACR, which is the area under the radius-accuracy curve. Table 2 presents the ACR of different methods and their runtime cost. The proposed QCRS outperforms the original randomized smoothing method (Cohen, Rosenfeld, and Kolter 2019) by significant margins: 48%48\%, 18%18\%, and 22%22\% for the models trained with σ={0.12,0.25,0.50}\sigma=\{0.12,0.25,0.50\}, respectively. The main performance gain comes from reducing the truncation effect (the waterfall effect) on the radius-accuracy curve. We also compare QCRS to grid search. Since grid search is extremely computationally expensive, we only test the images with id=0,50,100,,9950id=0,50,100,...,9950 in CIFAR-10. Despite using 24 points in the grid search, which costs approximately 2424 times more in runtime than QCRS, QCRS still outperforms grid search. This is due to QCRS being more time-efficient, which enables a broader and more accurate search region compared to grid search. Furthermore, QCRS guarantees to achieve the same optimal as grid search if the (1,11,1)-SQC condition holds. Regarding the computational cost, the proposed method only takes about 7%7\% additional inference time compared to the original Cohen method, as shown in Table 2.

We also compare the proposed QCRS with two state-of-the-art randomized smoothing methods, DSRS (Li et al. 2022) and DDRS (Alfarra et al. 2022). We follow their setting to evaluate the proposed method on CIFAR-10 for fair comparisons, and defer other minor comparisons and analyses to Appendix C. As Table 3 shows, for the certified accuracy under radius at 0.50.5, DSRS and DDRS improve Cohen by 2.4%2.4\% and 7.3%7.3\%, respectively. On the other hand, the proposed QCRS improves Cohen by 31.7%31.7\%. Therefore, among the methods that boost certified radii, QCRS improves Cohen most effectively.

Refer to caption
(a) σ=0.25\sigma=0.25
Refer to caption
(b) σ=0.50\sigma=0.50
Figure 3: The comparison between Cohen, grid search, and QCRS on ImageNet. Following Cohen, we only use 500500 images in the validation set. The models are trained by Gaussian augmentation with σ=\sigma= (a) 0.250.25 and (b) 0.500.50.

Fig. 3 shows the results on ImageNet. Following Cohen, only 500500 images with id=0,100,200,,49900id=0,100,200,...,49900 in the validation set were used. We avoided using the σ=1.0\sigma=1.0 model as (Mohapatra et al. 2021) revealed that a large σ\sigma leads to serious fairness concerns, necessitating a restriction on σ\sigma in randomized smoothing. Additionally, we observed the similar fairness issue in (Mohapatra et al. 2021), to be discussed later. For the model with σ=.25\sigma=.25, the proposed method improves ACR from 0.4770.477 to 0.5410.541, roughly 13.4%13.4\%. Similarly, for the model with σ=.50\sigma=.50, the proposed method improves ACR from 0.7330.733 to 0.8050.805, roughly 9.8%9.8\%. In addition, the proposed method overcomes the truncation effect, providing a larger certified radius than Cohen. As for the grid search, similar to CIFAR-10, it is computationally expensive, so we set the number of searching points to be 1111 on ImageNet. As mentioned earlier, although the grid search can provide the optimal certified radius if the cost does not matter, its searching region and precision are limited in practical application. That is why the proposed method achieves a slightly superior ACR than the brute-force grid search method in Fig. 3, while the runtime is roughly 11 times faster than it.

5.2 MACER

Refer to caption
(a) σ=0.25\sigma=0.25
Refer to caption
(b) σ=0.50\sigma=0.50
Figure 4: The performance of QCRS incorporating training-based methods. We use MACER model with (a) σ=0.25\sigma=0.25 and (b) σ=0.50\sigma=0.50. Both QCRS and MACER provide similar improvements over Cohen, but QCRS incurs little computational overhead. Combining QCRS and MACER provides state-of-the-art certified radii.

The proposed method focuses on enhancing randomized smoothing while building the smoothed classifier. Thus, it is orthogonal to the approach that aims to boost certified radii during training stage. QCRS can incorporate with training-based methods. The most representative training-based method to enhance certified radius is MACER. We apply the proposed method to the models trained by MACER and observe significant improvement in terms of the certified radius. Fig. 4 illustrates the results of radius-accuracy curves, and Table 4 shows the detailed comparison. As Table 4 shows, for the model trained by σ=.25\sigma=.25, Cohen achieves 0.4230.423 ACR, and MACER enhances this ACR to 0.5180.518, roughly 22.5%22.5\%. Next, our proposed QCRS improves MACER ACR from 0.5180.518 to 0.7150.715, roughly 38%38\%. Thus, QCRS and MACER together can significantly boost the original Cohen’s RS roughly 69%69\%. Similarly, for the model trained by σ=.50\sigma=.50, QCRS and MACER enhance Cohen’s RS from 0.5340.534 to 0.7860.786, approximately +47.2%+47.2\%. Furthermore, Table 4 also includes the results of DDRS incorporating MACER. It can be observed that MACER has a positive impact on the performance of the inference-phase randomized smoothing methods. Among the methods evaluated, QCRS incorporating MACER achieves the state-of-the-art performance. On the other hand, QCRS and MACER improves Cohen to 0.5120.512 and 0.5180.518, respectively. That is, QCRS can enlarge the certified radius to the extent that MACER does, but it does not need any training procedure. As datasets become larger and larger today, re-training may be computationally prohibited. Thus, QCRS benefits from its efficient workflow that enlarges the certified radius with negligible cost.

Training
Test Cohen MACER
Cohen 0.423 0.518
DDRS 0.448 (+6%+6\%) 0.561 (+8%+8\%)
QCRS 0.512 (+21%+21\%) 0.715 (+38%+38\%)
(a)
Training
Test Cohen MACER
Cohen 0.534 0.669
DDRS 0.540 (+1%+1\%) 0.702 (+5%+5\%)
QCRS 0.662 (+24%+24\%) 0.786 (+18%+18\%)
(b)
Table 4: The ACR results of different inference-phase RS incorporating different training-phase RS. The training-phase RS are Cohen or MACER with (a) σ=.25\sigma=.25 and (b) σ=.50\sigma=.50. The inference-phase RS are Cohen, DDRS, or QCRS. The table also includes the improvement ratios compared to Cohen.

5.3 Computational Cost

We briefly analyze the computational cost compared to Cohen. The sigma searching region of Algorithm 1 is 0.50.12=0.380.5-0.12=0.38. Because the convergence rate of Algorithm 1 is σmaxσmin2t|σtσ|\frac{\sigma_{max}-\sigma_{min}}{2^{t}}\geq|\sigma_{t}-\sigma^{*}|, if t6t\geq 6, we can achieve 0.0060.006-optimal, i.e., |σσ|<0.006|\sigma-\sigma^{*}|<0.006. As for the gradient computation, it requires 1,0001,000 forward propagations for each iteration. Thus, we roughly require additional 6,0006,000 forward propagations for each data point to achieve the optimal sigma value. The standard RS needs 100,000100,000 forward propagations, so the overhead of the proposed QCRS is approximately 6%6\%. Empirically, we observe an approximately 7%7\% overhead as Table 2 illustrates.

We compare the computational cost of QCRS with those of DDRS (Alfarra et al. 2022), DSRS (Li et al. 2022), and Insta-RS (Chen et al. 2021). The DDRS method adopts the radius as its objective function and utilizes gradient descent to directly optimize the radius. This approach involves computing the gradient of the radius multiple times to update the sigma value, which requires performing multiple back propagations. As shown in Table 2, certifying an image using DDRS takes roughly 7.397.39 seconds. This incurs an overhead of 14%14\%, which is twice as high as the proposed QCRS method. As for DSRS, in our experiments, it incurs roughly 100%100\% overhead compared to Cohen. Finally, Insta-RS employs multi-start gradient descent, so its computational cost is the highest among the methods considered in this paper.

5.4 Constant Sigma during Deployment

A consistent sigma may be required during deployment, presenting a limitation for input-specific RS. The memory bank strategy proposed by DDRS (Alfarra et al. 2022), compatible with our QCRS, can address this issue. It is crucial to understand QCRS still can apply a consistent sigma during deployment, and it does not undermine the soundness of certification, though it may limit its practical utility in real-world scenarios. Specifically, in binary scenarios, the proposed method enables the customization of an optimal sigma for specific classes, thereby effectively enhancing their certified radii. For further details, please see Appendix J.

5.5 Fairness Issue

As discussed in (Mohapatra et al. 2021), randomized smoothing suffers from fairness issue. The unbounded class in the data space dominates as σ\sigma increases. We investigate the optimal σ\sigma for each class in CIFAR-10. As Table 5 illustrates, class “cars” exhibits a larger optimal sigma value compared to the other classes. This discrepancy arises from the possibility that “cars” represents the unbounded class; thus, an increase in the sigma value leads to a continuous expansion of the radius. Exploring potential solutions to address this issue remains an intriguing direction.

airplanes cars birds cats deer
0.53±0.210.53\pm 0.21 0.84±0.41\mathbf{0.84\pm 0.41} 0.37±0.100.37\pm 0.10 0.36±0.110.36\pm 0.11 0.36±0.10\mathbf{0.36\pm 0.10}
dogs frogs horses ships trucks
0.50±0.170.50\pm 0.17 0.53±0.080.53\pm 0.08 0.53±0.120.53\pm 0.12 0.42±0.080.42\pm 0.08 0.61±0.140.61\pm 0.14
Table 5: The optimal sigma values for different classes in CIFAR-10. The base model is trained with σ=0.5\sigma=0.5.

6 Conclusion

In this work, we exploit and empirically demonstrate the quasiconcavity of the sigma-radius curve. The (υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition is general and easy to satisfy. Therefore, most data points (approximately 99%99\%) conform to this condition. Based on the (υ,υ+\upsilon^{-},\upsilon^{+})-SQC condition, we develop an efficient input-specific method called QCRS to efficiently find the optimal σ\sigma used for building the smoothed classifier, enhancing the traditional randomized smoothing significantly. Unlike former inference-time randomized smoothing methods that suffer from marginal improvement or high computational overhead, the proposed method enjoys better certification results and lower cost. We conducted extensive experiments on CIFAR-10 and ImageNet, and the results show that the proposed method significantly boosts the average certified radius with 7%7\% overhead. QCRS improves ACR, overcoming the trade-off on the radius-accuracy curve and eliminating the truncation effect. In addition, we combine the proposed QCRS with a training-based technique, and the results demonstrate the state-of-the-art average certified radii on CIFAR-10 and ImageNet. A direction for future work is to generalize the proposed method to p\ell_{p} ball and different distributions. A better training approach for QCRS is also an interesting future research direction.

Acknowledgement

This work was supported in part by the National Science and Technology Council under Grants MOST 110-2634-F002-051, MOST 110-2222-E-002-014-MY3, NSTC 113-2923-E-002-010-MY2, NSTC-112-2634-F-002-002-MBK, by National Taiwan University under Grant NTU-CC-112L891006, and by Center of Data Intelligence: Technologies, Applications, and Systems under Grant NTU-112L900903.

References

  • Alfarra et al. (2022) Alfarra, M.; Bibi, A.; Torr, P. H.; and Ghanem, B. 2022. Data dependent randomized smoothing. In Uncertainty in Artificial Intelligence (UAI), 64–74. PMLR.
  • Anderson and Sojoudi (2022) Anderson, B. G.; and Sojoudi, S. 2022. Certified robustness via locally biased randomized smoothing. In Learning for Dynamics and Control Conference, 207–220. PMLR.
  • Boyd and Vandenberghe (2004) Boyd, S. P.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.
  • Carlini and Wagner (2018) Carlini, N.; and Wagner, D. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE security and privacy workshops (SPW), 1–7. IEEE.
  • Chen et al. (2021) Chen, C.; Kong, K.; Yu, P.; Luque, J.; Goldstein, T.; and Huang, F. 2021. Insta-RS: Instance-wise Randomized Smoothing for Improved Robustness and Accuracy. arXiv preprint arXiv:2103.04436.
  • Chen et al. (2022) Chen, R.; Li, J.; Yan, J.; Li, P.; and Sheng, B. 2022. Input-specific robustness certification for randomized smoothing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 6295–6303.
  • Cohen, Rosenfeld, and Kolter (2019) Cohen, J.; Rosenfeld, E.; and Kolter, Z. 2019. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning (ICML), 1310–1320. PMLR.
  • Das et al. (2018) Das, N.; Shanbhogue, M.; Chen, S.-T.; Hohman, F.; Li, S.; Chen, L.; Kounavis, M. E.; and Chau, D. H. 2018. Shield: Fast, practical defense and vaccination for deep learning using jpeg compression. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 196–204.
  • Ehlers (2017) Ehlers, R. 2017. Formal verification of piece-wise linear feed-forward neural networks. In International Symposium on Automated Technology for Verification and Analysis, 269–286. Springer.
  • Goodfellow, Shlens, and Szegedy (2014) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
  • Gowal et al. (2018) Gowal, S.; Dvijotham, K.; Stanforth, R.; Bunel, R.; Qin, C.; Uesato, J.; Arandjelovic, R.; Mann, T.; and Kohli, P. 2018. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715.
  • Jeong et al. (2021) Jeong, J.; Park, S.; Kim, M.; Lee, H.-C.; Kim, D.; and Shin, J. 2021. SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Adversarial Robustness. In ICML 2021 Workshop on Adversarial Machine Learning.
  • Katz et al. (2017) Katz, G.; Barrett, C.; Dill, D. L.; Julian, K.; and Kochenderfer, M. J. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. In International conference on computer aided verification, 97–117. Springer.
  • Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Citeseer.
  • Kumar et al. (2020a) Kumar, A.; Levine, A.; Feizi, S.; and Goldstein, T. 2020a. Certifying confidence via randomized smoothing. Advances in Neural Information Processing Systems (NeurIPS), 33: 5165–5177.
  • Kumar et al. (2020b) Kumar, A.; Levine, A.; Goldstein, T.; and Feizi, S. 2020b. Curse of dimensionality on randomized smoothing for certifiable robustness. In International Conference on Machine Learning (ICML), 5458–5467. PMLR.
  • Lecuyer et al. (2019) Lecuyer, M.; Atlidakis, V.; Geambasu, R.; Hsu, D.; and Jana, S. 2019. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), 656–672. IEEE.
  • Levine et al. (2020) Levine, A.; Kumar, A.; Goldstein, T.; and Feizi, S. 2020. Tight Second-Order Certificates for Randomized Smoothing. arXiv preprint arXiv:2010.10549.
  • Levine and Feizi (2021) Levine, A. J.; and Feizi, S. 2021. Improved, Deterministic Smoothing for L_1 Certified Robustness. In International Conference on Machine Learning (ICML), 6254–6264. PMLR.
  • Li, Xie, and Li (2023) Li, L.; Xie, T.; and Li, B. 2023. Sok: Certified robustness for deep neural networks. In 2023 IEEE Symposium on Security and Privacy (SP), 1289–1310. IEEE.
  • Li et al. (2022) Li, L.; Zhang, J.; Xie, T.; and Li, B. 2022. Double Sampling Randomized Smoothing. In International Conference on Machine Learning (ICML), 13163–13208. PMLR.
  • Madry et al. (2017) Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
  • Mohapatra et al. (2021) Mohapatra, J.; Ko, C.-Y.; Weng, L.; Chen, P.-Y.; Liu, S.; and Daniel, L. 2021. Hidden cost of randomized smoothing. In International Conference on Artificial Intelligence and Statistics, 4033–4041. PMLR.
  • Mueller et al. (2022) Mueller, M. N.; Eckert, F.; Fischer, M.; and Vechev, M. 2022. Certified Training: Small Boxes are All You Need. In International Conference on Learning Representations (ICLR).
  • Nesterov (2018) Nesterov, Y. 2018. Lectures on convex optimization, volume 137. Springer.
  • Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV), 115(3): 211–252.
  • Salman et al. (2019) Salman, H.; Li, J.; Razenshteyn, I.; Zhang, P.; Zhang, H.; Bubeck, S.; and Yang, G. 2019. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems (NeurIPS), 32.
  • Samangouei, Kabkab, and Chellappa (2018) Samangouei, P.; Kabkab, M.; and Chellappa, R. 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations (ICLR).
  • Shafahi et al. (2019) Shafahi, A.; Najibi, M.; Ghiasi, M. A.; Xu, Z.; Dickerson, J.; Studer, C.; Davis, L. S.; Taylor, G.; and Goldstein, T. 2019. Adversarial training for free! Advances in Neural Information Processing Systems (NeurIPS), 32.
  • Singla and Feizi (2021) Singla, S.; and Feizi, S. 2021. Skew orthogonal convolutions. In International Conference on Machine Learning (ICML), 9756–9766. PMLR.
  • Singla, Singla, and Feizi (2022) Singla, S.; Singla, S.; and Feizi, S. 2022. Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100. In International Conference on Learning Representations (ICLR).
  • Súkeník, Kuvshinov, and Günnemann (2021) Súkeník, P.; Kuvshinov, A.; and Günnemann, S. 2021. Intriguing Properties of Input-dependent Randomized Smoothing. arXiv preprint arXiv:2110.05365.
  • Szegedy et al. (2013) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR).
  • Tjeng, Xiao, and Tedrake (2018) Tjeng, V.; Xiao, K. Y.; and Tedrake, R. 2018. Evaluating Robustness of Neural Networks with Mixed Integer Programming. In International Conference on Learning Representations (ICLR).
  • Wang, Bochkovskiy, and Liao (2022) Wang, C.-Y.; Bochkovskiy, A.; and Liao, H.-Y. M. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696.
  • Weng et al. (2018) Weng, L.; Zhang, H.; Chen, H.; Song, Z.; Hsieh, C.-J.; Daniel, L.; Boning, D.; and Dhillon, I. 2018. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning (ICML), 5276–5285. PMLR.
  • Wong, Rice, and Kolter (2020) Wong, E.; Rice, L.; and Kolter, J. Z. 2020. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994.
  • Xu, Li, and Li (2022) Xu, X.; Li, L.; and Li, B. 2022. Lot: Layer-wise orthogonal training on improving l2 certified robustness. Advances in Neural Information Processing Systems (NeurIPS), 35: 18904–18915.
  • Yang et al. (2020) Yang, G.; Duan, T.; Hu, J. E.; Salman, H.; Razenshteyn, I.; and Li, J. 2020. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning (ICML), 10693–10705. PMLR.
  • Zhai et al. (2019) Zhai, R.; Dan, C.; He, D.; Zhang, H.; Gong, B.; Ravikumar, P.; Hsieh, C.-J.; and Wang, L. 2019. MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius. In International Conference on Learning Representations (ICLR).
  • Zhai et al. (2022) Zhai, X.; Kolesnikov, A.; Houlsby, N.; and Beyer, L. 2022. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12104–12113.
  • Zhang et al. (2020) Zhang, D.; Ye, M.; Gong, C.; Zhu, Z.; and Liu, Q. 2020. Black-box certification with randomized smoothing: A functional optimization based framework. Advances in Neural Information Processing Systems (NeurIPS), 33: 2316–2326.

Appendix

Appendix A Quasiconcavity of Sigma-Radius Curves

In this section, our aim is to provide an intuitive explanation of the quasiconcavity observed on the sigma-radius curves. First, we scrutinize the sigma-radius curve, denoted as R(σ)=σΦ1(pA¯(σ))R(\sigma)=\sigma\cdot\Phi^{-1}(\underline{p_{A}}(\sigma)), from the perspective of the data space. Here, pA¯(σ)\underline{p_{A}}(\sigma) signifies the lower bound of the Gaussian measure for class AA with a specific σ\sigma, and Φ1\Phi^{-1} corresponds to the inverse Gaussian cumulative distribution function. As depicted in Fig. 5, while σ\sigma gradually increases from zero, pA¯(σ)\underline{p_{A}}(\sigma) remains close to 1 because most of the measure falls within the boundary. Consequently, RR increases with the increment of σ\sigma. As σ\sigma continues to increase, pA¯(σ)\underline{p_{A}}(\sigma) begins to decrease, especially when the Gaussian kernel incorporates more samples from the wrong class. This results in a significant decline in Φ1(pA¯(σ))\Phi^{-1}(\underline{p_{A}}(\sigma)) and consequently causes RR to decrease. Thus, there exists a negative relationship between σ\sigma and Φ1(pA¯)\Phi^{-1}(\underline{p_{A}}), causing the radius initially to increases and then decreases as the sigma value varies from zero to infinity. The objective is to identify the optimal value of σ\sigma that maximizes the radius RR. This observation inspires us to leverage the quasiconcavity property of the problem structure to effectively find the optimal σ\sigma.

Appendix B Evaluating Quasiconcavity

We conduct empirical evaluations to assess the concavity and the quasiconcavity of data points on CIFAR-10 and ImageNet datasets.

Concavity: To mitigate the computational burden, we sample 20 values of σ\sigma and subsequently evaluate the objective function R(σ)R(\sigma) using 100,000 Monte Carlo sampling. The specific 20 sigma values will be presented later in this paper. Subsequently, we compute the Hessian value for these 20 sampled points. It is crucial to note that for concave functions, the Hessian of R(σ)R(\sigma) should be non-positive. Thus, any data point demonstrating a positive Hessian is indicative of a non-concave function.

Quasiconcavity: The assessment of quasiconcavity leverages Lemma 2 as detailed in the main paper. Initially, we employ a grid search methodology to identify the optimal value for σ\sigma. Analogous to the concavity evaluation, we sample the same set of 20 sigma values and subsequently compute the gradient sign of R(σ)R(\sigma) using the forward method introduced in the main paper along with 100,000 Monte Carlo sampling. The fulfillment of the conditions stipulated in Lemma 2 is integral. Failure to satisfy Lemma 2 indicates that R(σ)R(\sigma) is not a quasiconcave function.

Refer to caption
Figure 5: Illustration of the relation between RR and σ\sigma. As σ\sigma gradually increases from zero, Φ1(pA¯)\Phi^{-1}(\underline{p_{A}}) does not change much as most sampled points are still within the boundary, leading to an increase in RR. However, as σ\sigma further increases, the other class catches up, and Φ1(pA¯)\Phi^{-1}(\underline{p_{A}}) decreases rapidly, causing RR to decrease. Thus, a negative correlation exists between Φ1(pA¯)\Phi^{-1}(\underline{p_{A}}) and σ\sigma. The objective is to identify the optimal value of σ\sigma that maximizes the radius RR.

Note that in our evaluation, we provide a guarantee against false negatives, ensuring completeness. Specifically, if a data point is identified as non-concave or non-quasiconcave, it is indeed non-concave or non-quasiconcave. However, it is important to acknowledge the potential existence of false positive samples. Therefore, Table 1 in the main paper provides an illustration of the upper bound for the number of samples that satisfy the criteria of concavity/quasiconcavity. We demonstrate that concavity is challenging for real-world data, whereas quasiconcavity is potentially more prevalent.

To further investigate the generality of quasiconcavity, we use MC sampling to estimate υ\upsilon^{-} and υ+\upsilon^{+}. The sampling number is set to 20 for each data point, i.e., 20 different sigma values, and for each sigma value, we use 100,000 MC sampling to compute the radius. On average, the estimated υ\upsilon^{-} and υ+\upsilon^{+} for all data points, including both quasiconcave and non-quasiconcave cases, are 99.7799.77 and 99.1199.11 respectively, for CIFAR-10 and ImageNet datasets. That is, approximately 0.99775=98.85%0.9977^{5}=98.85\% of data points on CIFAR-10 are expected to converge to the optimal sigma, and for ImageNet, this percentage is 95.62%95.62\%.

The experiment settings for Table 1 in the main paper are as follows: 1) For the model that is trained on CIFAR-10 and σ=0.12\sigma=0.12, we sample the sigma value at 0.020, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080, 0.090, 0.100, 0.110, 0.120, 0.130, 0.140, 0.150, 0.160, 0.170, 0.180, 0.190, 0.200, and 0.220; 2) For the model that is trained on CIFAR-10 and σ=0.25\sigma=0.25, we sample the sigma value at 0.15, 0.18, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.35, 0.4, 0.45, and 0.5; 3) For the model that is trained on ImageNet and σ=0.25\sigma=0.25, we sample the sigma value at 0.100, 0.115, 0.130, 0.145, 0.160, 0.175, 0.190, 0.205, 0.220, 0.235, 0.250, 0.265, 0.280, 0.295, 0.310, 0.325, 0.340, 0.355, 0.370, 0.385, and 0.400; 4) For the model that is trained on ImageNet and σ=0.25\sigma=0.25, we sample the sigma value at 0.3, 0.32, 0.34, 0.36, 0.38, 0.40, 0.42, 0.44, 0.46, 0.48, 0.50, 0.52, 0.54, 0.56, 0.58, 0.60, 0.6,2 0.64, and 0.66. Note that we increase the sampling density around the σ\sigma with which the model is trained.

Refer to caption
Figure 6: The sigma-radius curves on CIFAR-10. We can observe the quasiconcavity of the curves in this figure. The different curves indicate different data points. We evaluate the sigma-radius curves of 164164 images from CIFAR-10 that can be certified by Cohen and check their concavity and quasiconcavity numerically. 6.7%6.7\% and 98.2%98.2\% of data points are concave and quasiconcave, respectively. The bold red curve is non-quasiconcave, and the others are all quasiconcave.
Refer to caption
Figure 7: The sigma-radius curves on ImageNet. We can observe the quasiconcavity of the curves in this figure. The different curves indicate different data points. We evaluate the sigma-radius curves of 210210 images from ImageNet that can be certified by Cohen and check their concavity and quasiconcavity numerically. 0%0\% and 93%93\% of data points are concave and quasiconcave, respectively. The bold red curve is non-quasiconcave, and the others are all quasiconcave. Different from CIFAR-10, the count of non-quasiconcave and non-concave data points rises in the case of ImageNet. One plausible explanation is the heightened complexity of the data manifold for classifiers on ImageNet due to its higher number of classes and dimension. This complexity contributes to intricate and non-smooth decision boundaries.

To visualize the sigma-radius curve, we plot the sigma-radius curves in Fig. 6, which was derived from the CIFAR-10 model with σ=0.25\sigma=0.25. The figure clearly illustrates that only three curves fail to satisfy the quasiconcave condition, represented by the bold red curve. Furthermore, Fig. 7 shows the sigma-radius curves on ImageNet. Notably, a larger number of data points do not satisfy the concavity and quasiconcavity. A plausible explanation for this discrepancy lies in the increased complexity of the data manifold for classifiers operating on ImageNet, which is attributed to its higher number of classes and dimensions. This elevated complexity results in intricate and non-smooth decision boundaries. Consequently, employing a Gaussian kernel to smooth the data space can lead to unstable results. Nonetheless, we underscore the fact that a significantly larger number of data points adhere to quasiconcavity rather than concavity. This observation highlights the advantage of leveraging quasiconcavity for addressing optimization problems within the domain of randomized smoothing.

Appendix C Comparison with Prior Work

Detailed comparisons with DDRS: The proposed QCRS method is novel in that it requires a much looser (υ,υ+)(\upsilon^{-},\upsilon^{+})-SQC assumption and obtains the optimal sigma solution with a much faster convergence rate than prior work. We illustrate the detailed comparison to the gradient-based method. Table 6 compares QCRS to DDRS (Alfarra et al. 2022), a gradient-based SOTA method for finding the optimal sigma.

Table 6: Detailed comparisons with DDRS.
QCRS DDRS
Assumption (υ,υ+)(\upsilon^{-},\upsilon^{+})-SQC concavity
Generality high (99%) low (7%)
Convergence 𝒪((12)t)\mathcal{O}((\frac{1}{2})^{t}) no
Back-propagation no yes
Time cost low high

For the generality of assumptions, almost all data points satisfy the (υ,υ+)(\upsilon^{-},\upsilon^{+})-SQC condition, while only 6.7%6.7\% of data points satisfy concavity. In addition, the proposed QCRS guarantees convergence, and the complexity is logN\log N to find the optimal solution on the NN-grid. Furthermore, QCRS enjoys the advantages of computation. It does not employ back-propagation, making it computationally less expensive than DDRS.

Relative comparisons with prior work: When reproducing the results of prior work, it is important to consider that randomized smoothing involves random components, such as Monte Carlo sampling, and different studies may have slight variations in parameter selection. These factors can contribute to variations in the obtained results, even when attempting to reproduce the same experiments. To provide a comprehensive comparison, in addition to reporting our reproduced DDRS and DSRS results, we also include the original results from the DDRS and DSRS papers as a reference. Since these papers commonly use the original Cohen method as a baseline, we can measure the improvement achieved by DDRS/DSRS compared to the Cohen baseline in their respective papers. This improvement is referred to as the relative improvement. Table 7 presents the relative improvement in certified accuracies under different radii for DSRS and DDRS, with the values reported directly from the original DDRS/DSRS papers.

Table 7: Certified accuracy under different radii rr of DSRS, DDRS, Grid Search, and the proposed QCRS. The “Cohen” values (Li et al. 2022; Alfarra et al. 2022) of DSRS/DDRS are reported from their original paper. “+%\%” indicates the relative improvement compared to the respective Cohen baseline.
Certified Accuracy
Certified radii rr 0.250.25 0.50.5 0.750.75 1.01.0 1.251.25 1.51.5 1.751.75 2.02.0 2.252.25
Cohen (Li et al. 2022) 0.56 0.41 0.28 0.19 0.15 0.10 0.08 0.04 0.02
+DSRS (Li et al. 2022) 0.57 0.43 0.31 0.21 0.16 0.13 0.08 0.06 0.04
     (+%) 1.8% 4.9% 10.7% 10.5% 6.7% 30.0% 0.0% 50% 100%
Cohen (Alfarra et al. 2022) 0.58 0.40 0.29 0.20 0.13 0.07 0.03 0.00 0.00
+DDRS (Alfarra et al. 2022) 0.65 0.48 0.38 0.28 0.17 0.08 0.03 0.01 0.00
     (+%) 12.1% 20.0% 31.0% 40.0% 30.8% 14.3% 0.0% NA 0.0%
Cohen (Ours) 0.55 0.41 0.32 0.23 0.15 0.09 0.05 0.00 0.00
+Grid Search (24 points) 0.58 0.51 0.42 0.30 0.18 0.12 0.07 0.04 0.01
     (+%) 5.5% 24.4% 31.2% 30.4% 20.0% 33.3% 40% NA NA
Cohen (Ours) 0.55 0.41 0.32 0.23 0.15 0.09 0.05 0.00 0.00
+QCRS (Proposed) 0.64 0.54 0.43 0.31 0.20 0.11 0.05 0.02 0.01
     (+%) 16.4% 31.7% 34.4% 34.8% 33.3% 22.2% 0.0% NA NA

Experiments on DSRS: DSRS (Li et al. 2022) uses double-sampling with two different generalized Gaussian distributions to estimate the probability of class A. The sigma values of those generalized Gaussian distributions are σ=0.5\sigma=0.5 and σ=0.4\sigma=0.4. It is important to note that their models are trained using the generalized Gaussian, which is different from Cohen’s method. To evaluate DSRS, we compare two trained models and three different sampling methods. First, we compare the original Cohen’s model trained by Gaussian noise (denoted as “G-”) and the DSRS model trained by generalized Gaussian noise according to DSRS (denoted as “GG-”). Second, for randomized smoothing sampling, we compare three methods:

  1. 1.

    “-G” indicates the original Gaussian smoothing with a Monte Carlo sampling number of 100,000.

  2. 2.

    “-GG” indicates the generalized Gaussian smoothing with a Monte Carlo sampling number of 100,000.

  3. 3.

    “-DS” indicates the double-sampling generalized Gaussian smoothing. One is with σ=0.5\sigma=0.5, and a Monte Carlo sampling number of 50,000, and the other is with σ=0.4\sigma=0.4 and a Monte Carlo sampling number of 50,000.

Fig 8 and Table 8 show the radius-accuracy curves and ACR results, respectively. However, the results of the double-sampling method did not align with our expectations, despite following the suggestions and source code provided in the DSRS paper. We suspect that certain hyperparameters may not have been optimized properly in our experiments. In the main paper, we report the last row of Table 8, as the hyperparameters used in that row align with the recommendations provided by DSRS. In addition, we also report the original results in the DSRS paper, as Table 7 shows.

Table 8: The ACR results of DSRS on CIFAR-10.
Exp. Training Sampling ACR
G-G Gaussian Gaussian 0.510
G-GG Gaussian Generalized Gaussian 0.382
G-DS Gaussian Double 0.417
GG-G Generalized Gaussian Gaussian 0.512
GG-GG Generalized Gaussian Generalized Gaussian 0.439
GG-DS Generalized Gaussian Double 0.466
Refer to caption
Figure 8: ACR results of DSRS with different settings. The experiments are labeled according to Table 8.

Appendix D Proofs

D.1 Proof of Lemma 1

We first prove Lemma 1 of the main paper.

Proof.

Suppose hh is a strictly quasiconcave function and xx^{*} is a local maximum of hh. We want to show that xx^{*} is also a global maximum of hh. Assume, for the sake of contradiction, that xx^{*} is not a global maximum. This means that there exists a point yy in the domain of hh such that h(y)>h(x)h(y)>h(x^{*}). Let’s consider the point z=λx+(1λ)yz=\lambda x^{*}+(1-\lambda)y, where λ\lambda is chosen such that 0<λ<10<\lambda<1. Since xx^{*} is a local maximum, there exists a neighborhood around xx^{*} where h(x)>h(z)h(x^{*})>h(z). Now, by the definition of strict quasiconcavity, we have:

h(z)>min{h(x),h(y)}.h(z)>\min\{h(x^{*}),h(y)\}.

Since h(y)>h(x)h(y)>h(x^{*}), we have h(z)>h(x)h(z)>h(x^{*}). This contradicts the assumption that there exists a point yy with h(y)>h(x)h(y)>h(x^{*}). Therefore, the assumption that xx^{*} is not a global maximum must be false. Hence, we conclude that if hh is a strictly quasiconcave function and xx^{*} is a local maximum of hh, then xx^{*} is also a global maximum of hh. ∎

D.2 Proof of Lemma 2

We begin by proving the “\Rightarrow” direction, followed by the “\Leftarrow” direction. Proving Lemma 2 simplifies when we apply the first-order condition of quasiconcavity as outlined in (Boyd and Vandenberghe 2004). The first-order condition of quasiconcavity is as follows:

Proposition 1.

(First-order condition (Boyd and Vandenberghe 2004)) Suppose h:nh:\mathbb{R}^{n}\rightarrow\mathbb{R} is differentiable. Then hh is strictly quasiconcave if and only if

h(y)>h(x)h(x)(yx)>0.h(y)>h(x)\Rightarrow\nabla h(x)(y-x)>0.

First, we employ Proposition 1 to prove the “\Rightarrow” direction of Lemma 2.

Proof.

Suppose hh is a strictly quasiconcave function and xx^{*} is a local maximum of hh. For any other points xx, it holds that h(x)>h(x)h(x^{*})>h(x). Consequently, when x<xx<x^{*}, the inequality h(x)T(xx)>0\nabla h(x)^{T}(x^{*}-x)>0 ensures that h(x)>0\nabla h(x)>0. Conversely, if x>xx>x^{*}, the inequality h(x)T(xx)>0\nabla h(x)^{T}(x^{*}-x)>0 ensures that h(x)<0\nabla h(x)<0. That is, h(x)\nabla h(x) positive on the left side of the optimal solution and negative on the right side. ∎

Next, we employ Proposition 1 to prove the “\Leftarrow” direction of Lemma 2.

Proof.

Consider a function hh with a unique optimal solution xx^{*}. Furthermore, hh satisfies the condition that h(z)>0\nabla h(z)>0 for z<xz<x^{*} and h(z)<0\nabla h(z)<0 for z>xz>x^{*}. Consider two points xx and yy. If x<xx<x^{*}, then for any yy such that h(y)>h(x)h(y)>h(x), it follows that yy must be greater than xx because h(z)>0\nabla h(z)>0 for z<xz<x. Therefore, h(x)(yx)>0\nabla h(x)(y-x)>0. On the other hand, If x>xx>x^{*}, for any yy such that h(y)>h(x)h(y)>h(x), it follows that yy must be smaller than xx because h(z)<0\nabla h(z)<0 for z>xz>x. Therefore, h(x)(yx)>0\nabla h(x)(y-x)>0. Thus, according to Proposition 1, hh is a strictly quasiconcave function. ∎

D.3 Proofs of convergence analysis

In this section, we theoretically analyze the convergence of the proposed methods and provide proof. The convergence of the gradient-based method is also discussed.

Without loss of generality, we discuss convexity rather than concavity here. First, we prove Theorem 1 in the main paper as follows:

Proof.

Let σt\sigma_{t} be the σ\sigma under tt iterations. Suppose that RR satisfies (υ,υ+)(\upsilon^{-},\upsilon^{+})-SQC condition with (υ,υ+)=(1,1)(\upsilon^{-},\upsilon^{+})=(1,1), and there exists a σ[σmin,σmax]\sigma^{*}\in[\sigma_{min},\sigma_{max}]. Then, for the first iteration σ1=σmax+σmin2\sigma_{1}=\frac{\sigma_{max}+\sigma_{min}}{2}, we have

σmaxσmin2|σ1σ|,\displaystyle\frac{\sigma_{max}-\sigma_{min}}{2}\geq|\sigma_{1}-\sigma^{*}|,

because σ1\sigma_{1} is the midpoint of σmin\sigma_{min} and σmax\sigma_{max}. Without loss of generality, we assume σminσσ1\sigma_{min}\leq\sigma^{*}\leq\sigma_{1}. Thus, according to Algorithm 1, σ2=σmin+σ12\sigma_{2}=\frac{\sigma_{min}+\sigma_{1}}{2}, and

σmaxσmin22|σ2σ|.\displaystyle\frac{\sigma_{max}-\sigma_{min}}{2^{2}}\geq|\sigma_{2}-\sigma^{*}|.

If we run tt iteration, we can conclude that

σmaxσmin2t|σtσ|.\displaystyle\frac{\sigma_{max}-\sigma_{min}}{2^{t}}\geq|\sigma_{t}-\sigma^{*}|.

\blacksquare

Therefore, to achieve δ\delta-optimal, the convergence rate of the proposed method is 𝒪((12)t)\mathcal{O}((\frac{1}{2})^{t}).

Next, we further analyze the convergence of the gradient-based method. LL-smoothness or μ\mu-strong convexity are necessary to guarantee convergence. We first show the convergence rate of LL-smoothness:

Proposition 2.

Suppose a function R(σ)R(\sigma) is LL-smooth for some L>0L>0 with respect to σ\sigma. Then, if we run gradient descent for tt iterations, it converges as follows (Nesterov 2018):

R(σt)R(σ)L|σ1σ|22(t1).\displaystyle R(\sigma_{t})-R(\sigma^{*})\leq\frac{L|\sigma_{1}-\sigma^{*}|^{2}}{2(t-1)}.

In addition, if both LL-smoothness and μ\mu-strong convexity hold, the convergence rate is as follows:

Proposition 3.

Suppose a function R(σ)R(\sigma) is LL-smooth and μ\mu-strongly convex for some L,μ>0L,\mu>0 with respect to σ\sigma, and σ^\hat{\sigma} is the optimal sigma. Then, if we run gradient descent for tt iterations, it converges as follows (Nesterov 2018):

|σtσ|2(LμL+μ)(t1)|σ1σ|2.\displaystyle|\sigma_{t}-\sigma^{*}|^{2}\leq(\frac{L-\mu}{L+\mu})^{(t-1)}|\sigma_{1}-\sigma^{*}|^{2}.

Proposition 2 shows the convergence rate under the convex and LL-smooth condition. RR with LL-smoothness can not guarantee δ\delta-optimal, i.e., |σσ|δ|\sigma^{*}-\sigma|\leq\delta. On the other hand, Proposition 3 shows the convergence rate under the LL-smooth and μ\mu-convex condition, which is faster but stricter than Proposition 2. If we want to achieve δ\delta-optimal for σ\sigma, the convergence rate of 𝒪((LμL+μ)t)\mathcal{O}((\frac{L-\mu}{L+\mu})^{t}), where tt is the number of iterations.

Appendix E Implementation Details

We use an NVIDIA GeForce® RTX 3090 GPU and an AMD Ryzen 5 5600X CPU with 32GB DRAM to run the experiments. Following prior work, we use ResNet110 for CIFAR-10 and ResNet50 for ImageNet. We use 500500 as the MC sampling number to estimate gradients in Algorithm 1. The terminal condition of Algorithm 1 depends on ε\varepsilon, which indicates an ε\varepsilon-optimal solution, i.e., |σσ|ε|\sigma-\sigma^{*}|\leq\varepsilon. The ε\varepsilon we used in our experiments is 0.010.01, and τ\tau (the step to compute gradient) is ±0.05\pm 0.05. Regarding grid search, we use 24 points for CIFAR-10 and 11 points for ImageNet. The searching region is 0.080.08 to 0.500.50 for σ=0.12\sigma=0.12, 0.150.15 to 0.70.7 for σ=0.25\sigma=0.25, and 0.250.25 to 1.01.0 for σ=0.50\sigma=0.50.

Appendix F Ablation Study

A comprehensive ablation study for hyperparameter tuning is shown in Table 9. The terms “left” and “right” denote the endpoints σmin\sigma_{min} and σmax\sigma_{max} of the search region. That is, Algorithm 1 would search the optimal sigma value within [σmin,σmax][\sigma_{min},\sigma_{max}] interval. Notably, the results illustrate that a larger search region leads to a notable enhancement in ACR. On the other hand, the choice of step distance for gradient calculation is important. Smaller values introduce imprecise gradient directions, while excessively larger values struggle to capture local information effectively. These experimental evaluations were conducted on the test set of CIFAR-10, specifically following the condition id%50==0id\%50==0.

sigma left (σmin\sigma_{min}) right (σmax\sigma_{max}) step (τ\tau) ACR
0.5 0.12 0.8 0.4 0.619
0.5 0.12 1.2 0.4 0.627
0.5 0.12 1.9 0.4 0.638
0.5 0.15 0.9 0.4 0.624
0.5 0.20 0.9 0.4 0.624
0.5 0.20 1.0 0.4 0.625
0.5 0.20 1.2 0.4 0.628
0.5 0.20 1.5 0.4 0.629
0.5 0.20 1.7 0.4 0.639
0.5 0.30 1.0 0.4 0.622
0.5 0.30 1.5 0.4 0.629
0.5 0.3 1.2 0.2 0.640
0.5 0.3 1.2 0.4 0.634
0.5 0.3 1.2 0.6 0.630
0.5 0.3 1.2 0.8 0.627
0.5 0.3 1.2 1.2 0.615
0.5 0.12 1.7 0.05 0.572
0.5 0.12 1.7 0.1 0.640
0.5 0.12 1.7 0.2 0.644
0.5 0.12 1.7 0.4 0.639
Table 9: Hyperparameter tuning ablation study for QCRS. The terms “left” and “right” indicate the bounds of the search region. Evidently, broader search regions yield improved ACR. On the other hand, the step distance for gradient computation also plays a crucial role. Smaller values result in less accurate gradient directions, while larger values struggle to capture local information effectively. These experiments were conducted on CIFAR-10 with the condition id%50==0id\%50==0.

Appendix G Error on Sigma

Refer to caption
Figure 9: We demonstrate four sigma values found by different algorithms and parameters. The purple points represent sigma values found by grid search. The other points are sigma values found by the proposed QCRS. We test different ε\varepsilon values, which indicate ε\varepsilon-optimal solutions. The ε\varepsilon values also serve as the terminal conditions of Algorithm 1.

We assess the accuracy of the sigma values obtained by QCRS by comparing them to those obtained through grid search. Assuming that the optimal sigma value obtained through grid search represents the true optimum, we evaluate the effectiveness of QCRS in finding the optimal sigma. We randomly selected several data points for evaluation, and the results are illustrated in Fig. 9. It can be observed that the sigma values found by QCRS are in close proximity to the sigma values obtained through grid search. This suggests that QCRS is capable of effectively approximating the optimal sigma value. Furthermore, we conduct tests on the ε\varepsilon value in Algorithm 1, which serves as the terminal condition. Using a smaller ε\varepsilon value can lead to sigma values closer to the optimum found through grid search. However, even with different ε\varepsilon values, the sigma values obtained by QCRS and grid search remain quite close. This demonstrates the ability of QCRS to optimize the sigma value for randomized smoothing effectively.

Appendix H Gradient Stability

As mentioned in the main paper, using the radius calculated by Monte Carlo sampling as the objective function to optimize the sigma value can result in unstable gradients. To verify this, we analyzed the gradient values under different MC sampling numbers, as shown in Fig 10. The different curves indicate the different data points. We found that the gradient values exhibit significant variations when using low MC sampling numbers. Consequently, using a low number of Monte Carlo (MC) samples may lead to inaccurate gradient estimates, which can adversely impact gradient-based optimization methods. In contrast, the proposed QCRS method solely relies on the sign of the gradient, which is considerably more stable than the actual gradient values. This can be observed in Fig. 10, where the gradient sign remains unchanged while the gradient values exhibit instability.

Refer to caption
Figure 10: Gradient values with respect to different MC sampling numbers. The different curves indicate different data points. The gradient values with regard to different MC sampling numbers are unstable, but their signs remain the same in almost all cases. The proposed QCRS is stable in the optimization as it only relies on the signs of gradient values.
Refer to caption
(a) MC sampling
Refer to caption
(b) Zoom in
Figure 11: We plot the confidence interval of pA¯(σ)\underline{p_{A}}(\sigma) with respect to pA¯(σ)\underline{p_{A}}(\sigma). The number of MC sampling significantly affects the confidence interval. Lower MC sample numbers result in wider confidence intervals. The red region in the figure represents the confidence interval of pA¯(σ)\underline{p_{A}}(\sigma) with a 1α1-\alpha confidence level and 500500 MC sampling number. It is wider than the one with a 100,000100,000 MC sampling number. This shows that the low MC sampling number makes the certified region uncertain. When estimating gradient values, if the value depends on MC sampling, the corresponding gradient will be unstable, making optimization difficult to converge. Prior work, such as DDRS, used a very low MC sampling number (as low as eight) when optimizing the sigma value, leading to unstable optimization.

In addition, as mentioned in the main paper, the MC sampling number significantly affects the estimation of pA¯(σ)\underline{p_{A}}(\sigma). As shown in Fig. 11, when the MC sampling number is 500500, the possible interval is represented by the red region with a confidence level of 1α1-\alpha. The wide extent of the red region indicates a high level of uncertainty in the estimation of pA¯(σ)\underline{p_{A}}(\sigma), resulting in an unstable estimate. Previous work often employs low MC sampling numbers to minimize computational costs in the context of backpropagation, which further contributed to unstable computed gradients and inaccurate pA¯(σ)\underline{p_{A}}(\sigma) estimation. These factors can ultimately lead to poor optimization of σ\sigma.

Appendix I Limitations

Generality of quasiconcavity: The theoretical generality of the quasiconcavity assumption for data points remains unknown. While we empirically demonstrate the presence of quasiconcave sigma-radius curves in most data points in CIFAR-10 and ImageNet, a rigorous theoretical proof is lacking. A further theoretical investigation is required to determine the extent to which the quasiconcavity property holds across different datasets and problem domains.

Convergence assumption: The convergence guarantee of Algorithm 1 relies on the (υ,υ+)(\upsilon^{-},\upsilon^{+})-SQC condition. If this condition fails for a specific data point, the proposed QCRS algorithm may not be able to find the optimal sigma value. Therefore, the effectiveness of the algorithm is contingent on the validity of this assumption.

Computational cost: While the proposed algorithm aims to be efficient, it still introduces additional computational overhead compared to traditional randomized smoothing techniques. Furthermore, randomized smoothing itself is computationally expensive. As a result, the overall computational cost of the proposed method may limit its applicability in resource-constrained environments or real-time applications. Additionally, tuning the hyperparameters in QCRS may require additional computational resources and experimentation. Therefore, it is important to carefully consider the computational requirements and practical feasibility of the proposed method in real-world scenarios.

Appendix J Constant Sigma

A consistent sigma may be required during deployment, presenting a limitation and challenge for input-specific RS. The memory bank approach suggested by DDRS is a simple and effective solution and is compatible with our proposed method. With regard to the proposed QCRS, employing a constant sigma during deployment is indeed feasible, though its applicability is confined to particular scenarios. For example, consider binary certified robust classification. 222The certification experiment in Cohen’s paper also employs binary settings. They use pB=1pAp_{B}=1-p_{A}. In such cases, our approach can transition from datapoint-wise perspective to a broader class-wise perspective. Before deployment, we utilize QCRS to estimate an optimized sigma tailored to a specific class, say “cat”. At deployment, we consistently employ this optimized constant sigma value to certify and predict “cat” and “non-cat”. Table 10 illustrates the class-wise ACR of this scenario on CIFAR-10. Each class in Table 10 has a tailored sigma value (the values come from Table 5 in the main paper), and this class-wise constant sigma value is used to certify and predict for the binarized task. This approach substantially increases the certified radii by approximately 40%40\% for classes with smaller radii, including cats, birds, and deer. In CIFAR-10, only one class, cars, does not show equally favorable results. This is because “cars” already have an adequate radius with the standard sigma; a larger sigma may lead to misclassifications and zero radii for certain “car” instances.

Furthermore, to illustrate how to use constant sigma at deployment, we take the binary face recognition task in unlocking mobile phones as a practical example. When vendors distribute the model weights to individual users, we employ the proposed method to compute the new sigma value tailored to each user. This customization allows us to seamlessly construct a more robust smoothed classifier. In this scenario, every user has their own optimized sigma value, which remains constant. By introducing the sigma value as a new parameter, we can easily improve the certified radii for the system via a single parameter instead of all parameters in the base classifier. That is, the proposed method incurs negligible additional costs and provides larger radii at deployment compared to the other methods, such as grid search and retraining the base classifier. Given the increasing complexity of DNN models and the impracticality of retraining or fine-tuning, our low-cost QCRS emerges as a practical and efficient alternative.

ACR birds cats deer trucks
σ=0.5\sigma=0.5 .317 .250 .217 .705
Class-wise σ\sigma .457 .337 .302 .788
Table 10: Class-wise QCRS. We use the constant sigma in Table 5 of the main paper for different classes throughout the binary tasks.