This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\declaretheorem

[name=Definition]definition \declaretheorem[name=Lemma]lemma \declaretheorem[name=Theorem]theorem

Incremental Randomized Smoothing Certification

Shubham Ugare1,  Tarun Suresh1,  Debangshu Banerjee1,  Gagandeep Singh1,2,  Sasa Misailovic1
1University of Illinois Urbana-Champaign,   2VMware Research
{sugare2, tsuresh3, db21, ggnds, misailo}@illinois.edu
Abstract

Randomized smoothing-based certification is an effective approach for obtaining robustness certificates of deep neural networks (DNNs) against adversarial attacks. This method constructs a smoothed DNN model and certifies its robustness through statistical sampling, but it is computationally expensive, especially when certifying with a large number of samples. Furthermore, when the smoothed model is modified (e.g., quantized or pruned), certification guarantees may not hold for the modified DNN, and recertifying from scratch can be prohibitively expensive.

We present the first approach for incremental robustness certification for randomized smoothing, IRS. We show how to reuse the certification guarantees for the original smoothed model to certify an approximated model with very few samples. IRS significantly reduces the computational cost of certifying modified DNNs while maintaining strong robustness guarantees. We experimentally demonstrate the effectiveness of our approach, showing up to 4.1x certification speedup over the certification that applies randomized smoothing of approximate model from scratch.

1 Introduction

Ensuring the robustness of deep neural networks (DNNs) to input perturbations is gaining increased attention from both users and regulators in various application domains Bojarski et al. (2016); Amato et al. (2013); Julian et al. (2018); ISO . Out of many techniques for obtaining robustness certificaties, statistical methods currently offer the greatest scalability. Randomized smoothing (RS) is a popular statistical certification method by constructing a smoothed model gg from a base network ff under noise Cohen et al. (2019). To certify the model gg on an input, RS certification checks if the estimated lower bound on the probability of the top class is greater than the upper bound on the probability of the runner-up class (with high confidence). RS certification computes the certified accuracy metric of the DNN on the set of test inputs as a proxy for the DNN robustness. However, despite its effectiveness, RS-based certification can be computationally expensive as it requires DNN inference on a large number of corruptions per input.

The high cost of certification complicates the DNN deployment process, which has become increasingly iterative: the networks are often modified post-training to improve their execution time and/or accuracy. Especially, deploying DNNs on real-world systems with bounded computing resources (e.g., edge devices or GPUs with limited memory), has led to various techniques for approximating DNNs. Common approximation techniques include quantization – reducing the numerical precision of weights (Fiesler et al., 1990; Balzer et al., 1991), and pruning – removing weights that have minimal impact on accuracy (Janowsky, 1989; Reed, 1993).

Common to all of these approximations is that the network behavior (e.g., the classifications) remains the same on most inputs, its architecture does not change, and many weights are only slightly changed. When a user seeks to select a robust and accurate DNN from these possible approximations, RS needs to be performed to compute the robustness of all candidate networks. For instance, in the context of approximation tuning, there are multiple choices for approximation where different quantization or pruning strategies are applied at different layers. Tools such as Chen et al. (2018b; a); Sharif et al. (2019); Zhao et al. (2023) use approximations iteratively and test the network at each step. To ensure DNN robustness when using such tools, one would need to check certified accuracy, computed using RS on test data in each step. However, performing RS to compute certified accuracy from scratch can take hours as shown in our experiments even for a single network (with only 500 test images). Therefore, a major encumbrance of almost all existing RS-based certification practices in the above setting, is that the expensive certification needs to be re-run from scratch for each approximate network. Overcoming this main limitation requires addressing the following fundamental problem:

How can we leverage the information generated while certifying a given network to speed up the certification of similar networks?

This Work.

We present the first incremental RS-based certification framework called Incremental Randomized Smoothing (IRS) to answer this question. The primary objective of our work is to improve the sample complexity of the certification process of similar networks on a predefined test set. Improved sample complexity results in overall speedup in certification, and it reduces the energy requirement and memory footprint of the certification. Given a network ff and its smoothed version gg, and a modified network fpf^{p} with its smoothed version gpg^{p}, IRS incrementally certifies the robustness of gpg^{p} by reusing the information from the execution of RS certification on gg.

IRS optimizes the process of certifying the robustness of smoothed classifier gpg^{p} on an input xx, by estimating the disparity ζx\zeta_{x} – the upper bound on the probability that outputs of ff and fpf^{p} are distinct. Our new algorithm is based on three key insights about disparity:

  1. 1.

    Common approximations yield small ζx\zeta_{x} values – for instance, it is below 0.01 for int8 quantization for multiple large networks in our experiments.

  2. 2.

    Estimating ζx\zeta_{x} through binomial confidence interval requires fewer samples as it is close to 0 – it is, therefore, less expensive to certify with this probability than directly working with lower and upper probability bounds in the original RS algorithm.

  3. 3.

    We can leverage ζx\zeta_{x} alongside the bounds in the certified radius of gg around xx to compute the certified radius for gpg^{p} – thus soundly reusing the samples from certifying gg.

We extensively evaluate the performance of IRS when applying several common DNN approximations such as pruning and quantization on state-of-the-art DNNs on CIFAR10 (ResNet-20, ResNet-110) and ImageNet (ResNet-50) datasets.

Contributions. The main contributions of this paper are:

  • We propose a novel concept of incremental RS certification of the robustness of the updated smoothed classifier by reusing the certification guarantees for the original smoothed classifier.

  • We design the first algorithm IRS for incremental RS that efficiently computes the certified radius of the updated smoothed classifier.

  • We present an extensive evaluation of the performance of IRS speedups of up to 4.1x over the standard non-incremental RS baseline on state-of-the-art classification models.

2 Background

Randomized Smoothing.

Let f:m𝒴f:\mathbb{R}^{m}\to\mathcal{Y} be an ordinary classifier. A smoothed classifier g:m𝒴g:\mathbb{R}^{m}\to\mathcal{Y} can be obtained from calculating the most likely result of f(x+ϵ)f(x+\epsilon) where ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I).

g(x):=argmaxc𝒴ϵ(f(x+ϵ)=c)g(x):=\operatorname*{arg\,max}_{c\in\mathcal{Y}}\mathbb{P}_{\epsilon}(f(x+\epsilon)=c)

The smoothed network gg satisfies following guarantee for a single network ff:

{theorem}

[From Cohen et al. (2019)] Suppose cA𝒴c_{A}\in\mathcal{Y}, pA¯,pB¯[0,1]\underline{p_{A}},\overline{p_{B}}\in[0,1]. if

ϵ(f(x+ϵ)=cA)pA¯pB¯maxccAϵ(f(x+ϵ)=c),\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\geq\underline{p_{A}}\geq\overline{p_{B}}\geq\max_{c\neq c_{A}}\mathbb{P}_{\epsilon}(f(x+\epsilon)=c),

then g(x+δ)=cAg(x+\delta)=c_{A} for all δ\delta satisying δ2σ2(Φ1(pA¯)Φ1(pB¯))\|\delta\|_{2}\leq\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}})-\Phi^{-1}(\overline{p_{B}})), where Φ1\Phi^{-1} denotes the inverse of the standard Gaussian CDF.

Algorithm 1 RS certification Cohen et al. (2019)

Inputs: ff: DNN, σ\sigma: standard deviation, xx: input to the DNN, n0n_{0}: number of samples to predict the top class, nn: number of samples for computing pA¯\underline{p_{A}}, α\alpha: confidence parameter

1:function Certify(f,σ,x,n0,n,αf,\sigma,x,n_{0},n,\alpha)
2:   counts0SampleUnderNoise(f,x,n0,σ)counts_{0}\leftarrow\text{SampleUnderNoise}(f,x,n_{0},\sigma)
3:   c^Atop index in counts0\hat{c}_{A}\leftarrow\text{top index in }counts_{0}
4:   countsSampleUnderNoise(f,x,n,σ)counts\leftarrow\text{SampleUnderNoise}(f,x,n,\sigma)
5:   pA¯LowerConfidenceBound(counts[c^A],n,1α)\underline{p_{A}}\leftarrow\text{LowerConfidenceBound}(counts[\hat{c}_{A}],n,1-\alpha)
6:   if pA¯>12\underline{p_{A}}>\frac{1}{2} then
7:     return prediction c^A\hat{c}_{A} and radius σΦ1(pA¯)\sigma\cdot\Phi^{-1}(\underline{p_{A}})
8:   else
9:     return ABSTAIN    

Computing the exact probabilities Pϵ(f(x+ϵ)=c)P_{\epsilon}(f(x+\epsilon)=c) is generally intractable. Thus, for practical applications, CERTIFY Cohen et al. (2019) (Algorithm 1) utilizes sampling: First, it takes n0n_{0} samples to determine the majority class, then nn samples to compute a lower bound pA¯\underline{p_{A}} to the success probability with confidence 1α1-\alpha via the Clopper-Pearson lemma Clopper and Pearson (1934). If pA¯>0.5\underline{p_{A}}>0.5, we set pB¯=1pA¯\overline{p_{B}}=1-\underline{p_{A}} and obtain radius R=σΦ1(pA¯)R=\sigma\cdot\Phi^{-1}(\underline{p_{A}}) via Theorem 2, else we return ABSTAIN.

DNN approximation.

DNN weights need to be quantized to the appropriate datatype for deploying them on various edge devices. DNN approximations are used to compress the model size at the time of deployment, to allow inference speedup and energy savings without significant accuracy loss. While IRS can work with most of these approximations, for the evaluation, we focus on quantization and pruning as these are the most common ones (Zhou et al., 2017; Frankle and Carbin, 2019).

3 Incremental Randomized Smoothing

Refer to caption
Figure 1: Workflow of IRS from left to right. IRS takes the classifier ff and input xx. IRS reuses the pA¯\underline{p_{A}} and pB¯\overline{p_{B}} estimates computed for ff on xx by RS. IRS estimate ζx\zeta_{x} from ff and fpf^{p}. For the smoothed classifier gpg^{p} obtained from any of the approximate classifiers fpf^{p} it computes the certified radius by combining pA¯\underline{p_{A}} and pB¯\overline{p_{B}} with ζx\zeta_{x}.

Figure 1 illustrates the high-level idea behind the workings of IRS. It takes as input the classifier ff, the updated classifier fpf^{p}, and an input xx. Let gg and gpg^{p} denote the smoothed network obtained from ff and fpf^{p} using RS respectively. IRS reuses the pA¯\underline{p_{A}} and pB¯\overline{p_{B}} estimates computed for gg to compute the certified radius for gpg^{p}.

3.1 Motivation

Insight 1: Similarity in approximate networks We observe that for many practical approximations,

Table 1: Average ζx\zeta_{x} with n=1000n=1000 samples for various approximations.
CIFAR10 ImageNet
ResNet-110 ResNet-50
int8 0.009 0.006
prune10 0.010 0.008

ff and fpf^{p} produce the same result on most inputs. In this experiment, we estimate the disparity between ff and fpf^{p} on Gaussian corruptions of the input xx. We compute a lower confidence bound ζx\zeta_{x} such that ϵ(f(x+ϵ)fp(x+ϵ))ζx\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))\leq\zeta_{x} for ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I).

Table 1 presents empirical average ζx\zeta_{x} for int8 quantization and pruning 10%10\% lowest magnitude weights for some of the networks in our experiments computed over 500500 inputs. We compute ζx\zeta_{x} value as the binomial confidence upper limit using Clopper and Pearson (1934) method with n=1000n=1000 samples with σ=1\sigma=1. The results show that the ζx\zeta_{x} value is quite small in all the cases.

Insight 2: Sample reduction through ζx\zeta_{x} estimation We demonstrate that ζx\zeta_{x} estimation for approximate networks is more efficient than running certification from scratch. Fig. 2 shows that for the fixed target error χ\chi, confidence (1α)(1-\alpha) and estimation technique, the number of samples required for estimation peaks, when the actual parameter value is around 0.50.5 and is smallest around the boundaries. For example, when χ=0.5%\chi=0.5\% and α=0.01\alpha=0.01 estimating the unknown binomial proportion will take 41,50041,500 samples if the actual parameter value is 0.050.05 while achieving the same target error and confidence takes 216,900216,900 samples (5.225.22x higher) if the actual parameter value is 0.50.5. As observed in the previous section, ζx\zeta_{x}’s value for many practical approximations is close to 0.

Leveraging the observation shown in Fig. 2 and given actual value ζx\zeta_{x} is close to 0, estimating ζx\zeta_{x} with existing binomial proportion estimation techniques is efficient and requires a smaller number of samples. In Appendix A.7, we show the distribution of pA¯\underline{p_{A}} and pB¯\overline{p_{B}} for various cases. We see that pA¯\underline{p_{A}} and pB¯\overline{p_{B}} do not always lie close to 0 or 1 and have a more dispersed distribution. Thus, estimating those requires more samples. Prior work Thulin (2013) has theoretically shown that the expected length of the confidence interval for Clopper-Pearson follows a similar trend as in Fig. 2. This theoretical result supports our observation. We show in Appendix A.1 that this observation is not contingent on a specific estimation method and holds for other popular estimation techniques, e.g., (Wilson, 1927)(Agresti and Coull, 1998).

Refer to caption
Figure 2: The number of samples for the Clopper-Pearson method to achieve a target error χ\chi with confidence (1α)(1-\alpha).

Insight 3: Computing the approximate network’s certified radius using ζx\zeta_{x} For certification of the approximate network gpg^{p}, our main insight is that estimating ζx\zeta_{x} and using that value to compute the certified radius is more efficient than computing RS certified radius from scratch. The next theorem shows how to use estimated value of ζx\zeta_{x} to certify gpg^{p} (the proof is in Appendix A.2):

{theorem}

[] If a classifier fpf^{p} is such that for all xm,ϵ(f(x+ϵ)fp(x+ϵ))ζxx\in\mathbb{R}^{m},\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))\leq\zeta_{x}, and classifier ff satisfies ϵ(f(x+ϵ)=cA)pA¯pB¯maxccAϵ(f(x+ϵ)=c)\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\geq\underline{p_{A}}\geq\overline{p_{B}}\geq\max_{c\neq c_{A}}\mathbb{P}_{\epsilon}(f(x+\epsilon)=c) and pA¯ζxpB¯+ζx\underline{p_{A}}-\zeta_{x}\geq\overline{p_{B}}+\zeta_{x} then gpg^{p} satisfies gp(x+δ)=cAg^{p}(x+\delta)=c_{A} for all δ\delta satisying δ2σ2(Φ1(pA¯ζx)Φ1(pB¯+ζx))\|\delta\|_{2}\leq\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}}-\zeta_{x})-\Phi^{-1}(\overline{p_{B}}+\zeta_{x}))

Theorem 2 considers standard RS for a single network. Our Theorem 2 shows how to use the estimated value of ζx\zeta_{x} to transfer the certification guarantees across two networks ff and fpf^{p}.

3.2 IRS Certification Algorithm

Algorithm 2 IRS algorithm: Certification with cache

Inputs: fpf^{p}: DNN obtained from approximating ff, σ\sigma: standard deviation, xx: input to the DNN, npn_{p}: number of Gaussian samples used for certification, 𝒞f\mathcal{C}_{f}: stores the information to be reused from certification of ff, α\alpha and αζ\alpha_{\zeta}: confidence parameters, γ\gamma: threshold hyperparameter to switch between estimation methods

1:function CertifyIRS(fp,σ,x,np,𝒞f,α,αζ,γf^{p},\sigma,x,n_{p},\mathcal{C}_{f},\alpha,\alpha_{\zeta},\gamma)
2:   c^Atop index in 𝒞f[x]\hat{c}_{A}\leftarrow\text{top index in }\mathcal{C}_{f}[x]
3:   pA¯lower confidence of f from 𝒞f[x]\underline{p_{A}}\leftarrow\text{lower confidence of $f$ from }\mathcal{C}_{f}[x]
4:   if pA¯<γ\underline{p_{A}}<\gamma then
5:     ζxEstimateZeta(fp,σ,x,np,𝒞f,αζ)\zeta_{x}\leftarrow\text{EstimateZeta}(f^{p},\sigma,x,n_{p},\mathcal{C}_{f},\alpha_{\zeta})
6:     if pA¯ζx>12\underline{p_{A}}-\zeta_{x}>\frac{1}{2} then
7:       return prediction c^A\hat{c}_{A} and radius σΦ1(pA¯ζx)\sigma\Phi^{-1}(\underline{p_{A}}-\zeta_{x})      
8:   else
9:     countsSampleUnderNoise(fp,x,np,σ)counts\leftarrow\text{SampleUnderNoise}(f^{p},x,n_{p},\sigma)
10:     pALowerConfidenceBound(p^{\prime}_{A}\leftarrow\text{LowerConfidenceBound}(
11:           counts[c^A],np,1(α+αζ))counts[\hat{c}_{A}],n_{p},1-(\alpha+\alpha_{\zeta}))
12:     if pA>12p^{\prime}_{A}>\frac{1}{2} then
13:       return prediction c^A\hat{c}_{A} and radius σΦ1(pA)\sigma\Phi^{-1}(p^{\prime}_{A})         
14:   return ABSTAIN

The Algorithm 2 presents the pseudocode for the IRS algorithm, which extends RS certification from Algorithm 1. The algorithm takes the modified classifier fpf^{p} and certifies the robustness of gpg^{p} around xx. The input npn_{p} denotes the number of Gaussian corruptions used by the algorithm.

The IRS algorithm utilizes a cache 𝒞f\mathcal{C}_{f}, which stores information obtained from the RS execution of the classifier ff for each input xx. The cached information is crucial for the operation of IRS. 𝒞f\mathcal{C}_{f} stores the top predicted class index c^A\hat{c}_{A} and its lower confidence bound pA¯\underline{p_{A}} for ff on input xx.

The standard RS algorithm takes a conservative value of pB¯\overline{p_{B}} by letting pB¯=1pA¯\overline{p_{B}}=1-\underline{p_{A}}. This works reasonably well in practice and simplifies the computation of certified radius σ2(Φ1(pA¯)Φ1(pB¯))\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}})-\Phi^{-1}(\overline{p_{B}})) to σΦ1(pA¯)\sigma\Phi^{-1}(\underline{p_{A}}). We make a similar choice in IRS, which simplifies the certified radius calculation from σ2(Φ1(pA¯ζx)Φ1(pB¯+ζx))\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}}-\zeta_{x})-\Phi^{-1}(\overline{p_{B}}+\zeta_{x})) of Theorem 2 to σΦ1(pA¯ζx)\sigma\Phi^{-1}(\underline{p_{A}}-\zeta_{x}) as we state in the next theorem (the proof is in Appendix A.2):

{theorem}

[] If pA¯ζx12,\underline{p_{A}}-\zeta_{x}\geq\frac{1}{2}, then σΦ1(pA¯ζx)σ2(Φ1(pA¯ζx)Φ1(pB¯+ζx))\sigma\Phi^{-1}(\underline{p_{A}}-\zeta_{x})\leq\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}}-\zeta_{x})-\Phi^{-1}(\overline{p_{B}}+\zeta_{x}))

As per our insight 2 (Section 3.1), binomial confidence interval estimation requires fewer samples for binomial with actual probability close to 0 or 11. IRS can take advtange of this when pA¯\underline{p_{A}} is not close to 11. However, when pA¯\underline{p_{A}} is close to 11 then there is no benefit of using ζx\zeta_{x}-based certified radius for gpg^{p}. Therefore, the algorithm uses a threshold hyperparameter γ\gamma close to 11 that is used to switch between certified radius from Theorem 2 and standard RS from Theorem 2.

If the pA¯\underline{p_{A}} is less than the threshold γ\gamma, then an estimate of ζx\zeta_{x} for classifier fpf^{p} and the classifier ff is computed using the EstimateZeta function. We discuss EstimateZeta procedure in the next section. If the pA¯ζx\underline{p_{A}}-\zeta_{x} is greater than 12\frac{1}{2}, then the top predicted class in the cache is returned as the prediction with the radius σΦ1(pA¯ζx)\sigma\Phi^{-1}(\underline{p_{A}}-\zeta_{x}) as computed in Theorem 3.2.

In case, pA¯\underline{p_{A}} is greater than the threshold γ\gamma, similar to standard RS, the IRS algorithm draws npn^{p} samples of fp(x+ϵ)f^{p}(x+\epsilon) by running npn^{p} noise-corrupted copies of xx through the classifier fpf^{p}. The function SampleUnderNoise(fp,x,np,σ)\text{SampleUnderNoise}(f^{p},x,n_{p},\sigma) in the pseudocode draws npn_{p} samples of noise, ϵ1ϵnp𝒩(0,σ2I)\epsilon_{1}\dots\epsilon_{n_{p}}\sim\mathcal{N}(0,\sigma^{2}I), runs each x+ϵix+\epsilon_{i} through the classifier fpf^{p}, and returns a vector of class counts. If the lower confidence bound is greater than 12\frac{1}{2}, the top predicted class is returned as the prediction with a radius based on the lower confidence bound pA¯\underline{p_{A}}.

If the function does certify the input in both of the above cases, it returns ABSTAIN.

The hyperparameters α\alpha and αζ\alpha_{\zeta} denote confidence of IRS results. The IRS algorithm result is correct with confidence at least 1(α+αζ)1-(\alpha+\alpha_{\zeta}). For the case pA¯γ\underline{p_{A}}\geq\gamma, this holds since we follow the same steps as standard RS. The function LowerConfidenceBound(counts[c^A],np,1(α+αζ))\text{LowerConfidenceBound}(counts[\hat{c}_{A}],n_{p},1-(\alpha+\alpha_{\zeta})) in the pseudocode returns a one-sided 1(α+αζ)1-(\alpha+\alpha_{\zeta}) lower confidence interval for the Binomial parameter pp given a sample counts[c^A]Binomial(np,p)counts[\hat{c}_{A}]\sim Binomial(n_{p},p). We next state the theorem that shows the confidence of IRS results in the other case when pA¯<γ\underline{p_{A}}<\gamma (the proof is in Appendix A.2):

{theorem}

[] If ϵ(f(x+ϵ)=fp(x+ϵ))>1ζx\mathbb{P}_{\epsilon}(f(x+\epsilon)=f^{p}(x+\epsilon))>1-\zeta_{x} with confidence at least 1αζ1-\alpha_{\zeta}. If classifier ff satisfies ϵ(f(x+ϵ)=cA)pA¯\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\geq\underline{p_{A}} with confidence at least 1α1-\alpha. Then for classifier fpf^{p}, ϵ(fp(x+ϵ)=cA)pA¯ζx\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})\geq\underline{p_{A}}-\zeta_{x} with confidence at least 1(α+αζ)1-(\alpha+\alpha_{\zeta})

3.3 Estimating the Upper Confidence Bound ζx\zeta_{x}

In this section, we present our method for estimating ζx\zeta_{x} such that ϵ(f(x+ϵ)fp(x+ϵ))ζx\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))\leq\zeta_{x} with high confidence (Algorithm 3). We use the Clopper-Pearson Clopper and Pearson (1934) method to estimate the upper confidence bound ζx\zeta_{x}.

Algorithm 3 Estimate ζx\zeta_{x}

Inputs: fpf^{p}: DNN obtained from approximating ff, σ\sigma: standard deviation, xx: input to the DNN, npn_{p}: number of Gaussian samples used for estimating ζx\zeta_{x}, 𝒞f\mathcal{C}_{f}: stores the information to be reused from certification of ff, αζ\alpha_{\zeta}: confidence parameter
Output: Estimated value of ζx\zeta_{x}

1:function EstimateZeta(fp,σ,x,np,𝒞f,αζf^{p},\sigma,x,n_{p},\mathcal{C}_{f},\alpha_{\zeta})
2:   nΔ0n_{\Delta}\leftarrow 0
3:   seeds\textit{seeds}\leftarrow seeds for original samples from 𝒞f[x]\mathcal{C}_{f}[x]
4:   predictionsf\textit{predictions}\leftarrow f’s predictions on samples from 𝒞f[x]\mathcal{C}_{f}[x]
5:   for i{1,np}i\in\{1,\dots n_{p}\} do
6:     ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I) using seeds[i]\textit{seeds}[i]
7:     cfpredictions[i]c_{f}\leftarrow\textit{predictions}[i]
8:     cfpfp(x+ϵ)c_{f^{p}}\leftarrow f^{p}(x+\epsilon)
9:     nΔnΔ+I(cfcfp)n_{\Delta}\leftarrow n_{\Delta}+I(c_{f}\neq c_{f^{p}})    
10:   return UpperConfidenceBound(nΔn_{\Delta}, npn_{p}, 1αζ1-\alpha_{\zeta})

We store the seeds used for randomly generating Gaussian samples while certifying the function ff in the cache, and we reuse these seeds to generate the same Gaussian samples. seeds[i]\textit{seeds}[i] stores the seed used for generating ii-th sample in the RS execution of ff, and predictions[i]\textit{predictions}[i] stores the prediction of ff on the corrsponding x+ϵx+\epsilon. We evaluate fpf^{p} on each corruption ϵ\epsilon generated from seeds and match them to predictions by ff. cfc_{f} and cfpc_{f^{p}} represent the top class prediction by ff and fpf^{p} respectively. nΔn_{\Delta} is the count of the number of corruptions ϵ\epsilon such that ff and fpf^{p} do not match on x+ϵx+\epsilon.

The function UpperConfidenceBound(nΔn_{\Delta}, npn_{p}, 1αζ1-\alpha_{\zeta}) in the pseudocode returns a one-sided 1αζ1-\alpha_{\zeta} upper confidence interval for the Binomial parameter pp given a sample nΔBinomial(np,p)n_{\Delta}\sim Binomial(n_{p},p). We compute this upper confidence bound using the Clopper-Pearson method. This is similar to how the lower confidence bound is computed in the standard RS Algorithm 1. It is sound since the Clopper-Pearson method is conservative.

Reusing the seeds for generating noisy samples does not change the certified radius and is 2x faster compared to naive Monte Carlo estimation of ζx\zeta_{x} with fresh Gaussian samples. Storing the seeds used in the cache results in a small memory overhead (less than 2MBs for our largest benchmark). We use the same Gaussian samples for estimations of pA¯\underline{p_{A}} and ζx\zeta_{x}. This is equivalent to estimating two functions, p(X)p(X) and q(X)q(X), of a random variable XX, where the same set of samples of XX can be employed for their respective estimations. Theorem 3.2 makes no assumptions about the independence of estimating pA¯\underline{p_{A}} and ζx\zeta_{x}, thus we can soundly reuse the same Gaussian samples for both estimations.

4 Experimental Methodology

Networks and Datasets. We evaluate IRS on CIFAR-10 (Krizhevsky et al., ) and ImageNet (Deng et al., 2009). On each dataset, we use several classifiers, each with a different σ\sigma’s. For an experiment that adds Gaussian corruptions with σ\sigma to the input, we use the network that is trained with Gaussian augmentation with variance σ2\sigma^{2}. On CIFAR-10 we use the base classifier a 20-layer and 110-layer residual network. On ImageNet our base classifier is a ResNet-50.

Network Approximations. We evaluate IRS on multiple approximations. We consider float16 (fp16), bfloat16 (bf16), and int8 quantizations (Section 5.1). We show the effectiveness of IRS on pruning approximation in Section 5.3. For int8 quantization, we use dynamic per-channel quantization mode. from (Paszke et al., 2019) library. For float16 and bfloat16 quantization, we change the data type of the DNN weights from float32 to the respective types. We perform float32, float16, and bfloat16 inferences on the GPU and int8 inferences on CPU since Pytorch does not support int8 quantization for GPUs yet PyTorch . For the pruning experiment, we perform the lowest weight magnitude (LWM) pruning. The top-1 accuracy of the networks used in the evaluation and the approximate networks is discussed in Appendix A.3.

Experimental Setup. We ran experiments on a 48-core Intel Xeon Silver 4214R CPU with 2 NVidia RTX A5000 GPUs. IRS is implemented in Python and uses PyTorch 2.0.1. (Paszke et al., 2019).

Hyperparameters. We use confidence parameters α=0.001\alpha=0.001 for the certification of gg, and αζ=0.001\alpha_{\zeta}=0.001 for the estimation of ζx\zeta_{x}. To establish a fair comparison, we set the baseline confidence with αb=α+αζ=0.002\alpha_{b}=\alpha+\alpha_{\zeta}=0.002. This choice ensures that both the baseline and IRS, provide certified radii with equal confidence. We use grid search to choose an effective value for γ\gamma. A detailed description of our hyperparameter search and its results are described in Section 5.4.

Average Certified Radius. We compute the certified radius rr when the certification algorithm did not abstain and returned the correct class with radius rr, for both IRS (Algorithm 2) and the baseline (Algorithm 1). In other cases, we say that the certified radius r=0r=0. We compute the average certified radius (ACR) by taking the mean of certified radii computed for inputs in the test set. Higher ACR indicates stronger robustness certification guarantees.

Speedup. IRS is applicable while certifying multiple similar networks, where it can reuse the certification of one of the networks for faster certification of all other similar networks. We demonstrate the effectiveness of IRS by comparing IRS’s certification time for these other similar networks with the baseline certification from scratch. We do not include the certification time of the first network in the comparison as it adds the same time for both IRS and baseline.

5 Experimental Results

We now present our main evaluation results. We consider the float32 representation of the DNN as ff and a particular approximation as fpf^{p}. However, IRS can be used with any similar ff and fpf^{p}s, e.g., where ff is an int8 quantized network and fpf^{p} is the float32 network. In all of our experiments, we follow a specific procedure:

  1. 1.

    We certify the smoothed classifier gg using standard RS with a sample size of nn.

  2. 2.

    We approximate the base classifier ff with fpf^{p}.

  3. 3.

    Using the IRS, we certify smoothed classifier gpg^{p} by employing Algorithm 2 and utilizing the cached information 𝒞f\mathcal{C}_{f} obtained from the certification of gg.

We compare IRS to the baseline that uses standard non-incremental RS (Algorithm 1), to certify gpg^{p}. Our results compare ACR and certification time between IRS and the baseline for various npn_{p} values.

5.1 Effectiveness of IRS

We compare the ACR and the certification time of the baseline and IRS for the common int8 quantization. We use n=105n=10^{5} samples for certification of gg. For certifying gpg^{p}, we consider npn_{p} values from {5%,50%}\{5\%,\dots 50\%\} of nn and σ=1\sigma=1. We perform experiments on 500500 images and compute the total time for certifying gpg^{p}.

Refer to caption
(a) ResNet-110 on CIFAR-10
Refer to caption
(b) ResNet-50 on ImageNet
Figure 3: Total certification time versus ACR with σ=1.0\sigma=1.0.

Figure 3 presents the comparison between IRS and RS for int8 quantization. The x-axis displays the ACR and the y-axis displays the certification time. The plot consists of 10 markers each for the IRS and the baseline representing a specific value of npn_{p}. Expectedly, the higher the value of npn_{p}, the higher the average time and ACR. The marker coordinate denotes the ACR and the time for an experiment. In all the cases, IRS consistently takes less certification time to obtain the same ACR.

Figure 3(a), for ResNet-110 on CIFAR10, shows that IRS reduced the certification time from 2.91 hours (baseline) to 0.71 hours, resulting in time savings of 2.12 hours (4.1x faster). Moreover, we see that IRS achieves an ACR of more than 0.565, whereas the baseline does not reach this ACR for any of the npn_{p} values in our experiments.

Figure 3(b), for ResNet-50 on ImageNet, for certifying an ACR of 0.875, IRS substantially reduced certification time from 27.82 hours (baseline) to 22.45 hours, saving approximately 5.36 hours (1.24x faster). Additionally, IRS achieved an ACR of 0.90 and reduced the certification time from 53.93 hours (baseline) to 40.58 hours, resulting in substantial time savings of 13.35 hours (1.33x faster).

5.2 IRS speedups on different quantizations

Table 2: Average IRS speedup for combinations of quantizations and σ\sigma’s.
Dataset Architecture σ\sigma Quantization
fp16 bf16 int8
0.25 1.37x 1.29x 1.3x
CIFAR10 ResNet-20 0.5 1.79x 1.7x 1.77x
1.0 2.85x 2.41x 2.65x
0.25 1.42x 1.35x 1.29x
CIFAR10 ResNet-110 0.5 1.97x 1.74x 1.77x
1.0 3.02x 2.6x 2.6x
0.5 1.2x 1.14x 1.19x
ImageNet ResNet-50 1.0 1.43x 1.31x 1.43x
2.0 2.04x 1.69x 1.80x

Next, we study if IRS can handle other kinds of quantization. We perform experiments for 10 different values of npn_{p} along with distinct approximations, and 3 values of σ\sigma. Since this would take months of experiment time with nn and npn_{p} values from Section 5.1, for the rest of the experiments we use smaller values for these parameters. In these experiments, we compute the relative speedup due to IRS in comparison to the baseline. We use n=104n=10^{4} for samples for certification of gg. For certifying gpg^{p}, we consider npn_{p} values from {1%,10%}\{1\%,\dots 10\%\} of nn. For CIFAR10, we consider σ{0.25,0.5,1.0\sigma\in\{0.25,0.5,1.0}, and for ImageNet, we consider σ{0.5,1.0,2.0}\sigma\in\{0.5,1.0,2.0\} as in the previous workCohen et al. (2019). We validated that the speedups for int8 quantization in this section for ResNet-50-ImageNet and ResNet-110-CIFAR10 are similar to those studied in Section 5.1.

To quantify IRS’s average speedup over the baseline, we employ an approximate area under the curve (AOC) analysis. Specifically, we plot the certification time against the ACR. In most cases, IRS certifies a larger ACR compared to the baseline, resulting in regions on the x-axis where IRS exists but the baseline does not. To ensure a conservative estimation, we calculate the speedup only within the range where both IRS and the baseline exist. We determine the speedup by computing the ratio of the AOC for IRS to the AOC for the baseline within this common range. Table 2 summarizes the average speedups for all quantization experiments.

We observe that IRS gets a larger speedup for smoothing with larger σ\sigma since on average the pA¯\underline{p_{A}} values are smaller. Appendix A.7 presents a further justification for this observation. Appendix A.9 presents further experiments with all combinations of DNNs, σ\sigma, and quantizations.

5.3 IRS speedups on Pruned Models

Table 3: Average IRS speedup for combinations of pruning ratio and σ\sigma’s.
Dataset Architecture σ\sigma Prune
5% 10% 20%
0.25 1.3x 1.25x 0.99x
CIFAR10 ResNet-20 0.5 1.63x 1.39x 1.13x
1.0 2.5x 2.09x 1.39x
0.25 1.35x 1.24x 1.04x
CIFAR10 ResNet-110 0.5 1.83x 1.6x 1.23x
1.0 2.7x 2.25x 1.63x
0.5 1.19x 1.04x 0.87x
ImageNet ResNet-50 1.0 1.36x 1.15x 0.87x
2.0 1.87x 1.54x 1.01x

In this experiment, we study IRS’s ability to certify beyond quantized models. We employ l1l_{1} unstructured pruning, which prunes the fraction of the lowest l1l_{1} magnitude weights from the DNN. Table 3 presents the average IRS speedup for DNNs obtained by pruning 5%,10%5\%,10\% and 20%20\% weights. The speedups range from 0.99x to 2.7x. As the DNN is pruned more aggressively, it’s expected that IRS’s speedup will be lower. This is due to higher values of ζx\zeta_{x} associated with aggressive pruning. In Appendix A.4, we provide average ζx\zeta_{x} values for all approximations. Compared to pruning, quantization typically yields smaller ζx\zeta_{x} values, making IRS more effective for quantization.

5.4 Ablation Studies

Next, we show the effect of γ\gamma on ACR. In Appendix A.5 we show IRS speedup on distinct values of nn.

Table 4: ACR for each γ\gamma.
γ\gamma CIFAR10 CIFAR10 ImageNet
ResNet-20 ResNet-110 ResNet-50
0.9 0.438 0.436 0.458
0.95 0.442 0.439 0.464
0.975 0.445 0.441 0.465
0.99 0.446 0.443 0.466
0.995 0.445 0.442 0.467
0.999 0.444 0.442 0.464

Sensitivity to threshold γ\gamma. For each DNN architecture, we chose the hyperparameter γ\gamma by running IRS to certify a small subset of the validation set images for certifying the int8 quantized DNN and comparing the ACR. The choice of γ\gamma has no effect on certification time, as we perform npn_{p} inferences in both cases, pA¯<γ\underline{p_{A}}<\gamma and pA¯>γ\underline{p_{A}}>\gamma. We use the same γ\gamma for each DNN irrespective of the approximation and σ\sigma. We use the grid search to choose the best value of gamma from the set {0.9,0.95,0.975,0.99,0.999}\{0.9,0.95,0.975,0.99,0.999\}. Table 4 presents the ACR obtained for each γ\gamma. We chose γ\gamma as 0.990.99 for CIFAR10 networks and 0.9950.995 for the ImageNet networks since they result in the highest ACR.

6 Related Work

Incremental Program Verification. The scalability of traditional program verification has been significantly improved by incremental verification, which has been applied on an industrial scale (Johnson et al., 2013; O’Hearn, 2018; Stein et al., 2021). Incremental program analysis tasks achieve faster analysis of individual commits by reusing partial results (Yang et al., 2009), constraints (Visser et al., 2012), and precision information (Beyer et al., 2013) from previous runs.

Incremental DNN Certification. Several methods have been introduced in recent years to certify the properties of DNNs deterministically (Tjeng et al., 2017; Bunel et al., 2020; Katz et al., 2017; Wang et al., 2021b; Laurel et al., 2022; 2023) and probabilisticly (Cohen et al., 2019). Researchers used incremental certification speed up DNN certification (Fischer et al., 2022b; Ugare et al., 2022; Wei and Liu, 2021; Ugare et al., 2023; ) – these works apply complete and incomplete deterministic certification using formal logic cannot scale to e.g., ImageNet. In contrast, we propose incremental probabilistic certification with Randomized Smoothing, which enables much greater scalability.

Randomized Smoothing. Cohen et al. (2019) introduced the addition of Gaussian noise to achieve l2l_{2}-robustness results. Several extensions to this technique utilize different types of noise distributions and radius calculations to determine certificates for general lpl_{p}-balls. Yang et al. (2020) and Zhang et al. (2020) derived recipes for determining certificates for p=1,2p=1,2, and \infty. Lee et al. (2019), Wang et al. (2021a), and Schuchardt et al. (2021) presented extensions to discrete perturbations such as l0l_{0}-perturbations, while Bojchevski et al. (2023), Gao et al. (2020), Levine and Feizi (2020), and Liu et al. (2021) explored extensions to graphs, patches, and point cloud manipulations. Dvijotham et al. (2020) presented theoretical derivations for the application of both continuous and discrete smoothing measures, while Mohapatra et al. (2020) improved certificates by using gradient information. Horváth et al. (2022) used ensembles to improve the certificate.

Beyond norm-balls certificates, Fischer et al. (2020) and Li et al. (2021) presented how geometric operations such as rotation or translation can be certified via Randomized Smoothing. yeh Chiang et al. (2022) and Fischer et al. (2022a) demonstrated how the certificates can be extended from the setting of classification to regression (and object detection) and segmentation, respectively. For classification, Jia et al. (2020) extended certificates from just the top-1 class to the top-k classes, while Kumar et al. (2020) certified the confidence of the classifier, not just the top-class prediction. Rosenfeld et al. (2020) used Randomized Smoothing to defend against data poisoning attacks. These RS extensions (using different noise distributions, perturbations, and geometric operations) are orthogonal to the standard RS approach from Cohen et al. (2019). While these extensions have been shown to improve the overall bredth of RS, IRS is complementary to these extensions.

7 Limitations

We showed that IRS is effective at certifying the smoothed version of the approximated DNN. However, there are certain limitations to the effectiveness of IRS. First, the IRS algorithm requires a cache with the top predicted class index, its lower confidence bound, and the seeds for Gaussian corruptions obtained from the RS execution of the original classifier. However, storing this additional information is reasonable since it has negligible memory overhead and is a byproduct of certification (as trustworthy ML matures, we anticipate that this information will be shipped with pre-certified networks for reproducibility purposes).

The smoothing parameter σ\sigma used in IRS affects its efficiency, with larger values of σ\sigma generally leading to better results. As a consequence, we observed a smaller speedup when using a smaller value of σ\sigma (e.g., 0.25 on CIFAR10) compared to a larger value (e.g., 1 on CIFAR10). The value of σ\sigma offers a trade-off between robustness and accuracy. By choosing a larger σ\sigma, one can improve robustness but it may lead to a loss of accuracy in the model.

IRS targets fast certification while maintaining a sufficiently large radius. Therefore, we considered npn_{p} smaller than 50%50\% of nn for our evaluation. However, IRS certified radius can be smaller than the non-incremental RS, provided the user has a larger sample budget. In our experiment in Appendix A.6 we test IRS on larger npn_{p} and observe that IRS is better than baseline for npn_{p} less than 70%70\% of nn. This is particularly advantageous when computational resources are limited.

8 Conclusion

We propose IRS, the first incremental approach for probabilistic DNN certification. IRS leverages the certification guarantees obtained from the smoothed model to certify a smoothed approximated model with very few samples. Reusing the computation of original guarantees significantly reduces the computational cost of certification while maintaining strong robustness guarantees. IRS speeds up certification up to 4.1x over the standard non-incremental RS baseline on state-of-the-art classification models. We anticipate that IRS can be particularly useful for approximate tuning when the users need to analyze the robustness of multiple similar networks. Further, one can easily ship the certification cache to allow other users to further modify these networks based on their specific device and application needs and recertify the new network. We believe that our approach paves the way for efficient and effective certification of DNNs in real-world applications.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their comments. This research was supported in part by NSF Grants No. CCF-1846354, CCF-2217144, CCF-2238079, CCF-2313028, CCF-2316233, CNS-2148583, USDA NIFA Grant No. NIFA-2024827 and Google Research Scholar award.

References

  • Agresti and Coull [1998] Alan Agresti and Brent A. Coull. Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126, 1998. doi: 10.1080/00031305.1998.10480550. URL https://doi.org/10.1080/00031305.1998.10480550.
  • Amato et al. [2013] Filippo Amato, Alberto López, Eladia María Peña-Méndez, Petr Vaňhara, Aleš Hampl, and Josef Havel. Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 11(2):47–58, 2013.
  • Balzer et al. [1991] Wolfgang Balzer, Masanobu Takahashi, Jun Ohta, and Kazuo Kyuma. Weight quantization in boltzmann machines. Neural Networks, 4(3):405–409, 1991.
  • Beyer et al. [2013] Dirk Beyer, Stefan Löwe, Evgeny Novikov, Andreas Stahlbauer, and Philipp Wendler. Precision reuse for efficient regression verification. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, page 389–399, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450322379. doi: 10.1145/2491411.2491429. URL https://doi.org/10.1145/2491411.2491429.
  • Bojarski et al. [2016] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  • Bojchevski et al. [2023] Aleksandar Bojchevski, Johannes Gasteiger, and Stephan Günnemann. Efficient robustness certificates for discrete data: Sparsity-aware randomized smoothing for graphs, images and more, 2023.
  • Bunel et al. [2020] Rudy Bunel, Jingyue Lu, Ilker Turkaslan, Pushmeet Kohli, P Torr, and P Mudigonda. Branch and bound for piecewise linear neural network verification. Journal of Machine Learning Research, 21(2020), 2020.
  • Chen et al. [2018a] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 579–594, USA, 2018a. USENIX Association. ISBN 9781931971478.
  • Chen et al. [2018b] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 3393–3404, Red Hook, NY, USA, 2018b. Curran Associates Inc.
  • Clopper and Pearson [1934] C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934. ISSN 00063444. URL http://www.jstor.org/stable/2331986.
  • Cohen et al. [2019] Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1310–1320. PMLR, 2019. URL http://proceedings.mlr.press/v97/cohen19c.html.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dvijotham et al. [2020] Krishnamurthy (Dj) Dvijotham, Jamie Hayes, Borja Balle, Zico Kolter, Chongli Qin, Andras Gyorgy, Kai Xiao, Sven Gowal, and Pushmeet Kohli. A framework for robustness certification of smoothed classifiers using f-divergences. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJlKrkSFPH.
  • Fiesler et al. [1990] Emile Fiesler, Amar Choudry, and H John Caulfield. Weight discretization paradigm for optical neural networks. In Optical interconnections and networks, volume 1281, pages 164–173. SPIE, 1990.
  • Fischer et al. [2020] Marc Fischer, Maximilian Baader, and Martin Vechev. Certified defense to image transformations via randomized smoothing. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8404–8417. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/5fb37d5bbdbbae16dea2f3104d7f9439-Paper.pdf.
  • Fischer et al. [2022a] Marc Fischer, Maximilian Baader, and Martin Vechev. Scalable certified segmentation via randomized smoothing, 2022a.
  • Fischer et al. [2022b] Marc Fischer, Christian Sprecher, Dimitar I. Dimitrov, Gagandeep Singh, and Martin T. Vechev. Shared certificates for neural network verification. In Sharon Shoham and Yakir Vizel, editors, Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, Israel, August 7-10, 2022, Proceedings, Part I, volume 13371 of Lecture Notes in Computer Science, pages 127–148. Springer, 2022b. doi: 10.1007/978-3-031-13185-1\_7. URL https://doi.org/10.1007/978-3-031-13185-1_7.
  • Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proc. International Conference on Learning Representations (ICLR), 2019.
  • Gao et al. [2020] Zhidong Gao, Rui Hu, and Yanmin Gong. Certified robustness of graph classification against topology attack with randomized smoothing. In GLOBECOM 2020 - 2020 IEEE Global Communications Conference, pages 1–6, 2020. doi: 10.1109/GLOBECOM42002.2020.9322576.
  • Horváth et al. [2022] Miklós Z. Horváth, Mark Niklas Mueller, Marc Fischer, and Martin Vechev. Boosting randomized smoothing with variance reduced classifiers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=mHu2vIds_-b.
  • [21] ISO. Assessment of the robustness of neural networks. Standard, International Organization for Standardization, March 2021.
  • Janowsky [1989] Steven A Janowsky. Pruning versus clipping in neural networks. Physical Review A, 39(12):6600, 1989.
  • Jia et al. [2020] Jinyuan Jia, Xiaoyu Cao, Binghui Wang, and Neil Zhenqiang Gong. Certified robustness for top-k predictions against adversarial perturbations via randomized smoothing. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BkeWw6VFwr.
  • Johnson et al. [2013] Kenneth Johnson, Radu Calinescu, and Shinji Kikuchi. An incremental verification framework for component-based software systems. In Proceedings of the 16th International ACM Sigsoft Symposium on Component-Based Software Engineering, CBSE ’13, page 33–42, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450321228. doi: 10.1145/2465449.2465456. URL https://doi.org/10.1145/2465449.2465456.
  • Julian et al. [2018] Kyle D. Julian, Mykel J. Kochenderfer, and Michael P. Owen. Deep neural network compression for aircraft collision avoidance systems. CoRR, abs/1810.04240, 2018.
  • Katz et al. [2017] Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I, volume 10426 of Lecture Notes in Computer Science, 2017. doi: 10.1007/978-3-319-63387-9\_5.
  • [27] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Kumar et al. [2020] Aounon Kumar, Alexander Levine, Soheil Feizi, and Tom Goldstein. Certifying confidence via randomized smoothing. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5165–5177. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/37aa5dfc44dddd0d19d4311e2c7a0240-Paper.pdf.
  • Laurel et al. [2022] Jacob Laurel, Rem Yang, Shubham Ugare, Robert Nagel, Gagandeep Singh, and Sasa Misailovic. A general construction for abstract interpretation of higher-order automatic differentiation. Proc. ACM Program. Lang., 6(OOPSLA2), oct 2022. doi: 10.1145/3563324. URL https://doi.org/10.1145/3563324.
  • Laurel et al. [2023] Jacob Laurel, Siyuan Brant Qian, Gagandeep Singh, and Sasa Misailovic. Synthesizing precise static analyzers for automatic differentiation. Proc. ACM Program. Lang., 7(OOPSLA2), oct 2023. doi: 10.1145/3622867. URL https://doi.org/10.1145/3622867.
  • Lee et al. [2019] Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certificates of adversarial robustness for randomly smoothed classifiers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/fa2e8c4385712f9a1d24c363a2cbe5b8-Paper.pdf.
  • Levine and Feizi [2020] Alexander Levine and Soheil Feizi. (de)randomized smoothing for certifiable defense against patch attacks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/47ce0875420b2dbacfc5535f94e68433-Abstract.html.
  • Li et al. [2021] Linyi Li, Maurice Weber, Xiaojun Xu, Luka Rimanic, Bhavya Kailkhura, Tao Xie, Ce Zhang, and Bo Li. Tss: Transformation-specific smoothing for robustness certification. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 535–557, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384544. doi: 10.1145/3460120.3485258. URL https://doi.org/10.1145/3460120.3485258.
  • Liu et al. [2021] Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. Pointguard: Provably robust 3d point cloud classification, 2021.
  • Mohapatra et al. [2020] Jeet Mohapatra, Ching-Yun Ko, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Higher-order certification for randomized smoothing, 2020.
  • O’Hearn [2018] Peter W. O’Hearn. Continuous reasoning: Scaling the impact of formal methods. In Anuj Dawar and Erich Grädel, editors, Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2018, Oxford, UK, July 09-12, 2018, pages 13–25. ACM, 2018. doi: 10.1145/3209108.3209109. URL https://doi.org/10.1145/3209108.3209109.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  • [38] PyTorch. Torch quantization support. https://github.com/pytorch/pytorch/issues/87395.
  • Reed [1993] Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
  • Rosenfeld et al. [2020] Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and J. Zico Kolter. Certified robustness to label-flipping attacks via randomized smoothing, 2020.
  • Schuchardt et al. [2021] Jan Schuchardt, Aleksandar Bojchevski, Johannes Gasteiger, and Stephan Günnemann. Collective robustness certificates: Exploiting interdependence in graph neural networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=ULQdiUTHe3y.
  • Sharif et al. [2019] Hashim Sharif, Prakalp Srivastava, Muhammad Huzaifa, Maria Kotsifakou, Keyur Joshi, Yasmin Sarita, Nathan Zhao, Vikram S. Adve, Sasa Misailovic, and Sarita Adve. Approxhpvm: A portable compiler ir for accuracy-aware optimizations. Proc. ACM Program. Lang., 3(OOPSLA), oct 2019. doi: 10.1145/3360612. URL https://doi.org/10.1145/3360612.
  • Stein et al. [2021] Benno Stein, Bor-Yuh Evan Chang, and Manu Sridharan. Demanded abstract interpretation. In Stephen N. Freund and Eran Yahav, editors, PLDI ’21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Virtual Event, Canada, June 20-25, 2021, pages 282–295. ACM, 2021. doi: 10.1145/3453483.3454044. URL https://doi.org/10.1145/3453483.3454044.
  • Thulin [2013] Måns Thulin. The cost of using exact confidence intervals for a binomial proportion. Electronic Journal of Statistics, 8, 03 2013. doi: 10.1214/14-EJS909.
  • Tjeng et al. [2017] Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356, 2017.
  • [46] Shubham Ugare, Debangshu Banerjee, Tarun Suresh, Sasa Misailovic, and Gagandeep Singh. Toward continuous verification of dnns.
  • Ugare et al. [2022] Shubham Ugare, Gagandeep Singh, and Sasa Misailovic. Proof transfer for fast certification of multiple approximate neural networks. Proc. ACM Program. Lang., 6(OOPSLA):1–29, 2022. doi: 10.1145/3527319. URL https://doi.org/10.1145/3527319.
  • Ugare et al. [2023] Shubham Ugare, Debangshu Banerjee, Sasa Misailovic, and Gagandeep Singh. Incremental verification of neural networks. Proc. ACM Program. Lang., 7(PLDI), jun 2023. doi: 10.1145/3591299. URL https://doi.org/10.1145/3591299.
  • Visser et al. [2012] Willem Visser, Jaco Geldenhuys, and Matthew B. Dwyer. Green: Reducing, reusing and recycling constraints in program analysis. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450316149. doi: 10.1145/2393596.2393665. URL https://doi.org/10.1145/2393596.2393665.
  • Wang et al. [2021a] Binghui Wang, Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Certified robustness of graph neural networks against adversarial structural perturbation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 1645–1653, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3467295. URL https://doi.org/10.1145/3447548.3467295.
  • Wang et al. [2021b] Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J Zico Kolter. Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification. arXiv preprint arXiv:2103.06624, 2021b.
  • Wei and Liu [2021] Tianhao Wei and Changliu Liu. Online verification of deep neural networks under domain or weight shift. CoRR, abs/2106.12732, 2021. URL https://arxiv.org/abs/2106.12732.
  • Wilson [1927] Edwin B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212, 1927. ISSN 01621459. URL http://www.jstor.org/stable/2276774.
  • Yang et al. [2020] Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes, 2020.
  • Yang et al. [2009] Guowei Yang, Matthew B. Dwyer, and Gregg Rothermel. Regression model checking. In 2009 IEEE International Conference on Software Maintenance, pages 115–124, 2009. doi: 10.1109/ICSM.2009.5306334.
  • yeh Chiang et al. [2022] Ping yeh Chiang, Michael J. Curry, Ahmed Abdelkader, Aounon Kumar, John Dickerson, and Tom Goldstein. Detection as regression: Certified object detection by median smoothing, 2022.
  • Zhang et al. [2020] Dinghuai Zhang, Mao Ye, Chengyue Gong, Zhanxing Zhu, and Qiang Liu. Black-box certification with randomized smoothing: A functional optimization based framework. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2316–2326. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1896a3bf730516dd643ba67b4c447d36-Paper.pdf.
  • Zhao et al. [2023] Yifan Zhao, Hashim Sharif, Peter Pao-Huang, Vatsin Shah, Arun Narenthiran Sivakumar, Mateus Valverde Gasparino, Abdulrahman Mahmoud, Nathan Zhao, Sarita Adve, Girish Chowdhary, et al. Approxcaliper: A programmable framework for application-aware neural network optimization. Proceedings of Machine Learning and Systems, 5, 2023.
  • Zhou et al. [2017] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with low-precision weights. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=HyQJ-mclg.

Appendix A Appendix

A.1 Observation for Binomial Confidence Interval Methods

In this section, we show the plots for the number of samples required to estimate an unknown binomial proportion parameter through two popular estimation techniques - the Wilson  [Wilson, 1927] and Agresti-Coull method  [Agresti and Coull, 1998]. For this experiment, we use three different values of the target error χ\chi = 0.5 %, 0.75 %, and 1.0 % and a fixed confidence value (1α)=0.99(1-\alpha)=0.99 for both estimation methods. As shown in Fig 4, for a fixed target error χ\chi, confidence (1α)(1-\alpha), and estimation technique, the number of samples required for estimation peaks, when the actual parameter value is around 0.50.5 and is the smallest around the boundaries. This is consistent with the observation described in Section 3.1.

Refer to caption
(a) Agresti-Coull method
Refer to caption
(b) Wilson method
Figure 4: The number of samples for the Agresti-Coull and Wilson method to achieve a target error χ\chi with confidence (1α)(1-\alpha) where α=0.01\alpha=0.01. The plots show that the number of required samples for different methods peaks at 0.5 and decreases towards the boundaries.

A.2 Theorems

See 2

Proof.

If f(x+ϵ)=cAf(x+\epsilon)=c_{A} and fp(x+ϵ)=f(x+ϵ)f^{p}(x+\epsilon)=f(x+\epsilon) then fp(x+ϵ)=cAf^{p}(x+\epsilon)=c_{A}.
Thus, if fp(x+ϵ)cAf^{p}(x+\epsilon)\neq c_{A} then f(x+ϵ)cAf(x+\epsilon)\neq c_{A} or fp(x+ϵ)f(x+ϵ)f^{p}(x+\epsilon)\neq f(x+\epsilon).
Using union bound,

ϵ(fp(x+ϵ)cA)ϵ(f(x+ϵ)cA)+ϵ(f(x+ϵ)fp(x+ϵ))\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)\neq c_{A})\leq\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq c_{A})+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
(1ϵ(fp(x+ϵ)=cA))(1ϵ(f(x+ϵ)=cA))+ϵ(f(x+ϵ)fp(x+ϵ))(1-\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A}))\leq(1-\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A}))+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
ϵ(f(x+ϵ)=cA)ϵ(fp(x+ϵ)=cA)+ϵ(f(x+ϵ)fp(x+ϵ))\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\leq\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
pA¯ζxϵ(fp(x+ϵ)=cA)\underline{p_{A}}-\zeta_{x}\leq\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})

Similarly, if f(x+ϵ)cf(x+\epsilon)\neq c then fp(x+ϵ)cf^{p}(x+\epsilon)\neq c or fp(x+ϵ)f(x+ϵ)f^{p}(x+\epsilon)\neq f(x+\epsilon).
Hence, using union bound,

ϵ(f(x+ϵ)c)ϵ(fp(x+ϵ)c)+ϵ(f(x+ϵ)fp(x+ϵ))\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq c)\leq\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)\neq c)+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
(1ϵ(f(x+ϵ)=c))(1ϵ(fp(x+ϵ)=c))+ϵ(f(x+ϵ)fp(x+ϵ))(1-\mathbb{P}_{\epsilon}(f(x+\epsilon)=c))\leq(1-\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c))+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
ϵ(fp(x+ϵ)=c)ϵ(f(x+ϵ)=c)+ϵ(f(x+ϵ)fp(x+ϵ))\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c)\leq\mathbb{P}_{\epsilon}(f(x+\epsilon)=c)+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
maxccAϵ(fp(x+ϵ)=c)maxccAϵ(f(x+ϵ)=c)+ζx\max_{c\neq c_{A}}\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c)\leq\max_{c\neq c_{A}}\mathbb{P}_{\epsilon}(f(x+\epsilon)=c)+\zeta_{x}
maxccAϵ(fp(x+ϵ)=c)pB¯+ζx\max_{c\neq c_{A}}\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c)\leq\overline{p_{B}}+\zeta_{x}

Hence, using Theorem 2, gpg^{p} satisfies gp(x+δ)=cAg^{p}(x+\delta)=c_{A} for all δ\delta satisying δ2σ2(Φ1(pA¯ζx)Φ1(pB¯+ζx))\|\delta\|_{2}\leq\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}}-\zeta_{x})-\Phi^{-1}(\overline{p_{B}}+\zeta_{x}))

See 3.2

Proof.

Since pA¯ζx12\underline{p_{A}}-\zeta_{x}\geq\frac{1}{2}, 0pA¯10\leq\underline{p_{A}}\leq 1 and ζx0\zeta_{x}\geq 0, we get 0pA¯ζx10\leq\underline{p_{A}}-\zeta_{x}\leq 1

And since 1pA¯pB¯1-\underline{p_{A}}\geq\overline{p_{B}}, we get pB¯+ζx12\overline{p_{B}}+\zeta_{x}\leq\frac{1}{2}, and thus, 0pB¯+ζx10\leq\overline{p_{B}}+\zeta_{x}\leq 1

Since Φ1(1t)=Φ1(t)\Phi^{-1}(1-t)=-\Phi^{-1}(t)

Φ1(pB¯+ζx)=Φ1(1(pB¯+ζx))\Phi^{-1}(\overline{p_{B}}+\zeta_{x})=-\Phi^{-1}(1-(\overline{p_{B}}+\zeta_{x}))
=Φ1((1pB¯)ζx)=-\Phi^{-1}((1-\overline{p_{B}})-\zeta_{x})

Since 1pA¯pB¯1-\underline{p_{A}}\geq\overline{p_{B}}

Φ1(pA¯ζx)\leq-\Phi^{-1}(\underline{p_{A}}-\zeta_{x})

Hence,

Φ1(pA¯ζx)Φ1(pB¯+ζx)\Phi^{-1}(\underline{p_{A}}-\zeta_{x})\leq-\Phi^{-1}(\overline{p_{B}}+\zeta_{x})
σ2Φ1(pA¯ζx)σ2Φ1(pB¯+ζx)\frac{\sigma}{2}\Phi^{-1}(\underline{p_{A}}-\zeta_{x})\leq-\frac{\sigma}{2}\Phi^{-1}(\overline{p_{B}}+\zeta_{x})

Adding σ2Φ1(pA¯ζx)\frac{\sigma}{2}\Phi^{-1}(\underline{p_{A}}-\zeta_{x}) on both sides,

σΦ1(pA¯ζx)σ2(Φ1(pA¯ζx)Φ1(pB¯+ζx))\sigma\Phi^{-1}(\underline{p_{A}}-\zeta_{x})\leq\frac{\sigma}{2}(\Phi^{-1}(\underline{p_{A}}-\zeta_{x})-\Phi^{-1}(\overline{p_{B}}+\zeta_{x}))

See 3.2

Proof.

Suppose ff and fpf^{p} are classifiers such that for a fixed xm,ϵ(f(x+ϵ)=cA)pA¯x\in\mathbb{R}^{m},\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\geq\underline{p_{A}} and ϵ(f(x+ϵ)=fp(x+ϵ))>1ζx\mathbb{P}_{\epsilon}(f(x+\epsilon)=f^{p}(x+\epsilon))>1-\zeta_{x}. Note that this is true by the definition of pA¯\underline{p_{A}}, and is a separate pA¯\underline{p_{A}} for each xx. The statement is not true for all xx with single pA¯\underline{p_{A}}
Let E1E_{1} denote the event that ϵ(f(x+ϵ)=cA)pA¯\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\geq\underline{p_{A}}.
Let E2E_{2} denote the event that ϵ(f(x+ϵ)=fp(x+ϵ))>1ζx\mathbb{P}_{\epsilon}(f(x+\epsilon)=f^{p}(x+\epsilon))>1-\zeta_{x}.
By Theorem 2,

ϵ(f(x+ϵ)=cA)ϵ(fp(x+ϵ)=cA)+ϵ(f(x+ϵ)fp(x+ϵ))\mathbb{P}_{\epsilon}(f(x+\epsilon)=c_{A})\leq\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})+\mathbb{P}_{\epsilon}(f(x+\epsilon)\neq f^{p}(x+\epsilon))
pA¯ζxϵ(fp(x+ϵ)=cA)\underline{p_{A}}-\zeta_{x}\leq\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})

Let E3E_{3} denote the event that pA¯ζxϵ(fp(x+ϵ)=cA)\underline{p_{A}}-\zeta_{x}\leq\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})
Since, E1E_{1} and E2E_{2} imply E3E_{3} i.e. E1E2E3E_{1}\cap E_{2}\subseteq E_{3},

(E3)(E1E2)\mathbb{P}(E_{3})\geq\mathbb{P}(E_{1}\cap E_{2})

By the additive rule of probability,

(E1E2)=(E1)+(E2)(E1E2)\mathbb{P}(E_{1}\cap E_{2})=\mathbb{P}(E_{1})+\mathbb{P}(E_{2})-\mathbb{P}(E_{1}\cup E_{2})
(E3)(1α)+(1αζ)1\mathbb{P}(E_{3})\geq(1-\alpha)+(1-\alpha_{\zeta})-1
(E3)1(α+αζ)\mathbb{P}(E_{3})\geq 1-(\alpha+\alpha_{\zeta})

Hence, for classifier fpf^{p}, ϵ(fp(x+ϵ)=cA)pA¯ζx\mathbb{P}_{\epsilon}(f^{p}(x+\epsilon)=c_{A})\geq\underline{p_{A}}-\zeta_{x} has confidence at least 1(α+αζ)1-(\alpha+\alpha_{\zeta})

A.3 Evaluation Networks

Table 5 and Table 6 respectively present the standard top-1 accuracy of the original and approximated base classifiers and smoothed classifiers respectively.

Table 5: Standard top-1 accuracy for (non-smoothed) networks for combinations of approximations and σ\sigma’s.
Dataset Architecture σ\sigma original Quantization Prune
fp16 bf16 int8 5% 10% 20%
0.25 67.2 67.2 66.8 67.2 67.4 66.6 66.6
CIFAR10 ResNet-20 0.5 56.8 56.8 57.2 56.8 57 57.4 58
1.0 47.2 47.2 47.0 47.2 47 46.2 45.2
0.25 69.0 69.0 69.4 69.0 69.2 68.8 68.2
CIFAR10 ResNet-110 0.5 59.4 59.4 59.4 59.4 59.6 59 58.8
1.0 47.0 47.0 46.8 46.8 46.8 47.2 47
0.5 24.2 24.2 24.4 24.2 24.2 24.4 24.2
ImageNet ResNet-50 1.0 9.6 9.6 9.6 9.6 9.6 9.6 9.6
2.0 6.4 6.4 6.4 6.4 6.4 6.4 6.4
Table 6: standard top-1 accuracy for smoothed networks for combinations of approximations and σ\sigma’s.
Dataset Architecture σ\sigma original Quantization Prune
fp16 bf16 int8 5% 10% 20%
0.25 77.2 77 77.2 77.2 77.6 77.2 77.6
CIFAR10 ResNet-20 0.5 67.8 67.4 67.8 67.8 67.8 67.4 67.8
1.0 55.6 55.6 55.6 55.8 54.8 55.2 55.6
0.25 76.6 76.4 76.2 76.4 76.2 76.2 76.4
CIFAR10 ResNet-110 0.5 66.2 67 68 66.4 67 66.8 66.6
1.0 55.6 55.4 56.2 56.2 55 55 54.8
0.5 63.8 63.4 63.2 63.4 63.6 64 63
ImageNet ResNet-50 1.0 48.8 48.6 48.8 48.6 48.8 48.6 47.8
2.0 34.4 34.2 33.8 34.2 34.2 34.4 33.4

A.4 ζx\zeta_{x} evaluation

We compute ζx\zeta_{x} value as the binomial confidence upper limit using Clopper and Pearson [1934] method with n=1000n=1000 samples. For an experiment that adds Gaussian corruptions with σ\sigma to the input, we use the network that is trained with Gaussian data augmentation with variance σ2\sigma^{2}.

Table 7: Average IRS speedup for combinations of nn, σ\sigma’s, and quantizations for ResNet-20 on CIFAR10.
nn σ\sigma Quantization
fp16 bf16 int8
0.25 1.37x 1.29x 1.3x
10410^{4} 0.5 1.79x 1.7x 1.77x
1.0 2.85x 2.41x 2.65x
0.25 1.22x 1.11x 1.27x
10510^{5} 0.5 1.73x 1.4x 1.86x
1.0 3.88x 2.40x 4.31x
0.25 1.12x 0.93x 1.15x
10610^{6} 0.5 1.97x 1.04x 2.25x
1.0 4.58x 1.25x 5.85x
Table 8: ζx\zeta_{x} for approximate networks trained on different Gaussian augmentation σ\sigma’s.
Dataset Architecture σ\sigma Quantization Prune
fp16 bf16 int8 5% 10% 20%
0.25 0.01 0.01 0.006 0.01 0.02 0.04
CIFAR10 ResNet-20 0.5 0.006 0.008 0.01 0.01 0.02 0.03
1.0 0.006 0.007 0.006 0.007 0.02 0.02
0.25 0.006 0.01 0.006 0.009 0.02 0.04
CIFAR10 ResNet-110 0.5 0.006 0.006 0.006 0.008 0.02 0.03
1.0 0.006 0.008 0.009 0.007 0.01 0.02
0.5 0.0060.006 0.009 0.0060.006 0.01 0.02 0.09
ImageNet ResNet-50 1.0 0.0070.007 0.01 0.0060.006 0.01 0.02 0.08
2.0 0.0060.006 0.01 0.0060.006 0.007 0.02 0.07

A.5 Sensitivity to changing nn

In Section 5, to save time due to a large number of approximations and DNNs tested, we used n=104n=10^{4} samples for gg’s certification. Here, we present the effect of certifying with a larger nn by comparing the ACR vs certification time on the IRS and baseline approaches for ResNet-20 on CIFAR10. On average, for larger nn, we demonstrate greater speedup for larger σ\sigma. For instance, for int8 quantization with σ=1.0\sigma=1.0, the speedup for certifying with n=106n=10^{6} samples was 5.855.85x as compared to certification with n=104n=10^{4} which yielded at 2.652.65x speedup. However, for smaller σ\sigma, certification with a larger n results in less speedup. For σ=0.25\sigma=0.25, we observe speedups from 1.291.29x to 1.371.37x for n=104n=10^{4} whereas from 0.930.93x to 1.151.15x for n=106n=10^{6}.

Refer to caption
Figure 5: CIFAR10 ResNet-20 with σ=1\sigma=1, for np{5%,10%80%}n_{p}\in\{5\%,10\%\dots 80\%\} of nn

A.6 Evaluation with larger npn_{p}

The objective of IRS is to certify the approximated DNN with few samples. Thus, we consider npn_{p} ranging from 1%1\% to 10%10\%. Nevertheless, we check IRS effectiveness for larger npn_{p} values in this ablation study.

Since, IRS certifies radius σΦ1(pA¯ζx)\sigma\Phi^{-1}(\underline{p_{A}}-\zeta_{x}) that is always smaller than original certified radius. When np=nn_{p}=n, the baseline running from scratch should perform better than IRS, as it will reach a certification radius close to σΦ1(pA¯)\sigma\Phi^{-1}(\underline{p_{A}}).

In this experiment, on CIFAR10 ResNet-20 with σ=1\sigma=1, we let np{5%,10%80%}n_{p}\in\{5\%,10\%\dots 80\%\} of nn. Figure 5 shows the ACR vs mean time plot for the baseline and IRS. We see that IRS gives speedup for np=70%n_{p}=70\%. For np=75%n_{p}=75\% and np=80%n_{p}=80\%, we see that baseline ACR is higher and IRS cannot achieve that ACR.

A.7 Effect of standard deviation σ\sigma on IRS speedup.

Refer to caption
(a) ResNet-110 on CIFAR-10 (σ=0.25\sigma=0.25)
Refer to caption
(b) ResNet-110 on CIFAR-10 (σ=1.0\sigma=1.0)
Figure 6: Distribution of pA¯\underline{p_{A}} values greater than 0.5 with different σ\sigma for ResNet-110 on CIFAR-10.
Refer to caption
(c) ResNet-50 on ImageNet (σ=1.0\sigma=1.0)
Refer to caption
(d) ResNet-50 on ImageNet (σ=2.0\sigma=2.0)
Figure 7: Distribution of pA¯\underline{p_{A}} values greater than 0.5 with different σ\sigma for ResNet-50 on ImageNet.

Figure 7 and Figure 7, present the pA¯\underline{p_{A}} distribution between 0.50.5 to 11, for ResNet-110 on CIFAR-10 and ResNet-50 on ImageNet respectively. The x-axis represents the range of pA¯\underline{p_{A}} values and the y-axis represents their respective proportion. The results show that while certifying larger σ\sigma, on average the pA¯\underline{p_{A}} values are smaller. As shown in Figure 6(c), for σ=0.25\sigma=0.25, less than 35%35\% of pA¯\underline{p_{A}} values are smaller than 0.950.95. On the other hand, in Figure 6(d), when σ=1.0\sigma=1.0, the distribution is less left-skewed as nearly 75%75\% of pA¯\underline{p_{A}} values are less than 0.950.95. When the σ\sigma is larger, the values of pA¯\underline{p_{A}} tend to be farther away from 1. Therefore, the estimation of pA¯\underline{p_{A}} is less precise in such cases, as observed in insight 2. As a result, non-incremental RS performs poorly compared to IRS in these situations, leading to a greater speedup with IRS.

A.8 Threshold Parameter γ\gamma

Table 9 presents the proportion of cases for which pA¯>γ\underline{p_{A}}>\gamma for the γ\gamma chosen through hyperparameter search in Section 5.4 for different σ\sigma and networks.

Table 9: Proportion of pA¯\underline{p_{A}} > γ\gamma for different σ\sigma and networks.
Dataset Architecture γ\gamma σ\sigma pA¯>γ\underline{p_{A}}>\gamma
0.25 0.346
CIFAR10 ResNet-20 0.99 0.5 0.162
1.0 0.034
0.25 0.362
CIFAR10 ResNet-110 0.99 0.5 0.146
1.0 0.034
0.5 0.292
ImageNet ResNet-50 0.995 1.0 0.14
2.0 0.04

For CIFAR10 ResNet-20, we observe that pA¯>γ=0.346\underline{p_{A}}>\gamma=0.346 when σ=0.25\sigma=0.25 and pA¯>γ=0.034\underline{p_{A}}>\gamma=0.034 when σ=1.0\sigma=1.0. Additionally, for ImageNet ResNet-50, the results show pA¯>γ=0.292\underline{p_{A}}>\gamma=0.292 when σ=0.50\sigma=0.50 and pA¯>γ=0.04\underline{p_{A}}>\gamma=0.04 when σ=2.0\sigma=2.0. As shown in Section 5, certifying larger σ\sigma yields on average smaller pA¯\underline{p_{A}}. Expectedly, we see a smaller proportion of pA¯>γ\underline{p_{A}}>\gamma for larger σ\sigma and vice versa.

A.9 Quantization Plots

In this section, we present the ACR vs. time plots for all the quantization experiments. We use n=104n=10^{4} for samples for certification of gg. For certifying gpg^{p}, we consider npn_{p} values from {1%,10%}\{1\%,\dots 10\%\} of nn. Note that these smaller values of n,npn,n_{p} compared to Section 5.1 allow us to perform a large number of experiments.

Refer to caption
(a) fp16
Refer to caption
(b) bf16
Refer to caption
(c) int8
Figure 8: ResNet-20 on CIFAR10 with σ=0.25\sigma=0.25.
Refer to caption
(d) fp16
Refer to caption
(e) bf16
Refer to caption
(f) int8
Figure 9: ResNet-20 on CIFAR10 with σ=0.5\sigma=0.5.
Refer to caption
(a) fp16
Refer to caption
(b) bf16
Refer to caption
(c) int8
Figure 10: ResNet-20 on CIFAR10 with σ=1.0\sigma=1.0.
Refer to caption
(a) fp16
Refer to caption
(b) bf16
Refer to caption
(c) int8
Figure 11: ResNet-110 on CIFAR10 with σ=0.25\sigma=0.25.
Refer to caption
(d) fp16
Refer to caption
(e) bf16
Refer to caption
(f) int8
Figure 12: ResNet-110 on CIFAR10 with σ=0.5\sigma=0.5.
Refer to caption
(a) fp16
Refer to caption
(b) bf16
Refer to caption
(c) int8
Figure 13: ResNet-110 on CIFAR10 with σ=1.0\sigma=1.0.
Refer to caption
(a) fp16
Refer to caption
(b) bf16
Refer to caption
(c) int8
Figure 14: ResNet-50 on ImageNet with σ=0.5\sigma=0.5.
Refer to caption
(d) fp16
Refer to caption
(e) bf16
Refer to caption
(f) int8
Figure 15: ResNet-50 on ImageNet with σ=1.0\sigma=1.0.
Refer to caption
(a) fp16
Refer to caption
(b) bf16
Refer to caption
(c) int8
Figure 16: ResNet-50 on ImageNet with σ=2.0\sigma=2.0.