On Multi-head Ensemble of Smoothed Classifiers for Certified Robustness

Kun Fang Department of Automation, Shanghai Jiao Tong University
{fanghenshao,yingwen_wu,li.tao,xiaolinhuang,jieyang}@sjtu.edu.cn Qinghua Tao ESAT-STADIUS, KU Leuven, Belgium
[email protected] Yingwen Wu Department of Automation, Shanghai Jiao Tong University
{fanghenshao,yingwen_wu,li.tao,xiaolinhuang,jieyang}@sjtu.edu.cn Tao Li Department of Automation, Shanghai Jiao Tong University
{fanghenshao,yingwen_wu,li.tao,xiaolinhuang,jieyang}@sjtu.edu.cn Xiaolin Huang Department of Automation, Shanghai Jiao Tong University
{fanghenshao,yingwen_wu,li.tao,xiaolinhuang,jieyang}@sjtu.edu.cn Jie Yang Department of Automation, Shanghai Jiao Tong University
{fanghenshao,yingwen_wu,li.tao,xiaolinhuang,jieyang}@sjtu.edu.cn

Abstract

Randomized Smoothing (RS) is a promising technique for certified robustness, and recently in RS the ensemble of multiple deep neural networks (DNNs) has shown state-of-the-art performances. However, such an ensemble brings heavy computation burdens in both training and certification, and yet under-exploits individual DNNs and their mutual effects, as the communication between these classifiers is commonly ignored in optimization. In this work, starting from a single DNN, we augment the network with multiple heads, each of which pertains a classifier for the ensemble. A novel training strategy, namely Self-PAced Circular-TEaching (SPACTE), is proposed accordingly. SPACTE enables a circular communication flow among those augmented heads, i.e., each head teaches its neighbor with the self-paced learning using smoothed losses, which are specifically designed in relation to certified robustness. The deployed multi-head structure and the circular-teaching scheme of SPACTE jointly contribute to diversify and enhance the classifiers in augmented heads for ensemble, leading to even stronger certified robustness than ensembling multiple DNNs (effectiveness) at the cost of much less computational expenses (efficiency), verified by extensive experiments and discussions.

1 Introduction

Deep neural networks (DNNs) have been widely applied in various fields [1, 2], but at the same time showed devastating vulnerability to adversarial samples. That is, images crafted with maliciously-designed perturbations, namely adversarial examples, can easily mislead well-trained DNNs into wrong predictions [3, 4]. To resist adversarial examples, there has been a series of empirical defenses against adversarial attacks, e.g., adversarial training and its variants[5, 6, 7]. Besides, various methods for certified robustness [8, 9, 10] have also attracted increasing attention in recent years, where the robustness of DNNs for images perturbed by Gaussian noises $\mathcal{N}(0,\sigma^{2}I)$ is focused. A certifiably-robust classifier guarantees that for any input $\mathbf{x}$ , the predictions are kept constant within its perturbed neighborhood bounded commonly by the $\ell_{2}$ norm.

Randomized smoothing [11, 12, 13] (RS) is considered as one of the most effective $\ell_{2}$ -norm certifiably-robust defenses. With RS, any base classifier (DNN) can be formulated into a smoothed and certifiably-robust one by giving the most probable prediction over the Gaussian corruptions of the input. Cohen et al. [12] firstly proved a tight robustness guarantee of RS and achieved state-of-the-art results on large-scale DNNs and complex datasets. Various related works have been developed for a more robust base classifier with RS. These works can be mainly categorized as two types: extra regularization terms [14, 15, 16] and enhanced data augmentation techniques [17, 18].

Refer to caption — Figure 1: An illustration on the proposed SPACTE on a 5-head DNN. In our efficient structure for the ensemble, a common backbone $g$ is shared by the augmented 5 heads $h^{1},\cdots,h^{5}$ , which are trained to be mutually orthogonal to each other for diversified classifiers. In the novel training with circular communication flow between classifiers, the augmented head $h^{k+1}$ are optimized via the circular-teaching by its peer classifier $h^{k}$ with easy and hard samples in relation to the certified radii ( $h^{0}\coloneqq h^{5}$ ).

Recently, the ensemble of multiple DNNs being the base classifier in RS has shown great potentials in certified robustness. The weighted logit ensemble of several DNNs trained from different random seeds [19] or max-margin ensemble of fine-tuning multiple pretrained DNNs with extra constraints [20] have achieved significant improvements in higher certified accuracy on the predictions for $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ . However, such an ensemble scheme demands considerably higher computational sources to train multiple large DNNs. In particular, the resulting computational burden becomes more pronounced in the certification phase, because certifying each sample $\mathbf{x}$ can even require tens of thousands of samplings within $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ [12], resulting in numerous inferences in forward propagation. In this case, such certification time gets exacerbated approximately by the number of ensembled DNNs. Apart from the sacrifice of efficiency, these ensemble methods also ignore a straightforward and yet crucial fact that the current training of ensembled DNNs under-exploits the potentials of each individual DNN at hand and omits the mutual effects of each other in promoting the certified robustness.

Starting from an individual DNN, we in this paper augments the standard DNN with multiple heads. On such a multi-head structure basis we propose a novel ensemble training scheme, namely the Self-PAced Circular-TEaching (SPACTE), where multiple classifiers through mutually-orthogonal heads get ensembled and trained jointly with the common backbone coupling all heads. In the proposed circular teaching scheme, all heads are simultaneously optimized and meanwhile communicate with each other, instead of the separate training of individual DNNs in ensemble [19]. To be specific, during the joint training, each augmented head teaches its next neighboring head with selected samples by our modified self-paced learning, which progresses the optimization from “easy” samples to “hard” ones with specific considerations to certified robustness. In the following, the proposed SPACTE is elaborated:

•

In the modelling, our mutually-orthogonal heads leverage the essential idea of inducing diversity among classifiers for variance reduction as in the classic ensemble [21, 22, 23], but manage to greatly alleviate computational burdens both in storage and time expenses.
•

The concept of communication between ensembled classifiers is introduced in a joint optimization, enabling information exchange to exploit potentials of individual classifiers and providing a novel perspective on the interaction in ensemble training.
•

In the optimization, each head selects the mini-batch of samples to teach its next neighboring head to update. The sample selection in such a circular-teaching scheme is conducted with self-paced learning [24, 25], which proceeds the optimization from “easy” samples to “hard” ones, specifically defined in relation to the proposed smoothed loss for certified robustness.

Rather than the ensemble after the separate training of multiple DNNs, the proposed method instead considers a novel ensemble-based training method for a single DNN with augmented heads. On the one hand, the common high-level features are expected to be exploited by all classifiers in the ensemble; such high-level feature learning can be reflected in the shared backbone in our multi-head structure. On the other hand, classifiers in ensemble methods are designed to be diverse from each other and are capable of learning different subtle features, for the sake of greater reliability and variance reduction [19]; in our method, such diversity is embodied from two folds: (i) the parameters of the augmented heads are imposed to be mutually orthogonal, and (ii) the optimization of each head is conducted via the mini-batch samples that are taught by the self-paced learning of its neighboring head, not by itself. SPACTE is compatible with all single-model-based methods, which can be plugged into SPACTE via regularization [14, 15, 16] or data augmentation [17, 18]. Extensive experiments demonstrate that the proposed SPACTE greatly improves the certified robustness over the state-of-the-art (SOTA) methods with much less computation overhead than the existing ensemble methods. SPACTE acts as an efficient and effective ensemble method, which successfully exemplifies a bridge connecting the methods of improving single models and integrating multiple models for certified robustness.

In the following, Sec. 2 outlines related works. Section 3 introduces the proposed SPACTE, in terms of the multi-head structure and its optimization. Section 4 presents extensive experiments on the certified robustness from different aspects. Section 5 briefly concludes the paper.

2 Related work

2.1 Randomized smoothing

Consider an arbitrary classifier $f:\mathbb{R}^{d}\mapsto\mathbb{R}^{K}$ that takes the input $\mathbf{x}\in\mathbb{R}^{d}$ and gives $K$ logits w.r.t. each class. Then, denote $F(\mathbf{x})\coloneqq\arg\max_{k}f_{k}(\mathbf{x})$ as the function $\mathbb{R}^{d}\mapsto\mathcal{K}\triangleq\{1,\cdots,K\}$ predicting the class of $\mathbf{x}$ .

In RS, a smoothed classifier $G$ is obtained as:

G(\mathbf{x})\coloneqq\mathop{\arg\max}\nolimits_{c}\mathcal{P}_{\boldsymbol{\delta}\sim\mathcal{N}(0,\sigma^{2}I)}\left(F(\mathbf{x}+\boldsymbol{\delta})=c\right).

(1)

In Eq. 1, via numerous Gaussian samplings within the neighborhood of $\mathbf{x}$ , the smoothed classifier $G$ determines the top-1 class of $\mathbf{x}$ with probability $p_{A}$ and the top-2 with probability $p_{B}$ . Based on the Neyman-Pearson lemma [26], Cohen et al. [12] proved a lower bound $R(G;\mathbf{x},y)$ of the $\ell_{2}$ certified radius of $G$ around $\mathbf{x}$ with true label $y$ , such that for all $||\boldsymbol{\delta}||_{2}\leq R(G;\mathbf{x},y)$ , the robustness of $G$ , i.e., $G(\mathbf{x}+\boldsymbol{\delta})=y$ , is guaranteed in relation to $p_{A}$ and $p_{B}$ [12].

Since the pioneering work on RS [11, 12], there have been various researches on training a more robust base classifier $f$ . The stability training [14] required similar predictions on both original samples and noisy ones. MACER [15] directly maximized the certified radius via a hinge loss. In [16], the predictive consistency was imposed within the input neighborhood. These defenses [14, 15, 16] can be viewed as regularization-based schemes, since additional terms are devised into the loss to train the base classifier. Similar to the empirical defense of adversarial training [5], Salman et al. [17] and Jeong et al. [18] proposed to attack the smoothed classifier and meanwhile included the resulting adversarial examples in training the base classifier to enhance the certified robustness, which are regarded as data-augmentation-based techniques.

The aforementioned works on RS enhance individual DNNs with a single classifier, while two concurrent works by Horváth et al. [19] and Yang et al. [20] investigated how the ensemble of multiple DNNs improve the certified robustness. In [19], some theoretical aspects were given that ensemble models boost the certified robustness of RS via variance reduction, and experiments also showed that the average on the logits of multiple DNNs trained with different random seeds could achieve distinctive improvements. In [20], instead of the end-to-end training, it focused on a fine-tuning strategy for base RS classifiers of multiple pretrained DNNs, where diversified gradients and large confidence margins were used for the fine-tuning. The overhead computation in its pretraining for multiple DNNs is already expensive, and the fine-tuning method in [20] further amplifies such computational expenses. Despite of the improved accuracy, these ensemble methods greatly increase the computational burdens, i.e., amplifying both storage and algorithmic complexities at least by the number of ensembled classifiers. Our work in contrast bridges methods using a single model and ensembling multiple models by proposing an efficient (cheaper training and certification) and effective (stronger certified robustness) ensemble training strategy on an augmented multi-head structure to achieve a certifiably-robust single model.

2.2 Co-teaching and self-paced learning

Co-teaching was proposed in [27] for combating label noise, and some variants were also proposed later on [28, 29, 30]. Co-teaching involves two DNNs and requires that each network selects its small-loss samples to teach (train) the peer network, as it assumes that the small-loss samples are more likely to be the data with true labels. Therefore, two networks help each other select the possibly clean samples to prevent overfitting the noisy labels in training. Co-teaching is commonly suggested for fine-tuning two well-trained DNNs.

Self-paced learning [24, 31, 32] (SPL) has been proved as a useful technique in avoiding bad local optima for better optimization results in machine learning. In SPL, the model is trained with samples in an order of increasing difficulty, which facilitates the learning process. The “difficulty” of samples is defined in a similar way as how Co-teaching thinks about the data with noisy labels: the smaller the loss, the easier the sample. SPL can be conducted by hard weighting [24] or soft weighting [25]. The former conducts the optimization by only involving the easiest samples within the mini-batch in each iteration, while the latter gives a sample-wisely weighted mini-batch during the iterative updates in training.

Note that although the proposed SPACTE does the teaching to its peer classifiers, it is essentially different from the existing Co-teaching [27]. In our method, multiple classifiers are designed to formulate a novel network structure on a single DNN for more economic and effective ensembles. Our circular-teaching is intrinsically designed for allowing communications among classifiers during the end-to-end training for certified robustness, so as to exploit more potentials of individual classifiers in the joint optimization for the ensemble. Our teaching mechanism is designed with the modified SPL in a specific relation to the smoothed loss, not only for better optimization solutions but also for enhanced diversities among ensembled heads. Appendix A gives more discussions on the related works.

3 Methodology

The multi-head structure is introduced in Sec. 3.1 with its efficient ensemble prediction and certification. Then, the circular-teaching training is elaborated in Sec. 3.2.

3.1 Augmented multi-head network

Efficient structure. The proposed method deploys a single DNN augmented with multiple heads, each of which pertains a classifier for certified robustness. This multi-head structure achieves the ensemble predictions using multiple classifiers from augmented heads. The ensemble is thereby realized in a much more efficient way and is no more restricted to the framework of integrating multiple individual DNNs. In contrast, our multi-head structure substantially mitigates such heavy computation resources required by the existing ensemble methods [19, 20].

The deployed multi-head DNN consists of a backbone $g$ , which is shared by $L$ augmented heads $h^{1},\cdots,h^{L}$ with $h^{i}(g(\cdot)):\mathbb{R}^{d}\mapsto\mathbb{R}^{K},i=1,\cdots,L$ , as shown in Fig. 1. In prediction and certification, the model output is taken as the ensemble of these augmented heads by averaging the logits from the classifiers:

f_{\mathrm{ens}}(\mathbf{x})=\dfrac{1}{L}\sum\nolimits_{k=1}^{L}h^{k}(g(\mathbf{x})).

(2)

It can be seen that the prediction and certification of the multi-head DNN can follow the functions Predict and Certify in [12], respectively. Our $f_{\mathrm{ens}}(\cdot)$ can be directly employed as the base classifier in RS without extra modifications. Note that our forward propagation process differs in a more efficient way, from the existing ensemble methods, because only the computation in the augmented heads is required, rather than the complete computation using multiple DNNs. Such forward propagation process in our method directly contributes to a substantial reduction in the certification time compared with those ensemble methods computing multiple DNNs, since certifying each sample generally requires 100,000 samplings within the neighborhood of $\mathbf{x}$ [12]. Table 1 shows that over 60% certification time is reduced from 91.3s to 30.3s per sample, and more evidences on our efficiency can refer to Sec. 4.2, with comparisons on FLOPs, training time, etc.

Methods	Models	Per sample (s)	Total (h)
Gaussian [12]	1 DNN	17.3	2.41
Ensemble [19]	5 DNNs	91.3	12.68
Ours	5 heads	30.3	4.20

Table 1: Certification time of different methods. Ours distinctively improves the certification efficiency than the existing ensemble.

Diversifying parameters. In ensemble methods, the variance reduction via leveraging predictions of multiple classifiers distinctively improves the performance for certified robustness [19]. Correspondingly, diversities between the classifiers in our augmented multi-head DNN are encouraged. Especially in such a structure with a common backbone, the diversities across heads are of particular interest and should be designed delicately, avoiding extracting the same features and converging to identical heads.

A cosine constraint is imposed on the parameters of all paired heads, i.e., the augmented heads are mutually orthogonal with each other:

\mathcal{L}_{\rm cos}\coloneqq\sum_{i=1}^{L}\sum_{j=1,j\neq i}^{L}\frac{{\left\langle h^{i},h^{j}\right\rangle}^{2}}{||h^{i}||\cdot||h^{j}||}.

(3)

This constraint indicates that the features learned by the shared backbone $g(\cdot)$ are required to simultaneously well adapt to the augmented heads $h^{k}(\cdot),k=1,\cdots,L$ , which are mutually orthogonal and all optimized for certified robustness with RS. Thus in the considered network structure, a more adaptive backbone with diversified classifiers can be attained for the ensemble.

During training, the weighted loss of multiple heads is optimized in the objective, as commonly used in the ensemble training [33, 22]. In each iteration, the weight $\omega^{k}$ w.r.t. the $k$ -th head $h^{k}$ is set as $1-\epsilon$ if $h^{k}$ gets the smallest loss on the Gaussian corrupted mini-batch $(\mathbf{x}+\boldsymbol{\delta},y)$ :

\omega^{k}=\left\{\begin{array}[]{ll}1-\epsilon,&k=\mathop{\arg\min}_{k}\mathcal{L}_{\rm ce}(h^{k}(g(\mathbf{x}+\boldsymbol{\delta})),y),\\ \frac{\epsilon}{L-1},&\mathrm{otherwise},\end{array}\right.

(4)

where $\epsilon\in(0,1)$ and $\mathcal{L}_{\rm ce}$ denotes the cross-entropy loss.

3.2 Training with SPACTE

Through the lens of peer-teaching classifiers, we propose a novel training scheme, i.e., the self-paced circular-teaching (SPACTE), which allows communication and information exchange between the classifiers in augmented heads. With our modified SPL, the optimization of each head is taught by its neighboring head to progressively proceed from easy samples to hard ones, defined by the smoothed loss for certified robustness.

3.2.1 Optimization framework

Circular-teaching mechanism. As shown in Fig. 1, information circulates among heads by peer-teaching its next neighboring head with the selected samples. To be specific, we denote $f^{k}(\cdot)\coloneqq h^{k}(g(\cdot)):\mathbb{R}^{d}\mapsto\mathbb{R}^{K}$ as the classifier for the $k$ -th head, and then the classifier $f^{k}$ selects samples to teach the training of $f^{k+1}$ via SPL.

The proposed circular-teaching makes the $(k+1)$ -th classifier $f^{k+1}$ to be trained by the sample mini-batch fed by its neighboring head $f^{k}$ . Hence, $f^{k+1}$ conducts SPL in the view of $f^{k}$ , and at the same time $f^{k+1}$ analogously teaches its next neighboring head $f^{k+2}$ , and so on, until the last head $f^{L}$ circulates the teaching to the first head $f^{1}$ . With the circular-teaching, a knowledge communication flow is formed in the joint optimization of multiple classifiers for the ensemble (see the orange flow in Fig. 1).

We plot the log-probability distributions of two multi-head DNNs in Fig. 2 for the cases with and without the circular information flow, respectively, showing how such a communication flow helps the predictions within $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ . Given our ensemble prediction $f_{\rm ens}$ of the multi-head DNN in Eq. 2, the log-probability gap [16] is calculated as $\log f^{\Delta}_{y}(\mathbf{x}+\boldsymbol{\delta})-\mathop{\max}_{c\neq y}\log f^{\Delta}_{c}(\mathbf{x}+\boldsymbol{\delta})$ , where the probability outputs are denoted as $f^{\Delta}$ by using the softmax function on $f_{\rm ens}$ . Via numerous samplings of $\boldsymbol{\delta}$ from $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ , the distribution of log-probability gap reflects the prediction variance of $f_{\rm ens}$ at $\mathbf{x}$ and the certified robustness of $G$ at $\mathbf{x}$ . Figure 2 presents a distinctively smaller area in the negative half on the horizontal-axis of the model with the circular communication, indicating a lower variance and stronger certified robustness.

Sample selections via self-paced learning. The sample selection in our circular-teaching is achieved by SPL with soft thresholds, indicating that a sample-wisely weighted mini-batch is fed to teach the neighboring classifier. Given a sample $(\mathbf{x}_{n},y_{n})$ from the mini-batch in the current iteration for updating $f^{k+1}$ , SPL assigns a weight $\nu_{n}^{k}$ to its optimization objective, where $\nu_{n}^{k}$ is determined (taught) by the peer classifier $f^{k}$ . We leave the details of determining $\nu_{n}^{k}$ in Sec. 3.2.2 where our SPL specified for certified robustness is introduced in detail.

In implementation, the novel circular-teaching in the proposed SPACTE can be efficiently realized by shifting the weights $\boldsymbol{\nu}^{k}=[\nu_{1}^{k},\ldots,\nu_{N}^{k}]^{T}$ learned by $f^{k}$ to it neighboring classifier $f^{k+1}$ , Thus, the optimization objective of our SPACTE is formulated as follows:

\mathop{\min}_{g,h_{1},\cdots,h_{L}}\frac{1}{NL}\sum_{k=1}^{L}\omega^{k}\sum_{n=1}^{N}\nu_{n}^{k-1}\mathcal{L}_{\rm ce}(f^{k}(\mathbf{x}_{n}+\boldsymbol{\delta}),y_{n})+\mathcal{L}_{\rm cos},

(5)

where $\{(\mathbf{x}_{n},y_{n})\}^{N}_{n=1}$ denotes a training mini-batch with perturbations $\boldsymbol{\delta}\sim\mathcal{N}(0,\sigma^{2}I)$ , and $\nu_{n}^{0}\coloneqq\nu_{n}^{L}$ . Under this circular communication flow, each head can access more diversified information, i.e., justified by its peer heads, which further helps boost the diversities among heads and variance reduction in the ensemble predictions.

3.2.2 Easy samples V.S. hard samples

In SPL [24], the samples with smaller losses are viewed as easy samples in the optimization. However, in the task of certified robustness, an easier sample should be the one with a larger certified radius. As proven in [12], the certified radius of a sample $\mathbf{x}$ is reflected by the predictions within its perturbed neighborhood $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ , instead of the prediction on a single point $\mathbf{x}$ . For those harder samples with smaller certified radii, the predictions on the perturbed neighborhood $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ can vary drastically. Therefore, the loss alone can not be directly employed to determine whether a sample is easy or hard in the optimization for certified robustness.

Considering a threshold for certified radius $\bar{R}=0.2$ , we take the hard samples by $R\leq\bar{R}$ and the easy ones by $R>\bar{R}$ , and give an illustration in Fig. 3. The figure on the left top panel shows the loss and accuracy averaged on a batch of hard samples w.r.t. 10 samplings from their perturbed neighborhoods. A large variance occurs, where the highest accuracy reaches over $75\%$ (with the lowest loss $\approx 0.6$ ) while the lowest accuracy shows only less than $65\%$ (with the largest loss $\approx 1.0$ ). In contrast, the variance is distinctively smaller on easy samples, as shown in the left bottom panel of Fig. 3. Correspondingly, a metric, namely the smoothed loss, is established to justify easy and hard samples in certified robustness on $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ .

Definition 1

(Smoothed loss). Given a classifier $f(\cdot):\mathbb{R}^{d}\mapsto\mathbb{R}^{K}$ and the cross-entropy loss $\mathcal{L}_{\rm ce}$ , for any sample $(\mathbf{x},y)\sim\mathbb{R}^{d}\times\mathcal{K}$ , the smoothed loss $\overline{\mathcal{L}_{\mathbf{x}}}$ is defined by

\overline{\mathcal{L}_{\mathbf{x}}}\coloneqq\frac{1}{m}\sum\nolimits_{i=1}^{m}\mathcal{L}_{\rm ce}(f(\mathbf{x}+\boldsymbol{\delta}_{i}),y),\ \boldsymbol{\delta}_{i}\sim\mathcal{N}(0,\sigma^{2}I).

(6)

The constructed metric in Eq. 6 is computed by averaging the losses w.r.t. $m$ samplings on $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ . Since an easier sample indicates more accurate and stable predictions under the perturbation in a larger neighborhood domain (a larger $\bar{R}$ ) and vice versa, such a smoothed loss can be linked to the certified radius $\bar{R}$ . As shown in the right panel of Fig. 3, as $\bar{R}$ increases, the easier the sample is, the smaller the smoothed loss $\overline{\mathcal{L}_{\mathbf{x}}}$ is. The smoothed losses of easy and hard samples differ significantly and thereby are easily distinguishable.

SPL [24] progressively includes samples into the optimization in a meaningful order, i.e., only these samples with losses smaller than a given hard threshold are used in the current iteration. SPL can also be extended with soft thresholds, i.e., the losses of samples are weighted in accordance with their easiness or hardness by the adapted logistic function [25]. It has been proven theoretically and empirically that SPL helps avoid bad local optima and facilitate optimization [34]. We utilize SPL with soft thresholds and modify it for the circular-teaching in our multi-head DNNs. With the metric $\overline{\mathcal{L}_{\mathbf{x}}}$ in Definition 1 for certified robustness, the weight $\nu^{k}_{n}$ assigned to the sample $\mathbf{x}_{n}$ is determined by the $k$ -th classifier $f^{k}$ , such that

\nu^{k}_{n}=\left\{\begin{array}[]{ll}1&\overline{\mathcal{L}^{k}_{\mathbf{x}_{n}}}\leq\lambda,\\ \frac{1+e^{-\lambda}}{1+e^{\overline{\mathcal{L}^{k}_{\mathbf{x}_{n}}}-\lambda}}&\mathrm{otherwise}.\end{array}\right.

(7)

In Eq. 7, samples with $\overline{\mathcal{L}^{k}_{\mathbf{x}_{n}}}\leq\lambda$ under $f^{k}$ are viewed as easy samples and are fully involved, i.e., $\nu^{k}_{n}=1$ , while hard samples are weighted by the adapted logistic function [25] modified based on the smoothed loss in Eq. 6. As shown in Problem (5), $\nu_{n}^{k}$ is then used to update $f^{k+1}$ in the circular-teaching, flexibly realizing the communication flow in the joint training and making our SPACTE easily implementable and user-friendly. More details of the proposed algorithm is provided in Appendix B.

4 Numerical experiments

Extensive experiments are presented to evaluate the proposed SPACTE¹¹1Codes are available at https://github.com/fanghenshaometeor/Circular-teaching. on CIFAR10 with comparisons to the state-of-the-arts methods, in terms of certified accuracy, computational cost, and ablation studies.

Metrics. Both the average certified radius (ACR) [15] over the test set, and the certified accuracy at different certified radii $r$ are adopted for evaluations. As proposed in [12], the Monte Carlo method is used to estimate the certified radius of each test sample. The certified accuracy at a specific radius $r$ is calculated by the fraction of the correctly-predicted samples with certified radii larger than $r$ .

Baseline methods. For single-model-based methods, 3 representatives are compared, i.e., Gaussian [12]: the pioneering work trained on Gaussian corruptions of samples, Consistency [16]: the SOTA regularization-based method by adding a Kullback-Leibler (KL) divergence loss to regularize the prediction consistency within $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ , and SmoothMix [18]: the SOTA data-augmentation-based method by mixing the adversarial examples. The current SOTA ensemble methods are also compared, i.e.,, Horváth et al. [19]: the ensemble method via averaging the logits of multiple separately-trained DNNs. Other popular RS-based methods are also included into comparisons, including MACER [15] and SmoothAdv [17].

Setups. The networks and datasets are used as ResNet-110 [35] on CIFAR10 [36] with noise levels varied on $\sigma\in\{0.25,0.5,1.0\}$ , which are commonly set for certified robustness in related works [12, 16, 18] for fair comparisons. The training and certification adopt the same noise level $\sigma$ . Appendix C introduces more implementation details of models, algorithms, and hyper-parameters.

4.1 Main results

CIFAR10. Our method augments a single DNN with multiple heads trained with SPACTE, and can be seamlessly applied to the existing methods based on single models. In Tab. 2 and Tab. 3, similar to the ensemble method Horváth et al. [19] using 5 DNNs, our method augments the 3 representative single-model-based methods (Gaussian [12], Consistency [16] and SmoothMix [18]) with 5 heads ( $L=5$ ) and is compared with all the mentioned baseline models. The compared methods are evaluated based on the checkpoints released by Horváth et al. [19].

Table 3 illustrates that new SOTA results of ACR are achieved by our method. More specifically, for the 3 methods using a single DNN, both the existing Horváth et al. [19] and our SPACTE substantially improve the accuracy, showing the effectiveness of the ensemble for certified robustness. Compared to the existing ensemble method [19] using 5 individual DNNs, our SPACTE further achieves higher accuracy, verifying the advantageous of the proposed novel ensemble strategy with augmented heads instead of individual DNNs. Table 2 further lists the certified accuracy at given radii when $\sigma=1.00$ , where our SPACTE still shows the highest certified accuracy over a large range of certified radii. Due to space limitation, detailed performances on noise levels $\sigma\in\{0.25,0.5\}$ and more experimental results are given in Appendix D.

Methods	# DNNs	ACR	Radius $r$
Methods	# DNNs	ACR	0.00	0.25	0.50	0.75	1.00	1.25	1.50	1.75	2.00	2.25
Gaussian [12]	1	0.532	48.0	40.0	34.4	26.6	22.0	17.2	13.8	11.0	9.0	5.8
Horváth et al. [19]	5	0.601	49.0	43.0	36.4	29.8	24.4	19.8	16.4	12.8	11.2	9.2
SPACTE (5-head)	1	0.656	50.6	45.2	37.8	31.6	27.8	22.4	18.4	14.8	12.0	10.6
Consistency [16]	1	0.778	45.4	41.6	37.4	33.6	28.0	25.6	23.4	19.6	17.4	16.2
Horváth et al. [19]	5	0.809	46.6	42.0	37.6	33.0	29.6	25.8	23.2	21.0	17.6	16.2
SPACTE (5-head)	1	0.832	46.2	43.6	40.2	35.8	33.0	28.2	25.6	22.0	18.8	16.0
SmoothMix [18]	1	0.826	43.4	39.8	36.8	33.6	30.4	28.4	24.8	21.6	18.6	16.2
Horváth et al. [19]	5	0.831	43.8	40.6	38.2	34.8	31.2	27.6	24.0	22.0	19.2	15.8
SPACTE (5-head)	1	0.870	38.4	36.2	33.8	32.0	30.2	28.4	26.2	24.2	22.0	19.0
MACER [15]	1	0.797	42.8	40.6	37.4	34.4	31.0	28.0	25.0	21.4	18.4	15.0
SmoothAdv [17]	1	0.844	45.4	41.0	38.0	34.8	32.2	28.4	25.0	22.4	19.4	16.6

Table 2: Comparison results on ACR and the approximated certified accuracy (%) at different radii

r

. “# DNNs” denotes the number of DNNs employed in each method. Our results improving upon baseline methods are in bold and the overall best results are underlined.

Methods	Models	$\sigma$
Methods	Models	0.25	0.50	1.00
Gaussian [12]	1 DNN	0.450	0.535	0.532
Horváth et al. [19]	5 DNNs	0.530	0.634	0.601
SPACTE	5 heads	0.537	0.668	0.656
Consistency [16]	1 DNN	0.546	0.708	0.778
Horváth et al. [19]	5 DNNs	0.577	0.741	0.809
SPACTE	5 heads	0.580	0.745	0.830
SmoothMix [18]	1 DNN	0.545	0.728	0.826
Horváth et al. [19]	5 DNNs	0.585	0.755	0.831
SPACTE	5 heads	0.589	0.759	0.870
MACER [15]	1 DNN	0.518	0.668	0.797
SmoothAdv [17]	1 DNN	0.527	0.707	0.844

Table 3: Comparison results on ACR. The bold text indicates the best accuracy improved from a single model, and the underlined text is the new SOTA performance for each

\sigma

4.2 Runtime analysis

In addition to the certified accuracy, the efficiency of the proposed method is also comprehensively evaluated, where the single-model methods [12, 16] and the existing ensemble defense [19] are demonstrated for comparisons. For a fair comparison, each experiment here is executed on one single GPU of NVIDIA GeForce RTX 3060 (Ampere) with 12GB memory. The compared single models Gaussian [12] and Consistency [16] are based on ResNet-110 trained on CIFAR10 for $150$ epochs with a noise level $\sigma=1.0$ and a batch size of $256$ . The ensemble method [19] trains 5 individual models with Gaussian [12] and Consistency [16], respectively, and then performs the ensemble; the proposed SPACTE is instead augmented with $5$ heads on single models accordingly. The compared statistics mainly include the ACR, FLOPs, and time-consuming of training and certification phases. Following previous works [12, 19], we certify the CIFAR10 test set by evaluating every 20 samples. We leave the comparison with the fine-tune method [20] in Appendix D.1 for a fair and thorough discussion, considering its different training scheme and certification settings.

Table 4 shows that the training and certification time of the proposed SPACTE are both significantly reduced. For both Gaussian [12] and Consistency [16], our novel ensemble reduces the certification time by $>50\%$ than the existing ensemble [19], and the training time is also greatly reduced, particularly in the case of Consistency [16]. The FLOPs also indicate a much cheaper algorithm complexity of our method. The efficiency advantage of such a multi-head structure in our method can be further pronounced by paralleling the forward propagation of all heads, which would be interesting in future implementations.

Methods	ACR	Models	GFLOPs	Training time		Certification time
Methods	ACR	Models	GFLOPs	per epoch (s)	total (h)	per sample (s)	total (h)
Gaussian [12]	0.532	1 DNN	$0.256$	$39.6$	$1.65$	$17.3$	$2.41$
Horváth et al. [19]	0.601	5 DNNs	$1.280$	$198.0$	$8.25$	$91.3$	$12.68$
SPACTE	0.656	5 heads	$0.594$	$155.5$	$6.48$	$30.3$	$4.20$
Consistency [16]	0.778	1 DNN	$0.256$	$78.2$	$3.26$	$17.2$	$2.40$
Horváth et al. [19]	0.809	5 DNNs	$1.280$	$390.8$	$16.3$	$90.9$	$12.63$
SPACTE	0.830	5 heads	$0.594$	$154.6$	$6.44$	$30.4$	$4.22$

Table 4: Runtime comparisons. Our method greatly saves computational cost over the ensemble [19] and distinctively improves ACR over both single-model [12, 16] and ensemble [19] baselines. Horváth et al. [19] amplifies the running time of the single-model baseline by

\approx 5

4.3 Ablation studies

In ablation studies, we mainly bear 3 questions in mind:

Q1: does the multi-head structure in SPACTE help improve the certified robustness over single models?

Q2: does the proposed circular-teaching contribute to the certified robustness on top of our multi-head DNN?

Q3: can the proposed circular-teaching be extended to help train multiple individual DNNs in the ensemble?

Figure 4 is drawn with $\sigma\in\{0.25,0.50,1.00\}$ based on the classic RS method Gaussian [12]: the method “w/ 5-head” augments the single model in Gaussian [12] with 5 heads, and is optimized only with regular training, answering Q1; the method “w/ 5-head+CT” indicates that the proposed circular-teaching (CT) is further employed, answering Q2. Firstly, Fig. 4 shows that simply augmenting an individual model (Sec. 3.1) without circular-teaching can improve the baseline certified accuracy on different radii, i.e., the dashed curves are consistently over the dotted curves. It indicates that the diversity induced only by the augmented heads itself helps variance reduction on the ensemble predictions within $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ . Secondly, with our multi-head DNN, the proposed circular-teaching scheme further improves the performance on varied radii and different $\sigma$ , reflected by the solid curves. This verifies that the proposed circular-teaching scheme further enhances the diversity among heads and successfully boosts the certified robustness. Details of the certified accuracy in Fig. 4 at each radius are attached in Appendix D.2, where other baseline methods (Consistency [16] and SmoothMix [18]) are also discussed and all support the conclusions above.

To answer Q3, we apply our proposed training scheme, i.e., SPACTE, to the ensemble of 5 individual DNNs, denoted as “SPACTE (5 DNNs)”, which can actually be regarded as a special case of our method by augmenting the heads after the input layer. In this way, these 5 DNNs are not only jointly trained but also enable communications through the circular-teaching via SPACTE. In Tab. 5, applying our proposed SPACTE to train the existing ensemble method [19] consistently improves the certified accuracy, particularly in the cases of $\sigma\in\{0.50,1.00\}$ , achieving distinctive improvements. This further verifies that the communication flow between different classifiers in the ensemble is indeed effective, providing a promising perspective on boosting general ensemble methods. Thus, our proposed method is not only an extension on augmenting a single model, but can also be flexibly applied to enhance the training of ensemble methods. We can see that even with the more efficient 5-head structure and much less running time, our method already achieves comparable accuracy with SPACTE (5 DNNs). Therefore, training multiple DNNs via circular-teaching shall be recommended when certified robustness is pursued in the foremost priority and sufficient computation budgets are allowed.

Methods	Models	$\sigma$
Methods	Models	$0.25$	$0.50$	$1.00$
Gaussian [12]	1 DNN	0.450	0.535	0.532
Horváth et al. [19]	5 DNNs	0.530	0.634	0.601
SPACTE	5 DNNs	0.535	0.673	0.675
SPACTE	5 heads	0.537	0.668	0.656

Table 5: ACR comparisons with different methods and varied

\sigma

5 Conclusion

Based on RS for certified robustness, we propose (i) to augment multiple heads in a single DNN, and (ii) to construct an associated ensemble training strategy SPACTE. The multi-head structure seeks diversified classifiers for the ensemble with imposed orthogonal constraints and greatly alleviates the heavy computational cost in the existing ensemble methods using multiple DNNs. The proposed SPACTE allows communication and information exchange among heads via the self-paced learning, which is modified with the proposed smoothed loss in specific relations to the certified robustness of samples. With SPACTE on the multi-head network, each head selects the mini-batch samples to teach the training of its neighboring head, forming a circular communication flow for better optimization results and enhanced diversities. In extensive experiments, our proposed method achieves SOTA results over the existing ensemble methods at the cost of distinctively less computational expenses. In future, different ensemble mechanisms are worthy to be investigated to further boost certified robustness and efficiency.

References

[1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
[2] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
[3] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
[4] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
[5] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
[6] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pages 7472–7482. PMLR, 2019.
[7] Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pages 8093–8104. PMLR, 2020.
[8] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286–5295. PMLR, 2018.
[9] Huan Zhang, Hongge Chen, Chaowei Xiao, Sven Gowal, Robert Stanforth, Bo Li, Duane Boning, and Cho-Jui Hsieh. Towards stable and efficient training of verifiably robust neural networks. In International Conference on Learning Representations, 2019.
[10] Bohang Zhang, Du Jiang, Di He, and Liwei Wang. Boosting the certified robustness of l-infinity distance nets. In International Conference on Learning Representations, 2022.
[11] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy, pages 656–672, 2019.
[12] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310–1320. PMLR, 2019.
[13] Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693–10705. PMLR, 2020.
[14] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. Advances in Neural Information Processing Systems, 32:9464–9474, 2019.
[15] Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, and Liwei Wang. MACER: Attack-free and scalable robust training via maximizing certified radius. In International Conference on Learning Representations, 2019.
[16] Jongheon Jeong and Jinwoo Shin. Consistency regularization for certified robustness of smoothed classifiers. Advances in Neural Information Processing Systems, 33:10558–10570, 2020.
[17] Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32:11292–11303, 2019.
[18] Jongheon Jeong, Sejun Park, Minkyu Kim, Heung-Chang Lee, Do-Guk Kim, and Jinwoo Shin. SmoothMix: Training confidence-calibrated smoothed classifiers for certified robustness. Advances in Neural Information Processing Systems, 34:30153–30168, 2021.
[19] Miklós Z Horváth, Mark Niklas Mueller, Marc Fischer, and Martin Vechev. Boosting randomized smoothing with variance reduced classifiers. In International Conference on Learning Representations, 2021.
[20] Zhuolin Yang, Linyi Li, Xiaojun Xu, Bhavya Kailkhura, Tao Xie, and Bo Li. On the certified robustness for ensemble models and beyond. In International Conference on Learning Representations, 2021.
[21] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
[22] Christian Rupprecht, Iro Laina, Robert DiPietro, Maximilian Baust, Federico Tombari, Nassir Navab, and Gregory D Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In IEEE international Conference on Computer Vision, pages 3591–3600, 2017.
[23] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
[24] M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. Advances in Neural Information Processing Systems, 23:1189–1197, 2010.
[25] Chang Xu, Dacheng Tao, and Chao Xu. Multi-view self-paced learning for clustering. In International Joint Conference on Artificial Intelligence, 2015.
[26] Jerzy Neyman and Egon Sharpe Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933.
[27] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in Neural Information Processing Systems, 31, 2018.
[28] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In International Conference on Machine Learning, pages 7164–7173. PMLR, 2019.
[29] Junnan Li, Richard Socher, and Steven CH Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations, 2019.
[30] Yingyi Chen, Shell Xu Hu, Xi Shen, Chunrong Ai, and Johan AK Suykens. Compressing features for learning with noisy labels. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[31] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced learning with diversity. Advances in Neural Information Processing Systems, 27:2078–2086, 2014.
[32] Vithursan Thangarasa and Graham W. Taylor. Self-paced learning with adaptive deep visual embeddings. In British Machine Vision Conference, page 276, 2018.
[33] Jasper Linmans, Jeroen van der Laak, and Geert Litjens. Efficient out-of-distribution detection in digital pathology using multi-head convolutional neural networks. In Medical Imaging with Deep Learning, pages 465–478. PMLR, 2020.
[34] Deyu Meng, Qian Zhao, and Lu Jiang. A theoretical understanding of self-paced learning. Information Sciences, 414:319–328, 2017.
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[36] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009.
[37] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
[38] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
[39] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30:6405–6416, 2017.
[40] Kun Fang, Qinghua Tao, Yingwen Wu, Tao Li, Jia Cai, Feipeng Cai, Xiaolin Huang, and Jie Yang. Towards robust neural networks via orthogonal diversity. arXiv preprint arXiv:2010.12190, 2020.

Appendix A More related works and compared methods

The general concept of multi-head structures with DNNs is employed in different fields. In the prevailing vision transformer [37, 38], multi-head attention is devised, where each attention head performs scaled dot-products on queries, keys and values, instead of the common convolutions on features. Attention from multiple heads then presents the model with information from different representation subspaces at different positions to capture dependencies hidden in the sequence of image patches. Such revised attention blocks are intrinsically different from our multi-head structure in both technical aspects and tackled tasks. In medical image analysis, Linmans et al. [33] considers the out-of-distribution detection for normal lymph node tissue from breast cancer metastases, and deploys a multi-head convolutional neural network for more efficient computations [39]. This work simply pursues the efficiency for the binary classification on very high-resolution images, where no considerations are involved either in the optimization techniques or for the specific tasks. In against adversarial attacks, an empirical defense is proposed in [40] by simply using multiple last linear layers, enhancing the network robustness both with and without adversarial training. This method is actually not claimed as an ensemble method, since only a single last linear layer is used for predictions at each forward inference, as in single-model methods.

In the existing ensemble methods for certified robustness, multiple individual DNNs are employed for ensemble prediction, and each of those individual DNNs in fact simply realizes any existing single-model-based method with RS [19]. Similarly in our proposed method, each head pertaining a classifier can also be implemented with different methods used in single models. To be more specific, we give details on how the proposed SPACTE augments the 3 representative single-model-based certified defenses in each head as done in the experiments.

Gaussian. The training of the so-called Gaussian method with RS [12] simply includes the cross-entropy loss on Gaussian corrupted samples. The proposed SPACTE leveraging the Gaussian method in each head is described in Algorithm 1, which will be introduced in detail in the subsequent section. Following the notations in Algorithm 1, we exemplify the corresponding optimization problem in the form of expected risk minimization as follows:

\mathop{\min}_{g,h_{1},\cdots,h_{L}}\mathop{\mathbb{E}}_{(\mathbf{x},y),\boldsymbol{\delta}}\left[\sum_{k=1}^{L}\omega^{k}\nu^{k-1}\mathcal{L}_{\rm ce}\left(f^{k}(\mathbf{x}+\boldsymbol{\delta}),y\right)\right]+\mathcal{L}_{\rm cos}.

(8)

Consistency. The training of the Consistency method with RS [16] introduces two regularization terms to additionally restrict the predictive consistency in $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ . Given a classifier $f:\mathbb{R}^{d}\mapsto\mathbb{R}^{K}$ , the optimization problem of Consistency is formulated as:

\begin{split}&\mathop{\min}_{f}\ \mathop{\mathbb{E}}_{(\mathbf{x},y),\boldsymbol{\delta}}\left[\mathcal{L}_{\rm clf}+\mathcal{L}_{\rm con}\right],\\ &\mathcal{L}_{\rm clf}=\mathcal{L}_{\rm ce}\left(f(\mathbf{x}+\boldsymbol{\delta}),y\right),\\ &\mathcal{L}_{\rm con}=c_{1}\cdot\mathrm{KL}\left(f(\mathbf{x}+\boldsymbol{\delta})||\hat{f}(\mathbf{x})\right)+c_{2}\cdot\mathrm{H}\left(\hat{f}(\mathbf{x})\right),\end{split}

(9)

where $\hat{f}(\mathbf{x})\coloneqq\mathbb{E}_{\boldsymbol{\delta}}\left[f(\mathbf{x}+\boldsymbol{\delta})\right]$ . In Problem (9), aside of the classification loss $\mathcal{L}_{\rm clf}$ , the additional loss $\mathcal{L}_{\rm con}$ forces a small KL divergence between the predictions on one noisy sample $f(\mathbf{x}+\boldsymbol{\delta})$ and the mean predictions on multiple noisy samples $\mathbb{E}_{\boldsymbol{\delta}}\left[f(\mathbf{x}+\boldsymbol{\delta})\right]$ , so as to pursue the predictive consistency on $\mathcal{N}(\mathbf{x},\sigma^{2}I)$ for larger certified radii. The mean predictions $\hat{f}(\mathbf{x})$ are also restricted by the entropy function $\rm H(\cdot)$ . $c_{1}$ and $c_{2}$ are in correspondence with the coefficients $\lambda$ and $\eta$ , respectively, in the original paper [16]. Notice that the additional term $\mathcal{L}_{\rm con}$ is sample-wisely determined and thus can be weighted via the SPL strategy in SPACTE. The optimization problem of SPACTE based on Consistency is written as:

\begin{split}&\mathop{\min}_{g,h_{1},\cdots,h_{L}}\mathop{\mathbb{E}}_{(\mathbf{x},y),\boldsymbol{\delta}}\left[\sum_{k=1}^{L}\omega^{k}\nu^{k-1}\left(\mathcal{L}^{k}_{\rm clf}+\mathcal{L}^{k}_{\rm con}\right)\right]+\mathcal{L}_{\rm cos},\\ &\mathcal{L}^{k}_{\rm clf}=\mathcal{L}_{\rm ce}\left(f^{k}(\mathbf{x}+\boldsymbol{\delta}),y\right),\\ &\mathcal{L}^{k}_{\rm con}=c_{1}\cdot\mathrm{KL}\left(f^{k}(\mathbf{x}+\boldsymbol{\delta})||\hat{f}^{k}(\mathbf{x})\right)+c_{2}\cdot\mathrm{H}\left(\hat{f}^{k}(\mathbf{x})\right).\end{split}

(10)

SmoothMix. The training of the SmoothMix method [18] follows the idea of adversarial training [5], where the adversarial examples generated on the smoothed classifier are incorporated into training the base classifier. Given a classifier $f:\mathbb{R}^{d}\mapsto\mathbb{R}^{K}$ , the optimization problem of Smoothmix is formulated as:

\begin{split}&\mathop{\min}_{f}\ \mathop{\mathbb{E}}_{(\mathbf{x},y),\boldsymbol{\delta}}\left[\mathcal{L}_{\rm clf}+\mathcal{L}_{\rm mix}\right],\\ &\mathcal{L}_{\rm mix}=c_{3}\cdot\mathrm{KL}\left(f\left(\mathbf{x}^{\rm mix}\right)||y^{\rm mix}\right),\\ &\mathbf{x}^{\rm mix}=(1-c_{4})\cdot\mathbf{x}+c_{4}\cdot\mathbf{\tilde{x}}^{(T)},\\ &y^{\rm mix}=(1-c_{4})\cdot\tilde{f}(\mathbf{x})+c_{4}\cdot\frac{1}{K},c_{4}\sim\mathcal{U}(\left[0,\frac{1}{2}\right]).\end{split}

(11)

In Problem (11), the additional term $\mathcal{L}_{\rm mix}$ exploits the mix-up between the adversarial examples and clean images, so as to calibrate those over-confident examples via linear interpolation for re-balancing certified radii. $\mathbf{\tilde{x}}^{(T)}$ indicates the PGD-based $T$ -step adversarial examples on the smoothed classifier, and $\tilde{f}(\mathbf{x})$ denotes the soft predictions with softmax on $\mathbb{E}_{\boldsymbol{\delta}}\left[f(\mathbf{x}+\boldsymbol{\delta})\right]$ . $c_{3}$ and $c_{4}$ are in correspondence with the coefficients $\eta$ and $\lambda$ , respectively, in the original paper [18]. Details of generating $\mathbf{\tilde{x}}^{(T)}$ can refer to [18]. Similar to the case in Consistency, the SPL strategy in SPACTE can also be applied to weight the sample-wise loss $\mathcal{L}_{\rm mix}$ . Then, the optimization problem of SPACTE based on SmoothMix is written as

\begin{split}&\mathop{\min}_{g,h_{1},\cdots,h_{L}}\mathop{\mathbb{E}}_{(\mathbf{x},y),\boldsymbol{\delta}}\left[\sum_{k=1}^{L}\omega^{k}\nu^{k-1}\left(\mathcal{L}^{k}_{\rm clf}+\mathcal{L}^{k}_{\rm mix}\right)\right]+\mathcal{L}_{\rm cos},\\ &\mathcal{L}^{k}_{\rm mix}=c_{3}\cdot\mathrm{KL}\left(f^{k}\left(\mathbf{x}^{\rm mix}\right)||y^{\rm mix}\right).\end{split}

(12)

Notice that here $(\mathbf{x}^{\rm mix},y^{\rm mix})$ is accordingly modified to be generated based on the ensemble of multiple heads.

Appendix B Algorithm details

Algorithm 1 outlines the key steps in each iteration of the proposed algorithm SPACTE in training the deployed multi-head DNN for certified robustness. Given a mini-batch of training samples, the smoothed loss of each sample w.r.t. each classifier (head) is calculated firstly, which is hereafter used to determine the corresponding coefficient assigned differently to easy and hard samples. These coefficients then contribute to the sample-wisely cross-entropy loss in our modified SPL with soft thresholds. Secondly, with the circular-teaching scheme in SPACTE, the coefficients learned by $f^{k}$ are shifted to its next neighboring classifier $f^{k+1}$ , which flexibly and efficiently achieves the teaching and forms a circular communication flow in the joint optimization for our ensemble method. Finally, the cross-entropy loss and the cosine loss imposed on the augmented heads together help update the multi-head DNN.

Notice that the computation of $\mathcal{L}_{\rm w}$ in line 4 in Algorithm 1 slightly differs from Problem (5). To present a clear and succinct pipeline of the main method, the $m$ samplings $\boldsymbol{\delta}_{1},\cdots,\boldsymbol{\delta}_{m}$ for the smoothed loss are not explicitly presented in Problem (5), since we leave the introduction of the smoothed loss in the subsequent subsection. In our modified SPL to distinguish easy sample and hard ones, $\boldsymbol{\delta}_{1},\cdots,\boldsymbol{\delta}_{m}$ required by the smoothed loss can be simply implemented in computing $\mathcal{L}_{\rm w}$ , as shown in Algorithm 1. In this paper, we fix $m=2$ for all experiments.

Algorithm 1 Training algorithm of the proposed SPACTE.

1:A mini-batch of training samples

{\{(\mathbf{x}_{n},y_{n})\}}^{N}_{n=1}

, number of heads

L

, sampling times

m

, thresholds

\epsilon

and

\lambda

, learning rate

\eta

, noise level

\sigma

. Note that

\theta

denotes the to-be-optimized parameters of

g,h^{1},...,h^{L}

2:A deep neural network with augmented heads

f^{k}(\cdot)\coloneqq h^{k}(g(\cdot)),k\in\{1,2,...,L\}

3:Sample

\boldsymbol{\delta}_{1},\cdots,\boldsymbol{\delta}_{m}

from

\mathcal{N}(0,\sigma^{2}I)

4:Determine smoothed loss

\overline{\mathcal{L}^{k}_{\mathbf{x}_{n}}}

for each

f^{k}

(\mathbf{x}_{n},y_{n})

5:Determine SPL coefficient

\nu^{k}_{n}

as Eq. 7 based on

\overline{\mathcal{L}^{k}_{\mathbf{x}_{n}}}

6:Determine the sample-weighted cross-entropy loss as

\mathcal{L}_{\rm w}=\frac{1}{mNL}\sum_{k=1}^{L}\omega^{k}\sum_{n=1}^{N}\sum_{i=1}^{m}{\color[rgb]{1,0,0}\nu_{n}^{k-1}}\mathcal{L}_{\rm ce}({\color[rgb]{1,0,0}f^{k}}(\mathbf{x}_{n}+\boldsymbol{\delta}_{i}),y_{n}).

7:Determine

\mathcal{L}_{\mathrm{cos}}

as Eq. 3.

8:Update parameters

\theta\leftarrow\theta-\eta\cdot\nabla_{\theta}(\mathcal{L}_{\rm w}+\mathcal{L}_{\rm cos})

Appendix C Experiment details

More details on the experiments are provided to cover as comprehensive aspects as possible, e.g., location of augmented heads, selection of hyper-parameters, etc.

C.1 Dataset and network

In CIFAR10 [36], there are 50,000 training images and 10,000 test images of $32\times 32\times 3$ size categorized into 10 classes. In all the certifications, if not specified, we follow the same settings in previous works [12, 19], which is also regarded as the typical settings for certified robustness, i.e., we certify a subset of 500 images by sampling every 20 images without shuffling.

We augment multiple heads in the ResNet-110 on CIFAR10 [35]. In all the experiments, we adopt 5 heads, i.e., $L=5$ . Figure 5 gives more structural details on a 5-head DNN. The dotted black box frames out the typical structure of a standard residual network: a concatenation of a convolution-batchnorm-ReLU layer and 3 residual blocks followed by an average pooling layer and the last linear layer. The certain location where the heads get augmented is placed right after the second residual block. That is, each augmented head consists of the third residual block, the average pooling layer and the linear layer, represented by the green shaded area in Fig. 5. Accordingly, the backbone, shown by the blue shaded area, contains the convolution-batchnorm-ReLU layer and the first 2 residual blocks.

We now dive into more details on such a 5-head network. Without loss of generality, we exemplify the deployed ResNet-110. The 1st, 2nd and 3rd residual blocks contain 34, 37 and 37 convolution layers, respectively. Then the number of augmented layers in one head is only $37+1=38$ (the 3rd block and the last linear layer), accounting for approximately $38/110\approx 34\%$ in a 110-layer residual network. Though, it is worth mentioning that the parameters are not uniformly distributed in a residual network. The back layers near outputs generally have more parameters than the front layers near inputs due to the increasing number of channels along layers. Accordingly, each head saves $\approx 66\%$ network layers and $\approx 25\%$ parameters in a 110-layer residual network with in total 1,730,714 parameters. Thus, such a multi-head network helps save lots of computations compared with the ensemble of multiple individual DNNs, which has been empirically elaborated in detail from several aspects in Sec. 4.2. In later parts of this appendix, some empirical evidences on the performance of our multi-head structure augmented in different layers are also supplemented and can refer to Fig. 7 for details.

C.2 Training setups and hyper-parameters

Following the common settings in all previous works [12, 16, 18, 19, 20], the models on CIFAR10 are trained for 150 epochs with a batch size of 256. The stochastic gradient descent optimizer is adopted with Nesterov momentum (weight 0.9, no dampening) with weight decay 0.0001. The initial learning rate is 0.1, which is reduced by a factor of 10 every 50 epochs. We give a general description on the selection of hyper-parameters in SPACTE. Besides, the training commands for each experiment are all provided in the codes, where the specified values of the hyper-parameters are also presented.

In the experiments, without specifications, the number of heads is set as $L=5$ with $\epsilon=0.8$ in Eq. 4 controlling the gradient allocation on the 5 heads. The sampling number $m$ to calculate the smoothed loss is set as $m=2$ , which is the same as that in Consistency [16] and SmoothMix [18]. The threshold $\lambda$ in Eq. 7 to control the SPL weights is scheduled dynamically during training. $\lambda$ varies in a logarithmic shape with base 10. Given the initial value $\lambda_{\rm ini}$ at the 1st epoch and the last value $\lambda_{\rm lst}$ at the last epoch, the values of $\lambda$ at every training epoch can be determined. We choose the initial value of $\lambda$ as $\lambda_{\rm ini}=\ln{K}$ , where $K$ denotes the number of classes, while $\lambda_{\rm lst}$ could be further tuned according to practitioners’ demands. An example of the variation of $\lambda$ is shown in Fig. 6.

C.3 Certification setups

The ensemble of multiple heads $f_{\rm ens}$ in Eq. 2 is employed as the base classifier in RS. As typically done in previous works [16, 18, 19, 20], we follow the function Certify in [12] to perform certification with the same associated settings: $n_{0}=100,n=100,000,\alpha=0.001$ .

Appendix D Details of complete results

The complete empirical results are listed, covering the comparisons on ACR, and approximated certified accuracy at each given radius, and more comprehensive ablation studies.

D.1 Complete comparisons on CIFAR10

Table 6 shows the complete comparisons on all noise levels $\sigma\in\{0.25,0.50,1.00\}$ , as a supplement to Tab. 2 and Tab. 3. Among the compared methods in Tab. 6, the results of those denoted by $\dagger$ are directly from the original paper of Horváth et al. [19]. For the ensemble method [19] of 5 DNNs, the results are obtained by running the codes ²²2https://github.com/eth-sri/smoothing-ensembles. provided by [19] on 5 of the released models from [19]. For SmoothMix [18], we independently train 5 models from 5 different random seeds respectively following the codes ³³3https://github.com/jh-jeong/smoothmix. provided by [18] and again run the codes released by [19] to obtain the corresponding results of one single DNN and ensembled 5 DNNs.

In Tab. 6, both the ensemble method [19] and ours improve the ACR and the approximated certified accuracy from the 3 baseline methods (Gaussian [12], Consistency [16] and SmoothMix [18]). Particularly, our SPACTE with 5 heads shows better results than Horváth et al. [19] that ensembles 5 individual DNNs, achieving new SOTA results on ACR. In some cases, e.g., Gaussian baseline with $\sigma\in\{0.50,1.00\}$ and Consistency baseline with $\sigma=1.00$ , SPACTE ensembling 5 heads in one single network even exceed the performance of ensembling up to 10 individual DNNs [19], highly illustrating the superiority of SPACTE in certified robustness.

$\sigma$	Models	# DNNs	ACR	Radius $r$
$\sigma$	Models	# DNNs	ACR	0.00	0.25	0.50	0.75	1.00	1.25	1.50	1.75	2.00	2.25
0.25	Gaussian^† [12]	1	0.450	77.6	60.6	45.6	30.6	0.0	0.0	0.0	0.0	0.0	0.0
	Horváth et al. [19]	5	0.530	82.2	69.4	53.8	40.6	0.0	0.0	0.0	0.0	0.0	0.0
	SPACTE (5-head)	1	0.537	83.4	71.6	54.4	40.4	0.0	0.0	0.0	0.0	0.0	0.0
	Consistency^† [16]	1	0.546	75.6	65.8	57.2	46.4	0.0	0.0	0.0	0.0	0.0	0.0
	Horváth et al. [19]	5	0.577	75.6	69.4	60.0	51.0	0.0	0.0	0.0	0.0	0.0	0.0
	SPACTE (5-head)	1	0.580	77.6	69.4	61.4	50.6	0.0	0.0	0.0	0.0	0.0	0.0
	SmoothMix [18]	1	0.545	77.6	68.4	56.6	44.2	0.0	0.0	0.0	0.0	0.0	0.0
	Horváth et al. [19]	5	0.585	80.0	70.4	59.8	50.4	0.0	0.0	0.0	0.0	0.0	0.0
	SPACTE (5-head)	1	0.589	77.8	70.4	62.6	52.2	0.0	0.0	0.0	0.0	0.0	0.0
	MACER^† [15]	1	0.518	77.4	69.0	52.6	39.4	0.0	0.0	0.0	0.0	0.0	0.0
	SmoothAdv^† [17]	1	0.527	70.4	62.8	54.2	48.0	0.0	0.0	0.0	0.0	0.0	0.0
0.50	Gaussian^† [12]	1	0.535	65.8	54.2	42.2	32.4	22.0	14.8	10.8	6.6	0.0	0.0
	Horváth et al. [19]	5	0.634	68.8	60.6	47.8	39.2	28.6	20.0	13.8	8.4	0.0	0.0
	Horváth et al.^† [19]	10	0.648	69.0	60.4	49.8	40.0	29.8	19.8	15.0	9.6	0.0	0.0
	SPACTE (5-head)	1	0.668	69.8	61.8	50.8	42.6	31.2	21.6	15.2	9.4	0.0	0.0
	Consistency^† [16]	1	0.708	63.2	54.8	48.8	42.0	36.0	29.8	22.4	16.4	0.0	0.0
	Horváth et al. [19]	5	0.741	64.6	57.2	49.2	44.4	37.8	31.6	25.0	18.6	0.0	0.0
	SPACTE (5-head)	1	0.745	65.2	59.6	52.4	45.0	37.6	30.4	23.8	17.6	0.0	0.0
	SmoothMix [18]	1	0.728	61.4	53.4	48.0	42.4	37.2	32.8	26.0	20.6	0.0	0.0
	Horváth et al. [19]	5	0.755	59.8	53.8	48.2	44.0	39.4	34.6	28.0	22.6	0.0	0.0
	SPACTE (5-head)	1	0.759	53.4	50.2	46.8	44.8	40.0	35.8	31.2	25.6	0.0	0.0
	MACER^† [15]	1	0.668	62.4	54.4	48.2	40.2	33.2	26.8	19.8	13.0	0.0	0.0
	SmoothAdv^† [17]	1	0.707	52.6	47.6	46.0	41.2	37.2	31.8	28.0	23.4	0.0	0.0
1.00	Gaussian^† [12]	1	0.532	48.0	40.0	34.4	26.6	22.0	17.2	13.8	11.0	9.0	5.8
	Horváth et al. [19]	5	0.601	49.0	43.0	36.4	29.8	24.4	19.8	16.4	12.8	11.2	9.2
	Horváth et al.^† [19]	10	0.607	49.4	44.0	37.6	29.6	24.8	20.0	16.4	13.6	11.2	9.4
	SPACTE (5-head)	1	0.656	50.6	45.2	37.8	31.6	27.8	22.4	18.4	14.8	12.0	10.6
	Consistency^† [16]	1	0.778	45.4	41.6	37.4	33.6	28.0	25.6	23.4	19.6	17.4	16.2
	Horváth et al. [19]	5	0.809	46.6	42.0	37.6	33.0	29.6	25.8	23.2	21.0	17.6	16.2
	Horváth et al.^† [19]	10	0.809	46.4	42.6	37.2	33.0	29.4	25.6	23.2	21.0	17.6	16.2
	SPACTE (5-head)	1	0.830	46.2	43.6	40.2	35.8	33.0	28.2	25.6	22.0	18.8	16.0
	SmoothMix [18]	1	0.826	43.4	39.8	36.8	33.6	30.4	28.4	24.8	21.6	18.6	16.2
	Horváth et al. [19]	5	0.831	43.8	40.6	38.2	34.8	31.2	27.6	24.0	22.0	19.2	15.8
	SPACTE (5-head)	1	0.870	38.4	36.2	33.8	32.0	30.2	28.4	26.2	24.2	22.0	19.0
	MACER^† [15]	1	0.797	42.8	40.6	37.4	34.4	31.0	28.0	25.0	21.4	18.4	15.0
	SmoothAdv^† [17]	1	0.844	45.4	41.0	38.0	34.8	32.2	28.4	25.0	22.4	19.4	16.6

Table 6: A comprehensive comparison of various certified defenses on ACR and the approximated certified accuracy (%) at different radii

r

on CIFAR10 with noise levels

\sigma\in\{0.25,0.50,1.00\}

. Our results improving upon baseline methods are in bold and the overall best results are underlined. Results of methods denoted by

\dagger

are directly from the original paper of Horváth et al. [19].

$\sigma$	Methods	ACR	Base methods $\times k$	GFLOPs $\times k$	Training
$\sigma$	Methods	ACR	Base methods $\times k$	GFLOPs $\times k$	mem. (MiB)	per epoch (s)	per sample (s)
0.25	Horváth et al. [19]	0.575	Consistency $\bf\times 10$	0.256 $\bf\times 10$	4,197 $\bf\times 10$	43.0 $\bf\times 10$	78.5
	Yang et al. [20]	0.551	Gaussian $\bf\times 3$	0.256 $\bf\times 3$	21,143	21.5 $\bf\times 3$ + 1055.0	24.4
	SPACTE (5-head)	0.577	SmoothMix $\bf\times 1$	0.594 $\bf\times 1$	9,267	461.7	12.3
0.50	Horváth et al. [19]	0.754	Consistency $\bf\times 10$	0.256 $\bf\times 10$	4,197 $\bf\times 10$	43.4 $\bf\times 10$	78.5
	Yang et al. [20]	0.760	SmoothAdv $\bf\times 3$	0.256 $\bf\times 3$	21,351	21.7 $\bf\times 3$ +2,192.8	24.3
	SPACTE (5-head)	0.753	SmoothMix $\bf\times 1$	0.594 $\bf\times 1$	9,267	460.7	12.2
1.00	Horváth et al. [19]	0.815	SmoothAdv $\bf\times 10$	0.256 $\bf\times 10$	4,219 $\bf\times 10$	251.0 $\bf\times 10$	78.3
	Yang et al. [20]	0.868	SmoothAdv $\bf\times 3$	0.256 $\bf\times 3$	21,351	21.5 $\bf\times 3$ +2,178.1	24.5
	SPACTE (5-head)	0.818	SmoothMix $\bf\times 1$	0.594 $\bf\times 1$	9,267	464.1	12.4

Table 7: Runtime comparisons among Horváth et al. [19], Yang et al. [20] and our SPACTE.

k

denotes the number of deployed DNNs.

Yang et al. [20] gave theoretical discussions on the sufficient and necessary conditions for certifiably-robust ensemble models, i.e., diversified gradient and large confidence margin, and accordingly proposed an ensemble training strategy, namely DRT, by leveraging the two conditions to train ensemble models. Despite of the theoretical analyses, the proposed DRT in [20] is not computational friendly for its completely different training scheme from other certified defenses. In existing single-model-based [12, 16, 18, 15, 17] or ensemble-based [19] certifiably-robust defenses, models are all trained from scratch. In contrast, DRT starts from multiple well-pretrained models, e.g., $k$ DNNs trained on CIFAR10 for 150 epochs as [12], and particularly, continues to fine-tune these $k$ well-pretrained DNNs for another 150 epochs. The training (pretraining) of multiple DNNs already requires quite a lot computational cost, and yet their proposed fine-tuning method needs much more computational expenses than the overhead computation in the pretraining. As the gradient diversity loss term in DRT involves gradients on input images, the resulting GPU memory occupation during training also inevitably increases.

Table 7 gives a comprehensive runtime comparison on the certifiably-robust ensemble defense [19] and our SPACTE, which are both end-to-end training methods, and the fine-tuning ensemble method [20] is also compared here. Because the models and their corresponding settings of [20] are not released, we employ the best ACR values reported in their paper. In [20], the certification settings are different as they certify the test set every 10 images. For a fair comparison with the results released by[20], we re-certify for Horváth et al. [19] and SPACTE following the same certification settings in [20]. For Horváth et al. [19], certifications are re-executed based on the setups with the best results reported in their paper. In Tab. 7, each experiment is executed on one single GPU of NVIDIA GeForce RTX 3090 (Ampere) with 24GB memory, as our previously used NVIDIA GeForce RTX 3060 (Ampere, 12GB memory) lacks capacity for the newly compared method [20]. All the models are trained on CIFAR10 with a batch size of 256. The pretraining of [20] all requires 4189 $\bf\times 3$ MiB memory, which is omitted in the table for a more succinct comparison.

In Tab. 7, as a single DNN augmented with 5 heads, SPACTE achieves competitive or even better ACR over the ensemble of up to 10 individual DNNs [19], with the much saved time expenses in certification. On the other hand, SPACTE can still outperform [20] when $\sigma=0.25$ . Notice that such ACR gain is from the fine-tuning multiple well-pretrained DNNs at a very high memory occupation and training time-consuming, as shown in Tab. 7. In this paper, we mainly focus on the efficient ensemble method trained in an end-to-end way, involving a circular communication flow along classifiers. It would also be of interest in future works to apply SPACTE on multiple DNNs, either by the training from scratch or by the fine-tuning.

D.2 More results of ablation studies

Table 8 gives the quantitative values corresponding to the plots in Fig. 4 of ablation studies in Sec. 4.3, and further includes the results relating to the other 2 baseline methods, Consistency [16] and SmoothMix [18]. As shown in Tab. 8, in all the 3 cases, Gaussian, Consistency and SmoothMix, the multi-head structure in SPACTE can help enhance the certified robustness over single models, in terms of both the ACR and approximated certified accuracy. Besides, on top of the multi-head structure, the circular-teaching scheme of SPACTE among multiple heads further brings higher values of ACR and approximated certified accuracy. Both the 2 key parts in our defense, the multi-head structure and the circular-teaching, are comprehensively verified empirically-effective in strengthening certified robustness, supporting the answers for Q1 and Q2 in Sec. 4.3.

$\sigma$	Models	ACR	Radius $r$
$\sigma$	Models	ACR	0.00	0.25	0.50	0.75	1.00	1.25	1.50	1.75	2.00	2.25
0.25	Gaussian [12]	0.450	77.6	60.6	45.6	30.6	0.0	0.0	0.0	0.0	0.0	0.0
	+ w/ 5-head	0.508	80.8	67.6	52.0	37.2	0.0	0.0	0.0	0.0	0.0	0.0
	+ w/ 5-head + CT	0.537	83.4	71.6	54.4	40.4	0.0	0.0	0.0	0.0	0.0	0.0
	Consistency [16]	0.546	75.6	65.8	57.2	46.4	0.0	0.0	0.0	0.0	0.0	0.0
	+ w/ 5-head	0.573	80.4	71.0	59.4	49.0	0.0	0.0	0.0	0.0	0.0	0.0
	+ w/ 5-head + CT	0.580	77.6	69.4	61.4	50.6	0.0	0.0	0.0	0.0	0.0	0.0
	SmoothMix [18]	0.545	77.6	68.4	56.6	44.2	0.0	0.0	0.0	0.0	0.0	0.0
	+ w/ 5-head	0.588	80.0	72.2	60.2	51.2	0.0	0.0	0.0	0.0	0.0	0.0
	+ w/ 5-head + CT	0.589	77.8	70.4	62.6	52.2	0.0	0.0	0.0	0.0	0.0	0.0
0.50	Gaussian [12]	0.535	65.8	54.2	42.2	32.4	22.0	14.8	10.8	6.6	0.0	0.0
	+ w/ 5-head	0.629	68.0	59.8	49.6	38.8	30.2	19.4	13.8	8.0	0.0	0.0
	+ w/ 5-head + CT	0.668	69.8	61.8	50.8	42.6	31.2	21.6	15.2	9.4	0.0	0.0
	Consistency [16]	0.708	63.2	54.8	48.8	42.0	36.0	29.8	22.4	16.4	0.0	0.0
	+ w/ 5-head	0.744	66.8	60.6	52.0	45.4	37.2	29.2	23.0	16.6	0.0	0.0
	+ w/ 5-head + CT	0.745	65.2	59.6	52.4	45.0	37.6	30.4	23.8	17.6	0.0	0.0
	SmoothMix [18]	0.728	61.4	53.4	48.0	42.4	37.2	32.8	26.0	20.6	0.0	0.0
	+ w/ 5-head	0.750	59.0	53.6	48.2	42.6	38.8	34.0	29.2	23.0	0.0	0.0
	+ w/ 5-head + CT	0.759	53.4	50.2	46.8	44.8	40.0	35.8	31.2	25.6	0.0	0.0
1.00	Gaussian [12]	0.532	48.0	40.0	34.4	26.6	22.0	17.2	13.8	11.0	9.0	5.8
	+ w/ 5-head	0.590	49.4	44.2	37.0	29.0	23.0	19.2	16.2	13.2	10.8	8.0
	+ w/ 5-head + CT	0.656	50.6	45.2	37.8	31.6	27.8	22.4	18.4	14.8	12.0	10.6
	Consistency [16]	0.778	45.4	41.6	37.4	33.6	28.0	25.6	23.4	19.6	17.4	16.2
	+ w/ 5-head	0.774	47.2	43.4	38.6	34.4	29.4	25.8	22.6	19.2	16.8	14.6
	+ w/ 5-head + CT	0.830	46.2	43.6	40.2	35.8	33.0	28.2	25.6	22.0	18.8	16.0
	SmoothMix [18]	0.826	43.4	39.8	36.8	33.6	30.4	28.4	24.8	21.6	18.6	16.2
	+ w/ 5-head	0.859	44.0	41.2	37.8	34.8	31.4	28.2	25.6	22.2	20.2	17.4
	+ w/ 5-head + CT	0.870	38.4	36.2	33.8	32.0	30.2	28.4	26.2	24.2	22.0	19.0

Table 8: A comprehensive ablation study on the separate effectiveness of the proposed multi-head structure (+ w/

L

-head) and the circular-teaching scheme of SPACTE (+ w/

L

-head + CT) based on the 3 baseline methods. The considered noise levels of CIFAR10 are

\sigma\in\{0.25,0.50,1.00\}

Aside from the ablation studies answering the raised 3 questions in Sec. 4.3, we further explore the influence of different locations where heads get augmented. As shown in Fig. 5, generally in this paper, the certain location to augment heads is placed right after the 2nd residual block, resulting in $\approx 34\%$ layers being augmented in each. In this experiment, we choose different augmenting positions, including the middle of the 3rd residual block, where each head accounts for $\approx 17\%$ layers in the 110-layer residual network. Figure 7 shows the ACR results in correspondence to different augmentation locations in ResNet-110 on CIFAR10 with noise levels $\sigma\in\{0.25,0.50,1.00\}$ , where the training baseline method Gaussian [12] is employed. As discussed in Sec. 4.3, we consider two cases: a multi-head DNN trained with and without SPACTE, denoted as “SPACTE” and “Multi-head” in Fig. 7, respectively. The horizontal axis indicates the percentage of layers in one head over total layers: 0%,17%,34%, and 100%. The “0%” simply implies a standard DNN, and “100%” indicates multiple individual DNNs.

As shown in Fig. 7, as the number of augmented layers increases, the ACR of both SPACTE and Multi-head naturally improves a lot over one standard DNN. Besides, in all the 3 different augmentation proportions (17%, 34% and 100%), SPACTE all shows consistent superiority over the naive ensemble of multiple heads, indicating that the communication among heads benefits diversities of heads and thus boosts certified robustness. Specifically, Fig. 7 also illustrates that by only augmenting nearly or less than 30% of total layers with appropriate carefully-designed communication mechanisms, one multi-head DNN can possibly outperform the ensemble of multiple individual DNNs [19] (shown by the orange dashed lines) in certified robustness, further verifying our SPACTE as an efficient and effective ensemble method for certified robustness.