This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

11institutetext: Radboud University Nijmegen, The Netherlands 22institutetext: University of Antwerp, Belgium 33institutetext: TNO & Leiden University, The Netherlands

Robustness Verification for Classifier Ensemblesthanks: Research funded by the project NWA.BIGDATA.2019.002: “EXoDuS - EXplainable Data Science” and the FWO G030020N project “SAILor”

Dennis Gross 11    Nils Jansen 11    Guillermo A. Pérez 22    Stephan Raaijmakers 33
Abstract

We give a formal verification procedure that decides whether a classifier ensemble is robust against arbitrary randomized attacks. Such attacks consist of a set of deterministic attacks and a distribution over this set. The robustness-checking problem consists of assessing, given a set of classifiers and a labelled data set, whether there exists a randomized attack that induces a certain expected loss against all classifiers. We show the NP-hardness of the problem and provide an upper bound on the number of attacks that is sufficient to form an optimal randomized attack. These results provide an effective way to reason about the robustness of a classifier ensemble. We provide SMT and MILP encodings to compute optimal randomized attacks or prove that there is no attack inducing a certain expected loss. In the latter case, the classifier ensemble is provably robust. Our prototype implementation verifies multiple neural-network ensembles trained for image-classification tasks. The experimental results using the MILP encoding are promising both in terms of scalability and the general applicability of our verification procedure.

Keywords:
Adversarial attacks Ensemble classifiers Robustness

1 Introduction

Recent years have seen a rapid progress in machine learning (ML) with a strong impact to fields like autonomous systems, computer vision, or robotics. As a consequence, many systems employing ML show an increasing interaction with aspects of our everyday life, consider autonomous cars operating amongst pedestrians and bicycles. While studies indicate that self-driving cars, inherently relying on ML techniques, make around 80% fewer traffic mistakes than human drivers [19], verifiable safety remains a major open challenge [26, 13, 23, 5].

In the context of self-driving cars, for instance, certain camera data may contain noise that can be introduced randomly or actively via so-called adversarial attacks. We focus on the particular problem of such attacks in image classification. A successful attack perturbs the original image in a way such that a human does not recognize any difference while ML misclassifies the image. A measure of difference between the ground truth classification, for instance by a human, and a potentially perturbed ML classifier is referred as the loss.

A standard way to render image classification more robust against adversarial attacks is to employ a set of classifiers, also referred to as classifier ensembles [2, 20, 21, 3]. The underlying idea is to obscure the actual classifier from the attacker. One possible formalization of the competition between an adversarial attacker and the ensemble is that of a zero-sum game [20]: The attacker chooses first, the ensemble tries to react to the attack with minimal loss — that is, choosing a classifier that induces maximal classification accuracy.

01122334455xxyy
Figure 1: We depict a single data point in 2\mathbb{R}^{2} with label 11. The region to the left of the solid vertical line corresponds to points labelled with 22 by one classifier; the region to the right of the dashed line, those labelled with 22 by another classifier. Both (linear) classifiers label all other points in 2\mathbb{R}^{2} with 11. (Hence, they correctly label the data point with 11.) The dotted attack, moving the data point left, induces a misclassification of the point by one of the classifiers. The solid attack, moving the data point right, induces a misclassification of the point by one of the classifiers. Note that every attack has a classifier which is “robust” to it, i.e. it does not misclassify the perturbed point. However, if the attacker chooses an attack uniformly at random, both of them misclassify the point with probability 1/2\nicefrac{{1}}{{2}}.

In this setting, the attacker may need to use randomization to behave optimally (see Fig. 1, cf. [6]). Such an attack is called optimal if the expected loss is maximized regardless of the choice of classifier.

Inspired by previous approaches for single classifiers [15, 12], we develop a formal verification procedure that decides if a classifier ensemble is robust against any randomized attack. In particular, the formal problem is the following. Given a set of classifiers and a labelled data set, we want to find a probability distribution and a set of attacks that induce an optimal randomized attack. Akin to the setting in [20], one can provide thresholds on potential perturbations of data points and the minimum shift in classification values. Thereby, it may happen that no optimal attack exists, in which case we call the classifier ensemble robust. Our aim is the development of a principled and effective method that is able to either find the optimal attack or prove that the ensemble is robust with respect to the predefined thresholds.

To that end, we first establish a number of theoretical results. First, we show that the underlying formal problem is \NP-hard. Towards computational tractability, we also show that for an optimal attack there exists an upper bound on the number of attacks that are needed. Using these results, we provide an SMT encoding that computes suitable randomized attacks for a set of convolutional neural networks with ReLU activation functions and a labelled data set. In case there is no solution to that problem, the set of neural networks forms a robust classifier ensemble, see Fig. 2. Together with the state-of-the-art SMT solver Z3 [9], this encoding provides a complete method to solve the problem at hand. Yet, our experiments reveal that it scales only for small examples. We outline the necessary steps to formulate the problem as a mixed-integer linear programming (MILP), enabling the use of efficient optimization solvers like Gurobi [14].

VerifierROBUST/NOT ROBUSTProbabilities & attacksSet of classifiersSet of data pointsNumber of attacksMax attack strengthMin predictionperturbation
Figure 2: The verifier takes as input a set of classifiers, a set of labelled data points, the number of attacks, and the attack properties. If the verifier does not find a solution, we can be sure is robust against any attack with the specific properties. Otherwise, it returns the optimal attack.

In our experiments, we show the applicability of our approach by means of a benchmark set of binary classifiers, which were trained on the MNIST and German traffic sign datasets [10, 25].

Related work

It is noteworthy that there is some recent work on robustness checking of decision-tree ensembles [22]. However, their approach is based on abstract interpretation and is thus not complete. Other approaches for robustness checking of machine learning classifiers focus on single classifiers (see, e.g., [15, 12, 7, 18, 24]). Akin to our approach, some of these works employ SMT solving. In [4], MILP-solving is used for verification tasks on (single) recurrent neural networks. In contrast, our framework allows to compute attacks for classifier ensembles.

In [11], Dreossi et al. propose a robustness framework which unifies the optimization and verification views on the robustness-checking problem and encompasses several existing approaches. They explicitly mention that their framework applies to local robustness and argue most of the existing work on finding adversarial examples and verifying robustness fits their framework. Our work, when we have a single classifier and a singleton data set, fits precisely into their framework. However, we generalize in those two dimensions by averaging over the robustness target value (in their jargon) for all points in a data set, and by considering ensemble classifiers. This means that our point of view of the adversarial environment is neither that of a white-box attacker nor is it a black-box attacker. Indeed, we know the set of classifiers but we do not know what strategy is used to choose which (convex combination of) classifiers to apply. Our environment is thus a gray-box attacker.

2 Preliminaries

Let 𝐱\mathbf{x} be a vector (x1,,xd)d(x_{1},\dots,x_{d})\in\mathbb{R}^{d}. We write 𝐱1\left\lVert\mathbf{x}\right\rVert_{1} for the “Manhattan norm” of 𝐱\mathbf{x}, that is i=1d|𝐱i|\sum_{i=1}^{d}|\mathbf{x}_{i}|.

We will make use of a partial inverse of the max\max function. Consider a totally ordered set YY and a function f:XYf\colon X\to Y. Throughout this work we define the argmax\operatorname*{arg\,max} (arguments of the maxima) partial function as follows. For all SXS\subseteq X we set argmaxsSf(s):=m\operatorname*{arg\,max}_{s\in S}f(s):=m if mm is the unique element of SS such that f(m)=maxsSf(s)f(m)=\max_{s\in S}f(s). If more than one element of SS witnesses the maximum then argmaxsSf(s)\operatorname*{arg\,max}_{s\in S}f(s) is undefined.

A probability distribution over a finite set DD is a function μ:D[0, 1]\mu\colon D\rightarrow[0,\,1]\subseteq\mathbb{R} with xDμ(x)=1\sum_{x\in D}\mu(x)=1. The set of all distributions on DD is 𝐷𝑖𝑠𝑡𝑟(D)\mathit{Distr}(D).

2.1 Neural networks

We loosely follow the neural-network notation from [15, 12]. A feed-forward neural network (NN for short) with dd inputs and \ell outputs encodes a function f:df\colon\mathbb{R}^{d}\to\mathbb{R}^{\ell}. We focus on NNs with ReLU activation functions. Formally, the function ff is given in the form of

  • a sequence 𝐖(1),,𝐖(k)\mathbf{W}^{(1)},\dots,\mathbf{W}^{(k)} of weight matrices with 𝐖(i)di×di1\mathbf{W}^{(i)}\in\mathbb{R}^{d_{i}\times d_{i-1}}, for all i=1,,ki=1,\dots,k, and

  • a sequence 𝐁(1),,𝐁(k)\mathbf{B}^{(1)},\dots,\mathbf{B}^{(k)} of bias vectors with 𝐁(i)di\mathbf{B}^{(i)}\in\mathbb{R}^{d_{i}}, for all i=1,,ki=1,\dots,k.

Additionally, we have that d0,,dkd_{0},\ldots,d_{k}\in\mathbb{N} with d0=dd_{0}=d and dk=d_{k}=\ell. We then set f=g(k)(𝐱)f=g^{(k)}(\mathbf{x}) for all 𝐱d\mathbf{x}\in\mathbb{R}^{d} where for all i=1,,ki=1,\dots,k we define

g(i)(𝐱):=ReLU(𝐖(i)g(i1)(𝐱)+𝐁(i)),g^{(i)}(\mathbf{x}):=\mathrm{ReLU}(\mathbf{W}^{(i)}g^{(i-1)}(\mathbf{x})+\mathbf{B}^{(i)}),

and g(0)(𝐱):=𝐱g^{(0)}(\mathbf{x}):=\mathbf{x}. The ReLU function on vectors 𝐮\mathbf{u} is the element-wise maximum between 0 and the vector entries, that is, if 𝐯=ReLU(𝐮)\mathbf{v}=\mathrm{ReLU}(\mathbf{u}) then 𝐯i=max(0,𝐮i)\mathbf{v}_{i}=\max(0,\mathbf{u}_{i}).

We sometimes refer to each g(i)g^{(i)} as a layer. Note that each layer is fully determined by its corresponding weight and bias matrices.

2.2 Neural-network classifiers

A data set XdX\subseteq\mathbb{R}^{d} is a finite set of (real-valued) data points 𝐱d\mathbf{x}\in\mathbb{R}^{d} of dimension d>0d\in\mathbb{N}_{>0}. A classifier c:X[]c\colon X\rightarrow[\ell] is a partial function that attaches to each data point a label from []={1,,}[\ell]=\{1,\dots,\ell\}, the set of labels. We denote the set of all classifiers over XX by \mathbb{C}. An NN-encoded classifier is simply a partial function f:dkf\colon\mathbb{R}^{d}\to\mathbb{R}^{k} given as an NN that assigns to each data point 𝐱d\mathbf{x}\in\mathbb{R}^{d} the label argmaxi[]h(i)\operatorname*{arg\,max}_{i\in[\ell]}h(i) where h(i)=f(𝐱)ih(i)=f(\mathbf{x})_{i}. Intuitively, the label is the index of the largest entry in the vector resulting from applying ff to 𝐱\mathbf{x}. Note that if the image of xx according to ff has several maximal entries then the argmax\operatorname*{arg\,max} and the output label are undefined.

Definition 1 (Labelled data set)

A labelled data set 𝒳=(X,ct)\mathcal{X}=(X,c_{t}) consists of a data set XX and a total classifier ctc_{t} for XX, i.e. ctc_{t} is a total function.

In particular, ct(𝐱)c_{t}(\mathbf{x}) is defined for all 𝐱X\mathbf{x}\in X and considered to be the “ground truth” classification for the whole data set XX.

3 Problem Statement

We state the formal problem and provide the required notation. Recall that in our setting we assume to have an ensemble of classifiers CC\subseteq\mathbb{C}. Such an ensemble is attacked by a set of attacks that are selected randomly.

Definition 2 (Deterministic attack)

A deterministic attack for a labelled data set (X,ct)(X,c_{t}) and a classifier c:X[]c\colon X\rightarrow[\ell] is a function δ:Xd{\delta}\colon X\rightarrow\mathbb{R}^{d}. An attack δ{\delta} induces a misclassification for 𝐱X\mathbf{x}\in X and cc if c(𝐱+δ(𝐱))ct(𝐱)c(\mathbf{x}+{\delta}(\mathbf{x}))\neq c_{t}(\mathbf{x}) or if c(𝐱+δ(𝐱))c(\mathbf{x}+{\delta}(\mathbf{x})) is undefined. The set of all deterministic attacks is Δ\Delta. An attack is ε\varepsilon-bounded if δ(𝐱)1ε\left\lVert{\delta}(\mathbf{x})\right\rVert_{1}\leq\varepsilon holds for all 𝐱X\mathbf{x}\in X.

We sometimes call the value 𝐱+δ(𝐱)\mathbf{x}+{\delta}(\mathbf{x}) the attack point. Note that the classifier cc is not perfect, that is, c(𝐱)ct(𝐱)c(\mathbf{x})\neq c_{t}(\mathbf{x}) for some 𝐱X\mathbf{x}\in X, already a zero-attack δ(𝐱)=0{\delta}(\mathbf{x})=0 leads to a misclassification.

We extend deterministic attacks by means of probabilities.

Definition 3 (Randomized attack)

A finite set AΔA\subseteq\Delta of deterministic attacks together with a probability distribution 𝐷𝑖𝑠𝑡𝑟(A)\mathbb{P}\in\mathit{Distr}(A) is a randomized attack (A,)(A,\mathbb{P}). A randomized attack is ε\varepsilon-bounded if for all attacks δA{\delta}\in A with (δ)>0\mathbb{P}({\delta})>0 it holds that δ(𝐱)1ε\left\lVert{\delta}(\mathbf{x})\right\rVert_{1}\leq\varepsilon for all 𝐱X\mathbf{x}\in X.

0112233445501122334455xxyy22-boundedattacksδ1\delta_{1}δ2\delta_{2}
Figure 3: The dotted diamond contains the set of all 22-bounded attacks in the setting described in Fig. 1. Both δ1\delta_{1} and δ2\delta_{2} are therefore 22-bounded (deterministic) attacks. Hence, any randomized attack with A={δ1,δ2}A=\{\delta_{1},\delta_{2}\} is also 22-bounded.

In general, a loss function :×d×d\ell\colon\mathbb{C}\times\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} describes the penalty incurred by a classifier with respect to a labelled data point and an attack. In this work, we will focus on the widely used zero-one loss.

Definition 4 (Zero-one loss function)

The (01)(0-1)-loss function 01:×d×d{0,1}\ell_{\mathrm{0-1}}\colon\mathbb{C}\times\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\{0,1\} for a labelled data set (X,ct)(X,c_{t}), a classifier c:X[]c\colon X\rightarrow[\ell], and a deterministic attack δA{\delta}\in A is given by the following for all 𝐱X\mathbf{x}\in X

01(c,𝐱,δ(𝐱))={0if c(𝐱+δ(𝐱))=ct(𝐱)1otherwise.\displaystyle\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x}))=\begin{cases}0&\text{if }c(\mathbf{x}+{\delta}(\mathbf{x}))=c_{t}(\mathbf{x})\\ 1&\text{otherwise.}\end{cases}

In particular, the loss function yields one if the image of the classifier cc is undefined for the attack point 𝐱+δ(𝐱)\mathbf{x}+{\delta}(\mathbf{x}). Note furthermore that the loss is measured with respect to the ground truth classifier ctc_{t}. Thereby, the classifier cc and the zero function as deterministic attack do not necessarily induce a loss of zero with respect to ctc_{t}. This assumption is realistic as, while we expect classifiers to perform well with regard to the ground truth, we cannot assume perfect classification in realistic settings.

We now connect a randomized attack to an ensemble, that is, a finite set CC\subseteq\mathbb{C} of classifiers. In particular, we quantify the overall value a randomized attack induces with respect to the loss function and the ensemble.

Definition 5 (Misclassification value)

The misclassification value of a randomized attack (A,)(A,\mathbb{P}) with respect to a labelled data set 𝒳=(X,ct)\mathcal{X}=(X,c_{t}) and a finite set of classifiers CC\subseteq\mathbb{C} is given by

𝐕𝐚𝐥(A,):=mincC1|X|𝐱X𝔼δ[01(c,𝐱,δ(𝐱))].\mathbf{Val}(A,\mathbb{P}):=\min_{c\in C}\,\frac{1}{|X|}\sum_{\mathbf{x}\in X}\,\operatorname*{\mathbb{E}}_{{\delta}\sim\mathbb{P}}[\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x}))]. (1)

This value is the minimum (over all classifiers) mean expected loss with respect to the randomized attack and the classifiers from CC. An optimal adversarial attack against CC\subseteq\mathbb{C} with respect to a labelled data set (X,ct)(X,c_{t}) is a randomized attack which maximizes the value 𝐕𝐚𝐥(A,)\mathbf{Val}(A,\mathbb{P}) in Equation (1).

We are now ready to formalize a notion of robustness in terms of ε\varepsilon-bounded attacks and a robustness bound α\alpha\in\mathbb{R}, as proposed in [20] for a set of classifiers.

Definition 6 (Bounded robustness)

A set of classifiers CC\in\mathbb{C} for a labelled data set (X,ct)(X,c_{t}) is called robust bounded by ε\varepsilon and α\alpha (ε,α\varepsilon,\alpha-robust) if it holds that

(A,)2Δ×𝐷𝑖𝑠𝑡𝑟(A).𝐕𝐚𝐥(A,)<α,\forall{(A,\mathbb{P})\in 2^{\Delta}\times\mathit{Distr}(A)}.\,\mathbf{Val}(A,\mathbb{P})<\alpha, (2)

where the (A,(A,\mathbb{P}) range over all ε\varepsilon-bounded randomized attacks.

(δ1)\mathbb{P}(\delta_{1})𝔼(01())\mathbb{E}(\ell_{0-1}(\cdot))δ1,c1\delta_{1},c_{1}δ2,c2\delta_{2},c_{2}0.30.30.20.20.50.5
Figure 4: Continuing with the example from Figs 1 and 3, we now plot the expected loss per attack and the corresponding classifier. (That is, the classifier which misclassifies the perturbed point.) On the horizontal axis we have the probability xx assigned to δ1\delta_{1} and we assume (δ2)=1x\mathbb{P}(\delta_{2})=1-x. Note that x=0.2x=0.2 is such that the minimal expected loss, i.e. the misclassification value, is strictly less than 0.30.3. Indeed, one classifier manages to correctly classifier the perturbed point with probability 0.80.8 in this case. With x=0.5x=0.5 we see that the misclassification value is 0.50.5. Hence, the ensemble is not (2,0.5)(2,0.5)-robust.

In other words, an (ε,α)(\varepsilon,\alpha)-robust ensemble is such that for all possible ε\varepsilon-bounded random attacks (A,)(A,\mathbb{P}), there is at least one classifier cCc\in C from the ensemble such that 𝐱X𝔼δ[01(c,𝐱,δ(𝐱))]<α|X|.\sum_{\mathbf{x}\in X}\,\operatorname*{\mathbb{E}}_{{\delta}\sim\mathbb{P}}[\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x}))]<\alpha|X|. Conversely, an ensemble is not (ε,α)(\varepsilon,\alpha)-robust if there is an ε\varepsilon-bounded randomized attack with a misclassification value of at least α\alpha.

4 Theoretical Results

In this section we establish two key results that carry essential practical implication for our setting. First, we show that in order to obtain an optimal randomized attack, only a bounded number of deterministic attacks is needed.111An analogue of this property had already been observed by Perdomo and Singer in [20, Section 3] in the case when classifiers are chosen randomly. Thereby, we only need to take a bounded number of potential attacks into account in order to prove the α\alpha-robustness of a set of classifiers and a given labelled data set. Second, we establish that our problem is in fact \NP-hard, justifying the use of SMT and MILP solvers to (1) compute any optimal randomized attack and, more importantly, to (2) prove robustness against any such attack.

4.1 Bounding the number of attacks

In the following, we assume a fixed labelled data set (X,ct)(X,c_{t}). For every classifier cCc\in C and every deterministic attack δΔ{\delta}\in\Delta, let us write Mc(δ)M_{c}({\delta}) to denote the value xX01(c,𝐱,δ(𝐱))\sum_{x\in X}\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x})). Observe that 0Mc(δ)|X|0\leq M_{c}({\delta})\leq|X| for all cCc\in C and δΔ{\delta}\in\Delta. Furthermore, for all cCc\in C and randomized attacks (A,)(A,\mathbb{P}) it holds that:

𝐱X𝔼δ[01(c,𝐱,δ(𝐱))]=𝐱XδA(δ)01(c,𝐱,δ(𝐱))\displaystyle\sum_{\mathbf{x}\in X}\,\operatorname*{\mathbb{E}}_{{\delta}\sim\mathbb{P}}[\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x}))]=\sum_{\mathbf{x}\in X}\sum_{{\delta}\in A}\mathbb{P}({\delta})\cdot\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x}))
=δA(δ)(𝐱X01(c,𝐱,δ(𝐱)))=Mc(δ)=δA(δ)Mc(δ)\displaystyle=\sum_{{\delta}\in A}\mathbb{P}({\delta})\underbrace{\left(\sum_{\mathbf{x}\in X}\ell_{\mathrm{0-1}}(c,\mathbf{x},{\delta}(\mathbf{x}))\right)}_{=M_{c}({\delta})}=\sum_{{\delta}\in A}\mathbb{P}({\delta})\cdot M_{c}({\delta})

We get that Equation (2) from Def. 5 is false if and only if the following holds.

(A,)2Δ×𝐷𝑖𝑠𝑡𝑟(A).cC.δA(δ)Mc(δ)α|X|\exists{(A,\mathbb{P})\in 2^{\Delta}\times\mathit{Distr}(A)}.\,\forall{c\in C}.\,\sum_{{\delta}\in A}\mathbb{P}({\delta})\cdot M_{c}({\delta})\geq\alpha|X| (3)
Proposition 1 (Bounded number of attacks)

Let α\alpha\in\mathbb{R} and consider the labelled data set (X,ct)(X,c_{t}) together with a finite set of classifiers CC\subseteq\mathbb{C}. For all randomized attacks (A,)(A,\mathbb{P}), there exists a randomized attack (A,)(A^{\prime},\mathbb{P}^{\prime}) such that

  • |A|(|X|+1)|C||A^{\prime}|\leq(|X|+1)^{|C|},

  • 𝐕𝐚𝐥(A,)=𝐕𝐚𝐥(A,)\mathbf{Val}(A^{\prime},\mathbb{P}^{\prime})=\mathbf{Val}(A,\mathbb{P}), and

  • (A,)(A^{\prime},\mathbb{P}^{\prime}) is ε\varepsilon-bounded if (A,)(A,\mathbb{P}) is ε\varepsilon-bounded.

Proof

We proceed by contradiction. Let (A,)(A,\mathbb{P}) be an ε\varepsilon-bounded randomized attack with a misclassification value of α\alpha such that |A|>(|X|+1)|C||A|>(|X|+1)^{|C|}. Further suppose that (A,)(A,\mathbb{P}) is minimal (with respect to the size of AA) amongst all such randomized attacks. It follows that there are attacks δ,δA{\delta},{\delta}^{\prime}\in A such that Mc(δ)=Mc(δ)M_{c}({\delta})=M_{c}({\delta}^{\prime}) for all cCc\in C. We thus have that

(δ)Mc(δ)+(δ)Mc(δ)=((δ)+(δ))Mc(δ).\mathbb{P}({\delta})\cdot M_{c}({\delta})+\mathbb{P}({\delta}^{\prime})\cdot M_{c}({\delta}^{\prime})=\left(\mathbb{P}({\delta})+\mathbb{P}({\delta}^{\prime})\right)\cdot M_{c}({\delta}).

Consider now the randomized attack (A,)(A^{\prime},\mathbb{P}^{\prime}) obtained by modifying (A,)(A,\mathbb{P}) so that (δ)=(δ)+(δ)\mathbb{P}({\delta})=\mathbb{P}({\delta})+\mathbb{P}({\delta}^{\prime}) and δ{\delta}^{\prime} is removed from AA. From the above discussion and Equation (3) it follows that (A,)(A^{\prime},\mathbb{P}^{\prime}), just like (A,)(A,\mathbb{P}), has a misclassification value of α\alpha. Furthermore, since AAA^{\prime}\subseteq A, we have that (A,)(A^{\prime},\mathbb{P}^{\prime}) is ε\varepsilon-bounded and that |A|<|A||A^{\prime}|<|A|. This contradicts our assumption regarding the minimality of (A,)(A,\mathbb{P}) amongst ε\varepsilon-bounded randomized attacks with the same value α\alpha.∎

4.2 NP hardness of non-robustness checking

It is known that checking whether linear properties hold for a given NN with ReLU activation functions is \NP-hard [15]. We restate this using our notation.

Proposition 2 (From [15, Appendix I])

The following problem is \NP-hard: Given an NN-encoded function f:nmf\colon\mathbb{R}^{n}\to\mathbb{R}^{m} and closed nonnegative intervals (Ik)1n,(O)1m(I_{k})_{1}^{n},(O_{\ell})_{1}^{m}, decide whether there exists 𝐱k=1nIk\mathbf{x}\in\prod_{k=1}^{n}I_{k} such that f(𝐱)=1mOf(\mathbf{x})\in\prod_{\ell=1}^{m}O_{\ell}.

Intuitively, determining whether there exists a point in a given box — that is, a hypercube defined by a Cartesian product of intervals — from n\mathbb{R}^{n} whose image according to ff is in a given box from m\mathbb{R}^{m} is \NP-hard. We will now reduce this to the problem of determining if there is a randomized attack such that its misclassification value takes at least a given threshold.

Theorem 4.1

The following problem is \NP-hard: For a labelled data set 𝒳=(X,ct)\mathcal{X}=(X,c_{t}), a set CC of classifiers, and a value α\alpha\in\mathbb{Q}, decide if there exists an ε\varepsilon-bounded randomized attack (A,)(A,\mathbb{P}) w.r.t. 𝒳\mathcal{X} and CC such that 𝐕𝐚𝐥(A,)α\mathbf{Val}(A,\mathbb{P})\geq\alpha.

Proof

We use Proposition 2 and show how to construct, for any NN-encoded function g:ng\colon\mathbb{R}^{n}\to\mathbb{R} and any constraint g(𝐱)u\ell\leq g(\mathbf{x})\leq u, two classifiers c,cuc_{\ell},c_{u} such that a single deterministic attack δ\delta causes 𝟎\mathbf{0}, the single data point, to be misclassified by both cc_{\ell} and cuc_{u} if and only if the constraint holds. Note that the ε\varepsilon bound can be chosen to be large enough so that it contains the box k=1nIk\prod_{k=1}^{n}I_{k} and that the identity function over nonnegative numbers is NN-encodable, that is, using the identity matrix as weight matrix 𝐖\mathbf{W} and a zero bias vector 𝐁\mathbf{B}. For every instance of the problem from Proposition 2 we can therefore construct 2(n+m)2(n+m) NNs that encodes all input and output constraints: 2n2n of them based on the identity function to encode input constraints and 2m2m based on the input NN from the given instance. It follows that determining if there exists an ε\varepsilon-bounded deterministic attack δ{\delta} that causes 𝟎\mathbf{0} to be simultaneously misclassified by a given classifier ensemble is \NP-hard. Hence, to conclude, it suffices to argue that the latter problem reduces to our robustness-value threshold problem. The result follows from Lemmas 1 and 2.∎

4.2.1 Enforcing interval constraints.

Let g:ng\colon\mathbb{R}^{n}\to\mathbb{R} be an NN-encoded function and ,u\ell,u\in\mathbb{R} with u\ell\leq u. Consider now the constraint g(𝐱)u\ell\leq g(\mathbf{x})\leq u. Henceforth we will focus on the labelled data set (X,ct)(X,c_{t}) with X={𝟎} and ct(𝐱)=1X=\{\mathbf{0}\}\text{ and }c_{t}(\mathbf{x})=1.

Lower-bound constraint.

We obtain cc_{\ell} by adding to the NN encoding of gg a new final layer with weight and bias vectors

𝐖=(01),𝐁=(0)\displaystyle\mathbf{W}=\begin{pmatrix}0\\ 1\end{pmatrix},\mathbf{B}=\begin{pmatrix}\ell\\ 0\end{pmatrix}

to obtain the NN-encoded function g¯:n2\underline{g}\colon\mathbb{R}^{n}\to\mathbb{R}^{2}. Note that g¯(𝐯)=(,g(𝐯))\underline{g}(\mathbf{v})=(\ell,g(\mathbf{v}))^{\intercal} for all 𝐯n\mathbf{v}\in\mathbb{R}^{n}. It follows that c(𝐯)=2c_{\ell}(\mathbf{v})=2 if g(𝐯)>g(\mathbf{v})>\ell and c(𝐯)c_{\ell}(\mathbf{v}) is undefined if g(𝐯)=g(\mathbf{v})=\ell. In all other cases the classifier yields 11.

Upper-bound constraint.

To obtain cuc_{u} we add to the NN encoding of gg two new layers. The corresponding weight matrices and bias vectors are as follows.

𝐖(1)=(1),𝐁(1)=(u),𝐖(2)=(01),𝐁(2)=(11)\displaystyle\mathbf{W}^{(1)}=(1),\ \mathbf{B}^{(1)}=(-u),\ \mathbf{W}^{(2)}=\begin{pmatrix}0\\ -1\end{pmatrix},\ \mathbf{B}^{(2)}=\begin{pmatrix}1\\ 1\end{pmatrix}

Let us denote by g¯:n2\overline{g}\colon\mathbb{R}^{n}\to\mathbb{R}^{2} the resulting function. Observe that g¯(𝐯)=(1,max(0,1max(0,g(𝐯)u)))\overline{g}(\mathbf{v})=(1,\max(0,1-\max(0,g(\mathbf{v})-u)))^{\intercal} for all 𝐯n\mathbf{v}\in\mathbb{R}^{n}. Hence, we have that cu(𝐯)c_{u}(\mathbf{v}) is undefined if and only if g(𝐯)ug(\mathbf{v})\leq u and yields 11 otherwise.

Lemma 1

Let g:ng\colon\mathbb{R}^{n}\to\mathbb{R} be an NN-encoded function and consider the constraint g(𝐱)u\ell\leq g(\mathbf{x})\leq u. One can construct NN-encoded classifiers cc_{\ell} and cuc_{u}, of size linear with respect to gg, for the labelled data set ({𝟎},{𝟎1})(\{\mathbf{0}\},\{\mathbf{0}\mapsto 1\}) such that the deterministic attack δ:𝟎𝐯{\delta}\colon\mathbf{0}\mapsto\mathbf{v}

  • induces a misclassification of 𝟎\mathbf{0} with respect to cc_{\ell} if and only if g(𝐯)\ell\leq g(\mathbf{v}) and

  • it induces a misclassification of 𝟎\mathbf{0} with respect to cuc_{u} if and only if g(𝐯)ug(\mathbf{v})\leq u.

We now show how to modify the NN to obtain classifiers c,cuc_{\ell},c_{u} such that 𝐱\mathbf{x} is misclassified by both cc_{\ell} and cuc_{u} if and only if the constraint holds.

4.2.2 Enforcing universal misclassification.

A randomized attack with misclassification value 11 can be assumed to be deterministic. Indeed, from Equation (3) it follows that for any such randomized attack we must have Mc(δ)=|X|M_{c}({\delta})=|X| for all δA{\delta}\in A and all cCc\in C. Hence, we can choose any such δA{\delta}\in A and consider the randomized attack ({δ},{δ1})(\{{\delta}\},\{{\delta}\mapsto 1\}) which also has misclassification value 11.

Lemma 2

Consider the labelled data set 𝒳=(X,ct)\mathcal{X}=(X,c_{t}) with the finite set of classifiers CC\subseteq\mathbb{C}. There exists an ε\varepsilon-bounded randomized attack (A,)(A,\mathbb{P}) with 𝐕𝐚𝐥(A,)=1\mathbf{Val}(A,\mathbb{P})=1 if and only if there exists a deterministic attack δ{\delta} such that

  • δ(𝐱)1ε\left\lVert{\delta}(\mathbf{x})\right\rVert_{1}\leq\varepsilon for all 𝐱X\mathbf{x}\in X and

  • for all 𝐱X\mathbf{x}\in X and all cCc\in C we have that either c(𝐱+δ(𝐱))c(\mathbf{x}+{\delta}(\mathbf{x})) is undefined or it is not equal to ct(𝐱)c_{t}(\mathbf{x}).

With Lemmas 1 and 2 established, the proof of Theorem 4.1 is now complete.

5 SMT and MILP Encodings

In this section, we describe the main elements of our SMT and MILP encodings to compute (optimal) randomized attacks or prove the robustness of classifier ensembles. We start with a base encoding and will afterwards explain how to explicitly encode the classifiers and the loss function.

5.1 Base problem encoding

First, we assume a labelled data set 𝒳=(X,ct)\mathcal{X}=(X,c_{t}), the attack bound ε\varepsilon, and the robustness bound α\alpha are input to the problem. In particular, the data set X={𝐱1,,𝐱|X|}dX=\{\mathbf{x}^{1},\ldots,\mathbf{x}^{|X|}\}\subseteq\mathbb{R}^{d} has data points 𝐱j=(x1j,,xdj)d\mathbf{x}^{j}=(x_{1}^{j},\ldots,x_{d}^{j})\in\mathbb{R}^{d} for 1j|X|1\leq j\leq|X|. Furthermore, we assume the number |A||A| of attacks that shall be computed is fixed. Recall that, to show that the set of classifiers is robust, we can compute a sufficient bound on the number of attacks — see Sec. 4.1.

For readability, we assume the classifiers CC and the loss function 01\ell_{\mathrm{0-1}} are given as functions that can directly be used in the encodings, and we will use the absolute value |x||x| for xx\in\mathbb{R}. Afterwards, we discuss how to actually encode classifiers and the loss function. We use the following variables:

  • For the attacks from AA, we introduce δ1,,δ|A|{\delta}_{1},\ldots,{\delta}_{|A|} with δi|X|×d{\delta}_{i}\in\mathbb{R}^{|X|\times d} for 1i|A|1\leq i\leq|A|. Specifically, δi{\delta}_{i} shall be assigned all attack values for the ii-th attack from AA. That is, δij{\delta}_{i}^{j} is the attack for the data point 𝐱j=(x1j,,xdj)d\mathbf{x}^{j}=(x_{1}^{j},\ldots,x_{d}^{j})\in\mathbb{R}^{d} with δij=(δij,1,,δij,d){\delta}_{i}^{j}=({\delta}_{i}^{j,1},\ldots,{\delta}_{i}^{j,d}) for 1j|X|1\leq j\leq|X|.

  • We introduce p1,,p|A|p_{1},\ldots,p_{|A|} to form a probability distribution over deterministic attacks; pip_{i} is assigned the probability to execute attack δi\delta_{i}.

The classifier ensemble CC is not ε,α\varepsilon,\alpha-robust as in Definition 6 if and only if the following constraints are satisfiable.

cC.\displaystyle\forall c\in C.\quad j=1|X|i=1|A|(pi01(c,𝐱j,δij))α|X|\displaystyle\sum_{j=1}^{|X|}\sum_{i=1}^{|A|}\left(p_{i}\cdot\ell_{\mathrm{0-1}}(c,\mathbf{x}^{j},{\delta}_{i}^{j})\right)\geq\alpha\cdot|X| (4)
i{1,,|A|},j{1,,|X|}.\displaystyle\forall i\in\{1,\dots,|A|\},j\in\{1,\dots,|X|\}.\quad k=1d|δij,k|ε\displaystyle\sum_{k=1}^{d}|{\delta}^{j,k}_{i}|\leq\varepsilon (5)
i=1|A|pi=1\displaystyle\sum_{i=1}^{|A|}p_{i}=1 (6)
i{1,,|A|}.\displaystyle\forall i\in\{1,\dots,|A|\}.\quad pi0\displaystyle p_{i}\geq 0 (7)

Indeed, (4) enforces the misclassification value to be at least α\alpha; (5) ensures an ε\varepsilon-bounded randomized attack; finally, by (6) and (7) the probability variables induce a valid probability distribution.

Specific encodings.

For the SMT encoding, we can make use of the max()\max(\cdot) native to implement the absolute value. In particular for the MILP, however, we employ so-called “big-M” tricks to encode max functions and a (restricted) product operation (cf. [7]). Specifically, the product is required to obtain the value resulting from the multiplication of the loss function and probability variables.

As an example of “big-M” trick, suppose we have variables a[0,1],b{0,1}a\in\mathbb{Q}\cap[0,1],b\in\{0,1\}, and a constant MM\in\mathbb{Q} such that M>a+bM>a+b. We introduce a variable c[0,1]c\in\mathbb{Q}\cap[0,1] and add the following constraints which clearly enforce that c=abc=ab.

c\displaystyle c aM(1b),ca+M(1b),c0+Mb.\displaystyle\geq a-M(1-b),\quad c\leq a+M(1-b),\quad c\leq 0+Mb\ .

Note that MM can be chosen to be the constant 22 in this case.

Encoding the loss function.

We encode the zero-one loss function from Def. 4 as an if-then-else expression making use of the ground truth classifier ctc_{t}. We introduce one variable i,jc\ell^{c}_{i,j} per classifier cCc\in C for all attacks δiA{\delta}_{i}\in A and all datapoints 𝐱jX\mathbf{x}^{j}\in X. In the SMT encoding, we can then define

i,jc=𝐼𝑇𝐸((c(𝐱j+δij)=ct(𝐱j)),0,1)\displaystyle\ell^{c}_{i,j}=\mathit{ITE}((c(\mathbf{x}^{j}+{\delta}^{j}_{i})=c_{t}(\mathbf{x}^{j})),0,1) (8)

so that i,jc=01(c,𝐱j,δij)\ell^{c}_{i,j}=\ell_{\mathrm{0-1}}(c,\mathbf{x}^{j},{\delta}_{i}^{j}). In our MILP encoding we have to simulate the ITE primitive using constraints similar to the ones mentioned above.

5.2 Classifier encoding

As mentioned in the preliminaries, neural networks implement functions by way of layer composition. Intuitively, the input of a layer is by a previous layer. When fed forward, input values are multiplied by a weight, and a bias value will be added to it. Matrix operations realized by a neural network can thus be encoded as linear functions. For max-pooling operations and the ReLU activation function, one can use the native max()\max(\cdot) operation or implement a maximum using a “big-M trick”. For this, a suitable constant MM has to be obtained beforehand (cf. [7]). We also use a (discrete) convolution operation, as a linear function.

6 Experiments

In the previous section, we showed that our problem of neural network robustness verification is \NP-hard. Meaningful comparison between approaches, therefore, needs to be experimental. To that end, we use classifiers trained on multiple image data sets and report on the comparison between the SMT and MILP encodings. In what follows, we analyze the running time behavior of the different verifiers, the generated attacks and the misclassification value for the given data points, and whether a set of classifiers is robust against predefined thresholds.

6.1 Experimental setup

For each experiment, we define a set of classifiers CC, our data points 𝒳\mathcal{X}, the number of attacks |A||A|, and both the ε\varepsilon- and α\alpha-values. Then, we generate the attacks AA and the probability distribution \mathbb{P} using SMT and MILP solvers. If no randomized attack (A,)(A,\mathbb{P}) is found (UNSAT), we have shown that our set of classifiers is ε,α\varepsilon,\alpha-robust with respect to the predefined thresholds.

Toolchain.

Our NN robustness verifier222Available at https://tinyurl.com/ensemble-robustness., is implemented as part of a Python 3.x toolchain. We use the SMT solver Z3 [9] and the MILP solver Gurobi [14] with their standard settings. To support arbitrary classifiers, we created a generic pipeline using the TensorFlow API, and support Max-Pooling layers, convolutional layers, and dense layers with ReLU activation functions [1]. We focus on binary classifiers by defining certain classification boundaries. We train the classifiers using the Adam optimizer [17] as well as stochastic gradient descent.

Data sets.

MNIST consists of 70 000 images of handwritten digits) [10] and is widely used for benchmarking in the field of machine learning [16, 8]. We trained classifiers to have a test accuracy of at least 90%.

German traffic sign is a multi-class and single-image classification data set, containing more than 50 000 images and more than 40 classes [25]. Traffic sign recognition and potential attacks are of utmost importance for self-driving cars [27]. We extracted the images for “Give way” and “priority” traffic signs from the data set and trained classifiers on this subset to have an accuracy of at least 80%.

Benchmark Information SMT MILP
ID Name |𝑪||\boldsymbol{C}| |𝑨||\boldsymbol{A}| |𝓧||\boldsymbol{\mathcal{X}}| dim 𝜶\boldsymbol{\alpha} 𝜺\boldsymbol{\varepsilon} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})}
2 mnist_0_1 3 2 4 7x7 0.2 100 -TO- 12.43 0.25
4 mnist_0_1_2convs 3 2 4 8x8 0.4 100 -TO- 53.92 0.4
7 mnist_0_1 3 2 4 8x8 0.4 1000 -TO- 0.34 0.4
8 mnist_0_1 3 2 4 8x8 0.9 1000 -TO- 50.09 0.9
9 mnist_0_1 3 3 4 8x8 0.9 60 -TO- 34.02 1
13 mnist_4_5 3 4 4 10x10 0.9 50 -TO- 144.32 1
14 mnist_7_8 3 4 4 6x6 0.1 60 -TO- 18.94 0.25
16 mnist_4_5 3 2 4 10x10 0.1 1000 155.73 0.38 101.16 0.1
17 mnist_4_5 3 3 4 10x10 0.1 80 403.25 0.25 101.47 0.25
18 mnist_4_5 3 2 4 10x10 0.15 80 216.65 0.38 44.26 0.15
19 mnist_4_5 3 2 4 10x10 0.2 100 156.63 0.38 54.36 0.25
22 mnist_7_8 3 2 4 6x6 0.9 0.1 -TO- 4 robust
26 traffic_signs 3 2 4 10x10 1 0.01 -TO- 17 robust
27 traffic_signs 3 2 4 10x10 1 0.1 -TO- -TO-
Table 1: SMT versus MILP
Benchmark Information MILP MaxMILP
ID Name |𝑪||\boldsymbol{C|} |𝑨||\boldsymbol{A|} |𝓧||\boldsymbol{\mathcal{X}}| dim 𝜶\boldsymbol{\alpha} 𝜺\boldsymbol{\varepsilon} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})}
1 mnist_0_1 3 2 4 7x7 0.1 1000 57.79 0.25 46.23 1*
3 mnist_0_1_2convs 3 2 4 8x8 0.2 1000 738.76 0.5 93.54 1*
7 mnist_0_1 3 2 4 8x8 0.4 1000 0.34 0.4 0.34 1*
10 mnist_0_1 3 4 4 8x8 0.9 60 51.39 1 51.39 1*
14 mnist_7_8 3 4 4 6x6 0.1 60 18.94 0.25 21.20 1
17 mnist_4_5 3 3 4 10x10 0.1 80 101.47 0.25 88.39 1
20 mnist_3_6 2 9 2 8x8 1 0.005 7 robust 7 robust
21 mnist_7_8 3 2 4 6x6 1 0.1 4 robust 4 robust
24 mnist_0_2 3 27 4 9x9 1 0.01 108 robust 108 robust
25 mnist_0_2 3 30 4 9x9 1 0.01 120 robust 120 robust
28 traffic_signs 3 3 4 10x10 1 0.01 45 robust 45 robust
29 traffic_signs 3 3 4 10x10 1 0.01 –TO– - –TO–
Table 2: MILP versus MaxMILP
Optimal attacks.

As MILP inherently solves optimization problems, we augment the base encoding from Equations (4)–(7) with the following objective function:

maxk=1|C|j=1|𝒳|i=1|A|(pi01(ck,𝐱j,δij)).\max\quad\sum_{k=1}^{|C|}\sum_{j=1}^{|\mathcal{X}|}\sum_{i=1}^{|A|}\left(p_{i}\cdot\ell_{\mathrm{0-1}}(c_{k},\mathbf{x}^{j},{\delta}_{i}^{j})\right).

An optimal solution with respect to the objective may yield a randomized attack inducing the maximal misclassification value among all ε\varepsilon-bounded attacks.333Note that we sum over classifiers instead of minimizing, as required in 𝐕𝐚𝐥(A,)\mathbf{Val}(A,\mathbb{P}).

Alternative attacks.

Our method generates attacks taking the whole ensemble of classifiers into account, which is computationally harder than just considering single classifiers [15, 12] due to an increased number of constraints and variables in the underlying encodings. To showcase the need for our approach, we implemented two other ways to generate attacks that are based on individual classifiers and subsequently lifted to the whole ensemble. Recall that we assume the attacker does not know which classifier from the ensemble will be chosen.

First, for the classifier ensemble CC we compute — using a simplified version of our MILP encoding — an optimal attack δc{\delta}_{c} for each classifier cCc\in C. Each such δc{\delta}_{c} maximizes the misclassification value, that is, the loss, for the classifier cc. The attack set AC={δccC}A_{C}=\{{\delta}_{c}\mid c\in C\} together with a uniform distribution 𝐷𝑖𝑠𝑡𝑟A\mathit{Distr}_{A} over ACA_{C} form the so-called uniform attacker (AC,𝐷𝑖𝑠𝑡𝑟A)(A_{C},\mathit{Distr}_{A}).

Second, to compare with deterministic attacks, we calculate for each attack from ACA_{C} the misclassification value over all classifiers. The best deterministic attack is any attack from ACA_{C} inducing a maximal misclassification value.

6.2 Evaluation

We report on our experimental results using the aforementioned data sets and attacks. For all experiments we used a timeout of 7200 seconds (TO). Each benchmark has an ID, a name, the number |C||C| of classifiers, the number |A||A| of attacks, the size |𝒳||\mathcal{X}| of the data set, the dimension of the image (dim), the robustness bound α\alpha, and the attack bound ε\varepsilon. The names of the MNIST benchmarks are of the form “mnist_xx_yy”, where xx and yy are the labels; the additional suffix “_nnconvs” indicates that the classifier has nn convolutional layers. We provide an excerpt of our experiments, full tables are available in the appendix.

SMT versus MILP.

In Table 1, we report on the comparison between SMT and MILP. Note that for these benchmarks, the MILP solver just checks the feasibility of the constraints without an objective function. We list for both solvers the time in seconds and the misclassification value 𝐕𝐚𝐥(A,)\mathbf{Val}(A,\mathbb{P}), rounded to 2 decimal places, for the generated randomized attack (A,)(A,\mathbb{P}), if it can be computed before the timeout. If there is no solution, the classifier ensemble CC is ε,α\varepsilon,\alpha-robust, and instead of a misclassification value we list “robust”.

We observe that SMT is only able to find solutions within the timeout for small α\alpha and large ε\varepsilon values. Moreover, if the ensemble is robust, that is, the problem is not satisfiable, the solver does not terminate for any benchmark. Nevertheless, for some benchmarks (see Table 1, entries 16–19) the SMT solver yields a higher misclassification value than the MILP solver — that is, it finds a better attack. The MILP solver, on the other hand, solves most of our benchmarks mostly within less than a minute, including the robust ones. Despite the reasonably low timeout of 7200 seconds, we are thus able to verify the robustness of NNs with around 6 layers. Running times visibly differ for other factors such as layer types.

MILP versus MaxMILP.

Table 2 compares some of the MILP results to those where we optimize the mentioned objective function, denoted by MaxMILP. The MILP solver Gurobi offers the possibility of so-called callbacks, that is, while an intermediate solution is not proven to be optimal, it may already be feasible. In case optimality cannot be shown within the timeout, we list the current (feasible) solution, and mark optimal solutions with *. The misclassification value for the MaxMILP solver is always 1. For robust ensembles, it is interesting to see that the MaxMILP encoding sometimes needs less time.

MaxMILP versus alternative attacks.
Benchmark Information UA BDA MaxMILP
ID Name |𝑪||\boldsymbol{C|} |𝑨||\boldsymbol{A|} |𝓧||\boldsymbol{\mathcal{X}|} dim Epsilon Alpha 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})}
3 mnist_0_1_2convs 3 2 4 8x8 0.2 1000 0.33 0.25 1*
7 mnist_0_1 3 2 4 8x8 0.4 1000 0.33 0.5 1*
8 mnist_0_1 3 2 4 8x8 0.9 1000 0.33 0.5 1*
9 mnist_0_1 3 3 4 8x8 0.9 60 0.33 0.5 1*
10 mnist_0_1 3 4 4 8x8 0.9 60 0.33 0.5 1*
11 mnist_4_5 3 2 4 10x10 0.2 100 0.33 0.25 1*
14 mnist_7_8 3 4 4 6x6 0.1 60 0.33 0.5 1
15 mnist_7_8 3 10 4 6x6 0.1 60 0.33 0.5 1
16 mnist_4_5 3 2 4 10x10 0.1 1000 0.33 0.75 1*
17 mnist_4_5 3 3 4 10x10 0.1 80 0.33 0.5 1
18 mnist_4_5 3 2 4 10x10 0.15 80 0.33 0.5 1*
Table 3: Attacker Comparison

In Table 3, we compare the MaxMILP method to the uniform attacker (UA) and the best deterministic attacker (BDA). What we can see is that the best deterministic attacker usually achieves higher misclassification values than the uniform attacker, but none of them are able to reach the actual optimum of 1.

Discussion of the results.

Within the timeout, our method is able to generate optimal results for medium-sized neural networks. The running time is mainly influenced by the number and type of the used layers, in particular, it is governed by convolutional and max-pooling layers: these involve more matrix operations than dense layers. As expected, larger values of the robustness bound α\alpha and smaller values of the attack bound ε\varepsilon typically increase the running times.

7 Conclusion and Future Work

We presented a new method to formally verify the robustness or, vice versa, compute optimal attacks for an ensemble of classifiers. Despite the theoretical hardness, we were able to, in particular by using MILP-solving, provide results for meaningful benchmarks. In future work, we will render our method more scalable towards a standalone verification tool for neural network ensembles. Moreover, we will explore settings where we do not have white-box access to the classifiers and employ state-of-the-art classifier stealing methods.

References

  • [1] Abadi, M.: Tensorflow: learning functions at scale. In: Garrigue, J., Keller, G., Sumii, E. (eds.) ICFP. p. 1. ACM (2016)
  • [2] Abbasi, M., Gagné, C.: Robustness to adversarial examples through an ensemble of specialists. In: ICLR (Workshop). OpenReview.net (2017)
  • [3] Abbasi, M., Rajabi, A., Gagné, C., Bobba, R.B.: Toward adversarial robustness by diversity in an ensemble of specialized deep neural networks. In: Canadian Conference on AI. LNCS, vol. 12109, pp. 1–14. Springer (2020)
  • [4] Akintunde, M.E., Kevorchian, A., Lomuscio, A., Pirovano, E.: Verification of RNN-based neural agent-environment systems. In: AAAI. pp. 6006–6013. AAAI Press (2019)
  • [5] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in ai safety. CoRR abs/1606.06565 (2016)
  • [6] Apt, K.R., Grädel, E.: Lectures in game theory for computer scientists. Cambridge University Press (2011)
  • [7] Bunel, R., Turkaslan, I., Torr, P.H.S., Kohli, P., Mudigonda, P.K.: A unified view of piecewise linear neural network verification. In: NeurIPS. pp. 4795–4804 (2018)
  • [8] Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: EMNIST: extending MNIST to handwritten letters. In: IJCNN. pp. 2921–2926. IEEE (2017)
  • [9] De Moura, L., Bjørner, N.: Z3: An efficient smt solver. In: TACAS. pp. 337–340. Springer (2008)
  • [10] Deng, L.: The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
  • [11] Dreossi, T., Ghosh, S., Sangiovanni-Vincentelli, A.L., Seshia, S.A.: A formalization of robustness for deep neural networks. CoRR abs/1903.10033 (2019)
  • [12] Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. In: ATVA. LNCS, vol. 10482, pp. 269–286. Springer (2017)
  • [13] Freedman, R.G., Zilberstein, S.: Safety in AI-HRI: Challenges complementing user experience quality. In: AAAI Fall Symposium Series (2016)
  • [14] Gurobi Optimization, Inc.: Gurobi optimizer reference manual. http://www.gurobi.com (2013)
  • [15] Katz, G., Barrett, C.W., Dill, D.L., Julian, K., Kochenderfer, M.J.: Reluplex: An efficient SMT solver for verifying deep neural networks. In: CAV. LNCS, vol. 10426, pp. 97–117. Springer (2017)
  • [16] Keysers, D.: Comparison and combination of state-of-the-art techniques for handwritten character recognition: topping the mnist benchmark. arXiv preprint arXiv:0710.2231 (2007)
  • [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015), http://arxiv.org/abs/1412.6980
  • [18] Kwiatkowska, M.Z.: Safety verification for deep neural networks with provable guarantees (invited paper). In: CONCUR. LIPIcs, vol. 140, pp. 1:1–1:5. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
  • [19] Nyholm, S.: The ethics of crashes with self-driving cars: A roadmap, ii. Philosophy Compass 13(7) (2018)
  • [20] Perdomo, J.C., Singer, Y.: Robust attacks against multiple classifiers. CoRR abs/1906.02816 (2019)
  • [21] Pinot, R., Ettedgui, R., Rizk, G., Chevaleyre, Y., Atif, J.: Randomization matters. how to defend against strong adversarial attacks. CoRR abs/2002.11565 (2020)
  • [22] Ranzato, F., Zanella, M.: Robustness verification of decision tree ensembles. In: OVERLAY@AI*IA. vol. 2509, pp. 59–64. CEUR-WS.org (2019)
  • [23] Science, N., (NSTC), T.C.: Preparing for the Future of Artificial Intelligence (2016)
  • [24] Singh, G., Gehr, T., Püschel, M., Vechev, M.T.: An abstract domain for certifying neural networks. Proc. ACM Program. Lang. 3(POPL), 41:1–41:30 (2019). https://doi.org/10.1145/3290354, https://doi.org/10.1145/3290354
  • [25] Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The german traffic sign recognition benchmark: a multi-class classification competition. In: IJCNN. pp. 1453–1460. IEEE (2011)
  • [26] Stoica, I., Song, D., Popa, R.A., Patterson, D., Mahoney, M.W., Katz, R., Joseph, A.D., Jordan, M., Hellerstein, J.M., Gonzalez, J.E., et al.: A Berkeley view of systems challenges for AI. CoRR abs/1712.05855 (2017)
  • [27] Yan, C., Xu, W., Liu, J.: Can you trust autonomous vehicles: Contactless attacks against sensors of self-driving vehicle. DEF CON 24(8),  109 (2016)

Appendix 0.A Full Experimental Results

Benchmark Information SMT MILP
ID Name |𝑪||\boldsymbol{C}| |𝑨||\boldsymbol{A}| |𝓧||\boldsymbol{\mathcal{X}}| dim 𝜶\boldsymbol{\alpha} 𝜺\boldsymbol{\varepsilon} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})}
1 mnist_0_1 3 2 4 7x7 0.1 1000 -TO- 57.79 0.25
2 mnist_0_1 3 2 4 7x7 0.2 100 -TO- 12.43 0.25
3 mnist_0_1_2convs 3 2 4 8x8 0.2 1000 -TO- 738.76 0.5
4 mnist_0_1_2convs 3 2 4 8x8 0.4 100 -TO- 53.92 0.4
5 mnist_0_1_2convs 3 2 4 8x8 0.9 100 -TO- 44.24 1
6 mnist_0_1_2convs 3 3 4 8x8 0.9 100 -TO- 96.78 1
7 mnist_0_1 3 2 4 8x8 0.4 1000 -TO- 0.34 0.4
8 mnist_0_1 3 2 4 8x8 0.9 1000 -TO- 50.09 0.9
9 mnist_0_1 3 3 4 8x8 0.9 60 -TO- 34.02 1
10 mnist_0_1 3 4 4 8x8 0.9 60 -TO- 51.39 1
11 mnist_4_5 3 2 4 10x10 0.2 100 -TO- 57.92 0.25
12 mnist_4_5 3 4 4 10x10 0.3 50 -TO- 143.86 0.5
13 mnist_4_5 3 4 4 10x10 0.9 50 -TO- 144.32 1
14 mnist_7_8 3 4 4 6x6 0.1 60 -TO- 18.94 0.25
15 mnist_7_8 3 10 4 6x6 0.1 60 -TO- 76.25 0.5
16 mnist_4_5 3 2 4 10x10 0.1 1000 155.73 0.38 101.16 0.1
17 mnist_4_5 3 3 4 10x10 0.1 80 403.25 0.25 101.47 0.25
18 mnist_4_5 3 2 4 10x10 0.15 80 216.65 0.38 44.26 0.15
19 mnist_4_5 3 2 4 10x10 0.2 100 156.63 0.38 54.36 0.25
20 mnist_3_6 2 9 2 8x8 1 0.005 -TO- 7 robust
21 mnist_7_8 3 2 4 6x6 1 0.1 -TO- 4 robust
22 mnist_7_8 3 2 4 6x6 0.9 0.1 -TO- 4 robust
23 mnist_7_8 3 7 4 6x6 0.9 0.3 -TO- 71 robust
24 mnist_0_2 3 27 4 9x9 1 0.01 -TO- 108 robust
25 mnist_0_2 3 30 4 9x9 1 0.01 -TO- 120 robust
26 traffic_signs 3 2 4 10x10 1 0.01 -TO- 17 robust
27 traffic_signs 3 2 4 10x10 1 0.1 -TO- -TO-
28 traffic_signs 3 3 4 10x10 1 0.01 -TO- 45 robust
29 traffic_signs 3 3 4 10x10 1 0.01 -TO- -TO-
Table 4: SMT versus MILP
Benchmark Information MILP MaxMILP
ID Name |𝑪||\boldsymbol{C|} |𝑨||\boldsymbol{A|} |𝓧||\boldsymbol{\mathcal{X}}| dim 𝜶\boldsymbol{\alpha} 𝜺\boldsymbol{\varepsilon} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} Time 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})}
1 mnist_0_1 3 2 4 7x7 0.1 1000 57.79 0.25 46.23 1*
2 mnist_0_1 3 2 4 7x7 0.2 100 12.43 0.25 12.43 1*
3 mnist_0_1_2convs 3 2 4 8x8 0.2 1000 738.76 0.5 93.54 1*
4 mnist_0_1_2convs 3 2 4 8x8 0.4 100 53.92 0.4 53.92 1*
5 mnist_0_1_2convs 3 2 4 8x8 0.9 100 44.24 1 44.24 1*
6 mnist_0_1_2convs 3 3 4 8x8 0.9 100 96.78 1 96.78 1*
7 mnist_0_1 3 2 4 8x8 0.4 1000 0.34 0.4 0.34 1*
8 mnist_0_1 3 2 4 8x8 0.9 1000 50.09 0.9 50.09 1*
9 mnist_0_1 3 3 4 8x8 0.9 60 34.02 1 34.02 1*
10 mnist_0_1 3 4 4 8x8 0.9 60 51.39 1 51.39 1*
11 mnist_4_5 3 2 4 10x10 0.2 100 57.92 0.25 52.95 1*
12 mnist_4_5 3 4 4 10x10 0.3 50 143.86 0.5 158.20 1
13 mnist_4_5 3 4 4 10x10 0.9 50 144.32 1 144.32 1
14 mnist_7_8 3 4 4 6x6 0.1 60 18.94 0.25 21.20 1
15 mnist_7_8 3 10 4 6x6 0.1 60 76.25 0.5 76.25 1
16 mnist_4_5 3 2 4 10x10 0.1 1000 101.16 0.1 106.95 1*
17 mnist_4_5 3 3 4 10x10 0.1 80 101.47 0.25 88.39 1
18 mnist_4_5 3 2 4 10x10 0.15 80 44.26 0.15 52.00 1*
19 mnist_4_5 3 2 4 10x10 0.2 100 54.36 0.25 44.00 1*
20 mnist_3_6 2 9 2 8x8 1 0.005 7 robust 7 robust
21 mnist_7_8 3 2 4 6x6 1 0.1 4 robust 4 robust
22 mnist_7_8 3 2 4 6x6 0.9 0.1 4 robust 4 robust
23 mnist_7_8 3 7 4 6x6 0.9 0.3 71 robust 71 robust
24 mnist_0_2 3 27 4 9x9 1 0.01 108 robust 108 robust
25 mnist_0_2 3 30 4 9x9 1 0.01 120 robust 120 robust
26 traffic_signs 3 2 4 10x10 1 0.01 17 robust 17 robust
27 traffic_signs 3 2 4 10x10 1 0.1 –TO– - –TO–
28 traffic_signs 3 3 4 10x10 1 0.01 45 robust 45 robust
29 traffic_signs 3 3 4 10x10 1 0.01 –TO– - –TO–
Table 5: MILP versus MaxMILP
Benchmark Information UA BDA MaxMILP
ID Name |𝑪||\boldsymbol{C|} |𝑨||\boldsymbol{A|} |𝓧||\boldsymbol{\mathcal{X}|} dim Epsilon Alpha 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})} 𝐕𝐚𝐥(𝑨,)\boldsymbol{\mathbf{Val}(A,\mathbb{P})}
1 mnist_0_1 3 2 4 7x7 0.1 1000 0.33 0.5 1*
2 mnist_0_1 3 2 4 7x7 0.2 100 0.33 0.5 1*
3 mnist_0_1_2convs 3 2 4 8x8 0.2 1000 0.33 0.5 1*
4 mnist_0_1_2convs 3 2 4 8x8 0.4 100 0.33 0.25 1*
5 mnist_0_1_2convs 3 2 4 8x8 0.9 100 0.33 0.25 1*
6 mnist_0_1_2convs 3 3 4 8x8 0.9 100 0.33 0.25 1*
7 mnist_0_1 3 2 4 8x8 0.4 1000 0.33 0.5 1*
8 mnist_0_1 3 2 4 8x8 0.9 1000 0.33 0.5 1*
9 mnist_0_1 3 3 4 8x8 0.9 60 0.33 0.5 1*
10 mnist_0_1 3 4 4 8x8 0.9 60 0.33 0.5 1*
11 mnist_4_5 3 2 4 10x10 0.2 100 0.33 0.25 1*
12 mnist_4_5 3 4 4 10x10 0.3 50 0.33 0.75 1
13 mnist_4_5 3 4 4 10x10 0.9 50 0.33 0.5 1
14 mnist_7_8 3 4 4 6x6 0.1 60 0.33 0.5 1
15 mnist_7_8 3 10 4 6x6 0.1 60 0.33 0.5 1
16 mnist_4_5 3 2 4 10x10 0.1 1000 0.33 0.75 1*
17 mnist_4_5 3 3 4 10x10 0.1 80 0.33 0.5 1
18 mnist_4_5 3 2 4 10x10 0.15 80 0.33 0.5 1*
19 mnist_4_5 3 2 4 10x10 0.2 100 0.33 0.5 1*
Table 6: Attacker Comparison