This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Second-Order Provable Defenses against Adversarial Attacks

Sahil Singla    Soheil Feizi
Abstract

A robustness certificate is the minimum distance of a given input to the decision boundary of the classifier (or its lower bound). For any input perturbations with a magnitude smaller than the certificate value, the classification output will provably remain unchanged. Exactly computing the robustness certificates for neural networks is difficult since it requires solving a non-convex optimization. In this paper, we provide computationally-efficient robustness certificates for neural networks with differentiable activation functions in two steps. First, we show that if the eigenvalues of the Hessian of the network are bounded, we can compute a robustness certificate in the l2l_{2} norm efficiently using convex optimization. Second, we derive a computationally-efficient differentiable upper bound on the curvature of a deep network. We also use the curvature bound as a regularization term during the training of the network to boost its certified robustness. Putting these results together leads to our proposed Curvature-based Robustness Certificate (CRC) and Curvature-based Robust Training (CRT). Our numerical results show that CRT leads to significantly higher certified robust accuracy compared to interval-bound propagation (IBP) based training. We achieve certified robust accuracy 69.79%, 57.78% and 53.19% while IBP-based methods achieve 44.96%, 44.74% and 44.66% on 2,3 and 4 layer networks respectively on the MNIST-dataset.

Machine Learning, ICML

1 Introduction

Modern neural networks achieve high accuracy on tasks such as image classification and speech recognition, but are known to be brittle to small, adversarially chosen perturbations of their inputs [46]. A classifier which correctly classifies an image 𝐱\mathbf{x}, can be fooled by an adversary to misclassify an adversarial example 𝐱+δ\mathbf{x}+\delta, such that 𝐱+δ\mathbf{x}+\delta is indistinguishable from 𝐱\mathbf{x} to a human. Adversarial examples can also fool systems when they are printed out on a paper and photographed with a smart phone [23]. Even in a black box threat model, where the adversary has no access to the model parameters, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a ‘yield’ or another sign [36]. This trend is worrisome and suggests that adversarial vulnerabilities need to be appropriately addressed before neural networks can be deployed in security critical applications.

In this work, we propose a new approach for developing provable defenses against 2\ell_{2}-bounded adversarial attacks as well as computing robustness certifications of pre-trained deep networks with differentiable activations. In contrast to the existing certificates [55, 50] that use the first-order information (upper and lower bounds on the slope), our approach is based on the second-order information (upper and lower bounds on curvature values i.e. eigenvalues of the Hessian). Our approach is based on two key theoretically-justified steps: First, in Theorems 1 and 2, we show that if the eigenvalues of the Hessian of the network (curvatures of the network) are bounded (globally or locally), we can efficiently compute a robustness certificate and develop a defense method against 2\ell_{2}-bounded adversarial attacks using convex optimization. Second, in Theorem 4, we derive a computationally-efficient differentiable bound on the curvature (eigenvalues of the Hessian) of a deep network. We derive this bound by explicitly characterizing the Hessian of a deep network in Lemma 1.

Although the problem of finding the closest adversarial example to a given point for deep nets leads to a non-convex optimization problem, our proposed Curvature-based Robustness Certificate (CRC), under some verifiable conditions, is able to compute points on the decision boundary that are provably closest to the input. That is, it provides the tightest certificate in those cases. For example, for a 2,3,4 layer networks trained on MNIST, we can find provably closest adversarial points for 44.17%, 22.59%, 19.53% cases, respectively (Table 2). To the best of our knowledge, our method is the first approach that can efficiently compute provably closest adversarial examples for a significant fraction of examples in non-trivial neural networks.

We note that un-regularized networks, specially deep ones, can obtain large curvature bounds which can lead to small robustness certificates. However, by using the derived curvature bound as a regularizer during training, we significantly decrease curvature values of the network, with little or no decrease in its performance (Table 5(b), Figure 1). Using this technique, our method significantly outperforms interval-bound propagation (IBP) [57, 52] and achieves state of the art certified accuracy (Tables 3 and 4). In particular, our method achieves certified robust accuracy 69.79%, 57.78% and 53.19% while IBP-based methods achieve 44.96%, 44.74% and 44.66% on 2,3 and 4 layer networks, respectively, on the MNIST-dataset (similar results for Fashion-MNIST).

Other recent works (e.g. Moosavi Dezfooli et al. [34], Qin et al. [38]) empirically show that using an estimate of curvature at inputs as a regularizer leads to empirical robustness on par with the adversarial training. In this work, however, we use a bound on the absolute value of curvature (and not an estimate) as a regularizer and show that it results in high certified robustness. Moreover, previous works have tried to certify robustness by bounding the Lipschitz constant of the neural network [46, 37, 56, 1, 20]. Our approach, however, is based on bounding the Lipschitz constant of the gradient which in turn leads to bound on the eigenvalues of the Hessian of deep neural networks.

Table 1: A summary of various primal and dual concepts used in the paper. ff denotes the function of the decision boundary, i.e. 𝐳y(L)𝐳t(L)\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t} where yy is the true label and tt is the attack target. mm and MM are lower and upper bounds on the smallest and largest eigenvalues of the Hessian of ff, respectively.
Certificate problem ()=cert{}_{(-)\ =\ cert} Attack problem ()=attack{}_{(-)\ =\ attack}
primal problem, p()p^{*}_{(-)} minf(𝐱)=01/2𝐱𝐱(0)2\min_{f(\mathbf{x})=0}1/2\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2} min𝐱𝐱(0)ρf(𝐱)\min_{\|\mathbf{x}-\mathbf{x}^{(0)}\|\leq\rho}f(\mathbf{x})
dual function, d()(η)d_{(-)}(\eta) min𝐱1/2𝐱𝐱(0)2+ηf(𝐱)\min_{\mathbf{x}}1/2\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}+\eta f(\mathbf{x}) min𝐱f(𝐱)+η/2(𝐱𝐱(0)2ρ2)\min_{\mathbf{x}}f(\mathbf{x})+\eta/2(\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}-\rho^{2})
When is dual solvable? 1/Mη1/m-1/M\leq\eta\leq-1/m mη-m\leq\eta
dual problem, d()d^{*}_{(-)} max1/Mη1/mdcert(η)\max_{-1/M\leq\eta\leq-1/m}d_{cert}(\eta) maxmηdattack(η)\max_{-m\leq\eta}d_{attack}(\eta)
When primal=\ =\ dual? f(𝐱(cert))=0f(\mathbf{x}^{(cert)})=0 𝐱(attack)𝐱(0)=ρ\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\|=\rho

In summary, we make the following contributions:

  • We derive a closed-form expression for the Hessian of a deep network with differentiable activation functions (Lemma 1) and derive bounds on the curvature using this closed-form formula (Theorems 3 and 4).

  • We develop computationally efficient methods for both the robustness certification as well as the adversarial attack problems (Theorems 1 and 2).

  • We provide verifiable conditions under which our method is able to compute points on the decision boundary that are provably closest to the input. Empirically, we show that this condition holds for a significant fraction of examples (Table 2).

  • We show that using our proposed curvature bounds as a regularizer during training leads to improved certified accuracy on 2,3 and 4 layer networks (on the MNIST and Fashion-MNIST datasets) compared to IBP-based adversarial training [51, 57] (Tables 3 and 4). Our robustness certificate (CRC) outperforms CROWN’s certificate [55] significantly when trained with our regularizer (Table 5(b)).

To the best of our knowledge, this is the first work that (a) demonstrates the utility of second-order information for provable robustness, (b) derives a framework to find the exact robustness certificates in the l2l_{2} norm and the exact worst case adversarial perturbation in an l2l_{2} ball of given a radius under some conditions, and (c) derives an exact closed form expression for the Hessian and bounds on the curvature values using the same.

2 Related work

In the last couple of years, several empirical defenses have been proposed for training classifiers to be robust against adversarial perturbations [30, 42, 58, 35, 24, 32, 59] Although these defenses robustify classifiers to particular types of attacks, they can be still vulnerable against stronger attacks [3, 7, 47, 2]. For example, [3] showed most of the empirical defenses proposed in ICLR 2018 can be broken by developing tailored attacks for each of them.

To end the cycle between defenses and attacks, a line of work on certified defenses has gained attention where the goal is to train classifiers whose predictions are provably robust within some given region [21, 22, 15, 8, 9, 29, 12, 17, 5, 48, 51, 49, 52, 40, 39, 13, 14, 11, 44, 19, 18, 31, 55, 50, 57]. These methods, however, do not scale to large and practical networks used in solving modern machine learning problems. Another line of defense work focuses on randomized smoothing where the prediction is robust within some region around the input with a user-chosen probability [28, 6, 26, 27, 10, 41]. Although these methods can scale to large networks, certifying robustness with probability close to 1 often requires generating a large number of noisy samples around the input which leads to high inference-time computational complexity. We discuss existing works in more details in Appendix A.

3 Notation

Consider a fully connected neural network with LL layers and NIN_{I} neurons in the IthI^{th} layer (L2L\geq 2 and II\in [L][L]) for a multi-label classification problem with CC classes (NL=CN_{L}=C). The corresponding function of the neural network is 𝐳(L):𝐑D𝐑C\mathbf{z}^{(L)}:\mathbf{R}^{D}\to\mathbf{R}^{C} where DD is the dimension of the input. For an input 𝐱\mathbf{x}, we use 𝐳(I)(𝐱)𝐑NI\mathbf{z}^{(I)}(\mathbf{x})\in\mathbf{R}^{N_{I}} and 𝐚(I)(𝐱)𝐑NI\mathbf{a}^{(I)}(\mathbf{x})\in\mathbf{R}^{N_{I}} to denote the input (before applying the activation function) and output (after applying the activation function) of neurons in the IthI^{th} hidden layer of the network, respectively. To simplify notation and when no confusion arises, we make the dependency of 𝐳(I)\mathbf{z}^{(I)} and 𝐚(I)\mathbf{a}^{(I)} to 𝐱\mathbf{x} implicit. We define 𝐚(0)(𝐱)=𝐱\mathbf{a}^{(0)}(\mathbf{x})=\mathbf{x} and N0=DN_{0}=D.

With a fully connected architecture, each 𝐳(I)\mathbf{z}^{(I)} and 𝐚(I)\mathbf{a}^{(I)} is computed using a transformation matrix 𝐖(I)RNI×NI1\mathbf{W}^{(I)}\in R^{N_{I}\times N_{I-1}}, the bias vector 𝐛(I)RNI\mathbf{b}^{(I)}\in R^{N_{I}} and an activation function σ(.)\sigma(.) as follows:

𝐳(I)=𝐖(I)𝐚(I1)+𝐛(I),𝐚(I)=σ(𝐳(I))\displaystyle\mathbf{z}^{(I)}=\mathbf{W}^{(I)}\mathbf{a}^{(I-1)}+\mathbf{b}^{(I)},\qquad\mathbf{a}^{(I)}=\sigma\left(\mathbf{z}^{(I)}\right)

We use (𝐳i(L)𝐳j(L))(𝐱)(\mathbf{z}^{(L)}_{i}-\mathbf{z}^{(L)}_{j})(\mathbf{x}) as a shorthand for 𝐳i(L)(𝐱)𝐳j(L)(𝐱)\mathbf{z}^{(L)}_{i}(\mathbf{x})-\mathbf{z}^{(L)}_{j}(\mathbf{x}).

We use [p][p] to denote the set {1,,p}\{1,\dotsc,p\} and [p,q],pq[p,q],\ p\leq q to denote the set {p,p+1,,q}\{p,p+1,\dotsc,q\}. We use small letters i,j,ki,j,k etc to denote the index over a vector or rows of a matrix and capital letters I,JI,J to denote the index over layers of network. The element in the ithi^{th} position of a vector 𝐯\mathbf{v} is given by 𝐯i\mathbf{v}_{i}, the vector in the ithi^{th} row of a matrix 𝐀\mathbf{A} is 𝐀i\mathbf{A}_{i} while the element in the ithi^{th} row and jthj^{th} column of 𝐀\mathbf{A} is 𝐀i,j\mathbf{A}_{i,j}. We use 𝐯\|\mathbf{v}\| and 𝐀\|\mathbf{A}\| to denote the 2-norm and the operator 2-norm of the vector 𝐯\mathbf{v} and the matrix 𝐀\mathbf{A}, respectively. We use |𝐯|\left|\mathbf{v}\right| and |𝐀|\left|\mathbf{A}\right| to denote the vector and matrix constructed by taking the elementwise absolute values. We use λmax(𝐀)\lambda_{max}(\mathbf{A}) and λmin(𝐀)\lambda_{min}(\mathbf{A}) to denote the largest and smallest eigenvalues of a symmetric matrix 𝐀\mathbf{A}. We use diag(𝐯)diag(\mathbf{v}) to denote the diagonal matrix constructed by placing each element of 𝐯\mathbf{v} along the diagonal. We use \odot to denote the Hadamard Product, 𝐈\mathbf{I} to denote the identity matrix. We use \preccurlyeq and \succcurlyeq to denote Linear Matrix Inequalities (LMIs) such that given two symmetric matrices 𝐀\mathbf{A} and 𝐁\mathbf{B} where 𝐀𝐁\mathbf{A}\succcurlyeq\mathbf{B} means 𝐀𝐁\mathbf{A}-\mathbf{B} Positive Semi-Definite (PSD).

4 Using duality to solve the attack and certificate problems

Consider an input 𝐱(0)\mathbf{x}^{(0)} with true label yy and attack target tt. In the certificate problem, our goal is to find a lower bound of minimum l2l_{2} distance between 𝐱(0)\mathbf{x}^{(0)} and decision boundary f(𝐱)=0f(\mathbf{x})=0 where f(𝐱)=(𝐳y(L)𝐳t(L))(𝐱)f(\mathbf{x})=(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x}). The problem for solving the exact distance (primal) can be written as:

pcert=minf(𝐱)=0[12𝐱𝐱(0)2]\displaystyle p^{*}_{cert}=\min_{f(\mathbf{x})=0}\left[\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}\right]
pcert=min𝐱maxη[12𝐱𝐱(0)2+ηf(𝐱)]\displaystyle p^{*}_{cert}=\min_{\mathbf{x}}\max_{\eta}\left[\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta f(\mathbf{x})\right] (1)

However, solving the above problem can be hard in general. Using the minimax theorem (primal \geq dual), we can write the dual of the above problem as follows:

pcertmaxηdcert(η)\displaystyle p^{*}_{cert}\geq\max_{\eta}d_{cert}(\eta)
dcert(η)=min𝐱[12𝐱𝐱(0)2+ηf(𝐱)]\displaystyle d_{cert}(\eta)=\min_{\mathbf{x}}\left[\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta f(\mathbf{x})\right] (2)

From the theory of duality, we know that dcert(η)d_{cert}(\eta) for each value of η\eta gives a lower bound on the exact certification value (the primal solution) pcertp^{*}_{cert}. However, since ff is non-convex, solving dcert(η)d_{cert}(\eta) for every η\eta can be difficult. In the next section, we will prove that the curvature of the function ff is bounded globally:

m𝐈𝐱2fM𝐈𝐱D\displaystyle m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}f\preccurlyeq M\mathbf{I}\qquad\forall\mathbf{x}\in\mathbb{R}^{D} (3)

In this case, we have the following theorem (dcertd^{*}_{cert} is defined in Table 1):

Theorem 1.

dcert(η)d_{cert}(\eta) is a convex optimization problem for 1/Mη1/m-1/M\leq\eta\leq-1/m. Moreover, If 𝐱(cert)\mathbf{x}^{(cert)} is the solution to dcertd^{*}_{cert} such that f(𝐱(cert))=0f(\mathbf{x}^{(cert)})=0, then pcert=dcertp^{*}_{cert}=d^{*}_{cert}.

Below, we briefly outline the proof while the full proof is presented in Appendix E.1. The Hessian of the objective function of the dual dcert(η)d_{cert}(\eta), i.e the function inside the min𝐱\min_{\mathbf{x}} is given by:

𝐱2[12𝐱𝐱(0)2+ηf(𝐱)]=𝐈+η𝐱2f\nabla_{\mathbf{x}}^{2}\left[\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta f(\mathbf{x})\right]=\mathbf{I}+\eta\nabla^{2}_{\mathbf{x}}f

From equation (3), we know that the eigenvalues of 𝐈+η𝐱2f\mathbf{I}+\eta\nabla^{2}_{\mathbf{x}}f are bounded between (1+ηm,1+ηM)(1+\eta m,1+\eta M) if η0\eta\geq 0, and in (1+ηM,1+ηm)(1+\eta M,1+\eta m) if η0\eta\leq 0. In both cases, we can see that for 1/Mη1/m-1/M\leq\eta\leq-1/m, all eigenvalues will be non-negative, making the objective function convex. When 𝐱(cert)\mathbf{x}^{(cert)} satisfies f(𝐱(cert))=0f(\mathbf{x}^{(cert)})=0, we have dcert=1/2𝐱(cert)𝐱(0)2d^{*}_{cert}=1/2\|\mathbf{x}^{(cert)}-\mathbf{x}^{(0)}\|^{2}. Using the duality theorem we have dcertpcertd^{*}_{cert}\leq p^{*}_{cert} and from the definition of pcertp^{*}_{cert}, we have pcertdcertp^{*}_{cert}\leq d^{*}_{cert}. Combining the two inequalities, we get pcert=dcertp^{*}_{cert}=d^{*}_{cert}.

Next, we consider the attack problem. The goal here is to find an adversarial example inside an l2l_{2} ball of radius ρ\rho such that f(𝐱)f(\mathbf{x}) is minimized. Using similar arguments, we can get the following theorem for the attack problem (pattackp^{*}_{attack}, dattackd^{*}_{attack} and dattackd_{attack} are defined in Table 1):

Theorem 2.

dattack(η)d_{attack}(\eta) is a convex optimization problem for mη-m\leq\eta. Moreover, if 𝐱(attack)\mathbf{x}^{(attack)} is the solution to dattackd^{*}_{attack} such that 𝐱(attack)𝐱(0)=ρ\left\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\right\|=\rhopattack=dattack.p^{*}_{attack}=d^{*}_{attack}.

The proof is presented in Appendix E.2. We note that both Theorems 1 and 2 hold for any non-convex function with continuous gradients. Thus they can also be of interest in problems such as optimization of neural nets.

Using Theorems 1 and 2, we have the following definitions for certification and attack optimizations:

Definition 1.

(Curvature-based Certificate Optimization) Given an input 𝐱(0)\mathbf{x}^{(0)} with true label yy, false target tt, we define (η(cert),𝐱(cert))(\eta^{(cert)},\mathbf{x}^{(cert)}) as the solution of the following max-min optimization:

max1/Mη1/mmin𝐱[12𝐱𝐱(0)2+ηf(𝐱)]\max_{-1/M\leq\eta\leq-1/m}\min_{\mathbf{x}}\left[\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta f(\mathbf{x})\right]

We refer to 𝐱(cert)𝐱(0)\left\|\mathbf{x}^{(cert)}-\mathbf{x}^{(0)}\right\| as the Curvature-based Robustness Certificate (CRC).

Definition 2.

(Curvature-based Attack Optimization) Given input 𝐱(0)\mathbf{x}^{(0)} with label yy, false target tt, and the l2l_{2} ball radius ρ\rho, we define (η(attack),𝐱(attack))(\eta^{(attack)},\mathbf{x}^{(attack)}) as the solution of the following optimization:

maxηmmin𝐱[η2(𝐱𝐱(0)2ρ2)+f(𝐱)]\max_{\eta\geq-m}\min_{\mathbf{x}}\left[\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)+f(\mathbf{x})\right]

When 𝐱(attack)\mathbf{x}^{(attack)} is used for training in an adversarial training framework, we call the method the Curvature-based Robust Training (CRT).

A direct implication of Theorems 1 and 2 is that the tightness of our robustness certificate crucially depends on the tightness of our curvature bounds, mm and MM. If mm and MM are very large compared to the true eigenvalue bounds of the Hessian of the network, the resulting robustness certificate will be vacuous. In Table 5(b) (and Figure 1), we show that by adding the derived bound as a regularization term during the training, we can significantly decrease curvature bounds of the network, with little or no decrease in its performance. This leads to high robustness certifications against adversarial attacks.

5 Curvature Bounds for deep networks

In this section, we provide a computationally efficient approach to compute the curvature bounds for neural networks with differentiable activation functions. To the best of our knowledge, there is no prior work on finding provable bounds on the curvature values of deep neural networks.

5.1 Closed form expression for the Hessian

Using the chain rule of second derivatives, we can derive 𝐱2𝐳i(L)\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} as a sum of matrix products:

Lemma 1.

Given an LL layer neural network, the Hessian of the ithi^{th} hidden unit with respect to the input 𝐱\mathbf{x}, i.e 𝐱2𝐳i(L)\nabla_{\mathbf{x}}^{2}\mathbf{z}^{(L)}_{i} is given by the following formula:

𝐱2𝐳i(L)=I=1L1(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}=\sum_{I=1}^{L-1}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{(}\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{)}\mathbf{B}^{(I)}

where 𝐁(I)\mathbf{B}^{(I)} is the Jacobian of 𝐳(I)\mathbf{z}^{(I)} with respect to 𝐱\mathbf{x} (dimensions NI×DN_{I}\times D), and 𝐅(L,I)\mathbf{F}^{(L,I)} is the Jacobian of 𝐳(L)\mathbf{z}^{(L)} with respect to 𝐚(I)\mathbf{a}^{(I)} (dimensions NL×NIN_{L}\times N_{I}).

The proof is presented in Appendix E.3. Using the chain rule, we can compute 𝐁(I)\mathbf{B}^{(I)}𝐅(L,I)\mathbf{F}^{(L,I)} matrices in Lemma 1 recursively as follows:

𝐁(I)={𝐖(1),I=1𝐖(I)diag(σ(𝐳(I1)))𝐁(I1),I2\displaystyle\mathbf{B}^{(I)}=\begin{cases}\mathbf{W}^{(1)},&I=1\\ \mathbf{W}^{(I)}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}\right)\right)\mathbf{B}^{(I-1)},&I\geq 2\\ \end{cases}
𝐅(L,I)={𝐖(L),I=L1𝐖(L)diag(σ(𝐳(L1)))𝐅(L1,I),IL2\displaystyle\mathbf{F}^{(L,I)}=\begin{cases}\mathbf{W}^{(L)},&I=L-1\\ \mathbf{W}^{(L)}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{F}^{(L-1,I)},&I\leq L-2\end{cases}

This leads to a fast back-propagation like method that can be used to compute the Hessian. Note that Lemma 1 only assumes a matrix multiplication operation from 𝐚(I1)\mathbf{a}^{(I-1)} to 𝐳(I)\mathbf{z}^{(I)}. Since a convolution operation can also be expressed as a matrix multiplication, we can directly extend this lemma to deep convolutional networks. Furthermore, Lemma 1 can also be of independent interest in other related problems such as second-order interpretation methods for deep learning (e.g. [45]).

5.2 Curvature bounds for Two Layer networks

For a two-layer network and using Lemma 1, 𝐱2(𝐳y(2)𝐳t(2))\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right) is given by:

(𝐖(1))Tdiag((𝐖y(2)𝐖t(2))σ′′(𝐳(1)))𝐖(1)\displaystyle\left(\mathbf{W}^{(1)}\right)^{T}diag\bigg{(}\left(\mathbf{W}^{(2)}_{y}-\mathbf{W}^{(2)}_{t}\right)\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}\right)\bigg{)}\mathbf{W}^{(1)}

In the above equation, note that only the term σ′′(𝐳(1))\sigma^{{}^{\prime\prime}}(\mathbf{z}^{(1)}) depends on 𝐱\mathbf{x}. We can maximize and minimize each element in the diag term, (𝐖y,i(2)𝐖t,i(2))σ′′(𝐳i(1))(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i})\sigma^{{}^{\prime\prime}}(\mathbf{z}^{(1)}_{i}) independently subject to the constraint that σ′′(.)\sigma^{{}^{\prime\prime}}(.) is bounded. Using this procedure, we construct matrices 𝐏\mathbf{P} and 𝐍\mathbf{N} that satisfy properties given in the following theorem:

Theorem 3.

Given a two layer network whose activation function has bounded second derivative:

hLσ′′(x)hUx\displaystyle h_{L}\leq\sigma^{{}^{\prime\prime}}(x)\leq h_{U}\quad\forall x\in\mathbb{R}
  1. (a)

    We have the following linear matrix inequalities (LMIs):

    𝐍𝐱2(𝐳y(2)𝐳t(2))\displaystyle\mathbf{N}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right) 𝐏𝐱D\displaystyle\preccurlyeq\mathbf{P}\qquad\forall\mathbf{x}\in\mathbb{R}^{D}
  2. (b)

    If hU0h_{U}\geq 0 and hL0h_{L}\leq 0, 𝐏\mathbf{P} is PSD, 𝐍\mathbf{N} is a NSD matrix.

  3. (c)

    This gives the following global bounds on the eigenvalues of the Hessian:

    m𝐈𝐱2(𝐳y(2)𝐳t(2))M𝐈\displaystyle m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\preccurlyeq M\mathbf{I} (4)

    where M=λmax(𝐏),m=λmin(𝐍)M=\lambda_{max}(\mathbf{P}),\ m=\lambda_{min}(\mathbf{N})

𝐏\mathbf{P} and 𝐍\mathbf{N} are independent of 𝐱\mathbf{x} and defined in equations (58) and (59) in Appendix E.4.

The proof is presented in Appendix E.4. Because power iteration finds the eigenvalue with largest magnitude, we can use it to find mm and MM only when 𝐏\mathbf{P} is PSD and 𝐍\mathbf{N} is NSD. We solve for hUh_{U} and hLh_{L} for sigmoid, tanh, softplus activation functions in Appendix F and show that this is in fact the case for them.

We note that this result does not hold for ReLU networks since the ReLU function is not differentiable everywhere. However, in Appendix G, we devise a method to compute the certificate for a two layer ReLU network by finding a quadratic function that is a provable lower bound for 𝐳y(2)𝐳t(2)\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}. We show that the resulting method significantly outperforms CROWN-Ada (see Appendix Table 15).

5.3 Curvature bounds for Deep networks

Using Lemma 1, we know that 𝐱2𝐳i(L)\nabla_{\mathbf{x}}^{2}\mathbf{z}^{(L)}_{i} is a sum product of matrices 𝐁(I)\mathbf{B}^{(I)} and 𝐅i(L,I)\mathbf{F}^{(L,I)}_{i}. Thus, if we can find upper bounds for 𝐁(I)\|\mathbf{B}^{(I)}\| and 𝐅i(L,I)\|\mathbf{F}^{(L,I)}_{i}\|_{\infty}, we can get upper bounds for 𝐱2𝐳i(L)\|\nabla_{\mathbf{x}}^{2}\mathbf{z}^{(L)}_{i}\|. Using this intuition (proof is presented in Appendix E.5), we have the following result:

Theorem 4.

Given an LL layer neural network whose activation function satifies:

|σ(x)|g,|σ′′(x)|hx,\displaystyle|\sigma^{{}^{\prime}}(x)|\leq g,\ |\sigma^{{}^{\prime\prime}}(x)|\leq h\qquad\forall x\in\mathbb{R},

the absolute value of eigenvalues of 𝐱2𝐳i(L)\ \nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} is globally bounded by the following quantity:

𝐱2𝐳i(L)hI=1L1(r(I))2maxj(𝐒i,j(L,I)),𝐱D\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\|\leq h\sum_{I=1}^{L-1}\left(r^{(I)}\right)^{2}\max_{j}\left(\mathbf{S}^{(L,I)}_{i,j}\right),\quad\forall\mathbf{x}\in\mathbb{R}^{D}

where r(I)r^{(I)} and 𝐒(L,I)\mathbf{S}^{(L,I)} are independent of 𝐱\mathbf{x} and defined recursively as:

r(I)={𝐖(1),I=1g𝐖(I)r(I1),I2\displaystyle r^{(I)}=\begin{cases}\left\|\mathbf{W}^{(1)}\right\|,&I=1\\ g\left\|\mathbf{W}^{(I)}\right\|r^{(I-1)},&I\geq 2\\ \end{cases} (5)
𝐒(L,I)={|𝐖(L)|,I=L1g|𝐖(L)|𝐒(L1,I),IL2\displaystyle\mathbf{S}^{(L,I)}=\begin{cases}\left|\mathbf{W}^{(L)}\right|,&I=L-1\\ g\left|\mathbf{W}^{(L)}\right|\mathbf{S}^{(L-1,I)},&I\leq L-2\end{cases} (6)

The above expressions allows for an extremely efficient computation of the curvature bounds for deep networks. We consider simplification of this result for sigmoid, tanh, softplus activations in Appendix F. The curvature bounds for 𝐳y(L)𝐳t(L)\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t} can be computed by replacing 𝐖i(L)\mathbf{W}^{(L)}_{i} with 𝐖y(L)𝐖t(L)\mathbf{W}^{(L)}_{y}-\mathbf{W}^{(L)}_{t} in Theorem 4. The resulting bound is independent of 𝐱\mathbf{x}, and only depends on network weights 𝐖(I)\mathbf{W}^{(I)}, the true label yy, and the target tt. We denote it with K(𝐖,y,t)K(\mathbf{W},y,t). To simplify notation, when no confusion arises we denote it with KK. In our experiments, for two layer networks, we use MM, mm from Theorem 3 since it provides tighter curvature bounds. For deeper networks (L3L\geq 3), we use M=K,m=KM=K,\ m=-K.

Refer to caption
(a) 2 layer network
Refer to caption
(b) 3 layer network
Figure 1: Illustration of lower (KlbK_{lb}) and upper (KubK_{ub}) bounds on the curvature of 2 and 3 layer networks with sigmoid activations trained on MNIST. Without any curvature regularization (γ=0\gamma=0), curvature bounds increase significantly for deeper networks. Similarly with γ=0\gamma=0, networks adversarially trained with PGD have high curvature as well (note the log scale of the yy-axis). However, using our curvature bound as a regularizer, the bound becomes tight and CRC gives high certificate values (Table 5(b)). We report the curvature bounds (KlbK_{lb} and KubK_{ub}) for networks with different depths in Appendix Table 16.

6 Adversarial training with curvature regularization

Since the term 𝐁(I)\mathbf{B}^{(I)} in Lemma 1 is the Jacobian of 𝐳(I)\mathbf{z}^{(I)} with respect to 𝐱\mathbf{x}𝐁(I)\|\mathbf{B}^{(I)}\|, it is equal to the lipschitz constant of the neural network constructed from the first II layers of the original network. Finding tight bounds on the lipschitz constant is an active area of research [16, 50, 43] and the product of the operator norm of weight matrices is known to be a loose bound on the lipschitz constant for deep networks. Since we use the same product to compute the bound for 𝐁(I)\|\mathbf{B}^{(I)}\| in Theorem 4, the resulting curvature bound is likely to be loose for very deep networks.

In Figure 1, we observe the same trend: as the depth of the network increases, the upper bound KubK_{ub} computed using Theorem 4 becomes significantly larger than the lower bound KlbK_{lb} (computed by taking the maximum of the largest eigenvalue of the Hessian across all test images with label yy and the second largest logit tt, then averaging across different (y,t)(y,t)). However, by regularizing the network to have small curvature during training, the bound becomes significantly tighter. Interestingly, using curvature regularization, even with this loose curvature bound for deep nets, we achieve significantly higher robust accuracy than the current state of the art methods while enjoying significantly higher standard accuracy as well (see Tables 3 and 4).

To regularize the network to have small curvature values, we penalize the curvature bound KK during training. To compute the gradient of KK with respect to the network weights, note that using Theorem 4, we can compute KK using absolute value, matrix multiplications, and operator norm. Since the gradient of operator norm does not exist in standard libraries, we created a new layer where the gradient of 𝐖(I)\|\mathbf{W}^{(I)}\| i.e 𝐖(I)𝐖(I)\nabla_{\mathbf{W}^{(I)}}\|\mathbf{W}^{(I)}\| is given by:

𝐖(I)𝐖(I)=𝐮(I)(𝐯(I))T\displaystyle\nabla_{\mathbf{W}^{(I)}}\|\mathbf{W}^{(I)}\|=\mathbf{u}^{(I)}\left(\mathbf{v}^{(I)}\right)^{T}
where 𝐖(I)𝐯(I)=𝐖(I)𝐮(I)\displaystyle\text{ where }\ \ \mathbf{W}^{(I)}\mathbf{v}^{(I)}=\|\mathbf{W}^{(I)}\|\mathbf{u}^{(I)}

Note that 𝐖(I)\|\mathbf{W}^{(I)}\|, 𝐮(I)\mathbf{u}^{(I)} and 𝐯(I)\mathbf{v}^{(I)} can be computed using power iteration. Since the network weights do not change significantly during a single training step, we can use the singular vectors 𝐮(I)\mathbf{u}^{(I)} and 𝐯(I)\mathbf{v}^{(I)} computed in the previous training step to update 𝐖(I)\mathbf{W}^{(I)} using one iteration of power method. This approach to compute the gradient of the largest singular value of a matrix has also been used in previous published work [33]. Thus, the per-sample loss for training with curvature regularization is given by:

(𝐳(L)(𝐱(0)),y)+γK(𝐖,y,t)\displaystyle\ell\left(\mathbf{z}^{(L)}(\mathbf{x}^{(0)}),\ y\right)+\gamma K(\mathbf{W},\ y,\ t) (7)

where \ell denotes the cross entropy loss, yy is the true label of the input 𝐱(0)\mathbf{x}^{(0)}, tt is the attack target and γ\gamma is the regularization coefficient for penalizing large curvature values. Similar to the adversarial training, in CRT, we use 𝐱(attack)\mathbf{x}^{(attack)} instead of 𝐱(0)\mathbf{x}^{(0)} in equation (7).

7 Experiments

The certified robust accuracy means the fraction of correctly classified test samples whose robustness certificates (computed using CRC) are greater than a pre-specified radius ρ\rho. Unless otherwise specified, we use the class with the second largest logit as the attack target (i.e. the class tt). The notation (L×[1024]L\times[1024], activation) denotes a neural network with LL layers with the specified activation, (γ=c\gamma=c) denotes standard training with γ\gamma set to cc, while (CRT, cc) denotes CRT training with γ=c\gamma=c. Certificates are computed over 150 randomly chosen correctly classified images. We use a single NVIDIA GeForce RTX 2080 Ti GPU.

7.1 Fraction of inputs with tightest robustness certificate

Using the verifiable condition of Theorems 1 and 2, our approach is able to (1) find points that are provably the worst case adversarial perturbations (in the l2l_{2} norm) in the attack problem and (2) find points on the decision boundary that are provably closest to the input in the l2l_{2} norm in the certification problem. In particular, in Table 2, we observe that for curvature regularized networks, our approach finds provably worst-case adversarial perturbations for all of the inputs with a small drop in the accuracy. Moreover, for 2,3,and 4 layer networks, our method finds provably closest adversarial examples for 44.17%, 22.59% and 19.53% of inputs in the MNIST test set, respectively.

Table 2: Certificate success rate denotes the fraction of points satisfying 𝐳y𝐳t=0\mathbf{z}_{y}-\mathbf{z}_{t}=0, Attack success rate denotes the fraction of points (𝐱(0))\mathbf{x}^{(0)}) satisfying 𝐱(attack)𝐱(0)2=ρ=0.5\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\|_{2}=\rho=0.5 implying primal=dual in Theorems 1 and 2 respectively. We use the MNIST dataset.
Network γ\gamma Accuracy Attack success Certificate success
2×[1024]2\times[1024], sigmoid 0. 98.77% 5.05% 2.24%
0.03 98.30% 100% 44.17%
3×[1024]3\times[1024], sigmoid 0. 98.52% 0.% 0.12%
0.05 97.60% 100% 22.59%
4×[1024]4\times[1024], sigmoid 0. 98.22% 0.% 0.01%
0.07 95.24% 100% 19.53%

Note that the technique presented in this work is not applicable to ReLU networks due to the absence of curvature information. However, since verifying the robustness property in an l2l_{2} ball around the input is known to be an NP-complete problem for ReLU networks [22], it is computationally infeasible to even verify that a given adversarial perturbation is the worst case perturbation in polynomial time unless P=NP. We however show that it is possible to find (and not just verify) the exact worst case perturbation (and robustness certificate) for neural networks with smooth activation functions. We believe that these theoretical and empirical results provide evidence that bounding the curvature of the network and using smooth activation functions can be critical to achieve high robustness guarantees.

7.2 Comparison with existing provable defenses

We compare against certified defense techniques proposed in Wong et al. [52] and Zhang et al. [57] in Table 3 for the MNIST dataset [25] and Table 4 for the Fashion-MNIST dataset [53] with l2l_{2} radius ρ=1.58\rho=1.58. Even though our proposed CRT requires fully differentiable activation functions such as softplus, sigmoid, tanh etc, we include comparison with ReLU networks because the methods proposed in Wong et al. [52], Zhang et al. [57] use ReLU. Since CROWN-IBP can be trained using the softplus activation function, we include it in our comparison. Similar comparison with l2l_{2} radius ρ=0.5\rho=0.5 is given in Appendix Table 8 (MNIST dataset) and Table 9 (Fashion-MNIST dataset). We observe that CRT (certified with CRC) gives significantly higher certified robust accuracy as well as standard accuracy compared to either of the methods on both MNIST and Fashion-MNIST datasets for both different values of ρ\rho. Since shallow fully connected networks are known to perform poorly on the CIFAR-10 dataset, we do not include those results in our comparison.

Table 3: Comparison with interval-bound propagation based adversarial training methods: COAP i.e Convex Outer Adversarial Polytope [52], CROWN-IBP [57]) and Curvature-based Robust Training (Ours) with attack radius ρ=1.58\rho=1.58 on MNIST. For CROWN-IBP, we vary the final_beta hyperparameter between 0.8 and 3, and use the model with best certified accuracy. Results with ρ=0.5\rho=0.5 are in Appendix Table 8.
Network Training Standard Accuracy Certified Robust Accuracy
2×[1024]2\times[1024], softplus CRT, 0.01 98.68% 69.79%
CROWN-IBP 88.48% 42.36%
2×[1024]2\times[1024], relu COAP 89.33% 44.29%
CROWN-IBP 89.49% 44.96%
3×[1024]3\times[1024], softplus CRT, 0.05 97.43% 57.78%
CROWN-IBP 86.58% 42.14%
3×[1024]3\times[1024], relu COAP 89.12% 44.21%
CROWN-IBP 87.77% 44.74%
4×[1024]4\times[1024], softplus CRT, 0.07 95.60% 53.19%
CROWN-IBP 82.74% 41.34%
4×[1024]4\times[1024], relu COAP 90.17% 44.66%
CROWN-IBP 84.4% 43.83%
Table 4: Comparison between COAP [52], CROWN-IBP [57]) and Curvature-based Robust Training (Ours) with attack radius ρ=1.58\rho=1.58 on Fashion-MNIST. Results with ρ=0.5\rho=0.5 for are in Appendix Table 9.
Network Training Standard Accuracy Certified Robust Accuracy
2×[1024]2\times[1024], softplus CRT, 0.01 80.31% 54.39%
CROWN-IBP 69.23% 47.19%
2×[1024]2\times[1024], relu COAP 74.1% 46.3%
CROWN-IBP 70.73% 48.61%
3×[1024]3\times[1024], softplus CRT, 0.05 78.39% 53.4%
CROWN-IBP 68.72% 46.52%
3×[1024]3\times[1024], relu COAP 73.9% 46.3%
CROWN-IBP 70.79% 48.69%
4×[1024]4\times[1024], softplus CRT, 0.07 75.61% 49.6%
CROWN-IBP 68.31% 46.21%
4×[1024]4\times[1024], relu COAP 73.6% 45.1%
CROWN-IBP 70.21% 48.08%

In Appendix Table 11, we compare CRT with Randomized Smoothing [10]. For 2 & 3 layer networks, we achieve higher robust accuracy. However, we note that since our certificate is deterministic while the smoothing-based certificate is probabilistic (although with high probability), the results are not directly comparable. As a separate result, we also prove that randomized smoothing bounds the curvature of the network (Theorem 5 in Appendix E.6). We also include comparison with empirical defense methods namely PGD and TRADES in Appendix Table 14.

Table 5: Effect of curvature regularization and CRT on certified robust accuracy and robustness certificate
Network Training Standard Accuracy Certified Robust Accuracy
2×[1024]2\times[1024], sigmoid standard 98.37% 54.17%
γ=0.01\gamma=0.01 98.08% 83.53%
CRT, 0.01 98.57% 95.59%
3×[1024]3\times[1024], sigmoid standard 98.37% 0.00%
γ=0.01\gamma=0.01 97.71% 88.33%
CRT, 0.01 97.23% 94.99%
4×[1024]4\times[1024], sigmoid standard 98.39% 0.00%
γ=0.01\gamma=0.01 97.41% 89.61%
CRT, 0.01 97.83% 93.41%
(a) Effect of γ\gamma on certified robust accuracy
Network Training Certificate (mean)
CROWN CRC
2×[1024]2\times[1024], sigmoid standard 0.28395 0.48500
γ=0.01\gamma=0.01 0.32548 0.84719
CRT, 0.01 0.43061 1.54673
3×[1024]3\times[1024], sigmoid standard 0.24644 0.06874
γ=0.01\gamma=0.01 0.39799 1.07842
CRT, 0.01 0.39799 1.07842
4×[1024]4\times[1024], sigmoid standard 0.19501 0.00454
γ=0.01\gamma=0.01 0.40620 1.05323
CRT, 0.01 0.40327 1.06208
(b) Comparison between CROWN-general [54] and CRC.

7.3 Comparison with existing certificates

In Table 5(b), we compare CRC with CROWN-general [54]. For 2-layer networks, CRC outperforms CROWN significantly. For deeper networks, CRC works better than CROWN when the network is trained with curvature regularization. However, with small γ=0.01\gamma=0.01, we see a significant increase in CRC but a very small drop in the test accuracy (without any adversarial training). We can see that with γ=0.01\gamma=0.01, non-trivial certified accuracies of 83.53%, 88.33%, 89.61%83.53\%,\ 88.33\%,\ 89.61\% can be achieved on 2,3,42,3,4 layer sigmoid networks, respectively, without any adversarial training. Adversarial training using CRT further boosts certified accuracy to 95.59%, 94.99%95.59\%,\ 94.99\% and 93.41%93.41\%, respectively. We show some results on CIFAR-10 dataset in Appendix Table 13. We again observe improvements in the robustness certificate and certified robust accuracy using CRC and CRT.

7.4 Results using local curvature bounds

From Theorems 1 and 2, we can observe that if the curvature is locally bounded within a convex region around the input (we call it the ”safe” region), then the corresponding dual problems (dcertd^{*}_{cert}, dattackd^{*}_{attack}) are again convex optimization problems provided the optimization trajectory does not escape the safe region.

Theorem 3 can be directly extended to compute the local curvature bound using bounds on the second derivatives, i.e. σ′′(𝐳(1))\sigma^{{}^{\prime\prime}}(\mathbf{z}^{(1)}) in the local region. In Table 6, we show significant improvements for the CRC certificate for two-layer sigmoid networks on the MNIST dataset for γ=0\gamma=0. However, with the curvature regularization, the difference is insignificant. We also observe that the certified accuracy for (CRT, 0.0) improves from 95.04% to 95.31% and for standard improves from 54.17% to 58.06%. The certified accuracy remains the same for other cases. Implementation details are in the Appendix Section C.6.

Computing local curvature bounds for deeper networks, however, is more challenging due to the presence of terms involving multiplication of first and second derivatives. A straightforward extension of Theorem 4 4, wherein we compute the upper bound on σ\sigma^{{}^{\prime}} and σ′′\sigma{{}^{\prime\prime}} in a local region around the input across all neurons in all layers does not yield significant improvements over the global method, therefore we do not include those results in our comparison.

Table 6: Comparison between Certified Robust accuracy and CRC for 2 layer sigmoid and tanh networks using global and local curvature bounds on MNIST dataset with ρ=0.5\rho=0.5
Network Training CRC (Global) CRC (Local)
2×[1024]2\times[1024], sigmoid standard 0.5013 0.5847
CRT, 0.00.0 1.0011 1.1741
CRT, 0.010.01 1.5705 1.6047
CRT, 0.020.02 1.6720 1.6831

8 Conclusion

In this paper, we develop computationally-efficient convex relaxations for robustness certification and adversarial attack problems given the classifier has a bounded curvature. We also show that this convex relaxation is tight under some general verifiable conditions. To be able to use proposed certification and attack convex optimizations, we derive global curvature bounds for deep networks with differentiable activation functions. This result is a consequence of a closed-form expression that we derive for the Hessian of a deep network. Adversarial training using our attack method coupled with curvature regularization results in a significantly higher certified robust accuracy than the existing provable defense methods. Our proposed curvature-based robustness certificate significantly outperforms the CROWN certificate when trained with our regularizer. Scaling up our proposed curvature-based robustness certification and training methods as well as further tightening the derived curvature bounds are among interesting directions for the future work.

Appendix

Appendix A Related work

Many defenses have been proposed to make neural networks robust against adversarial examples. These methods can be classified into empirical defenses which empirically seem to be robust against known adversarial attacks, and certified defenses, which are provably robust against such attacks.

Empirical defenses The best known empirical defense is adversarial training [24, 30, 58]. In this method, a neural network is trained to minimize the worst-case loss over a region around the input. Although such defenses seem to work on existing attacks, there is no guarantee that a more powerful attack would not break them. In fact, most such defenses proposed in the literature were later broken by stronger attacks [3, 7, 47, 2]. To end this arms race between defenses and attacks, a number of works have tried to focus on certified defenses that have formal robustness guarantees.

Certified defenses A classifier is said to be certifiably robust if one can easily obtain a guarantee that a classifier’s prediction remains constant within some region around the input. Such defenses typically rely on certification methods which are either exact or conservative. Exact methods report whether or not there exists a adversarial perturbation inside some lpl_{p} norm ball. In contrast, conservative methods either certify that no adversarial perturbation exists or decline to make a certification; they may decline even when no such perturbation exists. Exact methods are usually based on Satisfiability Modulo Theories [21, 22, 15, 8] and Mixed Integer linear programming [9, 29, 12, 17, 5]. Unfortunately, they are computationally inefficient and difficult to scale up to even moderately sized neural networks. In contrast, conservative methods are more scalable and efficient which makes them useful for building certified defenses [48, 51, 49, 52, 40, 39, 13, 14, 11, 44, 19, 18, 31, 55, 50]. However, even these methods have not been shown to scale to practical networks that are large and expressive enough to perform well on ImageNet, for example. To scale to such large networks, randomized smoothing has been proposed as a probabilistically certified defense.

Randomized smoothing Randomized smoothing was previously proposed by several works [28, 6] as a empirical defense without any formal guarantees. [26] first proved robustness guarantees for randomized smoothing classifier using inequalities from differential privacy. [27] improved upon the same using tools from information theory. Recently, [10] provided a even tighter robustness guarantee for randomized smoothing. [41] proposed a method of adversarial training for the randomized smoothing classifier giving state of the art results in the l2l_{2} norm metric.

Appendix B The Attack problem

For a given input 𝐱(0)\mathbf{x}^{(0)} with true label yy and attack target tt, consider the attack problem. We are given that the eigenvalues of the Hessian 𝐱2(𝐳y(L)𝐳t(L))\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) are bounded below i.e:

m𝐈𝐱2(𝐳y(L)𝐳t(L))𝐱Dm\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\qquad\forall\mathbf{x}\in\mathbb{R}^{D}

Here m<0m<0 (since 𝐳y(L)𝐳t(L)\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t} is not convex in general).

The goal here is to find an adversarial example inside a l2l_{2} ball of radius ρ\rho such that (𝐳y(L)𝐳t(L))(𝐱)(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x}) is minimized. That is, we want to solve the following optimization:

pattack=min𝐱𝐱(0)ρ[(𝐳y(L)𝐳t(L))(𝐱)]\displaystyle p^{*}_{attack}=\min_{\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|\leq\rho}\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})\bigg{]}
=min𝐱maxη0[(𝐳y(L)𝐳t(L))(𝐱)+η2(𝐱𝐱(0)2ρ2)]\displaystyle=\min_{\mathbf{x}}\max_{\eta\geq 0}\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]} (8)

This optimization can be hard in general. Using the max-min inequality (primal \geq dual), we have:

pattack\displaystyle p^{*}_{attack} maxη0dattack(η)\displaystyle\geq\max_{\eta\geq 0}d_{attack}(\eta)
dattack(η)\displaystyle d_{attack}(\eta) =min𝐱[(𝐳y(L)𝐳t(L))(𝐱)\displaystyle=\min_{\mathbf{x}}\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})
+η2(𝐱𝐱(0)2ρ2)]\displaystyle+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]} (9)

We know that for every η0,dattack(η)\eta\geq 0,\ d_{attack}(\eta) gives a lower bound to the primal solution pattackp^{*}_{attack}. But solving dattack(η)d_{attack}(\eta) for any η0\eta\geq 0 can be hard unless the objective is convex. We prove that if the eigenvalues of the Hessian are bounded below i.e:

m𝐈𝐱2(𝐳y(L)𝐳t(L))𝐱D\displaystyle m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\qquad\forall\mathbf{x}\in\mathbb{R}^{D}

In general m<0m<0, since (𝐳y(L)𝐳t(L))(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) is non-convex.
dattack(η)d_{attack}(\eta) is a convex optimization problem for mη-m\leq\eta. Equivalently the objective function, i.e the function inside the min𝐱\min_{\mathbf{x}}:

[(𝐳y(L)𝐳t(L))(𝐱)+η2(𝐱𝐱(0)2ρ2)]\displaystyle\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}

is a convex function in 𝐱\mathbf{x} for mη-m\leq\eta.
The Hessian of the above function is given by:

𝐱2(𝐳y(L)𝐳t(L))+η𝐈\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)+\eta\mathbf{I}

Since we know that eigenvalues of 𝐱2(𝐳y(L)𝐳t(L))m𝐈\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})\succcurlyeq m\mathbf{I}, we know that eigenvalues of the above Hessian are η+m\geq\eta+m. For ηm\eta\geq-m, the eigenvalues are positive implying that the objective function is convex.

Since dattack(η)d_{attack}(\eta) gives a lower bound to pattackp^{*}_{attack} for every η0\eta\geq 0, we get the following result:

pattackdattack where dattack=maxmηdattack(η)\displaystyle p^{*}_{attack}\geq d^{*}_{attack}\text{ where }d^{*}_{attack}=\max_{-m\leq\eta}d_{attack}(\eta) (10)

Note that if 𝐱(attack)\mathbf{x}^{(attack)} is the solution to dattackd^{*}_{attack} such that: 𝐱(attack)𝐱(0)=ρ\left\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\right\|=\rho, by the definition of dattackd^{*}_{attack}:

dattack=(𝐳y(L)𝐳t(L))(𝐱(attack))d^{*}_{attack}=\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(attack)})

But then by the definition of pattack,pattackdattackp^{*}_{attack},\ p^{*}_{attack}\leq d^{*}_{attack}, implying that the duality gap is zero, i.e pattack=dattackp^{*}_{attack}=d^{*}_{attack}. This procedure leads to the Theorem 2.

Appendix C Implementation Details

C.1 Computing the derivative of largest singular value

Our objective is to compute derivative of the largest singular value, i.e 𝐖(I)\|\mathbf{W}^{(I)}\| with respect to 𝐖(I)\mathbf{W}^{(I)}. Let 𝐮(I),𝐯(I)\mathbf{u}^{(I)},\mathbf{v}^{(I)} be the singular vectors such that 𝐖(I)𝐯(I)=𝐖(I)𝐮(I)\mathbf{W}^{(I)}\mathbf{v}^{(I)}=\|\mathbf{W}^{(I)}\|\mathbf{u}^{(I)}. Then the derivative is given by:

𝐖(I)𝐖(I)=𝐮(I)(𝐯(I))T\displaystyle\nabla_{\mathbf{W}^{(I)}}\|\mathbf{W}^{(I)}\|=\mathbf{u}^{(I)}\left(\mathbf{v}^{(I)}\right)^{T}

𝐯(I),𝐖(I)2\mathbf{v}^{(I)},\ \|\mathbf{W}^{(I)}\|^{2} can be computed by running power iteration on (𝐖(I))T𝐖(I)\left(\mathbf{W}^{(I)}\right)^{T}\mathbf{W}^{(I)}. 𝐮(I)\mathbf{u}^{(I)} can be computed using the identity:

𝐮(I)=𝐖(I)𝐯(I)γ(I)\mathbf{u}^{(I)}=\frac{\mathbf{W}^{(I)}\mathbf{v}^{(I)}}{\gamma^{(I)}}

We use 25 iterations of the power method to compute the above quantities.

C.2 Update equation for the certificate problem

Our goal is to minimize 𝐱𝐱(0)\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\| such that (𝐳y(L)𝐳t(L))(𝐱)=0\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})=0. We know that the Hessian satisfies the following LMIs:

m𝐈𝐱2(𝐳y(L)𝐳t(L))M𝐈\displaystyle m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\preccurlyeq M\mathbf{I} (11)

KK is given by Theorem 4 for neural network of any depth (L2L\geq 2). For 22 layer networks, MM and mm are given by Theorem 3. But for deeper networks (L3L\geq 3), M=KM=K, m=Km=-K. In either case, Kmax(|m|,|M|)K\geq\max(|m|,|M|). Thus, we also have:

K𝐈𝐱2(𝐳y(L)𝐳t(L))K𝐈\displaystyle-K\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\preccurlyeq K\mathbf{I} (12)

We will solve the dual (dcertd^{*}_{cert}) of the attack problem (pcertp^{*}_{cert}).

The primal problem (pcert)(p^{*}_{cert}) is given by:

pcert=min𝐳y(L)(𝐱)=𝐳t(L)(𝐱)[12𝐱𝐱(0)2]\displaystyle p^{*}_{cert}=\min_{\mathbf{z}^{(L)}_{y}(\mathbf{x})=\mathbf{z}^{(L)}_{t}(\mathbf{x})}\bigg{[}\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}\bigg{]}
pcert=min𝐱maxη[12𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱)]\displaystyle p^{*}_{cert}=\min_{\mathbf{x}}\max_{\eta}\bigg{[}\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})\bigg{]}

Using inequality (11) and Theorem 1 part (a), we know that the dual of the above problem is convex when 1/Mη1/m-1/M\leq\eta\leq-1/m.

The corresponding dual problem (dcertd^{*}_{cert}) is given by:

dcert=max1/Mη1/mdcert(η)\displaystyle d^{*}_{cert}=\max_{-1/M\leq\eta\leq-1/m}d_{cert}(\eta)
dcert(η)=min𝐱[12𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱)]\displaystyle d_{cert}(\eta)=\min_{\mathbf{x}}\bigg{[}\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}\right)\bigg{]}

For a given η\eta, we have the following optimization:

dcert(η)=min𝐱[12𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱)]d_{cert}(\eta)=\min_{\mathbf{x}}\bigg{[}\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})\bigg{]}

We will use majorization-minimization to solve this optimization.

At a point 𝐱(k)\mathbf{x}^{(k)}, we aim to solve for the point 𝐱(k+1)\mathbf{x}^{(k+1)} that decreases the objective function. Using the Taylor’s theorem at point 𝐱(k)\mathbf{x}^{(k)}, we have:

(𝐳y(L)𝐳t(L))(𝐱)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}\right)
=(𝐳y(L)𝐳t(L))(𝐱(k))+(𝐠(k))T(𝐱𝐱(k))\displaystyle=\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}^{(k)}\right)+\left(\mathbf{g}^{(k)}\right)^{T}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)
+12(𝐱𝐱(k))T𝐇(ξ)(𝐱𝐱(k))\displaystyle+\frac{1}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{T}\mathbf{H}^{(\xi)}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)

where 𝐠(k)\mathbf{g}^{(k)} is the gradient of (𝐳y(L)𝐳t(L))(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) at 𝐱(k)\mathbf{x}^{(k)} and 𝐇(ξ)\mathbf{H}^{(\xi)} is the Hessian at a point ξ\xi on the line connecting 𝐱\mathbf{x} and 𝐱(k)\mathbf{x}^{(k)}.

Multiplying both sides by η\eta, we get the following equation:

η(𝐳y(L)𝐳t(L))(𝐱)\displaystyle\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}\right)
=η(𝐳y(L)𝐳t(L))(𝐱(k))+η(𝐠(k))T(𝐱𝐱(k))\displaystyle=\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}^{(k)}\right)+\eta\left(\mathbf{g}^{(k)}\right)^{T}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)
+η2(𝐱𝐱(k))T𝐇(ξ)(𝐱𝐱(k))\displaystyle+\frac{\eta}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{T}\mathbf{H}^{(\xi)}\left(\mathbf{x}-\mathbf{x}^{(k)}\right) (13)

Using inequality (12), we know that K𝐈𝐇(ξ)K𝐈ξD-K\mathbf{I}\preccurlyeq\mathbf{H}^{(\xi)}\preccurlyeq K\mathbf{I}\quad\forall\xi\in\mathbb{R}^{D},

η2(𝐱𝐱(k))T𝐇(ξ)(𝐱𝐱(k))|ηK|2𝐱𝐱(k)2\displaystyle\frac{\eta}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{T}\mathbf{H}^{(\xi)}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)\leq\frac{\left|\eta K\right|}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2} (14)

Using equation (13) and inequality (14):

η(𝐳y(L)𝐳t(L))(𝐱)\displaystyle\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})
[η(𝐳y(L)𝐳t(L))(𝐱(k))+η(𝐠(k))T(𝐱𝐱(k))\displaystyle\leq\ \bigg{[}\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(k)})+\eta\left(\mathbf{g}^{(k)}\right)^{T}(\mathbf{x}-\mathbf{x}^{(k)})
+|ηK|2𝐱𝐱(k)2]\displaystyle+\frac{|\eta K|}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2}\bigg{]}

Adding 1/2𝐱𝐱(0)21/2\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2} to both sides, we get the following inequality:

12𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱)\displaystyle\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})
[12𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱(k))\displaystyle\leq\bigg{[}\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(k)})
+η(𝐠(k))T(𝐱𝐱(k))+|ηK|2𝐱𝐱(k)2]\displaystyle+\eta\left(\mathbf{g}^{(k)}\right)^{T}(\mathbf{x}-\mathbf{x}^{(k)})+\frac{|\eta K|}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2}\bigg{]}

LHS is the objective function of dcert(η)d_{cert}(\eta) and RHS is an upper bound. In majorization-minimization, we minimize an upper bound on the objective function. Thus we set the gradient of RHS with respect to 𝐱\mathbf{x} to zero and solve for 𝐱\mathbf{x}:

𝐱[\displaystyle\nabla_{\mathbf{x}}\bigg{[} 12𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱(k))\displaystyle\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(k)})
+η(𝐠(k))T(𝐱𝐱(k))+|ηK|2𝐱𝐱(k)2]=0\displaystyle+\eta\left(\mathbf{g}^{(k)}\right)^{T}(\mathbf{x}-\mathbf{x}^{(k)})\ +\frac{|\eta K|}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2}\bigg{]}=0
𝐱𝐱(0)+η𝐠(k)+|ηK|(𝐱𝐱(k))=0\mathbf{x}-\mathbf{x}^{(0)}+\eta\mathbf{g}^{(k)}+\left|\eta K\right|\left(\mathbf{x}-\mathbf{x}^{(k)}\right)=0
(1+|ηK|)𝐱𝐱(0)+η𝐠(k)|ηK|𝐱(k)=0(1+\left|\eta K\right|)\mathbf{x}-\mathbf{x}^{(0)}+\eta\mathbf{g}^{(k)}-\left|\eta K\right|\mathbf{x}^{(k)}=0
𝐱=(1+|ηK|)1(η𝐠(k)|ηK|𝐱(k)𝐱(0))\mathbf{x}=-(1+|\eta K|)^{-1}\left(\eta\mathbf{g}^{(k)}-|\eta K|\mathbf{x}^{(k)}-\mathbf{x}^{(0)}\right)

This gives the following iterative equation:

𝐱(k+1)=(1+|ηK|)1(η𝐠(k)|ηK|𝐱(k)𝐱(0))\displaystyle\mathbf{x}^{(k+1)}=-(1+|\eta K|)^{-1}\left(\eta\mathbf{g}^{(k)}-|\eta K|\mathbf{x}^{(k)}-\mathbf{x}^{(0)}\right) (15)

C.3 Update equation for the attack problem

Our goal is to minimize 𝐳y(L)𝐳t(L)\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t} within an l2l_{2} ball of radius of ρ\rho. We know that the Hessian satisfies the following LMIs:

m𝐈𝐱2(𝐳y(L)𝐳t(L))M𝐈\displaystyle m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\preccurlyeq M\mathbf{I} (16)

KK is given by Theorem 4 for neural network of any depth (L2L\geq 2). For 22 layer networks, MM and mm are given by Theorem 3. But for deeper networks (L3L\geq 3), M=KM=K, m=Km=-K. In either case, Kmax(|m|,|M|)K\geq\max(|m|,|M|). Thus, we also have:

K𝐈𝐱2(𝐳y(L)𝐳t(L))K𝐈\displaystyle-K\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\preccurlyeq K\mathbf{I} (17)

We solve the dual (dattackd^{*}_{attack}) of the attack problem (pattackp^{*}_{attack}) for the given radius ρ\rho.

The primal problem (pattack)(p^{*}_{attack}) is given by:

pattack=min𝐱𝐱(0)ρ𝐳y(L)𝐳t(L)\displaystyle p^{*}_{attack}=\min_{\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|\leq\rho}\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}
pattack=min𝐱maxη0[𝐳y(L)𝐳t(L)+η2(𝐱𝐱(0)2ρ2)]\displaystyle p^{*}_{attack}=\min_{\mathbf{x}}\max_{\eta\geq 0}\bigg{[}\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}

Using inequality (16) and Theorem 2 part (a), we know that the dual of the above problem is convex when mη-m\leq\eta.

The corresponding dual problem (dcert)(d^{*}_{cert}) is given by:

dattack=maxηmdattack(η)\displaystyle d^{*}_{attack}=\max_{\eta\geq-m}d_{attack}(\eta)

where dattack(η)d_{attack}(\eta) is given as follows:

dattack(η)=min𝐱[\displaystyle d_{attack}(\eta)=\min_{\mathbf{x}}\bigg{[} (𝐳y(L)𝐳t(L))(𝐱)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})
+η2(𝐱𝐱(0)2ρ2)]\displaystyle+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}

For a given η\eta, we have the following optimization:

dattack(η)=min𝐱[\displaystyle d_{attack}(\eta)=\min_{\mathbf{x}}\bigg{[} (𝐳y(L)𝐳t(L))(𝐱)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}\right)
+η2(𝐱𝐱(0)2ρ2)]\displaystyle+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}

We will use majorization-minimization to solve this optimization.

At a point 𝐱(k)\mathbf{x}^{(k)}, we have to solve for the point 𝐱(k+1)\mathbf{x}^{(k+1)} that decreases the objective function. Using the Taylor’s theorem at point 𝐱(k)\mathbf{x}^{(k)}, we have:

(𝐳y(L)𝐳t(L))(𝐱)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}\right)
=(𝐳y(L)𝐳t(L))(𝐱(k))+(𝐠(k))T(𝐱𝐱(k))\displaystyle=\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\left(\mathbf{x}^{(k)}\right)+\left(\mathbf{g}^{(k)}\right)^{T}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)
+12(𝐱𝐱(k))T𝐇(ξ)(𝐱𝐱(k))\displaystyle+\frac{1}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{T}\mathbf{H}^{(\xi)}\left(\mathbf{x}-\mathbf{x}^{(k)}\right) (18)

where 𝐠(k)\mathbf{g}^{(k)} is the gradient of (𝐳y(L)𝐳t(L))(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) at 𝐱(k)\mathbf{x}^{(k)} and 𝐇(ξ)\mathbf{H}^{(\xi)} is the Hessian at a point ξ\xi on the line connecting 𝐱\mathbf{x} and 𝐱(k)\mathbf{x}^{(k)}.

Using inequality (17), we know that K𝐈𝐇(ξ)K𝐈ξD-K\mathbf{I}\preccurlyeq\mathbf{H}^{(\xi)}\preccurlyeq K\mathbf{I}\quad\forall\xi\in\mathbb{R}^{D},

12(𝐱𝐱(k))T𝐇(ξ)(𝐱𝐱(k))K2𝐱𝐱(k)2\displaystyle\frac{1}{2}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)^{T}\mathbf{H}^{(\xi)}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)\leq\frac{K}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2} (19)

Using equation (18) and inequality (19):

(𝐳y(L)𝐳t(L))(𝐱)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})
[(𝐳y(L)𝐳t(L))(𝐱(k))\displaystyle\leq\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(k)})
+(𝐠(k))T(𝐱𝐱(k))+K2𝐱𝐱(k)2]\displaystyle+\left(\mathbf{g}^{(k)}\right)^{T}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)\ +\frac{K}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2}\bigg{]}

Adding η/2(𝐱𝐱(0)2ρ2)\eta/2(\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}-\rho^{2}) to both sides, we get the following inequality:

(𝐳y(L)𝐳t(L))(𝐱)+η2(𝐱𝐱(0)2ρ2)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)
[(𝐳y(L)𝐳t(L))(𝐱(k))+(𝐠(k))T(𝐱𝐱(k))\displaystyle\leq\ \bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(k)})+\left(\mathbf{g}^{(k)}\right)^{T}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)
+K2𝐱𝐱(k)2+η2(𝐱𝐱(0)2ρ2)]\displaystyle+\frac{K}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2}+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}

LHS is the objective function of dattack(η)d_{attack}(\eta) and RHS is an upper bound. In majorization-minimization, we minimize an upper bound on the objective function. Thus we set the gradient of RHS with respect to 𝐱\mathbf{x} to zero and solve for 𝐱\mathbf{x}:

𝐱[\displaystyle\nabla_{\mathbf{x}}\bigg{[} (𝐳y(L)𝐳t(L))(𝐱(k))+(𝐠(k))T(𝐱𝐱(k))\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(k)})+\left(\mathbf{g}^{(k)}\right)^{T}\left(\mathbf{x}-\mathbf{x}^{(k)}\right)
+K2𝐱𝐱(k)2+η2(𝐱𝐱(0)2ρ2)]=0\displaystyle+\frac{K}{2}\left\|\mathbf{x}-\mathbf{x}^{(k)}\right\|^{2}+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}=0

Rearranging the above equation, we get:

𝐠(k)+K(𝐱𝐱(k))+η(𝐱𝐱(0))=0\displaystyle\mathbf{g}^{(k)}+K\left(\mathbf{x}-\mathbf{x}^{(k)}\right)+\eta\left(\mathbf{x}-\mathbf{x}^{(0)}\right)=0
(K+η)𝐱+𝐠(k)K𝐱(k)η𝐱(0)=0\displaystyle(K+\eta)\mathbf{x}+\mathbf{g}^{(k)}-K\mathbf{x}^{(k)}-\eta\mathbf{x}^{(0)}=0
𝐱=(K+η)1(𝐠(k)K𝐱(k)η𝐱(0))\displaystyle\mathbf{x}=-(K+\eta)^{-1}\left(\mathbf{g}^{(k)}-K\mathbf{x}^{(k)}-\eta\mathbf{x}^{(0)}\right)

This gives the following iterative equation:

𝐱(k+1)=(K+η)1(𝐠(k)K𝐱(k)η𝐱(0))\displaystyle\mathbf{x}^{(k+1)}=-(K+\eta)^{-1}\left(\mathbf{g}^{(k)}-K\mathbf{x}^{(k)}-\eta\mathbf{x}^{(0)}\right) (20)

C.4 Algorithm to compute the certificate

We start with the following initial values of 𝐱,η,ηmin,ηmax\mathbf{x},\ \eta,\ \eta_{min},\ \eta_{max}:

ηmin=1/M,ηmax=1/m\displaystyle\eta_{min}=-1/M,\qquad\eta_{max}=-1/m
η=12(ηmin+ηmax),𝐱=𝐱(0)\displaystyle\eta=\frac{1}{2}(\eta_{min}+\eta_{max}),\qquad\mathbf{x}=\mathbf{x}^{(0)}

To solve the dual for a given value of η\eta, we run 2020 iterations of the following update (derived in Appendix C.2):

𝐱(k+1)=(1+|ηK|)1(η𝐠(k)|ηK|𝐱(k)𝐱(0))\displaystyle\mathbf{x}^{(k+1)}=-(1+|\eta K|)^{-1}\left(\eta\mathbf{g}^{(k)}-|\eta K|\mathbf{x}^{(k)}-\mathbf{x}^{(0)}\right)

To maximize the dual dcert(η)d_{cert}(\eta) over η\eta in the range [1/M,1/m][-1/M,\ -1/m], we use a bisection method: If the solution 𝐱\mathbf{x} for a given value of η\eta, (𝐳y(L)𝐳t(L))(𝐱)>0(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x})>0, set ηmin=η\eta_{min}=\eta, else set ηmax=η\eta_{max}=\eta. Set the new η=(ηmin+ηmax)/2\eta=(\eta_{min}+\eta_{max})/2 and repeat. The maximum number of updates to η\eta are set to 30. This method satisfied linear convergence. The routine to compute the certificate example is given in Algorithm 1.

Algorithm 1 Certificate optimization
0:  input 𝐱(0)\mathbf{x}^{(0)}, label yy, target tt
  m,M,Kcompute_bounds(𝐳y(L)𝐳t(L))m,M,K\leftarrow compute\_bounds(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})
  ηmin1/M\eta_{min}\leftarrow-1/M
  ηmax1/m\eta_{max}\leftarrow-1/m
  η1/2(ηmin+ηmax)\eta\leftarrow 1/2(\eta_{min}+\eta_{max})
  𝐱𝐱(0)\mathbf{x}\leftarrow\mathbf{x}^{(0)}
  for ii in [1,,30][1,\dots,30] do
     for jj in [1,,20][1,\dots,20] do
        𝐠compute_gradient(𝐳y(L)𝐳t(L),𝐱)\mathbf{g}\leftarrow compute\_gradient(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t},\mathbf{x})
        if η𝐠+(𝐱𝐱(0))<105\|\eta\mathbf{g}+(\mathbf{x}-\mathbf{x}^{(0)})\|<10^{-5} then
           break
        end if
        𝐱(1+|ηK|)1(η𝐠|ηK|𝐱𝐱(0))\mathbf{x}\leftarrow-(1+|\eta K|)^{-1}\left(\eta\mathbf{g}-|\eta K|\mathbf{x}-\mathbf{x}^{(0)}\right)
     end for
     if (𝐳y(L)𝐳t(L))(𝐱)>0(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x})>0 then
        ηminη\eta_{min}\leftarrow\eta
     else
        ηmaxη\eta_{max}\leftarrow\eta
     end if
     η(ηmin+ηmax)/2\eta\leftarrow(\eta_{min}+\eta_{max})/2
  end forreturn 𝐱\mathbf{x}

C.5 Algorithm to compute the attack

We start with the following initial values of 𝐱,η,ηmin,ηmax\mathbf{x},\ \eta,\ \eta_{min},\ \eta_{max}:

ηmin=m,ηmax=20(1m)\displaystyle\eta_{min}=-m,\qquad\eta_{max}=20(1-m)
η=12(ηmin+ηmax),𝐱=𝐱(0)\displaystyle\eta=\frac{1}{2}(\eta_{min}+\eta_{max}),\qquad\mathbf{x}=\mathbf{x}^{(0)}

To solve the dual for a given value of η\eta, we run 2020 iterations of the following update (derived in Appendix C.3):

𝐱(k+1)=(K+η)1(𝐠(k)K𝐱(k)η𝐱(0))\displaystyle\mathbf{x}^{(k+1)}=-(K+\eta)^{-1}\left(\mathbf{g}^{(k)}-K\mathbf{x}^{(k)}-\eta\mathbf{x}^{(0)}\right)

To maximize the dual dcert(η)d_{cert}(\eta) over η\eta in the range [m, 20(1m)][-m,\ 20(1-m)], we use a bisection method: If the solution 𝐱\mathbf{x} for a given value of η\eta, 𝐱𝐱(0)ρ\|\mathbf{x}-\mathbf{x}^{(0)}\|\leq\rho, set ηmax=η\eta_{max}=\eta, else set ηmin=η\eta_{min}=\eta. Set new η=(ηmin+ηmax)/2\eta=(\eta_{min}+\eta_{max})/2 and repeat. The maximum number of updates to η\eta are set to 30. This method satisfied linear convergence. The routine to compute the attack example is given in Algorithm 2.

Algorithm 2 Attack optimization
0:  input 𝐱(0)\mathbf{x}^{(0)}, label yy, target tt , radius ρ\rho
  m,M,Kcompute_bounds(𝐳y(L)𝐳t(L))m,M,K\leftarrow compute\_bounds(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})
  ηminm\eta_{min}\leftarrow-m
  ηmax20(1m)\eta_{max}\leftarrow 20(1-m)
  η1/2(ηmin+ηmax)\eta\leftarrow 1/2(\eta_{min}+\eta_{max})
  𝐱𝐱(0)\mathbf{x}\leftarrow\mathbf{x}^{(0)}
  for ii in [1,,30][1,\dots,30] do
     for jj in [1,,20][1,\dots,20] do
        𝐠compute_gradient(𝐳y(L)𝐳t(L),𝐱)\mathbf{g}\leftarrow compute\_gradient(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t},\mathbf{x})
        if 𝐠+η(𝐱𝐱(0))<105\|\mathbf{g}+\eta(\mathbf{x}-\mathbf{x}^{(0)})\|<10^{-5} then
           break
        end if
        𝐱(K+η)1(𝐠K𝐱η𝐱(0))\mathbf{x}\leftarrow-(K+\eta)^{-1}\left(\mathbf{g}-K\mathbf{x}-\eta\mathbf{x}^{(0)}\right)
     end for
     if 𝐱𝐱(0)<ρ\|\mathbf{x}-\mathbf{x}^{(0)}\|<\rho then
        ηmaxη\eta_{max}\leftarrow\eta
     else
        ηminη\eta_{min}\leftarrow\eta
     end if
     η(ηmin+ηmax)/2\eta\leftarrow(\eta_{min}+\eta_{max})/2
  end forreturn 𝐱\mathbf{x}

C.6 Computing certificate using local curvature bounds

To compute the robustness certificate in a local region around the input, we first compute the certificate using the global bounds on the curvature. Using the same certificate as the initial l2l_{2} radius of the safe region, we can refine our certificate. Due to the reduction in curvature, this will surely increase the value of the certificate. We then use the new robustness certificate as the new l2l_{2} radius of the safe region and repeat. We iterate over this process 55 times to compute the local version of our robustness certificate.

To ensure that the optimization trajectory does not escape the safe region, whenever the gradient descent step lies outside the ”safe” region, we reduce the step size by a factor of two until it lies inside the region.

Appendix D Summary Table comparing out certification method against existing methods

Table 7 provides a summary table comparing our certification method against the existing methods.

Table 7: Comparison of methods for providing provable robustness certification. Note that [10] is a probabilistic certificate.
Method Non-trivial bound Multi-layer Activation functions Norm
[46] All l2l_{2}
[22] ReLU ll_{\infty}
[20] Differentiable l2l_{2}
[39] ReLU ll_{\infty}
[51] ReLU ll_{\infty}
[50] ReLU l1,l2,ll_{1},l_{2},l_{\infty}
[55] All l1,l2,ll_{1},l_{2},l_{\infty}
[10] All l2l_{2}
Ours Differentiable l2l_{2}

Appendix E Proofs

E.1 Proof of Theorem 1

  1. (a)
    dcert(η)=min𝐱[\displaystyle d_{cert}(\eta)=\min_{\mathbf{x}}\bigg{[} 12𝐱𝐱(0)2\displaystyle\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}
    +η(𝐳y(L)(𝐱)𝐳t(L)(𝐱))]\displaystyle+\eta\left(\mathbf{z}^{(L)}_{y}(\mathbf{x})-\mathbf{z}^{(L)}_{t}(\mathbf{x})\right)\bigg{]}
    𝐱2[12𝐱𝐱(0)2+η(𝐳y(L)(𝐱)𝐳t(L)(𝐱))]\displaystyle\nabla^{2}_{\mathbf{x}}\bigg{[}\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}+\eta\left(\mathbf{z}^{(L)}_{y}(\mathbf{x})-\mathbf{z}^{(L)}_{t}(\mathbf{x})\right)\bigg{]}
    =𝐈+η𝐱2(𝐳y(L)𝐳t(L))\displaystyle=\mathbf{I}+\eta\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)

    We are given that the Hessian 𝐱2(𝐳y(L)𝐳t(L))\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) satisfies the following LMIs:

    m𝐈𝐱2(𝐳y(L)𝐳t(L))M𝐈𝐱nm\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\preccurlyeq M\mathbf{I}\qquad\forall\mathbf{x}\in\mathbb{R}^{n}

    The eigenvalues of 𝐈+η𝐱2(𝐳y(L)𝐳t(L))\mathbf{I}+\eta\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) are bounded between:

    (1+ηM, 1+ηm), if η<0(1+\eta M,\ 1+\eta m),\text{ if }\eta<0
    (1+ηm, 1+ηM), if η>0(1+\eta m,\ 1+\eta M),\text{ if }\eta>0

    We are given that η\eta satisfies the following inequalities where m<0,M>0m<0,M>0 since (𝐳y(L)𝐳t(L))(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) is neither convex, nor concave as a function of 𝐱\mathbf{x}:

    1Mη1m,m<0,M>0\frac{-1}{M}\leq\eta\leq\frac{-1}{m},\qquad m<0,M>0

    We have the following inequalities:

    1+ηM0, 1+ηm01+\eta M\geq 0,\ 1+\eta m\geq 0

    Thus, 𝐈+η𝐱2(𝐳y(L)𝐳t(L))\mathbf{I}+\eta\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) is a PSD matrix for all 𝐱D\mathbf{x}\in\mathbb{R}^{D} when 1/Mη1/m-1/M\leq\eta\leq-1/m .
    Thus 1/2𝐱𝐱(0)2+η(𝐳y(L)𝐳t(L))(𝐱)1/2\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}+\eta(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x}) is a convex function in 𝐱\mathbf{x} and dcert(η)d_{cert}(\eta) is a convex optimization problem.

  2. (b)

    For every value of η\eta, dcert(η)d_{cert}(\eta) is a lower bound for pcertp^{*}_{cert}. Thus dcert=max1/Mη1/mdcert(η)d^{*}_{cert}=\max_{-1/M\leq\ \eta\ \leq-1/m}d_{cert}(\eta) is a lower bound for pcertp^{*}_{cert}, i.e:

    dcertpcert\displaystyle d^{*}_{cert}\leq p^{*}_{cert} (21)

    Let η(cert),𝐱(cert)\eta^{(cert)},\mathbf{x}^{(cert)} be the solution of the above dual optimization (dcertd^{*}_{cert}) such that

    𝐳y(L)(𝐱(cert))=𝐳t(L)(𝐱(cert))\displaystyle\mathbf{z}^{(L)}_{y}(\mathbf{x}^{(cert)})=\mathbf{z}^{(L)}_{t}(\mathbf{x}^{(cert)}) (22)

    dcertd^{*}_{cert} is given by the following:

    dcert=[\displaystyle d^{*}_{cert}=\bigg{[} 12𝐱(cert)𝐱(0)2\displaystyle\frac{1}{2}\left\|\mathbf{x}^{(cert)}-\mathbf{x}^{(0)}\right\|^{2}
    +η(cert)(𝐳y(L)(𝐱(cert))𝐳t(L)(𝐱(cert)))=0]\displaystyle+\eta^{(cert)}\underbrace{\left(\mathbf{z}^{(L)}_{y}(\mathbf{x}^{(cert)})-\mathbf{z}^{(L)}_{t}(\mathbf{x}^{(cert)})\right)}_{=0}\bigg{]}

    Since we are given that 𝐳y(L)(𝐱(cert))=𝐳t(L)(𝐱(cert))\mathbf{z}^{(L)}_{y}(\mathbf{x}^{(cert)})=\mathbf{z}^{(L)}_{t}(\mathbf{x}^{(cert)}), we get the following equation for dcertd^{*}_{cert}:

    dcert\displaystyle d^{*}_{cert} =12𝐱(cert)𝐱(0)2\displaystyle=\frac{1}{2}\left\|\mathbf{x}^{(cert)}-\mathbf{x}^{(0)}\right\|^{2} (23)

    Since pcertp^{*}_{cert} is given by the following equation:

    pcert=min𝐳y(L)(𝐱)=𝐳t(L)(𝐱)[12𝐱𝐱(0)2]\displaystyle p^{*}_{cert}=\min_{\mathbf{z}^{(L)}_{y}(\mathbf{x})=\mathbf{z}^{(L)}_{t}(\mathbf{x})}\bigg{[}\frac{1}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}\bigg{]} (24)

    Using equations (22) and (24), pcertp^{*}_{cert} is the minimum value of 1/2𝐱𝐱(0)2𝐱:𝐳y(L)(𝐱)=𝐳t(L)(𝐱)1/2\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}\quad\forall\mathbf{x}:\mathbf{z}^{(L)}_{y}(\mathbf{x})=\mathbf{z}^{(L)}_{t}(\mathbf{x}):

    pcert12𝐱(cert)𝐱(0)2\displaystyle p^{*}_{cert}\leq\frac{1}{2}\left\|\mathbf{x}^{(cert)}-\mathbf{x}^{(0)}\right\|^{2} (25)

    From equation (23), we know that dcert=1/2𝐱(cert)𝐱(0)2d^{*}_{cert}=1/2\|\mathbf{x}^{(cert)}-\mathbf{x}^{(0)}\|^{2}. Thus, we get:

    pcertdcert\displaystyle p^{*}_{cert}\leq d^{*}_{cert} (26)

    Using equation (21) we have dcertpcertd^{*}_{cert}\leq p^{*}_{cert} and using (26), pcertdcertp^{*}_{cert}\leq d^{*}_{cert}

    pcert=dcertp^{*}_{cert}=d^{*}_{cert}

E.2 Proof of Theorem 2

  1. (a)
    dattack(η)=min𝐱[\displaystyle d_{attack}(\eta)=\min_{\mathbf{x}}\bigg{[} (𝐳y(L)𝐳t(L))(𝐱)\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})
    +η2(𝐱𝐱(0)2ρ2)]\displaystyle+\frac{\eta}{2}\left(\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)\bigg{]}
    𝐱2[(𝐳y(L)𝐳t(L))(𝐱)+η2𝐱𝐱(0)2]\displaystyle\nabla^{2}_{\mathbf{x}}\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})+\frac{\eta}{2}\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2}\bigg{]}
    =𝐱2(𝐳y(L)𝐳t(L))+η𝐈\displaystyle=\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)+\eta\mathbf{I}

    Since the Hessian 𝐱2(𝐳y(L)𝐳t(L))\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}) is bounded below:

    m𝐈𝐱2(𝐳y(L)𝐳t(L))𝐱nm\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)\qquad\forall\mathbf{x}\in\mathbb{R}^{n}

    The eigenvalues of 𝐱2(𝐳y(L)𝐳t(L))+η𝐈\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})+\eta\mathbf{I} are bounded below:

    (m+η)𝐈𝐱2(𝐳y(L)𝐳t(L))+η𝐈(m+\eta)\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)+\eta\mathbf{I}

    Since ηm\eta\geq-m.

    η+m0\eta+m\geq 0

    Thus 𝐱2(𝐳y(L)𝐳t(L))+η𝐈\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})+\eta\mathbf{I} is a PSD matrix for all 𝐱D\mathbf{x}\in\mathbb{R}^{D} when ηm\eta\geq-m.
    Thus (𝐳y(L)𝐳t(L))(𝐱)+η/2(𝐱𝐱(0)2ρ2)(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x})+\eta/2(\|\mathbf{x}-\mathbf{x}^{(0)}\|^{2}-\rho^{2}) is a convex function in 𝐱\mathbf{x} and dattack(η)d_{attack}(\eta) is a convex optimization problem.

  2. (b)

    For every value of η\eta, dattack(η)d_{attack}(\eta) is a lower bound for pattackp^{*}_{attack}. Thus dattack=maxmηdattack(η)d^{*}_{attack}=\max_{-m\leq\eta}d_{attack}(\eta) is a lower bound for pattackp^{*}_{attack}:

    dattackpattack\displaystyle d^{*}_{attack}\leq p^{*}_{attack} (27)

    Let η(attack),𝐱(attack)\eta^{(attack)},\mathbf{x}^{(attack)} be the solution of the above dual optimization (dattackd^{*}_{attack}) such that

    𝐱(attack)𝐱(0)=ρ\displaystyle\left\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\right\|=\rho (28)

    dattackd^{*}_{attack} is given by the following:

    dattack=[\displaystyle d^{*}_{attack}=\bigg{[} (𝐳y(L)𝐳t(L))(𝐱(attack))\displaystyle\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(attack)}) (29)
    +η(attack)2(𝐱(attack)𝐱(0)2ρ2)=0]\displaystyle+\frac{\eta^{(attack)}}{2}\underbrace{\left(\left\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\right\|^{2}-\rho^{2}\right)}_{=0}\bigg{]}

    Since we are given that 𝐱(attack)𝐱(0)=ρ\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\|=\rho, we get the following equation for dattackd^{*}_{attack}:

    dattack\displaystyle d^{*}_{attack} =(𝐳y(L)𝐳t(L))(𝐱(attack))\displaystyle=\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(attack)}) (30)

    Since pattackp^{*}_{attack} is given by the following equation:

    pattack=min𝐱𝐱(0)ρ[(𝐳y(L)𝐳t(L))(𝐱)]\displaystyle p^{*}_{attack}=\min_{\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|\leq\rho}\bigg{[}\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x})\bigg{]} (31)

    Using equations (28) and (31), pattackp^{*}_{attack} is the minimum value of (𝐳y(L)𝐳t(L))(𝐱)𝐱𝐱(0)ρ(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x})\quad\forall\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|\leq\rho:

    pattack(𝐳y(L)𝐳t(L))(𝐱(attack))\displaystyle p^{*}_{attack}\leq\left(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t}\right)(\mathbf{x}^{(attack)}) (32)

    From equation (30), we know that dattack=(𝐳y(L)𝐳t(L))(𝐱(attack))d^{*}_{attack}=(\mathbf{z}^{(L)}_{y}-\mathbf{z}^{(L)}_{t})(\mathbf{x}^{(attack)}). Thus, we get:

    pattackdattack\displaystyle p^{*}_{attack}\leq d^{*}_{attack} (33)

    Using equation (27) we have dattackpattackd^{*}_{attack}\leq p^{*}_{attack} and using (33), pattackdattackp^{*}_{attack}\leq d^{*}_{attack}

    pattack=dattackp^{*}_{attack}=d^{*}_{attack}

E.3 Proof of Lemma 1

We have to prove that for an LL layer neural network, the hessian of the ithi^{th} hidden unit in the LthL^{th} layer with respect to the input 𝐱\mathbf{x}, i.e 𝐱2𝐳i(L)\nabla_{\mathbf{x}}^{2}\mathbf{z}^{(L)}_{i} is given by the following formula:

𝐱2𝐳i(L)=I=1L1(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}=\sum_{I=1}^{L-1}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{(}\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{)}\mathbf{B}^{(I)} (34)

where 𝐁(I)\mathbf{B}^{(I)}, I[L]I\in[L] is a matrix of size NI×DN_{I}\times D defined as follows:

𝐁(I)=[𝐱𝐳1(I),𝐱𝐳2(I),,𝐱𝐳NI(I)]T\displaystyle\mathbf{B}^{(I)}=\bigg{[}\nabla_{\mathbf{x}}\mathbf{z}^{(I)}_{1},\nabla_{\mathbf{x}}\mathbf{z}^{(I)}_{2},\dotsc,\nabla_{\mathbf{x}}\mathbf{z}^{(I)}_{N_{I}}\bigg{]}^{T} (35)

and 𝐅(L,I),I[L1]\mathbf{F}^{(L,I)},\ I\in[L-1] is a matrix of size NL×NIN_{L}\times N_{I} defined as follows:

𝐅(L,I)=[𝐚(I)𝐳1(L),𝐚(I)𝐳2(L),,𝐚(I)𝐳NL(L)]T\displaystyle\mathbf{F}^{(L,I)}=\bigg{[}\nabla_{\mathbf{a}^{(I)}}\mathbf{z}^{(L)}_{1},\nabla_{\mathbf{a}^{(I)}}\mathbf{z}^{(L)}_{2},\dotsc,\nabla_{\mathbf{a}^{(I)}}\mathbf{z}^{(L)}_{N_{L}}\bigg{]}^{T} (36)

𝐱2𝐳i(L)\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} can be written in terms of the activations of the previous layer using the following formula:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} =j=1NI1𝐖i,j(L)(𝐱2𝐚j(L1))\displaystyle=\sum_{j=1}^{N_{I-1}}\mathbf{W}^{(L)}_{i,j}\left(\nabla^{2}_{\mathbf{x}}\mathbf{a}^{(L-1)}_{j}\right) (37)

Using the chain rule of the Hessian and 𝐚(I)=σ(𝐳(I))\mathbf{a}^{(I)}=\sigma(\mathbf{z}^{(I)}), we can write 𝐱2𝐚j(L1)\nabla^{2}_{\mathbf{x}}\mathbf{a}^{(L-1)}_{j} in terms of 𝐱𝐳j(L1)\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j} and 𝐱2𝐳j(L1)\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j} as the following:

𝐱2𝐚j(L1)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{a}^{(L-1)}_{j} =σ′′(𝐳j(L1))(𝐱𝐳j(L1))(𝐱𝐳j(L1))T\displaystyle=\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)^{T}
+σ(𝐳j(L1))(𝐱2𝐳j(L1))\displaystyle+\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right) (38)

Replacing 𝐱2𝐚j(L1)\nabla^{2}_{\mathbf{x}}\mathbf{a}^{(L-1)}_{j} using equation (38) into equation (37), we get:

𝐱2(𝐳i(L))=\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{i}\right)=
j=1NL1𝐖i,j(L)[σ′′(𝐳j(L1))(𝐱𝐳j(L1))(𝐱𝐳j(L1))T\displaystyle\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\bigg{[}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)^{T}
+σ(𝐳j(L1))(𝐱2𝐳j(L1))]\displaystyle+\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\bigg{]}
𝐱2(𝐳i(L))=\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{i}\right)= (39)
j=1NL1𝐖i,j(L)σ′′(𝐳j(L1))(𝐱𝐳j(L1))(𝐱𝐳j(L1))T\displaystyle\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)^{T}
+j=1NL1𝐖i,j(L)σ(𝐳j(L1))(𝐱2𝐳j(L1))\displaystyle+\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right) (40)

For each I[2,L],iNII\in[2,L],\ i\in N_{I}, we define the matrix 𝐀i(I)\mathbf{A}^{(I)}_{i} as the following:

𝐱2(𝐳i(I))\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(I)}_{i}\right)
=j=1NI1𝐖i,j(I)σ′′(𝐳j(I1))(𝐱𝐳j(I1))(𝐱𝐳j(I1))T𝐀i(I)\displaystyle=\underbrace{\sum_{j=1}^{N_{I-1}}\mathbf{W}^{(I)}_{i,j}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(I-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(I-1)}_{j}\right)^{T}}_{\mathbf{A}^{(I)}_{i}}
+j=1NI1𝐖i,j(I)σ(𝐳j(I1))(𝐱2𝐳j(I1))\displaystyle+\sum_{j=1}^{N_{I-1}}\mathbf{W}^{(I)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(I-1)}_{j}\right) (41)
𝐀i(I)=j=1NI1𝐖i,j(I)σ′′(𝐳j(I1))(𝐱𝐳j(I1))(𝐱𝐳j(I1))T\displaystyle\mathbf{A}^{(I)}_{i}=\sum_{j=1}^{N_{I-1}}\mathbf{W}^{(I)}_{i,j}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(I-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(I-1)}_{j}\right)^{T} (42)

Substituting 𝐀i(L)\mathbf{A}^{(L)}_{i} using equation (42)\eqref{L_def} into equation (40), we get:

𝐱2(𝐳i(L))\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(L)}_{i}\right) =𝐀i(L)+j=1NI1𝐖i,j(I)σ(𝐳j(I1))(𝐱2𝐳j(I1))\displaystyle=\mathbf{A}^{(L)}_{i}+\sum_{j=1}^{N_{I-1}}\mathbf{W}^{(I)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(I-1)}_{j}\right) (43)

We first simplify the expression for 𝐀i(L)\mathbf{A}^{(L)}_{i}. Note that 𝐀i(L)\mathbf{A}^{(L)}_{i} is a sum of symmetric rank one matrices (𝐱𝐳j(L1))(𝐱𝐳j(L1))T\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)^{T} with the coefficient 𝐖i,j(L)σ′′(𝐳j(L1))\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right) for each jj. We create a diagonal matrix for the coefficients and another matrix 𝐁(L1)\mathbf{B}^{(L-1)} such that each jthj^{th} row of 𝐁(L1)\mathbf{B}^{(L-1)} is the vector 𝐱𝐳j(L1)\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}. This leads to the following equation:

𝐀i(L)\displaystyle\mathbf{A}^{(L)}_{i} =j=1NL1𝐖i,j(L)σ′′(𝐳j(L1))(𝐱𝐳j(L1))(𝐱𝐳j(L1))T\displaystyle=\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)^{T}
=(𝐁(L1))Tdiag(𝐖i(L)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{W}^{(L)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)} (44)

𝐁(I)\mathbf{B}^{(I)} where I[L]I\in[L] is a matrix of size NI×DN_{I}\times D defined as follows:

𝐁(I)\displaystyle\mathbf{B}^{(I)} =[𝐱𝐳1(I),𝐱𝐳2(I),,𝐱𝐳NI(I)]T,I[L]\displaystyle=\bigg{[}\nabla_{\mathbf{x}}\mathbf{z}^{(I)}_{1},\nabla_{\mathbf{x}}\mathbf{z}^{(I)}_{2},\dotsc,\nabla_{\mathbf{x}}\mathbf{z}^{(I)}_{N_{I}}\bigg{]}^{T},\qquad I\in[L]

Thus 𝐁(I)\mathbf{B}^{(I)} is the jacobian of 𝐳(I)\mathbf{z}^{(I)} with respect to the input 𝐱\mathbf{x}.
Using the chain rule of the gradient, we have the following properties of 𝐁(I)\mathbf{B}^{(I)}:

𝐁(1)=𝐖(1)\displaystyle\mathbf{B}^{(1)}=\mathbf{W}^{(1)} (45)
𝐁(I)=𝐖(I)diag(σ(𝐳(I1)))𝐁(I1)\displaystyle\mathbf{B}^{(I)}=\mathbf{W}^{(I)}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}\right)\right)\mathbf{B}^{(I-1)} (46)

Similarly, 𝐅(I,J)\mathbf{F}^{(I,J)} where I[L],J[I1]I\in[L],\ J\in[I-1] is a matrix of size NI×NJN_{I}\times N_{J} defined as follows:

𝐅(I,J)\displaystyle\mathbf{F}^{(I,J)} =[𝐚(J)𝐳1(I),𝐚(J)𝐳2(I),,𝐚(J)𝐳NI(I)]T\displaystyle=\bigg{[}\nabla_{\mathbf{a}^{(J)}}\mathbf{z}^{(I)}_{1},\nabla_{\mathbf{a}^{(J)}}\mathbf{z}^{(I)}_{2},\dotsc,\nabla_{\mathbf{a}^{(J)}}\mathbf{z}^{(I)}_{N_{I}}\bigg{]}^{T}

Thus 𝐅(I,J)\mathbf{F}^{(I,J)} is the jacobian of 𝐳(I)\mathbf{z}^{(I)} with respect to the activations 𝐚(J)\mathbf{a}^{(J)}.
Using the chain rule of the gradient, we have the following properties for 𝐅(L,I)\mathbf{F}^{(L,I)}:

𝐅(L,L1)=𝐖(L)\displaystyle\mathbf{F}^{(L,L-1)}=\mathbf{W}^{(L)} (47)
𝐅(L,I)=𝐖(L)diag(σ(𝐳(L1)))𝐅(L1,I)\displaystyle\mathbf{F}^{(L,I)}=\mathbf{W}^{(L)}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{F}^{(L-1,I)} (48)

Recall that in our notation: For a matrix 𝐄\mathbf{E}, 𝐄i\mathbf{E}_{i} denotes the column vector constructed by taking the transpose of the ithi^{th} row of the matrix 𝐄\mathbf{E}. Thus ithi^{th} row of 𝐖(L)\mathbf{W}^{(L)} is (𝐖i(L))T\left(\mathbf{W}^{(L)}_{i}\right)^{T} and 𝐅(L,I)\mathbf{F}^{(L,I)} is (𝐅i(L,I))T\left(\mathbf{F}^{(L,I)}_{i}\right)^{T}. Equating the ithi^{th} rows in equation (48), we get:

(𝐅i(L,I))T=(𝐖i(L))Tdiag(σ(𝐳(L1)))𝐅(L1,I)\displaystyle\left(\mathbf{F}^{(L,I)}_{i}\right)^{T}=\left(\mathbf{W}^{(L)}_{i}\right)^{T}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{F}^{(L-1,I)}

Taking the transpose of both the sides and expressing the RHS as a summation, we get:

𝐅i(L,I)=((𝐖i(L))Tdiag(σ(𝐳(L1)))𝐅(L1,I))T\displaystyle\mathbf{F}^{(L,I)}_{i}=\left(\left(\mathbf{W}^{(L)}_{i}\right)^{T}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{F}^{(L-1,I)}\right)^{T}
𝐅i(L,I)=j=1NL1𝐖i,j(L)σ(𝐳j(L1))𝐅j(L1,I)\displaystyle\mathbf{F}^{(L,I)}_{i}=\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\mathbf{F}^{(L-1,I)}_{j} (49)

Substituting 𝐖(L)\mathbf{W}^{(L)} using equation (47) into equation (44), we get:

𝐀i(L)\displaystyle\mathbf{A}^{(L)}_{i} =(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)} (50)

Substituting 𝐀i(L)\mathbf{A}^{(L)}_{i} using equation (50) into (43), we get:

𝐱2𝐳i(L)=\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}=
[(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle\bigg{[}\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+j=1NL1𝐖i,j(L)σ(𝐳j(L1))(𝐱2𝐳j(L1))]\displaystyle+\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)\bigg{]} (51)

Thus, equation (51) allows us to write the hessian of ithi^{th} unit at layer LL, i.e (𝐱2𝐳i(L))\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right) in terms of the hessian of jthj^{th} unit at layer L1L-1, i.e (𝐱2𝐳j(L1))\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right).
We will prove the following using induction:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} =I=1L1(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle=\sum_{I=1}^{L-1}\left(\mathbf{B}^{(I)}\right)^{T}diag\left(\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\mathbf{B}^{(I)} (52)

Note that for L=2,𝐱2𝐳j(L1)=0,jN1L=2,\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}=0,\ \forall j\in N_{1}. Thus using (51) we have:

𝐱2𝐳i(2)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(2)}_{i} =(𝐁(1))Tdiag(𝐅i(2,1)σ′′(𝐳(1)))𝐁(1)\displaystyle=\left(\mathbf{B}^{(1)}\right)^{T}diag\left(\mathbf{F}^{(2,1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}\right)\right)\mathbf{B}^{(1)}

Hence the induction hypothesis (52) is true for L=2L=2.
Now we will assume (52) is true for L1L-1. Thus we have:

𝐱2𝐳j(L1)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}
=I=1L2(𝐁(I))Tdiag(𝐅j(L1,I)σ′′(𝐳(I)))𝐁(I)\displaystyle=\sum_{I=1}^{L-2}\left(\mathbf{B}^{(I)}\right)^{T}diag\left(\mathbf{F}^{(L-1,I)}_{j}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\mathbf{B}^{(I)}
jNL1\displaystyle\quad\forall j\in N_{L-1} (53)

We will prove the same for LL.
Using equation (51), we have:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}
=(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+j=1NL1𝐖i,j(L)σ(𝐳j(L1))(𝐱2𝐳j(L1))\displaystyle+\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\left(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j}\right)

In the next set of steps, we will be working with the second term of the above equation, i.e:  j=1NL1𝐖i,j(L)σ(𝐳j(L1))(𝐱2𝐳j(L1))\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}(\mathbf{z}^{(L-1)}_{j})(\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j})
Substituting 𝐱2𝐳j(L1)\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L-1)}_{j} using equation (53) we get:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}
=(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+j=1NL1𝐖i,j(L)σ(𝐳j(L1))[\displaystyle+\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\bigg{[} (54)
I=1L2(𝐁(I))diag(𝐅j(L1,I)σ′′(𝐳(I)))(𝐁(I))T]\displaystyle\sum_{I=1}^{L-2}\left(\mathbf{B}^{(I)}\right)diag\left(\mathbf{F}^{(L-1,I)}_{j}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\left(\mathbf{B}^{(I)}\right)^{T}\bigg{]}

Combining the two summations in the second term, we get:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}
=(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+j=1NL1I=1L2[𝐖i,j(L)σ(𝐳j(L1))\displaystyle+\sum_{j=1}^{N_{L-1}}\sum_{I=1}^{L-2}\bigg{[}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)
(𝐁(I))Tdiag(𝐅j(L1,I)σ′′(𝐳(I)))𝐁(I)]\displaystyle\left(\mathbf{B}^{(I)}\right)^{T}diag\left(\mathbf{F}^{(L-1,I)}_{j}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\mathbf{B}^{(I)}\bigg{]}

Exchanging the summation over II and summation over jj:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}
=(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+I=1L2j=1NL1𝐖i,j(L)σ(𝐳j(L1))[\displaystyle+\sum_{I=1}^{L-2}\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\bigg{[}
(𝐁(I))Tdiag(𝐅j(L1,I)σ′′(𝐳(I)))𝐁(I)]\displaystyle\left(\mathbf{B}^{(I)}\right)^{T}diag\left(\mathbf{F}^{(L-1,I)}_{j}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\mathbf{B}^{(I)}\bigg{]}

Since 𝐁(I)\mathbf{B}^{(I)} is independent of jj, we take it out of the summation over jj:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}
=(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+I=1L2(𝐁(I))T[\displaystyle+\sum_{I=1}^{L-2}\left(\mathbf{B}^{(I)}\right)^{T}\bigg{[}
j=1NL1𝐖i,j(L)σ(𝐳j(L1))diag(𝐅j(L1,I)σ′′(𝐳(I)))]𝐁(I)\displaystyle\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)diag\left(\mathbf{F}^{(L-1,I)}_{j}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\bigg{]}\mathbf{B}^{(I)}

Using the property, α(diag(𝐮))+β(diag(𝐯))=diag(α𝐮+β𝐯)α,β,𝐮,𝐯n\alpha\left(diag(\mathbf{u})\right)+\beta\left(diag(\mathbf{v})\right)=diag\left(\alpha\mathbf{u}+\beta\mathbf{v}\right)\ \forall\alpha,\beta\in\mathbb{R},\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}; we can move the summation inside the diagonal:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} =(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+I=1L2(𝐁(I))Tdiag[\displaystyle+\sum_{I=1}^{L-2}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{[}
j=1NL1𝐖i,j(L)σ(𝐳j(L1))(𝐅j(L1,I)σ′′(𝐳(I)))]𝐁(I)\displaystyle\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\bigg{(}\mathbf{F}^{(L-1,I)}_{j}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{)}\bigg{]}\mathbf{B}^{(I)}

Since σ′′(𝐳(I))\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right) is independent of jj, we can take it out of the summation over jj:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i} =(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+I=1L2(𝐁(I))Tdiag[\displaystyle+\sum_{I=1}^{L-2}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{[}
(j=1NL1𝐖i,j(L)σ(𝐳j(L1))𝐅j(L1,I))σ′′(𝐳(I))]𝐁(I)\displaystyle\bigg{(}\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\mathbf{F}^{(L-1,I)}_{j}\bigg{)}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{]}\mathbf{B}^{(I)}

Using equation (49), we can replace j=1NL1𝐖i,j(L)σ(𝐳j(L1))𝐅j(L1,I)\sum_{j=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,j}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{j}\right)\mathbf{F}^{(L-1,I)}_{j} with 𝐅i(L,I)\mathbf{F}^{(L,I)}_{i}:

𝐱2𝐳i(L)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}
=(𝐁(L1))Tdiag(𝐅i(L,L1)σ′′(𝐳(L1)))𝐁(L1)\displaystyle=\left(\mathbf{B}^{(L-1)}\right)^{T}diag\left(\mathbf{F}^{(L,L-1)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(L-1)}\right)\right)\mathbf{B}^{(L-1)}
+I=1L2(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle+\sum_{I=1}^{L-2}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{(}\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{)}\mathbf{B}^{(I)}
𝐱2𝐳i(L)=I=1L1(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}=\sum_{I=1}^{L-1}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{(}\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{)}\mathbf{B}^{(I)}

E.4 Proof of Theorem 3

Using Lemma 1, we have the following formula for 𝐱2(𝐳y(2)𝐳t(2))\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right):

𝐱2(𝐳y(2)𝐳t(2))\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)
=(𝐖(1))Tdiag((𝐖y(2)𝐖t(2))σ′′(𝐳(1)))𝐖(1)\displaystyle=\left(\mathbf{W}^{(1)}\right)^{T}diag\bigg{(}\left(\mathbf{W}^{(2)}_{y}-\mathbf{W}^{(2)}_{t}\right)\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}\right)\bigg{)}\mathbf{W}^{(1)}
=i=1N1(𝐖y,i(2)𝐖t,i(2))σ′′(𝐳i(1))𝐖i(1)(𝐖i(1))T\displaystyle=\sum_{i=1}^{N_{1}}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\mathbf{W}^{(1)}_{i}\big{(}\mathbf{W}^{(1)}_{i}\big{)}^{T} (55)

We are also given that the activation function σ\sigma satisfies the following property:

hLσ′′(x)hUx\displaystyle h_{L}\leq\sigma^{{}^{\prime\prime}}(x)\leq h_{U}\quad\forall x\in\mathbb{R} (56)
  1. (a)

    We have to prove the following linear matrix inequalities (LMIs):

    𝐍𝐱2(𝐳y(2)𝐳t(2))𝐏𝐱D\displaystyle\mathbf{N}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\preccurlyeq\mathbf{P}\qquad\forall\mathbf{x}\in\mathbb{R}^{D} (57)

    where 𝐏\mathbf{P} and 𝐍\mathbf{N} are given as following:

    𝐏=i=1N1pi(𝐖y,i(2)𝐖t,i(2))𝐖i(1)(𝐖i(1))T\displaystyle\mathbf{P}=\sum_{i=1}^{N_{1}}p_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T} (58)
    𝐍=i=1N1ni(𝐖y,i(2)𝐖t,i(2))𝐖i(1)(𝐖i(1))T\displaystyle\mathbf{N}=\sum_{i=1}^{N_{1}}n_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T} (59)
    pi={hU,𝐖y,i(2)𝐖t,i(2)0hL,𝐖y,i(2)𝐖t,i(2)0},\displaystyle p_{i}=\left\{\begin{array}[]{@{}lr@{}}h_{U},&\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\geq 0\\ h_{L},&\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\leq 0\\ \end{array}\right\}, (62)
    ni={hL,𝐖y,i(2)𝐖t,i(2)0hU,𝐖y,i(2)𝐖t,i(2)0}\displaystyle n_{i}=\left\{\begin{array}[]{@{}lr@{}}h_{L},&\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\geq 0\\ h_{U},&\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\leq 0\\ \end{array}\right\} (65)

    We first prove: 𝐍𝐱2(𝐳y(2)𝐳t(2))𝐱D\mathbf{N}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\quad\forall\mathbf{x}\in\mathbb{R}^{D}:
    We substitute 𝐱2(𝐳y(2)𝐳t(2))\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right) and 𝐍\mathbf{N} from equations (55) and (59) respectively in 𝐱2(𝐳y(2)𝐳t(2))𝐍\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)-\mathbf{N}:

    𝐱2(𝐳y(2)𝐳t(2))𝐍\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)-\mathbf{N}
    =i=1N1(𝐖y,i(2)𝐖t,i(2))(σ′′(𝐳i(1))ni)𝐖i(1)(𝐖i(1))T\displaystyle=\sum_{i=1}^{N_{1}}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)-n_{i}\right)\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T}

    Thus 𝐱2(𝐳y(2)𝐳t(2))𝐍\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)-\mathbf{N} is a weighted sum of symmetric rank one matrices i.e, 𝐖i(1)(𝐖i(1))T\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T} and it is PSD if and only if coefficient of each rank one matrix i.e, (𝐖y,i(2)𝐖t,i(2))(σ′′(𝐳i(1))ni)\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)-n_{i}\right) is positive. Using equations (56) and (65), we have the following:

    (𝐖y,i(2)𝐖t,i(2))0ni=hL\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0\implies n_{i}=h_{L}
    (σ′′(𝐳i(1))ni)0i[N1],𝐱D\displaystyle\implies\left(\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)-n_{i}\right)\geq 0\qquad\forall i\in[N_{1}],\ \forall\mathbf{x}\in\mathbb{R}^{D}
    (𝐖y,i(2)𝐖t,i(2))0ni=hU\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\leq 0\implies n_{i}=h_{U}
    (σ′′(𝐳i(1))ni)0i[N1],𝐱D\displaystyle\implies\left(\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)-n_{i}\right)\leq 0\qquad\forall i\in[N_{1}],\ \forall\mathbf{x}\in\mathbb{R}^{D}

    Putting the above results together we have:

    (𝐖y,i(2)𝐖t,i(2))(σ′′(𝐳i(1))ni)0\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)-n_{i}\right)\geq 0
    i[N1],𝐱D\displaystyle\forall i\in[N_{1}],\ \forall\mathbf{x}\in\mathbb{R}^{D} (66)

    Thus 𝐱2(𝐳y(2)𝐳t(2))𝐍\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)-\mathbf{N} is a PSD matrix i.e:

    𝐱2(𝐳y(2)𝐳t(2))𝐍\displaystyle\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)-\mathbf{N}
    =i=1N1(𝐖y,i(2)𝐖t,i(2))(σ′′(𝐳i(1))ni)always positive using eq.(66)𝐖i(1)(𝐖i(1))T\displaystyle=\sum_{i=1}^{N_{1}}\underbrace{\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)-n_{i}\right)}_{\text{always positive using eq.}\ \eqref{n_always_pos}}\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T}
    𝐍𝐱2(𝐳y(2)𝐳t(2))𝐱D\displaystyle\implies\mathbf{N}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\qquad\forall\mathbf{x}\in\mathbb{R}^{D} (67)

    Now we prove that 𝐱2(𝐳y(2)𝐳t(2))𝐏𝐱D\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\preccurlyeq\mathbf{P}\quad\forall\mathbf{x}\in\mathbb{R}^{D}:
    We substitute 𝐱2(𝐳y(2)𝐳t(2))\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right) and 𝐏\mathbf{P} from equations (55) and (59) respectively in 𝐏𝐱2(𝐳y(2)𝐳t(2))\mathbf{P}-\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right):

    𝐏𝐱2(𝐳y(2)𝐳t(2))\displaystyle\mathbf{P}-\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)
    =i=1N1(𝐖y,i(2)𝐖t,i(2))(piσ′′(𝐳i(1)))𝐖i(1)(𝐖i(1))T\displaystyle=\sum_{i=1}^{N_{1}}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(p_{i}-\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\right)\mathbf{W}^{(1)}_{i}\big{(}\mathbf{W}^{(1)}_{i}\big{)}^{T}

    Thus 𝐏𝐱2(𝐳y(2)𝐳t(2))\mathbf{P}-\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right) is a weighted sum of symmetric rank one matrices i.e, 𝐖i(1)(𝐖i(1))T\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T} and it is PSD if and only if coefficient of each rank one matrix i.e, (𝐖y,i(2)𝐖t,i(2))(piσ′′(𝐳i(1)))\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(p_{i}-\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\right) is positive. Using equations (56) and (65), we have the following:

    (𝐖y,i(2)𝐖t,i(2))0pi=hU\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0\implies p_{i}=h_{U}
    (piσ′′(𝐳i(1)))0iN1,𝐱D\displaystyle\implies\left(p_{i}-\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\right)\geq 0\qquad\forall i\in N_{1},\ \mathbf{x}\in\mathbb{R}^{D}
    (𝐖y,i(2)𝐖t,i(2))0pi=hL\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\leq 0\implies p_{i}=h_{L}
    (piσ′′(𝐳i(1)))0iN1,𝐱D\displaystyle\implies\left(p_{i}-\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\right)\leq 0\qquad\forall i\in N_{1},\ \mathbf{x}\in\mathbb{R}^{D}

    Putting the above results together we have:

    (𝐖y,i(2)𝐖t,i(2))(piσ′′(𝐳i(1)))0\displaystyle\implies\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(p_{i}-\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\right)\geq 0
    i[N1],𝐱D\displaystyle\forall i\in[N_{1}],\ \mathbf{x}\in\mathbb{R}^{D} (68)

    Thus 𝐏𝐱2(𝐳y(2)𝐳t(2))\mathbf{P}-\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right) is PSD matrix i.e:

    𝐏𝐱2(𝐳y(2)𝐳t(2))\displaystyle\mathbf{P}-\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)
    =i=1N1(𝐖y,i(2)𝐖t,i(2))(piσ′′(𝐳i(1)))always positive using eq.(68)𝐖i(1)(𝐖i(1))T\displaystyle=\sum_{i=1}^{N_{1}}\underbrace{\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\left(p_{i}-\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(1)}_{i}\right)\right)}_{\text{always positive using eq.}\ \eqref{p_always_pos}}\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T}
    𝐏𝐱2(𝐳y(2)𝐳t(2))𝐱D\displaystyle\implies\mathbf{P}\succcurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\qquad\forall\mathbf{x}\in\mathbb{R}^{D} (69)

    Thus by proving the LMIs (67) and (69), we prove (57).

  2. (b)

    We have to prove that if hU0h_{U}\geq 0 and hL0h_{L}\leq 0, 𝐏\mathbf{P} is a PSD matrix, 𝐍\mathbf{N} is a NSD matrix.
    We are given hU0h_{U}\geq 0hL0h_{L}\leq 0. Using equation (65), we have the following:

    (𝐖y,i(2)𝐖t,i(2))0pi=hU0\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0\implies p_{i}=h_{U}\geq 0
    pi(𝐖y,i(2)𝐖t,i(2))0\displaystyle\implies p_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0
    (𝐖y,i(2)𝐖t,i(2))0pi=hL0\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\leq 0\implies p_{i}=h_{L}\leq 0
    pi(𝐖y,i(2)𝐖t,i(2))0\displaystyle\implies p_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0

    Putting these results together we have:

    pi(𝐖y,i(2)𝐖t,i(2))0i[N1]\displaystyle\implies p_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0\qquad\forall i\in[N_{1}] (70)

    Thus 𝐏\mathbf{P} is a weighted sum of symmetric rank one matrices i.e, 𝐖i(1)(𝐖i(1))T\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T} and each coefficient pi(𝐖y,i(2)𝐖t,i(2))p_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right) is positive.

    𝐏=i=1N1pi(𝐖y,i(2)𝐖t,i(2))always positive using eq. (70)𝐖i(1)(𝐖i(1))T0\displaystyle\mathbf{P}=\sum_{i=1}^{N_{1}}\underbrace{p_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)}_{\text{always positive using eq. \eqref{p_wdiff}}}\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T}\succcurlyeq 0

    Using equation (65), we have the following:

    (𝐖y,i(2)𝐖t,i(2))0ni=hL0\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0\implies n_{i}=h_{L}\leq 0
    ni(𝐖y,i(2)𝐖t,i(2))0\displaystyle\implies n_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\leq 0
    (𝐖y,i(2)𝐖t,i(2))0ni=hU0\displaystyle\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\leq 0\implies n_{i}=h_{U}\geq 0
    ni(𝐖y,i(2)𝐖t,i(2))0\displaystyle\implies n_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\leq 0

    Putting these results together we have:

    ni(𝐖y,i(2)𝐖t,i(2))0i[N1]\displaystyle\implies n_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\geq 0\qquad\forall i\in[N_{1}] (71)
    𝐍=i=1N1ni(𝐖y,i(2)𝐖t,i(2))always positive using eq. (71)𝐖i(1)(𝐖i(1))T0\displaystyle\mathbf{N}=\sum_{i=1}^{N_{1}}\underbrace{n_{i}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)}_{\text{always positive using eq. \eqref{n_wdiff}}}\mathbf{W}^{(1)}_{i}\left(\mathbf{W}^{(1)}_{i}\right)^{T}\preccurlyeq 0

    Thus 𝐏\mathbf{P} is a PSD and 𝐍\mathbf{N} is a NSD matrix if hU0h_{U}\geq 0 and hL0h_{L}\leq 0.

  3. (c)

    We have to prove the following global bounds on the eigenvalues of 𝐱2(𝐳y(2)𝐳t(2))\nabla^{2}_{\mathbf{x}}(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}):

    m𝐈𝐱2(𝐳y(2)𝐳t(2))M𝐈,\displaystyle m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\preccurlyeq M\mathbf{I},
    where M=max𝐯=1𝐯T𝐏𝐯,m=min𝐯=1𝐯T𝐍𝐯\displaystyle\text{where }M=\max_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\mathbf{P}\mathbf{v},\ m=\min_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\mathbf{N}\mathbf{v}

    Since 𝐱2(𝐳y(2)𝐳t(2))𝐏𝐱D\nabla_{\mathbf{x}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\preccurlyeq\mathbf{P}\quad\forall\mathbf{x}\in\mathbb{R}^{D}:

    𝐯T[𝐱2(𝐳y(2)𝐳t(2))]𝐯𝐯T𝐏𝐯\displaystyle\mathbf{v}^{T}\left[\nabla_{\mathbf{x}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}\leq\mathbf{v}^{T}\mathbf{P}\mathbf{v}
    𝐯D,𝐱D\displaystyle\forall\mathbf{v}\in\mathbb{R}^{D},\ \forall\mathbf{x}\in\mathbb{R}^{D} (72)

    Let 𝐯\mathbf{v}^{*}, 𝐱\mathbf{x}^{*} be vectors such that:

    (𝐯)T[𝐱2(𝐳y(2)𝐳t(2))]𝐯\displaystyle(\mathbf{v}^{*})^{T}\left[\nabla_{\mathbf{x}^{*}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}^{*}
    =max𝐱max𝐯=1𝐯T[𝐱2(𝐳y(2)𝐳t(2))]𝐯\displaystyle=\max_{\mathbf{x}}\max_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\left[\nabla_{\mathbf{x}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}

    Thus using inequality (72):

    (𝐯)T[𝐱2(𝐳y(2)𝐳t(2))]𝐯max𝐯=1𝐯T𝐏𝐯\displaystyle(\mathbf{v}^{*})^{T}\left[\nabla_{\mathbf{x}^{*}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}^{*}\leq\max_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\mathbf{P}\mathbf{v} (73)

    Since 𝐍𝐱2(𝐳y(2)𝐳t(2))𝐱D\mathbf{N}\preccurlyeq\nabla_{\mathbf{x}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\quad\forall\mathbf{x}\in\mathbb{R}^{D}:

    𝐯T𝐍𝐯𝐯T[𝐱2(𝐳y(2)𝐳t(2))]𝐯\displaystyle\mathbf{v}^{T}\mathbf{N}\mathbf{v}\leq\mathbf{v}^{T}\left[\nabla_{\mathbf{x}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}
    𝐯D,𝐱D\displaystyle\forall\mathbf{v}\in\mathbb{R}^{D},\ \forall\mathbf{x}\in\mathbb{R}^{D} (74)

    Let 𝐯\mathbf{v}^{*}, 𝐱\mathbf{x}^{*} be vectors such that:

    (𝐯)T[𝐱2(𝐳y(2)𝐳t(2))]𝐯\displaystyle(\mathbf{v}^{*})^{T}\left[\nabla_{\mathbf{x}^{*}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}^{*}
    =min𝐱min𝐯=1𝐯T[𝐱2(𝐳y(2)𝐳t(2))]𝐯\displaystyle=\min_{\mathbf{x}}\min_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\left[\nabla_{\mathbf{x}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}

    Thus using inequality (74):

    (𝐯)T[𝐱2(𝐳y(2)𝐳t(2))]𝐯min𝐯=1𝐯T𝐍𝐯\displaystyle(\mathbf{v}^{*})^{T}\left[\nabla_{\mathbf{x}^{*}}^{2}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\right]\mathbf{v}^{*}\geq\min_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\mathbf{N}\mathbf{v} (75)

    Using the inequalities (73) and (75), we get:

    m𝐈𝐱2(𝐳y(2)𝐳t(2))M𝐈m\mathbf{I}\preccurlyeq\nabla^{2}_{\mathbf{x}}\left(\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\right)\preccurlyeq M\mathbf{I}

    where M=max𝐯=1𝐯T𝐏𝐯,m=min𝐯=1𝐯T𝐍𝐯M=\max_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\mathbf{P}\mathbf{v},\ m=\min_{\|\mathbf{v}\|=1}\mathbf{v}^{T}\mathbf{N}\mathbf{v}

E.5 Proof of Theorem 4

We are given that the activation function σ\sigma is such that σ,σ′′\sigma^{{}^{\prime}},\ \sigma^{{}^{\prime\prime}} are bounded, i.e:

|σ(x)|g,|σ′′(x)|hx\displaystyle|\sigma^{{}^{\prime}}(x)|\leq g,\ |\sigma^{{}^{\prime\prime}}(x)|\leq h\qquad\forall x\in\mathbb{R} (76)

We have to prove the following:

𝐱2𝐳i(L)hI=1L1(r(I))2maxj(𝐒i,j(I))𝐱D\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\|\leq h\sum_{I=1}^{L-1}\left(r^{(I)}\right)^{2}\max_{j}\left(\mathbf{S}^{(I)}_{i,j}\right)\ \ \ \forall\mathbf{x}\in\mathbb{R}^{D}

where 𝐒(L,I)\mathbf{S}^{(L,I)} is a matrix of size NL×NIN_{L}\times N_{I} defined as follows:

𝐒(L,I)\displaystyle\mathbf{S}^{(L,I)} ={|𝐖(L)| I=L1g|𝐖(L)|𝐒(L1,I) I[L2]\displaystyle=\begin{dcases*}\left|\mathbf{W}^{(L)}\right|&\quad$I=L-1$\\ g\left|\mathbf{W}^{(L)}\right|\mathbf{S}^{(L-1,I)}&\quad$I\in[L-2]$\end{dcases*} (77)

and r(I)r^{(I)} is a scalar defined as follows:

r(I)\displaystyle r^{(I)} ={𝐖(1) I=1g𝐖(I)r(I1) I[2,L1]\displaystyle=\begin{dcases*}\left\|\mathbf{W}^{(1)}\right\|&\quad$I=1$\\ g\left\|\mathbf{W}^{(I)}\right\|r^{(I-1)}&\quad$I\in[2,L-1]$\end{dcases*} (78)

We will prove the same in 33 steps.
In step (a), we will prove:

|𝐅i,j(L,I)|𝐒i,j(L,I)𝐱D\displaystyle\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq\mathbf{S}^{(L,I)}_{i,j}\qquad\forall\mathbf{x}\in\mathbb{R}^{D} (79)

In step (b), we will prove:

𝐁(I)r(I),𝐱D\displaystyle\left\|\mathbf{B}^{(I)}\right\|\leq r^{(I)},\qquad\forall\mathbf{x}\in\mathbb{R}^{D} (80)

In step (c), we will use (a) and (b) to prove:

𝐱2𝐳i(L)hI=1L1(r(I))2maxj(𝐒i,j(L,I))\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\|\leq h\sum_{I=1}^{L-1}\left(r^{(I)}\right)^{2}\max_{j}\left(\mathbf{S}^{(L,I)}_{i,j}\right) (81)

Note that 𝐁(I)\mathbf{B}^{(I)} and 𝐅(L,I)\mathbf{F}^{(L,I)} are defined using (35) and (36) respectively.

  1. (a)

    We have to prove that for L2,I[L1],iNL,jNIL\geq 2,\ I\in[L-1],\ i\in N_{L},\ j\in N_{I}:

    |𝐅i,j(L,I)|𝐒i,j(L,I)𝐱D\displaystyle\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq\mathbf{S}^{(L,I)}_{i,j}\qquad\forall\mathbf{x}\in\mathbb{R}^{D}

    where 𝐒(L,I)\mathbf{S}^{(L,I)} is a matrix of size NI×NJN_{I}\times N_{J} defined as follows:

    𝐒(L,I)\displaystyle\mathbf{S}^{(L,I)} ={|𝐖(L)| I=L1g|𝐖(L)|𝐒(L1,J) I[L2]\displaystyle=\begin{dcases*}\left|\mathbf{W}^{(L)}\right|&\quad$I=L-1$\\ g\left|\mathbf{W}^{(L)}\right|\mathbf{S}^{(L-1,J)}&\quad$I\in[L-2]$\end{dcases*}

    We first prove the case when I=L1I=L-1.
    Using equation (47), 𝐅i,j(L,L1)=𝐖i,j(L)\mathbf{F}^{(L,L-1)}_{i,j}=\mathbf{W}^{(L)}_{i,j}.
    Since 𝐒i,j(L,L1)=|𝐖i,j(L)|\mathbf{S}^{(L,L-1)}_{i,j}=\left|\mathbf{W}^{(L)}_{i,j}\right|:

    |𝐅i,j(L,L1)|=𝐒i,j(L,L1)\left|\mathbf{F}^{(L,L-1)}_{i,j}\right|=\mathbf{S}^{(L,L-1)}_{i,j}

    Hence for L2,I=L1L\geq 2,\ I=L-1, we have equality in (79). Hence proved.
    Now, we will use proof by induction.
    To prove the base case L=2L=2, note that I=L1=1I=L-1=1 is the only possible value for II. Thus, using the result for I=L1I=L-1, the theorem holds for L=2L=2. This proves the base case.
    Now we assume the induction hypothesis is true for depth =L1,I[L2]=L-1,\ I\in[L-2]. and prove for depth =L,I[L1]=L,\ I\in[L-1]. Since for I=L1I=L-1, we have proven already, we prove for IL2I\leq L-2.
    Using equation (49), we have the following formula for 𝐅i(L,I)\mathbf{F}^{(L,I)}_{i}:

    𝐅i(L,I)=k=1NL1𝐖i,k(L)σ(𝐳k(L1))𝐅k(L1,I)\displaystyle\mathbf{F}^{(L,I)}_{i}=\sum_{k=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,k}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{k}\right)\mathbf{F}^{(L-1,I)}_{k}

    Taking the jthj^{th} element of the vectors on both sides:

    𝐅i,j(L,I)=k=1NL1𝐖i,k(L)σ(𝐳k(L1))𝐅k,j(L1,I)\displaystyle\mathbf{F}^{(L,I)}_{i,j}=\sum_{k=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,k}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{k}\right)\mathbf{F}^{(L-1,I)}_{k,j} (82)

    By induction hypothesis, we know that:

    |𝐅k,j(L1,I)|𝐒k,j(L1,I)\displaystyle\left|\mathbf{F}^{(L-1,I)}_{k,j}\right|\leq\mathbf{S}^{(L-1,I)}_{k,j} (83)

    Using the absolute value properties for equation (82), we have:

    |𝐅i,j(L,I)|=|k=1NL1𝐖i,k(L)σ(𝐳k(L1))𝐅k,j(L1,I)|\displaystyle\left|\mathbf{F}^{(L,I)}_{i,j}\right|=\left|\sum_{k=1}^{N_{L-1}}\mathbf{W}^{(L)}_{i,k}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{k}\right)\mathbf{F}^{(L-1,I)}_{k,j}\right|
    |𝐅i,j(L,I)|k=1NL1|𝐖i,k(L)σ(𝐳k(L1))𝐅k,j(L1,I)|\displaystyle\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq\sum_{k=1}^{N_{L-1}}\left|\mathbf{W}^{(L)}_{i,k}\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{k}\right)\mathbf{F}^{(L-1,I)}_{k,j}\right|
    |𝐅i,j(L,I)|k=1NL1|𝐖i,k(L)||σ(𝐳k(L1))||𝐅k,j(L1,I)|\displaystyle\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq\sum_{k=1}^{N_{L-1}}\left|\mathbf{W}^{(L)}_{i,k}\right|\left|\sigma^{{}^{\prime}}\left(\mathbf{z}^{(L-1)}_{k}\right)\right|\left|\mathbf{F}^{(L-1,I)}_{k,j}\right|

    Using |σ(x)|gx|\sigma^{{}^{\prime}}(x)|\leq g\quad\forall x\in\mathbb{R}\ (inequality (76)) :

    |𝐅i,j(L,I)|gk=1NL1|𝐖i,k(L)||𝐅k,j(L1,I)|\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq g\sum_{k=1}^{N_{L-1}}\left|\mathbf{W}^{(L)}_{i,k}\right|\left|\mathbf{F}^{(L-1,I)}_{k,j}\right|

    Using the induction hypothesis (inequality (83)):

    |𝐅i,j(L,I)|gk=1NL1|𝐖i,k(L)||𝐒k,j(L1,I)|\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq g\sum_{k=1}^{N_{L-1}}\left|\mathbf{W}^{(L)}_{i,k}\right|\left|\mathbf{S}^{(L-1,I)}_{k,j}\right|

    Using equation (77) for definition of 𝐒i,j(L,I)\mathbf{S}^{(L,I)}_{i,j}:

    |𝐅i,j(L,I)|𝐒i,j(L,I)\left|\mathbf{F}^{(L,I)}_{i,j}\right|\leq\mathbf{S}^{(L,I)}_{i,j}

    Hence we prove (79) for all L2L\geq 2 and IL1I\leq L-1 using induction.

  2. (b)

    We have to prove that for 1IM11\leq I\leq M-1:

    𝐁(I)r(I),𝐱D\displaystyle\left\|\mathbf{B}^{(I)}\right\|\leq r^{(I)},\qquad\forall\mathbf{x}\in\mathbb{R}^{D}

    where r(I)r^{(I)} is a scalar given as follows:

    r(I)\displaystyle r^{(I)} ={𝐖(1) I=1g𝐖(I)r(I1) I[2,L1]\displaystyle=\begin{dcases*}\left\|\mathbf{W}^{(1)}\right\|&\quad$I=1$\\ g\left\|\mathbf{W}^{(I)}\right\|r^{(I-1)}&\quad$I\in[2,L-1]$\end{dcases*}

    Using equation (45), for I=1I=1 we have:

    𝐁(1)\displaystyle\left\|\mathbf{B}^{(1)}\right\| =𝐖(1)=r(1)\displaystyle=\left\|\mathbf{W}^{(1)}\right\|=r^{(1)} (84)

    Using equation (46), for I>1I>1, we have:

    𝐁(I)\displaystyle\left\|\mathbf{B}^{(I)}\right\| =𝐖(I)diag(σ(𝐳(I1)))𝐁(I1)\displaystyle=\left\|\mathbf{W}^{(I)}diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}\right)\right)\mathbf{B}^{(I-1)}\right\|
    𝐁(I)\displaystyle\left\|\mathbf{B}^{(I)}\right\| 𝐖(I)diag(σ(𝐳(I1)))𝐁(I1)\displaystyle\leq\left\|\mathbf{W}^{(I)}\right\|\left\|diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}\right)\right)\right\|\left\|\mathbf{B}^{(I-1)}\right\|

    Since diag(σ(𝐳(I1)))=maxj|σ(𝐳j(I1))|\left\|diag\left(\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}\right)\right)\right\|=\max_{j}\left|\sigma^{{}^{\prime}}\left(\mathbf{z}^{(I-1)}_{j}\right)\right|, using equation (76):

    𝐁(I)\displaystyle\left\|\mathbf{B}^{(I)}\right\| g𝐖(I)𝐁(I1)g𝐖(I)r(I1)\displaystyle\leq g\left\|\mathbf{W}^{(I)}\right\|\left\|\mathbf{B}^{(I-1)}\right\|\leq g\left\|\mathbf{W}^{(I)}\right\|r^{(I-1)} (85)

    Using inequalities (84) and (85), the proof follows using induction.

  3. (c)

    We have to prove that:

    𝐱2𝐳i(L)hI=1L1(r(I))2maxj(𝐒i,j(I))\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\|\leq h\sum_{I=1}^{L-1}\left(r^{(I)}\right)^{2}\max_{j}\left(\mathbf{S}^{(I)}_{i,j}\right)

    Using Lemma 1, we have the following equation for 𝐱2𝐳i(L)\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}:

    𝐱2𝐳i(L)=I=1L1(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}=\sum_{I=1}^{L-1}\left(\mathbf{B}^{(I)}\right)^{T}diag\bigg{(}\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\bigg{)}\mathbf{B}^{(I)}

    Using the properties of norm we have:

    𝐱2𝐳i(L)\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\|
    =I=1L1(𝐁(I))Tdiag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)\displaystyle=\left\|\sum_{I=1}^{L-1}\left(\mathbf{B}^{(I)}\right)^{T}diag\left(\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\mathbf{B}^{(I)}\right\|
    I=1L1diag(𝐅i(L,I)σ′′(𝐳(I)))𝐁(I)2\displaystyle\leq\sum_{I=1}^{L-1}\left\|diag\left(\mathbf{F}^{(L,I)}_{i}\odot\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}\right)\right)\right\|\left\|\mathbf{B}^{(I)}\right\|^{2}
    I=1L1maxj(|𝐅i,j(L,I)σ′′(𝐳j(I))|)𝐁(I)2\displaystyle\leq\sum_{I=1}^{L-1}\max_{j}\bigg{(}\left|\mathbf{F}^{(L,I)}_{i,j}\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}_{j}\right)\right|\bigg{)}\left\|\mathbf{B}^{(I)}\right\|^{2}

    In the last inequality, we use the property that norm of a diagonal matrix is the maximum absolute value of the diagonal element. Using the product property of absolute value, we get:

    𝐱2𝐳i(L)\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\| I=1L1maxj(|𝐅i,j(L,I)||σ′′(𝐳j(I))|)𝐁(I)2\displaystyle\leq\sum_{I=1}^{L-1}\max_{j}\bigg{(}\left|\mathbf{F}^{(L,I)}_{i,j}\right|\left|\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}_{j}\right)\right|\bigg{)}\left\|\mathbf{B}^{(I)}\right\|^{2}

    Since |𝐅i,j(L,I)|\left|\mathbf{F}^{(L,I)}_{i,j}\right| and |σ′′(𝐳j(I))|\left|\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}_{j}\right)\right| are positive terms:

    𝐱2𝐳i(L)\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\|
    I=1L1maxj(|𝐅i,j(L,I)|)maxj(|σ′′(𝐳j(I))|)𝐁(I)2\displaystyle\leq\sum_{I=1}^{L-1}\max_{j}\bigg{(}\left|\mathbf{F}^{(L,I)}_{i,j}\right|\bigg{)}\max_{j}\bigg{(}\left|\sigma^{{}^{\prime\prime}}\left(\mathbf{z}^{(I)}_{j}\right)\right|\bigg{)}\left\|\mathbf{B}^{(I)}\right\|^{2}

    Since σ′′\left\|\sigma^{{}^{\prime\prime}}\right\| is bounded by hh:

    𝐱2𝐳i(L)\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\| hI=1L1maxj(|𝐅i,j(L,I)|)𝐁(I)2\displaystyle\leq h\sum_{I=1}^{L-1}\max_{j}\bigg{(}\left|\mathbf{F}^{(L,I)}_{i,j}\right|\bigg{)}\left\|\mathbf{B}^{(I)}\right\|^{2}

    Using inequality (79):

    𝐱2𝐳i(L)\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\| hI=1L1maxj(𝐒i,j(I))𝐁(I)2\displaystyle\leq h\sum_{I=1}^{L-1}\max_{j}\left(\mathbf{S}^{(I)}_{i,j}\right)\left\|\mathbf{B}^{(I)}\right\|^{2}

    Using inequality (80):

    𝐱2𝐳i(L)\displaystyle\left\|\nabla^{2}_{\mathbf{x}}\mathbf{z}^{(L)}_{i}\right\| hI=1L1(r(I))2maxj(𝐒i,j(I))𝐱D\displaystyle\leq h\sum_{I=1}^{L-1}\left(r^{(I)}\right)^{2}\max_{j}\left(\mathbf{S}^{(I)}_{i,j}\right)\quad\forall\mathbf{x}\in\mathbb{R}^{D}

E.6 Proof of Theorem 5

Theorem 5.

For a binary classifier ff, let gg denote the indicator function such that g(𝐱)=1f(𝐱)>0,g(𝐱)=0otherwiseg(\mathbf{x})=1\iff f(\mathbf{x})>0,\ g(\mathbf{x})=0\ \text{otherwise}. Let g^\hat{g} be the function constructed by applying randomized smoothing on gg such that:

g^(𝐮)=1(2πs2)n/2Dg(𝐯)exp(𝐯𝐮22s2)𝑑𝐯\hat{g}\left(\mathbf{u}\right)=\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}g(\mathbf{v})\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)d\mathbf{v}

then the curvature of the resulting function g^\hat{g} is bounded i.e:

𝐈s2𝐮2g^𝐈s2-\frac{\mathbf{I}}{s^{2}}\preccurlyeq\nabla^{2}_{\mathbf{u}}\ \hat{g}\preccurlyeq\frac{\mathbf{I}}{s^{2}}
Proof.
𝐮g^(𝐮)\displaystyle\nabla_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right)
=1(2πs2)n/2Dg(𝐯)(𝐯𝐮)s2exp(𝐯𝐮22s2)𝑑𝐯\displaystyle=\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}g(\mathbf{v})\frac{(\mathbf{v}-\mathbf{u})}{s^{2}}\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)d\mathbf{v}
𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right)
=1(2πs2)n/2Dg(𝐯)𝐈s2exp(𝐯𝐮22s2)𝑑𝐯\displaystyle=\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}g(\mathbf{v})\frac{-\mathbf{I}}{s^{2}}\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)d\mathbf{v}
+1(2πs2)n/2Dg(𝐯)(𝐯𝐮)(𝐯𝐮)Ts4[\displaystyle+\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}g(\mathbf{v})\frac{(\mathbf{v}-\mathbf{u})(\mathbf{v}-\mathbf{u})^{T}}{s^{4}}\bigg{[}
exp(𝐯𝐮22s2)]d𝐯\displaystyle\quad\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)\bigg{]}d\mathbf{v}

Since 0g(𝐯)10\leq g(\mathbf{v})\leq 1𝐈/s20-\mathbf{I}/s^{2}\preccurlyeq 0(𝐯𝐮)(𝐯𝐮)T0(\mathbf{v}-\mathbf{u})(\mathbf{v}-\mathbf{u})^{T}\succcurlyeq 0 and exp(x)0x\exp(x)\geq 0\ \forall x:

𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right) =1(2πs2)n/2Dg(𝐯)𝐈s2exp(𝐯𝐮22s2)d𝐯Negative Semi-Definite\displaystyle=\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}\underbrace{g(\mathbf{v})\frac{-\mathbf{I}}{s^{2}}\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)d\mathbf{v}}_{\text{Negative Semi-Definite}}
+1(2πs2)n/2Dg(𝐯)(𝐯𝐮)(𝐯𝐮)Ts4Positive Semi-Definite[\displaystyle+\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}\underbrace{g(\mathbf{v})\frac{(\mathbf{v}-\mathbf{u})(\mathbf{v}-\mathbf{u})^{T}}{s^{4}}}_{\text{Positive Semi-Definite}}\bigg{[}
exp(𝐯𝐮22s2)]d𝐯\displaystyle\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)\bigg{]}d\mathbf{v}
𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right) 1(2πs2)n/2D(𝐯𝐮)(𝐯𝐮)Ts4[\displaystyle\preccurlyeq\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}\frac{(\mathbf{v}-\mathbf{u})(\mathbf{v}-\mathbf{u})^{T}}{s^{4}}\bigg{[}
exp(𝐯𝐮22s2)]d𝐯\displaystyle\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)\bigg{]}d\mathbf{v}
𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right) 1(2πs2)n/2D𝐪𝐪Ts4exp(𝐪22s2)𝑑𝐪\displaystyle\preccurlyeq\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}\frac{\mathbf{q}\mathbf{q}^{T}}{s^{4}}\exp\left(-\frac{\|\mathbf{q}\|^{2}}{2s^{2}}\right)d\mathbf{q}
𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right) 𝐈s2\displaystyle\preccurlyeq\frac{\mathbf{I}}{s^{2}}
𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right) 1(2πs2)n/2D𝐈s2exp(𝐯𝐮22s2)𝑑𝐯\displaystyle\succcurlyeq\frac{1}{(2\pi s^{2})^{n/2}}\int_{\mathbb{R}^{D}}\frac{-\mathbf{I}}{s^{2}}\exp\left(-\frac{\|\mathbf{v}-\mathbf{u}\|^{2}}{2s^{2}}\right)d\mathbf{v}
𝐮2g^(𝐮)\displaystyle\nabla^{2}_{\mathbf{u}}\ \hat{g}\left(\mathbf{u}\right) 𝐈s2\displaystyle\succcurlyeq-\frac{\mathbf{I}}{s^{2}}

Appendix F Computing g,h,g,h, hUh_{U} and hLh_{L} for different activation functions

F.1 Softplus activation

For softplus activation, we have the following. We use S(x)S(x) to denote sigmoid:

σ(x)=log(1+exp(x))\displaystyle\sigma(x)=\log(1+\exp(x))
σ(x)=S(x)\displaystyle\sigma^{{}^{\prime}}(x)=S(x)
σ′′(x)=S(x)(1S(x))\displaystyle\sigma^{{}^{\prime\prime}}(x)=S(x)(1-S(x))

To bound S(x)(1S(x))S(x)(1-S(x)), let α\alpha denote S(x)S(x). We know that 0α10\leq\alpha\leq 1:

α(1α)=14(12α)2\alpha(1-\alpha)=\frac{1}{4}-\bigg{(}\frac{1}{2}-\alpha\bigg{)}^{2}

Thus, S(x)(1S(x))S(x)(1-S(x)) is maximum at S(x)=1/2S(x)=1/2 and minimum at S(x)=0S(x)=0 and S(x)=1S(x)=1. The maximum value is 0.25 and minimum value is 0.

0S(x)(1S(x))0.250σ′′(x)0.250\leq S(x)(1-S(x))\leq 0.25\implies 0\leq\sigma^{{}^{\prime\prime}}(x)\leq 0.25

Thus, hU=0.25,hL=0h_{U}=0.25,\ h_{L}=0 (for use in Theorem 3) and g=1,h=0.25g=1,\ h=0.25 (for use in Theorem 4).

F.2 Sigmoid activation

For sigmoid activation, we have the following. We use S(x)S(x) to denote sigmoid:

σ(x)=S(x)=11+exp(x)\displaystyle\sigma(x)=S(x)=\frac{1}{1+\exp(-x)}
σ(x)=S(x)(1S(x))\displaystyle\sigma^{{}^{\prime}}(x)=S(x)(1-S(x))
σ′′(x)=S(x)(1S(x))(12S(x))\displaystyle\sigma^{{}^{\prime\prime}}(x)=S(x)(1-S(x))(1-2S(x))

The second derivative of sigmoid (σ′′(x))(\sigma^{{}^{\prime\prime}}(x)) can be bounded using standard differentiation. Let α\alpha denote S(x)S(x). We know that 0α10\leq\alpha\leq 1:

hLσ′′(x)hUh_{L}\leq\sigma^{{}^{\prime\prime}}(x)\leq h_{U}
hL=min0α1α(1α)(12α)h_{L}=\min_{0\leq\alpha\leq 1}\alpha(1-\alpha)(1-2\alpha)
hU=max0α1α(1α)(12α)h_{U}=\max_{0\leq\alpha\leq 1}\alpha(1-\alpha)(1-2\alpha)

To solve for both hLh_{L} and hUh_{U}, we first differentiate α(1α)(12α)\alpha(1-\alpha)(1-2\alpha) with respect to α\alpha:

α(α(1α)(12α))=α(2α33α2+α)\displaystyle\nabla_{\alpha}\left(\alpha(1-\alpha)(1-2\alpha)\right)=\nabla_{\alpha}\left(2\alpha^{3}-3\alpha^{2}+\alpha\right)
=(6α26α+1)\displaystyle=\left(6\alpha^{2}-6\alpha+1\right)

Solving for 6α26α+1=06\alpha^{2}-6\alpha+1=0, we get the solutions:

α=(3+36),(336)\alpha=\bigg{(}\frac{3+\sqrt{3}}{6}\bigg{)},\bigg{(}\frac{3-\sqrt{3}}{6}\bigg{)}

Since both (3+3/6),(33/6)(3+\sqrt{3}/6),(3-\sqrt{3}/6) lie between 0 and 1, we check for the second derivatives:

α2(α(1α)(12α))=α(6α26α+1)\displaystyle\nabla^{2}_{\alpha}\left(\alpha(1-\alpha)(1-2\alpha)\right)=\nabla_{\alpha}\left(6\alpha^{2}-6\alpha+1\right)
=12α6=6(2α1)\displaystyle=12\alpha-6=6(2\alpha-1)

At α=(3+3)/6\alpha=(3+\sqrt{3})/6, α2=6(2α1)=23>0\nabla^{2}_{\alpha}=6(2\alpha-1)=2\sqrt{3}>0.
At α=(33)/6\alpha=(3-\sqrt{3})/6, α2=6(2α1)=23<0\nabla^{2}_{\alpha}=6(2\alpha-1)=-2\sqrt{3}<0.
Thus α=(3+3)/6\alpha=(3+\sqrt{3})/6 is a local minima, α=(33)/6\alpha=(3-\sqrt{3})/6 is a local maxima.
Substituting the two critical points into α(1α)(12α)\alpha(1-\alpha)(1-2\alpha), we get hU=9.623×102h_{U}=9.623\times 10^{-2}, hL=9.623×102h_{L}=-9.623\times 10^{-2}.
Thus, hU=9.623×102,hL=9.623×102h_{U}=9.623\times 10^{-2},\ h_{L}=-9.623\times 10^{-2} (for use in Theorem 3) and g=0.25,h=0.09623g=0.25,\ h=0.09623 (for use in Theorem 4).

F.3 Tanh activation

For tanh activation, we have the following:

σ(x)=tanh(x)=exp(x)exp(x)exp(x)+exp(x)\displaystyle\sigma(x)=\tanh(x)=\frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-x)}
σ(x)=(1tanh(x))(1+tanh(x))\displaystyle\sigma^{{}^{\prime}}(x)=\left(1-\tanh(x)\right)\left(1+\tanh(x)\right)
σ′′(x)=2tanh(x)(1tanh(x))(1+tanh(x))\displaystyle\sigma^{{}^{\prime\prime}}(x)=-2\tanh(x)\left(1-\tanh(x)\right)\left(1+\tanh(x)\right)

The second derivative of tanh , i.e (σ′′(x))(\sigma^{{}^{\prime\prime}}(x)) can be bounded using standard differentiation. Let α\alpha denote tanh(x)\tanh(x). We know that 1α1-1\leq\alpha\leq 1:

hLσ′′(x)hUh_{L}\leq\sigma^{{}^{\prime\prime}}(x)\leq h_{U}
hL=min0α12α(1α)(1+α)h_{L}=\min_{0\leq\alpha\leq 1}-2\alpha(1-\alpha)(1+\alpha)
hU=max0α12α(1α)(1+α)h_{U}=\max_{0\leq\alpha\leq 1}-2\alpha(1-\alpha)(1+\alpha)

To solve for both hLh_{L} and hUh_{U}, we first differentiate 2α(1α)(1+α)-2\alpha(1-\alpha)(1+\alpha) with respect to α\alpha:

α(2α(1α)(1+α))=α(2α32α)=(6α22)\nabla_{\alpha}\left(-2\alpha(1-\alpha)(1+\alpha)\right)=\nabla_{\alpha}\left(2\alpha^{3}-2\alpha\right)=\left(6\alpha^{2}-2\right)

Solving for 6α22=06\alpha^{2}-2=0, we get the solutions:

α=13,13\alpha=-\frac{1}{\sqrt{3}},\frac{1}{\sqrt{3}}

Since both 1/3,1/3-1/\sqrt{3},1/\sqrt{3} lie between -1 and 1, we check for the second derivatives:

α2(2α(1α)(1+α))=α(6α22)=12α\nabla^{2}_{\alpha}\left(-2\alpha(1-\alpha)(1+\alpha)\right)=\nabla_{\alpha}\left(6\alpha^{2}-2\right)=12\alpha

At α=1/3\alpha=-1/\sqrt{3}, α2=12α=43<0\nabla^{2}_{\alpha}=12\alpha=-4\sqrt{3}<0.
At α=1/3\alpha=1/\sqrt{3}, α2=12α=43>0\nabla^{2}_{\alpha}=12\alpha=4\sqrt{3}>0.
Thus α=1/3\alpha=1/\sqrt{3} is a local minima, α=1/3\alpha=-1/\sqrt{3} is a local maxima.
Substituting the two critical points into 2α(1α)(1+α)-2\alpha(1-\alpha)(1+\alpha), we get hU=0.76981h_{U}=0.76981, hL=0.76981h_{L}=-0.76981.
Thus, hU=0.76981,hL=0.76981h_{U}=0.76981,\ h_{L}=-0.76981 (for use in Theorem 3) and g=1,h=0.76981g=1,\ h=0.76981 (for use in Theorem 4).

Appendix G Quadratic bounds for two-layer ReLU networks

For a 2 layer network with ReLU activation, such that the input 𝐱\mathbf{x} lies in the ball 𝐱𝐱(0)ρ\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|\leq\rho, we can compute the bounds over 𝐳(1)\mathbf{z}^{(1)} directly:

𝐖i(1)𝐱(0)+𝐛i(1)ρ𝐖i(1)𝐳i(1)\displaystyle\mathbf{W}^{(1)}_{i}\mathbf{x}^{(0)}+\mathbf{b}^{(1)}_{i}-\rho\left\|\mathbf{W}^{(1)}_{i}\right\|\leq\mathbf{z}^{(1)}_{i}
𝐳i(1)𝐖i(1)𝐱(0)+𝐛i(1)+ρ𝐖i(1)\displaystyle\mathbf{z}^{(1)}_{i}\leq\mathbf{W}^{(1)}_{i}\mathbf{x}^{(0)}+\mathbf{b}^{(1)}_{i}+\rho\left\|\mathbf{W}^{(1)}_{i}\right\|

Thus we can get a lower bound and upper bound for each 𝐳i(1)\mathbf{z}^{(1)}_{i}. We define did_{i} and uiu_{i} as the following:

di=𝐖i(1)𝐱(0)+𝐛i(1)ρ𝐖i(1)\displaystyle d_{i}=\mathbf{W}^{(1)}_{i}\mathbf{x}^{(0)}+\mathbf{b}^{(1)}_{i}-\rho\left\|\mathbf{W}^{(1)}_{i}\right\| (86)
ui=𝐖i(1)𝐱(0)+𝐛i(1)+ρ𝐖i(1)\displaystyle u_{i}=\mathbf{W}^{(1)}_{i}\mathbf{x}^{(0)}+\mathbf{b}^{(1)}_{i}+\rho\left\|\mathbf{W}^{(1)}_{i}\right\| (87)

We can derive the following quadratic lower and upper bounds for each 𝐚i(1)\mathbf{a}^{(1)}_{i}:

𝐚i(1){di(uidi)2(𝐳i(1))2+ui2+di2(uidi)2𝐳i(1)ui2di(uidi)2 if |di||ui|ui(uidi)2(𝐳i(1))22uidi(uidi)2𝐳i(1)+uidi2(uidi)2if |di||ui|\displaystyle\mathbf{a}^{(1)}_{i}\leq\begin{dcases}\frac{-d_{i}}{(u_{i}-d_{i})^{2}}\left(\mathbf{z}^{(1)}_{i}\right)^{2}+\frac{u_{i}^{2}+d_{i}^{2}}{(u_{i}-d_{i})^{2}}\mathbf{z}^{(1)}_{i}-\frac{u_{i}^{2}d_{i}}{(u_{i}-d_{i})^{2}}\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{ if }|d_{i}|\leq|u_{i}|\\ \frac{u_{i}}{(u_{i}-d_{i})^{2}}\left(\mathbf{z}^{(1)}_{i}\right)^{2}-\frac{2u_{i}d_{i}}{(u_{i}-d_{i})^{2}}\mathbf{z}^{(1)}_{i}+\frac{u_{i}d_{i}^{2}}{(u_{i}-d_{i})^{2}}\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{if }|d_{i}|\geq|u_{i}|\end{dcases}
𝐚i(1){02|di||ui|𝐳i(1)|di|2|ui|1uidi(𝐳i(1))2diuidi𝐳i(1)otherwise\displaystyle\mathbf{a}^{(1)}_{i}\geq\begin{dcases}0\qquad&2|d_{i}|\leq|u_{i}|\\ \mathbf{z}^{(1)}_{i}\qquad&|d_{i}|\geq 2|u_{i}|\\ \frac{1}{u_{i}-d_{i}}\left(\mathbf{z}^{(1)}_{i}\right)^{2}-\frac{d_{i}}{u_{i}-d_{i}}\mathbf{z}^{(1)}_{i}\qquad&\text{otherwise}\end{dcases}

The above steps are exactly the same as the quadratic upper and lower bounds used in [54].
Using the above two inequalities and the identity:

𝐳y(2)𝐳t(2)=i=1N1(𝐖y,i(2)𝐖t,i(2))𝐚i(1)\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}=\sum_{i=1}^{N_{1}}\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)\mathbf{a}^{(1)}_{i}

we can compute a quadratic lower bound for 𝐳y(2)𝐳t(2)\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t} in terms of 𝐳i(1)\mathbf{z}^{(1)}_{i} by taking the lower bound for 𝐚i(1)\mathbf{a}^{(1)}_{i} when (𝐖y,i(2)𝐖t,i(2))>0\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)>0 and upper bound when (𝐖y,i(2)𝐖t,i(2))<=0\left(\mathbf{W}^{(2)}_{y,i}-\mathbf{W}^{(2)}_{t,i}\right)<=0. Furthermore since 𝐳i(1)=𝐖i(1)𝐱+𝐛i(1)\mathbf{z}^{(1)}_{i}=\mathbf{W}^{(1)}_{i}\mathbf{x}+\mathbf{b}^{(1)}_{i}, we can express the resulting quadratic in terms of 𝐱\mathbf{x}. Thus, we get the following quadratic function :

𝐳y(2)𝐳t(2)12𝐱T𝐏𝐱+𝐪+r\mathbf{z}^{(2)}_{y}-\mathbf{z}^{(2)}_{t}\geq\frac{1}{2}\mathbf{x}^{T}\mathbf{P}\mathbf{x}+\mathbf{q}+r

The coefficients 𝐏\mathbf{P}, 𝐪\mathbf{q} and rr can be determined using the above procedure. Note that unlike in [54], RHS can be a non-convex function.
Thus, it becomes an optimization problem where the goal is to minimize the distance 1/2𝐱𝐱(0)21/2\left\|\mathbf{x}-\mathbf{x}^{(0)}\right\|^{2} subject to RHS (which is quadratic in 𝐱\mathbf{x}) being zero. That is both our objective and constraint are quadratic functions. In the optimization literature, this is called the S-procedure and is one of the few non-convex problems that can be solved efficiently [4].
We start with two initial values called ρlow\rho_{low} (initialized to 0) and ρhigh\rho_{high} (initialized to 5).
We start with an initial value of ρ\rho, initialized at 1/2(ρlow+ρhigh)1/2\left(\rho_{low}+\rho_{high}\right) to compute did_{i} (eq. (86)) and uiu_{i} (eq. (87)). If the final distance after solving the S-procedure is less than ρ\rho, we set ρlow=ρ\rho_{low}=\rho. if the final distance is greater than ρ\rho, we set ρhigh=ρ\rho_{high}=\rho. Set new ρ=1/2(ρlow+ρhigh)\rho=1/2\left(\rho_{low}+\rho_{high}\right). Repeat until convergence.

Appendix H Additional experiments

Empirical accuracy means the fraction of test samples that were correctly classified after running a PGD attack [30] with an l2l_{2} bound on the adversarial perturbations. Certified accuracy means the fraction of test samples that were classified correctly initially and had the robustness certificate greater than a pre-specified attack radius ρ\rho. Unless otherwise specified, for both empirical and certified accuracy, we use ρ=0.5\rho=0.5. Unless otherwise specified, we use the class with the second largest logit as the attack target for the given input (i.e. the class tt). Unless specified, the experiments were run on the MNIST dataset while noting that our results are scalable for more complex datasets. The notation (L×[1024]L\times[1024], activation) denotes a neural network with LL layers with the specified activation function, (γ=c\gamma=c) denotes standard training with γ\gamma set to cc, (CRT, cc) denotes CRT training with γ=c\gamma=c. Certificates CROWN and CRC are computed over 150 correctly classified images.

H.1 Computing KlbK_{lb} and KubK_{ub}

First, note that KK does not depend on the input, but on network weights 𝐖(I)\mathbf{W}^{(I)}, label yy and target tt. Different images may still have different KK because label yy and target tt may be different.

To compute KlbK_{lb} in the table, first for each pair yy and tt, we find the largest eigenvalue of the Hessian of all test images that have label yy and second largest logit of class tt. Then we take the max of the largest eigenvalue across all test images. This gives a rough estimate of the largest curvature in the vicinity of test images with label yy and target tt. We can directly take the mean across all such pairs to compute KlbK_{lb}. However, we find that some pairs yy and tt were infrequent (with barely 1,2 test images in them). Thus, for all such pairs we cannot get a good estimate of the largest curvature in vicinity. We select all pairs yy and tt that have at least 100 images in them and compute KlbK_{lb} by taking the mean across all such pairs.

To compute KubK_{ub} in the table, we compute KK for all pairs yy and tt that have at least 100 images, i.e at least 100 images should have label yy and target tt. And then we compute the mean across all KK that satisfy this condition. This was done to do a fair comparison with KlbK_{lb}. Figure 2 shows a plot of the KubK_{ub} and KlbK_{lb} with increasing γ\gamma for a sigmoid network (with 4 layers).

Refer to caption
Figure 2: Effect of γ\gamma on KubK_{ub} and KlbK_{lb} for a 4 layer network. We observe a similar trend as in 2 and 3 layer networks (Figure 1). At γ=0\gamma=0, we observe Kub15418×KlbK_{ub}\approx 15418\times K_{lb}.

H.2 Comparison with provable defenses

In this section, we compare Curvature-based Robust Training (Ours) against state-of-the-art interval-bound propagation based adversarial training methods: COAP i.e Convex Outer Adversarial Polytope [51] and CROWN-IBP [57] with different attack radius on MNIST and Fashion-MNIST datasets. For CROWN-IBP, we vary the final_beta parameter between 0.5 to 3 (using an interval of 0.1) and choose the model with best certified accuracy.

Table 8: Comparison with interval-bound propagation based adversarial training methods with attack radius ρ=0.5\rho=0.5 on MNIST dataset. Note that the certified accuracy of softplus network with CROWN-IBP is significantly less than that of a similar ReLU network.
Network Training Standard Accuracy Certified Accuracy
2×[1024]2\times[1024], softplus CRT, 0.01 98.69% 95.5%
CROWN-IBP 98.72% 89.31%
2×[1024]2\times[1024], relu CROWN-IBP 98.69% 91.38%
COAP 98.8% 90.2%
3×[1024]3\times[1024], softplus CRT, 0.01 98.56% 94.44%
CROWN-IBP 98.55% 88.67%
3×[1024]3\times[1024], relu CROWN-IBP 98.9% 90.67%
COAP 98.9% 89.0%
4×[1024]4\times[1024], softplus CRT, 0.01 98.43% 93.35%
CROWN-IBP 98.34% 87.41%
4×[1024]4\times[1024], relu CROWN-IBP 98.78% 90.45%
COAP 98.9% 89.0%
Table 9: Comparison with interval-bound propagation based adversarial training methods with attack radius ρ=0.5\rho=0.5 on Fashion-MNIST dataset.
Network Training Standard Accuracy Certified Robust Accuracy
2×[1024]2\times[1024], softplus CRT, 0.01 88.45% 78.45%
2×[1024]2\times[1024], relu COAP 86.0% 74.0%
CROWN-IBP 85.89% 74.62%
3×[1024]3\times[1024], softplus CRT, 0.01 86.21% 76.94%
3×[1024]3\times[1024], relu COAP 85.9% 74.3%
CROWN-IBP 86.27% 74.56%
4×[1024]4\times[1024], softplus CRT, 0.01 86.37% 75.02%
4×[1024]4\times[1024], relu COAP 85.9% 74.2%
CROWN-IBP 86.03% 74.38%
Table 10: Comparison with interval-bound propagation based adversarial training methods with attack radius ρ=1.58\rho=1.58 on MNIST dataset. We again observe that the certified accuracy of softplus network with CROWN-IBP is significantly less than that of a similar ReLU network.
Network Training Standard Accuracy Certified Robust Accuracy
2×[1024]2\times[1024], softplus CRT, 0.01 98.68% 69.79%
CROWN-IBP 88.48% 42.36%
2×[1024]2\times[1024], relu COAP 89.33% 44.29%
CROWN-IBP 89.49% 44.96%
3×[1024]3\times[1024], softplus CRT, 0.01 98.26% 14.21%
CRT, 0.030.03 97.82% 50.72%
CRT, 0.05 97.43% 57.78%
CROWN-IBP 86.58% 42.14%
3×[1024]3\times[1024], relu COAP 89.12% 44.21%
CROWN-IBP 87.77% 44.74%
4×[1024]4\times[1024], softplus CRT, 0.010.01 97.80% 6.25%
CRT, 0.030.03 97.09% 29.64%
CRT, 0.050.05 96.33% 44.44%
CRT, 0.07 95.60% 53.19%
CROWN-IBP 82.74% 41.34%
4×[1024]4\times[1024], relu COAP 90.17% 44.66%
CROWN-IBP 84.4% 43.83%

H.3 Comparing Randomized Smoothing with CRT

Table 11: Comparison between CRT and Randomized Smoothing[10]. ss denotes the standard deviation for smoothing. We use ρ=0.5\rho=0.5. For CRT, we use γ=0.01\gamma=0.01
Network Randomized Smoothing CRT
s=0.25s=0.25 s=0.50s=0.50 s=1.0s=1.0 _
2×[1024]2\times[1024], sigmoid 93.75% 93.09% 88.91% 95.61%
2×[1024]2\times[1024], tanh 94.61% 93.08% 82.26% 95.00%
3×[1024]3\times[1024], sigmoid 94.00% 93.03% 86.58% 94.99%
3×[1024]3\times[1024], tanh 93.69% 91.68% 80.55% 94.16%
4×[1024]4\times[1024], sigmoid 93.68% 92.45% 84.99% 93.41%
4×[1024]4\times[1024], tanh 93.57% 92.19% 83.90% 91.37%

Since, randomized smoothing is designed to work in untargeted attack settings while CRT is for targeted attacks, we make the following changes in randomized smoothing. First, we use n0=100n_{0}=100 initial samples to select the label class (ll) and false target class (tt). The samples for estimation were n=100,000n=100,000 and failure probability was α=0.001\alpha=0.001. Then we use the binary version of randomized smoothing for estimation, i.e classify between yy and tt. To find the adversarial example for adversarial training, we use the cross entropy loss for 22 classes (yy and tt).

H.4 Additional experiments

Table 12: Table showing success rates (primal=dual)(primal=dual) for different values of γ\gamma. Certificate success rate denotes the fraction of points (𝐱(0))\mathbf{x}^{(0)}) satisfying 𝐳y𝐳t=0\mathbf{z}_{y}-\mathbf{z}_{t}=0, Attack success rate denotes the fraction of points (𝐱(0))\mathbf{x}^{(0)}) satisfying 𝐱(attack)𝐱(0)2=ρ\|\mathbf{x}^{(attack)}-\mathbf{x}^{(0)}\|_{2}=\rho implying primal=dualprimal=dual in Theorems 1 and 2 respectively. We observe that as we increase γ\gamma, the fraction of points satisfying primal=dualprimal=dual increases for both the certificate and attack problems. This can be attributed to the curvature bound K(𝐖,y,t)K(\mathbf{W},y,t) becoming tight on increasing γ\gamma.
Network γ\gamma Accuracy Attack success rate Certificate success rate
2×[1024]2\times[1024], sigmoid 0. 98.77% 5.05% 2.24%
0.01 98.57% 100% 15.68%
0.02 98.59% 100% 31.56%
0.03 98.30% 100% 44.17%
3×[1024]3\times[1024], sigmoid 0. 98.52% 0.% 0.12%
0.01 98.23% 44.86% 3.34%
0.03 97.86% 100% 11.51%
0.05 97.60% 100% 22.59%
4×[1024]4\times[1024], sigmoid 0. 98.22% 0.% 0.01%
0.01 97.24% 24.42% 2.68%
0.03 96.27% 44.42% 6.45%
0.05 95.77% 99.97% 12.40%
0.06 95.52% 100% 15.87%
0.07 95.24% 100% 19.53%
Table 13: Results for CIFAR-10 dataset (only curvature regularization, no CRT training)
Network Training Standard Accuracy Empirical Robust Accuracy Certified Robust Accuracy Certificate (mean)
CROWN CRC
2×[1024]2\times[1024], sigmoid standard 46.23% 37.82% 14.10% 0.37219 0.38173
γ=0.01\gamma=0.01 45.42% 38.17% 26.50% 0.40540 0.55010
3×[1024]3\times[1024], sigmoid standard 48.57% 34.80% 0.00% 0.19127 0.01404
γ=0.01\gamma=0.01 50.31% 39.87% 18.28% 0.24778 0.37895
4×[1024]4\times[1024], sigmoid standard 46.04% 34.38% 0.00% 0.19340 0.00191
γ=0.01\gamma=0.01 48.28% 40.10% 21.07% 0.29654 0.40005
Table 14: Comparison between CRT, PGD [30] and TRADES [58] for sigmoid and tanh networks. CRC outperforms CROWN significantly for 2 layer networks and when trained with our regularizer for deeper networks. CRT outperforms TRADES and PGD giving higher certified accuracy.
Network Training Standard Accuracy Empirical Robust Accuracy Certified Robust Accuracy Certificate (mean)
CROWN CRC
2×[1024]2\times[1024], sigmoid PGD 98.80% 96.26% 93.37% 0.37595 0.82702
TRADES 98.87% 96.76% 95.13% 0.41358 0.92300
CRT, 0.010.01 98.57% 96.28% 95.59% 0.43061 1.54673
2×[1024]2\times[1024], tanh PGD 98.76% 95.79% 84.11% 0.30833 0.61340
TRADES 98.63% 96.20% 93.72% 0.40601 0.86287
CRT, 0.010.01 98.52% 95.90% 95.00% 0.37691 1.47016
3×[1024]3\times[1024], sigmoid PGD 98.84% 96.14% 0.00% 0.29632 0.07290
TRADES 98.95% 96.79% 0.00% 0.30576 0.09108
CRT, 0.010.01 98.23% 95.70% 94.99% 0.39603 1.24100
3×[1024]3\times[1024], tanh PGD 98.78% 94.92% 0.00% 0.12706 0.03036
TRADES 98.16% 94.78% 0.00% 0.15875 0.02983
CRT, 0.010.01 98.15% 95.00% 94.16% 0.28004 1.14995
4×[1024]4\times[1024], sigmoid PGD 98.84% 96.26% 0.00% 0.25444 0.00658
TRADES 98.76% 96.67% 0.00% 0.26128 0.00625
CRT, 0.010.01 97.83% 94.65% 93.41% 0.40327 1.06208
4×[1024]4\times[1024], tanh PGD 98.53% 94.53% 0.00% 0.07439 0.00140
TRADES 97.08% 92.85% 0.00% 0.11889 0.00068
CRT, 0.010.01 97.24% 93.05% 91.37% 0.33649 0.93890
Table 15: Comparison between CRC and CROWN-general (CROWN-Ada for relu) for different targets. For CRT training, we use γ=0.01\gamma=0.01. We compare CRC with CROWN-general for different targets for 150 correctly classified images. Runner-up means class with second highest logit is considered as adversarial class. Random means any random class other than the label is considered adversarial. Least means class with smallest logit is adversarial. For 2-layer networks, CRC outperforms CROWN-general significantly even without adversarial training. For deeper networks (3 and 4 layers), CRC works better on networks that are trained with curvature regularization. Both CROWN and CRC are computed on CPU but the running time numbers mentioned here are not directly comparable because our CRC implementation uses a batch of images while the CROWN implementation uses a single image at a time.
Network Training Target Certificate (mean) Time per Image (s)
CROWN CRC CROWN CRC
2×[1024]2\times[1024], relu standard runner-up 0.50110 0.59166 0.1359 2.3492
random 0.68506 0.83080 0.2213 3.5942
least 0.86386 1.04883 0.1904 3.0292
2×[1024]2\times[1024], sigmoid standard runner-up 0.28395 0.48500 0.1818 0.1911
random 0.38501 0.69087 0.1870 0.1912
least 0.47639 0.85526 0.1857 0.1920
CRT, 0.010.01 runner-up 0.43061 1.54673 0.1823 0.1910
random 0.52847 1.99918 0.1853 0.1911
least 0.62319 2.41047 0.1873 0.1911
2×[1024]2\times[1024], tanh standard runner-up 0.23928 0.40047 0.1672 0.1973
random 0.31281 0.52025 0.1680 0.1986
least 0.38964 0.63081 0.1726 0.1993
CRT, 0.010.01 runner-up 0.37691 1.47016 0.1633 0.1963
random 0.45896 1.87571 0.1657 0.1982
least 0.52800 2.21704 0.1697 0.1981
3×[1024]3\times[1024], sigmoid standard runner-up 0.24644 0.06874 1.6356 0.5012
random 0.29496 0.08275 1.5871 0.5090
least 0.33436 0.09771 1.6415 0.5056
CRT, 0.010.01 runner-up 0.39603 1.24100 1.5625 0.5013
random 0.46808 1.54622 1.6142 0.4974
least 0.51906 1.75916 1.6054 0.4967
3×[1024]3\times[1024], tanh standard runner-up 0.08174 0.01169 1.4818 0.4908
random 0.10012 0.01432 1.5906 0.4963
least 0.12132 0.01757 1.5888 0.5076
CRT, 0.010.01 runner-up 0.28004 1.14995 1.4832 0.4926
random 0.32942 1.41032 1.5637 0.4957
least 0.38023 1.65692 1.5626 0.4930
4×[1024]4\times[1024], sigmoid standard runner-up 0.19501 0.00454 4.7814 0.8107
random 0.21417 0.00542 4.6313 0.8377
least 0.22706 0.00609 4.7973 0.8313
CRT, 0.010.01 runner-up 0.40327 1.06208 4.1830 0.8088
random 0.47038 1.29095 4.3922 0.7333
least 0.52249 1.49521 4.4676 0.7879
4×[1024]4\times[1024], tanh standard runner-up 0.03554 0.00028 5.7016 0.8836
random 0.04247 0.00036 5.8379 0.8602
least 0.04895 0.00044 5.8298 0.9045
CRT, 0.010.01 runner-up 0.33649 0.93890 3.8815 0.8182
random 0.41617 1.18956 4.0013 0.8215
least 0.47778 1.41429 4.3856 0.8311
Table 16: In this table, we measure the effect of increasing γ\gamma, when the network is trained with CRT on standard, empirical, certified robust accuracy, KlbK_{lb} and KubK_{ub} (defined in subsection H.1) for different depths (2, 3, 4 layer) and activations (sigmoid, tanh). We find that for all networks γ=0.01\gamma=0.01 works best. We find that the lower bound, KlbK_{lb} increases (for γ=0\gamma=0) for deeper networks suggesting that deep networks have higher curvature. Furthermore, for a given γ\gamma (say 0.0050.005), we find that the gap between KubK_{ub} and KlbK_{lb} increases as we increase the depth suggesting that KK is not a tight bound for deeper networks.
Network γ\gamma Standard Accuracy Empirical Robust Accuracy Certified Robust Accuracy Curvature bound (mean)
KlbK_{lb} KubK_{ub}
2×[1024]2\times[1024], sigmoid 0.0 98.77% 96.17% 95.04% 7.2031 72.0835
0.005 98.82% 96.33% 95.61% 3.8411 8.2656
0.01 98.57% 96.28% 95.59% 2.8196 5.4873
0.02 98.59% 95.97% 95.22% 2.2114 3.7228
0.03 98.30% 95.73% 94.94% 1.8501 2.9219
2×[1024]2\times[1024], tanh 0.0 98.65% 95.48% 92.69% 12.8434 107.5689
0.005 98.71% 95.88% 94.76% 4.8116 10.1860
0.01 98.52% 95.90% 95.00% 3.4269 6.3529
0.02 98.35% 95.71% 94.77% 2.3943 4.1513
0.03 98.29% 95.39% 94.54% 1.9860 3.933
3×[1024]3\times[1024], sigmoid 0. 98.52% 90.26% 0.00% 19.2131 3294.9070
0.005 98.41% 95.81% 94.91% 2.6249 13.4985
0.01 98.23% 95.70% 94.99% 1.9902 8.6654
0.02 97.99% 95.33% 94.64% 1.4903 5.4380
0.03 97.86% 94.98% 94.15% 1.2396 4.1409
0.04 97.73% 94.60% 93.88% 1.0886 3.3354
0.05 97.60% 94.45% 93.65% 0.9677 2.7839
3×[1024]3\times[1024], tanh 0. 98.19% 86.38% 0.00% 133.7992 17767.5918
0.005 98.13% 94.56% 93.01% 3.2461 17.5500
0.01 98.15% 95.00% 94.16% 2.2347 10.8635
0.02 97.84% 94.79% 94.05% 1.6556 6.7072
0.03 97.70% 94.19% 93.42% 1.3546 5.0533
0.04 97.57% 94.04% 92.95% 1.1621 4.0071
0.05 97.31% 93.66% 92.65% 1.0354 3.3439
4×[1024]4\times[1024], sigmoid 0. 98.22% 83.04% 0.00% 86.9974 343582.3125
0.01 97.83% 94.65% 93.41% 1.6823 10.2289
0.02 97.33% 94.02% 92.94% 1.2089 6.5573
0.03 97.07% 93.52% 92.65% 1.0144 4.9576
0.04 96.70% 92.78% 91.95% 0.8840 3.9967
0.05 96.38% 92.29% 91.33% 0.7890 3.4183
0.07 96.08% 91.83% 90.67% 0.6614 2.6905
4×[1024]4\times[1024], tanh 0. 97.45% 75.18% 0.00% 913.6984 37148156
0.01 97.24% 93.05% 91.37% 1.9114 12.2148
0.02 96.82% 92.65% 91.35% 1.3882 7.1771
0.03 96.27% 91.43% 90.09% 1.1643 5.1671
0.04 95.62% 90.69% 89.41% 0.9620 3.9061
0.05 95.77% 90.69% 89.40% 0.9160 3.2909
0.07 95.24% 89.51% 87.91% 0.7540 2.5635
Table 17: In this table, we measure the impact of increasing curvature regularization (γ)(\gamma) on accuracy, empirical robust accuracy, certified robust accuracy, CROWN-general and CRC when the network is trained without any adversarial training. We find that adding a very small amount of curvature regularization has a minimal impact on the accuracy but significantly increases CRC. Increase in CROWN certificate is not of similar magnitude. Somewhat surprisingly, we observe that even without any adversarial training, we can get nontrivial certified accuracies of 84.73%,88.66%,89.61%84.73\%,88.66\%,89.61\% on 2,3,4 layer sigmoid networks respectively.
Network γ\gamma Standard Accuracy Empirical Robust Accuracy Certified Robust Accuracy Certificate (mean)
CROWN CRC
2×[1024]2\times[1024], sigmoid 0. 98.37% 76.28% 54.17% 0.28395 0.48500
0.005 97.96% 88.65% 82.68% 0.36125 0.83367
0.01 98.08% 88.82% 83.53% 0.32548 0.84719
0.02 97.88% 88.90% 83.68% 0.34744 0.86632
0.03 97.73% 89.28% 84.73% 0.35387 0.90490
2×[1024]2\times[1024], tanh 0. 98.34% 79.10% 14.42% 0.23938 0.40047
0.005 98.01% 89.95% 85.70% 0.27262 0.89672
0.01 97.99% 90.17% 86.18% 0.28647 0.93819
0.02 97.64% 90.13% 86.40% 0.30075 0.99166
0.03 97.52% 89.96% 86.22% 0.30614 0.98771
3×[1024]3\times[1024], sigmoid 0. 98.37% 85.19% 0.00% 0.24644 0.06874
0.005 97.98% 91.93% 88.66% 0.38030 0.99044
0.01 97.71% 91.49% 88.33% 0.39799 1.07842
0.02 97.50% 91.34% 88.38% 0.38091 1.08396
0.03 97.16% 91.10% 88.63% 0.41015 1.15505
0.04 97.03% 90.96% 88.48% 0.42704 1.18073
0.05 96.76% 90.65% 88.30% 0.43884 1.19296
3×[1024]3\times[1024], tanh 0. 97.91% 77.40% 0.00% 0.08174 0.01169
0.005 97.45% 91.32% 88.57% 0.28196 0.95367
0.01 97.29% 90.98% 88.31% 0.31237 1.05915
0.02 97.04% 90.21% 87.77% 0.30901 1.08607
0.03 96.88% 90.02% 87.52% 0.34148 1.11717
0.04 96.53% 89.61% 86.87% 0.36583 1.11307
0.05 96.31% 89.25% 86.26% 0.38519 1.11689
4×[1024]4\times[1024], sigmoid 0. 98.39% 83.27% 0.00% 0.19501 0.00454
0.01 97.41% 91.71% 89.61% 0.40620 1.05323
0.02 96.47% 90.03% 87.77% 0.45074 1.14219
0.03 96.24% 90.40% 88.14% 0.47961 1.30671
0.04 95.65% 89.61% 87.54% 0.49987 1.35129
0.05 95.36% 89.10% 87.09% 0.51187 1.36064
0.07 95.23% 88.03% 85.93% 0.54754 1.27948
4×[1024]4\times[1024], tanh 0. 97.65% 69.20% 0.00% 0.03554 0.00028
0.01 96.52% 89.38% 86.40% 0.34778 0.97365
0.02 96.09% 88.79% 86.09% 0.41662 1.10860
0.03 95.74% 88.36% 85.65% 0.44981 1.17400
0.04 95.10% 87.50% 84.74% 0.48356 1.21957
0.05 95.14% 87.72% 84.77% 0.49113 1.25076
0.07 94.34% 86.67% 83.90% 0.49750 1.24198

References

  • Anil et al. [2018] Anil, C., Lucas, J., and Grosse, R. B. Sorting out lipschitz function approximation. In ICML, 2018.
  • Athalye & Carlini [2018] Athalye, A. and Carlini, N. On the robustness of the cvpr 2018 white-box adversarial example defenses. ArXiv, abs/1804.03286, 2018.
  • Athalye et al. [2018] Athalye, A., Carlini, N., and Wagner, D. A. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
  • Boyd & Vandenberghe [2004] Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. ISBN 0521833787.
  • Bunel et al. [2017] Bunel, R., Turkaslan, I., Torr, P. H. S., Kohli, P., and Mudigonda, P. K. A unified view of piecewise linear neural network verification. In NeurIPS, 2017.
  • Cao & Gong [2017] Cao, X. and Gong, N. Z. Mitigating evasion attacks to deep neural networks via region-based classification. ArXiv, abs/1709.05583, 2017.
  • Carlini & Wagner [2017] Carlini, N. and Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, 2017.
  • Carlini et al. [2017] Carlini, N., Katz, G., Barrett, C. E., and Dill, D. L. Provably minimally-distorted adversarial examples. 2017.
  • Cheng et al. [2017] Cheng, C.-H., Nührenberg, G., and Ruess, H. Maximum resilience of artificial neural networks. In ATVA, 2017.
  • Cohen et al. [2019] Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. In ICML, 2019.
  • Croce et al. [2018] Croce, F., Andriushchenko, M., and Hein, M. Provable robustness of relu networks via maximization of linear regions. ArXiv, abs/1810.07481, 2018.
  • Dutta et al. [2018] Dutta, S., Jha, S., Sankaranarayanan, S., and Tiwari, A. Output range analysis for deep feedforward neural networks. In NFM, 2018.
  • Dvijotham et al. [2018a] Dvijotham, K., Gowal, S., Stanforth, R., Arandjelovic, R., O’Donoghue, B., Uesato, J., and Kohli, P. Training verified learners with learned verifiers. ArXiv, abs/1805.10265, 2018a.
  • Dvijotham et al. [2018b] Dvijotham, K., Stanforth, R., Gowal, S., Mann, T. A., and Kohli, P. A dual approach to scalable verification of deep networks. In UAI, 2018b.
  • Ehlers [2017] Ehlers, R. Formal verification of piece-wise linear feed-forward neural networks. ArXiv, abs/1705.01320, 2017.
  • Fazlyab et al. [2019] Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. J. Efficient and accurate estimation of lipschitz constants for deep neural networks. CoRR, abs/1906.04893, 2019. URL http://arxiv.org/abs/1906.04893.
  • Fischetti & Jo [2018] Fischetti, M. and Jo, J. Deep neural networks and mixed integer linear optimization. Constraints, 23:296–309, 2018.
  • Gehr et al. [2018] Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., and Vechev, M. T. Ai2: Safety and robustness certification of neural networks with abstract interpretation. 2018 IEEE Symposium on Security and Privacy (SP), pp.  3–18, 2018.
  • Gowal et al. [2018] Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T. A., and Kohli, P. On the effectiveness of interval bound propagation for training verifiably robust models. ArXiv, abs/1810.12715, 2018.
  • Hein & Andriushchenko [2017] Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp.  2266–2276. 2017.
  • Huang et al. [2016] Huang, X., Kwiatkowska, M. Z., Wang, S., and Wu, M. Safety verification of deep neural networks. ArXiv, abs/1610.06940, 2016.
  • Katz et al. [2017] Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efficient smt solver for verifying deep neural networks. In CAV, 2017.
  • Kurakin et al. [2016a] Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial examples in the physical world. ArXiv, abs/1607.02533, 2016a.
  • Kurakin et al. [2016b] Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial machine learning at scale. ArXiv, abs/1611.01236, 2016b.
  • LeCun & Cortes [2010] LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
  • Lécuyer et al. [2018] Lécuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. K. K. Certified robustness to adversarial examples with differential privacy. In IEEE S&P 2019, 2018.
  • Li et al. [2018] Li, B. H., Chen, C., Wang, W., and Carin, L. Certified adversarial robustness with additive gaussian noise. 2018.
  • Liu et al. [2017] Liu, X., Cheng, M., Zhang, H., and Hsieh, C.-J. Towards robust neural networks via random self-ensemble. ArXiv, abs/1712.00673, 2017.
  • Lomuscio & Maganti [2017] Lomuscio, A. and Maganti, L. An approach to reachability analysis for feed-forward relu neural networks. ArXiv, abs/1706.07351, 2017.
  • Madry et al. [2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
  • Mirman et al. [2018] Mirman, M., Gehr, T., and Vechev, M. T. Differentiable abstract interpretation for provably robust neural networks. In ICML, 2018.
  • Miyato et al. [2017] Miyato, T., ichi Maeda, S., Koyama, M., and Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:1979–1993, 2017.
  • Miyato et al. [2018] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1QRgziT-.
  • Moosavi Dezfooli et al. [2019] Moosavi Dezfooli, S. M., Fawzi, A., Uesato, J., and Frossard, P. Robustness via curvature regularization, and vice versa. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Papernot et al. [2016] Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597, May 2016. doi: 10.1109/SP.2016.41.
  • Papernot et al. [2016] Papernot, N., McDaniel, P. D., Goodfellow, I. J., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against deep learning systems using adversarial examples. CoRR, abs/1602.02697, 2016. URL http://arxiv.org/abs/1602.02697.
  • Peck et al. [2017] Peck, J., Roels, J., Goossens, B., and Saeys, Y. Lower bounds on the robustness to adversarial perturbations. In NIPS, 2017.
  • Qin et al. [2019] Qin, C., Martens, J., Gowal, S., Krishnan, D., Fawzi, A., De, S., Stanforth, R., and Kohli, P. Adversarial robustness through local linearization. arXiv preprint arXiv:1907.02610, 2019.
  • Raghunathan et al. [2018a] Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. ArXiv, abs/1801.09344, 2018a.
  • Raghunathan et al. [2018b] Raghunathan, A., Steinhardt, J., and Liang, P. Semidefinite relaxations for certifying robustness to adversarial examples. In NeurIPS, 2018b.
  • Salman et al. [2019] Salman, H., Yang, G., Li, J., Zhang, P., Zhang, H., Razenshteyn, I. P., and Bubeck, S. Provably robust deep learning via adversarially trained smoothed classifiers. ArXiv, abs/1906.04584, 2019.
  • Samangouei et al. [2018] Samangouei, P., Kabkab, M., and Chellappa, R. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BkJ3ibb0-.
  • Scaman & Virmaux [2018] Scaman, K. and Virmaux, A. Lipschitz regularity of deep neural networks: analysis and efficient estimation, 2018.
  • Singh et al. [2018] Singh, G., Gehr, T., Mirman, M., Püschel, M., and Vechev, M. T. Fast and effective robustness certification. In NeurIPS, 2018.
  • Singla et al. [2019] Singla, S., Wallace, E., Feng, S., and Feizi, S. Understanding impacts of high-order loss approximations and features in deep learning interpretation. In ICML, 2019.
  • Szegedy et al. [2014] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.
  • Uesato et al. [2018] Uesato, J., O’Donoghue, B., Kohli, P., and van den Oord, A. Adversarial risk and the dangers of evaluating against weak attacks. In ICML, 2018.
  • Wang et al. [2018a] Wang, S., Chen, Y., Abdou, A., and Jana, S. K. K. Mixtrain: Scalable training of verifiably robust neural networks. 2018a.
  • Wang et al. [2018b] Wang, S., Pei, K., Whitehouse, J., Yang, J., and Jana, S. K. K. Efficient formal safety analysis of neural networks. In NeurIPS, 2018b.
  • Weng et al. [2018] Weng, T.-W., Zhang, H., Chen, H., Song, Z., Hsieh, C.-J., Boning, D. S., Dhillon, I. S., and Daniel, L. Towards fast computation of certified robustness for relu networks. ArXiv, abs/1804.09699, 2018.
  • Wong & Kolter [2017] Wong, E. and Kolter, J. Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. ArXiv, abs/1711.00851, 2017.
  • Wong et al. [2018] Wong, E., Schmidt, F. R., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In NeurIPS, 2018.
  • Xiao et al. [2017] Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http://arxiv.org/abs/1708.07747.
  • Zhang et al. [2018a] Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efficient neural network robustness certification with general activation functions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp.  4939–4948. Curran Associates, Inc., 2018a.
  • Zhang et al. [2018b] Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efficient neural network robustness certification with general activation functions. ArXiv, abs/1811.00866, 2018b.
  • Zhang et al. [2018c] Zhang, H., Zhang, P., and Hsieh, C.-J. Recurjac: An efficient recursive algorithm for bounding jacobian matrix of neural networks and its applications. In AAAI, 2018c.
  • Zhang et al. [2019a] Zhang, H., Chen, H., Xiao, C., Li, B., Boning, D. S., and Hsieh, C. Towards stable and efficient training of verifiably robust neural networks. CoRR, abs/1906.06316, 2019a. URL http://arxiv.org/abs/1906.06316.
  • Zhang et al. [2019b] Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019b.
  • Zheng et al. [2016] Zheng, S., Song, Y., Leung, T., and Goodfellow, I. J. Improving the robustness of deep neural networks via stability training. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4480–4488, 2016.