This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Analytical bounds on the local Lipschitz constants of affine-ReLU functions

Trevor Avant
University of Washington
[email protected]
&Kristi A. Morgansen
University of Washington
[email protected]
Abstract

In this paper, we determine analytical bounds on the local Lipschitz constants of of affine functions composed with rectified linear units (ReLUs). Affine-ReLU functions represent a widely used layer in deep neural networks, due to the fact that convolution, fully-connected, and normalization functions are all affine, and are often followed by a ReLU activation function. Using an analytical approach, we mathematically determine upper bounds on the local Lipschitz constant of an affine-ReLU function, show how these bounds can be combined to determine a bound on an entire network, and discuss how the bounds can be efficiently computed, even for larger layers and networks. We show several examples by applying our results to AlexNet, as well as several smaller networks based on the MNIST and CIFAR-10 datasets. The results show that our method produces tighter bounds than the standard conservative bound (i.e. the product of the spectral norms of the layers’ linear matrices), especially for small perturbations.

1 Introduction

1.1 Introduction

The huge successes of deep neural networks have been accompanied by the unfavorable property of high sensitivity. As a result, for many networks a small perturbation of the input can produce a huge change in the output [13]. These sensitivity properties are still not completely theoretically understood, and raise significant concerns when applying neural networks to safety-critical and other applications. This establishes a strong motivation to obtain a better theoretical understanding sensitivity.

One of the main tools in analyzing the sensitivity of neural networks is the Lipschitz constant, which is a measure of how much the output of a function can change with respect to changes in the input. Analytically computing the exact Lipschitz constant of neural networks has so far been unattainable due to the complexity and high-dimensionality of most networks. As feedforward neural networks consist of a composition of functions, a conservative upper bound on the Lipschitz constant can be determined by calculating the product of each individual function’s Lipschitz constant [13]. Unfortunately, this method usually results in a very conservative bound.

Calculating or estimating tighter bounds on Lipschitz constants has recently been approached using optimization-based methods [11, 8, 3, 17, 6]. The downside to these approaches is that they usually can only be applied to small networks, and also often have to be relaxed, which invalidates any guarantee on the bound.

In summary, the current state of Lipschitz analysis of neural networks is that the function-by-function approach yields bounds which are too loose, and holistic approaches are too expensive for larger networks. In this paper, we explore a middle ground between these two approaches by analyzing the composite of two functions, the affine-ReLU function, which represents a common layer used in modern neural networks. This function is simple enough to obtain analytical results, but complex enough to provide tighter bounds than the function-by-function analysis. We can also combine the constants between layers to compute a Lipschitz constant for the entire network. Furthermore, our analytical approach leads us to develop intuition behind the structure of neural network layers, and shows how the different components of the layer (e.g. the linear operator, bias, nominal input, and size of the perturbation) contribute to sensitivity.

1.2 Related work

The high sensitivity of deep neural networks has been noted as early as [13]. This work conceived the idea of adversarial examples, which have since become a popular area of research [4]. One tool that has been used to study the sensitivity of networks is the input-output Jacobian [9, 12], which gives a local estimate of sensitivity but generally provides no guarantees.

Lipschitz constants are also a common tool to study sensitivity. Recently, several studies have explored using optimization-based approaches to compute the Lipschitz constant of neural networks. The work of [11] presents two algorithms, AutoLip and SeqLip, to compute the Lipschitz constant. AutoLip reduces to the standard conservative approach, while SeqLip is an optimization which requires a greedy approximation for larger networks. The work of [8] presents a sparse polynomial optimization (LiPopt) method which relies on the network being sparse. To apply this technique to larger networks, the authors have to first prune the network to increase sparsity. A semidefinite programming technique (LipSDP) is used in [3], but in order to apply the results to larger networks, a relaxation must be used which invalidates the guarantee. Another approach is that of [17], in which linear programming is used to estimate Lipschitz constants. Finally, [6] proposes exactly computing the Lipschitz constant using mixed integer programming, which is very expensive and can only be applied to very small networks.

Other work has considered computing Lipschitz constants in the context of adversarial examples [10, 15, 16]. While these works use Lipschitz constants and similar mathematical analysis, their focus is on classification, and it is not clear how or if these techniques can be adapted to provide guaranteed upper Lipschitz bounds for larger networks. Additionally, other work has considered constraining the Lipschitz constant as a means to regularize a network [5, 14, 1]. Finally, we note that in this work we study affine-ReLU functions which are commonly used in neural networks, but we are not aware of any work that has directly analyzed these functions except for [2].

1.3 Contributions

Our main contributions are that develop analytical upper bounds on the local Lipschitz constant of affine-ReLU function. We show how these bounds can be combined to create a bound on an entire feedforward network, and also how we can compute our bounds even for large layers.

1.4 Notation

In this paper, we use non-bold lowercase and capital (aa and AA) to denote scalars, bold lowercase (𝐚\boldsymbol{\mathbf{a}}) to denote vectors, and bold uppercase (𝐀\boldsymbol{\mathbf{A}}) to denote matrices. Similarly, we use non-bold to denote scalar-valued functions (f()f(\cdot)) and bold to denote vector-valued functions (𝐟()\boldsymbol{\mathbf{f}}(\cdot)). We will use inequalities to compare vectors, and say that 𝐚>𝐛\boldsymbol{\mathbf{a}}>\boldsymbol{\mathbf{b}} holds if all corresponding pairs of elements satisfy the inequality. Additionally, unless otherwise specified, we let \lVert\cdot\rVert denote the 2-norm.

2 Affine & ReLU functions

2.1 Affine functions

Affine functions are ubiquitous in deep neural networks, as convolutional, fully-connected, and normalization functions are all affine. An affine function can be written as 𝐟(𝐱)=𝐀𝐱+𝐛\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}})=\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} where 𝐀m×n\boldsymbol{\mathbf{A}}\in\mathbb{R}^{m\times n}, 𝐱n\boldsymbol{\mathbf{x}}\in\mathbb{R}^{n}, and 𝐛m\boldsymbol{\mathbf{b}}\in\mathbb{R}^{m}. Note that since we are considering neural networks, without loss of generality we can define 𝐱\boldsymbol{\mathbf{x}} to be a tensor that has been reshaped into a 1D vector. We note that we can redefine the origin of the domain of an affine function to correspond to any point 𝐱0\boldsymbol{\mathbf{x}}_{0}. More specifically, if we consider the system 𝐀𝐱+𝐛0\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}{+}\boldsymbol{\mathbf{b}}_{0} and nominal input 𝐱0\boldsymbol{\mathbf{x}}_{0}, we can redefine 𝐱\boldsymbol{\mathbf{x}} as 𝐱0+𝐱\boldsymbol{\mathbf{x}}_{0}{+}\boldsymbol{\mathbf{x}} in which case the affine function becomes 𝐀𝐱+𝐀𝐱0+𝐛0\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}{+}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}_{0}{+}\boldsymbol{\mathbf{b}}_{0}. We then redefine the bias as 𝐛𝐛0+𝐀𝐱0\boldsymbol{\mathbf{b}}\coloneqq\boldsymbol{\mathbf{b}}_{0}{+}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}_{0}, which gives us the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}{+}\boldsymbol{\mathbf{b}}. In this paper, we will use the form 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}{+}\boldsymbol{\mathbf{b}} to represent an affine function that has been shifted so that the origin is at 𝐱0\boldsymbol{\mathbf{x}}_{0}, and 𝐱\boldsymbol{\mathbf{x}} represents a perturbation from 𝐱0\boldsymbol{\mathbf{x}}_{0}.

2.2 Rectified Linear Units (ReLUs)

The rectified linear unit (ReLU) is widely used as an activation function in deep neural networks. The ReLU is simply the elementwise maximum of an input and zero: 𝐫𝐞𝐥𝐮(𝐲)=max(𝟎,𝐲)\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{y}})=\max(\boldsymbol{\mathbf{0}},\boldsymbol{\mathbf{y}}). The ReLU is piecewise linear, and can therefore be represented by a piecewise linear matrix. To define this matrix, we will first define the following function which indicates if a value is non-negative: ind(y)={1ify<0,0ify0}\text{ind}(y)=\{1~{}\text{if}~{}y<0,~{}~{}0~{}\text{if}~{}y\leq 0\}. We will define the elementwise version of this function as 𝐢𝐧𝐝:m{0,1}m\boldsymbol{\mathbf{ind}}:\mathbb{R}^{m}\rightarrow\{0,1\}^{m}. By defining 𝐝𝐢𝐚𝐠:mm×m\boldsymbol{\mathbf{diag}}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{m\times m} as the function which creates a diagonal matrix from a vector, we define the ReLU matrix 𝐑𝐲\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}} and ReLU function 𝐫𝐞𝐥𝐮\boldsymbol{\mathbf{relu}} as follows

𝐑𝐲=𝐝𝐢𝐚𝐠(𝐢𝐧𝐝(𝐲)),𝐫𝐞𝐥𝐮(𝐲)=𝐑𝐲𝐲.\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}}=\boldsymbol{\mathbf{diag}}(\boldsymbol{\mathbf{ind}}(\boldsymbol{\mathbf{y}})),~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{y}})=\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}}\boldsymbol{\mathbf{y}}. (1)

Note that 𝐑𝐲\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}} is a function of 𝐲\boldsymbol{\mathbf{y}}, but to make our notation clear we choose to denote the dependence on 𝐲\boldsymbol{\mathbf{y}} using a subscript rather than parentheses.

Note that ReLUs are naturally related to the geometric concept of orthants, which are a higher dimensional generalization of quadrants in 2\mathbb{R}^{2} and octants in 3\mathbb{R}^{3}. The mm-dimensional space m\mathbb{R}^{m} has 2m2^{m} orthants. Furthermore, the matrix 𝐑\boldsymbol{\mathbf{R}} can be interpreted as an orthogonal projection matrix, which projects 𝐲\boldsymbol{\mathbf{y}} onto a lower-dimensional space of m\mathbb{R}^{m}. In 3\mathbb{R}^{3} for example, 𝐑\boldsymbol{\mathbf{R}} will represent a projection onto either the origin (when 𝐑=𝟎\boldsymbol{\mathbf{R}}=\boldsymbol{\mathbf{0}}), a coordinate axis, a coordinate plane, or all of 3\mathbb{R}^{3} (when 𝐑=𝐈\boldsymbol{\mathbf{R}}=\boldsymbol{\mathbf{I}}). Each orthant in m\mathbb{R}^{m} corresponds to a linear region of the ReLU function, so since there are 2m2^{m} orthants there are 2m2^{m} linear regions.

2.3 Affine-ReLU Functions

Refer to caption
Figure 1: The unit ball transformed by affine and ReLU functions in 2\mathbb{R}^{2}. Different colors represent different orthants after the affine operator. As shown in the rightmost diagram, the ReLU projects the domain onto the non-negative orthant.

We define an affine-ReLU function as a ReLU composed with an affine function. In a neural network, these represent one layer, e.g. a convolution or fully-connected function with a ReLU activation. Using the notation in (1), we can write an affine-ReLU function as

𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)=𝐑𝐀𝐱+𝐛(𝐀𝐱+𝐛).\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}})=\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}). (2)
Refer to caption
Figure 2: A line segment transformed by affine and ReLU functions in 2\mathbb{R}^{2}.

Using Fig. 2 as a reference, we note that the vector 𝐲=𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}=\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} will lie in several different orthants of m\mathbb{R}^{m}, which are the linear regions of the ReLU function. As a result, 𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}) can be represented as a sum across the linear regions (i.e. orthants) of the function. For each 𝐱\boldsymbol{\mathbf{x}}, there will be some number pp of linear regions, and we define the points at which 𝐲\boldsymbol{\mathbf{y}} transitions between linear regions as 𝐲i=αi𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}_{i}=\alpha_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} for i=1,,p1i=1,...,p-1 and where 0<αi<10<\alpha_{i}<1 and αi>αi1\alpha_{i}>\alpha_{i-1}. We also define α0=0\alpha_{0}=0 and αp=1\alpha_{p}=1 so that 𝐲0=𝐛\boldsymbol{\mathbf{y}}_{0}=\boldsymbol{\mathbf{b}} and 𝐲p=𝐲\boldsymbol{\mathbf{y}}_{p}=\boldsymbol{\mathbf{y}}. The transition vectors 𝐲i\boldsymbol{\mathbf{y}}_{i} can be determined for a given 𝐱\boldsymbol{\mathbf{x}} by determining the value of αi\alpha_{i} for which elements of 𝐲i=αi𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}_{i}=\alpha_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} equal zero. With these vectors defined, we note that any vectors 𝐲i1\boldsymbol{\mathbf{y}}_{i-1} and 𝐲i\boldsymbol{\mathbf{y}}_{i} will lie in or at the boundary of the same orthant, and we define 𝐑i\boldsymbol{\mathbf{R}}_{i} as the ReLU matrix corresponding to that orthant. Therefore, we can write the net change of the affine-ReLU function across the orthant adjacent to 𝐲i1\boldsymbol{\mathbf{y}}_{i-1} and 𝐲i\boldsymbol{\mathbf{y}}_{i} as

𝐑i(𝐲i𝐲i1)=𝐑i(αi𝐀𝐱+𝐛(αi1𝐀𝐱+𝐛))=(αiαi1)𝐑i𝐀𝐱.\displaystyle\boldsymbol{\mathbf{R}}_{i}(\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{y}}_{i-1})=\boldsymbol{\mathbf{R}}_{i}(\alpha_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}-(\alpha_{i-1}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}))=(\alpha_{i}-\alpha_{i-1})\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}. (3)

Next, we define Δαiαiαi1\Delta\alpha_{i}\coloneqq\alpha_{i}-\alpha_{i-1} and note that i=1pΔαi=1\sum_{i=1}^{p}\Delta\alpha_{i}=1. Noting that the change from the 𝟎\boldsymbol{\mathbf{0}} to 𝐛\boldsymbol{\mathbf{b}} segment of 𝐲\boldsymbol{\mathbf{y}} is 𝐑𝐛(𝐛𝟎)=𝐑𝐛𝐛\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}(\boldsymbol{\mathbf{b}}-\boldsymbol{\mathbf{0}})=\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}\boldsymbol{\mathbf{b}} (see Fig. 2), we can write the affine-ReLU function as

𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)\displaystyle\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}) =𝐑𝐛𝐛+i=1pΔαi𝐑i𝐀𝐱.\displaystyle=\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}\boldsymbol{\mathbf{b}}+\sum_{i=1}^{p}\Delta\alpha_{i}\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}. (4)

3 Lipschitz Constants

3.1 Local Lipschitz Constant

We will analyze the sensitivity of the affine-ReLU functions using a Lipschitz constant, which measures how much the output of a function can change with respect to changes in the input. The Lipschitz constant of a function 𝐟:nm\boldsymbol{\mathbf{f}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is L=sup𝐱0𝐱1𝐟(𝐱1)𝐟(𝐱0)/𝐱1𝐱0L=\sup_{\boldsymbol{\mathbf{x}}_{0}\neq\boldsymbol{\mathbf{x}}_{1}}\lVert\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}}_{1})-\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}}_{0})\rVert/\lVert\boldsymbol{\mathbf{x}}_{1}-\boldsymbol{\mathbf{x}}_{0}\rVert. Our goal is to analyze the sensitivity of the affine-ReLU function by considering a nominal input and perturbation. As mentioned in Section 2.1, we consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} to be shifted so that the origin 𝐱=𝟎\boldsymbol{\mathbf{x}}=\boldsymbol{\mathbf{0}} corresponds to the nominal input 𝐱0\boldsymbol{\mathbf{x}}_{0}, and 𝐱\boldsymbol{\mathbf{x}} corresponds to a perturbation. We define the local Lipschitz constant as a modified version of the standard Lipschitz constant:

L(𝐱0,𝒳)\displaystyle L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}) max𝐱𝒳𝐟(𝐱)𝐟(𝟎)𝐱.\displaystyle\coloneqq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}})-\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{0}})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}. (5)

The set 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} represents the set of all permissible perturbations. In this paper, we will be most interested in the case that 𝒳\mathcal{X} is the Euclidean ball (𝒳={𝐱|𝐱ϵ}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert\leq\epsilon\}) or the positive part of the Euclidean ball (𝒳={𝐱|𝐱ϵ,𝐱𝟎}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert\leq\epsilon,~{}~{}\boldsymbol{\mathbf{x}}\geq\boldsymbol{\mathbf{0}}\}). See Appendix A.2 for more information. As 𝒳\mathcal{X} denotes the domain of affine function (and affine-ReLU function), we will define the range of the affine function similarly as

𝒴={𝐀𝐱+𝐛|𝐱𝒳}.\displaystyle\mathcal{Y}=\{\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}~{}~{}|~{}~{}\boldsymbol{\mathbf{x}}\in\mathcal{X}\}. (6)

Applying the local Lipschitz constant to the affine-ReLU function we have

L(𝐱0,𝒳)\displaystyle L\left(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}\right) =max𝐱𝒳𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)𝐫𝐞𝐥𝐮(𝐛)𝐱.\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}})-\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{b}})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}. (7)

Determining the Lipschitz constant above is a difficult problem due to the piecewise nature of the ReLU. We are not aware of a way to do this computation for high dimensional spaces, which prohibits us from exactly computing the Lipschitz constant. Instead, we will try to come up with a conservative bound. In this paper, we will present several bounds on the Lipschitz constant. The following lemma will serve as a starting point for several of our bounds.

Lemma 1.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}, its domain 𝒳\mathcal{X}, and the piecewise representation of the affine-ReLU function in (4). We have the following upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)max𝐱𝒳i=1pΔαi𝐑i𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert.

The proof is shown in Appendix A.1.

3.2 Naive and intractable upper bounds

We will now approach the task of deriving an analytical upper bound on the local Lipschitz constant of the affine-ReLU function. We start by presenting a standard naive bound.

Proposition 1.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} and its domain 𝒳\mathcal{X}. The spectral norm of 𝐀\boldsymbol{\mathbf{A}} is an upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\lVert\boldsymbol{\mathbf{A}}\rVert.

The proof is shown in Appendix A.1. This is a standard conservative bound that is often used in determining the Lipschitz constants of a full neural network. Next, we will attempt to create a tighter bound. Consider the term 𝐑i𝐀\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert in the inequality in Lemma 1. The ReLU matrices 𝐑i\boldsymbol{\mathbf{R}}_{i} are those that correspond to the vectors 𝐲𝒴\boldsymbol{\mathbf{y}}\in\mathcal{Y}. So if we can determine all possible ReLU matrices for the vectors in 𝒴\mathcal{Y}, then we can determine a tighter bound on the Lipschitz constant. We start by defining the matrix 𝐑max\boldsymbol{\mathbf{R}}_{max} as

𝐑max{𝐑𝐲|𝐑𝐲𝐀𝐑𝐰𝐀,𝐲𝒴,𝐰𝒴}.\boldsymbol{\mathbf{R}}_{max}\coloneqq\{\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}}\boldsymbol{\mathbf{A}}\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{w}}}\boldsymbol{\mathbf{A}}\rVert,~{}~{}\boldsymbol{\mathbf{y}}\in\mathcal{Y},~{}~{}\forall\boldsymbol{\mathbf{w}}\in\mathcal{Y}\}. (8)
Proposition 2.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}, its domain 𝒳\mathcal{X}, and matrix 𝐑max\boldsymbol{\mathbf{R}}_{max} defined in (8). The spectral norm of 𝐑max𝐀\boldsymbol{\mathbf{R}}_{max}\boldsymbol{\mathbf{A}} is an upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)𝐑max𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\lVert\boldsymbol{\mathbf{R}}_{max}\boldsymbol{\mathbf{A}}\rVert.

Proof.

Consider the inequality in Lemma 1, and note that the ReLU matrices 𝐑i\boldsymbol{\mathbf{R}}_{i} correspond to vectors 𝐲𝒴\boldsymbol{\mathbf{y}}\in\mathcal{Y}. So, by definition of 𝐑max\boldsymbol{\mathbf{R}}_{max}, we have 𝐑max𝐀𝐑i𝐀\lVert\boldsymbol{\mathbf{R}}_{max}\boldsymbol{\mathbf{A}}\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert for all 𝐑i\boldsymbol{\mathbf{R}}_{i} and all 𝐱𝒳\boldsymbol{\mathbf{x}}\in\mathcal{X}. Since i=1pΔαi=1\sum_{i=1}^{p}\Delta\alpha_{i}=1, we have L(𝐱0,𝒳)𝐑max𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\lVert\boldsymbol{\mathbf{R}}_{max}\boldsymbol{\mathbf{A}}\rVert. ∎

While this proposition would provide an upper bound on the Lipschitz constant, in practice it requires determining all possible ReLU matrices 𝐑i\boldsymbol{\mathbf{R}}_{i} corresponding to all vectors 𝐲𝒴\boldsymbol{\mathbf{y}}\in\mathcal{Y}. Since m\mathbb{R}^{m} has 2m2^{m} orthants, this method would most likely be intractable except for very small mm (due to the large number of matrices we would need to compare). As we do not know of a way that avoids computing a large number of spectral norms, this motivates us to look for an even more conservative bound that is more easily computable.

4 Upper bounding regions

Refer to caption
Figure 3: Left: Diagram of the range 𝒴\mathcal{Y} of the affine function (in this case, a transformation of when the domain 𝒳\mathcal{X} is the Eucildean ball), its bounding region \mathcal{H} with lengths lil_{i}, and the upper bounding vertex 𝐲¯\overline{\boldsymbol{\mathbf{y}}}. Right: Diagram of a bounding region \mathcal{H} and its scaled bounding region β1\beta_{1}\mathcal{H}. Note that there will also be a larger bounding region β2\beta_{2}\mathcal{H} aligned with the vertical axis but it is not shown.

4.1 Bounding regions

Our approach in determining a more easily computable bound is based on the idea that we can find a ReLU matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}} such that 𝐑¯𝐀𝐑𝐲𝐀\lVert\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}}\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}}\boldsymbol{\mathbf{A}}\rVert for all 𝐲𝒴\boldsymbol{\mathbf{y}}\in\mathcal{Y}. To find this matrix, we will create a coordinate-axis-aligned bounding region around the set 𝒴\mathcal{Y} (see the left side of Fig. 3 for a diagram). We define the upper bounding vertex of this region as 𝐲¯\overline{\boldsymbol{\mathbf{y}}}, its associated ReLU matrix as 𝐑¯\overline{\boldsymbol{\mathbf{R}}}, and the upper bounding region as \mathcal{H}:

𝐲¯{𝐲|𝐲𝐀𝐱+𝐛,𝐱𝒳},𝐑¯\displaystyle\overline{\boldsymbol{\mathbf{y}}}\coloneqq\{\boldsymbol{\mathbf{y}}~{}~{}|~{}~{}\boldsymbol{\mathbf{y}}\geq\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}},~{}~{}\forall\boldsymbol{\mathbf{x}}\in\mathcal{X}\},~{}~{}~{}~{}~{}~{}~{}\overline{\boldsymbol{\mathbf{R}}} 𝐑𝐲¯,{𝐲|𝐲𝐲¯}.\displaystyle\coloneqq\boldsymbol{\mathbf{R}}_{\overline{\boldsymbol{\mathbf{y}}}},~{}~{}~{}~{}~{}~{}~{}\mathcal{H}\coloneqq\{\boldsymbol{\mathbf{y}}~{}~{}|~{}~{}\boldsymbol{\mathbf{y}}\leq\overline{\boldsymbol{\mathbf{y}}}\}. (9)

We define 𝒍𝐲¯𝐛\boldsymbol{l}\coloneqq\overline{\boldsymbol{\mathbf{y}}}-\boldsymbol{\mathbf{b}}, which represents the distance in each coordinate direction from 𝐛\boldsymbol{\mathbf{b}} to the border of the bounding region (see Fig. 3). This region will function as a conservative estimate around 𝒴\mathcal{Y} in regards to the ReLU operation. For information about how 𝐲¯\overline{\boldsymbol{\mathbf{y}}} and 𝒍\boldsymbol{l} can be computed for various domains 𝒳\mathcal{X}, see Appendix A.2.

Recalling the definition of 𝐑\boldsymbol{\mathbf{R}} in (1), we note that 𝐑\boldsymbol{\mathbf{R}} is a diagonal matrix of 0’s and 1’s. Since the matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}} is computed with respect to 𝐲¯\overline{\boldsymbol{\mathbf{y}}}, it is a reflection of the “most positive” orthant in \mathcal{H}, and will have a 1 anywhere any matrix 𝐑𝐲\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}} for 𝐲\boldsymbol{\mathbf{y}}\in\mathcal{H} has a 1. The matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}} can also be interpreted as the logical disjunction of all matrices 𝐑𝐲\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}} for 𝐲\boldsymbol{\mathbf{y}}\in\mathcal{H}.

4.2 Nested bounding regions

We have described the concept of an upper bounding region \mathcal{H}, which will lead us to develop an upper bound on the local Lipschitz constant. However, we will be able to develop an even tighter bound by noting that within a bounding region there will be some number of smaller, “nested” bounding regions, each with its own matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}} (see the right side of Fig. 3). We then note that the vector 𝐲\boldsymbol{\mathbf{y}} can be described using a piecewise representation, in which pieces of 𝐲\boldsymbol{\mathbf{y}} closer to 𝐛\boldsymbol{\mathbf{b}} are contained in smaller bounding regions. We consider some number qq of the these bounding regions, which we will index by i=1,,qi=1,...,q. We define these bounding regions using scalars 0βi10\leq\beta_{i}\leq 1 where βi>βi1\beta_{i}>\beta_{i-1}. For a given 𝐱𝒳\boldsymbol{\mathbf{x}}\in\mathcal{X}, we define the affine transformation of βi𝐱\beta_{i}\boldsymbol{\mathbf{x}} as 𝐲iβi𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}_{i}\coloneqq\beta_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}. Furthermore, we define β0=0\beta_{0}=0 and βq=1\beta_{q}=1 so that 𝐲0=𝐛\boldsymbol{\mathbf{y}}_{0}=\boldsymbol{\mathbf{b}} and 𝐲q=𝐲\boldsymbol{\mathbf{y}}_{q}=\boldsymbol{\mathbf{y}}. We define the scaled bounding region to be the region that bounds 𝐲i\boldsymbol{\mathbf{y}}_{i} for all 𝐱𝒳\boldsymbol{\mathbf{x}}\in\mathcal{X}. As in (9), we define the scaled bounding region as βi\beta_{i}\mathcal{H}, the bounding vertex as 𝐲¯i\overline{\boldsymbol{\mathbf{y}}}_{i}, and its corresponding ReLU matrix as 𝐑¯i\overline{\boldsymbol{\mathbf{R}}}_{i}:

𝐲¯i{𝐲|𝐲βi𝐀𝐱+𝐛,𝐱𝒳},𝐑¯i\displaystyle\overline{\boldsymbol{\mathbf{y}}}_{i}\coloneqq\{\boldsymbol{\mathbf{y}}~{}~{}|~{}~{}\boldsymbol{\mathbf{y}}\geq\beta_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}},~{}~{}\forall\boldsymbol{\mathbf{x}}\in\mathcal{X}\},~{}~{}~{}~{}~{}~{}~{}\overline{\boldsymbol{\mathbf{R}}}_{i} 𝐑𝐲¯i,βi{𝐲|𝐲𝐲¯i}.\displaystyle\coloneqq\boldsymbol{\mathbf{R}}_{\overline{\boldsymbol{\mathbf{y}}}_{i}},~{}~{}~{}~{}~{}~{}~{}\beta_{i}\mathcal{H}\coloneqq\{\boldsymbol{\mathbf{y}}~{}~{}|~{}~{}\boldsymbol{\mathbf{y}}\leq\overline{\boldsymbol{\mathbf{y}}}_{i}\}. (10)

Note that the distance from 𝐛\boldsymbol{\mathbf{b}} to 𝐲¯\overline{\boldsymbol{\mathbf{y}}} is 𝒍\boldsymbol{l} for \mathcal{H}, and the distance from 𝐛\boldsymbol{\mathbf{b}} to 𝐲¯i\overline{\boldsymbol{\mathbf{y}}}_{i} is βi𝒍\beta_{i}\boldsymbol{l} for βi\beta_{i}\mathcal{H}. It will be most sensible to define the scalars βi\beta_{i} to occur at the points for which the scaled region enters positive space for each coordinate, which are the locations at which 𝐑¯i\overline{\boldsymbol{\mathbf{R}}}_{i} changes. These values can be found by determining when 𝐛+βi𝒍\boldsymbol{\mathbf{b}}+\beta_{i}\boldsymbol{l} equals zero for each coordinate. Also, we define the difference in βi\beta_{i} values as Δβiβiβi1\Delta\beta_{i}\coloneqq\beta_{i}-\beta_{i-1} for i=1,,qi=1,...,q. Lastly, we define the following lemma which we will use later to create our bound.

Lemma 2.

Consider a bounding region \mathcal{H} and its bounding ReLU matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}}. For any two points 𝐲a,𝐲b\boldsymbol{\mathbf{y}}_{a},\boldsymbol{\mathbf{y}}_{b}\in\mathcal{H}, the following inequality holds: 𝐑¯(𝐲b𝐲a)𝐑𝐲b𝐲b𝐑𝐲a𝐲a\lVert\overline{\boldsymbol{\mathbf{R}}}(\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a})\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}\boldsymbol{\mathbf{y}}_{a}\rVert.

The proof is shown in Appendix A.1.

5 Upper bounds

5.1 Looser and tighter upper bounds

We are now ready to present the main mathematical results of the paper.

Theorem 1.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}, its domain 𝒳\mathcal{X}, and its bounding ReLU matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}} from (9). The spectral norm of 𝐑¯𝐀\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}} is an upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)𝐑¯𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\lVert\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}}\rVert.

Proof.

From Lemma 1 we have L(𝐱0,𝒳)max𝐱𝒳i=1pΔαi𝐑i𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert. We note that the matrix 𝐑i\boldsymbol{\mathbf{R}}_{i} corresponds to vectors which are inside the bounding region \mathcal{H}. Recalling that the ReLU matrices 𝐑\boldsymbol{\mathbf{R}} are diagonal matrices with 0’s and 1’s on the diagonal, and 𝐑¯\overline{\boldsymbol{\mathbf{R}}} will have a 1 anywhere any matrix 𝐑𝐲\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}} for 𝐲\boldsymbol{\mathbf{y}}\in\mathcal{H} has a 1. Therefore the non-zero elements of 𝐑i𝐰\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{w}} will be a subset of the non-zero elements of 𝐑¯𝐰\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{w}} for all 𝐲𝒴\boldsymbol{\mathbf{y}}\in\mathcal{Y} and all 𝐰m\boldsymbol{\mathbf{w}}\in\mathbb{R}^{m}, which implies 𝐑¯𝐀𝐑i𝐀\lVert\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}}\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert for all ii and all 𝐱𝒳\boldsymbol{\mathbf{x}}\in\mathcal{X}. Since i=1pΔαi=1\sum_{i=1}^{p}\Delta\alpha_{i}=1, we have L(𝐱0,𝒳)𝐑¯𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\lVert\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}}\rVert. ∎

Computing this bound for a given domain 𝒳\mathcal{X} will be quick if we can quickly compute 𝐑¯\overline{\boldsymbol{\mathbf{R}}} and 𝐑¯𝐀\lVert\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}}\rVert. However, we can create an even tighter bound by using the idea of nested regions in Section 4.2.

Theorem 2.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} and its domain 𝒳\mathcal{X}. Consider the nested bounding regions βi\beta_{i}\mathcal{H}, their scale factors Δβi\Delta\beta_{i} and their bounding ReLU matrices 𝐑¯i\overline{\boldsymbol{\mathbf{R}}}_{i} as described in Section 4.2. The following is an upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)i=1qΔβi𝐑¯i𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\sum_{i=1}^{q}\Delta\beta_{i}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\boldsymbol{\mathbf{A}}\rVert.

The proof is shown in Appendix A.1. This is the last bound we will derive. To summarize the bounds from Proposition 1 and Theorems 1 & 2, we have

𝐀𝐑¯𝐀i=1qΔβi𝐑¯i𝐀L(𝐱0,𝒳).\lVert\boldsymbol{\mathbf{A}}\rVert\geq\lVert\overline{\boldsymbol{\mathbf{R}}}\boldsymbol{\mathbf{A}}\rVert\geq\sum_{i=1}^{q}\Delta\beta_{i}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\boldsymbol{\mathbf{A}}\rVert\geq L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}). (11)

Note that the bound in Proposition 2 may be less than or greater than the bound in Theorem 2 so we cannot include it in the inequality above.

5.2 Bounds for Multiple Layers

So far, our analysis has applied to a single affine-ReLU function, which would represents one layer (e.g. convolution-ReLU) of a network. We now describe how these bounds can be combined for multiple layers. First, assume that we have a bound ϵ\epsilon on the size of the perturbation 𝐱\boldsymbol{\mathbf{x}}, i.e. 𝐱ϵ,𝐱𝒳\lVert\boldsymbol{\mathbf{x}}\rVert\leq\epsilon,~{}~{}\forall\boldsymbol{\mathbf{x}}\in\mathcal{X}. We can rearrange the local Lipschitz constant equation in (5) by moving the denominator to the LHS and applying the ϵ\epsilon bound as follows

ϵL(𝐱0,𝒳)\displaystyle\epsilon L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}) 𝐟(𝐱)𝐟(𝟎),𝐱𝒳.\displaystyle\geq\lVert\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}})-\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{0}})\rVert,~{}~{}\forall\boldsymbol{\mathbf{x}}\in\mathcal{X}. (12)

Recall that 𝐟(𝟎)\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{0}}) represents the nominal input of the next layer, so 𝐟(𝐱)𝐟(𝟎)\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}})-\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{0}}) represents the perturbation with respect to the next layer. Defining this perturbation as 𝐳𝐟(𝐱)𝐟(𝟎)\boldsymbol{\mathbf{z}}\coloneqq\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{x}})-\boldsymbol{\mathbf{f}}(\boldsymbol{\mathbf{0}}), we have

ϵL(𝐱0,𝒳)𝐳.\displaystyle\epsilon L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\geq\lVert\boldsymbol{\mathbf{z}}\rVert. (13)

This gives us a bound on perturbations of the nominal input of the next layer. We can therefore compute the local Lipschitz bounds in an iterative fashion, by propagating the perturbation bounds through each layer of the network. More specifically, if we start with ϵ\epsilon, we can compute the Lipschitz constant of the current layer and then determine the bound for the next layer. We can continue this process for subsequent layers. Using these perturbation bounds, we will consider the domains for each layer of the network to be of the form 𝒳={𝐱|𝐱ϵ}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert\leq\epsilon\} or 𝒳={𝐱|𝐱ϵ,𝐱𝟎}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert\leq\epsilon,~{}~{}\boldsymbol{\mathbf{x}}\geq\boldsymbol{\mathbf{0}}\} when the layer is preceded by a ReLU. Also note that for other types of layers such as max pooling, if we can’t compute a local bound, we can use the global bound, which is what we will do in our simulations.

6 Simulation

6.1 Spectral norm computation

Our results rely on computing the spectral norm 𝐑𝐀\boldsymbol{\mathbf{R}}\boldsymbol{\mathbf{A}} for various ReLU matrices 𝐑\boldsymbol{\mathbf{R}}. The 𝐀\boldsymbol{\mathbf{A}} matrix will correspond to either a convolution or fully-connected function. For larger convolution functions, the 𝐀\boldsymbol{\mathbf{A}} matrices are often too large to define explicitly. The only way we can compute 𝐑𝐀\lVert\boldsymbol{\mathbf{R}}\boldsymbol{\mathbf{A}}\rVert for larger layers is by using a power iteration method.

To compute the spectral norm of a matrix 𝐌\boldsymbol{\mathbf{M}}, we can note that the largest singular value of 𝐌\boldsymbol{\mathbf{M}} is the square root of the largest eigenvalue of 𝐌T𝐌\boldsymbol{\mathbf{M}}^{T}\boldsymbol{\mathbf{M}}. So, we can find the spectral norm of 𝐌\boldsymbol{\mathbf{M}} by applying a power iteration to the operator 𝐌T𝐌\boldsymbol{\mathbf{M}}^{T}\boldsymbol{\mathbf{M}}. In our case, 𝐌=𝐑𝐀\boldsymbol{\mathbf{M}}=\boldsymbol{\mathbf{R}}\boldsymbol{\mathbf{A}} and 𝐌T𝐌=𝐀T𝐑T𝐑𝐀=𝐀T𝐑𝐀\boldsymbol{\mathbf{M}}^{T}\boldsymbol{\mathbf{M}}=\boldsymbol{\mathbf{A}}^{T}\boldsymbol{\mathbf{R}}^{T}\boldsymbol{\mathbf{R}}\boldsymbol{\mathbf{A}}=\boldsymbol{\mathbf{A}}^{T}\boldsymbol{\mathbf{R}}\boldsymbol{\mathbf{A}} (for various 𝐑\boldsymbol{\mathbf{R}} matrices). We can compute the operations corresponding to the 𝐀\boldsymbol{\mathbf{A}}, 𝐀T\boldsymbol{\mathbf{A}}^{T}, and 𝐑\boldsymbol{\mathbf{R}} matrices in code using convolution to apply 𝐀\boldsymbol{\mathbf{A}}, transposed convolution to apply 𝐀T\boldsymbol{\mathbf{A}}^{T}, and zeroing appropriate elements to apply 𝐑\boldsymbol{\mathbf{R}}. In all of our simulations, we used 100 iterations, which we verified to be accurate for smaller systems for which an SVD can be computed for comparison.

6.2 Simulations

In our simulations, we compared three different networks: a 7-layer network trained on MNIST, an 8-layer network trained on CIFAR-10, and AlexNet [7] (11-layers, trained on ImageNet). See Appendix A.3 for the exact architectures we used. We trained the MNIST and CIFAR-10 networks ourselves while we used the trained version of AlexNet from Pytorch’s torchvision package. For all of our simulations, we used nominal input images which achieved good classification. However, we noticed that we obtained similar trends using random images. We compared the upper bounds of Proposition 1, Theorem 1, Theorem 2, as well as naive lower bound based on randomly sampling 10,000 perturbation vectors from an ϵ\epsilon-sized sphere. Figure 4 shows the results for different layers of the MNIST network. Figure 5 shows the full-network local Lipschitz constants using the method discussed in Section 5.2. Table 1 shows the computation times.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Upper bounds (UB) and lower bounds (LB) on the local Lipschitz constants for the 4 affine-ReLU layers (convolution, convolution, fully-connected, fully-connected) of the MNIST net. Note that these results are computed with respect to a particular nominal input image.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Upper bounds (UB) and lower bounds (LB) on the local Lipschitz constants of MNIST, CIFAR-10, and AlexNet networks for various perturbation sizes. Note that these results are computed with respect to particular nominal input images.
Table 1: Computation times for the local Lipschitz bounds. Computations were performed on a desktop computer with an Nvidia GTX 1080 Ti card.
network MNIST net CIFAR-10 net AlexNet
computation time 2 sec 58 sec 72 min

The results show that our Lipschitz bounds increase with the size of the perturbation ϵ\epsilon, and approach the spectral norm of 𝐀\boldsymbol{\mathbf{A}} for large ϵ\epsilon. For small perturbations, the bound is significantly lower than the naive bound.

7 Conclusion

We have presented the idea of computing upper bounds on the local Lipschitz constant of an affine-ReLU function, which represents one layer of a neural network. We described how these bounds can be combined to determine the Lipschitz constant of a full network, and also how they can be computed in an efficient way, even for large networks. The results show that our bounds are tighter than the naive bounds for a full network, especially for small perturbations.

We believe that the most important direction of future work regarding our method is to more effectively apply it to multiple layers. While we can combine our layer-specific bounds as we have in Fig. 5, it almost certainly leads to an overly conservative bound, especially for deeper networks. We also suspect that our bounds become more conservative for larger perturbations. However, calculating tight Lipschitz bounds for large neural networks is still an open and challenging problem, and we believe our results provide a useful step forward.

8 Broader Impact

We classify this work as basic mathematical analysis that applies to functions commonly used in neural networks. We believe this work could benefit those who are interested in developing more robust algorithms for safety-critical or other applications. It does not seem to us that this research puts anyone at a disadvantage. Additionally, since our bounds are provable, our method should not fail unless it is implemented incorrectly. Finally, we do not believe our method leverages any biases in data.

Acknowledgments and Disclosure of Funding

This work was supported by ONR grant N000141712623.

References

References

  • [1] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6240–6249. Curran Associates, Inc., 2017.
  • [2] Sören Dittmer, Emily J. King, and Peter Maass. Singular values for ReLU layers. CoRR, abs/1812.02566, 2018.
  • [3] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas. Efficient and accurate estimation of Lipschitz constants for deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11427–11438. Curran Associates, Inc., 2019.
  • [4] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2014.
  • [5] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural networks by enforcing Lipschitz continuity, 2018.
  • [6] Matt Jordan and Alexandros G. Dimakis. Exactly computing the local Lipschitz constant of ReLU networks, 2020.
  • [7] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
  • [8] Fabian Latorre, Paul Rolland, and Volkan Cevher. Lipschitz constant estimation of neural networks via sparse polynomial optimization. In International Conference on Learning Representations, 2020.
  • [9] Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018.
  • [10] Jonathan Peck, Joris Roels, Bart Goossens, and Yvan Saeys. Lower bounds on the robustness to adversarial perturbations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 804–813. Curran Associates, Inc., 2017.
  • [11] Kevin Scaman and Aladin Virmaux. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3835–3844. Curran Associates, Inc., 2018.
  • [12] J. Sokolić, R. Giryes, G. Sapiro, and M. R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, Aug 2017.
  • [13] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks, 2013.
  • [14] Dávid Terjék. Adversarial Lipschitz regularization. In International Conference on Learning Representations, 2020.
  • [15] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6541–6550. Curran Associates, Inc., 2018.
  • [16] Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. In International Conference on Learning Representations, 2018.
  • [17] D. Zou, R. Balan, and M. Singh. On Lipschitz bounds of general convolutional neural networks. IEEE Transactions on Information Theory, 66(3):1738–1759, 2020.

Appendix A Appendix

A.1 Proofs

Lemma 1.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}, its domain 𝒳\mathcal{X}, and the piecewise representation of the affine-ReLU function in (4). We have the following upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)max𝐱𝒳i=1pΔαi𝐑i𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert.

Proof.

We can start with (7) and plug in (4):

L(𝒳,𝐱0)\displaystyle L\left(\mathcal{X},\boldsymbol{\mathbf{x}}_{0}\right) =max𝐱𝒳𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)𝐫𝐞𝐥𝐮(𝐛)𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}})-\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{b}})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳(𝐑𝐛𝐛+i=1pΔαi𝐑i𝐀𝐱)𝐑𝐛𝐛𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert(\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}\boldsymbol{\mathbf{b}}+\sum_{i=1}^{p}\Delta\alpha_{i}\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}})-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}\boldsymbol{\mathbf{b}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳i=1pΔαi𝐑i𝐀𝐱𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\sum_{i=1}^{p}\Delta\alpha_{i}\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
max𝐱𝒳i=1pΔαi𝐑i𝐀𝐱𝐱\displaystyle\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
max𝐱𝒳i=1pΔαi𝐑i𝐀𝐱𝐱\displaystyle\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert\lVert\boldsymbol{\mathbf{x}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳i=1pΔαi𝐑i𝐀.\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert.

Proposition 1.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} and its domain 𝒳\mathcal{X}. The spectral norm of 𝐀\boldsymbol{\mathbf{A}} is an upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\lVert\boldsymbol{\mathbf{A}}\rVert.

Proof.

Consider the inequality from Lemma 1: L(𝐱0,𝒳)max𝐱𝒳i=1pΔαi𝐑i𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert. Note that the ReLU matrix 𝐑i\boldsymbol{\mathbf{R}}_{i} is a diagonal matrix of 0’s and 1’s, so for any 𝐯n\boldsymbol{\mathbf{v}}\in\mathbb{R}^{n}, the non-zero elements of 𝐑i𝐀𝐯\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{v}} will be a subset of the non-zero elements of 𝐀𝐯\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{v}}. Therefore, 𝐀𝐑i𝐀\lVert\boldsymbol{\mathbf{A}}\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert for all 𝐑i\boldsymbol{\mathbf{R}}_{i} and all 𝐱𝒳\boldsymbol{\mathbf{x}}\in\mathcal{X}, and we can rearrange the inequality from Lemma 1 as follows:

L(𝐱0,𝒳)\displaystyle L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}) max𝐱𝒳i=1pΔαi𝐑i𝐀\displaystyle\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{A}}\rVert
max𝐱𝒳i=1pΔαi𝐀\displaystyle\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\sum_{i=1}^{p}\Delta\alpha_{i}\lVert\boldsymbol{\mathbf{A}}\rVert
=𝐀.\displaystyle=\lVert\boldsymbol{\mathbf{A}}\rVert.

Where in the last step we have used the fact that i=1pΔαi=1\sum_{i=1}^{p}\Delta\alpha_{i}=1. ∎

Lemma 2.

Consider a bounding region \mathcal{H} and its bounding ReLU matrix 𝐑¯\overline{\boldsymbol{\mathbf{R}}}. For any two points 𝐲a,𝐲b\boldsymbol{\mathbf{y}}_{a},\boldsymbol{\mathbf{y}}_{b}\in\mathcal{H}, the following inequality holds: 𝐑¯(𝐲b𝐲a)𝐑𝐲b𝐲b𝐑𝐲a𝐲a\lVert\overline{\boldsymbol{\mathbf{R}}}(\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a})\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}\boldsymbol{\mathbf{y}}_{a}\rVert.

Proof.

Since the 𝐑\boldsymbol{\mathbf{R}} matrices are diagonal, we can consider this problem on an element-by-element basis, and show that each element of 𝐑¯(𝐲b𝐲a)\overline{\boldsymbol{\mathbf{R}}}(\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a}) is greater in magnitude than its counterpart in 𝐑𝐲b𝐲b𝐑𝐲a𝐲a\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}\boldsymbol{\mathbf{y}}_{a}. For a given ii, let yay_{a} and yby_{b} denote the ithi^{th} entry of 𝐲a\boldsymbol{\mathbf{y}}_{a} and 𝐲b\boldsymbol{\mathbf{y}}_{b}, respectively, and let RaR_{a}, RbR_{b}, and R¯\overline{R} denote the (i,i)th(i,i)^{th} entries of 𝐑𝐲a\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}, 𝐑𝐲b\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}, and 𝐑¯\overline{\boldsymbol{\mathbf{R}}}, respectively. We can write the ithi^{th} element of 𝐑¯(𝐲b𝐲a)\overline{\boldsymbol{\mathbf{R}}}(\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a}) as R¯(ybya)\overline{R}(y_{b}-y_{a}) and the corresponding element of 𝐑𝐲b𝐲b𝐑𝐲a𝐲a\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}\boldsymbol{\mathbf{y}}_{a} as RbybRayaR_{b}y_{b}-R_{a}y_{a}.

Since Rya=1R_{y_{a}}=1 or Rya=1R_{y_{a}}=1 imply R¯=1\overline{R}=1, there are five possible cases we have to consider, which are shown in the table below.

RaR_{a} RbR_{b} R¯\overline{R} |R¯(ybya)|\lvert\overline{R}(y_{b}-y_{a})\rvert |RbybRaya|\lvert R_{b}y_{b}-R_{a}y_{a}\rvert
case 1 0 0 0 0 0
case 2 0 0 1 |ybya|\lvert y_{b}-y_{a}\rvert 0
case 3 1 0 1 |ybya|\lvert y_{b}-y_{a}\rvert |ya|\lvert y_{a}\rvert
case 4 0 1 1 |ybya|\lvert y_{b}-y_{a}\rvert |yb|\lvert y_{b}\rvert
case 5 1 1 1 |ybya|\lvert y_{b}-y_{a}\rvert |ybya|\lvert y_{b}-y_{a}\rvert

For cases 1, 2, and 5 it is clear that |R¯(ybya)||RbybRaya|\lvert\overline{R}(y_{b}-y_{a})\rvert\geq\lvert R_{b}y_{b}-R_{a}y_{a}\rvert. For case 3, we note that if Ra=1R_{a}=1 and Rb=0R_{b}=0, then ya0y_{a}\geq 0 and yb0y_{b}\leq 0, which means that ybyay_{b}-y_{a} is a non-positive number that has magnitude equal to or less than the magnitude of yay_{a}. This implies that |R¯(ybya)||RbybRaya|\lvert\overline{R}(y_{b}-y_{a})\rvert\geq\lvert R_{b}y_{b}-R_{a}y_{a}\rvert for case 3. Similar logic can be applied to case 4, except in this case ybyay_{b}-y_{a} is a non-negative number that has magnitude equal to or greater than the magnitude of yby_{b}. This implies that |R¯(ybya)||RbybRaya|\lvert\overline{R}(y_{b}-y_{a})\rvert\geq\lvert R_{b}y_{b}-R_{a}y_{a}\rvert for case 4.

We showed that for all ii, each element of 𝐑¯(𝐲b𝐲a)\overline{\boldsymbol{\mathbf{R}}}(\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a}) is greater in magnitude than the corresponding element in 𝐑𝐲b𝐲b𝐑𝐲a𝐲a\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}\boldsymbol{\mathbf{y}}_{a}, i.e. |R¯(ybya)||RbybRaya|\lvert\overline{R}(y_{b}-y_{a})\rvert\geq\lvert R_{b}y_{b}-R_{a}y_{a}\rvert. This implies 𝐑¯(𝐲b𝐲a)𝐑𝐲b𝐲b𝐑𝐲a𝐲a\lVert\overline{\boldsymbol{\mathbf{R}}}(\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{y}}_{a})\rVert\geq\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{b}}\boldsymbol{\mathbf{y}}_{b}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{a}}\boldsymbol{\mathbf{y}}_{a}\rVert. ∎

Theorem 2.

Consider the affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} and its domain 𝒳\mathcal{X}. Consider the nested bounding regions βi\beta_{i}\mathcal{H}, their scale factors Δβi\Delta\beta_{i} and their bounding ReLU matrices 𝐑¯i\overline{\boldsymbol{\mathbf{R}}}_{i} as described in Section 4.2. The following is an upper bound on the affine-ReLU function’s local Lipschitz constant: L(𝐱0,𝒳)i=1qΔβi𝐑¯i𝐀L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X})\leq\sum_{i=1}^{q}\Delta\beta_{i}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\boldsymbol{\mathbf{A}}\rVert.

Proof.

First, define the affine transformation of βi𝐱\beta_{i}\boldsymbol{\mathbf{x}} to be 𝐲i=βi𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}_{i}=\beta_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}. We can write the function 𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}) as the sum of the differences of the function taken across each segment. For the segment from i1i{-}1 to ii, the difference in the affine-ReLU function is 𝐑𝐲i𝐲i𝐑𝐲i1𝐲i1\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i}}\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i-1}}\boldsymbol{\mathbf{y}}_{i-1}. So we can write the total function as

𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)\displaystyle\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}) =(𝐑𝐛𝟎)+i=1q(𝐑i𝐲i𝐑i1𝐲i1)\displaystyle=(\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}-\boldsymbol{\mathbf{0}})+\sum_{i=1}^{q}\left(\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{R}}_{i-1}\boldsymbol{\mathbf{y}}_{i-1}\right)
=𝐑𝐛+i=1q(𝐑i𝐲i𝐑i1𝐲i1).\displaystyle=\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}+\sum_{i=1}^{q}\left(\boldsymbol{\mathbf{R}}_{i}\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{R}}_{i-1}\boldsymbol{\mathbf{y}}_{i-1}\right).

Plugging the equation above into (7), we can write the local Lipschitz constant as

L(𝐱0,𝒳)\displaystyle L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}) =max𝐱𝒳𝐫𝐞𝐥𝐮(𝐀𝐱+𝐛)𝐫𝐞𝐥𝐮(𝐛)𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}})-\boldsymbol{\mathbf{relu}}(\boldsymbol{\mathbf{b}})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳𝐑𝐛𝐛+i=1q(𝐑𝐲i𝐲i𝐑𝐲i1𝐲i1)𝐑𝐛𝐛𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}\boldsymbol{\mathbf{b}}+\sum_{i=1}^{q}(\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i}}\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i-1}}\boldsymbol{\mathbf{y}}_{i-1})-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{b}}}\boldsymbol{\mathbf{b}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳i=1q(𝐑𝐲i𝐲i𝐑𝐲i1𝐲i1)𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\lVert\sum_{i=1}^{q}(\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i}}\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i-1}}\boldsymbol{\mathbf{y}}_{i-1})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
max𝐱𝒳i=1q(𝐑𝐲i𝐲i𝐑𝐲i1𝐲i1)𝐱.\displaystyle\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\sum_{i=1}^{q}\lVert(\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i}}\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{R}}_{\boldsymbol{\mathbf{y}}_{i-1}}\boldsymbol{\mathbf{y}}_{i-1})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}.

Recalling that 𝐲i1=βi1𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}_{i-1}=\beta_{i-1}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} and 𝐲i=βi𝐀𝐱+𝐛\boldsymbol{\mathbf{y}}_{i}=\beta_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}, and that βi1<βi\beta_{i-1}<\beta_{i}, we know that 𝐲i1,𝐲iβi\boldsymbol{\mathbf{y}}_{i-1},\boldsymbol{\mathbf{y}}_{i}\in\beta_{i}\mathcal{H}. So, using Lemma 2 we can rearrange the equation above as follows:

L(𝐱0,𝒳)\displaystyle L(\boldsymbol{\mathbf{x}}_{0},\mathcal{X}) max𝐱𝒳i=1q𝐑¯i(𝐲i𝐲i1)𝐱\displaystyle\leq\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\sum_{i=1}^{q}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}(\boldsymbol{\mathbf{y}}_{i}-\boldsymbol{\mathbf{y}}_{i-1})\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳i=1q𝐑¯i(βi𝐀𝐱+𝐛(βi1𝐀𝐱+𝐛))𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\sum_{i=1}^{q}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\left(\beta_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}}-(\beta_{i-1}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}})\right)\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=max𝐱𝒳i=1qΔβi𝐑¯i𝐀𝐱𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}\frac{\sum_{i=1}^{q}\Delta\beta_{i}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
i=1qΔβi𝐑¯i𝐀𝐱𝐱\displaystyle\leq\frac{\sum_{i=1}^{q}\Delta\beta_{i}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\boldsymbol{\mathbf{A}}\rVert\lVert\boldsymbol{\mathbf{x}}\rVert}{\lVert\boldsymbol{\mathbf{x}}\rVert}
=i=1qΔβi𝐑¯i𝐀.\displaystyle=\sum_{i=1}^{q}\Delta\beta_{i}\lVert\overline{\boldsymbol{\mathbf{R}}}_{i}\boldsymbol{\mathbf{A}}\rVert.

A.2 Bounding region determination for various domains

We have presented the idea of considering an affine function 𝐀𝐱+𝐛\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}} with domain 𝒳\mathcal{X} and range 𝒴\mathcal{Y}. We are interested in determining the axis-aligned bounding region \mathcal{H} around 𝒴\mathcal{Y}. We will now show how to tightly compute this region for various domains. Recall from Section 4.1 that we denote the upper bounding vertex of the region as 𝐲¯\overline{\boldsymbol{\mathbf{y}}}, and define 𝒍𝐲¯𝐛\boldsymbol{l}\coloneqq\overline{\boldsymbol{\mathbf{y}}}-\boldsymbol{\mathbf{b}} (the distance from 𝐛\boldsymbol{\mathbf{b}} to 𝐲¯\overline{\boldsymbol{\mathbf{y}}}). We let 𝐚1T,,𝐚mTn\boldsymbol{\mathbf{a}}_{1}^{T},...,\boldsymbol{\mathbf{a}}_{m}^{T}\in\mathbb{R}^{n} denote the rows of 𝐀\boldsymbol{\mathbf{A}}, and aija_{ij} denote the jthj^{th} element of 𝐚i\boldsymbol{\mathbf{a}}_{i}. Similarly, we let y¯i\overline{y}_{i} and lil_{i} denote the ithi^{th} elements of 𝐲¯\overline{\boldsymbol{\mathbf{y}}} and 𝒍\boldsymbol{l}, respectively. Denoting 𝐞im\boldsymbol{\mathbf{e}}_{i}\in\mathbb{R}^{m} as the ithi^{th} standard basis vector, we can write this problem as

y¯i\displaystyle\overline{y}_{i} =max𝐱𝒳𝐞iT(𝐀𝐱+𝐛)\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}~{}\boldsymbol{\mathbf{e}}_{i}^{T}(\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{b}})
=max𝐱𝒳𝐚iT𝐱+bi.\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}~{}\boldsymbol{\mathbf{a}}_{i}^{T}\boldsymbol{\mathbf{x}}+b_{i}.

We can subtract out the bias term in the maximization above since it is constant. By doing so, our maximization will find lil_{i} instead of y¯i\overline{y}_{i}. We also define 𝐱,in\boldsymbol{\mathbf{x}}^{*,i}\in\mathbb{R}^{n} as the maximizing vector:

li\displaystyle l_{i} =max𝐱𝒳𝐚iT𝐱\displaystyle=\max_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}~{}\boldsymbol{\mathbf{a}}_{i}^{T}\boldsymbol{\mathbf{x}}
𝐱,i\displaystyle\boldsymbol{\mathbf{x}}^{*,i} =argmax𝐱𝒳𝐚iT𝐱.\displaystyle=\operatorname*{arg\,max}_{\boldsymbol{\mathbf{x}}\in\mathcal{X}}~{}\boldsymbol{\mathbf{a}}_{i}^{T}\boldsymbol{\mathbf{x}}.

Domain 1: 𝒳={𝐱|𝐱1ϵ}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert_{1}\leq\epsilon\}

Let xjx_{j} denote the jthj^{th} element of 𝐱\boldsymbol{\mathbf{x}}. In this case we havej|xj|ϵ\sum_{j}\lvert x_{j}\rvert\leq\epsilon. The quantity 𝐚iT𝐱\boldsymbol{\mathbf{a}}_{i}^{T}\boldsymbol{\mathbf{x}} will be maximized when the element of 𝐚i\boldsymbol{\mathbf{a}}_{i} with the largest magnitude is given all of the weight. In other words,

j\displaystyle j^{*} argmaxj|aij|\displaystyle\coloneqq\operatorname*{arg\,max}_{j}\lvert a_{ij}\rvert
xj,i\displaystyle x_{j}^{*,i} ={ϵsgn(aij),j=j0,otherwise\displaystyle=\begin{cases}\epsilon\cdot\operatorname{sgn}(a_{ij^{*}}),&j=j^{*}\\ 0,&\text{otherwise}\end{cases}
li\displaystyle l_{i} =|aij|.\displaystyle=\lvert a_{ij^{*}}\rvert.

Domain 2: 𝒳={𝐱|𝐱2ϵ}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert_{2}\leq\epsilon\}

In this case, we are maximizing over all vectors 𝐱\boldsymbol{\mathbf{x}} with length less than or equal to ϵ\epsilon. So, the maximum will occur when 𝐱\boldsymbol{\mathbf{x}} points in the direction of 𝐚i\boldsymbol{\mathbf{a}}_{i} and has the largest possible magnitude (i.e. ϵ\epsilon). Note that intuitively this can be thought of as maximizing the dot product of an ϵ\epsilon-sized nn-sphere with 𝐚i\boldsymbol{\mathbf{a}}_{i}. We have

𝐱,i\displaystyle\boldsymbol{\mathbf{x}}^{*,i} =ϵ𝐚i𝐚i2\displaystyle=\epsilon\frac{\boldsymbol{\mathbf{a}}_{i}}{\lVert\boldsymbol{\mathbf{a}}_{i}\rVert_{2}}
li\displaystyle l_{i} =𝐚iT(ϵ𝐚i𝐚i2)\displaystyle=\boldsymbol{\mathbf{a}}_{i}^{T}\left(\epsilon\frac{\boldsymbol{\mathbf{a}}_{i}}{\lVert\boldsymbol{\mathbf{a}}_{i}\rVert_{2}}\right)
=ϵ𝐚i2.\displaystyle=\epsilon\lVert\boldsymbol{\mathbf{a}}_{i}\rVert_{2}.

Note that we have assumed that 𝐚i𝟎\boldsymbol{\mathbf{a}}_{i}\neq\boldsymbol{\mathbf{0}}. If 𝐚i=𝟎\boldsymbol{\mathbf{a}}_{i}=\boldsymbol{\mathbf{0}}, then it is obvious that any 𝐱𝒳\boldsymbol{\mathbf{x}}\in\mathcal{X} will produce a maximum value of li=0l_{i}=0, and the last equation still holds.

Domain 3: 𝒳={𝐱|𝐱ϵ}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert_{\infty}\leq\epsilon\}

In this case, 𝐱[ϵ,ϵ]n\boldsymbol{\mathbf{x}}\in[-\epsilon,\epsilon]^{n}. So, the quantity 𝐚iT𝐱\boldsymbol{\mathbf{a}}_{i}^{T}\boldsymbol{\mathbf{x}} is maximized when xj=ϵx_{j}=\epsilon for positive aija_{ij}, and xj=ϵx_{j}=-\epsilon for negative aija_{ij}. So, we have

xj,i\displaystyle x_{j}^{*,i} ={ϵ,aij<0ϵ,aij>00,aij=0\displaystyle=\begin{cases}-\epsilon,&a_{ij}<0\\ \epsilon,&a_{ij}>0\\ 0,&a_{ij}=0\end{cases}
li\displaystyle l_{i} =j|aij|.\displaystyle=\sum_{j}\lvert a_{ij}\rvert.

Note that when aij=0a_{ij}=0, the value of xj,ix_{j}^{*,i} does not matter as long as it is in the range [ϵ,ϵ][-\epsilon,\epsilon]. But we define it as zero so that when we consider non-negative domains in the next section, we can simply replace the matrix 𝐀\boldsymbol{\mathbf{A}} with its positive part 𝐀+\boldsymbol{\mathbf{A}}^{+}.

Non-negative Domains: 𝒳={𝐱|𝐱qϵ,𝐱𝟎}\mathcal{X}=\{\boldsymbol{\mathbf{x}}~{}~{}|~{}~{}\lVert\boldsymbol{\mathbf{x}}\rVert_{q}\leq\epsilon,~{}~{}\boldsymbol{\mathbf{x}}\geq\boldsymbol{\mathbf{0}}\}

In many cases, due to the affine-ReLU function being preceded by a ReLU, the domain 𝒳\mathcal{X} will consist of vectors with non-negative entries. In these cases, the bounding region often becomes smaller (i.e. some or all elements of 𝐲¯\overline{\boldsymbol{\mathbf{y}}} are smaller). For the 1, 2, or \infty-norms above, we can first decompose the 𝐀\boldsymbol{\mathbf{A}} matrix into its positive and negative parts: 𝐀=𝐀+𝐀\boldsymbol{\mathbf{A}}=\boldsymbol{\mathbf{A}}^{+}-\boldsymbol{\mathbf{A}}^{-}. Since placing emphasis on a negative element of 𝐚i\boldsymbol{\mathbf{a}}_{i} will always be suboptimal, we can apply the same analysis in Domains 1, 2, and 3, except replace 𝐀\boldsymbol{\mathbf{A}} with 𝐀+\boldsymbol{\mathbf{A}}^{+}.

Efficient computation

Note that for large convolutional layers, it is too expensive to represent the entire matrix 𝐀\boldsymbol{\mathbf{A}}. In these cases, we can obtain the ithi^{th} row of 𝐀\boldsymbol{\mathbf{A}} using the transposed convolution operator. More specifically, we create a transposed convolution function with no bias, based on the original convolution function. Then, by noting that if we consider a standard basis vector 𝐞im\boldsymbol{\mathbf{e}}_{i}\in\mathbb{R}^{m}, the ithi^{th} column of 𝐀T\boldsymbol{\mathbf{A}}^{T} (and ithi^{th} row of 𝐀\boldsymbol{\mathbf{A}}) is given by 𝐀T𝐞i\boldsymbol{\mathbf{A}}^{T}\boldsymbol{\mathbf{e}}_{i}. Therefore, by plugging in the ithi^{th} standard basis vector to the transposed convolution function, we can obtain 𝐚i\boldsymbol{\mathbf{a}}_{i}, the ithi^{th} row of 𝐀\boldsymbol{\mathbf{A}}. Note that a vector 𝐞i\boldsymbol{\mathbf{e}}_{i} must first be reshaped into the proper input dimension before plugging it into the transposed convolution function. Furthermore, to reduce computation time in practice, instead of plugging in each standard basis vector 𝐞i\boldsymbol{\mathbf{e}}_{i} one at a time, we plug in a batch of different standard basis vectors to obtain multiple rows of 𝐀\boldsymbol{\mathbf{A}}.

A.3 Neural network architectures

We used three neural networks in this paper. The first network is based on the MNIST dataset and we refer to it as “MNIST net”. We constructed MNIST net ourselves and trained it to 99% top-1 test accuracy in 100 epochs. The second network is based on the CIFAR-10 dataset and we refer to it as “CIFAR-10 net”. We constructed CIFAR-10 net ourselves and trained it to 84% top-1 test accuracy in 500 epochs. The third network is the pre-trained implementation of AlexNet from Pytorch’s torchvision package. The following table shows the architectures of MNIST net and CIFAR-10 net.

Table 2: Networks we constructed for this paper. Convolution layers are denoted as conv{kernel size}-{output channels}. Max pooling layers are denoted as maxpool{kernel size}, and fully-connected layers are denoted as FC-{output features}. All convolution layers are followed by a ReLU and have a stride of 1. All fully-connected layers are followed by a ReLU unless it is the last layer.
MNIST net CIFAR-10 net
conv5-6 conv3-32
maxpool2 conv3-32
conv5-16 maxpool2
maxpool2 dropout
FC-120 conv3-64
FC-84 conv3-64
FC-10 maxpool2
dropout
FC-512
dropout
FC-10