This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Computer Science, Korea University, Seoul, Republic of Korea
11email: {byung4329,lpmn678,kangj,hyunwoojkim}@korea.ac.kr
22institutetext: Department of Mathematics and Statistics, San Diego State University, San Diego, California, USA
22email: {bchudomelka,yhong2}@sdsu.edu

Robust Neural Networks inspired by Strong Stability Preserving Runge-Kutta methods

Byungjoo Kim Equal Contribution, Corresponding Author.11    Bryce Chudomelka 22    Jinyoung Park 11   
Jaewoo Kang
11
   Youngjoon Hong 22    Hyunwoo J. Kim 11
Abstract

Deep neural networks have achieved state-of-the-art performance in a variety of fields. Recent works observe that a class of widely used neural networks can be viewed as the Euler method of numerical discretization. From the numerical discretization perspective, Strong Stability Preserving (SSP) methods are more advanced techniques than the explicit Euler method that produce both accurate and stable solutions. Motivated by the SSP property and a generalized Runge-Kutta method, we proposed Strong Stability Preserving networks (SSP networks) which improve robustness against adversarial attacks. We empirically demonstrate that the proposed networks improve the robustness against adversarial examples without any defensive methods. Further, the SSP networks are complementary with a state-of-the-art adversarial training scheme. Lastly, our experiments show that SSP networks suppress the blow-up of adversarial perturbations. Our results open up a way to study robust architectures of neural networks leveraging rich knowledge from numerical discretization literature.

1 Introduction

Refer to caption
(a) Euler method
Refer to caption
(b) 3rd order SSP
Figure 1: We illustrate the difference between a forward Euler discretization and a third-order SSP discretization applied to the inviscid Burgers’ solution. After computing numerical solutions, the solutions are filtered through the sigmoid function as an activation function. Evidently, in (a) the Euler scheme, i.e., a ResBlock, produces notable numerical errors while the SSP3 discretization in (b) shows a stable numerical approximation. For more details, please see the supplement.

Recent progress in deep learning has shown promising results in various research areas, such as computer vision, natural language processing and recommendation systems. In particular, on the ImageNet classification task [18], deep neural networks show state-of-the-art performance, e.g., residual networks (ResNet), which outperform humans in image classification [15]. Despite the success, deep neural networks often suffer from the lack of robustness against adversarial attacks [30]. ResNet, which is a widely used base network, also suffers from adversarial attacks which necessitates a more fundamental understanding of the architecture at hand.

One interesting interpretation of the ResNet architecture is that of the explicit Euler discretization scheme, i.e., x(tk+1)=x(tk)+F(x(tk))x(t_{k+1})=x(t_{k})+F(x(t_{k})), because it allows one to view neural networks as numerical methods. The explicit Euler method is one of the simplest first-order numerical schemes but often leads to large numerical errors due to its low order. Thus, we would expect that applying advanced numerical discretizations would produce a more accurate numerical solution than the Euler method, such as an explicit high-order Runge-Kutta method. However, an arbitrary explicit high-order Runge-Kutta method can pose a stability problem if the numerical solution becomes unstable [27]. To tackle this issue, [9] and [27] introduce the notion of total variation diminishing (TVD); also called Strong Stability Preserving (SSP) methods. The strong stability preserving approach produces a more accurate solution of the differential equation than the Euler method. We would expect to obtain a more accurate solution of the underlying function with non-smooth initial data (shocks) compared to the Euler method without notable numerical errors, see Figure 1. This phenomenon is directly related to the problem of adversarial attack and robustness of neural networks [30].

Motivated by the advanced numerical discretization schemes, we propose novel network architectures with the SSP property that address robustness; SSP networks (SSPNets). The use of the SSP property consistently demonstrates that all of our proposed architectures outperform ResNet in terms of robustness. SSP architectural blocks do not increase the amount of model parameters compared to ResNet, can be easily implemented, and realized by a convex combination of existing ResNet modules. The parameters used in SSP blocks are mathematically derived coefficients from the advanced numerical discretization methods. In addition, starting from an explicit Runge-Kutta method with the SSP property, we propose novel Adaptive Runge-Kutta blocks with learned coefficients obtained by training. With these learned coefficients, we are able to improve robustness while retaining the natural accuracy of ResNet.

The simple architectural change, SSPNets, improve robustness and are complementary with adversarial training, which is the de facto state-of-the-art defensive methodology. Our contributions are summarized as follows:

  • We propose multiple novel architectural blocks motivated by the Strong Stability Preserving explicit higher-order numerical discretization method.

  • We demonstrate empirically that these proposed blocks improve the robustness of Strong Stability Preserving networks consistently; against adversarial examples and without any defensive methods.

  • We further improve on robustness with a novel adaptive architectural block motivated by a generalized Runge-Kutta method and the SSP property.

  • Last but not least, we show that Strong Stability Preserving Networks suppress the blow-up of adversarial perturbations added to inputs.

2 Background and Related Work

2.1 Neural Networks and Differential Equations

Neural networks such as ResNet [15], PolyNet [40] and recurrent neural networks share a common operation represented as xt+1=xt+F(xt;Θt)x_{t+1}=x_{t}+F(x_{t};\Theta_{t}). Interestingly, a sequence of the operations (or equivalently the network architectures) can be interpreted as an explicit Euler method for numerical discretization [6, 7, 20, 24, 25]. For instance, ResNet can be written mathematically as

x0=x,xk+1=xk+F(xk;Θk),k{0,1,,A1},y^=f(xA),\begin{split}&x_{0}=x,\\ &x_{k+1}=x_{k}+F(x_{k};\Theta_{k}),\quad k\in\{0,1,\dots,A-1\},\\ &\hat{y}=f(x_{A}),\end{split} (1)

where AA denotes the number of layers in the network.

If we multiply the function FF by Δt\Delta t, i.e., xk+1=xk+ΔtF(xk;Θk)x_{k+1}=x_{k}+\Delta tF(x_{k};\Theta_{k}), then ResNet can be seen as the explicit Euler numerical scheme discretization with an initial condition, x(0)x(0), to solve the initial value problem given as

x(0)=x,dx(t)dt=F(x(t);Θ(t)),y^=f(x(A)).\begin{split}&x(0)=x,\\ &\frac{dx(t)}{dt}=F(x(t);\Theta(t)),\\ &\hat{y}=f(x(A)).\end{split} (2)

The explicit Euler method is the simplest Runge-Kutta method and often suffers from low accuracy because it is a first-order method. In this regard, higher-order numerical methods are natural candidates to obtain a more precise numerical solution, but the higher accuracy from higher-order methods may come with the cost of instability, e.g., poor convergence behaviour on stiff differential equations compared to the first order Euler method [4]. Therefore, it is important to understand the trade-off between accuracy and stability when considering a numerical method.

Recently, some network architectures inspired by the computational similarity between ResNet and Euler discretization have been proposed, e.g., NeuralODE and FFJORD [6, 11]. Unlike ResNet, which requires the discretization of observation/emission intervals to be represented by a finite number of hidden layers, NeuralODE and FFJORD use numerical discretization methods in the forward propagation to define continuous-depth and continuous-time latent variable models. These require ODE solvers for training and inference, unlike our implementation of SSP networks. Since we changed only computational graphs and coefficients based on the numerical discretization theory, our methods perform the standard forward/backward propagation in the discrete space as ResNet.

Another approach to design new blocks/layers of neural networks is to make them have operations similar to advanced numerical discretization techniques that possess desirable properties [20, 25]. From the partial differential equation perspective, analysis on numerical stability of conventional residual connections lead to the development of new architectures: parabolic/hyperbolic CNNs to achieve better stability as parabolic/hyperbolic PDEs [25]. The models use theoretical assumptions on the function to achieve stability with a positive semi-definite Jacobian of the function resulting in constraints on convolutional kernels; alternatively, our networks do not require such constraints.

2.2 Robust Machine Learning and Adversarial Attacks

Stability and robustness of neural networks have been studied in the context of adversarial attacks after the success of deep learning [2, 30, 36]. Gradient-based adversarial attacks create adversarial examples solving optimization problems. One example is the maximization of loss against ground truth labels within a small ball, e.g., maxδ(hθ(x+δ),y), s.t. δϵ\max_{\delta}\mathcal{L}(h_{\theta}(x+\delta),y),\textit{ s.t. }\|\delta\|_{\infty}\leq\epsilon, where hθh_{\theta} is a model parameterized by θ\theta, xx, yy are the input (natural sample) and its target label respectively, and \mathcal{L} is a loss function. The simplest procedure to approximate the solution is to use the fast gradient sign method (FGSM) [8]. It can be seen as an optimal solution to a linearized loss function, i.e., argmaxvαvTδ(hθ(x+δ),y)=αsign(δ(hθ(x+δ),y))\arg\max_{\|v\|_{\infty}\leq\alpha}v^{T}\nabla_{\delta}\mathcal{L}(h_{\theta}(x+\delta),y)=\alpha\cdot\text{sign}(\nabla_{\delta}\mathcal{L}(h_{\theta}(x+\delta),y)). Furthermore, the FGSM can be more powerful when it is used with iterative methods such as the projected gradient descent (PGD). PGD has been used in both untargeted and targeted attacks [21, 5].

One of the early attempts to defend against adversarial attacks is adversarial training using FGSM, a single-step method [8]. After that, various defensive techniques have been proposed [3, 22, 26, 29, 31]. Many of them were defeated by iterative attack methods [5] and Backward Pass Differentiable Approximation [1]. Adversarial training with stronger multi-step attack methods is still promising and shows state-of-the-art performance [21, 32, 35]. More recently, provably robust neural networks have been successfully trained by minimizing the lower bound of risk based on convex duality and convex relaxation [33, 34]. Most adversarial training methods above assume that attack methods are known a priori, i.e., a white-box attack, and generate augmented samples using the attacks. Another defensive technique is to alleviate the effect of perturbation by augmentation and reconstruction [23, 37], or denoising [35]. These methods alongside adversarial training achieved comparable robustness. Similarly, in this work we will introduce our approach and evaluate it with adversarial training.

3 Strong Stability Preserving Networks

In this section, we introduce the mathematical framework for the Strong Stability Preserving property and describe how to implement SSP blocks with mathematically derived coefficients. Next, we provide a variance analysis to compare high-order Runge-Kutta blocks with residual blocks. Lastly, we introduce adaptive Runge-Kutta blocks with learnable coefficients which possess the SSP property.

3.1 Motivation of strong stability preserving method

Our objective is to solve the non-autonomous differential equation given as

ut=L(u(t),t),t[t0,,tN],\frac{\partial u}{\partial t}=L(u(t),t),\quad t\in[t_{0},...,t_{N}], (3)

where t0,tNt_{0},t_{N} are the initial and terminal time state respectively; a non-autonomous system permits a time varying solution, e.g., the learned function varies as the depth of the network increases. The function LL is a linear (or nonlinear) function and u(t0)u(t_{0}) is given by the initial condition. The objective is to figure out the terminal state of the function uu, i.e., u(tN)u(t_{N}).

A general high-order Runge-Kutta time discretization for solving the initial value problem (3) introduced in [28] is given as

u(0)=un,u(i)=k=0i1(αi,ku(k)+Δtβi,kL(u(k))),i{1,,m},un+1=u(m),\begin{split}&u^{(0)}=u^{n},\\ &u^{(i)}=\sum^{i-1}_{k=0}\left(\alpha_{i,k}u^{(k)}+\Delta t\beta_{i,k}L(u^{(k)})\right),\quad i\in\{1,\cdots,m\},\\ &u^{n+1}=u^{(m)},\end{split} (4)

where k=0i1αi,k=1\sum^{i-1}_{k=0}\alpha_{i,k}=1 and αi,k0\alpha_{i,k}\geq 0. For example, if m=1m=1, it becomes the first-order Euler method as in Equation (1) with α1,0=β1,0=1\alpha_{1,0}=\beta_{1,0}=1.

Shu et al. [27, 28] propose a TVD time discretization method that is called the SSP time discretization method; for more discussion on the TVD method, we refer the reader to [12, 13]. The procedure of TVD time discretization is to take the high-order method to decrease the local truncation error and maintain the stability under a suitable restriction on the time step. While applying the TVD scheme into the explicit high-order Runge-Kutta methods, there needs the assumption to hold it: The first-order Euler method in time is strongly stable under a certain (semi) norm when the time step Δt\Delta t is suitably restricted [10]. More precisely, if we assume that the forward Euler time discretization is stable under a certain norm, the SSP methods find a higher-order time discretization that maintains strong stability for the same norm; improving accuracy.

Followed by this assumption, for a sufficiently small time step known as Courant-Friedrichs-Lewy (CFL) condition ΔtΔtCFL\Delta t\leq\Delta t_{CFL}, the total variation semi-norm of the numerical scheme does not increase in time, that is,

TV(un+1)TV(un),TV(u^{n+1})\leq TV(u^{n}), (5)

where the total variation is defined by

TV(un):=j|uj+1nujn|,TV(u^{n}):=\sum_{j}|u_{j+1}^{n}-u_{j}^{n}|, (6)

where jj is the spatial discretization. The explicit high-order Runge-Kutta discretization with the SSP property maintains a higher order accuracy with a modified CFL condition ΔtcΔtCFL\Delta t\leq c\Delta t_{CFL}. In other words, the high-order SSP Runge-Kutta scheme improves accuracy while retaining its stability. This has been theoretically studied by the following Lemma 1.

Lemma 1

If the forward Euler method is strongly stable under the CFL condition, i.e. un+ΔtL(un)un||u^{n}+\Delta tL(u^{n})||\leq||u^{n}||, then the Runge-Kutta method possesses SSP, un+1un||u^{n+1}||\leq||u^{n}||, provided that ΔtcΔtCFL.\Delta t\leq c\Delta t_{CFL}.

We provide a sketch of the proof of Lemma 1 in the supplement. The full proof of the Lemma 1 can be found in [28]. Following this representation, we can figure out the specific coefficients αi,k\alpha_{i,k} and βi,k\beta_{i,k} in equation (4). In particular, the second and third order nonlinear SSP Runge-Kutta method was studied in [28].

Lemma 2

An optimal second-order SSP Runge-Kutta method is given by,

u(1)=un+ΔtL(un),un+1=12un+12u(1)+12ΔtL(u(1)),\begin{split}&u^{(1)}=u^{n}+\Delta tL(u^{n}),\\ &u^{n+1}=\frac{1}{2}u^{n}+\frac{1}{2}u^{(1)}+\frac{1}{2}\Delta tL(u^{(1)}),\end{split} (7)

with a CFL coefficient c=1c=1. In addition, an optimal third-order SSP Runge-Kutta method is of the form

u(1)=un+ΔtL(un),u(2)=34un+14u(1)+14ΔtL(u(1)),un+1=13un+23u(2)+23ΔtL(u(2)),\begin{split}&u^{(1)}=u^{n}+\Delta tL(u^{n}),\\ &u^{(2)}=\frac{3}{4}u^{n}+\frac{1}{4}u^{(1)}+\frac{1}{4}\Delta tL(u^{(1)}),\\ &u^{n+1}=\frac{1}{3}u^{n}+\frac{2}{3}u^{(2)}+\frac{2}{3}\Delta tL(u^{(2)}),\end{split} (8)

with a CFL coefficient c=1c=1.

A sketch of the proof for Lemma 2 can be found in the supplement and for the detailed proof, we refer the reader to [9, 10, 28].

3.2 Strong Stability Preserving Networks

Refer to caption

(a)

Refer to caption

(b)
Refer to caption
(c)

Refer to caption

(d)
Figure 2: Network modules with ResBlock and SSP blocks. (a): ResBlock. (b): SSP2-block (c): SSP3-block, (d): ArkBlock

Next, we show how to incorporate the explicit SSP Runge-Kutta method into neural networks. Equation (19) and (20) can be implemented with standard residual blocks and simple operations, as shown in Figure 2.

Let ResBlock denote a standard residual block written as ResBlock(x(tk);Θ(tk))\textit{ResBlock}(x(t_{k});\Theta(t_{k})) =x(tk)+F(x(tk);Θ(tk)),=x(t_{k})+F(x(t_{k});\Theta(t_{k})), where Θ(tk)\Theta(t_{k}) are the parameters of ResBlock(\cdot;Θ(tk)\Theta(t_{k})). The function FF is typically composed of two or three sets of normalization, activation, and convolutional layers, e.g., Figure 2 and [15, 16]. When the numbers of input and output channels differ, we use the expansive residual block ResBlock-E; this can be implemented with a 1×11\times 1 convolutional filter to expand the number of channels.

Using the standard modules in ResNet (ResBlock and ResBlock-E), SSPNets can be constructed. First, SSP blocks can be implemented using linear combinations of ResBlocks. As the Euler method interpretation of ResNet requires Δt=1\Delta t=1, we assume Δt=1\Delta t=1 in Equation (19), then the SSP2-block is given by,

x(tk+12)=x(tk)+F(x(tk);Θ(tk))ResBlock(x(tk);Θ(tk)),x(tk+1)=12x(tk)+12x(tk+12)+12F(x(tk+12);Θ(tk))12ResBlock(x(tk+1/2);Θ(tk)).\begin{split}&x({t_{k+\frac{1}{2}}})=\underbrace{x(t_{k})+F(x(t_{k});\Theta(t_{k}))}_{\textit{ResBlock}\left(x\left(t_{k}\right);\Theta\left(t_{k}\right)\right)},\\ &x({t_{k+1}})=\frac{1}{2}x(t_{k})+\underbrace{\frac{1}{2}x({t_{k+\frac{1}{2}}})+\frac{1}{2}F\left(x(t_{k+\frac{1}{2}});\Theta(t_{k})\right)}_{\frac{1}{2}\textit{ResBlock}\left(x\left(t_{k+1/2}\right);\Theta\left(t_{k}\right)\right)}.\end{split} (9)

Similarly, the third order SSP in Equation (20) (SSP3-block) is written as

x(tk+13)=x(tk)+F(x(tk);Θ(tk))ResBlock(x(tk);Θ(tk)),x(tk+23)=34x(tk)+14x(tk+13)+14F(x(tk+13);Θ(tk))14ResBlock(x(tk+1/3);Θ(tk)),x(tk+1)=13x(tk)+23x(tk+23)+23F(x(tk+23);Θ(tk))23ResBlock(x(tk+2/3);Θ(tk)).\begin{split}&x({t_{k+\frac{1}{3}}})=\underbrace{x(t_{k})+F\left(x(t_{k});\Theta(t_{k})\right)}_{\textit{ResBlock}\left(x\left(t_{k}\right);\Theta\left(t_{k}\right)\right)},\\ &x({t_{k+\frac{2}{3}}})=\frac{3}{4}x(t_{k})+\underbrace{\frac{1}{4}x({t_{k+\frac{1}{3}}})+\frac{1}{4}F\left(x({t_{k+\frac{1}{3}}});\Theta(t_{k})\right)}_{\frac{1}{4}\textit{ResBlock}\left(x\left(t_{k+1/3}\right);\Theta\left(t_{k}\right)\right)},\\ &x(t_{k+1})=\frac{1}{3}x(t_{k})+\underbrace{\frac{2}{3}x({t_{k+\frac{2}{3}}})+\frac{2}{3}F\left(x({t_{k+\frac{2}{3}}});\Theta(t_{k})\right)}_{\frac{2}{3}\textit{ResBlock}\left(x\left(t_{k+2/3}\right);\Theta\left(t_{k}\right)\right)}.\end{split} (10)

The SSP block schematic is presented in Figure 2 and SSP blocks are only used when the number of channels does not change.

The explicit SSP Runge-Kutta methods in Equation (19) and (20) use the same function LL multiple times. Similarly, SSP blocks in Equation (9) and (10) apply the same ResBlock multiple times. Using the same ResBlock multiple times can be viewed as parameter sharing, which is a kind of regularization. In other words, without increasing the number of parameters, a SSP block implementation improves the robustness of neural networks by utilizing higher-order schemes.

Midpoint Runge-Kutta Second-Order Methods. For contrast, one may ask whether or not the stability preserving properties are key to the robustness against adversarial perturbation. We address this important question by training another network that utilizes a second-order midpoint Runge-Kutta method (mid-RK2) which does not have the strong stability preserving property [4, 10]. Recall that this method is implemented numerically as

x(tk+1)=x(tk)+F(x(tk)+12F(x(tk);Θ(tk));Θ(tk)),x(t_{k+1})=x(t_{k})+F\left(x(t_{k})+\frac{1}{2}F(x(t_{k});\Theta(t_{k}));\Theta(t_{k})\right), (11)

and does not have the SSP property. This network will provide a comparison of numerical discretization methods with regard to stability in attacked accuracy.

Variance Analysis of SSP networks. We analyze the variance increase of SSP blocks following previous works [14, 39], which compare the variance of input and output of functional modules. Next, we show that SSP blocks suppress the variance increase compared to ResBlock; as well as comparing the variance of the midpoint Runge-Kutta second-order numerical method for further justification.

Lemma 3

If Var[F(x)]=Var[x]\texttt{Var}[F(x)]=\texttt{Var}[x], Cov[x,F(y)]=0\texttt{Cov}[x,F(y)]=0 then the variance increases by

Var[ResBlock(x)]=2Var[x],Var[mid-RK2(x)]=94Var[x],Var[SSP2-Block(x)]=74Var[x],Var[SSP3-Block(x)]=2918Var[x].\begin{split}&\texttt{Var}[ResBlock(x)]\;\;=2\texttt{Var}[x],\;\;\;\;\;\;\texttt{Var}[\textit{mid-RK2(x)}]\;\;\;\>=\frac{9}{4}\texttt{Var}[x],\\ &\texttt{Var}[\textit{SSP2-Block(x)}]=\frac{7}{4}\texttt{Var}[x],\;\;\;\;\;\texttt{Var}[\textit{SSP3-Block(x)}]=\frac{29}{18}\texttt{Var}[x].\end{split} (12)

The variance of SSP blocks is smaller than that of ResBlock. The variance adds to our argument that the SSP property is the reason for improved robustness; for more detailed derivation and proof, see the supplement.

Adaptive SSP Networks.

Refer to caption
Figure 3: The overall architecture of neural networks used in experiments. Each group has N{6,10}N\in\{6,10\} blocks and the block is either ResBlock, SSP blocks (2 or 3) or ArkBlock. The ResBlock-E is inserted between groups to expand the number of channels for all the architectures.

Also, we generalize Equation (4) with the second-order Adaptive Runge-Kutta block (ArkBlock) that has the SSP property by construction. These novel computational blocks slightly increase the number of parameters compared to ResBlock but also provide greater robustness and natural accuracy than SSP2-Block or SSP3-Block. Finally, we explore different computational architectures within each group to retain natural accuracy and further improve robustness.

A naive implementation of Equation (4) yields 5 additional parameters. We can retain the SSP property in ArkBlocks by reducing the number of parameters with Ralston’s method [9]. Thus, the number of additional learned parameters per block, when compared with ResBlock, is 2 and is defined as

α1,0=1,α2,0=1α2,1,β2,0=112β1,0α2,1β1,0,β2,1=12β1,0.\begin{split}\alpha_{1,0}&=1,\qquad\alpha_{2,0}=1-\alpha_{2,1},\\ \beta_{2,0}&=1-\frac{1}{2\beta_{1,0}}-\alpha_{2,1}\beta_{1,0},\qquad\beta_{2,1}=\frac{1}{2\beta_{1,0}}.\end{split} (13)

We further improve performance by reducing the number of parameters by fixing α2,1\alpha_{2,1} and simply learning β1,0\beta_{1,0} in each block.

Adaptive SSP networks still maintain the same architecture, as in Figure 3, but are comprised of blocks that have the form

u(1)=un+β1,0L(un),u(n+1)=α2,0un+β2,0L(un)+α2,1u(1)+β2,1L(u(1)).\begin{split}u^{(1)}&=u^{n}+\beta_{1,0}L(u^{n}),\\ u^{(n+1)}&=\alpha_{2,0}u^{n}+\beta_{2,0}L(u^{n})+\alpha_{2,1}u^{(1)}+\beta_{2,1}L(u^{(1)}).\end{split} (14)

We implement ArkBlocks with,

x(tk+12)=x(tk)+β1,0F(x(tk);Θ(tk)),x(tk+1)=α2,0x(tk)+β2,0F(x(tk);Θ(tk))+α2,1x(tk+12)+β2,1F(x(tk+12);Θ(tk)).\begin{split}x({t_{k+\frac{1}{2}}})&=x(t_{k})+\beta_{1,0}F\left(x(t_{k});\Theta(t_{k})\right),\\ x(t_{k+1})&=\alpha_{2,0}x(t_{k})+\beta_{2,0}F\left(x({t_{k}});\Theta(t_{k})\right)\\ &~{}+\alpha_{2,1}x({t_{k+\frac{1}{2}}})+\beta_{2,1}F\left(x({t_{k+\frac{1}{2}}});\Theta(t_{k})\right).\end{split} (15)

The ArkBlocks are inspired by the generalized Runge-Kutta method in (14). However, the numerical scheme in Equation (14), keeps α2,1\alpha_{2,1} and β1,0\beta_{1,0} constant in all blocks, while ArkBlocks set those parameters as learnable; varying in each block. Such an adaptivity based on data and architectures cannot be obtained by mathematically derived coefficients. To our knowledge, this is the first attempt.

Model Clean FGSM PGD20 PGD30
ResNet 0.9961 0.7674 0.5799 0.1773
SSP-2 0.9954 0.7984 0.5979 0.1850
SSP-3 0.9960 0.8022 0.6176 0.1930
SSP-adap 0.9946 0.8586 0.7611 0.5102
Table 1: The accuracy against adversarial attacks with standard training on the MNIST dataset; all models were trained with 6 blocks. Note that PGDi represents a projected gradient descent attack with ii iterations and that all the SSPNets are more robust against adversarial attacks than ResNet.

4 Experiments

We evaluate the robustness of various SSP networks against adversarial examples. MNIST [19] and CIFAR10 [18] are used for evaluation; for results on other datasets, see the supplement. The robustness is measured by the classification accuracy on adversarial examples generated by FGSM [8] and PGD [21].

In this section, we empirically address the following three questions:

  • Are deep neural networks with the SSP property more robust than ResNet when the models are trained with or without adversarial training?

  • Can we further improve upon adversarial robustness and simultaneously retain the natural accuracy of ResNet?

  • Do Strong Stability Preserving networks suppress the perturbation growth during forward propagation?

4.1 Experimental setup

ResNet and SSP networks. Each group has NN blocks where each block can be either ResBlock, SSP2-block, SSP3-block, or ArkBlock, as seen in Figure 3. Networks are named after the type of blocks: ResNet, SSP-2, SSP-3, and SSP-adap. The blocks in each group have the same number of input/output channels. The convolutional layers in group 1, group 2, and group 3 have 16, 32, 64 channels respectively. The classification layer of our networks consist of an average pooling and softmax layer, in order to calculate the confidence score.

4.2 Evaluation on MNIST with standard training

We demonstrate that SSPNets are more robust than ResNet with standard training. Since MNIST has relatively low-resolution images compared to CIFAR10, we used a smaller architecture by skipping group 1 and 2 in Figure 3.

Experimental Details. We evaluate the models on MNIST. When training the models, samples are augmented by adding random noise δ\delta drawn from a uniform distribution Uniform(ϵ,ϵ)Uniform(-\epsilon,\epsilon). We set the maximum perturbation magnitude ϵ=0.3\epsilon=0.3 for both training and evaluation. For optimization, Adam [17] is used with learning rate 0.00010.0001 and (β1,β2)=(0.9,0.999)(\beta_{1},\beta_{2})=(0.9,0.999), minibatch size of 128. Models are trained for 100 epochs.

Robustness Comparison. The results in Table 1 show that all four models have high accuracy (99.599.6%99.5\sim 99.6\%) in classifying clean samples. This means that SSP blocks do not lead to a significant loss of accuracy on clean samples. Further, the improvement by SSP compared to ResNet is consistently observed in different settings. SSP-2 improves the robustness by 3% against FGSM and 1% against PGD. SSP-3 shows larger improvement about 4% and 2% against FGSM and PGD. SSP-adap shows the largest improvement about 9% and 33% against FGSM and PGD. It is known that adversarial training on MNIST is sufficiently robust against FGSM and PGD. All models trained by adversarial training achieve 9697%96\sim 97\% on MNIST, which makes it hard to demonstrate the benefit of SSP networks with adversarial training compared to ResNet.

4.3 SSP with adversarial training

We analyze the robustness of SSP networks, on the CIFAR10 dataset. Our preliminary experiments show that all the models, e.g., ResNet, SSP-2, SSP-3, and SSP-adap trained without adversarial training are easily fooled by PGD attacks, but more analysis is needed on a more challenging dataset. For this reason, we focus on the adversarial training setting for CIFAR10. Please see supplementary materials for more analysis on SSP networks with adversarial training.

NN KK Model Clean FGSM PGD7 PGD12 PGD20
6 7 ResNet 0.8357 0.5116 0.4389 0.4215 0.4150
6 7 mid-RK2 0.8407 0.5156 0.4377 0.4193 0.4129
6 7 SSP-2 0.8257 0.5223 0.4577 0.4426 0.4368
6 7 SSP-3 0.8376 0.5165 0.4478 0.4305 0.4246
6 7 SSP-adap 0.8376 0.5283 0.4640 0.4455 0.4403
6 12 ResNet 0.8010 0.5304 0.4817 0.4691 0.4650
6 12 mid-RK2 0.7957 0.5326 0.4849 0.4740 0.4693
6 12 SSP-2 0.7899 0.5426 0.5073 0.4983 0.4961
6 12 SSP-3 0.7966 0.5440 0.5092 0.4999 0.4976
6 12 SSP-adap 0.7988 0.5504 0.5066 0.4964 0.4943
10 7 ResNet 0.8516 0.5225 0.4398 0.4188 0.4111
10 7 mid-RK2 0.8451 0.5146 0.4343 0.4122 0.4045
10 7 SSP-2 0.8437 0.5373 0.4714 0.4502 0.4427
10 7 SSP-3 0.8505 0.5350 0.4719 0.4558 0.4497
10 7 SSP-adap 0.8504 0.5308 0.4592 0.4376 0.4310
10 12 ResNet 0.8181 0.5467 0.4957 0.4799 0.4755
10 12 mid-RK2 0.8198 0.5522 0.4968 0.4818 0.4775
10 12 SSP-2 0.8144 0.5497 0.5074 0.4957 0.4932
10 12 SSP-3 0.8119 0.5507 0.5032 0.4929 0.4890
10 12 SSP-adap 0.8156 0.5643 0.5166 0.5054 0.5016
Table 2: CIFAR10 robustness evaluation against adversarial attacks. The column index NN indicates the number of blocks in each group of Figure 3, KK indicates the number of PGD iterations during training while PGDi represents the attack with ii iterations during attack. The SSP-adap model indicates an adaptive Runge-Kutta structure. All the SSP networks are more robust against adversarial attack than ResNet. Moreover, SSP-adap maintains the natural accuracy.

Adversarial Training. Before experimental results, we briefly summarize the adversarial training proposed by [21]. The objective of adversarial training is to minimize the adversarial risk given as,

Radv(hθ)=𝔼(x,y)D[maxδΔ(hθ(x+δ),y)],R_{adv}(h_{\theta})=\mathbb{E}_{(x,y)\sim D}\big{[}\max\limits_{\delta\in\Delta}\mathcal{L}(h_{\theta}(x+\delta),y)\big{]}, (16)

where the hθh_{\theta} is a model parameterized by θ\theta, \mathcal{L} is a loss function, yy is the label of corresponding image xx, DD is a true data distribution, and Δ\Delta is a set of small perturbations satisfying δpϵ\|\delta\|_{p}\leq\epsilon. In our experiments, the \ell_{\infty} metric is used, i.e., p=p=\infty. Finding the exact solution to maxδΔ(hθ(x+δ),y)\max\limits_{\delta\in\Delta}\mathcal{L}(h_{\theta}(x+\delta),y) is intractable, so [21] approximate it with a sample generated by the PGD attack. PGD attack finds the adversarial example given as xi+1=Π(xi+αxi(hθ(xi),y))x_{i+1}=\Pi(x_{i}+\alpha\nabla_{x_{i}}\mathcal{L}(h_{\theta}(x_{i}),y)), where i{0,1,,K1}i\in\{0,1,\cdots,K-1\}, KK is the number of iterations of PGD attack, Π\Pi denotes the projection to a small ball Δ\Delta and a valid pixel range. In our experiment, x0x_{0} is initialized with the input image augmented by adding the random perturbation δ0\delta_{0} sampled from the uniform distribution Uniform(ϵ,ϵ)Uniform(-\epsilon,\epsilon).

To summarize, our adversarial training procedure works as follows: First, randomly perturb the image within the allowed perturbation range ϵ\epsilon. Next, generate the candidate adversarial example by PGD attack. Finally, take the gradient descent step on a minibatch composed of only candidate adversarial examples. The adversarial training is closely related to the Frank-Wolfe Algorithm and two projections in the original adversarial training can be simplified to one projection to the intersection of two convex sets. The pseudocode of adversarial training and a detailed discussion of implementation are provided in the supplement.

Experimental Details. We use the Stochastic Gradient Descent method with Nesterov momentum, learning rate of 0.10.1, weight decay of 0.00050.0005, momentum 0.90.9, and a minibatch size of 128 samples. All models are trained for 200 epochs and in every 60, 100, 140 epochs, the learning rate decayed with a decaying factor 0.10.1. Both adversarial training and robustness evaluation, we set the maximum perturbation range ϵ=8/255\epsilon=8/255. To evaluate the robustness, we use FGSM [8] and PGD [21]; similar to our MNIST experiments. We set the PGD attack parameters to α=2/255\alpha=2/255, and the number of iterations K=7,12,20K=7,12,20 in evaluation.

Robustness Comparison. The experimental results are shown in Table 2. Models are evaluated in four different settings varying both the number of blocks (6 or 10 in column NN) for each group in Figure 3 and the number of iterations in PGD (7 or 12 in column KK) to generate adversarial examples during training.

Before discussing about the effectiveness of SSP, we briefly show the relationship among robustness, the amount of model parameters, and the strength of attacks used in adversarial training. As shown in Table 2, the robustness of all the models is improved by stronger attacks during training (e.g., larger KK in PGD). The same observation is reported in [32]. For instance, SSP-3 (N=6, K=12) shows higher accuracy than SSP-3 (N=6, K=7) against all the attacks and especially the improvement is about 7% against PGD with 2020 iterations. Also, a bigger model size (e.g., larger NN) increases the robustness against adversarial examples. This is closely related to the finding in [21] that increasing the number of channels in hidden layers often improves the robustness. Our experiments show that increasing the model size by adding more layers improves the robustness. For example, when K=7K=7, SSP-3 with N=10N=10 blocks show overall higher accuracy than SSP-3 with N=6N=6 blocks, the gain is about 2%2\%. From the numerical discretization perspective, more blocks can be seen as a finer time discretization that leads to a more accurate numerical solution (or prediction).

All SSPNets, SSP-2, SSP-3 and SSP-adap, consistently outperform ResNet by, roughly, 13.9%1\sim 3.9\% when ResNet and SSPNets have the same number of blocks, NN, and iterations, KK, in adversarial training. Note that we compare SSP-2 (and SSP-3) with ResNet, which has the same amount of parameters, and this is important to assure that the gain is not from an increased the amount of model parameters. Also, SSP-2, SSP-3 and SSP-adap have the same time discretization as ResNet. So, we conclude that the improvement in robustness against adversarial attacks solely comes from the strength of a higher-order numerical discretization. Table 2 shows one more interesting property of SSP networks. Unlike adversarial training and defensive methods that usually cause the label leaking effects [31, 38], SSP-2, SSP-3 and SSP-adap (our architectural changes) do not bring any additional loss of accuracy on natural samples.

On the other hand, Table 2 also shows that the mid-RK2 architecture does not outperform ResNet, SSP-2, SSP-3 or SSP-adap even though the mid-RK2 is derived from the second order numerical scheme. This gives credence to the implementation of SSPNets and implies that the robust performance is not a result of arbitrary high-order methods. In addition, SSP-adap achieves comparable natural accuracy as ResNet and improves robustness. Table 2 demonstrates the consistent improvement across various settings. For example, SSP-adap achieves nearly 4% absolute performance improvement for N=10,K=7N=10,K=7. The improvement by SSP networks compared to ResNet and the performance difference between different SSP networks are relatively smaller than Table 1. Our conjecture is that this is due to the improvement by adversarial training. We believe that the Strong Stability Preserving property imposed by our architectural change allows the SSPNets to improve the robustness against adversarial attacks.

Refer to caption
(a) N==6,K==7,p==1
Refer to caption
(b) N==6,K==7,p==2
Refer to caption
(c) N==10,K==12,p==1
Refer to caption
(d) N==10,K==12,p==2
Figure 4: Perturbation growth ratio in Equation (17) of clean samples and its adversarial counterparts. As the SSP networks suppress the perturbation growth during forward propagation, SSP-2, SSP-3 and SSP-adap have a lower ratio than ResNet. For full version of this figure, see the supplement.

Perturbation Growth Ratio Comparison. We investigate how the distance between clean samples and adversarial examples evolves through networks by calculating the perturbation growth ratio between input/output of groups given by

PGR(f)=𝔼x𝒟[𝔼x𝒳[f(x)f(x)pxxp]],p{1,2}\texttt{PGR}(f)=\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{x^{\prime}\sim\mathcal{X^{\prime}}}\left[\frac{\|f(x)-f(x^{\prime})\|_{p}}{\|x-x^{\prime}\|_{p}}\right]\right],\quad p\in\{1,2\} (17)

where f()f(\cdot) is a function of a group, xx^{\prime} is a corrupted sample from xx and is an element of the set 𝒳\mathcal{X^{\prime}}, 𝒳\mathcal{X^{\prime}} is a small neighborhood of xx, and pp defines a type of norm either 1\ell_{1} (related to TV in Equation (5)) or 2\ell_{2} (related to Lemma 6). Since each model has a different scale of feature maps, to compare, the distance needs proper normalization. So, we first measure the distance between a clean sample and its adversarial example before/after each group in Figure 3. xx^{\prime} is the adversarial example generated by PGD attack with 20 iterations for each model.

Figure 4 presents the perturbation growth ratio when N=6,K=7N=6,K=7 and N=10,K=12N=10,K=12 at each group in the models. Since the adversarial examples change the final predictions, the perturbation growth ratio increases in all the models. However, for SSPNets, the perturbation growth ratio is significantly lower than ResNet. This result supports that the proposed SSP blocks improve robustness of networks against adversarial attacks when compared to ResNet. We also conducted an experiment when xx^{\prime} is corrupted by adding a random perturbation to xx and the result is consistent with Figure 4. For full version of Figure 4 and more discussion, see the supplement.

5 Conclusion

In this work, we leverage the Strong Stability Preserving property of numerical discretization in order to improve adversarial robustness. Inspired by the Strong Stability Preserving methods, we design a series of SSPNets by applying the same ResBlock multiple times with parameters derived from numerical analysis. All of the SSP networks provide robustness against adversarial attacks. In particular, SSPNets with the ArkBlock improve adversarial robustness while maintaining natural accuracy. The proposed networks are complementary with adversarial training and suppress the perturbation growth. Our work shows the way to improve the robustness of neural networks by utilizing the theory of advanced numerical discretization schemes. We believe that the intersection of numerical discretization and robust deep learning will provide new opportunities to study robust neural networks. 111The codes are available at https://github.com/matbambbang/sspnet.

Acknowledgements This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00533, Research on CPU vulnerability detection and validation), National Supercomputing Center with supercomputing resources including technical support (KSC-2019-CRE-0186), National Research Foundation of Korea (NRF-2020R1A2C3010638), and Simons Foundation Collaboration Grants for Mathematicians.

References

  • [1] Athalye, A., Carlini, N., Wagner, D.: Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In: ICML. pp. 274–283 (2018)
  • [2] Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust optimization, vol. 28. Princeton University Press (2009)
  • [3] Buckman, J., Roy, A., Raffel, C., Goodfellow, I.: Thermometer encoding: One hot way to resist adversarial examples. In: ICLR (2018)
  • [4] Butcher, J.C.: The numerical analysis of ordinary differential equations. A Wiley-Interscience Publication, John Wiley & Sons, Ltd., Chichester (1987), runge Kutta and general linear methods
  • [5] Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP). pp. 39–57 (2017)
  • [6] Chen, T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. In: NeurIPS. pp. 6572–6583 (2018)
  • [7] Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C., Gomez, F.: NAIS-Net: stable deep networks from non-autonomous differential equations. In: NeurIPS. pp. 3025–3035 (2018)
  • [8] Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015)
  • [9] Gottlieb, S., Shu, C.W.: Total variation diminishing runge-kutta schemes. Mathematics of computation of the American Mathematical Society 67(221), 73–85 (1998)
  • [10] Gottlieb, S., Shu, C.W., Tadmor, E.: Strong stability-preserving high-order time discretization methods. SIAM Rev. 43(1), 89–112 (2001)
  • [11] Grathwohl, W., Chen, R.T., Betterncourt, J., Sutskever, I., Duvenaud, D.: FFJORD: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018)
  • [12] Harten, A.: High resolution schemes for hyperbolic conservation laws. J. Comput. Phys. 49(3), 357–393 (1983)
  • [13] Harten, A., Engquist, B., Osher, S., Chakravarthy, S.R.: Uniformly high-order accurate essentially nonoscillatory schemes. III. J. Comput. Phys. 71(2), 231–303 (1987)
  • [14] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)
  • [15] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
  • [16] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV. pp. 630–645. Springer (2016)
  • [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2014)
  • [18] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)
  • [19] LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010), http://yann.lecun.com/exdb/mnist/
  • [20] Lu, Y., Zhong, A., Li, Q., Dong, B.: Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In: ICML. pp. 5181–5190 (2018)
  • [21] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: ICLR (2018)
  • [22] Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP). pp. 582–597. IEEE (2016)
  • [23] Raff, E., Sylvester, J., Forsyth, S., McLean, M.: Barrage of random transforms for adversarially robust defense. In: CVPR (June 2019)
  • [24] Rubanova, Y., Chen, R.T., Duvenaud, D.: Latent odes for irregularly-sampled time series. arXiv preprint arXiv:1907.03907 (2019)
  • [25] Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential equations. arXiv preprint arXiv:1804.04272 (2018)
  • [26] Samangouei, P., Kabkab, M., Chellappa, R.: Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In: ICLR (2018)
  • [27] Shu, C.W.: Total-variation-diminishing time discretizations. SIAM Journal on Scientific and Statistical Computing 9(6), 1073–1084 (1988)
  • [28] Shu, C.W., Osher, S.: Efficient implementation of essentially non-oscillatory shock-capturing schemes. Journal of computational physics 77(2), 439–471 (1988)
  • [29] Song, Y., Kim, T., Nowozin, S., Ermon, S., Kushman, N.: Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In: ICLR (2018)
  • [30] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
  • [31] Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be at odds with accuracy. In: ICLR (2019)
  • [32] Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., Gu, Q.: On the convergence and robustness of adversarial training. In: ICML. pp. 6586–6595 (2019)
  • [33] Wong, E., Kolter, Z.: Provable defenses against adversarial examples via the convex outer adversarial polytope. In: ICML. pp. 5283–5292 (2018)
  • [34] Wong, E., Schmidt, F., Metzen, J.H., Kolter, J.Z.: Scaling provable adversarial defenses. In: NeurIPS. pp. 8400–8409 (2018)
  • [35] Xie, C., Wu, Y., Maaten, L.v.d., Yuille, A.L., He, K.: Feature denoising for improving adversarial robustness. In: CVPR. pp. 501–509 (2019)
  • [36] Xu, H., Caramanis, C., Mannor, S.: Robustness and regularization of support vector machines. Journal of Machine Learning Research 10(Jul), 1485–1510 (2009)
  • [37] Yang, Y., Zhang, G., Xu, Z., Katabi, D.: Me-net: Towards effective adversarial robustness with matrix estimation. In: ICML. pp. 7025–7034 (2019)
  • [38] Zhang, H., Yu, Y., Jiao, J., Xing, E., Ghaoui, L.E., Jordan, M.: Theoretically principled trade-off between robustness and accuracy. In: ICML. pp. 7472–7482 (2019)
  • [39] Zhang, H., Dauphin, Y.N., Ma, T.: Residual learning without normalization via better initialization. In: International Conference on Learning Representations (2019)
  • [40] Zhang, X., Li, Z., Change Loy, C., Lin, D.: Polynet: A pursuit of structural diversity in very deep networks. In: CVPR. pp. 718–726 (2017)

Appendix 0.A Summary

This supplementary material is structured as follows: proofs of strong stability preserving methods (section 0.B), proofs of variance analysis of SSP networks (section 0.C), comparison of non-TVD scheme and TVD scheme (section 0.D), reminder of adversarial training (section 0.E), exploratory network analysis (section 0.F) and suppression on perturbation growth (section 0.G).

Appendix 0.B Proofs of Strong Stability Preserving Scheme

Lemma 4

[28] If the forward Euler method is strongly stable under the CFL condition, i.e. un+ΔtL(un)un||u^{n}+\Delta tL(u^{n})||\leq||u^{n}||, then the Runge-Kutta method possesses SSP, un+1un||u^{n+1}||\leq||u^{n}||, provided that ΔtcΔtCFL.\Delta t\leq c\Delta t_{CFL}.

Sketch of proof. To begin, we rewrite the Runge-Kutta method as a convex combination of forward Euler steps

u(i)=k=0i1(αi,ku(k)+Δtβi,kL(u(k)))k=0i1αi,ju(k)+Δtβi,kαi,kL(u(k)).\begin{split}\|u^{(i)}\|&=\left\|\sum^{i-1}_{k=0}\left(\alpha_{i,k}u^{(k)}+\Delta t\beta_{i,k}L(u^{(k)})\right)\right\|\\ &\leq\sum^{i-1}_{k=0}\alpha_{i,j}\left\|u^{(k)}+\Delta t\frac{\beta_{i,k}}{\alpha_{i,k}}L(u^{(k)})\right\|.\end{split}

If we set c=mini,k(αi,k/βi,k)c=\min_{i,k}(\alpha_{i,k}/\beta_{i,k}) for ΔtcΔtCFL\Delta t\leq c\Delta t_{CFL}, we find that

u(k)+Δtβi,kαi,kL(u(k))u(k).\|u^{(k)}+\Delta t\frac{\beta_{i,k}}{\alpha_{i,k}}L(u^{(k)})\|\leq\|u^{(k)}\|.

Also, we notice that k=0i1αi,k=1\sum^{i-1}_{k=0}\alpha_{i,k}=1 by consistency. We now use induction to show

u(k)un,\|u^{(k)}\|\leq\|u^{n}\|, (18)

for k=0,1,,mk=0,1,...,m. Clearly, when k=0k=0, (18) holds. Assuming that it is valid for all ki1k\leq i-1, we deduce that

u(i)k=0i1αi,ju(k)+Δtβi,kαi,kL(u(k))k=0i1αi,ku(k)k=0i1αi,kun=un.\begin{split}\|u^{(i)}\|&\leq\sum^{i-1}_{k=0}\alpha_{i,j}\left\|u^{(k)}+\Delta t\frac{\beta_{i,k}}{\alpha_{i,k}}L(u^{(k)})\right\|\\ &\leq\sum^{i-1}_{k=0}\alpha_{i,k}\|u^{(k)}\|\\ &\leq\sum^{i-1}_{k=0}\alpha_{i,k}\|u^{n}\|=\|u^{n}\|.\end{split}

Hence, the lemma follows.

Lemma 5

[28] An optimal second-order SSP Runge-Kutta method is given by,

u(1)=un+ΔtL(un),un+1=12un+12u(1)+12ΔtL(u(1)),\begin{split}&u^{(1)}=u^{n}+\Delta tL(u^{n}),\\ &u^{n+1}=\frac{1}{2}u^{n}+\frac{1}{2}u^{(1)}+\frac{1}{2}\Delta tL(u^{(1)}),\end{split} (19)

with a CFL coefficient c=1c=1. In addition, an optimal third-order SSP Runge-Kutta method is of the form

u(1)=un+ΔtL(un),u(2)=34un+14u(1)+14ΔtL(u(1)),un+1=13un+23u(2)+23ΔtL(u(2)),\begin{split}&u^{(1)}=u^{n}+\Delta tL(u^{n}),\\ &u^{(2)}=\frac{3}{4}u^{n}+\frac{1}{4}u^{(1)}+\frac{1}{4}\Delta tL(u^{(1)}),\\ &u^{n+1}=\frac{1}{3}u^{n}+\frac{2}{3}u^{(2)}+\frac{2}{3}\Delta tL(u^{(2)}),\end{split} (20)

with a CFL coefficient c=1c=1.

Sketch of proof. For the second order m=2m=2, we choose the coefficients as

{α1,0=1,α2,0=1α2,1,β2,0=112β1,0α2,1β1,0,β2,1=12β1,0,\begin{cases}\alpha_{1,0}=1,\\ \alpha_{2,0}=1-\alpha_{2,1},\\ \beta_{2,0}=1-\frac{1}{2\beta_{1,0}}-\alpha_{2,1}\beta_{1,0},\\ \beta_{2,1}=\frac{1}{2\beta_{1,0}},\end{cases}

where β1,0\beta_{1,0} and α2,1\alpha_{2,1} are free parameters. Assume a CFL coefficient c>1c>1, then α1,0=1\alpha_{1,0}=1 implies β1,0<1\beta_{1,0}<1. Hence, we deduce that

12β1,0>12.\frac{1}{2\beta_{1,0}}>\frac{1}{2}.

In addition, we note that

α2,1>β2,1=12β1,0α2,1β1,0>12.\alpha_{2,1}>\beta_{2,1}=\frac{1}{2\beta_{1,0}}\implies\alpha_{2,1}\beta_{1,0}>\frac{1}{2}.

Hence, we obtain that

β2,0=112β1,0α2,1β1,0<11212=0,\beta_{2,0}=1-\frac{1}{2\beta_{1,0}}-\alpha_{2,1}\beta_{1,0}<1-\frac{1}{2}-\frac{1}{2}=0,

which is a contradiction. For the third order case m=3m=3, we choose the coefficients as

{α3,2=1α3,1α3,0,β3,2=3β1,026P(β1,0P),β2,1=16β1,0β3,2,β3,1=1/2α3,2β1,0β2,1Pβ3,2β1,0,β3,0=1α3,1β1,0α3,2Pβ3,1β3,2,β2,0=Pα2,1β1,0β2,1,\begin{cases}\alpha_{3,2}=1-\alpha_{3,1}-\alpha_{3,0},\\ \beta_{3,2}=\dfrac{3\beta_{1,0}-2}{6P(\beta_{1,0}-P)},\\ \beta_{2,1}=\dfrac{1}{6\beta_{1,0}\beta_{3,2}},\\ \beta_{3,1}=\dfrac{1/2-\alpha_{3,2}\beta_{1,0}\beta_{2,1}-P\beta_{3,2}}{\beta_{1,0}},\\ \beta_{3,0}=1-\alpha_{3,1}\beta_{1,0}-\alpha_{3,2}P-\beta_{3,1}-\beta_{3,2},\\ \beta_{2,0}=P-\alpha_{2,1}\beta_{1,0}-\beta_{2,1},\end{cases}

where α2,1,α3,0,α3,1,β1,0\alpha_{2,1},\alpha_{3,0},\alpha_{3,1},\beta_{1,0}, and P=β2,0+α2,1β1,0+β2,1P=\beta_{2,0}+\alpha_{2,1}\beta_{1,0}+\beta_{2,1} are free parameters. We omit the detailed proof for the third order scheme as it is more technical. For the complete proof, see e.g. [9].

Appendix 0.C Proofs of Variance

In this section, we provide a proof of the Lemma 6.

Lemma 6

If Var[F(x)]=Var[x]\texttt{Var}[F(x)]=\texttt{Var}[x], Cov[x,F(y)]=0\texttt{Cov}[x,F(y)]=0 then the variance increases by

Var[ResBlock(x)]=2Var[x],Var[mid-RK2(x)]=94Var[x],Var[SSP2-Block(x)]=74Var[x],Var[SSP3-Block(x)]=2918Var[x].\begin{split}&\texttt{Var}[ResBlock(x)]\;\;=2\texttt{Var}[x],\;\;\;\;\;\;\texttt{Var}[\textit{mid-RK2(x)}]\;\;\;\>=\frac{9}{4}\texttt{Var}[x],\\ &\texttt{Var}[\textit{SSP2-Block(x)}]=\frac{7}{4}\texttt{Var}[x],\;\;\;\;\;\texttt{Var}[\textit{SSP3-Block(x)}]=\frac{29}{18}\texttt{Var}[x].\end{split} (21)

Proof To begin, we summarize basic properties of variance and convariance which commonly used in this proof.

Var[x+y]=Var[x]+Var[y]+2Cov[x,y],Var[ax]=a2Var[x],\begin{split}&\texttt{Var}[x+y]=\texttt{Var}[x]+\texttt{Var}[y]+2\texttt{Cov}[x,y],\\ &\texttt{Var}[ax]=a^{2}\texttt{Var}[x],\end{split} (22)
Cov[x,y+z]=Cov[x,y]+Cov[x,z],Cov[ax,by]=abCov[x,y],\begin{split}&\texttt{Cov}[x,y+z]=\texttt{Cov}[x,y]+\texttt{Cov}[x,z],\\ &\texttt{Cov}[ax,by]=ab\texttt{Cov}[x,y],\end{split} (23)

where a,ba,b are real-valued constants, x,y,zx,y,z are random variables.

Our assumption holds

Var[F(x)]=Var[x]\texttt{Var}[F(x)]=\texttt{Var}[x] (24)

and

Cov[x,F(y;Θ)]=0,\texttt{Cov}[x,F(y;\Theta)]=0, (25)

where xx and yy are random variables [14, 39]. Recall that operations of each block is written as

xk+1=xk+F(xk;Θk),x_{k+1}=x_{k}+F(x_{k};\Theta_{k}), (26)
xk+12=xk+12F(xk;Θk),xk+1=xk+F(xk+12;Θk),\begin{split}&x_{k+\frac{1}{2}}=x_{k}+\frac{1}{2}F(x_{k};\Theta_{k}),\\ &x_{k+1}=x_{k}+F(x_{k+\frac{1}{2}};\Theta_{k}),\end{split} (27)
xk+12=xk+F(xk;Θk),xk+1=12xk+12xk+12+12F(xk+12;Θk),\begin{split}&x_{k+\frac{1}{2}}=x_{k}+F(x_{k};\Theta_{k}),\\ &x_{k+1}=\frac{1}{2}x_{k}+\frac{1}{2}x_{k+\frac{1}{2}}+\frac{1}{2}F(x_{k+\frac{1}{2}};\Theta_{k}),\end{split} (28)
xk+13=xk+F(xk;Θk),xk+23=34xk+14xk+13+14F(xk+13;Θk),xk+1=13xk+23xk+23+23F(xk+23;Θk),\begin{split}&x_{k+\frac{1}{3}}=x_{k}+F(x_{k};\Theta_{k}),\\ &x_{k+\frac{2}{3}}=\frac{3}{4}x_{k}+\frac{1}{4}x_{k+\frac{1}{3}}+\frac{1}{4}F(x_{k+\frac{1}{3}};\Theta_{k}),\\ &x_{k+1}=\frac{1}{3}x_{k}+\frac{2}{3}x_{k+\frac{2}{3}}+\frac{2}{3}F(x_{k+\frac{2}{3}};\Theta_{k}),\end{split} (29)

where the Equation (26) is the equation of ResBlock, Equation (27) is mid-RK2 block, Equation (28) is SSP2-block, Equation (29) is SSP3-block. We divide the proofs of each block. In every proofs, for simplicity, F(x;Θk):=f(x)F(x;\Theta_{k}):=f(x).

Proof of ResBlock Using (24) and (25), we derive the variance of output [39].

Var[xk+1]=(22)Var[xk]+Var[f(xk)]+2Cov[xk,f(xk)]=(25)Var[xk]+Var[f(xk)]=(24)2Var[xk].\begin{split}\texttt{Var}[x_{k+1}]&\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k}]+\texttt{Var}[f(x_{k})]+2\texttt{Cov}[x_{k},f(x_{k})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Var}[x_{k}]+\texttt{Var}[f(x_{k})]\\ &\operatornamewithlimits{=}^{\eqref{assumption1}}2\texttt{Var}[x_{k}].\end{split}

Proof of mid-RK2 First, we derive Var[xk+12]\texttt{Var}[x_{k+\frac{1}{2}}] and Cov[xk,xk+12]\texttt{Cov}[x_{k},x_{k+\frac{1}{2}}].

Var[xk+12]=(22)Var[xk]+14Var[f(xk)]+Cov[xk,f(xk)]=(24,25)54Var[xk],\begin{split}\texttt{Var}[x_{k+\frac{1}{2}}]&\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k}]+\frac{1}{4}\texttt{Var}[f(x_{k})]+\texttt{Cov}[x_{k},f(x_{k})]\\ &\operatornamewithlimits{=}^{(\ref{assumption1},\ref{assumption2})}\frac{5}{4}\texttt{Var}[x_{k}],\end{split} (30)
Cov[xk,xk+12]=Cov[xk,xk+12f(xk)]=(23)Cov[xk,xk]+12Cov[xk,f(xk)]=(24,25)Var[xk].\begin{split}\texttt{Cov}[x_{k},x_{k+\frac{1}{2}}]&=\texttt{Cov}[x_{k},x_{k}+\frac{1}{2}f(x_{k})]\\ &\operatornamewithlimits{=}^{\eqref{cov_property}}\texttt{Cov}[x_{k},x_{k}]+\frac{1}{2}\texttt{Cov}[x_{k},f(x_{k})]\\ &\operatornamewithlimits{=}^{(\ref{assumption1},\ref{assumption2})}\texttt{Var}[x_{k}].\end{split} (31)

By using xk+1=xk+f(xk+12)x_{k+1}=x_{k}+f(x_{k+\frac{1}{2}}),

Var[xk+1]=(27)Var[xk+f(xk+12)]=(22)Var[xk]+Var[f(xk+12)]+2Cov[xk,f(xk+12)]=(24,25)Var[xk]+Var[xk+12]=(30)94Var[xk].\begin{split}\texttt{Var}[x_{k+1}]&\operatornamewithlimits{=}^{\eqref{eq:midrk2}}\texttt{Var}[x_{k}+f(x_{k+\frac{1}{2}})]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k}]+\texttt{Var}[f(x_{k+\frac{1}{2}})]+2\texttt{Cov}[x_{k},f(x_{k+\frac{1}{2}})]\\ &\operatornamewithlimits{=}^{(\ref{assumption1},\ref{assumption2})}\texttt{Var}[x_{k}]+\texttt{Var}[x_{k+\frac{1}{2}}]\\ &\operatornamewithlimits{=}^{\eqref{midrk2_1/2var}}\frac{9}{4}\texttt{Var}[x_{k}].\end{split}

Proof of SSP2-block We start with obtaining Var[xk+12]\texttt{Var}[x_{k+\frac{1}{2}}] and Cov[xk,xk+12]\texttt{Cov}[x_{k},x_{k+\frac{1}{2}}].

Var[xk+12]=(22)Var[xk]+Var[f(xk)]+2Cov[xk,f(xk)]=(24,25)2Var[xk],\begin{split}\texttt{Var}[x_{k+\frac{1}{2}}]&\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k}]+\texttt{Var}[f(x_{k})]+2\texttt{Cov}[x_{k},f(x_{k})]\\ &\operatornamewithlimits{=}^{(\ref{assumption1},\ref{assumption2})}2\texttt{Var}[x_{k}],\end{split} (32)
Cov[xk,xk+12]=(23)Cov[xk,xk]+Cov[xk,f(xk)]=(25)Cov[xk,xk]=Var[xk].\begin{split}\texttt{Cov}[x_{k},x_{k+\frac{1}{2}}]&\operatornamewithlimits{=}^{\eqref{cov_property}}\texttt{Cov}[x_{k},x_{k}]+\texttt{Cov}[x_{k},f(x_{k})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Cov}[x_{k},x_{k}]\\ &=\texttt{Var}[x_{k}].\end{split} (33)

Next, let x(1)=xk+12+f(xk+12)x^{(1)}=x_{k+\frac{1}{2}}+f(x_{k+\frac{1}{2}}). Then we can derive Var[x(1)]\texttt{Var}[x^{(1)}] and Cov[xk,x(1)]\texttt{Cov}[x_{k},x^{(1)}].

Var[x(1)]=(22)Var[xk+12]+Var[f(xk+12)]+2Cov[xk+12,f(xk+12)]=(25)Var[xk+12]+Var[f(xk+12)]=(24)Var[xk+12]+Var[xk+12]=2Var[xk+12]=(32)4Var[xk],\begin{split}&\texttt{Var}[x^{(1)}]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k+\frac{1}{2}}]+\texttt{Var}[f(x_{k+\frac{1}{2}})]+2\texttt{Cov}[x_{k+\frac{1}{2}},f(x_{k+\frac{1}{2}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Var}[x_{k+\frac{1}{2}}]+\texttt{Var}[f(x_{k+\frac{1}{2}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption1}}\texttt{Var}[x_{k+\frac{1}{2}}]+\texttt{Var}[x_{k+\frac{1}{2}}]\\ &=2\texttt{Var}[x_{k+\frac{1}{2}}]\\ &\operatornamewithlimits{=}^{\eqref{ssp2_1/2var}}4\texttt{Var}[x_{k}],\end{split} (34)
Cov[xk,x(1)]=(23)Cov[xk,xk+12]+Cov[xk,f(xk+12)]=(25)Cov[xk,xk+12]=(33)Var[xk].\begin{split}\texttt{Cov}[x_{k},x^{(1)}]&\operatornamewithlimits{=}^{\eqref{cov_property}}\texttt{Cov}[x_{k},x_{k+\frac{1}{2}}]+\texttt{Cov}[x_{k},f(x_{k+\frac{1}{2}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Cov}[x_{k},x_{k+\frac{1}{2}}]\\ &\operatornamewithlimits{=}^{\eqref{ssp2_1/2cov}}\texttt{Var}[x_{k}].\end{split} (35)

Finally, by using x(1)=xk+12+f(xk+12)x^{(1)}=x_{k+\frac{1}{2}}+f(x_{k+\frac{1}{2}}),

Var[xk+1]=(28)Var[12xk+12xk+12+12f(xk+12)]=Var[12xk+12x(1)]=(22)14Var[xk]+14Var[x(1)]+12Cov[xk,x(1)]=(34,35)14Var[xk]+Var[xk]+12Var[xk]=74Var[xk].\begin{split}\texttt{Var}[x_{k+1}]&\operatornamewithlimits{=}^{\eqref{eq:sspblock2}}\texttt{Var}\left[\frac{1}{2}x_{k}+\frac{1}{2}x_{k+\frac{1}{2}}+\frac{1}{2}f(x_{k+\frac{1}{2}})\right]\\ &=\texttt{Var}\left[\frac{1}{2}x_{k}+\frac{1}{2}x^{(1)}\right]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\frac{1}{4}\texttt{Var}[x_{k}]+\frac{1}{4}\texttt{Var}[x^{(1)}]+\frac{1}{2}\texttt{Cov}[x_{k},x^{(1)}]\\ &\operatornamewithlimits{=}^{(\ref{ssp2_(1)var},\ref{ssp2_(1)cov})}\frac{1}{4}\texttt{Var}[x_{k}]+\texttt{Var}[x_{k}]+\frac{1}{2}\texttt{Var}[x_{k}]\\ &=\frac{7}{4}\texttt{Var}[x_{k}].\end{split}

Proof of SSP3-block Similar to prove the SSP2-block, the first step is inducing Var[xk+13]\texttt{Var}[x_{k+\frac{1}{3}}] and Cov[xk,xk+13]\texttt{Cov}[x_{k},x_{k+\frac{1}{3}}].

Var[xk+13]=2Var[xk],\texttt{Var}[x_{k+\frac{1}{3}}]=2\texttt{Var}[x_{k}], (36)
Cov[xk,xk+13]=(23)Cov[xk,xk]+Cov[xk,f(xk)]=(24)Var[xk].\begin{split}\texttt{Cov}[x_{k},x_{k+\frac{1}{3}}]&\operatornamewithlimits{=}^{\eqref{cov_property}}\texttt{Cov}[x_{k},x_{k}]+\texttt{Cov}[x_{k},f(x_{k})]\\ &\operatornamewithlimits{=}^{\eqref{assumption1}}\texttt{Var}[x_{k}].\end{split} (37)

Let x(1)=xk+13+f(xk+13)x^{(1)}=x_{k+\frac{1}{3}}+f(x_{k+\frac{1}{3}}). Then,

Var[x(1)]=(22)Var[xk+13]+Var[f(xx+13)]+2Cov[xk+13,f(xx+13)]=(25)Var[xk+13]+Var[f(xx+13)]=(24)2Var[xk+13]=(36)4Var[xk],\begin{split}&\texttt{Var}[x^{(1)}]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k+\frac{1}{3}}]+\texttt{Var}[f(x_{x+\frac{1}{3}})]+2\texttt{Cov}[x_{k+\frac{1}{3}},f(x_{x+\frac{1}{3}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Var}[x_{k+\frac{1}{3}}]+\texttt{Var}[f(x_{x+\frac{1}{3}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption1}}2\texttt{Var}[x_{k+\frac{1}{3}}]\\ &\operatornamewithlimits{=}^{\eqref{ssp3_1/3var}}4\texttt{Var}[x_{k}],\end{split} (38)
Cov[xk,x(1)]=(23)Cov[xk,xk+13]+Cov[xk,f(xx+13)]=(25,37)Var[xk].\begin{split}\texttt{Cov}[x_{k},x^{(1)}]&\operatornamewithlimits{=}^{\eqref{cov_property}}\texttt{Cov}[x_{k},x_{k+\frac{1}{3}}]+\texttt{Cov}[x_{k},f(x_{x+\frac{1}{3}})]\\ &\operatornamewithlimits{=}^{(\ref{assumption2},\ref{ssp3_1/3cov})}\texttt{Var}[x_{k}].\end{split} (39)

By using x(1)=xk+13+f(xk+13)x^{(1)}=x_{k+\frac{1}{3}}+f(x_{k+\frac{1}{3}}),

Var[xk+23]=(29)Var[34xk+14xk+13+14f(xk+13)]=Var[34xk+14x(1)]=(22)916Var[xk]+116Var[x(1)]+38Cov[xk,x(1)]=(38,39)1916Var[xk],\begin{split}\texttt{Var}[x_{k+\frac{2}{3}}]&\operatornamewithlimits{=}^{\eqref{eq:sspblock3}}\texttt{Var}\left[\frac{3}{4}x_{k}+\frac{1}{4}x_{k+\frac{1}{3}}+\frac{1}{4}f(x_{k+\frac{1}{3}})\right]\\ &=\texttt{Var}\left[\frac{3}{4}x_{k}+\frac{1}{4}x^{(1)}\right]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\frac{9}{16}\texttt{Var}[x_{k}]+\frac{1}{16}\texttt{Var}[x^{(1)}]+\frac{3}{8}\texttt{Cov}[x_{k},x^{(1)}]\\ &\operatornamewithlimits{=}^{(\ref{ssp3_(1)var},\ref{ssp3_(1)cov})}\frac{19}{16}\texttt{Var}[x_{k}],\end{split} (40)

and

Cov[xk,xk+23]=Cov[xk,34xk+14x(1)]=(23)34Cov[xk,xk]+14Cov[xk,x(1)]=(39)Var[xk].\begin{split}\texttt{Cov}[x_{k},x_{k+\frac{2}{3}}]&=\texttt{Cov}\left[x_{k},\frac{3}{4}x_{k}+\frac{1}{4}x^{(1)}\right]\\ &\operatornamewithlimits{=}^{\eqref{cov_property}}\frac{3}{4}\texttt{Cov}[x_{k},x_{k}]+\frac{1}{4}\texttt{Cov}[x_{k},x^{(1)}]\\ &\operatornamewithlimits{=}^{\eqref{ssp3_(1)cov}}\texttt{Var}[x_{k}].\end{split} (41)

Similar to previous steps, let x(2)=xk+23+f(xk+23)x^{(2)}=x_{k+\frac{2}{3}}+f(x_{k+\frac{2}{3}}). Once again, by applying same procedure,

Var[x(2)]=(22)Var[xk+23]+Var[f(xk+23)]+2Cov[xk+23,f(xk+23)]=(25)Var[xk+23]+Var[f(xk+23)]=(24)2Var[xk+23]=(40)198Var[xk],\begin{split}&\texttt{Var}[x^{(2)}]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\texttt{Var}[x_{k+\frac{2}{3}}]+\texttt{Var}[f(x_{k+\frac{2}{3}})]+2\texttt{Cov}[x_{k+\frac{2}{3}},f(x_{k+\frac{2}{3}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Var}[x_{k+\frac{2}{3}}]+\texttt{Var}[f(x_{k+\frac{2}{3}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption1}}2\texttt{Var}[x_{k+\frac{2}{3}}]\\ &\operatornamewithlimits{=}^{\eqref{ssp3_2/3var}}\frac{19}{8}\texttt{Var}[x_{k}],\end{split} (42)

and

Cov[xk,x(2)]=(23)Cov[xk,xk+23]+Cov[xk,f(xk+23)]=(25)Cov[xk,xk+23]=(41)Var[xk].\begin{split}\texttt{Cov}[x_{k},x^{(2)}]&\operatornamewithlimits{=}^{\eqref{cov_property}}\texttt{Cov}[x_{k},x_{k+\frac{2}{3}}]+\texttt{Cov}[x_{k},f(x_{k+\frac{2}{3}})]\\ &\operatornamewithlimits{=}^{\eqref{assumption2}}\texttt{Cov}[x_{k},x_{k+\frac{2}{3}}]\\ &\operatornamewithlimits{=}^{\eqref{ssp3_2/3cov}}\texttt{Var}[x_{k}].\end{split} (43)

Finally, since x(2)=xk+23+f(xk+23)x^{(2)}=x_{k+\frac{2}{3}}+f(x_{k+\frac{2}{3}}),

Var[xk+1]=(29)Var[13xk+23xk+23+23f(xk+23)]=Var[13xk+23x(2)]=(22)19Var[xk]+49Var[x(2)]+49Cov[xk,x(2)]=(42,43)19Var[xk]+1918Var[xk]+49Var[xk]=2918Var[xk].\begin{split}\texttt{Var}[x_{k+1}]&\operatornamewithlimits{=}^{\eqref{eq:sspblock3}}\texttt{Var}\left[\frac{1}{3}x_{k}+\frac{2}{3}x_{k+\frac{2}{3}}+\frac{2}{3}f(x_{k+\frac{2}{3}})\right]\\ &=\texttt{Var}\left[\frac{1}{3}x_{k}+\frac{2}{3}x^{(2)}\right]\\ &\operatornamewithlimits{=}^{\eqref{var_property}}\frac{1}{9}\texttt{Var}[x_{k}]+\frac{4}{9}\texttt{Var}[x^{(2)}]+\frac{4}{9}\texttt{Cov}[x_{k},x^{(2)}]\\ &\operatornamewithlimits{=}^{(\ref{ssp3_(2)var},\ref{ssp3_(2)cov})}\frac{1}{9}\texttt{Var}[x_{k}]+\frac{19}{18}\texttt{Var}[x_{k}]+\frac{4}{9}\texttt{Var}[x_{k}]\\ &=\frac{29}{18}\texttt{Var}[x_{k}].\end{split}

Appendix 0.D Comparison non-TVD scheme and TVD scheme

Refer to caption
(a) non-TVD scheme
Refer to caption
(b) TVD scheme
Figure 5: Two numerical solutions of the inviscid Burgers’ equations using two different time discretizations are presented above. (a) shows the numerical solution with the non-TVD scheme in (45) while (b) is the numerical solution with the SSP3 discretization as in (20). After computing numerical solutions of the Burgers’ equations, the solutions are filtered through the sigmoid function as an activation function. Evidently, the left panel (a) displays wild oscillations while the right panel (b) displays accurate numerical solutions.

In Figure 5, we implemented numerical solutions of the inviscid Burgers’ equations

ut+uux=0,x(0,1),u(0,t)=u(1,t),u(x,0)=u0,\begin{split}&u_{t}+uu_{x}=0,\quad x\in(0,1),\\ &u(0,t)=u(1,t),\\ &u(x,0)=u_{0},\end{split} (44)

using two different time discretizations; non-TVD scheme in (a) and TVD scheme in (b). The initial condition u0(x)u_{0}(x) is

u0(x):={0,0<x1/6,1,1/6<x2/6,0,2/6<x<1.u_{0}(x):=\begin{cases}0,&0<x\leq 1/6,\\ 1,&1/6<x\leq 2/6,\\ 0,&2/6<x<1.\end{cases}

The same equations and initial conditions were also used in Figure 1 of the main manuscript. For the spatial discretization, we adopt the third-order weighted essentially non-oscillatory (WENO) schemes. For numerical computations, the following configurations are used:

N=number of grid points in x=100,h=Time step size=0.8/N,T=final time=0.3.\begin{split}&N=\text{number of grid points in $x$}=100,\\ &h=\text{Time step size}=0.8/N,\\ &T=\text{final time}=0.3.\end{split}

The left panel (a) shows the numerical solution with non-TVD time discretization of the second order while the right panel (b) presents the numerical solution with the SSP-2 discretization stated in (19). More precisely, in the left panel, we used the second order non-TVD scheme

u(1)=un20ΔtL(un),un+1=un+4140ΔtL(un)140ΔtL(u(1)).\begin{split}&u^{(1)}=u^{n}-20\Delta tL(u^{n}),\\ &u^{n+1}=u^{n}+\frac{41}{40}\Delta tL(u^{n})-\frac{1}{40}\Delta tL(u^{(1)}).\end{split} (45)

After computing numerical solutions of the Burgers’ equations, the solutions are filtered through the sigmoid function as activation function.

Appendix 0.E Adversarial Training Details

Algorithm 1 Projected Gradient Descent

Input: Clean image xnat[0,1]mx_{\text{nat}}\in[0,1]^{m} and corresponding label yy, model hθh_{\theta}, loss function \ell, step α\alpha, bound ϵ\epsilon, # of iteration KK, metric pp.
Output: Candidate adversarial example xadvx_{\text{adv}}.

xadv:=xnatx_{\text{adv}}:=x_{\text{nat}}
feasible-set={x|xxnatpϵ}[0,1]m\texttt{feasible-set}=\{x^{\prime}|\|x^{\prime}-x_{\text{nat}}\|_{p}\leq\epsilon\}\cap[0,1]^{m}
for i in range(KKdo
    xadv=xadv+αsign(xadv(hθ(xadv),y))x_{\text{adv}}=x_{\text{adv}}+\alpha\cdot\texttt{sign}(\nabla_{x_{\text{adv}}}\ell(h_{\theta}(x_{\text{adv}}),y))
    xadv=Π(xadv,feasible-set)x_{\text{adv}}=\Pi(x_{\text{adv}},\texttt{feasible-set})
end for
return xadvx_{\text{adv}}
Algorithm 2 PGD Adversarial Training

Input: Training data minibatch (xi,yi)(x_{i},y_{i}) with i{1,,N}i\in\{1,\cdots,N\}, initialized model hθh_{\theta}, training epochs KK, PGD attack algorithm PGD, Optimization method optim.
Output: Trained model hθh_{\theta}

for i in range(KKdo
    for j in range(NNdo
        Sample random δUniform(ϵ,ϵ)\delta\in Uniform(-\epsilon,\epsilon)
        xj,adv=xj,nat+δx_{j,\text{adv}}=x_{j,\text{nat}}+\delta
        xj,adv=PGD(xj,adv,xj,nat,yj)x_{j,\text{adv}}=\texttt{PGD}(x_{j,\text{adv}},x_{j,\text{nat}},y_{j})
        θ:=optim(hθ(xj,adv),yj)\theta:=\texttt{optim}(h_{\theta}(x_{j,\text{adv}}),y_{j})
    end for
end for
return hθh_{\theta}

In this section, we briefly introduce the adversarial training which we used in our experiments.

Adversarial training is state-of-the-art methodology for defending adversarial attacks [21, 32, 35]. As we mentioned in our main paper, the objective of adversarial training is to minimize the adversarial risk given as

Radv(hθ)=𝔼(x,y)D[maxδΔ(hθ(x+δ),y)],R_{adv}(h_{\theta})=\mathbb{E}_{(x,y)\sim D}\big{[}\max\limits_{\delta\in\Delta}\mathcal{L}(h_{\theta}(x+\delta),y)\big{]}, (46)

where the all notations are the same as our main paper. Strictly speaking, the set Δ\Delta has finite number of elements, so the exact solutions exist which maximize the loss (hθ(x),y)\mathcal{L}(h_{\theta}(x),y). However, as we mentioned in our main paper, numerically solving this problem is intractable. Therefore, most of works estimate the maxδΔ(hθ(x),y)\max\limits_{\delta\in\Delta}\mathcal{L}(h_{\theta}(x),y) by using PGD; see Algorithm 1. Further, the randomness is injected during adversarial training, and this may help the robustness [21, 35]. The description of adversarial training is shown in Algorithm 2.

Appendix 0.F Exploratory Network Analysis

ϵ\epsilon Natural 1 2 3 4 5 6 7 8
ResNet 88.14 83.00 77.22 70.49 64.46 58.45 52.50 47.21 42.29
SSP-2 87.59 83.04 77.61 71.77 66.09 60.27 54.68 49.08 44.73
SSP-3 87.51 83.31 78.49 72.70 66.61 60.90 55.28 50.20 45.40
Table 3: Network performance against FGSM adversarial attacks when trained with PGD training (α=1/255\alpha=1/255). SSP-3 is more robust than ResNet and SSP-2; approximately 3%.
ϵ\epsilon Natural 1 2 3 4 5 6 7 8
ResNet 88.14 82.74 76.00 67.69 59.33 51.01 43.25 36.96 31.31
SSP-2 87.59 82.90 76.87 70.06 62.59 54.83 47.35 41.00 35.02
SSP-3 87.51 83.16 77.78 70.75 63.19 55.53 48.51 42.08 36.13
Table 4: Network performance against PGD adversarial attacks when trained with PGD training (α=1/255\alpha=1/255). SSP-3 is more robust than ResNet and SSP-2; approximately 5%.
Model Clean FGSM PGD20
ResNet 0.9090 0.8562 0.8179
SSP-2 0.9132 0.8591 0.8252
SSP-3 0.9110 0.8639 0.8295
SSP-adap 0.9098 0.8621 0.8264
Table 5: Result on Fashion-MNIST. All the models follow the same architecture which used in MNIST experiment in main paper. The number of blocks is 20, with using Group Normalization. We perform adversarial training with ϵ=0.1\epsilon=0.1, α=0.02\alpha=0.02 with 10 iterations. For evaluating robustness, α=0.01\alpha=0.01 with 20 iterations are used in PGD attack (PGD20).
Model Clean PGD20
ResNet 0.4648 0.1738
SSP-2 0.4386 0.1761
SSP-3 0.4529 0.1955
Table 6: Result on Tiny-Imagenet. We perform adversarial training with ϵ=8/255\epsilon=8/255, α=2/255\alpha=2/255 with 5 iterations. For evaluating robustness, α=2/255\alpha=2/255 with 20 iterations are used in PGD attack (PGD20).

In this section, we provide more experimental data on the CIFAR-10 dataset; as well as other well known datasets: Fashion-MNIST and Tiny-Imagenet.

We have trained ResNet, SSP-2 and SSP-3 on the CIFAR10 dataset, with N=5,K=7N=5,~{}K=7, and PGD adversarial training with α=1/255\alpha=1/255; in order to gauge performance and robustness of our architecture. We were able to perform various attacks on the network using the PGD and FGSM methods. We believe that this can be optimized with higher order methods, specifically SSP. Intuitively speaking, we are expecting performance to increase as a result of using a more accurate approximation. Also, accuracy should increase as we increase the order of the numerical approximation.

We noticed that when all three models were attacked via FGSM, or PGD, with α=1/255\alpha=1/255, and various ϵ\epsilon, that the higher order methods outperformed ResNet from roughly 35%3\sim 5\%. We observed that SSP-3 outperforms SSP-2, which outperforms ResNet; consistently. Furthermore, as the strength of the perturbation was increased, SSP networks became more resilient to adversarial attacks than ResNet. This results in behavior that is truly characteristic of numerical methods, in that accuracy is increased as higher order methods are implemented, and stability is preserved.

What is perhaps the most notable is that SSP-3 is extremely more robust than ResNet whether it is attacked via FGSM or PGD. Robustness is achieved without introducing more parameters, or a dramatic increase to computational power. Our architecture achieves comparable performance on unperturbed images and superior performance with respect to adversarial attacks.

Next, we evaluate the robustness on Fashion-MNIST [xiao2017fashion] dataset. The architecture of model is same as the model used in MNIST, but the only difference is the number of blocks and maximum perturbation range (ϵ\epsilon). The results are shown in Table 5 and are consistent with our previous assumptions and results.

Last but not least, we conduct an experiment on the more challenging dataset, Tiny-Imagenet [le2015tiny]. The models used in Tiny-Imagenet experiment are composed of 4 groups of blocks and each group has 10 blocks. Table 6 shows the top-1 accuracy of natural samples and adversarial examples generated by PGD attack. As the result shows, all the SSP networks show better robustness than ResNet.

Appendix 0.G Suppression on Perturbation Growth

Refer to caption
(a) N==6, K==7, p==1
Refer to caption
(b) N==6, K==12, p==1
Refer to caption
(c) N==10, K==7, p==1
Refer to caption
(d) N==10, K==12, p==1
Refer to caption
(e) N==6, K==7, p==2
Refer to caption
(f) N==6, K==12, p==2
Refer to caption
(g) N==10, K==7, p==2
Refer to caption
(h) N==10, K==12, p==2
Figure 6: Perturbation growth ratio of clean samples and its adversarial counterparts. As the perturbation evolves through networks, SSPNets have lower ratio than ResNet.

In this section, we present the perturbation growth ratio of all the networks used in the CIFAR-10 experiments. Recall that the perturbation growth ratio is given by

PGR(f)=𝔼x𝒟[𝔼x𝒳[f(x)f(x)pxxp]],p{1,2},\texttt{PGR}(f)=\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{x^{\prime}\sim\mathcal{X^{\prime}}}\left[\frac{\|f(x)-f(x^{\prime})\|_{p}}{\|x-x^{\prime}\|_{p}}\right]\right],\quad p\in\{1,2\},

where each corrupted sample xx^{\prime} is sampled from a small neighborhood of xx, i.e., 𝒳\mathcal{X^{\prime}}, and pp defines a type of norm either 1\ell_{1} or 2\ell_{2}.

In Figure 6, all the corrupted sample xx^{\prime} is the adversarial example generated by PGD attack with 20 iterations for each model. Despite there is no other regularization using Lipschitzness or Jacobian, all the SSPNets have lower perturbation growth ratio than ResNet.