Fast Jacobian-Vector Product for Deep Networks

Randall Balestriero Richard G. Baraniuk

Abstract

Jacobian-vector products (JVPs) form the backbone of many recent developments in Deep Networks (DNs), with applications including faster constrained optimization, regularization with generalization guarantees, and adversarial example sensitivity assessments. Unfortunately, JVPs are computationally expensive for real world DN architectures and require the use of automatic differentiation to avoid manually adapting the JVP program when changing the DN architecture. We propose a novel method to quickly compute JVPs for any DN that employ Continuous Piecewise Affine (e.g., leaky-ReLU, max-pooling, maxout, etc.) nonlinearities. We show that our technique is on average $2\times$ faster than the fastest alternative over $13$ DN architectures and across various hardware. In addition, our solution does not require automatic differentiation and is thus easy to deploy in software, requiring only the modification of a few lines of codes that do not depend on the DN architecture.

Deep Learning, Piecewise Linear Networks, Splines, Jacobian Vector Products, forward-mode differentiation, automatic differentiation

1 Introduction

Deep (Neural) Networks (DNs) are used throughout the spectrum of machine learning tasks ranging from image classification (Lawrence et al., 1997; Lin et al., 2013; Zagoruyko & Komodakis, 2016) to financial fraud prediction (Rushin et al., 2017; Roy et al., 2018; Zheng et al., 2018). The ability of DNs to reach super-human performances added great momentum into the deployment of those techniques into real life applications e.g. self-driving cars (Rao & Frtunikj, 2018), fire prediction (Kim & Lee, 2019) or seismic activity prediction (Seydoux et al., 2020). The main ingredients behind this success roughly fall into three camps: (i) specialized optimization algorithms such as Adam (Kingma & Ba, 2014) coupled with carefully designed learning rate schedules (Smith, 2017) and/or mini-batch size (Devarakonda et al., 2017; Wang et al., 2017), (ii) novel regularization techniques which can be explicit as with orthogonal/sparse weights penalties (Cohen & Welling, 2016; Jia et al., 2019; Wang et al., 2020) or implicit as with dropout (Molchanov et al., 2017) or data-augmentation (Taylor & Nitschke, 2017; Perez & Wang, 2017; Hernández-García & König, 2018; Shorten & Khoshgoftaar, 2019); and (iii) adaptivity of the DN architecture to the data and task at hand thanks to expert knowledge (Graves et al., 2013; Seydoux et al., 2020) or automated Neural Architecture Search (Elsken et al., 2019).

Despite DNs’ astonishing performance, their practical deployment in the real-world remains hazardous, requiring tremendous tweaking and the implementation of various safe guard measures (Marcus, 2018) to avoid failure cases that can lead to injuries or death, but also to ensure fairness and unbiased behaviors (Gianfrancesco et al., 2018). One path for overcoming those challenges in a more principled way is to develop novel, specialized methods for each part of the deep learning pipeline (optimization/regularization/architecture).

Recent techniques aiming at tackling those shortcomings have been proposed. Most of those techniques rely on Jacobian-vector products (JVPs) (Hirsch & Smale, 1974), which are defined as the projection of a given vector onto the Jacobian matrix of an operator, i.e. the DN in our case. The JVP captures crucial information on the local geometry of the DN input-output mapping which is one of the main reason behind its popularity. For example, Dehaene et al. (2020) leverage the JVP to better control the denoising/data generation process of autoencoders via explicit manifold projections (generated samples are repeatedly projected onto the autoencoder Jacobian matrix); Balestriero & Baraniuk (2020) assess the sensitivity of DNs to adversarial examples (noise realizations are projected onto the DN Jacobian matrix); Cosentino et al. (2020) guarantees the generalization performance of DNs based on a novel JVP based measure (neighboring data samples are projected onto the DN Jacobian matrix and compared); Márquez-Neila et al. (2017) trains DNs with explicit input-output constraints (the constraints are cast into JVPs and evaluated as part of the DN training procedure). Beyond those recent developments, JVPs have a long history of being employed to probe models and get precise understandings of the models’ properties e.g. input sensitivity, intrinsic dimension, conditioning (Gudmundsson et al., 1995; Bentbib & Kanber, 2015; Balestriero et al., 2020). To enable the deployment of these techniques in practical scenarios, it is crucial to evaluate JVPs as fast as possible and without relying on automatic differentiation, a feature primarily implemented in development libraries but often missing in software deployed in production.

In this paper, we propose a novel method to evaluate JVPs that is inspired by recent theoretical studies of DNs employing Continuous Piecewise Affine (CPA) nonlinearities and by Forward-mode Differentiation (FD). In short, the CPA property allows to specialize FD to be evaluate only by feeding two inputs to a DN while holding the DN internal state fixed, and combining the two produced outputs such that the final result corresponds to the desired JVP. Our solution offers the following benefits:

•

Speed: $2\times$ faster on average over the fastest alternative solution with empirical validation on $13$ different DN architectures (including Recurrent Neural Networks, ResNets, DenseNets, VGGs, UNets) and with speed-ups consistent across hardware.
•

Simple implementation: Implementing our solution for any DN requires to change only a few lines of code in total, regardless of the architecture being considered. This allows our method to be employed inside the architecture search loop if needed.
•

Does not employ automatic differentiation: Our solution can be implemented in any software/library including the ones that are not equipped with automatic differentiation, allowing deployment of the method on production specific hardware/software.

While our method is most simple to implement for CPA DNs (a constraint embracing most of the current state-of-the-art architectures), we also demonstrate how to employ it for smooth DN’s layers such as batch-normalization, which is CPA only during testing.

This paper is organized as follows: Sec. 2 reviews current JVPs solutions and thoroughly presents notations from CPA DNs, Sec. 3 presents our proposed solution where detailed implementation is provided in Sec. 3.2 and careful empirical validations are performed in Sec. 3.3; Sec. 4 presents further results on how JVPs are crucial to evaluate many insightful statistics of CPA DNs such as the largest eigenvalues of the DN Jacobian matrix.

2 Jacobian-Vector Products and Continuous Piecewise Affine Deep Networks

Jacobian-Vector Products (JVPs). The JVP operation, also called Rop (short for right-operation), is abbreviated as ${\rm Rop}(f,\bm{x},v)$ and computes the directional derivative, with direction $\bm{u}\in\mathbb{R}^{D}$ , of the multi-dimensional operator $f:\mathbb{R}^{D}\mapsto\mathbb{R}^{K}$ with respect to the input $\bm{x}\in\mathbb{R}^{D}$ as

{\rm Rop}(f,\bm{x},\bm{u})\triangleq{\bm{J}}_{f}(\bm{x})\bm{u}.

(1)

Throughout our study, we will assume that $f$ is differentiable at $\bm{x}$ , ${\bm{J}}_{f}(\bm{x})$ denotes the Jacobian of $f$ evaluated at $\bm{x}$ . The ${\rm Rop}$ operator defines a linear operator from $\mathbb{R}^{D}$ to $\mathbb{R}^{K}$ with respect to the $\bm{u}$ argument, additional details can be found in Appendix A; throughout this paper we will use the terms JVP and Rop interchangeably. The vector-Jacobian product (VJP) operation, also called Lop, is abbreviated as ${\rm Lop}(f,\bm{x},\bm{v})$ with direction $\bm{v}\in\mathbb{R}^{K}$ and computes the adjoint directional derivative

{\rm Lop}(f,\bm{x},\bm{v})\triangleq\bm{v}^{T}({\bm{J}}_{f}(\bm{x})).

(2)

For a thorough review of those operations, we refer the reader to Paszke et al. (2017); Baydin et al. (2017); Elliott (2018); throughout this paper we will use the terms VJP and Lop interchangeably.

In practice, evaluating the Rop given a differentiable operator $f$ can be done naturally through forward-mode automatic differentiation which builds ${\rm Rop}(f,\bm{x},\bm{u})$ while forwarding $\bm{x}$ through the computational graph of $f$ . However, it can also be done by employing the backward-mode automatic differentiation (Linnainmaa, 1976) i.e. traversing the computational graph of $f$ from output to input, via multiple Lop calls via

\displaystyle{\rm Rop}(f,\bm{x},\bm{u})={\bm{J}}_{f}(\bm{x})\bm{u}=\begin{pmatrix}{\rm Lop}(f,\bm{x},\bm{e}_{1})\\ \vdots\\ {\rm Lop}(f,w,\bm{e}_{K})\end{pmatrix}\bm{u},

(3)

where $K$ is the length of the output vector of the mapping $f$ . Practical implementations of the above often resort to parallel computations of each $Lop$ forming the rows of the Jacobian matrix reducing computation time at the cost of greater memory requirements. Alternatively, an interesting computational trick recently brought to light in Jamie et al. (2017) allows to compute the $Rop$ from two $Lop$ calls only (instead of K) as follows

\displaystyle{\rm Rop}(f,\bm{x},\bm{u})={\rm Lop}({\rm Lop}(f,\bm{x},\bm{v})^{T},\bm{v},\bm{u})^{T},

(4)

detailed derivation can be found in Appendix B. In short, the inner $Lop$ represents the mapping of $\bm{v}$ onto the linear operator ${\bm{J}}_{f}(\bm{x})^{T}$ . Hence, ${\rm Lop}(f,\bm{x},\bm{v})^{T}$ being linear in $\bm{v}$ , can itself be differentiated w.r.t. $\bm{v}$ and evaluated with direction $\bm{u}$ leading to the above equality. While this alternative strategy might not be of great interest for theoretical studies since the asymptotic time-complexity of ${\rm Lop}$ and ${\rm Rop}$ is similar (proportional to the number of operations $K$ (Griewank & Walther, 2008)). Which of the above strategies will provide the fastest computation in practice depends on the type of operations (input/output shapes) that are present in the computational graph. And while Rop and Lop can be combined, the general problem of optimal Jacobian accumulation (finding the most efficient computation) is NP-complete (Naumann, 2008). Note that current software/hardware in deep learning heavily focus on Lop as backpropagation (gradient based DN training strategy) is a special case of the latter (Rumelhart et al., 1986).

Deep Networks (DNs). A DN is an operator $f$ that maps an observation ¹¹1We consider vectors, matrices and tensors as flattened vectors $\bm{z}\in{\mathbb{R}}^{D}$ to a prediction $\bm{y}\in{\mathbb{R}}^{K}$ . This prediction is a nonlinear transformation of the input $\bm{x}$ built through a cascade of linear and nonlinear operators (LeCun et al., 2015). Those operators include affine operators such as the fully connected operator (simply an affine transformation defined by weight matrix $\bm{W}^{\ell}$ and bias vector $\bm{v}^{\ell}$ ), convolution operator (with circulant $\bm{W}^{\ell}$ ), and, nonlinear operators such as the activation operator (applying a scalar nonlinearity such as the ubiquitous ReLU), or the max-pooling operator; precise definitions of these operators can be found in Goodfellow et al. (2016).

It will become convenient for our development to group the succession of operators forming a DN into layers in order to express the entire DN as a composition of $L$ intermediate layers $f^{\ell}$ , $\ell=1,\dots,L$ as in

\displaystyle f(\bm{x})=(f^{L}\circ\dots\circ f^{1})(\bm{x}),

(5)

where each $f^{\ell}$ maps an input vector of length $D^{\ell-1}$ to an output vector of length $D^{\ell}$ . We formally specify how to perform this grouping to ensure that any DN as a unique layer decomposition.

Definition 1 (Deep Network Layer).

A DN layer $f^{\ell}$ composes a single nonlinear operator and any (if any) of the preceding linear operators between it and the preceding nonlinear operator (if any).

We also explicitly denote the up-to-layer $\ell$ mapping as

\displaystyle\bm{z}^{\ell}(\bm{x})=f^{\ell}\left(\bm{z}^{\ell-1}(\bm{x})\right),

with initialization $\bm{z}^{0}(\bm{x})\triangleq\bm{x}$ , those intermediate representations $\bm{z}^{1},\dots,\bm{z}^{L-1}$ are denoted as feature maps.

Continuous Piecewise Affine (CPA) DNs. Whenever a DN $f$ employs CPA nonlinearities throughout its layers, the entire input-output mapping is a (continuous) affine spline with a partition $\Omega$ of its domain (the input space of the DN) and a per-region affine mapping (for more background on splines we refer the reader to De Boor et al. (1978); Schumaker (2007); Bojanov et al. (2013); Pascanu et al. (2013)) such that

\displaystyle f(\bm{x})=

\displaystyle\sum_{\omega\in\Omega}\left(\bm{A}_{\omega}\bm{x}+\bm{b}_{\omega}\right)\mathbbm{1}_{\bm{x}\in\omega},\;\;\;\;\;\bm{x}\in\mathbb{R}^{D},

(6)

where the per-region slope and bias parameters are a function of the DN architecture and parameters. Given an observation $\bm{x}\in\omega$ , the affine parameters of region $\omega$ from (6) can be obtained explicitly as follows.
1. For each layer $f^{\ell}$ (recall Def 1) compose the possibly multiple linear operator into a single one with slope matrix $\bm{W}^{\ell}$ and bias $\bm{b}^{\ell}$ (any composition of linear operators can be expressed this way by composition and distributive law) and denote the layer’s pre-activation as

\displaystyle\bm{h}^{\ell}(\bm{x})=\bm{W}^{\ell}\bm{z}^{\ell-1}(\bm{x})+\bm{b}^{\ell},

(7)

in the absence of linear operators in a layer, set $\bm{W}^{\ell}$ as the identity matrix and $\bm{b}^{\ell}$ as the zero-vector.
2. For each layer, encode the states of the layer’s activation operator into the following diagonal matrix

\displaystyle\left[\bm{Q}^{\ell}\left(\bm{h}^{\ell}(\bm{x})\right)\right]_{i,i}=\begin{cases}\eta&\text{ if }\left[\bm{h}^{\ell}(\bm{x})\right]_{i}<0\\ 1&\text{ else }\end{cases},

(8)

where $\eta>0$ (leaky-ReLU), $\eta=0$ (ReLU), or $\eta=-1$ (absolute value); for the max-pooling operator, see Appendix C; $[.]$ is used to access a specific entry of a matrix/vector, for clarity we will often omit the double superscript in (8) and use $\bm{Q}\left(\bm{h}^{\ell}(\bm{x})\right)$ .
3. Combining (7) and (8) to obtain

	$\displaystyle\bm{A}_{\omega}=$	$\displaystyle\prod_{\ell=0}^{L-1}\bm{Q}\left(\bm{h}^{L-\ell}(\bm{x})\right)\bm{W}^{L-\ell},$
	$\displaystyle\bm{b}_{\omega}=$	$\displaystyle\sum_{i=1}^{L}\left(\prod_{\ell=0}^{L-i-1}\bm{Q}\left(\bm{h}^{L-\ell}(\bm{x})\right)\bm{W}^{L-\ell}\right)\bm{Q}\left(\bm{h}^{i}(\bm{x})\right)\bm{b}^{i},$

with $\bm{A}_{\omega}\in\mathbb{R}^{K\times D}$ and $\bm{b}_{\omega}\in\mathbb{R}^{D}$ .
Recently, many studies have successfully exploited the CPA property of current DNs to derive novel insights and results such as bounding the number of partition regions (Montufar et al., 2014; Arora et al., 2016; Rister & Rubin, 2017; Serra et al., 2018; Hanin & Rolnick, 2019), characterizing the geometry of the partitioning (Balestriero et al., 2019; Robinson et al., 2019; Gamba et al., 2020); and quantifying the expressive power of CPA DNs as function approximators (Unser, 2019; Bartlett et al., 2019; Nadav et al., 2020; Grigsby & Lindsey, 2020). We propose to follow the same path and exploit the CPA property of DNs to obtain a fast and architecture independent method to evaluate Jacobian-vector products with DNs that do not require any flavor of automatic differentiation. Our method will allow to speed-up recent methods employing JVPs and to ease their deployment.

Figure 1: Depiction of the DN cloning strategy employed in (10) to evaluate Jacobian-vector of a DN taken w.r.t the DN input, at

\bm{x}

, and with direction

\bm{u}

without resorting to automatic differentiation. First, in red, one propagates the input

\bm{x}

and observes the nonlinearities’ states (recall (8)). Then, in blue, one feeds

\bm{v}:=\bm{u}-\bm{x}

in the exact same DN but fixing the nonlinearities’ states to the ones observed during the forward pass of

\bm{x}

(blue). The feature map of layer

\ell

obtained when fixing the nonlinearities from

\bm{x}

and forwarding

\bm{v}

\bm{z}^{\ell}_{\bm{x}}(\bm{v})

(and similarly

\bm{h}^{\ell}_{\bm{x}}(\bm{v})

for the pre-activations). The produced outputs,

f_{\bm{x}}(\bm{v})

and

f(\bm{x})

, can then be combined as per (10) to obtain the desired Jacobian-vector product.

3 Fast Jacobian-Vector Product for Continuous Piecewise Affine Deep Networks

As we mentioned, the core of our solution resides on the assumptions that the employed DN’s nonlinearities are CPA, we first describe our solution, its implementation and then conclude with thorough empirical validations and solutions to extend our technique to smooth DNs.

3.1 Deep Network Cloning

When employing CPA nonlinearities, the DN input-output mapping can be expressed as a per-region affine mapping (recall (6)). This key property greatly simplifies the form of the explicit Rop (or JVP) of a DN taken with respect to its input given an observation $\bm{x}$ and a directional vector $\bm{u}$ . In fact, we directly have

\displaystyle{\rm Rop}(f,\bm{x},\bm{u})=\bm{A}_{\omega(\bm{x})}\bm{u},\;\;\omega(\bm{x})\text{ s.t. }\bm{x}\in\omega,\omega\in\Omega.

(9)

where we explicit the $\bm{x}$ enclosing region $\omega$ with $\omega(\bm{x})$ . In the unrealistic setting where one would a priori know and store the $\bm{A}_{\omega}$ matrices for each region $\omega\in\Omega$ , and instantaneously infer the region $\omega(\bm{x})$ , the evaluation of (9) would be performed in $\mathcal{O}(KD)$ time complexity. This complexity could be reduced in the presence of low-rank (or other peculiar cases) $\bm{A}_{\omega}$ matrices, such cases do not occur with standard architectures and are thus omitted. In the more realistic scenario however, direct evaluation of 9 is not possible as each matrix tends to be large (and not sparse) with input dimension $D\gg 3000$ , and outpud dimension $K$ reaching $1000$ for classification/regression tasks, and being as large as $D$ for the case of autoencoders; and the number of regions $Card(\Omega)$ being on the order of hundreds of thousands (Cosentino et al., 2020), even for small DNs (and keep increasing with the DN number of nodes/layers). As a result, all current solutions rely on automatic differentiation which constructs only a single $\bm{A}_{\omega(\bm{x})}$ and $\bm{b}_{\omega(\bm{x})}$ on demand based on the given observation $\bm{x}$ and its propagation through the DN layers.

Instead, we propose an alternative solution that does not rely on automatic differentiation. For clarity, we detail our approach first when the JVPs is taken with respect to the DN input, and second when the JVP is taken with respect to the parameters of a layer.

JVPs of DN inputs. Recalling (6), we know that forwarding an observation $\bm{x}$ through the DN layers produces the output $\bm{A}_{\omega}\bm{x}+\bm{b}_{\omega}$ . Now, suppose that we possessed the ability to freeze, or clone the layers such that the internal $\bm{Q}^{\ell}(\bm{h}^{\ell}(\bm{x}))$ matrices (recall 8) remain the same regardless of the presented input; denote such a DN cloned at $\bm{x}$ by $f_{\bm{x}}$ . Then, in all simplicity by using (6) we obtain the following equality

$\displaystyle f(\bm{x})-f_{\bm{x}}(\bm{u}-\bm{x})=$	$\displaystyle\bm{A}_{\omega(\bm{x})}\bm{x}+\bm{b}_{\omega(\bm{x})}$
	$\displaystyle-\left(\bm{A}_{\omega(\bm{x})}(\bm{u}-\bm{x})+\bm{b}_{\omega(\bm{x})}\right)$
$\displaystyle=$	$\displaystyle\bm{A}_{\omega(\bm{x})}\bm{u}={\rm Rop}(f,\bm{x},\bm{u}),$	(10)

producing the desired JVP. Note that the user only needs to perform the forward pass through the cloned DN $f_{\bm{x}}$ , without the need to explicit compute $\bm{A}_{\omega(\bm{x})}$ , the JVP $\bm{A}_{\omega(\bm{x})}\bm{u}$ is gradually computed through the forwaard pass. Computationally, (10) requires only two passes through the DN that we group into two steps. Step 1: forward the observation $\bm{x}$ through all the layers to observe the $\bm{Q}^{\ell}(\bm{h}^{\ell}(\bm{x}))$ matrices. Step 2: evaluate the cloned DN at $\bm{u}-\bm{x}$ using the observed $\bm{Q}^{\ell}(\bm{h}^{\ell}(\bm{x}))$ matrices from step 1. We depict this cloning and propagation procedure in Fig. 1.

JVP of DN layer weights. We now consider the evaluation of a DN’s JVP taken with respect to one of the layers’ affine parameter i.e., weights, say $\bm{W}^{\ell}$ for some layer $1\leq\ell\leq L$ . Let’s first define an auxiliary DN mapping $\tilde{f}(.;\bm{x})$ that takes as input the value of the weight of interest ( $\bm{W}^{\ell}$ in this case). This leads to the following mapping

	$\displaystyle\tilde{f}(\bm{W};\bm{x})=$	$\displaystyle(f^{L}\circ\dots\circ f^{\ell+1}\circ\sigma)(\bm{W}\bm{z}^{\ell}(\bm{x})+\bm{b}^{\ell})$
	$\displaystyle=(f^{L}$	$\displaystyle\circ\dots\circ f^{\ell+1}\circ\sigma)(\bm{M}(\bm{z}^{\ell}(\bm{x}))\bm{w}+\bm{b}^{\ell})$		(11)

where $\bm{w}$ is the flattened version of $\bm{W}$ and $\bm{M}(\bm{z}^{\ell}(\bm{x}))$ is the matrix s.t. projecting $\bm{w}$ onto it is equivalent to projecting $\bm{z}^{\ell}(\bm{x})$ onto $\bm{W}$ . If $\bm{W}$ is a dense matrix then $\bm{M}$ is block diagonal with $D^{\ell}$ rows and $D^{\ell}D^{\ell-1}$ columns as

\bm{M}(\bm{z}^{\ell}(\bm{x}))=\begin{pmatrix}\bm{z}^{\ell}(\bm{x})^{T}&0&\dots\\ 0&\bm{z}^{\ell}(\bm{x})^{T}&\dots\\ \vdots&\vdots&\ddots\end{pmatrix},

if $\bm{W}$ were a convolutional layer parameter, then the $\bm{M}$ matrix would include the patch-translations corresponding to the application of the filter. Now, it is clear that taking the JVP of $\tilde{f}(\bm{W};\bm{x})$ w.r.t. $\bm{W}$ can be done exactly as per (10) since the flattened layer parameter can be seen as the DN input leading to

\displaystyle\tilde{f}(\bm{W}^{\ell};\bm{x})-\tilde{f}_{\bm{W}^{\ell}}(\bm{U}-\bm{W}^{\ell};\bm{x})={\rm Rop}(f(\bm{x}),\bm{W}^{\ell},\bm{U}),

(12)

where $\bm{U}$ is the the direction of the JVP that has same shape as $\bm{W}^{(\ell)}$ . Also, in this case, $\tilde{f}_{\bm{W}^{\ell}}$ represents cloning the nonlinearity states $\bm{Q}^{(\ell)}$ from the DN with inputs $\bm{W}^{(\ell)}$ and $\bm{x}$ . Furthermore, thanks to the equality of (11), one does not even need to actually obtain $\bm{M}$ since performing the usual layer forward propagation with the given parameter and the layer input is equivalent. As a result, regardless of the type of layer, taking the JVP w.r.t. a layer parameter is done simply by (i) propagating $\bm{x}$ through the DN $f$ while using $\bm{W}^{(\ell)}$ to observe all the matrices $\bm{Q}^{\ell}(\bm{h}^{\ell}(\bm{x}))$ , (ii) propagating $\bm{x}$ using $\bm{U}-\bm{W}^{\ell}$ as the filter of the layer’s parameter of interest (and keeping all the nonlinearities frozen) and (iii) computing the different between the two produced quantity. We depict this cloning procedure in Fig. 2.

Figure 2: Reprise of Fig. 1 but now evaluating the Jacobian-vector of a DN at

\bm{x}

taken w.r.t the layer parameter

\bm{W}^{2}

and with direction

\bm{U}

. First, one propagates the input

\bm{x}

and observes the nonlinearities’ states (recall (8)). Then, one (starting from

\bm{z}^{\ell-1}(\bm{x})

propagates

\bm{x}

in the cloned DN by using

\bm{U}-\bm{W}^{\ell}

instead of

\bm{W}^{\ell}

. The produced outputs can then be combined as per (12) to obtain the desired Jacobian-vector product

{\rm Rop}(f(\bm{x}),\bm{W}^{\ell},\bm{U})

We now formalize those two cases in the following theorem that supports the main results of our study, the proof as well as many additional examples and details is provided in Appendix G.1, implementation details and experiments are provided in the following sections.

Theorem 1.

Evaluating the JVP of a DN w.r.t. its input at $\bm{x}$ with direction $\bm{u}$ (recall (1)) and evaluating the JVP of a DN at $\bm{W}^{\ell},\bm{x}$ w.r.t. its layer parameter with direction $\bm{U}$ can be done with

	$\displaystyle{\rm Rop}(f,\bm{x},\bm{u})=$	$\displaystyle f(\bm{x})-f_{\bm{x}}(\bm{u}-\bm{x}),$
	$\displaystyle{\rm Rop}(f(\bm{x}),\bm{W},\bm{U})=$	$\displaystyle\tilde{f}(\bm{W}^{\ell};\bm{x})-\tilde{f}_{\bm{W}^{\ell}}(\bm{U}-\bm{W}^{\ell};\bm{x}).$

Note that in practice, the two forward passes are done synchronously in a single pass by concatenating the two inputs in a single mini-batch. Leveraging the highly parallel computations enabled by GPUs, Thm. 1 enables rapid evaluation of JVPs without requiring any automatic differentiation to be performed as we will detail in the following section.

3.2 Implementation

Algorithm 1 Custom nonlinearity implementations to be used jointly with the DN input that concatenates

\bm{x},\bm{u},\mathbf{0}

(recall (10) and (12)). Those implementations are drop-in replacements of the usual tf.nn.relu (leakiness to

0

), tf.nn.leaky_relu (leakiness

>0

), tf.abs (leakiness to

-1

). One can use our method to evaluate JVPs only by using those custom nonlinearities in place of the usual ones and feed to the DN the concatenated inputs

X

, no additional changes are required. Codes for additional nonlinearities e.g. max-pooling, dropout, as well as smooth nonlinearity such as batch-normalization, which is a smooth nonlinear operator during training, are provided in Appendix F

⬇

x_only = X[: X.shape[0]//2]

mask = tf.greater(x_only, 0)

tmask = tf.tile(mask, [2]+[1]*(X.ndim-1))

return tf.where(tmask, 1.0, leak) * X

We describe in detail the implementation of our method which only takes the form of a few lines of code, we focus here on taking the JVP of a DN with respect to its input given by (10) and using Tensorflow (Abadi et al., 2015) for concreteness and refer the reader to Appendix E for the case of JVP taken with respect to layer weights (12) and for further details and discussions on the implementation.

As was explained in the previous section, our method requires to forward-pass three different DN inputs $\bm{x},\bm{u},\mathbf{0}$ (recall (10)) and to clone the nonlinearities ( $\bm{Q}^{\ell}(\bm{h}^{\ell}(\bm{x}))$ ) for the forward-pass of $\bm{u},\mathbf{0}$ from the forward-pass of $\bm{x}$ . This is implemented as follows

1.

Given a DN of interest $f$ , replace all the nonlinearities (e.g. tf.nn.relu and tf.nn.max_pool) with the custom ones provided in Algo. 3 leading to a custom DN denoted as $F$
2.

Propagate through this custom DN $F$ a single input $X$ formed as X=tf.concat([x,u,tf.zeros_like(x)], 0) that concatenates the three needed inputs over the mini-batch dimension, the shape of $\bm{x},\bm{u}$ and $\mathbf{0}$ must be identical, one can process a single or multiple JVPs at once based on the shape of those inputs
3.

Obtain the needed quantities to compute the JVP from (10) with $f_{\bm{x}}(\bm{u})$ =F(X)[len(x):2*len(x)] and $f_{\bm{x}}(\mathbf{0})$ =F(X)[2*len(x):], the usual DN output is computed as a side effect $f(\bm{x})$ =F(X)[:len(x)].

Table 1: Average computation time to evaluate the JVP

{\rm Rop}(f,\bm{x},\bm{u})

, averaged over

1000

runs for

13

different DN architectures,

2

different input/output shapes and

2

different GPUs. For the RNN case, the recurrence is taken along the horizontal axis of the image. The symbol ‘-’ indicates Out Of Memory (OOM) failure. We compare our method (clone) against ‘batch jacobian’ (3), ‘jvp‘ which is the forward-mode auto differentiation, and ’double vjp’ (4). See Appendix D for details on implementations, libraries and hardware. We observe that our method produces faster JVP evaluation in all cases with speed-up varying based on the architecture at hand which are consistent across input/output shapes and hardware. For GeForce GTX 1080Ti times, see Table 4 where the same trend can be observed, for standard deviations, see Table 5, in the Appendix.

			$\bm{x}\in\mathbb{R}^{100\times 100\times 3},\bm{y}\in\mathbb{R}^{20}$					$\bm{x}\in\mathbb{R}^{400\times 400\times 3},\bm{y}\in\mathbb{R}^{1000}$
	Model	# params.	batch- jacobian	jvp	double vjp	clone (ours)	speedup factor	batch- jacobian	jvp	double vjp	clone (ours)	speedup factor
Quadro RTX 8000	RNN	5M	0.264	-	0.233	0.036	6.47	-	-	0.900	0.129	6.98
	UNet	49M	-	0.047	0.390	0.024	1.63	-	0.059	0.037	0.022	1.68
	VGG16	138M	0.038	0.014	0.014	0.005	2.80	-	0.168	0.158	0.046	3.43
	VGG19	143M	0.048	0.020	0.017	0.006	2.83	-	0.210	0.193	0.059	3.27
	Inception V3	23M	0.037	0.030	0.025	0.016	1.56	-	0.028	0.027	0.019	1.42
	Resnet50	25M	0.030	0.019	0.015	0.012	1.25	-	0.028	0.020	0.018	1.12
	Resnet101	44M	0.051	0.029	0.024	0.022	1.09	-	0.038	0.036	0.030	1.13
	Resnet152	60M	0.680	0.037	0.039	0.029	1.28	-	0.053	0.049	0.036	1.36
	EfficientNet B0	5M	0.053	0.023	0.018	0.013	1.38	-	0.032	0.021	0.015	1.40
	EfficientNet B1	7M	0.073	0.028	0.024	0.018	1.33	-	0.032	0.025	0.019	1.32
	Densenet121	8M	0.040	0.036	0.033	0.022	1.50	-	0.036	0.033	0.025	1.32
	Densenet169	14M	0.054	0.046	0.034	0.026	1.31	-	0.047	0.037	0.028	1.32
	Densenet202	17M	0.070	0.056	0.042	0.033	1.27	-	0.055	0.043	0.033	1.30
	arithmetic average speedup from fastest alternative:						1.98					2.08
	geometric average speedup from fastest alternative:						1.71					1.74
TITAN X (Pascal)	RNN	5M	-	-	0.220	0.039	5.64	-	-	-	0.134	$\infty$
	UNet	49M	-	0.083	0.069	0.04	1.72	-	0.118	0.084	0.057	1.47
	VGG16	138M	0.059	0.026	0.019	0.012	1.58	-	0.122	0.088	0.062	1.42
	VGG19	143M	0.072	0.034	0.026	0.017	1.53	-	0.133	0.101	0.077	1.31
	Inception V3	23M	0.043	0.033	0.028	0.017	1.65	-	0.038	0.033	0.019	1.74
	Resnet50	25M	0.050	0.019	0.016	0.011	1.45	-	0.022	0.017	0.013	1.31
	Resnet101	44M	0.069	0.032	0.029	0.02	1.45	-	0.041	0.032	0.026	1.23
	Resnet152	60M	0.091	0.051	0.041	0.028	1.46	-	0.057	0.046	0.033	1.39
	EfficientNet B0	5M	0.064	0.025	0.021	0.01	2.10	-	0.030	0.024	0.012	2.00
	EfficientNet B1	7M	0.086	0.035	0.028	0.014	2.00	-	0.040	0.033	0.016	2.06
	Densenet121	8M	0.055	0.046	0.036	0.022	1.64	-	0.055	0.040	0.025	1.60
	Densenet169	14M	0.077	0.071	0.049	0.029	1.69	-	0.076	0.055	0.033	1.67
	Densenet202	17M	0.094	0.083	0.051	0.035	1.46	-	0.087	0.064	0.04	1.60
	arithmetic average speedup from fastest alternative:						1.95					1.57
	geometric average speedup from fastest alternative:						1.79					1.49

Table 2: Reprise of Table 1 but now considering

{\rm Rop}(f(\bm{x}),\bm{W},\bm{U})

. For GeForce GTX 1080Ti times, see Table 3 where the same trend can be observed, for standard deviations, see Table 6, in the Appendix.

			$\bm{x}\in\mathbb{R}^{100\times 100\times 3},\bm{y}\in\mathbb{R}^{20}$					$\bm{x}\in\mathbb{R}^{400\times 400\times 3},\bm{y}\in\mathbb{R}^{1000}$
	Model	# params.	batch- jacobian	jvp	double vjp	clone (ours)	speedup factor	batch- jacobian	jvp	double vjp	clone (ours)	speedup factor
Quadro RTX 8000	RNN	5M	0.245	-	0.227	0.047	4.83	-	-	0.877	0.163	5.38
	UNet	49M	-	0.048	0.04	0.023	1.74	-	0.062	0.042	0.022	1.91
	VGG16	138M	0.036	0.014	0.014	0.005	2.80	-	0.168	0.157	0.047	3.34
	VGG19	143M	0.046	0.021	0.016	0.006	2.67	-	0.205	0.191	0.055	3.47
	Inception V3	23M	0.032	0.033	0.023	0.017	1.35	-	0.031	0.026	0.018	1.44
	Resnet50	25M	0.027	0.019	0.015	0.012	1.25	-	0.028	0.021	0.019	1.11
	Resnet101	44M	0.051	0.034	0.026	0.021	1.24	-	0.038	0.032	0.029	1.10
	Resnet152	60M	0.059	0.039	0.035	0.028	1.25	-	0.052	0.042	0.035	1.20
	EfficientNet B0	5M	0.051	0.022	0.018	0.013	1.38	-	0.023	0.020	0.015	1.33
	EfficientNet B1	7M	0.072	0.028	0.024	0.016	1.50	-	0.028	0.027	0.018	1.50
	Densenet121	8M	0.039	0.036	0.031	0.024	1.29	-	0.038	0.030	0.024	1.25
	Densenet169	14M	0.062	0.055	0.035	0.028	1.25	-	0.048	0.035	0.028	1.25
	Densenet201	17M	0.064	0.055	0.041	0.031	1.32	-	0.062	0.042	0.037	1.14
	arithmetic average speedup from fastest alternative:						1.84					1.96
	geometric average speedup from fastest alternative:						1.66					1.69
TITAN X (Pascal)	RNN	5M	-	-	0.219	0.045	4.87	-	-	-	0.153	$\infty$
	UNet	49M	-	0.084	0.068	0.039	1.74	-	0.114	0.079	0.059	1.34
	VGG16	138M	0.059	0.024	0.019	0.012	1.58	-	0.114	0.085	0.062	1.37
	VGG19	143M	0.072	0.032	0.025	0.016	1.56	-	0.137	0.103	0.089	1.16
	Inception V3	23M	0.041	0.033	0.029	0.017	1.71	-	0.039	0.035	0.021	1.67
	Resnet50	25M	0.048	0.019	0.015	0.011	1.36	-	0.023	0.021	0.015	1.4
	Resnet101	44M	0.067	0.035	0.030	0.019	1.58	-	0.040	0.037	0.029	1.28
	Resnet152	60M	0.090	0.053	0.042	0.028	1.50	-	0.057	0.046	0.036	1.28
	EfficientNet B0	5M	0.063	0.025	0.020	0.010	2.00	-	0.029	0.025	0.012	2.08
	EfficientNet B1	7M	0.092	0.034	0.029	0.014	2.07	-	0.042	0.034	0.018	1.89
	Densenet121	8M	0.054	0.046	0.035	0.021	1.67	-	0.058	0.042	0.025	1.68
	Densenet169	14M	0.075	0.068	0.048	0.029	1.66	-	0.075	0.057	0.033	1.72
	Densenet201	17M	0.089	0.081	0.057	0.035	1.63	-	0.086	0.066	0.039	1.69
	arithmetic average speedup factor from fastest alternative:						1.92					1.55
	geometric average speedup from fastest alternative:						1.80					1.47

As clarified by the above implementation walk-through, our method offers two great advantages. First, the details and peculiarities of the considered DN do not impact nor complicate the implementation of our method. All that needs to be done is to replace the nonlinearity functions from the model declaration with the provided ones, everything else left untouched. Second, our method can be implemented in languages/libraries that are not equipped with automatic differentiation. This is an important benefit as it allows deployment in non deep learning focused platforms. Furthermore, specifically designed chips (Ando et al., 2017) can employ our solution without additional difficulties than when implementing a standard DN forward pass. The next section provides various experiments comparing the speed of the method compared to alternative solutions.

3.3 Speed Validation Across Architectures and Hardware

Refer to caption — Figure 3: Depiction of the average slow-down factor (over $1000$ runs) induced by employing the proposed custom nonlinearities from Algo. 3 instead of the default ones for the activation in black, and max-pooling in green with $32$ -channel inputs, varying spatial shape (x-axis) and with mini-batch size of $10$ and $80$ (left,right). The red line is at $y=1$ . We also depict the case of a Conv2D layer in blue, in this case the line depicts the slow-down incurred by applying the Conv2D layer on an input with mini-batch size multiplied by $3$ . We see that forwarding through those operators is at most $50\%$ slower than using the default nonlinearity/Conv2D with mini-batch size divided by $3$ , meaning that one JVP evaluation takes no more than $50\%$ more time than a direct input forward propagation in $f$ .

We first propose in Fig. 3 to depict the slow down that is incurred by using the custom nonlinearities from Algo. 3 with input $X$ that concatenates $\bm{x},\bm{u},\mathbf{0}$ instead of the default nonlinearities with input $\bm{x}$ for different input shapes and mini-batch sizes. As discussed in Sec. 3.2 despite the need to clone the nonlinearity matrices and to process three times more inputs, we observe that the slow-down factor remains around $1.5$ , at most, thanks to the highly parallel platform offered by GPUs. This serves as a good indicator to practitioners that the computation time of the JVP evaluation proposed in this paper will generally not exceed $1.5\times$ the computation time needed to perform a forward pass of $\bm{x}$ using the original DN that employs the default nonlinearities. We also propose in Table 1 and Table 2 the computation times to evaluate the JVP of multiple DN architectures taken with respect to their input and layer weights respectively. We employ various architectures and also compare different GPUs performances (in all case using a single GPU). We observe that our solution is the fastest with speed-up factors varying based on the considered architecture, ranging from $~{}1.1$ to $~{}7$ . In addition of accelerating JVP evaluations, our solution allows JVP evaluation for models like RNNs where the alternative methods fail to provide any result from Out Of Memory failures most likely due to the automatic differentiation method overhead induced from the recurrence mechanism.

4 Fast Jacobian-Vector Product Extensions

We now demonstrate how (fast) JVPs can be used to compute some important quantities that have been shown to characterize DNs e.g. the largest eigenvalues and eigenvectors of $\bm{A}_{\omega}$ , the per-region slope matrix that produces the DN output (recall (6)). Recall that for large models, extracting $\bm{A}_{\omega(\bm{x})}$ entirely, and then performing any sort of analysis is not feasible ( $196608$ by $196608$ matrix for datasets a la ImageNet (Deng et al., 2009)). It is thus crucial to evaluate the desired quantities only through JVP calls.

4.1 Deep Network Spectroscopy

Algorithm 2 Top-

k

spectral decomposition of a square slope matrix

\bm{A}_{\omega}\in\mathbb{R}^{D\times D}

. For the case of rectangular matrix see Algo. 4 in the Appendix.

procedure (

k\in\{1,\dots,K\},\bm{x}\in\omega,tol>0

)

randomly initialize

\bm{V}=[\bm{v}_{1},\dots,\bm{v}_{k}],\bm{v}_{i}\in\mathbb{R}^{D}

\bm{C}\leftarrow\left[{\rm Rop}(f,\bm{x},\bm{v}_{1}),\dots,{\rm Rop}(f,\bm{x},\bm{v}_{k})\right]

repeat

\bm{Q},\bm{R}\leftarrow\text{QRDecomposition}(\bm{C})

\bm{V}\leftarrow\bm{Q}_{.,1:k},\;\;\;\bm{\Sigma}\leftarrow\bm{R}_{1:k,.}

\bm{C}\leftarrow\left[{\rm Rop}(f,\bm{x},\bm{v}_{1}),\dots,{\rm Rop}(f,\bm{x},\bm{v}_{k})\right]

until

\|\bm{C}-\bm{V}\bm{\Sigma}\|\leq tol

returns: top-k eigenvalues

(\bm{\Sigma})

and eigenvectors

(\bm{V})

First, we demonstrate how the DN closing recipe allows to compute the per-region affine parameters.

Proposition 1.

The $Rop$ allows efficient computation of the per-region slope and bias parameters from (9) given a sample $\bm{x}\in\omega$

\displaystyle\bm{A}_{\omega(\bm{x})}=\left[{\rm Rop}(f,\bm{x},\bm{e}_{1}),{\tiny...},{\rm Rop}(f,\bm{x},\bm{e}_{D})\right],\bm{b}_{\omega(\bm{x})}=f_{\bm{x}}(\mathbf{0}),

where $\bm{e}_{d}$ represents the element of the $\mathbb{R}^{D}$ canonical.

Another quantity that can easily be obtained from the DN cloning and Rop evaluation concerns the spectrum of the slope matrix $\bm{A}_{\omega(\bm{x})}$ which has been theoretically studied for DNs in the context of data augmentation (LeJeune et al., 2019) and dropout (Gal & Ghahramani, 2016). It is possible to efficiently compute the top-k spectral decomposition of such matrices through iterative procedures as given in Algo. 2 from (Bentbib & Kanber, 2015), leveraging the JVP, with same asymptotic complexity than a simple forward propagation of an input through a DN $f$ (see Appendix H for the proof). Illustrative application on a UNet is provided in Fig. 4. This case demonstrates the importance and ability of such analysis to finally enable tractable computations of key quantities for DNs and should open the door to tremendous applications of the methods and principled empirical validation of various theoretical studies as well as open the door to novel interpretability and visualization techniques.

4.2 Frobenius Norm Estimation

We also demonstrate how the JVP provides an efficient solution to estimate the Frobenius norm of the per-region slope matrix. Such norm has been theoretically studied for DNs and was found to be an indicator of Deep Generative Networks training failures (Tanielian et al., 2020), rapid evaluation of it would thus allow practitioners to detect such cases. It can be done as per (Gudmundsson et al., 1995) with

\displaystyle\mathbb{E}[\|\bm{A}_{\omega(\bm{x})}\bm{u}\|_{2}^{2}]=\|\bm{A}_{\omega(\bm{x})}\|^{2}_{F},

(13)

with $\bm{u}$ a random unit sphere gaussian random variable (Devroye, 1986; Calafiore et al., 1999). Hence, (13) can easily be estimated via Monte-Carlo sampling by repeatedly performing JVPs, as we demonstrate in Fig. 4. While the above illustrates a particular case, the same recipe can be employed to estimate ${\rm trace}(\bm{A}_{\omega(\bm{x})})$ and $\log\det(I+\bm{A}_{\omega(\bm{x})})$ for Hermitian positive semi-definite matrices (Saibaba et al., 2017), or to estimate $\|\bm{A}_{\omega(\bm{x})}^{-1}\|_{F}$ (Kenney et al., 1998). For in-depth study of such numerical methods that are built-upon JVPs, we refer the reader to Trefethen & Bau III (1997); Ipsen & Meyer (1998); Meyer (2000).

5 Conclusions

Automatic differentiation and backpropagation are behind most if not all recent successes in deep learning, yielding not only state-of-the-art results but also fueling the discoveries of new architectures, normalizations, and the likes. This first wave of research almost exclusively relied on backward differentiation (gradient computations) but recently, many techniques further pushing DN performances have been relying on Jacobian-vector products (JVPs). Current JVPs evaluations rely on automatic differentiation making them slower and less memory efficient than direct evaluation which in turn requires to be implemented by hand based on the architecture employed. We demonstrated in this paper that current DNs, which employ nonlinearities such as (leaky-)ReLU and max-pooling, can have their JVPs evaluated directly without the need for automatic differentiation and without the need to hard-code each architecture only by changing a few lines of code in their implementation. We validated our proposed method against alternative implementations and demonstrated significant speed-ups of a factor of almost $2$ on average over $13$ different architectures and hardware. We also demonstrated that fast JVPs will help in the development of novel analysis and theoretical techniques for CPA DNs that require evaluations of quantities such as eigenvectors/eigenvalues of the per-region slope matrices that benefit from faster JVP evaluations.

6 Acknowledgments

This work was supported by NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR grants N00014-18-12571, N00014-20-1-2787, and N00014-20-1-2534; AFOSR grant FA9550-18-1-0478; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047.

References

Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
Ando et al. (2017) Ando, K., Ueyoshi, K., Orimo, K., Yonekawa, H., Sato, S., Nakahara, H., Takamaeda-Yamazaki, S., Ikebe, M., Asai, T., Kuroda, T., et al. Brein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 tops at 0.6 w. IEEE Journal of Solid-State Circuits, 53(4):983–994, 2017.
Arora et al. (2016) Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
Balestriero & Baraniuk (2020) Balestriero, R. and Baraniuk, R. G. Mad max: Affine spline insights into deep learning. Proceedings of the IEEE, pp. 1–24, 2020. doi: 10.1109/JPROC.2020.3042100.
Balestriero et al. (2019) Balestriero, R., Cosentino, R., Aazhang, B., and Baraniuk, R. The geometry of deep networks: Power diagram subdivision. arXiv preprint arXiv:1905.08443, 2019.
Balestriero et al. (2020) Balestriero, R., Paris, S., and Baraniuk, R. Max-affine spline insights into deep generative networks. arXiv preprint arXiv:2002.11912, 2020.
Bartlett et al. (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res., 20:63–1, 2019.
Baydin et al. (2017) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18(1):5595–5637, 2017.
Bentbib & Kanber (2015) Bentbib, A. and Kanber, A. Block power method for svd decomposition. Analele Universitatii” Ovidius” Constanta-Seria Matematica, 23(2):45–58, 2015.
Bojanov et al. (2013) Bojanov, B. D., Hakopian, H., and Sahakian, B. Spline functions and multivariate interpolations, volume 248. Springer Science & Business Media, 2013.
Calafiore et al. (1999) Calafiore, G., Dabbene, F., and Tempo, R. Radial and uniform distributions in vector and matrix spaces for probabilistic robustness. In Topics in Control and its Applications, pp. 17–31. Springer, 1999.
Cohen & Welling (2016) Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999, 2016.
Cosentino et al. (2020) Cosentino, R., Balestriero, R., Baraniuk, R., and Aazhang, B. Provable finite data generalization with group autoencoder, 2020.
De Boor et al. (1978) De Boor, C., De Boor, C., Mathématicien, E.-U., De Boor, C., and De Boor, C. A practical guide to splines, volume 27. springer-verlag New York, 1978.
Dehaene et al. (2020) Dehaene, D., Frigo, O., Combrexelle, S., and Eline, P. Iterative energy-based projection on a normal data manifold for anomaly localization. arXiv preprint arXiv:2002.03734, 2020.
Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Devarakonda et al. (2017) Devarakonda, A., Naumov, M., and Garland, M. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029, 2017.
Devroye (1986) Devroye, L. Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pp. 260–265, 1986.
Elliott (2018) Elliott, C. The simple essence of automatic differentiation. Proceedings of the ACM on Programming Languages, 2(ICFP):1–29, 2018.
Elsken et al. (2019) Elsken, T., Metzen, J. H., Hutter, F., et al. Neural architecture search: A survey. J. Mach. Learn. Res., 20(55):1–21, 2019.
Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
Gamba et al. (2020) Gamba, M., Carlsson, S., Azizpour, H., and Björkman, M. Hyperplane arrangements of trained convnets are biased. arXiv preprint arXiv:2003.07797, 2020.
Gianfrancesco et al. (2018) Gianfrancesco, M. A., Tamang, S., Yazdany, J., and Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine, 178(11):1544–1547, 2018.
Golub & Kahan (1965) Golub, G. and Kahan, W. Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2):205–224, 1965.
Golub & Van Loan (1996) Golub, G. H. and Van Loan, C. F. Matrix computations johns hopkins university press. Baltimore and London, 1996.
Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, volume 1. MIT Press, 2016. http://www.deeplearningbook.org.
Graves et al. (2013) Graves, A., Jaitly, N., and Mohamed, A.-r. Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding, pp. 273–278. IEEE, 2013.
Griewank & Walther (2008) Griewank, A. and Walther, A. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.
Grigsby & Lindsey (2020) Grigsby, J. E. and Lindsey, K. On transversality of bent hyperplane arrangements and the topological expressiveness of relu neural networks. arXiv preprint arXiv:2008.09052, 2020.
Gudmundsson et al. (1995) Gudmundsson, T., Kenney, C. S., and Laub, A. J. Small-sample statistical estimates for matrix norms. SIAM journal on matrix analysis and applications, 16(3):776–792, 1995.
Hanin & Rolnick (2019) Hanin, B. and Rolnick, D. Complexity of linear regions in deep networks. arXiv preprint arXiv:1901.09021, 2019.
Hernández-García & König (2018) Hernández-García, A. and König, P. Data augmentation instead of explicit regularization. arXiv preprint arXiv:1806.03852, 2018.
Hirsch & Smale (1974) Hirsch, M. and Smale, S. Differential equations, dynamical systems, and linear algebra (pure and applied mathematics, vol. 60). 1974.
Ipsen & Meyer (1998) Ipsen, I. C. F. and Meyer, C. D. The idea behind krylov methods. The American Mathematical Monthly, 105(10):889–899, 1998. doi: 10.1080/00029890.1998.12004985. URL https://doi.org/10.1080/00029890.1998.12004985.
Jamie et al. (2017) Jamie, T., Alexandre, P., and Matthew, J. generating forward-mode graphs by calling tf.gradients twice. https://github.com/renmengye/tensorflow-forward-ad/issues/2, 2017.
Jia et al. (2019) Jia, K., Li, S., Wen, Y., Liu, T., and Tao, D. Orthogonal deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2019. doi: 10.1109/TPAMI.2019.2948352.
Kenney et al. (1998) Kenney, C. S., Laub, A. J., and Reese, M. Statistical condition estimation for linear systems. SIAM Journal on Scientific Computing, 19(2):566–583, 1998.
Kim & Lee (2019) Kim, B. and Lee, J. A video-based fire detection using deep learning models. Applied Sciences, 9(14):2862, 2019.
Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Lawrence et al. (1997) Lawrence, S., Giles, C. L., Tsoi, A. C., and Back, A. D. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98–113, 1997.
LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
LeJeune et al. (2019) LeJeune, D., Balestriero, R., Javadi, H., and Baraniuk, R. G. Implicit rugosity regularization via data augmentation. arXiv preprint arXiv:1905.11639, 2019.
Lin et al. (2013) Lin, M., Chen, Q., and Yan, S. Network in network. arXiv preprint arXiv:1312.4400, 2013.
Linnainmaa (1976) Linnainmaa, S. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976.
Marcus (2018) Marcus, G. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018.
Márquez-Neila et al. (2017) Márquez-Neila, P., Salzmann, M., and Fua, P. Imposing hard constraints on deep networks: Promises and limitations. arXiv preprint arXiv:1706.02025, 2017.
Meyer (2000) Meyer, C. D. Matrix analysis and applied linear algebra, volume 71. Siam, 2000.
Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
Nadav et al. (2020) Nadav, D., Barak, S., and Ingrid, D. Expression of fractals through neural network functions. IEEE Journal on Selected Areas in Information Theory, 2020.
Naumann (2008) Naumann, U. Optimal jacobian accumulation is np-complete. Mathematical Programming, 112(2):427–441, 2008.
Pascanu et al. (2013) Pascanu, R., Montufar, G., and Bengio, Y. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098, 2013.
Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
Perez & Wang (2017) Perez, L. and Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
Rao & Frtunikj (2018) Rao, Q. and Frtunikj, J. Deep learning for self-driving cars: chances and challenges. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, pp. 35–38, 2018.
Rister & Rubin (2017) Rister, B. and Rubin, D. L. Piecewise convexity of artificial neural networks. Neural Networks, 94:34–45, 2017.
Robinson et al. (2019) Robinson, H., Rasheed, A., and San, O. Dissecting deep neural networks. arXiv preprint arXiv:1910.03879, 2019.
Roy et al. (2018) Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., and Beling, P. Deep learning detecting fraud in credit card transactions. In 2018 Systems and Information Engineering Design Symposium (SIEDS), pp. 129–134. IEEE, 2018.
Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
Rushin et al. (2017) Rushin, G., Stancil, C., Sun, M., Adams, S., and Beling, P. Horse race analysis in credit card fraud—deep learning, logistic regression, and gradient boosted tree. In 2017 systems and information engineering design symposium (SIEDS), pp. 117–121. IEEE, 2017.
Saibaba et al. (2017) Saibaba, A. K., Alexanderian, A., and Ipsen, I. C. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395, 2017.
Schumaker (2007) Schumaker, L. Spline functions: basic theory. Cambridge University Press, 2007.
Serra et al. (2018) Serra, T., Tjandraatmadja, C., and Ramalingam, S. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pp. 4558–4566. PMLR, 2018.
Seydoux et al. (2020) Seydoux, L., Balestriero, R., Poli, P., De Hoop, M., Campillo, M., and Baraniuk, R. Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning. Nature communications, 11(1):1–12, 2020.
Shorten & Khoshgoftaar (2019) Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
Smith (2017) Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. IEEE, 2017.
Strang (2019) Strang, G. Linear algebra and learning from data. Wellesley-Cambridge Press, 2019.
Tanielian et al. (2020) Tanielian, U., Issenhuth, T., Dohmatob, E., and Mary, J. Learning disconnected manifolds: a no gan’s land. In International Conference on Machine Learning, pp. 9418–9427. PMLR, 2020.
Taylor & Nitschke (2017) Taylor, L. and Nitschke, G. Improving deep learning using generic data augmentation. arXiv preprint arXiv:1708.06020, 2017.
Trefethen & Bau III (1997) Trefethen, L. N. and Bau III, D. Numerical linear algebra, volume 50. Siam, 1997.
Unser (2019) Unser, M. A representer theorem for deep neural networks. Journal of Machine Learning Research, 20(110):1–30, 2019.
Wang et al. (2017) Wang, H., Ren, K., and Song, J. A closer look at batch size in mini-batch training of deep auto-encoders. In 2017 3rd IEEE International Conference on Computer and Communications (ICCC), pp. 2756–2761. IEEE, 2017.
Wang et al. (2020) Wang, J., Chen, Y., Chakraborty, R., and Yu, S. X. Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11505–11515, 2020.
Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Zheng et al. (2018) Zheng, Y.-J., Zhou, X.-H., Sheng, W.-G., Xue, Y., and Chen, S.-Y. Generative adversarial network based telecom fraud detection at the receiving bank. Neural Networks, 102:78–86, 2018.

Appendix A Details on Jacobian Vector Products

The differential allows to express the local growth of each function’s output dimension as a linear term w.r.t the growth variable on each input dimension plus a small error term. This information is formatted into a Jacobian matrix which has the form

\displaystyle{\bm{J}}_{f}(\bm{x})=\begin{pmatrix}\frac{f_{1}}{\partial x_{1}}(\bm{x})&\dots&\frac{f_{1}}{\partial x_{D}}(\bm{x})\\ \vdots&\ddots&\vdots\\ \frac{f_{K}}{\partial x_{1}}(\bm{x})&\dots&\frac{f_{K}}{\partial x_{D}}(\bm{x})\end{pmatrix},

where $f_{i}$ is the $i^{\rm th}$ output dimension of the operator $f$ . Note that one can also express the Jacobian matrix in term of the gradient operator $\nabla$ taken for each output dimension as

\displaystyle{\bm{J}}_{f}(\bm{x})=\begin{pmatrix}\nabla_{f_{1}}(\bm{x})^{T}\\ \vdots\\ \nabla_{f_{K}}(\bm{x})^{T}\end{pmatrix}.

One has the following property when assuming that $f$ is differentiable in $\bm{x}\in U$ where $U$ is an open subspace of $\mathbb{R}^{D}$

f(\bm{x}+\bm{u})=f(\bm{x})+{\bm{J}}_{f}(\bm{x})\bm{u}+o(\|\bm{u}\|),

where it should be clear that the tangent space induced by the Jacobian matrix allows to locally describe the operator $f$ with approximation error increasing as one moves away from $\bm{x}$ , the point where the Jacobian matrix has been evaluated at.

Appendix B Proof of Rop Evaluation From Double Lop

It is possible to compute the $Rop$ from two $Lop$ calls via

\displaystyle{\rm Rop}(f,\bm{x},\bm{u})={\rm Lop}({\rm Lop}(f,\bm{x},\bm{v})^{T},\bm{v},\bm{u})^{T},

which we now prove. First, we have that

	$\displaystyle{\rm Lop}(f,\bm{x},\bm{v})^{T}=$	$\displaystyle\left(\bm{v}^{T}{\bm{J}}_{f}(\bm{x})\right)^{T}$
	$\displaystyle=$	$\displaystyle{\bm{J}}_{f}(\bm{x})^{T}\bm{v},$

denote the following linear mapping in $\bm{v}$ as $g(\bm{v},\bm{x})={\bm{J}}_{f}(\bm{x})^{T}\bm{v}$ , we have

	$\displaystyle{\rm Lop}({\rm Lop}(f,\bm{x},\bm{v})^{T},\bm{v},\bm{u})^{T}=$	$\displaystyle{\rm Lop}(g,\bm{v},\bm{u})^{T}$
	$\displaystyle=$	$\displaystyle\left(\bm{u}^{T}{\bm{J}}_{f}(\bm{x})^{T}\right)^{T}$
	$\displaystyle=$	$\displaystyle{\bm{J}}_{f}(\bm{x})\bm{u}$
	$\displaystyle=$	$\displaystyle{\rm Rop}(f,\bm{x},\bm{u}),$

which completes the proof.

Appendix C Max-Pooling Nonlinearity Matrix

We saw in (8) how the nonlinearity matrix for activation operators are square, diagonal, and with entries based on the employed activations e.g. $0,1$ for ReLU and $-1,1$ for absolute value. This is true as the activation operator maintains the same dimensions from its input to its output. A max-pooling operator however reduces the dimensionality of its input based on some policy, for concreteness say by performing a spatial pooling over $2\times 2$ non-overlapping regions of its input. Hence the $\bm{Q}^{\ell}$ matrix of the max-pooling operator will be rectangular, and in this particular case, with number of rows equal to the number of columns (input dimension) divided by $4$ , the size of the pooling regions. Since we consider flattened versions of the input tensors, let first recall that when employing such a pooling with a tensor, the output at spatial position and channel $(i,j,c)$ in the output performs pooling over the indices $(i,j,c),(i+1,j,c),(i,j+1,c),(i+1,j+1,c)$ . This translates into the flattened vectors into indices $((i-1)*W+j)*C+c,((i-1)*W+j+1)*C+c,(i*W+j)*C+c,(i*W+j+1)*C+c$ for the input vector and index $((i-1)*W/2+j)*C+c$ for the output vector. We also have in this case that the input tensor is of shape $(H,W,C)$ and output tensor is $(H/2,W/2,C)$ . As a result, the rectangular matrix $\bm{Q}^{\ell}$ will be filled with $0$ , with a single $1$ per row, the position one this $1$ in each row can be in any index from $((i-1)*W+j)*C+c,((i-1)*W+j+1)*C+c,(i*W+j)*C+c,(i*W+j+1)*C+c$ based on the argmax position from the input values at those indices.

Appendix D Implementation Details

D.1 Software

We employed the latest stable version of Tensorflow, at the time of our study this was version 2.4.0, for all the additional details and versions of supporting libraries like Numpy we refer the reader to our GitHub page.

D.2 Hardware

TITAN X (Pascal): The TITAN X Pascal is an enthusiast-class graphics card by NVIDIA, launched in August 2016. Built on the 16 nm process, and based on the GP102 graphics processor, in its GP102-400-A1 variant, the card supports DirectX 12. This ensures that all modern games will run on TITAN X Pascal. The GP102 graphics processor is a large chip with a die area of 471 mm² and 11,800 million transistors. Unlike the fully unlocked TITAN Xp, which uses the same GPU but has all 3840 shaders enabled, NVIDIA has disabled some shading units on the TITAN X Pascal to reach the product’s target shader count. It features 3584 shading units, 224 texture mapping units, and 96 ROPs. NVIDIA has paired 12 GB GDDR5X memory with the TITAN X Pascal, which are connected using a 384-bit memory interface. The GPU is operating at a frequency of 1417 MHz, which can be boosted up to 1531 MHz, memory is running at 1251 MHz (10 Gbps effective). Being a dual-slot card, the NVIDIA TITAN X Pascal draws power from 1x 6-pin + 1x 8-pin power connector, with power draw rated at 250 W maximum. Display outputs include: 1x DVI, 1x HDMI, 3x DisplayPort. TITAN X Pascal is connected to the rest of the system using a PCI-Express 3.0 x16 interface. The card’s dimensions are 267 mm x 112 mm x 40 mm, and it features a dual-slot cooling solution. Its price at launch was 1199 US Dollars.

GeForce GTX 1080 Ti: The GeForce GTX 1080 Ti is an enthusiast-class graphics card by NVIDIA, launched in March 2017. Built on the 16 nm process, and based on the GP102 graphics processor, in its GP102-350-K1-A1 variant, the card supports DirectX 12. This ensures that all modern games will run on GeForce GTX 1080 Ti. The GP102 graphics processor is a large chip with a die area of 471 mm² and 11,800 million transistors. Unlike the fully unlocked TITAN Xp, which uses the same GPU but has all 3840 shaders enabled, NVIDIA has disabled some shading units on the GeForce GTX 1080 Ti to reach the product’s target shader count. It features 3584 shading units, 224 texture mapping units, and 88 ROPs. NVIDIA has paired 11 GB GDDR5X memory with the GeForce GTX 1080 Ti, which are connected using a 352-bit memory interface. The GPU is operating at a frequency of 1481 MHz, which can be boosted up to 1582 MHz, memory is running at 1376 MHz (11 Gbps effective). Being a dual-slot card, the NVIDIA GeForce GTX 1080 Ti draws power from 1x 6-pin + 1x 8-pin power connector, with power draw rated at 250 W maximum. Display outputs include: 1x HDMI, 3x DisplayPort. GeForce GTX 1080 Ti is connected to the rest of the system using a PCI-Express 3.0 x16 interface. The card’s dimensions are 267 mm x 112 mm x 40 mm, and it features a dual-slot cooling solution. Its price at launch was 699 US Dollars.

Quadro RTX 8000: The Quadro RTX 8000 is an enthusiast-class professional graphics card by NVIDIA, launched in August 2018. Built on the 12 nm process, and based on the TU102 graphics processor, in its TU102-875-A1 variant, the card supports DirectX 12 Ultimate. The TU102 graphics processor is a large chip with a die area of 754 mm² and 18,600 million transistors. It features 4608 shading units, 288 texture mapping units, and 96 ROPs. Also included are 576 tensor cores which help improve the speed of machine learning applications. The card also has 72 raytracing acceleration cores. NVIDIA has paired 48 GB GDDR6 memory with the Quadro RTX 8000, which are connected using a 384-bit memory interface. The GPU is operating at a frequency of 1395 MHz, which can be boosted up to 1770 MHz, memory is running at 1750 MHz (14 Gbps effective). Being a dual-slot card, the NVIDIA Quadro RTX 8000 draws power from 1x 6-pin + 1x 8-pin power connector, with power draw rated at 260 W maximum. Display outputs include: 4x DisplayPort, 1x USB Type-C. Quadro RTX 8000 is connected to the rest of the system using a PCI-Express 3.0 x16 interface. The card measures 267 mm in length, 111 mm in width, and features a dual-slot cooling solution. Its price at launch was 9999 US Dollars.

CPU: Architecture: x86_64, CPU op-mode(s): 32-bit, 64-bit, Byte Order: Little Endian, CPU(s): 96, On-line CPU(s) list: 0-95, Thread(s) per core: 2, Core(s) per socket: 24, Socket(s): 2, NUMA node(s): 2, Vendor ID: GenuineIntel, CPU family: 6, Model: 85, Model name: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz, Stepping: 7, CPU MHz: 2200.000, CPU max MHz: 2201.0000, CPU min MHz: 1000.0000, BogoMIPS: 4400.00, Virtualization: VT-x, L1d cache: 32K, L1i cache: 32K, L2 cache: 1024K, L3 cache: 36608K, NUMA node0 CPU(s): 0-23,48-71, NUMA node1 CPU(s): 24-47,72-95.

Appendix E Layer JVP Implementation

For the layer case, that is, taking the JVP of the DN output at $\bm{x}$ with respect to a layer weight $\bm{W}^{\ell}$ , the implementation is similar except that all the first layers up to and including $\ell-1$ remain with their default nonlinearities. All the layers starting and including $\ell$ will employ the custom nonlinearities that we provided. At layer $\ell$ , the user will take the layer input and generate the $X$ tensor as follows for the case of a dense layer, we denote by $\bm{W}$ , $\bm{U}$ the considered layer weight and the direction to produce the JVP from

⬇

# current layer input

hx = tf.matmul(x,W)+b # this is the usual

# pre-activation

hu = tf.matmul(x,U-W)+b # same but using the

# given direction

h0 = b

X = tf.concat([hx,hu], 0)

output = clone_act(X,0.1)

the same goes for the case of a convolutional layer as in

⬇

hx = tf.nn.conv2d(x,W,strides,padding)+b

hu = tf.nn.conv2d(x,U-W,strides,padding)+b

h0 = b

X = tf.concat([hx,hu], 0)

output = clone_act(X,0.1)

if one wanted to use batch-normalization, add a clone_bn operation just before the activation, taking X as input, the output of this cloned BN would then be fed into the clone_act operation.

Appendix F Codes for Additional Nonlinearities

In the case of batch-normalization layer, the input output mapping is not a CPA and thus we need to explicitly compute by hand the Jacobian of the layer and implement the forward pass. This is only needed if one wants to compute the JVP during training, otherwise this can be omitted since during testing the batch-normalization is simply an affine transformation of the input.

Algorithm 3 Custom nonlinearity implementations to be used jointly with the DN input that concatenates

\bm{x},\bm{u},\mathbf{0}

(recall (10) and (12)). Those implementations are drop-in replacements of the usual ones.

⬇

K, N = len(X.shape)-1, X.shape[0]

x_only = X[: N//2]

argmax = tf.nn.max_pool_with_argmax(

x_only, ksize, strides, pad,

include_batch_in_index=True)[1]

n_vals = x_only.shape.num_elements()

shifts = tf.repeat(tf.range(2)*n_vals,

[N//2])

shifts = tf.reshape(shifts, [-1]+[1]*K)

flat = tf.reshape(X, (-1,))

t_argmax = tf.tile(argmax, [2]+[1]*K)

return tf.gather(flat,

shifts+tf.cast(t_argmax,’int32’))

def clone_dropout(X, rate, training=None):

if ! training:

return X

keep_prob, K = 1 - rate, len(X.shape)-1

n_shape = (X.shape[0]//2,) + X.shape[1:]

mask = tf.random.uniform(n_shape) > rate

mask = tf.tile(mask, [2] + [1] * K)

return X * mask / keep_prob

Appendix G Proof

G.1 Proof of Theorem 1

Proof.

The proof for the DN input JVP case is trivial and given in (10). Now when considering the layer weight case, the only added complication comes from the fact that the parameter of interest is a matrix (as opposed to a vector). However, we will simply demonstrate that a layer output can be produced by considering the flattened version of the weight matrix, in which case the JVP is taken w.r.t. a vector.

Denote a vector $\bm{w}$ encoding the flattened representation of $\bm{W}^{\ell}$ , the DN mapping based on an input $\bm{x}$ but now taking as input $\bm{w}$ is defined as

\displaystyle\tilde{f}_{\bm{x}}(\bm{w})=\left(f^{L}\circ\dots\circ f^{\ell+1}\right)\left(\bm{h}^{\ell}(\bm{Z}\bm{w}+\bm{b}^{\ell})\right),

(14)

with $\bm{Z}$ a $D^{\ell}$ by $D^{\ell}\times D^{\ell-1}$ matrix stacking repeatedly $\bm{z}^{\ell-1}(\bm{x})$ as $\bm{Z}=[\bm{z}^{\ell-1}(\bm{x}),\dots,\bm{z}^{\ell-1}(\bm{x})]^{T}$ . First, notice that layers $1,\dots,\ell-1$ do not depend on $\bm{W}^{\ell}$ . From this, it is possible to apply the same strategy as when taking the JVP with the DN since by considering $\tilde{f}$ the DN input is now the layer weight of interest as in

{\rm Rop}(f,\bm{x},\bm{u})=\tilde{f}_{\bm{x}}(\bm{u})-\tilde{f}_{\bm{x}}(\mathbf{0}).

However, one can further simplify the above such that in practice there is no need to produce the $\bm{Z}$ matrix explicitly. Notice that for $\tilde{f}_{\bm{x}}(\bm{u})$ , the layer will compute $\bm{Z}\bm{u}+\bm{b}^{\ell}$ . This quantity, is nothing else that $\bm{U}\bm{z}^{\ell-1}(\bm{x})+\bm{b}^{\ell}$ where $\bm{U}$ is now the matrix form of $\bm{u}$ and the matrix $\bm{Z}$ is put back in vector form (and without the duplication) as $\bm{z}^{\ell}(\bm{x})$ . We thus directly obtain that in practice, computing the above is equivalent to computing the layer output but replacing the layer weights by $\bm{U}$ and $\mathbf{0}$ matrices. For the convolutional layer case, the exact same process is applied but now noticing that $\bm{W}^{\ell}$ can be written as the product between the (small) weight vectors and a matrix producing the circulant block circulant matrix of convolution. ∎

Appendix H Computational Complexity of Spectral Decomposition

The QR decomposition (Strang, 2019) using Householder transformations is done with $2dn^{2}-2n^{3}/3$ flops (Golub & Van Loan, 1996) for a $d\times n$ matrix. Hence the top- $k$ SVD is obtained by repeating until reaching the stopping criteria $k$ $Rop$ calls and $2dk^{2}-2k^{3}/3$ additional flops. Considering a fixed number of iteration of the inner-loop we obtain that the asymptotic time-complexity of Algo. 2 is the same as the asymptotic time-complexity of performing a forward pass into the considered DN.

Appendix I Additional Results

Table 3: Reprise of Table 1 but now considering

Rop(f(,\bm{x},\bm{u})

for GeForce GTX 1080Ti.

			$\bm{x}\in\mathbb{R}^{100\times 100\times 3},\bm{y}\in\mathbb{R}^{20}$
	Model	# params.	batch- jacobian	jvp	double vjp	clone (ours)	speedup factor
GeForce GTX 1080Ti	RNN	5M	-	-	0.166	0.034	4.88
	UNet	49M	-	-	-	-	-
	VGG16	138M	0.060	0.028	0.024	0.017	1.41
	VGG19	143M	0.075	0.039	0.031	0.021	1.47
	Inception V3	23M	0.041	0.022	0.017	0.011	1.54
	Resnet50	25M	0.041	0.014	0.012	0.011	1.09
	Resnet101	44M	0.058	0.021	0.017	0.017	1
	Resnet152	60M	0.082	0.03	0.025	0.025	1
	EfficientNet B0	5M	-	-	-	0.007	$\infty$
	EfficientNet B1	7M	-	-	-	0.009	$\infty$
	Densenet121	8M	0.037	0.026	0.019	0.013	1.46
	Densenet169	14M	0.045	0.036	0.026	0.018	1.44
	Densenet201	17M	0.061	0.046	0.033	0.018	1.83

Table 4: Reprise of Table 2 but now considering

Rop(f(\bm{x}),\bm{W},\bm{U})

for GeForce GTX 1080Ti.

			$\bm{x}\in\mathbb{R}^{100\times 100\times 3},\bm{y}\in\mathbb{R}^{20}$
	Model	# params.	batch- jacobian	jvp	double vjp	clone (ours)	speedup factor
GeForce GTX 1080Ti	RNN	5M	-	-	0.151	0.024	6.29167
	UNet	49M	-	-	-	-	-
	VGG16	138M	0.060	0.028	0.024	0.017	1.41
	VGG19	143M	0.072	0.036	0.031	0.022	1.40
	Inception V3	23M	0.039	0.021	0.016	0.011	1.45
	Resnet50	25M	0.039	0.013	0.011	0.011	1
	Resnet101	44M	0.058	0.023	0.017	0.017	1
	Resnet152	60M	0.080	0.029	0.027	0.025	1.08
	EfficientNet B0	5M	-	-	-	0.007	$\infty$
	EfficientNet B1	7M	-	-	-	0.009	$\infty$
	Densenet121	8M	0.033	0.025	0.020	0.011	1.81
	Densenet169	14M	0.050	0.036	0.026	0.015	1.73
	Densenet201	17M	0.061	0.044	0.030	0.018	1.67

Algorithm 4 Top-

k

SVD (Golub & Kahan, 1965) of the slope matrix

\bm{A}_{\omega(\bm{x})}\in\mathbb{R}^{K\times D}

procedure BlockSVD(

k,\bm{x}\in\omega

)

randomly initialize

\bm{U}=[\bm{u}_{1},\dots,\bm{u}_{k}]\in\mathbb{R}^{D\times k}

randomly initialize

\bm{V}=[\bm{v}_{1},\dots,\bm{v}_{k}]\in\mathbb{R}^{D\times k}

randomly initialize

\bm{\Sigma}\in\mathbb{R}^{k\times k}

\bm{C}\leftarrow\left[{\rm Rop}(f,\bm{x},\bm{v}_{1}),\dots,{\rm Rop}(f,\bm{x},\bm{v}_{k})\right]

while

\|\bm{C}-\bm{U}\bm{\Sigma}\|>tol

\bm{Q},\bm{R}\leftarrow\text{QRDecomposition}(\bm{C})

\bm{U}\leftarrow\left[\bm{Q}\right]_{.,1:k}

\triangleright

Update left singular vectors

\bm{B}\leftarrow\left[{\rm Lop}(f,\bm{x},\bm{u}_{1})^{T},\dots,{\rm Lop}(f,\bm{x},\bm{u}_{k})^{T}\right]

\bm{Q},\bm{R}\leftarrow\text{QRDecomposition}(\bm{B})

\bm{V}\leftarrow\left[\bm{Q}\right]_{.,1:k}

\triangleright

Update right singular vectors

\bm{\Sigma}\leftarrow\left[\bm{R}\right]_{1:k,1:k}

\triangleright

Update singular values

\bm{C}\leftarrow\left[{\rm Rop}(f,\bm{x},\bm{v}_{1}),\dots,{\rm Rop}(f,\bm{x},\bm{v}_{k})\right]

returns: top-k eigenvalues

(\bm{\Sigma})

and eigenvectors

(\bm{V})

Table 5: Standard deviations of the computation times to perform

{\rm Rop}(f,\bm{x},\bm{u})

over 1000 runs, - indicates Out Of Memory (OOM) failure

		$\bm{x}\in\mathbb{R}^{100\times 100\times 3},\bm{y}\in\mathbb{R}^{20}$				$\bm{x}\in\mathbb{R}^{400\times 400\times 3},\bm{y}\in\mathbb{R}^{1000}$
Model	# params.	batch- jacobian	jvp	double vjp	clone (ours)	batch- jacobian	jvp	double vjp	clone (ours)
RNN	5M	0.011	-	0.014	0.004	-	-	0.046	0.016
UNet	49M	-	0.004	0.003	0.002	-	0.006	0.004	0.005
VGG16	138M	0.004	0.003	0.003	0.001	-	0.008	0.008	0.004
VGG19	143M	0.005	0.003	0.003	0.002	-	0.009	0.008	0.005
Inception V3	23M	0.006	0.005	0.004	0.003	-	0.006	0.004	0.004
Resnet50	25M	0.005	0.003	0.003	0.003	-	0.005	0.004	0.004
Resnet101	44M	0.005	0.005	0.004	0.003	-	0.008	0.007	0.006
Resnet152	60M	0.010	0.006	0.004	0.004	-	0.010	0.006	0.007
EfficientNet B0	5M	0.007	0.004	0.003	0.003	-	0.006	0.004	0.003
EfficientNet B1	7M	0.007	0.005	0.004	0.003	-	0.007	0.006	0.004
Densenet121	8M	0.007	0.005	0.005	0.003	-	0.008	0.006	0.005
Densenet169	14M	0.012	0.008	0.005	0.005	-	0.008	0.008	0.006
Densenet202	17M	0.013	0.011	0.007	0.004	-	0.008	0.012	0.007
RNN	5M	-	-	0.022	0.007	-	-	-	0.017
UNet	49M	-	0.004	0.003	0.002	-	0.01	0.006	0.007
VGG16	138M	0.007	0.002	0.001	0.001	-	0.009	0.005	0.005
VGG19	143M	0.004	0.003	0.002	0.001	-	0.010	0.005	0.007
Inception V3	23M	0.007	0.007	0.006	0.003	-	0.006	0.005	0.004
Resnet50	25M	0.005	0.003	0.003	0.002	-	0.002	0.002	0.004
Resnet101	44M	0.006	0.004	0.005	0.004	-	0.004	0.004	0.007
Resnet152	60M	0.006	0.009	0.007	0.005	-	0.003	0.004	0.004
EfficientNet B0	5M	0.011	0.004	0.003	0.001	-	0.002	0.001	0.001
EfficientNet B1	7M	0.014	0.006	0.005	0.002	-	0.007	0.001	0.002
Densenet121	8M	0.011	0.009	0.006	0.004	-	0.008	0.004	0.002
Densenet169	14M	0.015	0.011	0.008	0.005	-	0.006	0.004	0.004
Densenet202	17M	0.017	0.012	0.011	0.005	-	0.007	0.004	0.004

Table 6: Standard deviations of the computation times to perform

{\rm Rop}(f(\bm{x}),\bm{W}^{1},\bm{U})

over 1000 runs, - indicates Out Of Memory (OOM) failure

		$\bm{x}\in\mathbb{R}^{100\times 100\times 3},\bm{y}\in\mathbb{R}^{20}$				$\bm{x}\in\mathbb{R}^{400\times 400\times 3},\bm{y}\in\mathbb{R}^{1000}$
Model	# params.	batch- jacobian	jvp	double vjp	clone (ours)	batch- jacobian	jvp	double vjp	clone (ours)
RNN	5M	0.013	-	0.013	0.006	-	-	0.042	0.018
UNet	49M	-	0.004	0.003	0.002	-	0.005	0.004	0.005
VGG16	138M	0.004	0.003	0.003	0.001	-	0.008	0.008	0.003
VGG19	143M	0.004	0.003	0.003	0.002	-	0.008	0.008	0.005
Inception V3	23M	0.006	0.004	0.004	0.004	-	0.007	0.004	0.004
Resnet50	25M	0.003	0.003	0.003	0.003	-	0.004	0.004	0.003
Resnet101	44M	0.007	0.005	0.004	0.003	-	0.007	0.004	0.004
Resnet152	60M	0.008	0.006	0.007	0.005	-	0.007	0.008	0.004
EfficientNet B0	5M	0.004	0.003	0.003	0.003	-	0.006	0.004	0.003
EfficientNet B1	7M	0.007	0.005	0.004	0.003	-	0.007	0.005	0.004
Densenet121	8M	0.006	0.005	0.004	0.004	-	0.006	0.006	0.004
Densenet169	14M	0.012	0.006	0.006	0.004	-	0.014	0.007	0.006
Densenet202	17M	0.010	0.007	0.006	0.005	-	0.009	0.008	0.006
RNN	5M	-	-	0.02	0.009	-	-	-	0.024
UNet	49M	-	0.004	0.002	0.002	-	0.009	0.003	0.007
VGG16	138M	0.008	0.002	0.002	0.001	-	0.009	0.005	0.005
VGG19	143M	0.004	0.003	0.002	0.001	-	0.009	0.006	0.007
Inception V3	23M	0.008	0.007	0.005	0.003	-	0.006	0.003	0.003
Resnet50	25M	0.006	0.004	0.001	0.002	-	0.002	0.002	0.004
Resnet101	44M	0.007	0.007	0.006	0.004	-	0.005	0.003	0.007
Resnet152	60M	0.007	0.009	0.008	0.006	-	0.007	0.007	0.005
EfficientNet B0	5M	0.010	0.005	0.004	0.002	-	0.005	0.002	0.002
EfficientNet B1	7M	0.029	0.007	0.005	0.003	-	0.009	0.001	0.003
Densenet121	8M	0.011	0.010	0.007	0.004	-	0.009	0.004	0.003
Densenet169	14M	0.013	0.011	0.009	0.005	-	0.008	0.006	0.004
Densenet202	17M	0.014	0.013	0.010	0.005	-	0.01	0.003	0.002