Fast Jacobian-Vector Product for Deep Networks
Abstract
Jacobian-vector products (JVPs) form the backbone of many recent developments in Deep Networks (DNs), with applications including faster constrained optimization, regularization with generalization guarantees, and adversarial example sensitivity assessments. Unfortunately, JVPs are computationally expensive for real world DN architectures and require the use of automatic differentiation to avoid manually adapting the JVP program when changing the DN architecture. We propose a novel method to quickly compute JVPs for any DN that employ Continuous Piecewise Affine (e.g., leaky-ReLU, max-pooling, maxout, etc.) nonlinearities. We show that our technique is on average faster than the fastest alternative over DN architectures and across various hardware. In addition, our solution does not require automatic differentiation and is thus easy to deploy in software, requiring only the modification of a few lines of codes that do not depend on the DN architecture.
1 Introduction
Deep (Neural) Networks (DNs) are used throughout the spectrum of machine learning tasks ranging from image classification (Lawrence et al., 1997; Lin et al., 2013; Zagoruyko & Komodakis, 2016) to financial fraud prediction (Rushin et al., 2017; Roy et al., 2018; Zheng et al., 2018). The ability of DNs to reach super-human performances added great momentum into the deployment of those techniques into real life applications e.g. self-driving cars (Rao & Frtunikj, 2018), fire prediction (Kim & Lee, 2019) or seismic activity prediction (Seydoux et al., 2020). The main ingredients behind this success roughly fall into three camps: (i) specialized optimization algorithms such as Adam (Kingma & Ba, 2014) coupled with carefully designed learning rate schedules (Smith, 2017) and/or mini-batch size (Devarakonda et al., 2017; Wang et al., 2017), (ii) novel regularization techniques which can be explicit as with orthogonal/sparse weights penalties (Cohen & Welling, 2016; Jia et al., 2019; Wang et al., 2020) or implicit as with dropout (Molchanov et al., 2017) or data-augmentation (Taylor & Nitschke, 2017; Perez & Wang, 2017; Hernández-García & König, 2018; Shorten & Khoshgoftaar, 2019); and (iii) adaptivity of the DN architecture to the data and task at hand thanks to expert knowledge (Graves et al., 2013; Seydoux et al., 2020) or automated Neural Architecture Search (Elsken et al., 2019).
Despite DNs’ astonishing performance, their practical deployment in the real-world remains hazardous, requiring tremendous tweaking and the implementation of various safe guard measures (Marcus, 2018) to avoid failure cases that can lead to injuries or death, but also to ensure fairness and unbiased behaviors (Gianfrancesco et al., 2018). One path for overcoming those challenges in a more principled way is to develop novel, specialized methods for each part of the deep learning pipeline (optimization/regularization/architecture).
Recent techniques aiming at tackling those shortcomings have been proposed. Most of those techniques rely on Jacobian-vector products (JVPs) (Hirsch & Smale, 1974), which are defined as the projection of a given vector onto the Jacobian matrix of an operator, i.e. the DN in our case. The JVP captures crucial information on the local geometry of the DN input-output mapping which is one of the main reason behind its popularity. For example, Dehaene et al. (2020) leverage the JVP to better control the denoising/data generation process of autoencoders via explicit manifold projections (generated samples are repeatedly projected onto the autoencoder Jacobian matrix); Balestriero & Baraniuk (2020) assess the sensitivity of DNs to adversarial examples (noise realizations are projected onto the DN Jacobian matrix); Cosentino et al. (2020) guarantees the generalization performance of DNs based on a novel JVP based measure (neighboring data samples are projected onto the DN Jacobian matrix and compared); Márquez-Neila et al. (2017) trains DNs with explicit input-output constraints (the constraints are cast into JVPs and evaluated as part of the DN training procedure). Beyond those recent developments, JVPs have a long history of being employed to probe models and get precise understandings of the models’ properties e.g. input sensitivity, intrinsic dimension, conditioning (Gudmundsson et al., 1995; Bentbib & Kanber, 2015; Balestriero et al., 2020). To enable the deployment of these techniques in practical scenarios, it is crucial to evaluate JVPs as fast as possible and without relying on automatic differentiation, a feature primarily implemented in development libraries but often missing in software deployed in production.
In this paper, we propose a novel method to evaluate JVPs that is inspired by recent theoretical studies of DNs employing Continuous Piecewise Affine (CPA) nonlinearities and by Forward-mode Differentiation (FD). In short, the CPA property allows to specialize FD to be evaluate only by feeding two inputs to a DN while holding the DN internal state fixed, and combining the two produced outputs such that the final result corresponds to the desired JVP. Our solution offers the following benefits:
-
•
Speed: faster on average over the fastest alternative solution with empirical validation on different DN architectures (including Recurrent Neural Networks, ResNets, DenseNets, VGGs, UNets) and with speed-ups consistent across hardware.
-
•
Simple implementation: Implementing our solution for any DN requires to change only a few lines of code in total, regardless of the architecture being considered. This allows our method to be employed inside the architecture search loop if needed.
-
•
Does not employ automatic differentiation: Our solution can be implemented in any software/library including the ones that are not equipped with automatic differentiation, allowing deployment of the method on production specific hardware/software.
While our method is most simple to implement for CPA DNs (a constraint embracing most of the current state-of-the-art architectures), we also demonstrate how to employ it for smooth DN’s layers such as batch-normalization, which is CPA only during testing.
This paper is organized as follows: Sec. 2 reviews current JVPs solutions and thoroughly presents notations from CPA DNs, Sec. 3 presents our proposed solution where detailed implementation is provided in Sec. 3.2 and careful empirical validations are performed in Sec. 3.3; Sec. 4 presents further results on how JVPs are crucial to evaluate many insightful statistics of CPA DNs such as the largest eigenvalues of the DN Jacobian matrix.
2 Jacobian-Vector Products and Continuous Piecewise Affine Deep Networks
Jacobian-Vector Products (JVPs). The JVP operation, also called Rop (short for right-operation), is abbreviated as and computes the directional derivative, with direction , of the multi-dimensional operator with respect to the input as
(1) |
Throughout our study, we will assume that is differentiable at , denotes the Jacobian of evaluated at . The operator defines a linear operator from to with respect to the argument, additional details can be found in Appendix A; throughout this paper we will use the terms JVP and Rop interchangeably. The vector-Jacobian product (VJP) operation, also called Lop, is abbreviated as with direction and computes the adjoint directional derivative
(2) |
For a thorough review of those operations, we refer the reader to Paszke et al. (2017); Baydin et al. (2017); Elliott (2018); throughout this paper we will use the terms VJP and Lop interchangeably.
In practice, evaluating the Rop given a differentiable operator can be done naturally through forward-mode automatic differentiation which builds while forwarding through the computational graph of . However, it can also be done by employing the backward-mode automatic differentiation (Linnainmaa, 1976) i.e. traversing the computational graph of from output to input, via multiple Lop calls via
(3) |
where is the length of the output vector of the mapping . Practical implementations of the above often resort to parallel computations of each forming the rows of the Jacobian matrix reducing computation time at the cost of greater memory requirements. Alternatively, an interesting computational trick recently brought to light in Jamie et al. (2017) allows to compute the from two calls only (instead of K) as follows
(4) |
detailed derivation can be found in Appendix B. In short, the inner represents the mapping of onto the linear operator . Hence, being linear in , can itself be differentiated w.r.t. and evaluated with direction leading to the above equality. While this alternative strategy might not be of great interest for theoretical studies since the asymptotic time-complexity of and is similar (proportional to the number of operations (Griewank & Walther, 2008)). Which of the above strategies will provide the fastest computation in practice depends on the type of operations (input/output shapes) that are present in the computational graph. And while Rop and Lop can be combined, the general problem of optimal Jacobian accumulation (finding the most efficient computation) is NP-complete (Naumann, 2008). Note that current software/hardware in deep learning heavily focus on Lop as backpropagation (gradient based DN training strategy) is a special case of the latter (Rumelhart et al., 1986).
Deep Networks (DNs). A DN is an operator that maps an observation 111We consider vectors, matrices and tensors as flattened vectors to a prediction . This prediction is a nonlinear transformation of the input built through a cascade of linear and nonlinear operators (LeCun et al., 2015). Those operators include affine operators such as the fully connected operator (simply an affine transformation defined by weight matrix and bias vector ), convolution operator (with circulant ), and, nonlinear operators such as the activation operator (applying a scalar nonlinearity such as the ubiquitous ReLU), or the max-pooling operator; precise definitions of these operators can be found in Goodfellow et al. (2016).
It will become convenient for our development to group the succession of operators forming a DN into layers in order to express the entire DN as a composition of intermediate layers , as in
(5) |
where each maps an input vector of length to an output vector of length . We formally specify how to perform this grouping to ensure that any DN as a unique layer decomposition.
Definition 1 (Deep Network Layer).
A DN layer composes a single nonlinear operator and any (if any) of the preceding linear operators between it and the preceding nonlinear operator (if any).
We also explicitly denote the up-to-layer mapping as
with initialization , those intermediate representations are denoted as feature maps.
Continuous Piecewise Affine (CPA) DNs. Whenever a DN employs CPA nonlinearities throughout its layers, the entire input-output mapping is a (continuous) affine spline with a partition of its domain (the input space of the DN) and a per-region affine mapping (for more background on splines we refer the reader to De Boor et al. (1978); Schumaker (2007); Bojanov et al. (2013); Pascanu et al. (2013)) such that
(6) |
where the per-region slope and bias parameters are a function of the DN architecture and parameters. Given an observation , the affine parameters of region from (6) can be obtained explicitly as follows.
1. For each layer (recall Def 1) compose the possibly multiple linear operator into a single one with slope matrix and bias (any composition of linear operators can be expressed this way by composition and distributive law) and denote the layer’s pre-activation as
(7) |
in the absence of linear operators in a layer, set as the identity matrix and as the zero-vector.
2. For each layer, encode the states of the layer’s activation operator into the following diagonal matrix
(8) |
where (leaky-ReLU), (ReLU), or (absolute value); for the max-pooling operator, see Appendix C; is used to access a specific entry of a matrix/vector, for clarity we will often omit the double superscript in (8) and use .
3. Combining (7) and (8) to obtain
with and .
Recently, many studies have successfully exploited the CPA property of current DNs to derive novel insights and results such as bounding the number of partition regions (Montufar et al., 2014; Arora et al., 2016; Rister & Rubin, 2017; Serra et al., 2018; Hanin & Rolnick, 2019), characterizing the geometry of the partitioning (Balestriero et al., 2019; Robinson et al., 2019; Gamba et al., 2020); and quantifying the expressive power of CPA DNs as function approximators (Unser, 2019; Bartlett et al., 2019; Nadav et al., 2020; Grigsby & Lindsey, 2020).
We propose to follow the same path and exploit the CPA property of DNs to obtain a fast and architecture independent method to evaluate Jacobian-vector products with DNs that do not require any flavor of automatic differentiation. Our method will allow to speed-up recent methods employing JVPs and to ease their deployment.
3 Fast Jacobian-Vector Product for Continuous Piecewise Affine Deep Networks
As we mentioned, the core of our solution resides on the assumptions that the employed DN’s nonlinearities are CPA, we first describe our solution, its implementation and then conclude with thorough empirical validations and solutions to extend our technique to smooth DNs.
3.1 Deep Network Cloning
When employing CPA nonlinearities, the DN input-output mapping can be expressed as a per-region affine mapping (recall (6)). This key property greatly simplifies the form of the explicit Rop (or JVP) of a DN taken with respect to its input given an observation and a directional vector . In fact, we directly have
(9) |
where we explicit the enclosing region with . In the unrealistic setting where one would a priori know and store the matrices for each region , and instantaneously infer the region , the evaluation of (9) would be performed in time complexity. This complexity could be reduced in the presence of low-rank (or other peculiar cases) matrices, such cases do not occur with standard architectures and are thus omitted. In the more realistic scenario however, direct evaluation of 9 is not possible as each matrix tends to be large (and not sparse) with input dimension , and outpud dimension reaching for classification/regression tasks, and being as large as for the case of autoencoders; and the number of regions being on the order of hundreds of thousands (Cosentino et al., 2020), even for small DNs (and keep increasing with the DN number of nodes/layers). As a result, all current solutions rely on automatic differentiation which constructs only a single and on demand based on the given observation and its propagation through the DN layers.
Instead, we propose an alternative solution that does not rely on automatic differentiation. For clarity, we detail our approach first when the JVPs is taken with respect to the DN input, and second when the JVP is taken with respect to the parameters of a layer.
JVPs of DN inputs. Recalling (6), we know that forwarding an observation through the DN layers produces the output . Now, suppose that we possessed the ability to freeze, or clone the layers such that the internal matrices (recall 8) remain the same regardless of the presented input; denote such a DN cloned at by . Then, in all simplicity by using (6) we obtain the following equality
(10) |
producing the desired JVP. Note that the user only needs to perform the forward pass through the cloned DN , without the need to explicit compute , the JVP is gradually computed through the forwaard pass. Computationally, (10) requires only two passes through the DN that we group into two steps. Step 1: forward the observation through all the layers to observe the matrices. Step 2: evaluate the cloned DN at using the observed matrices from step 1. We depict this cloning and propagation procedure in Fig. 1.
JVP of DN layer weights. We now consider the evaluation of a DN’s JVP taken with respect to one of the layers’ affine parameter i.e., weights, say for some layer . Let’s first define an auxiliary DN mapping that takes as input the value of the weight of interest ( in this case). This leads to the following mapping
(11) |
where is the flattened version of and is the matrix s.t. projecting onto it is equivalent to projecting onto . If is a dense matrix then is block diagonal with rows and columns as
if were a convolutional layer parameter, then the matrix would include the patch-translations corresponding to the application of the filter. Now, it is clear that taking the JVP of w.r.t. can be done exactly as per (10) since the flattened layer parameter can be seen as the DN input leading to
(12) |
where is the the direction of the JVP that has same shape as . Also, in this case, represents cloning the nonlinearity states from the DN with inputs and . Furthermore, thanks to the equality of (11), one does not even need to actually obtain since performing the usual layer forward propagation with the given parameter and the layer input is equivalent. As a result, regardless of the type of layer, taking the JVP w.r.t. a layer parameter is done simply by (i) propagating through the DN while using to observe all the matrices , (ii) propagating using as the filter of the layer’s parameter of interest (and keeping all the nonlinearities frozen) and (iii) computing the different between the two produced quantity. We depict this cloning procedure in Fig. 2.
We now formalize those two cases in the following theorem that supports the main results of our study, the proof as well as many additional examples and details is provided in Appendix G.1, implementation details and experiments are provided in the following sections.
Theorem 1.
Evaluating the JVP of a DN w.r.t. its input at with direction (recall (1)) and evaluating the JVP of a DN at w.r.t. its layer parameter with direction can be done with
Note that in practice, the two forward passes are done synchronously in a single pass by concatenating the two inputs in a single mini-batch. Leveraging the highly parallel computations enabled by GPUs, Thm. 1 enables rapid evaluation of JVPs without requiring any automatic differentiation to be performed as we will detail in the following section.
3.2 Implementation
We describe in detail the implementation of our method which only takes the form of a few lines of code, we focus here on taking the JVP of a DN with respect to its input given by (10) and using Tensorflow (Abadi et al., 2015) for concreteness and refer the reader to Appendix E for the case of JVP taken with respect to layer weights (12) and for further details and discussions on the implementation.
As was explained in the previous section, our method requires to forward-pass three different DN inputs (recall (10)) and to clone the nonlinearities () for the forward-pass of from the forward-pass of . This is implemented as follows
-
1.
Given a DN of interest , replace all the nonlinearities (e.g. tf.nn.relu and tf.nn.max_pool) with the custom ones provided in Algo. 3 leading to a custom DN denoted as
-
2.
Propagate through this custom DN a single input formed as X=tf.concat([x,u,tf.zeros_like(x)], 0) that concatenates the three needed inputs over the mini-batch dimension, the shape of and must be identical, one can process a single or multiple JVPs at once based on the shape of those inputs
-
3.
Obtain the needed quantities to compute the JVP from (10) with =F(X)[len(x):2*len(x)] and =F(X)[2*len(x):], the usual DN output is computed as a side effect =F(X)[:len(x)].
Model | # params. | batch- jacobian | jvp | double vjp | clone (ours) | speedup factor | batch- jacobian | jvp | double vjp | clone (ours) | speedup factor | |
Quadro RTX 8000 | RNN | 5M | 0.264 | - | 0.233 | 0.036 | 6.47 | - | - | 0.900 | 0.129 | 6.98 |
UNet | 49M | - | 0.047 | 0.390 | 0.024 | 1.63 | - | 0.059 | 0.037 | 0.022 | 1.68 | |
VGG16 | 138M | 0.038 | 0.014 | 0.014 | 0.005 | 2.80 | - | 0.168 | 0.158 | 0.046 | 3.43 | |
VGG19 | 143M | 0.048 | 0.020 | 0.017 | 0.006 | 2.83 | - | 0.210 | 0.193 | 0.059 | 3.27 | |
Inception V3 | 23M | 0.037 | 0.030 | 0.025 | 0.016 | 1.56 | - | 0.028 | 0.027 | 0.019 | 1.42 | |
Resnet50 | 25M | 0.030 | 0.019 | 0.015 | 0.012 | 1.25 | - | 0.028 | 0.020 | 0.018 | 1.12 | |
Resnet101 | 44M | 0.051 | 0.029 | 0.024 | 0.022 | 1.09 | - | 0.038 | 0.036 | 0.030 | 1.13 | |
Resnet152 | 60M | 0.680 | 0.037 | 0.039 | 0.029 | 1.28 | - | 0.053 | 0.049 | 0.036 | 1.36 | |
EfficientNet B0 | 5M | 0.053 | 0.023 | 0.018 | 0.013 | 1.38 | - | 0.032 | 0.021 | 0.015 | 1.40 | |
EfficientNet B1 | 7M | 0.073 | 0.028 | 0.024 | 0.018 | 1.33 | - | 0.032 | 0.025 | 0.019 | 1.32 | |
Densenet121 | 8M | 0.040 | 0.036 | 0.033 | 0.022 | 1.50 | - | 0.036 | 0.033 | 0.025 | 1.32 | |
Densenet169 | 14M | 0.054 | 0.046 | 0.034 | 0.026 | 1.31 | - | 0.047 | 0.037 | 0.028 | 1.32 | |
Densenet202 | 17M | 0.070 | 0.056 | 0.042 | 0.033 | 1.27 | - | 0.055 | 0.043 | 0.033 | 1.30 | |
arithmetic average speedup from fastest alternative: | 1.98 | 2.08 | ||||||||||
geometric average speedup from fastest alternative: | 1.71 | 1.74 | ||||||||||
TITAN X (Pascal) | RNN | 5M | - | - | 0.220 | 0.039 | 5.64 | - | - | - | 0.134 | |
UNet | 49M | - | 0.083 | 0.069 | 0.04 | 1.72 | - | 0.118 | 0.084 | 0.057 | 1.47 | |
VGG16 | 138M | 0.059 | 0.026 | 0.019 | 0.012 | 1.58 | - | 0.122 | 0.088 | 0.062 | 1.42 | |
VGG19 | 143M | 0.072 | 0.034 | 0.026 | 0.017 | 1.53 | - | 0.133 | 0.101 | 0.077 | 1.31 | |
Inception V3 | 23M | 0.043 | 0.033 | 0.028 | 0.017 | 1.65 | - | 0.038 | 0.033 | 0.019 | 1.74 | |
Resnet50 | 25M | 0.050 | 0.019 | 0.016 | 0.011 | 1.45 | - | 0.022 | 0.017 | 0.013 | 1.31 | |
Resnet101 | 44M | 0.069 | 0.032 | 0.029 | 0.02 | 1.45 | - | 0.041 | 0.032 | 0.026 | 1.23 | |
Resnet152 | 60M | 0.091 | 0.051 | 0.041 | 0.028 | 1.46 | - | 0.057 | 0.046 | 0.033 | 1.39 | |
EfficientNet B0 | 5M | 0.064 | 0.025 | 0.021 | 0.01 | 2.10 | - | 0.030 | 0.024 | 0.012 | 2.00 | |
EfficientNet B1 | 7M | 0.086 | 0.035 | 0.028 | 0.014 | 2.00 | - | 0.040 | 0.033 | 0.016 | 2.06 | |
Densenet121 | 8M | 0.055 | 0.046 | 0.036 | 0.022 | 1.64 | - | 0.055 | 0.040 | 0.025 | 1.60 | |
Densenet169 | 14M | 0.077 | 0.071 | 0.049 | 0.029 | 1.69 | - | 0.076 | 0.055 | 0.033 | 1.67 | |
Densenet202 | 17M | 0.094 | 0.083 | 0.051 | 0.035 | 1.46 | - | 0.087 | 0.064 | 0.04 | 1.60 | |
arithmetic average speedup from fastest alternative: | 1.95 | 1.57 | ||||||||||
geometric average speedup from fastest alternative: | 1.79 | 1.49 |
Model | # params. | batch- jacobian | jvp | double vjp | clone (ours) | speedup factor | batch- jacobian | jvp | double vjp | clone (ours) | speedup factor | |
Quadro RTX 8000 | RNN | 5M | 0.245 | - | 0.227 | 0.047 | 4.83 | - | - | 0.877 | 0.163 | 5.38 |
UNet | 49M | - | 0.048 | 0.04 | 0.023 | 1.74 | - | 0.062 | 0.042 | 0.022 | 1.91 | |
VGG16 | 138M | 0.036 | 0.014 | 0.014 | 0.005 | 2.80 | - | 0.168 | 0.157 | 0.047 | 3.34 | |
VGG19 | 143M | 0.046 | 0.021 | 0.016 | 0.006 | 2.67 | - | 0.205 | 0.191 | 0.055 | 3.47 | |
Inception V3 | 23M | 0.032 | 0.033 | 0.023 | 0.017 | 1.35 | - | 0.031 | 0.026 | 0.018 | 1.44 | |
Resnet50 | 25M | 0.027 | 0.019 | 0.015 | 0.012 | 1.25 | - | 0.028 | 0.021 | 0.019 | 1.11 | |
Resnet101 | 44M | 0.051 | 0.034 | 0.026 | 0.021 | 1.24 | - | 0.038 | 0.032 | 0.029 | 1.10 | |
Resnet152 | 60M | 0.059 | 0.039 | 0.035 | 0.028 | 1.25 | - | 0.052 | 0.042 | 0.035 | 1.20 | |
EfficientNet B0 | 5M | 0.051 | 0.022 | 0.018 | 0.013 | 1.38 | - | 0.023 | 0.020 | 0.015 | 1.33 | |
EfficientNet B1 | 7M | 0.072 | 0.028 | 0.024 | 0.016 | 1.50 | - | 0.028 | 0.027 | 0.018 | 1.50 | |
Densenet121 | 8M | 0.039 | 0.036 | 0.031 | 0.024 | 1.29 | - | 0.038 | 0.030 | 0.024 | 1.25 | |
Densenet169 | 14M | 0.062 | 0.055 | 0.035 | 0.028 | 1.25 | - | 0.048 | 0.035 | 0.028 | 1.25 | |
Densenet201 | 17M | 0.064 | 0.055 | 0.041 | 0.031 | 1.32 | - | 0.062 | 0.042 | 0.037 | 1.14 | |
arithmetic average speedup from fastest alternative: | 1.84 | 1.96 | ||||||||||
geometric average speedup from fastest alternative: | 1.66 | 1.69 | ||||||||||
TITAN X (Pascal) | RNN | 5M | - | - | 0.219 | 0.045 | 4.87 | - | - | - | 0.153 | |
UNet | 49M | - | 0.084 | 0.068 | 0.039 | 1.74 | - | 0.114 | 0.079 | 0.059 | 1.34 | |
VGG16 | 138M | 0.059 | 0.024 | 0.019 | 0.012 | 1.58 | - | 0.114 | 0.085 | 0.062 | 1.37 | |
VGG19 | 143M | 0.072 | 0.032 | 0.025 | 0.016 | 1.56 | - | 0.137 | 0.103 | 0.089 | 1.16 | |
Inception V3 | 23M | 0.041 | 0.033 | 0.029 | 0.017 | 1.71 | - | 0.039 | 0.035 | 0.021 | 1.67 | |
Resnet50 | 25M | 0.048 | 0.019 | 0.015 | 0.011 | 1.36 | - | 0.023 | 0.021 | 0.015 | 1.4 | |
Resnet101 | 44M | 0.067 | 0.035 | 0.030 | 0.019 | 1.58 | - | 0.040 | 0.037 | 0.029 | 1.28 | |
Resnet152 | 60M | 0.090 | 0.053 | 0.042 | 0.028 | 1.50 | - | 0.057 | 0.046 | 0.036 | 1.28 | |
EfficientNet B0 | 5M | 0.063 | 0.025 | 0.020 | 0.010 | 2.00 | - | 0.029 | 0.025 | 0.012 | 2.08 | |
EfficientNet B1 | 7M | 0.092 | 0.034 | 0.029 | 0.014 | 2.07 | - | 0.042 | 0.034 | 0.018 | 1.89 | |
Densenet121 | 8M | 0.054 | 0.046 | 0.035 | 0.021 | 1.67 | - | 0.058 | 0.042 | 0.025 | 1.68 | |
Densenet169 | 14M | 0.075 | 0.068 | 0.048 | 0.029 | 1.66 | - | 0.075 | 0.057 | 0.033 | 1.72 | |
Densenet201 | 17M | 0.089 | 0.081 | 0.057 | 0.035 | 1.63 | - | 0.086 | 0.066 | 0.039 | 1.69 | |
arithmetic average speedup factor from fastest alternative: | 1.92 | 1.55 | ||||||||||
geometric average speedup from fastest alternative: | 1.80 | 1.47 |
As clarified by the above implementation walk-through, our method offers two great advantages. First, the details and peculiarities of the considered DN do not impact nor complicate the implementation of our method. All that needs to be done is to replace the nonlinearity functions from the model declaration with the provided ones, everything else left untouched. Second, our method can be implemented in languages/libraries that are not equipped with automatic differentiation. This is an important benefit as it allows deployment in non deep learning focused platforms. Furthermore, specifically designed chips (Ando et al., 2017) can employ our solution without additional difficulties than when implementing a standard DN forward pass. The next section provides various experiments comparing the speed of the method compared to alternative solutions.
3.3 Speed Validation Across Architectures and Hardware

We first propose in Fig. 3 to depict the slow down that is incurred by using the custom nonlinearities from Algo. 3 with input that concatenates instead of the default nonlinearities with input for different input shapes and mini-batch sizes. As discussed in Sec. 3.2 despite the need to clone the nonlinearity matrices and to process three times more inputs, we observe that the slow-down factor remains around , at most, thanks to the highly parallel platform offered by GPUs. This serves as a good indicator to practitioners that the computation time of the JVP evaluation proposed in this paper will generally not exceed the computation time needed to perform a forward pass of using the original DN that employs the default nonlinearities. We also propose in Table 1 and Table 2 the computation times to evaluate the JVP of multiple DN architectures taken with respect to their input and layer weights respectively. We employ various architectures and also compare different GPUs performances (in all case using a single GPU). We observe that our solution is the fastest with speed-up factors varying based on the considered architecture, ranging from to . In addition of accelerating JVP evaluations, our solution allows JVP evaluation for models like RNNs where the alternative methods fail to provide any result from Out Of Memory failures most likely due to the automatic differentiation method overhead induced from the recurrence mechanism.
4 Fast Jacobian-Vector Product Extensions
We now demonstrate how (fast) JVPs can be used to compute some important quantities that have been shown to characterize DNs e.g. the largest eigenvalues and eigenvectors of , the per-region slope matrix that produces the DN output (recall (6)). Recall that for large models, extracting entirely, and then performing any sort of analysis is not feasible ( by matrix for datasets a la ImageNet (Deng et al., 2009)). It is thus crucial to evaluate the desired quantities only through JVP calls.
4.1 Deep Network Spectroscopy

iterations

iterations
First, we demonstrate how the DN closing recipe allows to compute the per-region affine parameters.
Proposition 1.
The allows efficient computation of the per-region slope and bias parameters from (9) given a sample
where represents the element of the canonical.
Another quantity that can easily be obtained from the DN cloning and Rop evaluation concerns the spectrum of the slope matrix which has been theoretically studied for DNs in the context of data augmentation (LeJeune et al., 2019) and dropout (Gal & Ghahramani, 2016). It is possible to efficiently compute the top-k spectral decomposition of such matrices through iterative procedures as given in Algo. 2 from (Bentbib & Kanber, 2015), leveraging the JVP, with same asymptotic complexity than a simple forward propagation of an input through a DN (see Appendix H for the proof). Illustrative application on a UNet is provided in Fig. 4. This case demonstrates the importance and ability of such analysis to finally enable tractable computations of key quantities for DNs and should open the door to tremendous applications of the methods and principled empirical validation of various theoretical studies as well as open the door to novel interpretability and visualization techniques.
4.2 Frobenius Norm Estimation
We also demonstrate how the JVP provides an efficient solution to estimate the Frobenius norm of the per-region slope matrix. Such norm has been theoretically studied for DNs and was found to be an indicator of Deep Generative Networks training failures (Tanielian et al., 2020), rapid evaluation of it would thus allow practitioners to detect such cases. It can be done as per (Gudmundsson et al., 1995) with
(13) |
with a random unit sphere gaussian random variable (Devroye, 1986; Calafiore et al., 1999). Hence, (13) can easily be estimated via Monte-Carlo sampling by repeatedly performing JVPs, as we demonstrate in Fig. 4. While the above illustrates a particular case, the same recipe can be employed to estimate and for Hermitian positive semi-definite matrices (Saibaba et al., 2017), or to estimate (Kenney et al., 1998). For in-depth study of such numerical methods that are built-upon JVPs, we refer the reader to Trefethen & Bau III (1997); Ipsen & Meyer (1998); Meyer (2000).
5 Conclusions
Automatic differentiation and backpropagation are behind most if not all recent successes in deep learning, yielding not only state-of-the-art results but also fueling the discoveries of new architectures, normalizations, and the likes. This first wave of research almost exclusively relied on backward differentiation (gradient computations) but recently, many techniques further pushing DN performances have been relying on Jacobian-vector products (JVPs). Current JVPs evaluations rely on automatic differentiation making them slower and less memory efficient than direct evaluation which in turn requires to be implemented by hand based on the architecture employed. We demonstrated in this paper that current DNs, which employ nonlinearities such as (leaky-)ReLU and max-pooling, can have their JVPs evaluated directly without the need for automatic differentiation and without the need to hard-code each architecture only by changing a few lines of code in their implementation. We validated our proposed method against alternative implementations and demonstrated significant speed-ups of a factor of almost on average over different architectures and hardware. We also demonstrated that fast JVPs will help in the development of novel analysis and theoretical techniques for CPA DNs that require evaluations of quantities such as eigenvectors/eigenvalues of the per-region slope matrices that benefit from faster JVP evaluations.
6 Acknowledgments
This work was supported by NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR grants N00014-18-12571, N00014-20-1-2787, and N00014-20-1-2534; AFOSR grant FA9550-18-1-0478; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047.
References
- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Ando et al. (2017) Ando, K., Ueyoshi, K., Orimo, K., Yonekawa, H., Sato, S., Nakahara, H., Takamaeda-Yamazaki, S., Ikebe, M., Asai, T., Kuroda, T., et al. Brein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 tops at 0.6 w. IEEE Journal of Solid-State Circuits, 53(4):983–994, 2017.
- Arora et al. (2016) Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
- Balestriero & Baraniuk (2020) Balestriero, R. and Baraniuk, R. G. Mad max: Affine spline insights into deep learning. Proceedings of the IEEE, pp. 1–24, 2020. doi: 10.1109/JPROC.2020.3042100.
- Balestriero et al. (2019) Balestriero, R., Cosentino, R., Aazhang, B., and Baraniuk, R. The geometry of deep networks: Power diagram subdivision. arXiv preprint arXiv:1905.08443, 2019.
- Balestriero et al. (2020) Balestriero, R., Paris, S., and Baraniuk, R. Max-affine spline insights into deep generative networks. arXiv preprint arXiv:2002.11912, 2020.
- Bartlett et al. (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res., 20:63–1, 2019.
- Baydin et al. (2017) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18(1):5595–5637, 2017.
- Bentbib & Kanber (2015) Bentbib, A. and Kanber, A. Block power method for svd decomposition. Analele Universitatii” Ovidius” Constanta-Seria Matematica, 23(2):45–58, 2015.
- Bojanov et al. (2013) Bojanov, B. D., Hakopian, H., and Sahakian, B. Spline functions and multivariate interpolations, volume 248. Springer Science & Business Media, 2013.
- Calafiore et al. (1999) Calafiore, G., Dabbene, F., and Tempo, R. Radial and uniform distributions in vector and matrix spaces for probabilistic robustness. In Topics in Control and its Applications, pp. 17–31. Springer, 1999.
- Cohen & Welling (2016) Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999, 2016.
- Cosentino et al. (2020) Cosentino, R., Balestriero, R., Baraniuk, R., and Aazhang, B. Provable finite data generalization with group autoencoder, 2020.
- De Boor et al. (1978) De Boor, C., De Boor, C., Mathématicien, E.-U., De Boor, C., and De Boor, C. A practical guide to splines, volume 27. springer-verlag New York, 1978.
- Dehaene et al. (2020) Dehaene, D., Frigo, O., Combrexelle, S., and Eline, P. Iterative energy-based projection on a normal data manifold for anomaly localization. arXiv preprint arXiv:2002.03734, 2020.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Devarakonda et al. (2017) Devarakonda, A., Naumov, M., and Garland, M. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029, 2017.
- Devroye (1986) Devroye, L. Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pp. 260–265, 1986.
- Elliott (2018) Elliott, C. The simple essence of automatic differentiation. Proceedings of the ACM on Programming Languages, 2(ICFP):1–29, 2018.
- Elsken et al. (2019) Elsken, T., Metzen, J. H., Hutter, F., et al. Neural architecture search: A survey. J. Mach. Learn. Res., 20(55):1–21, 2019.
- Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
- Gamba et al. (2020) Gamba, M., Carlsson, S., Azizpour, H., and Björkman, M. Hyperplane arrangements of trained convnets are biased. arXiv preprint arXiv:2003.07797, 2020.
- Gianfrancesco et al. (2018) Gianfrancesco, M. A., Tamang, S., Yazdany, J., and Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine, 178(11):1544–1547, 2018.
- Golub & Kahan (1965) Golub, G. and Kahan, W. Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 2(2):205–224, 1965.
- Golub & Van Loan (1996) Golub, G. H. and Van Loan, C. F. Matrix computations johns hopkins university press. Baltimore and London, 1996.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, volume 1. MIT Press, 2016. http://www.deeplearningbook.org.
- Graves et al. (2013) Graves, A., Jaitly, N., and Mohamed, A.-r. Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding, pp. 273–278. IEEE, 2013.
- Griewank & Walther (2008) Griewank, A. and Walther, A. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.
- Grigsby & Lindsey (2020) Grigsby, J. E. and Lindsey, K. On transversality of bent hyperplane arrangements and the topological expressiveness of relu neural networks. arXiv preprint arXiv:2008.09052, 2020.
- Gudmundsson et al. (1995) Gudmundsson, T., Kenney, C. S., and Laub, A. J. Small-sample statistical estimates for matrix norms. SIAM journal on matrix analysis and applications, 16(3):776–792, 1995.
- Hanin & Rolnick (2019) Hanin, B. and Rolnick, D. Complexity of linear regions in deep networks. arXiv preprint arXiv:1901.09021, 2019.
- Hernández-García & König (2018) Hernández-García, A. and König, P. Data augmentation instead of explicit regularization. arXiv preprint arXiv:1806.03852, 2018.
- Hirsch & Smale (1974) Hirsch, M. and Smale, S. Differential equations, dynamical systems, and linear algebra (pure and applied mathematics, vol. 60). 1974.
- Ipsen & Meyer (1998) Ipsen, I. C. F. and Meyer, C. D. The idea behind krylov methods. The American Mathematical Monthly, 105(10):889–899, 1998. doi: 10.1080/00029890.1998.12004985. URL https://doi.org/10.1080/00029890.1998.12004985.
- Jamie et al. (2017) Jamie, T., Alexandre, P., and Matthew, J. generating forward-mode graphs by calling tf.gradients twice. https://github.com/renmengye/tensorflow-forward-ad/issues/2, 2017.
- Jia et al. (2019) Jia, K., Li, S., Wen, Y., Liu, T., and Tao, D. Orthogonal deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2019. doi: 10.1109/TPAMI.2019.2948352.
- Kenney et al. (1998) Kenney, C. S., Laub, A. J., and Reese, M. Statistical condition estimation for linear systems. SIAM Journal on Scientific Computing, 19(2):566–583, 1998.
- Kim & Lee (2019) Kim, B. and Lee, J. A video-based fire detection using deep learning models. Applied Sciences, 9(14):2862, 2019.
- Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lawrence et al. (1997) Lawrence, S., Giles, C. L., Tsoi, A. C., and Back, A. D. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98–113, 1997.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
- LeJeune et al. (2019) LeJeune, D., Balestriero, R., Javadi, H., and Baraniuk, R. G. Implicit rugosity regularization via data augmentation. arXiv preprint arXiv:1905.11639, 2019.
- Lin et al. (2013) Lin, M., Chen, Q., and Yan, S. Network in network. arXiv preprint arXiv:1312.4400, 2013.
- Linnainmaa (1976) Linnainmaa, S. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976.
- Marcus (2018) Marcus, G. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018.
- Márquez-Neila et al. (2017) Márquez-Neila, P., Salzmann, M., and Fua, P. Imposing hard constraints on deep networks: Promises and limitations. arXiv preprint arXiv:1706.02025, 2017.
- Meyer (2000) Meyer, C. D. Matrix analysis and applied linear algebra, volume 71. Siam, 2000.
- Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
- Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
- Nadav et al. (2020) Nadav, D., Barak, S., and Ingrid, D. Expression of fractals through neural network functions. IEEE Journal on Selected Areas in Information Theory, 2020.
- Naumann (2008) Naumann, U. Optimal jacobian accumulation is np-complete. Mathematical Programming, 112(2):427–441, 2008.
- Pascanu et al. (2013) Pascanu, R., Montufar, G., and Bengio, Y. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098, 2013.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
- Perez & Wang (2017) Perez, L. and Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
- Rao & Frtunikj (2018) Rao, Q. and Frtunikj, J. Deep learning for self-driving cars: chances and challenges. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, pp. 35–38, 2018.
- Rister & Rubin (2017) Rister, B. and Rubin, D. L. Piecewise convexity of artificial neural networks. Neural Networks, 94:34–45, 2017.
- Robinson et al. (2019) Robinson, H., Rasheed, A., and San, O. Dissecting deep neural networks. arXiv preprint arXiv:1910.03879, 2019.
- Roy et al. (2018) Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., and Beling, P. Deep learning detecting fraud in credit card transactions. In 2018 Systems and Information Engineering Design Symposium (SIEDS), pp. 129–134. IEEE, 2018.
- Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Rushin et al. (2017) Rushin, G., Stancil, C., Sun, M., Adams, S., and Beling, P. Horse race analysis in credit card fraud—deep learning, logistic regression, and gradient boosted tree. In 2017 systems and information engineering design symposium (SIEDS), pp. 117–121. IEEE, 2017.
- Saibaba et al. (2017) Saibaba, A. K., Alexanderian, A., and Ipsen, I. C. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395, 2017.
- Schumaker (2007) Schumaker, L. Spline functions: basic theory. Cambridge University Press, 2007.
- Serra et al. (2018) Serra, T., Tjandraatmadja, C., and Ramalingam, S. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pp. 4558–4566. PMLR, 2018.
- Seydoux et al. (2020) Seydoux, L., Balestriero, R., Poli, P., De Hoop, M., Campillo, M., and Baraniuk, R. Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning. Nature communications, 11(1):1–12, 2020.
- Shorten & Khoshgoftaar (2019) Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
- Smith (2017) Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. IEEE, 2017.
- Strang (2019) Strang, G. Linear algebra and learning from data. Wellesley-Cambridge Press, 2019.
- Tanielian et al. (2020) Tanielian, U., Issenhuth, T., Dohmatob, E., and Mary, J. Learning disconnected manifolds: a no gan’s land. In International Conference on Machine Learning, pp. 9418–9427. PMLR, 2020.
- Taylor & Nitschke (2017) Taylor, L. and Nitschke, G. Improving deep learning using generic data augmentation. arXiv preprint arXiv:1708.06020, 2017.
- Trefethen & Bau III (1997) Trefethen, L. N. and Bau III, D. Numerical linear algebra, volume 50. Siam, 1997.
- Unser (2019) Unser, M. A representer theorem for deep neural networks. Journal of Machine Learning Research, 20(110):1–30, 2019.
- Wang et al. (2017) Wang, H., Ren, K., and Song, J. A closer look at batch size in mini-batch training of deep auto-encoders. In 2017 3rd IEEE International Conference on Computer and Communications (ICCC), pp. 2756–2761. IEEE, 2017.
- Wang et al. (2020) Wang, J., Chen, Y., Chakraborty, R., and Yu, S. X. Orthogonal convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11505–11515, 2020.
- Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Zheng et al. (2018) Zheng, Y.-J., Zhou, X.-H., Sheng, W.-G., Xue, Y., and Chen, S.-Y. Generative adversarial network based telecom fraud detection at the receiving bank. Neural Networks, 102:78–86, 2018.
Appendix A Details on Jacobian Vector Products
The differential allows to express the local growth of each function’s output dimension as a linear term w.r.t the growth variable on each input dimension plus a small error term. This information is formatted into a Jacobian matrix which has the form
where is the output dimension of the operator . Note that one can also express the Jacobian matrix in term of the gradient operator taken for each output dimension as
One has the following property when assuming that is differentiable in where is an open subspace of
where it should be clear that the tangent space induced by the Jacobian matrix allows to locally describe the operator with approximation error increasing as one moves away from , the point where the Jacobian matrix has been evaluated at.
Appendix B Proof of Rop Evaluation From Double Lop
It is possible to compute the from two calls via
which we now prove. First, we have that
denote the following linear mapping in as , we have
which completes the proof.
Appendix C Max-Pooling Nonlinearity Matrix
We saw in (8) how the nonlinearity matrix for activation operators are square, diagonal, and with entries based on the employed activations e.g. for ReLU and for absolute value. This is true as the activation operator maintains the same dimensions from its input to its output. A max-pooling operator however reduces the dimensionality of its input based on some policy, for concreteness say by performing a spatial pooling over non-overlapping regions of its input. Hence the matrix of the max-pooling operator will be rectangular, and in this particular case, with number of rows equal to the number of columns (input dimension) divided by , the size of the pooling regions. Since we consider flattened versions of the input tensors, let first recall that when employing such a pooling with a tensor, the output at spatial position and channel in the output performs pooling over the indices . This translates into the flattened vectors into indices for the input vector and index for the output vector. We also have in this case that the input tensor is of shape and output tensor is . As a result, the rectangular matrix will be filled with , with a single per row, the position one this in each row can be in any index from based on the argmax position from the input values at those indices.
Appendix D Implementation Details
D.1 Software
We employed the latest stable version of Tensorflow, at the time of our study this was version 2.4.0, for all the additional details and versions of supporting libraries like Numpy we refer the reader to our GitHub page.
D.2 Hardware
TITAN X (Pascal): The TITAN X Pascal is an enthusiast-class graphics card by NVIDIA, launched in August 2016. Built on the 16 nm process, and based on the GP102 graphics processor, in its GP102-400-A1 variant, the card supports DirectX 12. This ensures that all modern games will run on TITAN X Pascal. The GP102 graphics processor is a large chip with a die area of 471 mm2 and 11,800 million transistors. Unlike the fully unlocked TITAN Xp, which uses the same GPU but has all 3840 shaders enabled, NVIDIA has disabled some shading units on the TITAN X Pascal to reach the product’s target shader count. It features 3584 shading units, 224 texture mapping units, and 96 ROPs. NVIDIA has paired 12 GB GDDR5X memory with the TITAN X Pascal, which are connected using a 384-bit memory interface. The GPU is operating at a frequency of 1417 MHz, which can be boosted up to 1531 MHz, memory is running at 1251 MHz (10 Gbps effective). Being a dual-slot card, the NVIDIA TITAN X Pascal draws power from 1x 6-pin + 1x 8-pin power connector, with power draw rated at 250 W maximum. Display outputs include: 1x DVI, 1x HDMI, 3x DisplayPort. TITAN X Pascal is connected to the rest of the system using a PCI-Express 3.0 x16 interface. The card’s dimensions are 267 mm x 112 mm x 40 mm, and it features a dual-slot cooling solution. Its price at launch was 1199 US Dollars.
GeForce GTX 1080 Ti: The GeForce GTX 1080 Ti is an enthusiast-class graphics card by NVIDIA, launched in March 2017. Built on the 16 nm process, and based on the GP102 graphics processor, in its GP102-350-K1-A1 variant, the card supports DirectX 12. This ensures that all modern games will run on GeForce GTX 1080 Ti. The GP102 graphics processor is a large chip with a die area of 471 mm2 and 11,800 million transistors. Unlike the fully unlocked TITAN Xp, which uses the same GPU but has all 3840 shaders enabled, NVIDIA has disabled some shading units on the GeForce GTX 1080 Ti to reach the product’s target shader count. It features 3584 shading units, 224 texture mapping units, and 88 ROPs. NVIDIA has paired 11 GB GDDR5X memory with the GeForce GTX 1080 Ti, which are connected using a 352-bit memory interface. The GPU is operating at a frequency of 1481 MHz, which can be boosted up to 1582 MHz, memory is running at 1376 MHz (11 Gbps effective). Being a dual-slot card, the NVIDIA GeForce GTX 1080 Ti draws power from 1x 6-pin + 1x 8-pin power connector, with power draw rated at 250 W maximum. Display outputs include: 1x HDMI, 3x DisplayPort. GeForce GTX 1080 Ti is connected to the rest of the system using a PCI-Express 3.0 x16 interface. The card’s dimensions are 267 mm x 112 mm x 40 mm, and it features a dual-slot cooling solution. Its price at launch was 699 US Dollars.
Quadro RTX 8000: The Quadro RTX 8000 is an enthusiast-class professional graphics card by NVIDIA, launched in August 2018. Built on the 12 nm process, and based on the TU102 graphics processor, in its TU102-875-A1 variant, the card supports DirectX 12 Ultimate. The TU102 graphics processor is a large chip with a die area of 754 mm2 and 18,600 million transistors. It features 4608 shading units, 288 texture mapping units, and 96 ROPs. Also included are 576 tensor cores which help improve the speed of machine learning applications. The card also has 72 raytracing acceleration cores. NVIDIA has paired 48 GB GDDR6 memory with the Quadro RTX 8000, which are connected using a 384-bit memory interface. The GPU is operating at a frequency of 1395 MHz, which can be boosted up to 1770 MHz, memory is running at 1750 MHz (14 Gbps effective). Being a dual-slot card, the NVIDIA Quadro RTX 8000 draws power from 1x 6-pin + 1x 8-pin power connector, with power draw rated at 260 W maximum. Display outputs include: 4x DisplayPort, 1x USB Type-C. Quadro RTX 8000 is connected to the rest of the system using a PCI-Express 3.0 x16 interface. The card measures 267 mm in length, 111 mm in width, and features a dual-slot cooling solution. Its price at launch was 9999 US Dollars.
CPU: Architecture: x86_64, CPU op-mode(s): 32-bit, 64-bit, Byte Order: Little Endian, CPU(s): 96, On-line CPU(s) list: 0-95, Thread(s) per core: 2, Core(s) per socket: 24, Socket(s): 2, NUMA node(s): 2, Vendor ID: GenuineIntel, CPU family: 6, Model: 85, Model name: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz, Stepping: 7, CPU MHz: 2200.000, CPU max MHz: 2201.0000, CPU min MHz: 1000.0000, BogoMIPS: 4400.00, Virtualization: VT-x, L1d cache: 32K, L1i cache: 32K, L2 cache: 1024K, L3 cache: 36608K, NUMA node0 CPU(s): 0-23,48-71, NUMA node1 CPU(s): 24-47,72-95.
Appendix E Layer JVP Implementation
For the layer case, that is, taking the JVP of the DN output at with respect to a layer weight , the implementation is similar except that all the first layers up to and including remain with their default nonlinearities. All the layers starting and including will employ the custom nonlinearities that we provided. At layer , the user will take the layer input and generate the tensor as follows for the case of a dense layer, we denote by , the considered layer weight and the direction to produce the JVP from
the same goes for the case of a convolutional layer as in
if one wanted to use batch-normalization, add a clone_bn operation just before the activation, taking X as input, the output of this cloned BN would then be fed into the clone_act operation.
Appendix F Codes for Additional Nonlinearities
In the case of batch-normalization layer, the input output mapping is not a CPA and thus we need to explicitly compute by hand the Jacobian of the layer and implement the forward pass. This is only needed if one wants to compute the JVP during training, otherwise this can be omitted since during testing the batch-normalization is simply an affine transformation of the input.
Appendix G Proof
G.1 Proof of Theorem 1
Proof.
The proof for the DN input JVP case is trivial and given in (10). Now when considering the layer weight case, the only added complication comes from the fact that the parameter of interest is a matrix (as opposed to a vector). However, we will simply demonstrate that a layer output can be produced by considering the flattened version of the weight matrix, in which case the JVP is taken w.r.t. a vector.
Denote a vector encoding the flattened representation of , the DN mapping based on an input but now taking as input is defined as
(14) |
with a by matrix stacking repeatedly as . First, notice that layers do not depend on . From this, it is possible to apply the same strategy as when taking the JVP with the DN since by considering the DN input is now the layer weight of interest as in
However, one can further simplify the above such that in practice there is no need to produce the matrix explicitly. Notice that for , the layer will compute . This quantity, is nothing else that where is now the matrix form of and the matrix is put back in vector form (and without the duplication) as . We thus directly obtain that in practice, computing the above is equivalent to computing the layer output but replacing the layer weights by and matrices. For the convolutional layer case, the exact same process is applied but now noticing that can be written as the product between the (small) weight vectors and a matrix producing the circulant block circulant matrix of convolution. ∎
Appendix H Computational Complexity of Spectral Decomposition
The QR decomposition (Strang, 2019) using Householder transformations is done with flops (Golub & Van Loan, 1996) for a matrix. Hence the top- SVD is obtained by repeating until reaching the stopping criteria calls and additional flops. Considering a fixed number of iteration of the inner-loop we obtain that the asymptotic time-complexity of Algo. 2 is the same as the asymptotic time-complexity of performing a forward pass into the considered DN.
Appendix I Additional Results
Model | # params. | batch- jacobian | jvp | double vjp | clone (ours) | speedup factor | |
GeForce GTX 1080Ti | RNN | 5M | - | - | 0.166 | 0.034 | 4.88 |
UNet | 49M | - | - | - | - | - | |
VGG16 | 138M | 0.060 | 0.028 | 0.024 | 0.017 | 1.41 | |
VGG19 | 143M | 0.075 | 0.039 | 0.031 | 0.021 | 1.47 | |
Inception V3 | 23M | 0.041 | 0.022 | 0.017 | 0.011 | 1.54 | |
Resnet50 | 25M | 0.041 | 0.014 | 0.012 | 0.011 | 1.09 | |
Resnet101 | 44M | 0.058 | 0.021 | 0.017 | 0.017 | 1 | |
Resnet152 | 60M | 0.082 | 0.03 | 0.025 | 0.025 | 1 | |
EfficientNet B0 | 5M | - | - | - | 0.007 | ||
EfficientNet B1 | 7M | - | - | - | 0.009 | ||
Densenet121 | 8M | 0.037 | 0.026 | 0.019 | 0.013 | 1.46 | |
Densenet169 | 14M | 0.045 | 0.036 | 0.026 | 0.018 | 1.44 | |
Densenet201 | 17M | 0.061 | 0.046 | 0.033 | 0.018 | 1.83 |
Model | # params. | batch- jacobian | jvp | double vjp | clone (ours) | speedup factor | |
GeForce GTX 1080Ti | RNN | 5M | - | - | 0.151 | 0.024 | 6.29167 |
UNet | 49M | - | - | - | - | - | |
VGG16 | 138M | 0.060 | 0.028 | 0.024 | 0.017 | 1.41 | |
VGG19 | 143M | 0.072 | 0.036 | 0.031 | 0.022 | 1.40 | |
Inception V3 | 23M | 0.039 | 0.021 | 0.016 | 0.011 | 1.45 | |
Resnet50 | 25M | 0.039 | 0.013 | 0.011 | 0.011 | 1 | |
Resnet101 | 44M | 0.058 | 0.023 | 0.017 | 0.017 | 1 | |
Resnet152 | 60M | 0.080 | 0.029 | 0.027 | 0.025 | 1.08 | |
EfficientNet B0 | 5M | - | - | - | 0.007 | ||
EfficientNet B1 | 7M | - | - | - | 0.009 | ||
Densenet121 | 8M | 0.033 | 0.025 | 0.020 | 0.011 | 1.81 | |
Densenet169 | 14M | 0.050 | 0.036 | 0.026 | 0.015 | 1.73 | |
Densenet201 | 17M | 0.061 | 0.044 | 0.030 | 0.018 | 1.67 |
Model | # params. | batch- jacobian | jvp | double vjp | clone (ours) | batch- jacobian | jvp | double vjp | clone (ours) |
RNN | 5M | 0.011 | - | 0.014 | 0.004 | - | - | 0.046 | 0.016 |
UNet | 49M | - | 0.004 | 0.003 | 0.002 | - | 0.006 | 0.004 | 0.005 |
VGG16 | 138M | 0.004 | 0.003 | 0.003 | 0.001 | - | 0.008 | 0.008 | 0.004 |
VGG19 | 143M | 0.005 | 0.003 | 0.003 | 0.002 | - | 0.009 | 0.008 | 0.005 |
Inception V3 | 23M | 0.006 | 0.005 | 0.004 | 0.003 | - | 0.006 | 0.004 | 0.004 |
Resnet50 | 25M | 0.005 | 0.003 | 0.003 | 0.003 | - | 0.005 | 0.004 | 0.004 |
Resnet101 | 44M | 0.005 | 0.005 | 0.004 | 0.003 | - | 0.008 | 0.007 | 0.006 |
Resnet152 | 60M | 0.010 | 0.006 | 0.004 | 0.004 | - | 0.010 | 0.006 | 0.007 |
EfficientNet B0 | 5M | 0.007 | 0.004 | 0.003 | 0.003 | - | 0.006 | 0.004 | 0.003 |
EfficientNet B1 | 7M | 0.007 | 0.005 | 0.004 | 0.003 | - | 0.007 | 0.006 | 0.004 |
Densenet121 | 8M | 0.007 | 0.005 | 0.005 | 0.003 | - | 0.008 | 0.006 | 0.005 |
Densenet169 | 14M | 0.012 | 0.008 | 0.005 | 0.005 | - | 0.008 | 0.008 | 0.006 |
Densenet202 | 17M | 0.013 | 0.011 | 0.007 | 0.004 | - | 0.008 | 0.012 | 0.007 |
RNN | 5M | - | - | 0.022 | 0.007 | - | - | - | 0.017 |
UNet | 49M | - | 0.004 | 0.003 | 0.002 | - | 0.01 | 0.006 | 0.007 |
VGG16 | 138M | 0.007 | 0.002 | 0.001 | 0.001 | - | 0.009 | 0.005 | 0.005 |
VGG19 | 143M | 0.004 | 0.003 | 0.002 | 0.001 | - | 0.010 | 0.005 | 0.007 |
Inception V3 | 23M | 0.007 | 0.007 | 0.006 | 0.003 | - | 0.006 | 0.005 | 0.004 |
Resnet50 | 25M | 0.005 | 0.003 | 0.003 | 0.002 | - | 0.002 | 0.002 | 0.004 |
Resnet101 | 44M | 0.006 | 0.004 | 0.005 | 0.004 | - | 0.004 | 0.004 | 0.007 |
Resnet152 | 60M | 0.006 | 0.009 | 0.007 | 0.005 | - | 0.003 | 0.004 | 0.004 |
EfficientNet B0 | 5M | 0.011 | 0.004 | 0.003 | 0.001 | - | 0.002 | 0.001 | 0.001 |
EfficientNet B1 | 7M | 0.014 | 0.006 | 0.005 | 0.002 | - | 0.007 | 0.001 | 0.002 |
Densenet121 | 8M | 0.011 | 0.009 | 0.006 | 0.004 | - | 0.008 | 0.004 | 0.002 |
Densenet169 | 14M | 0.015 | 0.011 | 0.008 | 0.005 | - | 0.006 | 0.004 | 0.004 |
Densenet202 | 17M | 0.017 | 0.012 | 0.011 | 0.005 | - | 0.007 | 0.004 | 0.004 |
Model | # params. | batch- jacobian | jvp | double vjp | clone (ours) | batch- jacobian | jvp | double vjp | clone (ours) |
RNN | 5M | 0.013 | - | 0.013 | 0.006 | - | - | 0.042 | 0.018 |
UNet | 49M | - | 0.004 | 0.003 | 0.002 | - | 0.005 | 0.004 | 0.005 |
VGG16 | 138M | 0.004 | 0.003 | 0.003 | 0.001 | - | 0.008 | 0.008 | 0.003 |
VGG19 | 143M | 0.004 | 0.003 | 0.003 | 0.002 | - | 0.008 | 0.008 | 0.005 |
Inception V3 | 23M | 0.006 | 0.004 | 0.004 | 0.004 | - | 0.007 | 0.004 | 0.004 |
Resnet50 | 25M | 0.003 | 0.003 | 0.003 | 0.003 | - | 0.004 | 0.004 | 0.003 |
Resnet101 | 44M | 0.007 | 0.005 | 0.004 | 0.003 | - | 0.007 | 0.004 | 0.004 |
Resnet152 | 60M | 0.008 | 0.006 | 0.007 | 0.005 | - | 0.007 | 0.008 | 0.004 |
EfficientNet B0 | 5M | 0.004 | 0.003 | 0.003 | 0.003 | - | 0.006 | 0.004 | 0.003 |
EfficientNet B1 | 7M | 0.007 | 0.005 | 0.004 | 0.003 | - | 0.007 | 0.005 | 0.004 |
Densenet121 | 8M | 0.006 | 0.005 | 0.004 | 0.004 | - | 0.006 | 0.006 | 0.004 |
Densenet169 | 14M | 0.012 | 0.006 | 0.006 | 0.004 | - | 0.014 | 0.007 | 0.006 |
Densenet202 | 17M | 0.010 | 0.007 | 0.006 | 0.005 | - | 0.009 | 0.008 | 0.006 |
RNN | 5M | - | - | 0.02 | 0.009 | - | - | - | 0.024 |
UNet | 49M | - | 0.004 | 0.002 | 0.002 | - | 0.009 | 0.003 | 0.007 |
VGG16 | 138M | 0.008 | 0.002 | 0.002 | 0.001 | - | 0.009 | 0.005 | 0.005 |
VGG19 | 143M | 0.004 | 0.003 | 0.002 | 0.001 | - | 0.009 | 0.006 | 0.007 |
Inception V3 | 23M | 0.008 | 0.007 | 0.005 | 0.003 | - | 0.006 | 0.003 | 0.003 |
Resnet50 | 25M | 0.006 | 0.004 | 0.001 | 0.002 | - | 0.002 | 0.002 | 0.004 |
Resnet101 | 44M | 0.007 | 0.007 | 0.006 | 0.004 | - | 0.005 | 0.003 | 0.007 |
Resnet152 | 60M | 0.007 | 0.009 | 0.008 | 0.006 | - | 0.007 | 0.007 | 0.005 |
EfficientNet B0 | 5M | 0.010 | 0.005 | 0.004 | 0.002 | - | 0.005 | 0.002 | 0.002 |
EfficientNet B1 | 7M | 0.029 | 0.007 | 0.005 | 0.003 | - | 0.009 | 0.001 | 0.003 |
Densenet121 | 8M | 0.011 | 0.010 | 0.007 | 0.004 | - | 0.009 | 0.004 | 0.003 |
Densenet169 | 14M | 0.013 | 0.011 | 0.009 | 0.005 | - | 0.008 | 0.006 | 0.004 |
Densenet202 | 17M | 0.014 | 0.013 | 0.010 | 0.005 | - | 0.01 | 0.003 | 0.002 |