This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Practical Quasi-Newton Methods for Training Deep Neural Networks

Donald Goldfarb, Yi Ren, Achraf Bahamou
Department of Industrial Engineering and Operations Research
Columbia University
New York, NY 10027
{goldfarb, yr2322, ab4689}@columbia.edu
Abstract

We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block-diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient nn is often of the order of tens of millions and the Hessian has n2n^{2} elements. Consequently, computing and storing a full n×nn\times n BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC [30], which computes a Kronecker-factored block-diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

1 Introduction

We consider in this paper the development of practical stochastic quasi-Newton (QN), and in particular Kronecker-factored block-diagonal BFGS [6, 13, 16, 39] and L-BFGS [27], methods for training deep neural networks (DNNs). Recall that the BFGS method starts each iteration with a symmetric positive definite matrix BB (or H=B1H=B^{-1}) that approximates the current Hessian matrix (or its inverse), computes the gradient 𝐟\mathbf{\nabla f} of ff at the current iterate 𝐱\mathbf{x} and then takes a step 𝐬=αH𝐟\mathbf{s}=-\alpha H\mathbf{\nabla f}, where α\alpha is a step length (usually) determined by some inexact line-search procedure, such that 𝐲𝐬>0\mathbf{y}^{\top}\mathbf{s}>0, where 𝐲=𝐟+𝐟\mathbf{y}=\mathbf{\nabla f}^{+}-\mathbf{\nabla f} and 𝐟+\mathbf{\nabla f}^{+} is the gradient of ff at the new point 𝐱+=𝐱+𝐬\mathbf{x}^{+}=\mathbf{x}+\mathbf{s}. The method then computes an updated approximation B+B^{+} to BB (or H+H^{+} to HH) that remains symmetric and positive-definite and satisfies the so-called quasi-Newton (QN) condition B+𝐬=𝐲B^{+}\mathbf{s}=\mathbf{y} (or equivalently, H+𝐲=𝐬H^{+}\mathbf{y}=\mathbf{s}). A consequence of this is that the matrix B+B^{+} operates on the vector 𝐬\mathbf{s} in exactly the same way as the average of the Hessian matrix along the line segment between 𝐱\mathbf{x} and 𝐱+\mathbf{x}^{+} operates on 𝐬\mathbf{s}.

In DNN training, the number of variables and components of the gradient nn is often of the order of tens of millions and the Hessian has n2n^{2} elements. Hence, computing and storing a full n×nn\times n BFGS approximation or storing pp (𝐬,𝐲)(\mathbf{s},\mathbf{y}) pairs, where pp is approximately 1010 or larger for use in an L-BFGS implementation, is out of the question. Consequently, in our methods, we approximate the Hessian by a block-diagonal matrix, where each diagonal block corresponds to a layer, further approximating them as the Kronecker product of two much smaller matrices, as in [30, 5, 19, 10].

Literature Review on Using Second-order Information for DNN Training. For solving the stochastic optimization problems with high-dimensional data that arise in machine learning (ML), stochastic gradient descent (SGD) [36] and its variants are the methods that are most often used, especially for training DNNs. These variants include such methods as AdaGrad [12], RMSprop [21], and Adam [24], all of which scale the stochastic gradient by a diagonal matrix based on estimates of the first and second moments of the individual gradient components. Nonetheless, there has been a lot of effort to find ways to take advantage of second-order information in solving ML optimization problems. Approaches have run the gamut from the use of a diagonal re-scaling of the stochastic gradient, based on the secant condition associated with quasi-Newton (QN) methods [4], to sub-sampled Newton methods (e.g. see [43], and references therein), including those that solve the Newton system using the linear conjugate gradient method (see [8]).

In between these two extremes are stochastic methods that are based either on QN methods or generalized Gauss-Newton (GGN) and natural gradient [1] methods. For example, a stochastic L-BFGS method for solving strongly convex problems was proposed in [9] that uses sampled Hessian-vector products rather than gradient differences, which was proved in [33] to be linearly convergent by incorporating the variance reduction technique (SVRG [23]) to alleviate the effect of noisy gradients. A closely related variance reduced block L-BFGS method was proposed in [17]. A regularized stochastic BFGS method was proposed in [31], and an online L-BFGS method was proposed in [32] for strongly convex problems and extended in [28] to incorporate SVRG variance reduction. Stochastic BFGS and L-BFGS methods were also developed for online convex optimization in [38]. For nonconvex problems, a damped L-BFGS method which incorporated SVRG variance reduction was developed and its convergence properties was studied in [41].

GGN methods that approximate the Hessian have been proposed, including the Hessian-free method [29] and the Krylov subspace method [40]. Variants of the closely related natural gradient method that use block-diagonal approximations to the Fisher information matrix, where blocks correspond to layers, have been proposed in e.g. [20, 11, 30, 14]. Using further approximation of each of these (empirical) Fisher matrix and GGN blocks by the Kronecker product of two much smaller matrices, the efficient KFAC [30], KFRA [5], EKFAC [15], and Shampoo [19] methods were developed. See also [2] and [10], [37], which combine both Hessian and covariance (Fisher-like) matrix information in stochastic Newton type methods, Also, methods are given in [26, 42] that replace the Kullback-Leibler divergence by the Wasserstein distance to define the natural gradient, but with a greater computational cost.

Our Contributions. The main contributions of this paper can be summarized as follows:

  1. 1.

    New BFGS and limited-memory variants (i.e. L-BFGS) that take advantage of the structure of feed-forward DNN training problems;

  2. 2.

    Efficient non-diagonal second-order algorithms for deep learning that require a comparable amount of memory and computational cost per iteration as first-order methods;

  3. 3.

    A new damping scheme for BFGS and L-BFGS updating of an inverse Hessian approximation, that not only preserves its positive definiteness, but also limits the decrease (and increase) in its smallest (and largest) eigenvalues for non-convex problems;

  4. 4.

    A novel application of Hessian-action BFGS;

  5. 5.

    The first proof of convergence (to the best of our knowledge) of a stochastic Kronecker-factored quasi-Newton method.

2 Kronecker-factored Quasi-Newton Method for DNN

After reviewing the computations used in DNN training, we describe the Kronecker structures of the gradient and Hessian for a single data point, followed by their extension to approximate expectations of these quantities for multiple data-points and give a generic algorithm that employs BFGS (or L-BFGS) approximations for the Hessians.

Deep Neural Networks. We consider a feed-forward DNN with LL layers, defined by weight matrices WlW_{l} (whose last columns are bias vectors blb_{l}), activation functions ϕl\phi_{l} for l{1L}l\in\{1\dots L\} and loss function \mathcal{L}. For a data-point (x,y)(x,y), the loss (aL,y)\mathcal{L}\left(a_{L},y\right) between the output aLa_{L} of the DNN and yy is a non-convex function of θ=[vec(W1),,vec(WL)]\theta=\left[\operatorname{vec}\left(W_{1}\right)^{\top},\ldots,\operatorname{vec}\left(W_{L}\right)^{\top}\right]^{\top}. The network’s forward and backward pass for a single input data point (x,y)(x,y) is described in Algorithm 1.

Algorithm 1 Forward and backward pass of DNN for a single data-point
1:  Given input (x,y)(x,y), weights (and biases) WlW_{l}, and activations ϕl\phi_{l} for l[1,L]l\in[1,L]
2:  𝐚0=x{\mathbf{a}}_{0}=x; for l=1,..,Ll=1,..,L do 𝐚¯l1=(𝐚l1,1)\bar{{\mathbf{a}}}_{l-1}={({\mathbf{a}}_{l-1},\hskip 2.0pt1)}; 𝐡l=Wl𝐚¯l1{\mathbf{h}}_{l}={W}_{l}\bar{{\mathbf{a}}}_{l-1}; 𝐚l=ϕl(𝐡l){\mathbf{a}}_{l}=\phi_{l}({\mathbf{h}}_{l})
3:  𝒟𝐚L(z,y)z|z=𝐚L\left.\mathcal{D}{\mathbf{a}}_{L}\leftarrow\frac{\partial\mathcal{L}(z,y)}{\partial z}\right|_{z={\mathbf{a}}_{L}}
4:   for l=L,..,1l=L,..,1 do 𝐠l=𝒟𝐚lϕl(𝐡l){\mathbf{g}}_{l}=\mathcal{D}{\mathbf{a}}_{l}\odot\phi_{l}^{\prime}({\mathbf{h}}_{l}); 𝒟Wl=𝐠i𝐚¯i1\mathcal{D}{W}_{l}={\mathbf{g}}_{i}\bar{{\mathbf{a}}}_{i-1}^{\top}; 𝒟𝐚l1=Wl𝐠l\mathcal{D}{\mathbf{a}}_{l-1}=W_{l}^{\top}{\mathbf{g}}_{l}

For a training dataset that contains multiple data-points indexed by i=1,,Ii=1,...,I, let f(i;θ)f(i;\theta) denote the loss for the iith data-point. Then, viewing the dataset as an empirical distribution, the total loss function f(θ)f(\theta) that we wish to minimize is

f(θ):=𝔼i[f(i;θ)]:=1Ii=1If(i;θ).f(\theta):=\mathbb{E}_{i}[f(i;\theta)]:=\frac{1}{I}\sum_{i=1}^{I}f(i;\theta).

Single Data-point: Layer-wise Structure of the Gradient and Hessian. Let 𝐟l{\mathbf{\nabla f}}_{l} and 2fl\nabla^{2}f_{l} denote, respectively, the restriction of 𝐟\mathbf{\nabla f} and 2f\nabla^{2}f to the weights WlW_{l} in layer l=1,,Ll=1,\ldots,L. For a single data-point 𝐟l{\mathbf{\nabla f}}_{l} and 2fl\nabla^{2}f_{l} have a tensor (Kronecker) structure, as shown in [30] and [5]. Specifically,

𝐟l(i)=𝐠l(i)(𝐚l1(i)),equivalently,vec(𝐟l(i))=𝐚l1(i)𝐠l(i),\displaystyle{\mathbf{\nabla f}}_{l}(i)={\mathbf{g}}_{l}(i)({\mathbf{a}}_{l-1}(i))^{\top},\quad{\rm equivalently,}\quad\text{vec}({\mathbf{\nabla f}}_{l}(i))={\mathbf{a}}_{l-1}(i)\otimes{\mathbf{g}}_{l}(i), (1)
2fl(i)=(𝐚l1(i)(𝐚l1(i)))Gl(i),\displaystyle\nabla^{2}f_{l}(i)=({\mathbf{a}}_{l-1}(i)({\mathbf{a}}_{l-1}(i))^{\top})\otimes G_{l}(i), (2)

where the pre-activation gradient 𝐠l(i)=f(i)𝐡l(i){\mathbf{g}}_{l}(i)=\frac{\partial f(i)}{\partial{\mathbf{h}}_{l}(i)}, and the pre-activation Hessian Gl(i)=2f(i)𝐡l(i)2G_{l}(i)=\frac{\partial^{2}f(i)}{\partial{\mathbf{h}}_{l}(i)^{2}}. Our algorithm uses an approximation to (Gl(i))1(G_{l}(i))^{-1}, which is updated via the BFGS updating formulas based upon a secant condition that relates the change in 𝐠l(i){\mathbf{g}}_{l}(i) with the change in 𝐡l(i){\mathbf{h}}_{l}(i).

Although we focus on fully-connected layers in this paper, the idea of Kronecker-factored approximations to the diagonal blocks 2fl\nabla^{2}f_{l}, l=1,,Ll=1,\ldots,L of the Hessian can be extended to other layers used in deep learning, such as convolutional and recurrent layers.

Multiple Data-points: Kronecker-factored QN Approach. Now consider the case where we have a dataset of II data-points indexed by i=1,,Ii=1,\ldots,I. By (2), we have

𝔼i[2fl(i)]𝔼i[𝐚l1(i)(𝐚l1(i))]𝔼i[Gl(i)]:=AlGl\displaystyle\mathbb{E}_{i}[\nabla^{2}f_{l}(i)]\approx\mathbb{E}_{i}\left[{\mathbf{a}}_{l-1}(i)({\mathbf{a}}_{l-1}(i))^{\top}\right]\otimes\mathbb{E}_{i}\left[G_{l}(i)\right]:=A_{l}\otimes G_{l} (3)

Note that the approximation in (3) that the expectation of the Kronecker product of two matrices equals the Kronecker product of their expectations is the same as the one used by KFAC [30]. Now, based on this structural approximation, we use Hl=HalHglH^{l}=H^{l}_{a}\otimes H^{l}_{g} as our QN approximation to (𝔼i[2fl(i)])1\left(\mathbb{E}_{i}[\nabla^{2}f_{l}(i)]\right)^{-1}, where HalH^{l}_{a} and HglH^{l}_{g} are positive definite approximations to Al1A_{l}^{-1} and Gl1G_{l}^{-1}, respectively. Hence, using our layer-wise block-diagonal approximation to the Hessian, a step in our algorithm for each layer ll is computed as

vec(Wl+)vec(Wl)=αHlvec(𝐟l^)=α(HalHgl)vec(𝐟l^)=αvec(Hgl𝐟l^Hal),\displaystyle\text{vec}(W_{l}^{+})-\text{vec}(W_{l})=-\alpha H^{l}\text{vec}\left(\widehat{{\mathbf{\nabla f}}_{l}}\right)=-\alpha(H^{l}_{a}\otimes H^{l}_{g})\text{vec}\left(\widehat{{\mathbf{\nabla f}}_{l}}\right)=-\alpha\text{vec}\left(H^{l}_{g}\widehat{{\mathbf{\nabla f}}_{l}}H^{l}_{a}\right), (4)

where 𝐟l^\widehat{\mathbf{\nabla f}_{l}} denotes the estimate to 𝔼i[𝐟l(i)]\mathbb{E}_{i}[\mathbf{\nabla f}_{l}(i)] and α\alpha is the learning rate. After computing Wl+W_{l}^{+} and performing another forward/backward pass, our method computes or updates HalH_{a}^{l} and HglH_{g}^{l} as follows:

  1. 1.

    For HglH_{g}^{l}, we use a damped version of BFGS (or L-BFGS) (See Section 3) based on the (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs corresponding to the average change in 𝐡l(i){\mathbf{h}}_{l}(i) and in the gradient with respect to 𝐡l(i){\mathbf{h}}_{l}(i); i.e.,

    𝐬gl=𝔼i[𝐡l+(i)]𝔼i[𝐡l(i)],𝐲gl=𝔼i[𝐠l+(i)]𝔼i[𝐠l(i)].\displaystyle{\mathbf{s}}_{g}^{l}=\mathbb{E}_{i}[{\mathbf{h}}_{l}^{+}(i)]-\mathbb{E}_{i}[{\mathbf{h}}_{l}(i)],\qquad{\mathbf{y}}_{g}^{l}=\mathbb{E}_{i}[{\mathbf{g}}_{l}^{+}(i)]-\mathbb{E}_{i}[{\mathbf{g}}_{l}(i)]. (5)
  2. 2.

    For HalH_{a}^{l} we use the "Hessian-action" BFGS method described in Section 4. The issue of possible singularity of the positive semi-definite matrix AlA_{l} approximated by (Hal)1(H_{a}^{l})^{-1} is also addressed there by incorporating a Levenberg-Marquardt (LM) damping term.

Algorithm 2 gives a high-level summary of our K-BFGS or K-BFGS(L) algorithms (which use BFGS or L-BFGS to update HglH_{g}^{l}, respectively). See Algoirthm 4 in the appendix for a detailed pseudocode. The use of mini-batches is described in Section 6. Note that, an additional forward-backward pass is used in Algorithm 2 because the quantities in (5) need to be estimated using the same mini-batch.

Algorithm 2 High-level summary of K-BFGS / K-BFGS(L)
0:  Given initial weights θ\theta, batch size mm, learning rate α\alpha
1:  for k=1,2,k=1,2,\ldots do
2:     Sample mini-batch of size mm: Mk={ξk,i,i=1,,m}M_{k}=\{\xi_{k,i},i=1,\dots,m\}
3:     Perform a forward-backward pass over the current mini-batch MkM_{k} (see Algorithm 1)
4:      for l=1,,Ll=1,\ldots,L do pl=Hgl𝐟^lHalp_{l}=H_{g}^{l}\widehat{\mathbf{\nabla f}}_{l}H_{a}^{l}; Wl=WlαplW_{l}=W_{l}-\alpha\cdot p_{l}
5:     Perform another forward-backward pass over MkM_{k} to get (𝐬gl,𝐲gl)({\mathbf{s}}_{g}^{l},{\mathbf{y}}_{g}^{l})
6:     Use damped BFGS or L-BFGS to update HglH_{g}^{l} (l=1,,Ll=1,...,L) (see Section 3, in particular Algorithm 3)
7:     Use Hessian-action BFGS to update HalH_{a}^{l} (l=1,,Ll=1,...,L) (see Section 4 )

3 BFGS and L-BFGS for GlG_{l}

Damped BFGS Updating. It is well-known that training a DNN is a non-convex optimization problem. As (2) and (3) show, this non-convexity manifests in the fact that Gl0G_{l}\succ 0 often does not hold. Thus, for the BFGS update of HglH_{g}^{l}, the approximation to Gl1G_{l}^{-1}, to remain positive definite, we have to ensure that (𝐬gl)𝐲gl>0({\mathbf{s}}_{g}^{l})^{\top}{\mathbf{y}}_{g}^{l}>0. Due to the stochastic setting, ensuring this condition by line-search, as is done in deterministic settings, is impractical. In addition, due to the large changes in curvature in DNN models that occur as the parameters are varied, we also need to suppress large changes to HglH_{g}^{l} as it is updated. To deal with both of these issues, we propose a double damping (DD) procedure (Algorithm 3), which is based upon Powell’s damped-BFGS approach [35], for modifying the (𝐬gl,𝐲gl)({\mathbf{s}}_{g}^{l},{\mathbf{y}}_{g}^{l}) pair. To motivate Algorithm 3, consider the formulas used for BFGS updating of BB and HH:

B+=BB𝐬𝐬B𝐬B𝐬+ρ𝐲𝐲,H+=(Iρ𝐬𝐲)H(Iρ𝐲𝐬)+ρ𝐬𝐬,\displaystyle B^{+}=B-\frac{B{\mathbf{s}}{\mathbf{s}}^{\top}B}{{\mathbf{s}}^{\top}B{\mathbf{s}}}+\rho{\mathbf{y}}{\mathbf{y}}^{\top},\quad H^{+}=(I-\rho{\mathbf{s}}{\mathbf{y}}^{\top})H(I-\rho{\mathbf{y}}{\mathbf{s}}^{\top})+\rho{\mathbf{s}}{\mathbf{s}}^{\top}, (6)

where ρ=1𝐬𝐲>0\rho=\frac{1}{{\mathbf{s}}^{\top}{\mathbf{y}}}>0. If we can ensure that 0<𝐲H𝐲𝐬𝐲1μ10<\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{1}} and 0<𝐬𝐬𝐬𝐲1μ20<\frac{{\mathbf{s}}^{\top}{\mathbf{s}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{2}} , then we can obtain the following bounds:

B+\displaystyle\|B^{+}\| BB𝐬𝐬B𝐬B𝐬+ρ𝐲𝐲B+B1/2H1/2𝐲𝐲H1/2B1/2𝐬𝐲\displaystyle\leq\|B-\frac{B{\mathbf{s}}{\mathbf{s}}^{\top}B}{{\mathbf{s}}^{\top}B{\mathbf{s}}}\|+\|\rho{\mathbf{y}}{\mathbf{y}}^{\top}\|\leq\|B\|+\|\frac{B^{1/2}H^{1/2}{\mathbf{y}}{\mathbf{y}}^{\top}H^{1/2}B^{1/2}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\| (7)
B+BH1/2𝐲2𝐬𝐲B(1+𝐲H𝐲𝐬𝐲)B(1+1μ1)\displaystyle\leq\|B\|+\|B\|\frac{\|H^{1/2}{\mathbf{y}}\|^{2}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\|B\|\left(1+\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\right)\leq\|B\|\left(1+\frac{1}{\mu_{1}}\right) (8)

and

H+\displaystyle\|H^{+}\| H1/2𝐬𝐲H1/2𝐬𝐲2+𝐬𝐬𝐬𝐲(H1/2+𝐬H1/2𝐲𝐬𝐲)2+𝐬2𝐬𝐲\displaystyle\leq\|H^{1/2}-\frac{{\mathbf{s}}{\mathbf{y}}^{\top}H^{1/2}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\|^{2}+\|\frac{{\mathbf{s}}{\mathbf{s}}^{\top}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\|\leq\left(\|H^{1/2}\|+\frac{\|{\mathbf{s}}\|\|H^{1/2}{\mathbf{y}}\|}{{\mathbf{s}}^{\top}{\mathbf{y}}}\right)^{2}+\frac{\|{\mathbf{s}}\|^{2}}{{\mathbf{s}}^{\top}{\mathbf{y}}} (9)
(H1/2+(𝐬𝐬𝐬𝐲)1/2(𝐲H𝐲𝐬𝐲)1/2)2+𝐬𝐬𝐬𝐲(H1/2+1μ1μ2)2+1μ2.\displaystyle\leq\left(\|H^{1/2}\|+(\frac{{\mathbf{s}}^{\top}{\mathbf{s}}}{{\mathbf{s}}^{\top}{\mathbf{y}}})^{1/2}(\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}})^{1/2}\right)^{2}+\frac{{\mathbf{s}}^{\top}{\mathbf{s}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\left(\|H^{1/2}\|+\frac{1}{\sqrt{\mu_{1}\mu_{2}}}\right)^{2}+\frac{1}{\mu_{2}}. (10)

Thus, the change in BB (and HH) is controlled if 𝐲H𝐲𝐬𝐲1μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{1}} and 𝐬𝐬𝐬𝐲1μ2\frac{{\mathbf{s}}^{\top}{\mathbf{s}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{2}}. Our DD approach is a two-step procedure, where the first step (i.e. Powell’s damping of HH) guarantees that 𝐲H𝐲𝐬𝐲1μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{1}} and the second step (i.e., Powell’s damping with B=IB=I) guarantees that 𝐬𝐬𝐬𝐲1μ2\frac{{\mathbf{s}}^{\top}{\mathbf{s}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{2}}. Note that there is no guarantee of 𝐲H𝐲𝐬𝐲1μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{1}{\mu_{1}} after the second step. However, we can skip updating HH in this case so that the bounds on these matrices hold. In our implementation, we always do the update, since in empirical testing, we observed that at least 90% of the pairs satisfy 𝐲H𝐲𝐬𝐲2μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{2}{\mu_{1}}. See Section C in the appendix for more details on damping.

Algorithm 3 Double Damping (DD)
1:  Input: 𝐬{\mathbf{s}}, 𝐲{\mathbf{y}}; Output: 𝐬~\tilde{{\mathbf{s}}}, 𝐲~\tilde{{\mathbf{y}}}; Given: H,μ1,μ2H,\mu_{1},\mu_{2}
2:   if 𝐬𝐲<μ1𝐲H𝐲{\mathbf{s}}^{\top}{\mathbf{y}}<\mu_{1}{\mathbf{y}}^{\top}H{\mathbf{y}} then θ1=(1μ1)𝐲H𝐲𝐲H𝐲𝐬𝐲\theta_{1}=\frac{(1-\mu_{1}){\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{y}}^{\top}H{\mathbf{y}}-{\mathbf{s}}^{\top}{\mathbf{y}}} else θ1=1\theta_{1}=1
3:   𝐬~=θ1𝐬+(1θ1)H𝐲\tilde{{\mathbf{s}}}=\theta_{1}{\mathbf{s}}+(1-\theta_{1})H{\mathbf{y}} {Powell’s damping on HH}
4:   if 𝐬~𝐲<μ2𝐬~𝐬~\tilde{{\mathbf{s}}}^{\top}{\mathbf{y}}<\mu_{2}\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{s}}} then θ2=(1μ2)𝐬~𝐬~𝐬~𝐬~𝐬~𝐲\theta_{2}=\frac{(1-\mu_{2})\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{s}}}}{\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{s}}}-\tilde{{\mathbf{s}}}^{\top}{\mathbf{y}}} else θ2=1\theta_{2}=1
5:   𝐲~=θ2𝐲+(1θ2)𝐬~\tilde{{\mathbf{y}}}=\theta_{2}{\mathbf{y}}+(1-\theta_{2})\tilde{{\mathbf{s}}} {Powell’s damping with B=IB=I}
6:  return  s~\tilde{s}, y~\tilde{y}

L-BFGS Implementation. L-BFGS can also be used to update HglH_{g}^{l}. However, implementing L-BFGS using the standard "two-loop recursion" (see Algorithm 7.4 in [34]) is not efficient. This is because the main work in computing Hgl𝐟l^HalH^{l}_{g}\widehat{{\mathbf{\nabla f}}_{l}}H^{l}_{a} in line 4 of Algorithm 2 would require 4p4p matrix-vector multiplications, each requiring O(dido)O(d_{i}d_{o}) operations, where pp denotes the number of (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs stored by L-BFGS. (Recall that 𝐟l^Rdo×di\widehat{{\mathbf{\nabla f}}_{l}}\in R^{d_{o}\times d_{i}}.) Instead, we use a "non-loop" implementation [7] of L-BFGS, whose main work involves 2 matrix-matrix multiplications, each requiring O(pdido)O(pd_{i}d_{o}) operations. When pp is not small (we used p=100p=100 in our tests), and did_{i} and dod_{o} are large, this is much more efficient, especially on GPUs.

4 "Hessian action" BFGS for AlA_{l}

In addition to approximating Gl1G_{l}^{-1} by HglH_{g}^{l} using BFGS, we also propose approximating Al1A_{l}^{-1} by HalH_{a}^{l} using BFGS. Note that AlA_{l} does not correspond to some Hessian of the objective function. However, we can generate (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs for it by "Hessian action" (see e.g. [9, 17, 18]).

Connection between Hessian-action BFGS and Matrix Inversion. In our methods, we choose 𝐬=Hal𝔼i[𝐚l1(i)]{\mathbf{s}}=H^{l}_{a}\cdot\mathbb{E}_{i}[{\mathbf{a}}_{l-1}(i)] and 𝐲=Al𝐬{\mathbf{y}}=A_{l}{\mathbf{s}}, which as we now show, is closely connected to using the Sherman-Morrison modification formula to invert AlA_{l}. In particular, suppose that A+=A+c𝐚𝐚A^{+}=A+{c\cdot}{\mathbf{a}}{\mathbf{a}}^{\top}; i.e., only a rank-one update is made to AA. This corresponds to the case where the information of AA is accumulated from iteration to iteration, and the size of the mini-batch is 1 or 𝐚{\mathbf{a}} represents the average of the vectors 𝐚(i){\mathbf{a}}(i) from multiple data-points.

Theorem 1.

Suppose that AA and HH are symmetric and positive definite, and that H=A1H=A^{-1}. If we choose 𝐬=H𝐚{\mathbf{s}}=H{\mathbf{a}} and 𝐲=A+𝐬{\mathbf{y}}=A^{+}{\mathbf{s}}, where A+=A+c𝐚𝐚A^{+}=A+c\cdot{\mathbf{a}}{\mathbf{a}}^{\top}(c>0c>0). Then, the H+H^{+} generated by any QN update in the Broyden family

H+=HσH𝐲𝐲H+ρ𝐬𝐬+ϕ(𝐲H𝐲)𝐡𝐡,\displaystyle H^{+}=H-\sigma H{\mathbf{y}}{\mathbf{y}}^{\top}{H}+\rho{\mathbf{s}}{\mathbf{s}}^{\top}+\phi({\mathbf{y}}^{\top}H{\mathbf{y}}){\mathbf{h}}{\mathbf{h}}^{\top}, (11)

where ρ=1/𝐬𝐲\rho=1/{\mathbf{s}}^{\top}{\mathbf{y}}, σ=1/𝐲H𝐲\sigma=1/{\mathbf{y}}^{\top}H{\mathbf{y}}, 𝐡=ρ𝐬σH𝐲{\mathbf{h}}=\rho{\mathbf{s}}-\sigma H{\mathbf{y}} and ϕ\phi is a scalar parameter in [0,1][0,1], equals (A+)1(A^{+})^{-1}. Note that ϕ=1\phi=1 yields the BFGS update (6) and ϕ=0\phi=0 yields the DFP update.

Proof.

If 𝐬=H𝐚{\mathbf{s}}=H{\mathbf{a}} and 𝐲=A+𝐬{\mathbf{y}}=A^{+}{\mathbf{s}}, then 𝐡=0{\mathbf{h}}=0 , so all choices of ϕ\phi yield the same matrix H+H^{+}. Since H+A+𝐬=H+𝐲=𝐬H^{+}A^{+}{\mathbf{s}}=H^{+}{\mathbf{y}}={\mathbf{s}} and for any vector 𝐯{\mathbf{v}} that is orthogonal to 𝐚{\mathbf{a}}, H+A+𝐯=H+A𝐯=𝐯H^{+}A^{+}{\mathbf{v}}=H^{+}A{\mathbf{v}}={\mathbf{v}}, since 𝐬A𝐯=0{\mathbf{s}}^{\top}A{\mathbf{v}}=0 and 𝐲HA𝐯=0{\mathbf{y}}^{\top}HA{\mathbf{v}}=0, it follows that H+A+=IH^{+}A^{+}=I, using the fact that 𝐬{\mathbf{s}} together with any linearly independent set of n1n-1 vectors orthogonal to 𝐚{\mathbf{a}} spans RnR^{n}. (Note that 𝐬𝐚=𝐚H𝐚>0{\mathbf{s}}^{\top}{\mathbf{a}}={\mathbf{a}}^{\top}H{\mathbf{a}}>0, since H0H\succ 0\Rightarrow that 𝐬{\mathbf{s}} is not orthogonal to 𝐚{\mathbf{a}}.) ∎

In fact, all updates in the Broyden family are equivalent to applying the Sherman-Morrison modification formula to A+=A+c𝐚𝐚A^{+}=A+{c\cdot}{\mathbf{a}}{\mathbf{a}}^{\top}, given H=A1H=A^{-1}, since after substituting for 𝐬{\mathbf{s}} and 𝐲{\mathbf{y}} in (11) and simplifying, one obtains

H+=HH𝐚(c1+𝐚H𝐚)1𝐚H.H^{+}=H-H{\mathbf{a}}(c^{-1}+{\mathbf{a}}^{\top}H{\mathbf{a}})^{-1}{\mathbf{a}}^{\top}H.

When using momentum, A+=βA+(1β)𝐚𝐚A^{+}=\beta A+(1-\beta){\mathbf{a}}{\mathbf{a}}^{\top} (0<β<10<\beta<1). Hence, if we still want Theorem 1 to hold, we have to scale HH by 1/β1/\beta before updating it. This, however, turns out to be unstable. Hence, in practice, we use the non-scaled version of "Hessian action" BFGS.

Levenberg-Marquardt Damping for AlA_{l}. Since Al=𝔼i[(𝐚l1(i)(𝐚l1(i)))]0A_{l}=\mathbb{E}_{i}\left[({\mathbf{a}}_{l-1}(i)({\mathbf{a}}_{l-1}(i))^{\top})\right]\succeq 0 may not be positive definite, or may have very small positive eigenvalues, we add an Levenberg-Marquardt (LM) damping term to make our "Hessian-action" BFGS stable; i.e., we use Al+λAIA{A_{l}}+\lambda_{A}I_{A} instead of Al{A_{l}}, when we update HalH_{a}^{l}. Specifically, "Hessian action" BFGS for AlA_{l} is performed as

  1. 1.

    Al=βAl+(1β)𝔼i[𝐚l1(i)𝐚l1(i)]A_{l}=\beta\cdot A_{l}+(1-\beta)\cdot\mathbb{E}_{i}\left[{\mathbf{a}}_{l-1}(i){\mathbf{a}}_{l-1}(i)^{\top}\right]; AlLM=Al+λAIAA_{l}^{\text{LM}}=A_{l}+\lambda_{A}I_{A}.

  2. 2.

    𝐬al=Hal𝔼i[𝐚l1(i)]{\mathbf{s}}_{a}^{l}=H_{a}^{l}\cdot\mathbb{E}_{i}[{\mathbf{a}}_{l-1}(i)], 𝐲al=AlLM𝐬al{\mathbf{y}}_{a}^{l}=A_{l}^{\text{LM}}{\mathbf{s}}_{a}^{l}; use BFGS with (𝐬al,𝐲al)({\mathbf{s}}_{a}^{l},{\mathbf{y}}_{a}^{l}) to update HalH_{a}^{l}.

5 Convergence Analysis

Following the framework for stochastic quasi-Newton methods (SQN) established in [41] for solving nonconvex stochastic optimization problems (see Section B in the appendix for this framework), we prove that, under fairly standard assumptions, for our K-BFGS(L) algorithm with skipping DD and exact inversion on AlA_{l} (see Algorithm 5 in Section B), the number of iterations NN needed to obtain 1Nk=1N𝔼[𝐟(θk)2]ϵ\frac{1}{N}\sum_{k=1}^{N}\mathbb{E}[\|\mathbf{\nabla f}(\theta_{k})\|^{2}]\leq\epsilon is N=O(ϵ11β)N=O(\epsilon^{-\frac{1}{1-\beta}}), for step size αk\alpha_{k} chosen proportional to kβk^{-\beta}, where β(0.5,1)\beta\in(0.5,1) is a constant. Our proofs, which are delayed until Section B, make use of the following assumptions, the first two of which, were made in [41].

AS. 1.

f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is continuously differentiable. f(θ)flow>f(\theta)\geq f^{low}>-\infty, for any θn\theta\in\mathbb{R}^{n}. 𝐟\mathbf{\nabla f} is globally LL-Lipschitz continuous; namely for any x,ynx,y\in\mathbb{R}^{n}, 𝐟(x)𝐟(y)Lxy.\|\mathbf{\nabla f}(x)-\mathbf{\nabla f}(y)\|\leq L\|x-y\|.

AS. 2.

For any iteration kk, the stochastic gradient 𝐟^k=𝐟^(θk,ξk)\widehat{\mathbf{\nabla f}}_{k}=\widehat{\mathbf{\nabla f}}(\theta_{k},\xi_{k}) satisfies:
a) 𝔼ξk[𝐟^(θk,ξk)]=𝐟(θk)\mathbb{E}_{\xi_{k}}\left[\widehat{\mathbf{\nabla f}}(\theta_{k},\xi_{k})\right]=\mathbf{\nabla f}(\theta_{k}), b) 𝔼ξk[𝐟^(θk,ξk)𝐟(θk)2]σ2\mathbb{E}_{\xi_{k}}\left[\left\|\widehat{\mathbf{\nabla f}}(\theta_{k},\xi_{k})-\mathbf{\nabla f}(\theta_{k})\right\|^{2}\right]\leq\sigma^{2}, where σ>0\sigma>0, and ξk,k=1,2,\xi_{k},k=1,2,\ldots are independent samples that are independent of {θj}j=1k\left\{\theta_{j}\right\}_{j=1}^{k}.

AS. 3.

The activation functions ϕl\phi_{l} have bounded values: φ>0\exists\varphi>0 s.t. l,h,|ϕl(h)|φ\forall l,\forall h,|\phi_{l}(h)|\leq\varphi.

To use the convergence analysis in [41], we need to show that the block-diagonal approximation of the inverse Hessian used in Algorithm 5 satisfies the assumption that it is bounded above and below by positive-definite matrices. Given the Kronecker structure of our Hessian inverse approximation, it suffices to prove boundness of both Hal(k)H_{a}^{l}(k) and Hgl(k)H_{g}^{l}(k) for all iterations kk. Making the additional assumption AS.3, we are able to prove Lemma 1, and hence Lemma 3, below. Note that many popular activation functions satisfy AS.3, such as sigmoid and tanh.

Lemma 1.

Suppose that AS.3 holds. There exist two positive constants κ¯a\underline{\kappa}_{a}, κ¯a\bar{\kappa}_{a} such that κ¯aIHal(k)κ¯aI,k,l.\underline{\kappa}_{a}I\preceq H_{a}^{l}(k)\preceq\bar{\kappa}_{a}I,\forall k,l.

Lemma 2.

There exist two positive constants κ¯g\underline{\kappa}_{g} and κ¯g\bar{\kappa}_{g}, such that κ¯gIHgl(k)κ¯gI,k,l.\underline{\kappa}_{g}I\preceq H_{g}^{l}(k)\preceq\bar{\kappa}_{g}I,\forall k,l.

Lemma 3.

Suppose that AS.3 holds. Let θk+1=θkαkHk𝐟^k\theta_{k+1}=\theta_{k}-\alpha_{k}H_{k}\widehat{\mathbf{\nabla f}}_{k} be the step taken in Algorithm 5. There exists two positive constants κ¯\underline{\kappa}, κ¯\bar{\kappa} such that κ¯IHkκ¯I,k\underline{\kappa}I\preceq H_{k}\preceq\bar{\kappa}I,\forall k.

Using Lemma 3, we can now apply Theorem 2.8 in [41] to prove the convergence of Algorithm 5:

Theorem 2.

Suppose that assumptions AS.1-3 hold for {θk}\left\{\theta_{k}\right\} generated by Algorithm 5 with mini-batch size mk=m_{k}= m for all kk, and αk\alpha_{k} is chosen as αk=κ¯Lκ¯2kβ\alpha_{k}=\frac{\underline{\kappa}}{L\bar{\kappa}^{2}}k^{-\beta}, with β(0.5,1)\beta\in(0.5,1). Then

1Nk=1N𝔼[𝐟(θk)2]2L(Mfflow)κ¯2κ¯2Nβ1+σ2(1β)m(NβN1)\frac{1}{N}\sum_{k=1}^{N}\mathbb{E}\left[\left\|\mathbf{\nabla f}(\theta_{k})\right\|^{2}\right]\leq\frac{2L\left(M_{f}-f^{low}\right)\bar{\kappa}^{2}}{\underline{\kappa}^{2}}N^{\beta-1}+\frac{\sigma^{2}}{(1-\beta)m}\left(N^{-\beta}-N^{-1}\right)

where NN denotes the iteration number and Mf>0M_{f}>0 depends only on ff. Moreover, for a given ϵ(0,1),\epsilon\in(0,1), to guarantee that 1Nk=1N𝔼[𝐟(θk)2]<ϵ\frac{1}{N}\sum_{k=1}^{N}\mathbb{E}\left[\left\|\mathbf{\nabla f}(\theta_{k})\right\|^{2}\right]<\epsilon, the number of iterations NN needed is at most O(ϵ11β)O\left(\epsilon^{-\frac{1}{1-\beta}}\right).

Note: other theorems in [41], namely Theorems 2.5 and 2.6, also apply here under our assumptions.

6 Experiments

Before we present some experimental results, we address the use of moving averages, and the computational and storage requirements of the algorithms that we tested.

Mini-batch and Moving Average. Clearly, using the whole dataset at each iteration is inefficient; hence, we use a mini-batch to estimate desired quantities. We use X¯\overline{X} to denote the averaged value of XX across the mini-batch for any quantity XX. To incorporate information from the past as well as reducing the variability, we use an exponentially decaying moving average to estimate desired quantities with decay parameter β(0,1)\beta\in(0,1):

  1. 1.

    To estimate the gradient 𝔼i[𝐟(i)]\mathbb{E}_{i}[\mathbf{\nabla f}(i)], at each iteration, we update 𝐟^=β𝐟^+(1β)𝐟¯\widehat{\mathbf{\nabla f}}={\beta}\cdot\widehat{\mathbf{\nabla f}}+{(1-\beta)}\cdot\overline{\mathbf{\nabla f}}.

  2. 2.

    HalH_{a}^{l}: To estimate AlA_{l}, at each iteration we update Al^=βAl^+(1β)𝐚l1𝐚l1¯.\widehat{A_{l}}={\beta}\cdot\widehat{A_{l}}+{(1-\beta)}\cdot\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}}. Note that although we compute 𝐬al{\mathbf{s}}_{a}^{l} as Hal𝐚l1¯H_{a}^{l}\cdot\overline{{\mathbf{a}}_{l-1}}, we update Al^\widehat{A_{l}} with 𝐚l1𝐚l1¯\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}} (i.e. the average 𝐚l1(i)𝐚l1(i){\mathbf{a}}_{l-1}(i){\mathbf{a}}_{l-1}(i)^{\top} over the minibatch, not 𝐚l1¯(𝐚l1¯)\overline{{\mathbf{a}}_{l-1}}\cdot(\overline{{\mathbf{a}}_{l-1}})^{\top}).

  3. 3.

    HglH^{l}_{g}: BFGS "uses" momentum implicitly incorporated in the matrices HglH^{l}_{g}. To further stabilize the BFGS update, we also use a moving-averaged (𝐬gl,𝐲gl)({\mathbf{s}}_{g}^{l},{\mathbf{y}}_{g}^{l}) (before damping); i.e., We update 𝐬gl=β𝐬gl+(1β)(𝐡l+¯𝐡l¯){\mathbf{s}}_{g}^{l}={\beta}\cdot{\mathbf{s}}_{g}^{l}+{(1-\beta)}\cdot\left(\overline{{\mathbf{h}}_{l}^{+}}-\overline{{\mathbf{h}}_{l}}\right), and 𝐲gl=β𝐲gl+(1β)(𝐠l+¯𝐠l¯){\mathbf{y}}_{g}^{l}={\beta}\cdot{\mathbf{y}}_{g}^{l}+{(1-\beta)}\cdot\left(\overline{{\mathbf{g}}_{l}^{+}}-\overline{{\mathbf{g}}_{l}}\right).

Finally, when computing 𝐡l+¯\overline{{\mathbf{h}}_{l}^{+}} and 𝐠l+¯\overline{{\mathbf{g}}_{l}^{+}}, we use the same mini-batch as was used to compute 𝐡l¯\overline{{\mathbf{h}}_{l}} and 𝐠l¯\overline{{\mathbf{g}}_{l}}. This doubles the number of forward-backward passes at each iteration.

Storage and Computational Complexity. Tables 1 and 2 compare the storage and computational requirements, respectively, for a layer with did_{i} inputs and dod_{o} outputs for K-BFGS, K-BFGS(L), KFAC, and Adam/RMSprop. We denote the size of mini-batch by mm, the number of (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs stored for L-BFGS by pp, and the frequency of matrix inversion in KFAC by TT. Besides the requirements listed in Table 1, all algorithms need storage for the parameters WlW_{l} and the estimate of the gradient, 𝐟𝐥^\widehat{\bf{\nabla f}_{l}}, (i.e. O(dido)O(d_{i}d_{o})). Besides the work listed in Table 2, all algorithms also need to do a forward-backward pass to compute 𝐟l\mathbf{\nabla f}_{l} as well as updating WlW_{l}, (i.e. O(mdido)O(md_{i}d_{o})). Also note that, even though we use big-OO notation in these tables, the constants for all of the terms in each of the rows are roughly at the same level and relatively small.

In Table 2, for K-BFGS and K-BFGS(L), "Additional pass" refers to Line 5 of Algorithm 2; under "Curvature", O(mdi2)O(md_{i}^{2}) arises from "Hessian action" BFGS to update HalH_{a}^{l} (see the algorithm at the end of Section 4), O(mdo)O(md_{o}) arises from (5), O(do2)O(d_{o}^{2}) arises from updating HglH_{g}^{l} (only for K-BFGS); and "Step ΔWl\Delta W_{l}" refers to (4). For KFAC, referring to Algorithm 7 (in the appendix), "Additional pass" refers to Line 7; under "Curvature", O(mdi2+mdo2)O(md_{i}^{2}+md_{o}^{2}) refers to Line 8, and O(1Tdi3+1Tdo3)O(\frac{1}{T}d_{i}^{3}+\frac{1}{T}d_{o}^{3}) refers to Line 10; and "Step ΔWl\Delta W_{l}" refers to Line 5.

From Table 1, we see that the Kronecker property enables K-BFGS and K-BFGS(L) (as well as KFAC) to have storage requirements comparable to those of first-order methods. Moreover, from Table 2, we see that K-BFGS and K-BFGS(L) require less computation per iteration than KFAC, since they only involve matrix multiplications, whereas KFAC requires matrix inversions which depend cubically on both did_{i} and dod_{o}. The cost of matrix inversion in KFAC (and singular value decomposition in [19]) is amortized by performing these operations only once every TT iterations; nonetheless, these amortized operations usually become much slower than matrix multiplication as models scale up.

Table 1: Storage
Algorithm flfl\nabla f_{l}\odot\nabla f_{l} AA GG Total
K-BFGS O(di2)O(d_{i}^{2}) O(do2)O(d_{o}^{2}) O(di2+do2+dido)O(d_{i}^{2}+d_{o}^{2}+d_{i}d_{o})
K-BFGS(L) O(di2)O(d_{i}^{2}) O(pdo)O(pd_{o}) O(di2+dido+pdo)O(d_{i}^{2}+d_{i}d_{o}+pd_{o})
KFAC O(di2)O(d_{i}^{2}) O(do2)O(d_{o}^{2}) O(di2+do2+dido)O(d_{i}^{2}+d_{o}^{2}+d_{i}d_{o})
Adam/RMSprop O(dido)O(d_{i}d_{o}) O(dido)O(d_{i}d_{o})
Table 2: Computation per iteration
Algorithm Additional pass Curvature Step ΔWl\Delta W_{l}
K-BFGS O(mdido)O(md_{i}d_{o}) O(mdi2+mdo+do2)O(md_{i}^{2}+md_{o}+d_{o}^{2}) O(di2do+do2di)O(d_{i}^{2}d_{o}+d_{o}^{2}d_{i})
K-BFGS(L) O(mdido)O(md_{i}d_{o}) O(mdi2+mdo)O(md_{i}^{2}+md_{o}) O(di2do+pdido)O(d_{i}^{2}d_{o}+pd_{i}d_{o})
KFAC O(mdido)O(md_{i}d_{o}) O(mdi2+mdo2+1Tdi3+1Tdo3)O(md_{i}^{2}+md_{o}^{2}+\frac{1}{T}d_{i}^{3}+\frac{1}{T}d_{o}^{3}) O(di2do+do2di)O(d_{i}^{2}d_{o}+d_{o}^{2}d_{i})
Adam/RMSprop O(dido)O(d_{i}d_{o}) O(dido)O(d_{i}d_{o})

Experimental Results. We tested K-BFGS and K-BFGS(L), as well as KFAC, Adam/RMSprop and SGD-m (SGD with momentum) on three autoencoder problems, namely, MNIST [25], FACES, and CURVES, which are used in e.g. [22, 29, 30], except that we replaced the sigmoid activation with ReLU. See Section D in the appendix for a complete description of these problems and the competing algorithms.

Since one can view Powell’s damping with B=IB=I as LM damping, we write μ2=λG\mu_{2}=\lambda_{G}, where λG\lambda_{G} denotes the LM damping parameter for GlG_{l}. We then define λ=λAλG\lambda=\lambda_{A}\lambda_{G} as the overall damping term of our QN approximation. To simplify matters, we chose λA=λG=λ\lambda_{A}=\lambda_{G}=\sqrt{\lambda}, so that we needed to tune only one hyper-parameter (HP) λ\lambda.

Refer to caption
(a) MNIST
Refer to caption
(b) FACES
Refer to caption
(c) CURVES
Figure 1: Comparison between algorithms on (a) MNIST, (b) FACES, (c) CURVES. For each problem, the upper (lower) rower depicts training loss (testing (mean square) error), whereas the left (right) column depicts training/test progress versus epoch (CPU time), respectively. After each epoch, the training loss/testing error from the whole training/testing set is reported (the time for computing this loss is not included in the plots). For each problem, algorithms are terminated after the same amount of CPU time.

To obtain the results in Figure 1, we first did a grid-search on (learning rate, damping) pairs for all algorithms (except for SGD-m, whose grid-search was only on learning rate), where damping refers to λ\lambda for K-BFGS/K-BFGS(L)/KFAC, and ϵ\epsilon for RMSprop/Adam. We then selected the best (learning rate, damping) pairs with the lowest training loss encountered. The range for the grid-search and the best HP values (as well as other fixed HP values) are listed in Section D in the appendix. Using the best HP values that we found, we then made 20 runs employing different random seeds, and plotted the mean value of the 20 runs as the solid line and the standard deviation as the shaded area.111 Results were obtained on a machine with 8 x Intel(R) Xeon(R) CPU @ 2.30GHz and 1 x NVIDIA Tesla P100. Code is available at https://github.com/renyiryry/kbfgs_neurips2020_public.

From the training loss plots in Figure 1, our algorithms clearly outperformed the first-order methods, except for RMSprop/Adam on CURVES, with respect to CPU time, and performed comparably to KFAC in terms of both CPU time and number of epochs taken. The testing error plots in Figure 1 show that our K-BFGS(L) method and KFAC behave very similarly and substantially outperform all of the first-order methods in terms of both of these measures. This suggests that our algorithms not only optimize well, but also generalize well.

To further demonstrate the robustness of our algorithms, we examined the loss under various HP settings, which showed that our algorithms are stable under a fairly wide range for the HPs (see Section D.4 in the appendix).

We also repeated our experiments using mini-batches of size 100 for all algorithms (see Figures 4, 5, and 6 in the appendix, where HPs are optimally tuned for batch sizes of 100) and our proposed methods continue to demonstrate advantageous performance, both in training and testing. For these experiments, we did not experiment with 20 random seeds. These results show that our approach works as well for relatively small mini-batch sizes of 100, as those of size 1000, which are less noisy, and hence is robust in the stochastic setting employed to train DNNs.

Compared with other methods mentioned in this paper, our K-BFGS and K-BFGS(L) have the extra advantage of being able to double the size of minibatch for computing the stochastic gradient with almost no extra cost, which might be of particular interest in a highly stochastic setting. See Section D.6 in the appendix for more discussion on this and some preliminary experimental results.

7 Conclusion

We proposed Kronecker-factored QN methods, namely, K-BFGS and K-BFGS(L), for training multi-layer feed-forward neural network models, that use layer-wise second-order information and require modest memory and computational resources. Experimental results indicate that our methods outperform or perform comparably to the state-of-the-art first-order and second-order methods. Our methods can also be extended to convolutional and recurrent NNs.

Broader Impact

The research presented in this paper provides a new method for training DNNs that our experimental testing has shown to be more efficient in several cases than current state-of-the-art optimization methods for this task. Consequently, because of the wide use of DNNs in machine learning, this should help save a substantial amount of energy. Our new algorithms simply attempt to minimize the loss function that are given to it and the computations that it performs are all transparent. In machine learning, DNNs can be trained to address many types of problems, some of which should lead to positive societal outcomes, such as ones in medicine (e.g., diagnostics and drug effectiveness), autonomous vehicle development, voice recognition and climate change. Of course, optimization algorithms for training DNNs can be used to train models that may have negative consequences, such as those intended to develop psychological profiles, invade privacy and justify biases. The misuse of any efficient optimization algorithm for machine learning, and in particular our algorithms, is beyond the control of the work presented here.

Acknowledgments and Disclosure of Funding

DG and YR were supported in part by NSF Grant IIS-1838061. DG acknowledges the computational support provided by Google Cloud Platform Education Grant, Q81G-U4X3-5AG5-F5CG.

References

  • Amari et al. [2000] S.-I. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural computation, 12(6):1399–1409, 2000.
  • Ba et al. [2017] J. Ba, R. B. Grosse, and J. Martens. Distributed second-order optimization using kronecker-factored approximations. In ICLR, 2017.
  • Badreddine et al. [2014] H. Badreddine, S. Vandewalle, and J. Meyers. Sequential quadratic programming (sqp) for optimal control in direct numerical simulation of turbulent flow. Journal of Computational Physics, 256:1–16, 2014.
  • Bordes et al. [2009] A. Bordes, L. Bottou, and P. Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10(Jul):1737–1754, 2009.
  • Botev et al. [2017] A. Botev, H. Ritter, and D. Barber. Practical gauss-newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 557–565. JMLR. org, 2017.
  • Broyden [1970] C. G. Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970.
  • Byrd et al. [1994] R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming, 63(1-3):129–156, 1994.
  • Byrd et al. [2011] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
  • Byrd et al. [2016] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
  • Dangel et al. [2019] F. Dangel, P. Hennig, and S. Harmeling. Modular block-diagonal curvature approximations for feedforward architectures. arXiv preprint arXiv:1902.01813, 2019.
  • Desjardins et al. [2015] G. Desjardins, K. Simonyan, R. Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079, 2015.
  • Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Fletcher [1970] R. Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317–322, 1970.
  • Fujimoto and Ohira [2018] Y. Fujimoto and T. Ohira. A neural network model with bidirectional whitening. In International Conference on Artificial Intelligence and Soft Computing, pages 47–57. Springer, 2018.
  • George et al. [2018] T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems, pages 9550–9560, 2018.
  • Goldfarb [1970] D. Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23–26, 1970.
  • Gower et al. [2016] R. Gower, D. Goldfarb, and P. Richtárik. Stochastic block bfgs: Squeezing more curvature out of data. In International Conference on Machine Learning, pages 1869–1878, 2016.
  • Gower and Richtárik [2017] R. M. Gower and P. Richtárik. Randomized quasi-newton updates are linearly convergent matrix inversion algorithms. SIAM Journal on Matrix Analysis and Applications, 38(4):1380–1409, 2017.
  • Gupta et al. [2018] V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018.
  • Heskes [2000] T. Heskes. On "natural" learning and pruning in multilayered perceptrons. Neural Computation, 12, 01 2000. doi: 10.1162/089976600300015637.
  • Hinton et al. [2012] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8), 2012.
  • Hinton and Salakhutdinov [2006] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • Johnson and Zhang [2013] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
  • Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li and Montúfar [2018] W. Li and G. Montúfar. Natural gradient via optimal transport. Information Geometry, 1(2):181–214, 2018.
  • Liu and Nocedal [1989] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
  • Lucchi et al. [2015] A. Lucchi, B. McWilliams, and T. Hofmann. A variance reduced stochastic newton method. arXiv preprint arXiv:1503.08316, 2015.
  • Martens [2010] J. Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.
  • Martens and Grosse [2015] J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
  • Mokhtari and Ribeiro [2014] A. Mokhtari and A. Ribeiro. Res: Regularized stochastic bfgs algorithm. IEEE Transactions on Signal Processing, 62(23):6089–6104, 2014.
  • Mokhtari and Ribeiro [2015] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory bfgs. The Journal of Machine Learning Research, 16(1):3151–3181, 2015.
  • Moritz et al. [2016] P. Moritz, R. Nishihara, and M. Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249–258, 2016.
  • Nocedal and Wright [2006] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • Powell [1978] M. J. Powell. Algorithms for nonlinear constraints that use lagrangian functions. Mathematical programming, 14(1):224–248, 1978.
  • Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Roux and Fitzgibbon [2010] N. L. Roux and A. W. Fitzgibbon. A fast natural newton method. In ICML, 2010.
  • Schraudolph et al. [2007] N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-newton method for online convex optimization. In Artificial intelligence and statistics, pages 436–443, 2007.
  • Shanno [1970] D. F. Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647–656, 1970.
  • Vinyals and Povey [2012] O. Vinyals and D. Povey. Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics, pages 1261–1268, 2012.
  • Wang et al. [2017] X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956, 2017.
  • Wang and Li [2020] Y. Wang and W. Li. Information newton’s flow: second-order optimization method in probability space. arXiv preprint arXiv:2001.04341, 2020.
  • Xu et al. [2019] P. Xu, F. Roosta, and M. W. Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming, pages 1–36, 2019.

Appendix A Pseudocode for K-BFGS/K-BFGS(L)

Algorithm 4 gives pseudocode for K-BFGS/K-BFGS(L), which is implemented in the experiments. For details see Sections 3, 4, and Section C in the Appendix.

Algorithm 4 Pseudocode for K-BFGS / K-BFGS(L)
0:  Given initial weights θ=[vec(W1),,vec(WL)]\theta=\left[\operatorname{vec}\left(W_{1}\right)^{\top},\ldots,\operatorname{vec}\left(W_{L}\right)^{\top}\right]^{\top}, batch size mm, learning rate α\alpha, damping value λ\lambda, and for K-BFGS(L), the number of (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs pp that are stored and used to compute HglH_{g}^{l} at each iteration
1:  μ1=0.2\mu_{1}=0.2, β=0.9\beta=0.9 {set default hyper-parameter values}
2:  λA=λG=λ\lambda_{A}=\lambda_{G}=\sqrt{\lambda} {split the damping into AA and GG}
3:  𝐟^l=0\widehat{\mathbf{\nabla f}}_{l}=0, Al=𝔼i[𝐚l1(i)𝐚l1(i)]A_{l}=\mathbb{E}_{i}\left[{\mathbf{a}}_{l-1}(i){\mathbf{a}}_{l-1}(i)^{\top}\right] by forward pass, Hal=(Al+λAIA)1H_{a}^{l}=(A_{l}+\lambda_{A}I_{A})^{-1}, Hgl=IH_{g}^{l}=I (l=1,,Ll=1,...,L) {Initialization}
4:  for k=1,2,k=1,2,\ldots do
5:     Sample mini-batch of size mm: Mk={ξk,i,i=1,,m}M_{k}=\{\xi_{k,i},i=1,\dots,m\}
6:     Perform a forward-backward pass over the current mini-batch MkM_{k} to compute 𝐟¯l\overline{\mathbf{\nabla f}}_{l}, 𝐚l{\mathbf{a}}_{l}, 𝐡l{\mathbf{h}}_{l}, and 𝐠l{\mathbf{g}}_{l} (l=1,,Ll=1,\ldots,L) (see Algorithm 1)
7:     for l=1,,Ll=1,\ldots,L do
8:        𝐟^l=β𝐟^l+(1β)𝐟¯l\widehat{\mathbf{\nabla f}}_{l}=\beta\widehat{\mathbf{\nabla f}}_{l}+(1-\beta)\overline{\mathbf{\nabla f}}_{l}
9:        pl=Hgl𝐟^lHalp_{l}=H_{g}^{l}\widehat{\mathbf{\nabla f}}_{l}H_{a}^{l}
10:        {In K-BFGS(L), when computing Hgl(𝐟^lHal)H^{l}_{g}\left(\widehat{\mathbf{\nabla f}}_{l}H^{l}_{a}\right), L-BFGS is initialized with an identity matrix}
11:        Wl=WlαplW_{l}=W_{l}-\alpha\cdot p_{l}
12:     Perform another forward-backward pass over MkM_{k} to compute 𝐡l+{\mathbf{h}}_{l}^{+}, 𝐠l+{\mathbf{g}}_{l}^{+} (l=1,,Ll=1,\ldots,L)
13:     for l=1,,Ll=1,...,L do
14:        {Use damped BFGS or L-BFGS to update HglH_{g}^{l} (see Section 3)}
15:        𝐬gl=β𝐬gl+(1β)(𝐡l+¯𝐡l¯){\mathbf{s}}_{g}^{l}={\beta}\cdot{\mathbf{s}}_{g}^{l}+{(1-\beta)}\cdot\left(\overline{{\mathbf{h}}_{l}^{+}}-\overline{{\mathbf{h}}_{l}}\right), 𝐲gl=β𝐲gl+(1β)(𝐠l+¯𝐠l¯){\mathbf{y}}_{g}^{l}={\beta}\cdot{\mathbf{y}}_{g}^{l}+{(1-\beta)}\cdot\left(\overline{{\mathbf{g}}_{l}^{+}}-\overline{{\mathbf{g}}_{l}}\right)
16:        (𝐬~gl,𝐲~gl)=DD(𝐬gl,𝐲gl)(\tilde{{\mathbf{s}}}_{g}^{l},{\tilde{{\mathbf{y}}}_{g}^{l}})=\text{DD}({{\mathbf{s}}}_{g}^{l},{{{\mathbf{y}}}_{g}^{l}}) with H=HglH=H_{g}^{l}, μ1=μ1\mu_{1}=\mu_{1}, μ2=λG\mu_{2}=\lambda_{G} {See Algorithm 3}
17:        Use BFGS or L-BFGS with (𝐬~gl,𝐲~gl)(\tilde{{\mathbf{s}}}_{g}^{l},{\tilde{{\mathbf{y}}}_{g}^{l}}) to update HglH_{g}^{l}
18:        {Use Hessian-action BFGS to update HalH_{a}^{l} (see Section 4)}
19:        Al=βAl+(1β)𝐚l1𝐚l1¯A_{l}=\beta\cdot A_{l}+(1-\beta)\cdot\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}}
20:        AlLM=Al+λAIAA_{l}^{\text{LM}}=A_{l}+\lambda_{A}I_{A}
21:        𝐬al=Hal𝐚l1¯{\mathbf{s}}_{a}^{l}=H_{a}^{l}\cdot\overline{{\mathbf{a}}_{l-1}}, 𝐲al=AlLM𝐬al{\mathbf{y}}_{a}^{l}=A_{l}^{\text{LM}}{\mathbf{s}}_{a}^{l}
22:        Use BFGS with (𝐬al,𝐲al)({\mathbf{s}}_{a}^{l},{\mathbf{y}}_{a}^{l}) to update HalH_{a}^{l}

Appendix B Convergence: Proofs of Lemmas 1-3 and Theorem 2

Algorithm 5 K-BFGS(L) with DD-skip and exact inversion of AlLMA_{l}^{\text{LM}}
0:  Given initial weights θ=[vec(W1),,vec(WL)]\theta=\left[\operatorname{vec}\left(W_{1}\right)^{\top},\ldots,\operatorname{vec}\left(W_{L}\right)^{\top}\right]^{\top}, batch size mm, learning rate {αk}\{\alpha_{k}\}, damping value λ\lambda, and the number of (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs pp that are stored and used to compute HglH_{g}^{l} at each iteration
1:  μ1=0.2\mu_{1}=0.2, β=0.9\beta=0.9 {set default hyper-parameter values}
2:  λA=λG=λ\lambda_{A}=\lambda_{G}=\sqrt{\lambda} {split the damping into AA and GG}
3:  Al(0)=𝔼i[𝐚l1(i)𝐚l1(i)]A_{l}(0)=\mathbb{E}_{i}\left[{\mathbf{a}}_{l-1}(i){\mathbf{a}}_{l-1}(i)^{\top}\right] by forward pass, Hal(0)=(Al(0)+λAIA)1H_{a}^{l}(0)=(A_{l}(0)+\lambda_{A}I_{A})^{-1}, Hgl(0)=IH_{g}^{l}(0)=I (l=1,,Ll=1,...,L) {Initialization}
4:  for k=1,2,k=1,2,\ldots do
5:     Sample mini-batch of size mm: Mk={ξk,i,i=1,,m}M_{k}=\{\xi_{k,i},i=1,\dots,m\}
6:     Perform a forward-backward pass over the current mini-batch MkM_{k} to compute 𝐟¯l\overline{\mathbf{\nabla f}}_{l}, 𝐚l{\mathbf{a}}_{l}, 𝐡l{\mathbf{h}}_{l}, and 𝐠l{\mathbf{g}}_{l} (l=1,,Ll=1,\ldots,L) (see Algorithm 1)
7:     for l=1,,Ll=1,\ldots,L do
8:         pl=Hgl(k1)𝐟^lHal(k1)p_{l}=H_{g}^{l}(k-1)\widehat{\mathbf{\nabla f}}_{l}H_{a}^{l}(k-1), where 𝐟^l=𝐟¯l\widehat{\mathbf{\nabla f}}_{l}=\overline{\mathbf{\nabla f}}_{l}
9:        {When computing Hgl(𝐟^lHal)H^{l}_{g}\left(\widehat{\mathbf{\nabla f}}_{l}H^{l}_{a}\right), L-BFGS is initialized with an identity matrix}
10:        Wl=WlαkplW_{l}=W_{l}-\alpha_{k}\cdot p_{l}
11:     Perform another forward-backward pass over MkM_{k} to compute 𝐡l+{\mathbf{h}}_{l}^{+}, 𝐠l+{\mathbf{g}}_{l}^{+} (l=1,,Ll=1,\ldots,L)
12:     for l=1,,Ll=1,...,L do
13:        {Use damped L-BFGS with skip to update HglH_{g}^{l} (see Section 3)}
14:        𝐬gl=β𝐬gl+(1β)(𝐡l+¯𝐡l¯){\mathbf{s}}_{g}^{l}={\beta}\cdot{\mathbf{s}}_{g}^{l}+{(1-\beta)}\cdot\left(\overline{{\mathbf{h}}_{l}^{+}}-\overline{{\mathbf{h}}_{l}}\right), 𝐲gl=β𝐲gl+(1β)(𝐠l+¯𝐠l¯){\mathbf{y}}_{g}^{l}={\beta}\cdot{\mathbf{y}}_{g}^{l}+{(1-\beta)}\cdot\left(\overline{{\mathbf{g}}_{l}^{+}}-\overline{{\mathbf{g}}_{l}}\right)
15:        (𝐬~gl,𝐲~gl)=DD(𝐬gl,𝐲gl)(\tilde{{\mathbf{s}}}_{g}^{l},{\tilde{{\mathbf{y}}}_{g}^{l}})=\text{DD}({{\mathbf{s}}}_{g}^{l},{{{\mathbf{y}}}_{g}^{l}}) with H=Hgl(k1)H=H_{g}^{l}(k-1), μ1=μ1\mu_{1}=\mu_{1}, μ2=λG\mu_{2}=\lambda_{G} {See Algorithm 3}
16:        if (𝐬~gl)𝐲~glμ1(𝐲~gl)Hgl𝐲~gl(\tilde{{\mathbf{s}}}_{g}^{l})^{\top}\tilde{{\mathbf{y}}}_{g}^{l}\geq\mu_{1}(\tilde{{\mathbf{y}}}_{g}^{l})^{\top}H_{g}^{l}\tilde{{\mathbf{y}}}_{g}^{l} then
17:           Use L-BFGS with (𝐬~gl,𝐲~gl)(\tilde{{\mathbf{s}}}_{g}^{l},{\tilde{{\mathbf{y}}}_{g}^{l}}) to update Hgl(k)H_{g}^{l}(k)
18:        {Use exact inversion to compute HalH_{a}^{l}}
19:        Al(k)=βAl(k1)+(1β)𝐚l1𝐚l1¯A_{l}(k)=\beta\cdot A_{l}(k-1)+(1-\beta)\cdot\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}}
20:        AlLM(k)=Al(k)+λAIAA_{l}^{\text{LM}}(k)=A_{l}(k)+\lambda_{A}I_{A}
21:         Hal(k)=(AlLM(k))1H_{a}^{l}(k)=\left(A_{l}^{\text{LM}}(k)\right)^{-1}

In this section, we prove the convergence of Algorithm 5, a variant of K-BFGS(L). Algorithm 5 is very similar to our actual implementation of K-BFGS(L) (i.e. Algorithm 4), except that

  • we skip updating HglH_{g}^{l} if (𝐬~gl)𝐲~gl<μ1(𝐲~gl)Hgl𝐲~gl(\tilde{{\mathbf{s}}}_{g}^{l})^{\top}\tilde{{\mathbf{y}}}_{g}^{l}<\mu_{1}(\tilde{{\mathbf{y}}}_{g}^{l})^{\top}H_{g}^{l}\tilde{{\mathbf{y}}}_{g}^{l} (see Line 16);

  • we set HalH_{a}^{l} to the exact inverse of AlLMA_{l}^{\text{LM}} (see Line 21);

  • we use decreasing step sizes {αk}\{\alpha_{k}\} as specified in Theorem 2;

  • we use the mini-batch gradient instead of the momentum gradient (see Line 8).

To accomplish this, we prove Lemmas 1-3, which in addition to Assumptions AS.1-2, ensure that all of the assumptions in Theorem 2.8 in [41] are satisfied, and hence that the generic stochastic quasi-Newton (SQN) method, i.e. Algorithm 6, below converges. Specifically, Theorem 2.8 in [41] requires, in addition to Assumptions AS.1-2, the assumption

AS. 4.

There exist two positive constants κ¯,κ¯\underline{\kappa},\bar{\kappa}, such that κ¯IHkκ¯I,k\underline{\kappa}I\preceq H_{k}\preceq\bar{\kappa}I,\forall k; for any k2,k\geq 2, the random variable HkH_{k} depends only on ξ[k1]\xi_{[k-1]}.

Algorithm 6 SQN method for nonconvex stochastic optimization.
0:  Given θ1n\theta_{1}\in\mathbb{R}^{n}, batch sizes {mk}k1\left\{m_{k}\right\}_{k\geq 1}, and step sizes {αk}k1\left\{\alpha_{k}\right\}_{k\geq 1}
1:  for k=1,2,k=1,2,\ldots do
2:     Calculate 𝐟^k=1mki=1mk𝐟(θk,ξk,i)\widehat{\mathbf{\nabla f}}_{k}=\frac{1}{m_{k}}\sum_{i=1}^{m_{k}}\mathbf{\nabla f}(\theta_{k},\xi_{k,i})
3:     Generate a positive definite Hessian inverse approximation HkH_{k}
4:     Calculate θk+1=θkαkHk𝐟^k\theta_{k+1}=\theta_{k}-\alpha_{k}H_{k}\widehat{\mathbf{\nabla f}}_{k}

In the following proofs, \|\cdot\| denotes the 2-norm for vectors, and the spectral norm for matrices.

Proof of Lemma 1:

Proof.

Because AlLM(k)λAIAA_{l}^{\text{LM}}(k)\succeq\lambda_{A}I_{A}, we have that Hal(k)κ¯aIAH_{a}^{l}(k)\preceq\bar{\kappa}_{a}I_{A}, where κ¯a=1λA\bar{\kappa}_{a}=\frac{1}{\lambda_{A}}.

On the other hand, for any 𝐱dl{\mathbf{x}}\in\mathbb{R}^{d_{l}}, by Cauchy-Schwarz, 𝐚l1(i),𝐱2𝐱2𝐚l1(i)2𝐱2(1+φ2dl)\langle{{\mathbf{a}}}_{l-1}(i),{\mathbf{x}}\rangle^{2}\;\leq\|{\mathbf{x}}\|^{2}\|{{\mathbf{a}}}_{l-1}(i)\|^{2}\leq\|{\mathbf{x}}\|^{2}(1+\varphi^{2}d_{l}). Hence, 𝐚l1𝐚l1¯1+φ2dl\left\|\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}}\right\|\leq 1+\varphi^{2}d_{l}; similarly, Al(0)1+φ2dl\|A_{l}(0)\|\leq 1+\varphi^{2}d_{l}. Because Al(k)βAl(k1)+(1β)𝐚l1𝐚l1¯\|A_{l}(k)\|\leq\beta\|A_{l}(k-1)\|+(1-\beta)\left\|\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}}\right\|, by induction, Al(k)1+φ2dl\|A_{l}(k)\|\leq 1+\varphi^{2}d_{l} for any kk and ll. Thus, AlLM(k)1+φ2dl+λA\|A_{l}^{\text{LM}}(k)\|\leq 1+\varphi^{2}d_{l}+\lambda_{A}. Hence, Hal(k)κ¯aIAH_{a}^{l}(k)\succeq\underline{\kappa}_{a}I_{A}, where κ¯a=11+φ2dl+λA\underline{\kappa}_{a}=\frac{1}{1+\varphi^{2}d_{l}+\lambda_{A}}.

Proof of Lemma 2:

Proof.

To simplify notation, we omit the subscript gg, superscript ll and the iteration index kk in the proof. Hence, our goal is to prove κ¯gIH=Hgl(k)κ¯gI\underline{\kappa}_{g}I\preceq H=H_{g}^{l}(k)\preceq\bar{\kappa}_{g}I, for any ll and kk. Let (𝐬i,𝐲i)({\mathbf{s}}_{i},{\mathbf{y}}_{i}) (i=1,,pi=1,...,p) denote the pairs used in an L-BFGS computation of HH. Since (𝐬i,𝐲i)({\mathbf{s}}_{i},{\mathbf{y}}_{i}) was not skipped, 𝐲iH¯(i)𝐲i𝐬i𝐲i1μ1\frac{{\mathbf{y}}_{i}^{\top}\bar{H}^{(i)}{\mathbf{y}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\leq\frac{1}{\mu_{1}}, where H¯(i)\bar{H}^{(i)} denotes the matrix HglH^{l}_{g} used at the iteration in which 𝐬i{\mathbf{s}}_{i} and 𝐲i{\mathbf{y}}_{i} were computed. Note that this is not the matrix HiH_{i} used in the recursive computation of HH at the current iterate θk\theta_{k}.

Given an initial estimate H0=B01=IH_{0}=B_{0}^{-1}=I of (Ggl(θk))1(G^{l}_{g}(\theta_{k}))^{-1}, the L-BFGS method updates HiH_{i} recursively as

Hi=(Iρi𝐬i𝐲i)Hi1(Iρi𝐲i𝐬i)+ρi𝐬i𝐬i,i=1,,p,\displaystyle H_{i}=\left(I-\rho_{i}{\mathbf{s}}_{i}{\mathbf{y}}_{i}^{\top}\right)H_{i-1}\left(I-\rho_{i}{\mathbf{y}}_{i}{\mathbf{s}}_{i}^{\top}\right)+\rho_{i}{\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top},\quad i=1,\ldots,p, (12)

where ρi=(𝐬i𝐲i)1\rho_{i}=({\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i})^{-1}, and equivalently,

Bi=Bi1+𝐲i𝐲i𝐬i𝐲iBi1𝐬i𝐬iBi1𝐬iBi1𝐬i,i=1,,p,B_{i}=B_{i-1}+\frac{{{\mathbf{y}}}_{i}{{\mathbf{y}}}_{i}^{\top}}{{\mathbf{s}}_{i}^{\top}{{\mathbf{y}}}_{i}}-\frac{B_{i-1}{\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}B_{i-1}}{{\mathbf{s}}_{i}^{\top}B_{i-1}{\mathbf{s}}_{i}},\quad i=1,\ldots,p,

where Bi=Hi1B_{i}=H_{i}^{-1}. Since we use DD with skipping, we have that 𝐬i𝐬i𝐬i𝐲i1μ2\frac{{\mathbf{s}}_{i}^{\top}{\mathbf{s}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\leq\frac{1}{\mu_{2}} and 𝐲iH¯(i)𝐲i𝐬i𝐲i1μ1\frac{{\mathbf{y}}_{i}^{\top}\bar{H}^{(i)}{\mathbf{y}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\leq\frac{1}{\mu_{1}}. Note that we don’t have 𝐲iHi1𝐲i𝐬i𝐲i1μ1\frac{{\mathbf{y}}_{i}^{\top}H_{i-1}{\mathbf{y}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\leq\frac{1}{\mu_{1}}, so we cannot direct apply (10). Hence, by (8), we have that BiBi1(1+1μ1)||B_{i}||\leq||B_{i-1}||\left(1+\frac{1}{\mu_{1}}\right). Hence, B=BpB0(1+1μ1)p=(1+1μ1)p||B||=||B_{p}||\leq||B_{0}||\left(1+\frac{1}{\mu_{1}}\right)^{p}=\left(1+\frac{1}{\mu_{1}}\right)^{p}. Thus, B(1+1μ1)pIB\preceq\left(1+\frac{1}{\mu_{1}}\right)^{p}I, H(1+1μ1)pI:=κ¯gIH\succeq\left(1+\frac{1}{\mu_{1}}\right)^{-p}I:=\underline{\kappa}_{g}I.

On the other hand, since κ¯g\underline{\kappa}_{g} is a uniform lower bound for Hgl(k)H_{g}^{l}(k) for any kk and ll, H¯(i)κ¯gI\bar{H}^{(i)}\succeq\underline{\kappa}_{g}I. Thus,

1μ1𝐲iH¯(i)𝐲i𝐬i𝐲iκ¯g𝐲i𝐲i𝐬i𝐲i𝐲i𝐲i𝐬i𝐲i1μ1κ¯g.\displaystyle\frac{1}{\mu_{1}}\geq\frac{{\mathbf{y}}_{i}^{\top}\bar{H}^{(i)}{\mathbf{y}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\geq\underline{\kappa}_{g}\frac{{\mathbf{y}}_{i}^{\top}{\mathbf{y}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\Rightarrow\frac{{\mathbf{y}}_{i}^{\top}{\mathbf{y}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\leq\frac{1}{\mu_{1}\underline{\kappa}_{g}}.

Hence, using the fact that uv=uv||uv^{\top}||=||u||\cdot||v|| for any vectors u,vu,v, ρi𝐬i𝐬i=ρi𝐬i𝐬i=𝐬i𝐬i𝐬i𝐲i1μ2,||\rho_{i}{\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}||=\rho_{i}||{\mathbf{s}}_{i}||||{\mathbf{s}}_{i}||=\frac{{\mathbf{s}}_{i}^{\top}{\mathbf{s}}_{i}}{{\mathbf{s}}_{i}^{\top}{\mathbf{y}}_{i}}\leq\frac{1}{\mu_{2}},

Hi\displaystyle||H_{i}|| =(Iρi𝐬i𝐲i)Hi1(Iρi𝐲i𝐬i)+ρi𝐬i𝐬i\displaystyle=||\left(I-\rho_{i}{\mathbf{s}}_{i}{\mathbf{y}}_{i}^{\top}\right)H_{i-1}\left(I-\rho_{i}{\mathbf{y}}_{i}{\mathbf{s}}_{i}^{\top}\right)+\rho_{i}{\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}||
=Hi1+ρi2(𝐲iHi1𝐲i)𝐬i𝐬iρi𝐬i𝐲iHi1ρiHi1𝐲i𝐬i+ρi𝐬i𝐬i\displaystyle=||H_{i-1}+\rho_{i}^{2}({\mathbf{y}}_{i}^{\top}H_{i-1}{\mathbf{y}}_{i}){\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}-\rho_{i}{\mathbf{s}}_{i}{\mathbf{y}}_{i}^{\top}H_{i-1}-\rho_{i}H_{i-1}{\mathbf{y}}_{i}{\mathbf{s}}_{i}^{\top}+\rho_{i}{\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}||
Hi1+ρi2(𝐲iHi1𝐲i)𝐬i𝐬i+ρi𝐬i𝐲iHi1+ρiHi1𝐲i𝐬i+ρi𝐬i𝐬i\displaystyle\leq||H_{i-1}||+||\rho_{i}^{2}({\mathbf{y}}_{i}^{\top}H_{i-1}{\mathbf{y}}_{i}){\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}||+||\rho_{i}{\mathbf{s}}_{i}{\mathbf{y}}_{i}^{\top}H_{i-1}||+||\rho_{i}H_{i-1}{\mathbf{y}}_{i}{\mathbf{s}}_{i}^{\top}||+||\rho_{i}{\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}||
Hi1+Hi1ρi2(𝐲i𝐲i)𝐬i𝐬i+2ρi𝐬i𝐲iHi1+1μ2\displaystyle\leq||H_{i-1}||+||H_{i-1}||\cdot||\rho_{i}^{2}({\mathbf{y}}_{i}^{\top}{\mathbf{y}}_{i}){\mathbf{s}}_{i}{\mathbf{s}}_{i}^{\top}||+2\rho_{i}||{\mathbf{s}}_{i}||\cdot||{\mathbf{y}}_{i}^{\top}H_{i-1}||+\frac{1}{\mu_{2}}
Hi1+Hi11μ1κ¯g1μ2+2ρi𝐬i𝐲iHi1+1μ2\displaystyle\leq||H_{i-1}||+||H_{i-1}||\cdot\frac{1}{\mu_{1}\underline{\kappa}_{g}}\frac{1}{\mu_{2}}+2\rho_{i}||{\mathbf{s}}_{i}||\cdot||{\mathbf{y}}_{i}^{\top}||\cdot||H_{i-1}||+\frac{1}{\mu_{2}}
Hi1(1+1μ1κ¯g1μ2+21μ1μ2κ¯g)+1μ2\displaystyle\leq||H_{i-1}||\left(1+\frac{1}{\mu_{1}\underline{\kappa}_{g}}\frac{1}{\mu_{2}}+2\frac{1}{\sqrt{\mu_{1}\mu_{2}\underline{\kappa}_{g}}}\right)+\frac{1}{\mu_{2}}
=μ^Hi1+1μ2,whereμ^=(1+1μ1μ2κ¯g)2.\displaystyle=\hat{\mu}||H_{i-1}||+\frac{1}{\mu_{2}},\quad{\rm where}\quad\hat{\mu}=\left(1+\frac{1}{\sqrt{\mu_{1}\mu_{2}\underline{\kappa}_{g}}}\right)^{2}.

From the fact that H0=IH_{0}=I, and induction, we have that Hμ^p+μ^p1μ^11μ2κ¯g||H||\leq\hat{\mu}^{p}+\frac{\hat{\mu}^{p}-1}{\hat{\mu}-1}\frac{1}{\mu_{2}}\equiv\bar{\kappa}_{g}.

Proof of Lemma 3:

Proof.

By Lemma 1, 2 and the fact that Hk=diag{Ha1(k1)Hg1(k1),,HaL(k1)HgL(k1)}H_{k}=\text{diag}\{H_{a}^{1}(k-1)\otimes H_{g}^{1}(k-1),...,H_{a}^{L}(k-1)\otimes H_{g}^{L}(k-1)\}. ∎

Proof of Theorem 2:

Proof.

To show that Algorithm 5 lies in the framework of Algorithm 6, it suffices to show that HkH_{k} generated by Algorithm 5 is positive definite, which is true since Hk=diag{Ha1(k1)Hg1(k1),,HaL(k1)HgL(k1)}H_{k}=\text{diag}\{H_{a}^{1}(k-1)\otimes H_{g}^{1}(k-1),...,H_{a}^{L}(k-1)\otimes H_{g}^{L}(k-1)\} and Hal(k)H_{a}^{l}(k) and Hgl(k)H_{g}^{l}(k) are positive definite for all kk and ll. Then by Lemma 3, and the fact that HkH_{k} depends on Hal(k1)H_{a}^{l}(k-1) and Hgl(k1)H_{g}^{l}(k-1), and Hal(k1)H_{a}^{l}(k-1) and Hgl(k1)H_{g}^{l}(k-1) does not depend on random samplings in the kkth iteration, AS.4 holds. Hence, Theorem 2.8 of [41] applies to Algorithm 5, proving Theorem 2. ∎

Appendix C Powell’s Damped BFGS Updating

For BFGS and L-BFGS, one needs 𝐲𝐬>0{\mathbf{y}}^{\top}{\mathbf{s}}>0. However, when used to update HglH_{g}^{l}, there is no guarantee that (𝐲gl)𝐬gl>0({\mathbf{y}}^{l}_{g})^{\top}{\mathbf{s}}^{l}_{g}>0 for any layer l=1,,Ll=1,\ldots,L. In deterministic optimization, positive definiteness of the QN Hessian approximation BB (or its inverse) is maintained by performing an inexact line search that ensures that 𝐬T𝐲>0{\mathbf{s}}^{T}{\mathbf{y}}>0, which is always possible as long as the function being minimized is bounded below. However, this would be expensive to do for DNN. Thus, we propose the following heuristic based on Powell’s damped-BFGS approach [35].

Powell’s Damping on BB. Powell’s damping on BB, proposed in [35], replaces 𝐲{\mathbf{y}} in the BFGS update, by 𝐲~=θ𝐲+(1θ)B𝐬\tilde{{\mathbf{y}}}=\theta{\mathbf{y}}+(1-\theta)B{\mathbf{s}}, where

θ={(1μ)𝐬B𝐬𝐬B𝐬𝐬𝐲,if 𝐬𝐲<μ𝐬B𝐬,1,otherwise.\displaystyle\theta=\begin{cases}\frac{(1-\mu){\mathbf{s}}^{\top}B{\mathbf{s}}}{{\mathbf{s}}^{\top}B{\mathbf{s}}-{\mathbf{s}}^{\top}{\mathbf{y}}},&\text{if }{\mathbf{s}}^{\top}{\mathbf{y}}<\mu{\mathbf{s}}^{\top}B{\mathbf{s}},\\ 1,&\text{otherwise.}\end{cases}

It is easy to verify that 𝐬𝐲~μ𝐬B𝐬{\mathbf{s}}^{\top}\tilde{{\mathbf{y}}}\geq\mu{\mathbf{s}}^{\top}B{\mathbf{s}}.

Powell’s Damping on HH. In Powell’s damping on HH (see e.g. [3]), 𝐬~=θ𝐬+(1θ)H𝐲\tilde{{\mathbf{s}}}=\theta{\mathbf{s}}+(1-\theta)H{\mathbf{y}} replaces 𝐬{\mathbf{s}}, where

θ={(1μ1)𝐲H𝐲𝐲H𝐲𝐬𝐲,if 𝐬𝐲<μ1𝐲H𝐲,1,otherwise.\displaystyle\theta=\begin{cases}\frac{(1-\mu_{1}){\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{y}}^{\top}H{\mathbf{y}}-{\mathbf{s}}^{\top}{\mathbf{y}}},&\text{if }{\mathbf{s}}^{\top}{\mathbf{y}}<\mu_{1}{\mathbf{y}}^{\top}H{\mathbf{y}},\\ 1,&\text{otherwise.}\end{cases}

This is used in lines 2 and 3 of the DD (Algorithm 3). It is also easy to verify that 𝐬~𝐲μ1𝐲H𝐲\tilde{{\mathbf{s}}}^{\top}{\mathbf{y}}\geq\mu_{1}{\mathbf{y}}^{\top}H{\mathbf{y}}.

Powell’s Damping with B=IB=I. Powell’s damping on BB is not suitable for our algorithms because we do not keep track of BB. Moreover, it does not provide a simple bound on 𝐬~𝐬~𝐬~𝐲~\frac{\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{s}}}}{\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{y}}}} that is independent of B\|B\|. Therefore, we use Powell’s damping with B=IB=I, in lines 4 and 5 of the DD (Algorithm 3). It is easy to verify that it ensures that 𝐬~𝐲~μ2𝐬~𝐬~\tilde{{\mathbf{s}}}^{\top}\tilde{\mathbf{y}}\geq\mu_{2}\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{s}}}.

Powell’s damping with B=IB=I can be interpreted as adding an Levenberg-Marquardt (LM) damping term to BB. Note that an LM damping term μ2\mu_{2} would lead to Bμ2IB\succeq\mu_{2}I. Then, the secant condition 𝐲~=B𝐬~\tilde{{\mathbf{y}}}=B\tilde{{\mathbf{s}}} implies

𝐲~𝐬~=𝐬~B𝐬~μ2𝐬~𝐬~,\tilde{{\mathbf{y}}}^{\top}\tilde{{\mathbf{s}}}=\tilde{{\mathbf{s}}}^{\top}B\tilde{{\mathbf{s}}}\geq\mu_{2}\tilde{{\mathbf{s}}}^{\top}\tilde{{\mathbf{s}}},

which is the same inequality as we get using Powell’s damping with B=IB=I. Note that although the μ2\mu_{2} parameter in Powell’s damping with B=IB=I can be interpreted as an LM damping, we recommend setting the value of μ2\mu_{2} within (0,1](0,1] so that 𝐲~\tilde{{\mathbf{y}}} is a convex combination of 𝐲{\mathbf{y}} and 𝐬~\tilde{{\mathbf{s}}}. In all of our experimental tests, we found that the best value for the hyperparameter λ\lambda for both K-BFGS and K-BFGS(L) was less than or equal to 11, and hence that μ2=λG=λ\mu_{2}=\lambda_{G}=\sqrt{\lambda} was in the interval (0,1](0,1].

C.1 Double Damping (DD)

Our double damping (Algorithm 3) is a two-step damping procedure, where the first step (i.e. Powell’s damping on HH) can be viewed as an interpolation between the current curvature and the previous ones, and the second step (i.e. Powell’s damping with B=IB=I) can be viewed as an LM damping.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Fraction of the number of iterations in each epoch, in which the inequality 𝐲H𝐲𝐬𝐲2μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{2}{\mu_{1}} holds (upper plots), and the average value of 𝐲TH𝐲𝐬𝐲\frac{{\mathbf{y}}^{T}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}} (lower plots) in each epoch. Legends in each plot assign different colors to represent each layer ll.

Recall that there is no guarantee that 𝐲H𝐲𝐬𝐲2μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{2}{\mu_{1}} holds after DD. While we skip using pairs that do not satisfy this inequality, when updating HglH_{g}^{l} in proving the convergence of the K-BFGS(L) variant Algorithm 5 , we use all (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs to update HglH_{g}^{l} in our implementations of both K-BFGS and K-BFGS(L) . However, whether one skips or not makes only slight difference in the performance of these algorithms, because as our empirical testing has shown, at least 90% of the iterations satisfy 𝐲H𝐲𝐬𝐲2μ1\frac{{\mathbf{y}}^{\top}H{\mathbf{y}}}{{\mathbf{s}}^{\top}{\mathbf{y}}}\leq\frac{2}{\mu_{1}}, even if we don’t skip. See Figure 2 which reports results on this for K-BFGS(L) when tested on the MNIST, FACES and CURVES datasets.

Appendix D Implementation Details and More Experiments

D.1 Description of Competing Algorithms

D.1.1 KFAC

We first describe KFAC in Algorithm 7. Note that GlG_{l} in KFAC refers to the GG matrices in [30], which is different from the GlG_{l} in K-BFGS.

Algorithm 7 KFAC
0:  Given θ0\theta_{0}, batch size mm, and learning rate α\alpha, damping value λ\lambda, inversion frequency TT
1:  for k=1,2,k=1,2,\ldots do
2:     Sample mini-batch of size mm: Mk={ξk,i,i=1,,m}M_{k}=\{\xi_{k,i},i=1,\dots,m\}
3:     Perform a forward-backward pass over the current mini-batch MkM_{k} (see Algorithm 1)
4:     for l=1,2,Ll=1,2,\ldots L do
5:        pl=Hgl𝐟^lHalp_{l}=H_{g}^{l}\widehat{\mathbf{\nabla f}}_{l}H_{a}^{l}
6:        Wl=WlαplW_{l}=W_{l}-\alpha\cdot p_{l}.
7:     Perform another pass over MkM_{k} with yy sampled from the predictive distribution
8:     Update Al=βAl+(1β)𝐚l1𝐚l1¯A_{l}=\beta\cdot A_{l}+(1-\beta)\cdot\overline{{\mathbf{a}}_{l-1}{\mathbf{a}}_{l-1}^{\top}}, Gl=βGl+(1β)𝐠l𝐠l¯G_{l}=\beta\cdot G_{l}+(1-\beta)\cdot\overline{{\mathbf{g}}_{l}{\mathbf{g}}_{l}^{\top}}
9:     if kTk\leq T or k0(modT)k\equiv 0\pmod{T} then
10:        Recompute Hal=(Al+λI)1H_{a}^{l}=(A_{l}+\sqrt{\lambda}I)^{-1}, Hgl=(Gl+λI)1H_{g}^{l}=(G_{l}+\sqrt{\lambda}I)^{-1}

D.1.2 Adam/RMSprop

We implement Adam and RMSprop exactly as in [24] and [21], respectively. Note that the only difference between them is that Adam does bias correction for the 1st and 2nd moments of gradient while RMSprop does not.

D.1.3 Initialization of Algorithms

We now describe how each algorithm is initialized. For all algorithms, 𝐟^\widehat{\mathbf{\nabla f}} is always initialized as zero.

For second-order information, we use a "warm start" to estimate the curvature when applicable. In particular, we estimate the following curvature information using the entire training set before we start updating parameters. The information gathered is

  • AlA_{l} for K-BFGS and K-BFGS(L);

  • AlA_{l} and GlG_{l} for KFAC;

  • ff\nabla f\odot\nabla f for RMSprop;

  • Not applicable to Adam because of the bias correction.

Lastly, for K-BFGS and K-BFGS(L), HalH_{a}^{l} is always initially set to an identity matrix. HglH_{g}^{l} is also initially set to an identity matrix in K-BFGS; for K-BFGS(L), when updating HglH_{g}^{l} using L-BFGS, the incorporation of the information from the pp (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs is applied to an initial matrix that is set to an identity matrix. Hence, the above initialization/warm start costs are roughly twice as large for KFAC as they are for K-BFGS and K-BFGS(L).

D.2 Autoencoder Problems

Table 3 lists information about the three datasets, namely, MNIST222 Downloadable at http://yann.lecun.com/exdb/mnist/ , FACES333 Downloadable at www.cs.toronto.edu/~jmartens/newfaces_rot_single.mat , and CURVES444 Downloadable at www.cs.toronto.edu/~jmartens/digs3pts_1.mat . Table 4 specifies the architecture of the 3 problems, where binary entropy (aL,y)=n[ynlogaL,n+(1yn)log(1aL,n)]\mathcal{L}\left(a_{L},y\right)=\sum_{n}[y_{n}\log a_{L,n}+(1-y_{n})\log(1-a_{L,n})], MSE (aL,y)=12n(aL,nyn)2\mathcal{L}\left(a_{L},y\right)=\frac{1}{2}\sum_{n}(a_{L,n}-y_{n})^{2}. Besides the loss function in Table 4, we further add a regularization term η2θ2\frac{\eta}{2}||\theta||^{2} to the loss function, where η=105\eta=10^{-5}.

Table 3: Info for 3 datasets
Dataset # data points # training examples # testing examples
MNIST 70,000 60,000 10,000
FACES 165,600 103,500 62,100
CURVES 30,000 20,000 10,000
Table 4: Architecture of 3 auto-encoder problems
Dataset Layer width & activation Loss function
MNIST [784, 1000, 500, 250, 30, 250, 500, 1000, 784] binary entropy
[ReLU, ReLU, ReLU, linear, ReLU, ReLU, ReLU, sigmoid]
FACES [625, 2000, 1000, 500, 30, 500, 1000, 2000, 625] MSE
[ReLU, ReLU, ReLU, linear, ReLU, ReLU, ReLU, linear]
CURVES [784, 400, 200, 100, 50, 25, 6, 25, 50, 100, 200, 400, 784] binary entropy
[ReLU, ReLU, ReLU, ReLU, ReLU, linear,
ReLU, ReLU, ReLU, ReLU, ReLU, sigmoid]

D.3 Specification of Hyper-parameters

In our experiments, we focus our tuning effort onto learning rate and damping. The range of the tuning values is listed below:

  • learning rate αk=α{\alpha_{k}=\alpha\in\{ 1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1, 3, 10 }\}.

  • damping:

    • λ\lambda for K-BFGS, K-BFGS(L) and KFAC: λ{\lambda\in\{ 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1, 3 }\}.

    • ϵ\epsilon for RMSprop and Adam: ϵ{\epsilon\in\{ 1e-10, 1e-8, 1e-6, 1e-4, 1e-3, 1e-2, 1e-1 }\}.

    • Not applicable for SGD with momentum.

Table 5: Best (learning rate, damping) for Figure 1
K-BFGS K-BFGS(L) KFAC Adam RMSprop SGD-m
MNIST (0.3, 0.3) (0.3, 0.3) (1, 3) (1e-4, 1e-4) (1e-4, 1e-4) (0.03, -)
FACES (0.1, 0.1) (0.1, 0.1) (0.1, 0.1) (1e-4, 1e-4) (1e-4, 1e-4) (0.01, -)
CURVES (0.1, 0.01) (0.3, 0.3) (0.3, 0.3) (1e-3, 1e-3) (1e-3, 1e-3) (0.1, -)

The best hyper-parameters were those that produced the lowest value of the deterministic loss function encountered at the end of every epoch until the algorithm was terminated. These values were used in Figure 1 and are listed in Table 5. Besides the tuning hyper-parameters, we also list other fixed hyper-parameters with their values:

  • Size of minibatch mm = 1000, which is also suggested in [5].

  • Decay parameter:

    • K-BFGS, K-BFGS(L): β=0.9\beta=0.9;

    • KFAC: β=0.9\beta=0.9;

    • RMSprop, Adam: Following the notation in [24], we use β1=β2=0.9\beta_{1}=\beta_{2}=0.9;555The default value of β2\beta_{2} recommended in [24] is 0.999. Hence, we also tested β2=0.999\beta_{2}=0.999, and obtained results that were similar to those presented in Figure 1 (i.e., with β2=0.9\beta_{2}=0.9). For the sake of fair comparison, we chose to report the results with β2=0.9\beta_{2}=0.9.

    • SGD with momentum: β=0.9\beta=0.9.

  • Other:

    • μ1=0.2\mu_{1}=0.2 in double damping (DD):

      We recommend to leave the value as default because μ1\mu_{1} represents the "ratio" between current and past, which is scaling invariant;

    • Number of (𝐬,𝐲)({\mathbf{s}},{\mathbf{y}}) pairs stored for K-BFGS(L) p=100p=100:

      It might be more efficient to use a smaller pp for the narrow layers. We didn’t investigate this for simplicity and consistency;

    • Inverse frequency T=20T=20 in KFAC.

D.4 Sensitivity to Hyper-parameters

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Landscape of loss w.r.t hyper-parameters (i.e. learning rate and damping). The left, middle, right columns depict results for MNIST, FACES, CURVES, which are terminated after 500, 2000, 500 seconds (CPU time), respectively, for K-BFGS (upper) and K-BFGS(L) (lower) row.

Figure 3 shows the sensitivity of K-BFGS and K-BFGS(L) to hyper-parameter values (i.e. learning rate and damping). The xx-axis corresponds to the learning rate α\alpha, while the yy-axis correspond to the damping value λ\lambda. Color corresponds to the loss after a certain amount of CPU time. We can see that both K-BFGS and K-BFGS(L) are robust within a fairly wide range of hyper-parameters.

To get the plot, we first obtained training loss with α{\alpha\in\{1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1}\} and λ{\lambda\in\{1e-2, 3e-2, 1e-1, 3e-1, 1}\}, and then drew contour lines of the loss within the above ranges.

D.5 Experimental Results Using Mini-batches of Size 100

We repeated our experiments using mini-batches of size 100 for all algorithms (see Figures 4, 5, and 6). For each figure, the upper (lower) rower depict training loss (testing (mean square) error), whereas the left (right) column depicts training/test progress versus epoch (CPU time), respectively.

The best hyper-parameters were those that produce the lowest value of the deterministic loss function encountered at the end of every epoch until the algorithm was terminated. These values were used in Figures 4, 5, 6 and are listed in Table 6.

Our proposed methods continue to demonstrate advantageous performance, both in training and testing. It is interesting to note that, whereas for a minibatch size of 1000, KFAC slightly outperformed K-BFGS(L), for a minibatch size of 100, K-BFGS(L) clearly outperformed KFAC in training on CURVES.

Table 6: Best (learning rate, damping) for Figures 4, 5, 6
K-BFGS K-BFGS(L) KFAC Adam RMSprop SGD-m
MNIST (0.1, 0.3) (0.1, 0.3) (0.1, 0.3) (1e-4, 1e-4) (1e-4, 1e-4) (0.03, -)
FACES (0.03, 0.03) (0.03, 0.3) (0.03, 0.3) (3e-5, 1e-4) (3e-5, 1e-4) (0.01, -)
CURVES (0.3, 1) (0.3, 0.3) (0.03, 0.1) (3e-4, 1e-4) (3e-3, 1e-4) (0.03, -)
Refer to caption
Figure 4: Comparison between algorithms on MNIST with batch size 100
Refer to caption
Figure 5: Comparison between algorithms on FACES with batch size 100
Refer to caption
Figure 6: Comparison between algorithms on CURVES with batch size 100

D.6 Doubling the Mini-batch for the Gradient at Almost No Cost

Refer to caption
Refer to caption
Refer to caption
Figure 7: Comparison between K-BFGS and its "double-grad" variants

Compared with other methods mentioned in this paper, our K-BFGS and K-BFGS(L) methods have the extra advantage of being able to double the size of the minibatch used to compute the stochastic gradient with almost no extra cost, which might be of particular interest in a highly stochastic setting. To accomplish this, we can make use of the stochastic gradient 𝐟¯+\overline{\mathbf{\nabla f}}^{+} computed in the second pass of the previous iteration that is needed for computing the (𝐬¯,𝐲¯)(\overline{{\mathbf{s}}},\overline{{\mathbf{y}}}) pair for applying the BFGS or L-BFGS updates, and average it with the stochastic gradient 𝐟¯\overline{\mathbf{\nabla f}} of the current iteration. For example if the size of minibatch is m=1000m=1000, the above "double-grad-minibatch" method computes a stochastic gradient from 2000 data points at each iteration, except at the very first iteration.

The results of some preliminary experiments are depicted in Figure 7, where we compare an earlier version of the K-BFGS algorithm (Algorithm 4), which uses a slightly different variant of Hessian-action to update HalH_{a}^{l}, using a size of m=1000m=1000 for mini-batches, with its "double-grad-minibatch" variants for m=500m=500 and 10001000. Even though "double-grad" does not help much in these experiments, our K-BFGS algorithm performs stably across these different variants. These results indicate that there is a potential for further improvements; e.g., a finer grid search might identify hyper-parameter values that result in better performing algorithms.