This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\DOI

DOI HERE \vol00 \accessAdvance Access Publication Date: Day Month Year \appnotesPaper \copyrightstatementPublished by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved.

\authormark

Lim. A and Roosta. F

\corresp

[**]Corresponding author: [email protected]

0Year 0Year 0Year

Complexity Guarantees for Nonconvex Newton-MR Under Inexact Hessian Information

Alexander Lim \orgdivSchool of Mathematics and Physics, \orgnameUniversity of Queensland, \orgaddress\streetSt Lucia, \postcode4072, \stateQueensland, \countryAustralia    Fred Roosta \orgdivSchool of Mathematics and Physics, \orgnameUniversity of Queensland, \orgaddress\streetSt Lucia, \postcode4072, \stateQueensland, \countryAustralia
(2021; 2021; Date; Date; Date)
Abstract

We consider an extension of the Newton-MR algorithm for nonconvex unconstrained optimization to the settings where Hessian information is approximated. Under a particular noise model on the Hessian matrix, we investigate the iteration and operation complexities of this variant to achieve appropriate sub-optimality criteria in several nonconvex settings. We do this by first considering functions that satisfy the (generalized) Polyak-Łojasiewicz condition, a special sub-class of nonconvex functions. We show that, under certain conditions, our algorithm achieves global linear convergence rate. We then consider more general nonconvex settings where the rate to obtain first order sub-optimality is shown to be sub-linear. In all these settings, we show that our algorithm converges regardless of the degree of approximation of the Hessian as well as the accuracy of the solution to the sub-problem. Finally, we compare the performance of our algorithm with several alternatives on a few machine learning problems.

keywords:
Nonconvex; Newton-MR; Minimum Residual; Hessian Approximation.

1 Introduction

Consider the following unconstrained optimization problem:

min𝐱df(𝐱),\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{d}}f(\mathbf{x}), (1.1)

where f:df:\mathbb{R}^{d}\to\mathbb{R} is twice continuously differentiable and nonconvex. Modern machine learning, in particular deep learning, has motivated the development of a plethora of optimization methods for solving 1.1. The majority of these methods belong to the class of first order algorithms, which use only gradient information [6, 73, 69, 2, 38, 50, 46, 82, 32, 44, 54]. While first order methods are typically memory efficient, have low-cost per iteration, and are simple to implement, they suffer from serious drawbacks. These methods are notoriously difficult to fine-tune and tend to converge slowly, especially when dealing with ill-conditioned problems. By incorporating the curvature information in the form of (approximate) Hessian matrix, second order algorithms [26, 61] tend to alleviate these drawbacks. Beyond their theoretical appeal such as affine invariance and fast local convergence, these methods are also empirically shown to be less sensitive to hyper-parameters tuning and problem ill-conditioning [12, 76].

Arguably, the primary computational bottleneck in second order methods for solving large-scale problems is due to operations involving the Hessian matrix. In such large-scale settings where storing the Hessian matrix explicitly is prohibitive, operations such as matrix-vector products would impose a cost equivalent to multiple function evaluations; see [51, 65]. This is often considered as the major disadvantage of these methods. Consequently, second order algorithms that utilize appropriate approximations of the Hessian matrix have been introduced, e.g., methods based on constructing probabilistic models [64, 42, 8, 16, 18, 41, 74, 48], methods that employ randomized sub-sampling in finite sum settings [66, 77, 75, 78, 79, 19, 13, 10, 9, 11, 7, 17, 12, 58, 27], similar techniques based on randomized sketching of the Hessian matrix [12, 58, 63, 35, 59], and the related randomized subspace methods that are at times categorized under the sketch-and-project framework [37, 22, 40, 58, 81, 31]. A bird’s eye view of these methods reveals three main categories of globalization strategies based on trust region [28, 75], cubic regularization [23, 24, 60], and line-search [61]. Almost all second order methods with line-search algorithmic framework employ the conjugate gradient (CG) algorithm as their respective sub-problem solver, e.g., variants of Newton-CG [66, 17, 79, 7, 10]. An exception lies in [51] where the Newton-MR algorithm, initially proposed in [65], is extended to incorporate Hessian approximation. Unlike Newton-CG, the framework of Newton-MR relies on the minimum residual (MINRES) method [53, 62] as the sub-problem solver. However, the Newton-MR variants discussed in [51] are limited in their scope to invex problems [57].

In this paper, we extend the analysis in [51] to accommodate inexact Hessian information in a variety of more general nonconvex settings. In addition to iteration complexity, we provide operation complexity of our method. This is so since in large-scale problems, where solving the sub-problem constitutes the main computational costs, analyzing operation complexity – which incorporates the cost of solving these sub-problems – better captures the essence of the algorithmic cost than simple iteration complexity [26]. Although extensively studied in convex settings, efforts to establish operation complexity results for nonconvex settings have emerged only recently. For example, to our knowledge, [20] was among the first to analyze the complexity of solving sub-problems in trust region and cubic regularization methods in nonconvex settings, while [68, 67] focused on line-search-based methods. Since then, numerous follow-up works have aimed to quantify the overall operational complexity across a wide range of nonconvex optimization algorithms for solving 1.1 by assessing the computational cost of solving their respective sub-problems, e.g., [79, 52, 4, 48, 29].

The overarching objective of our work here is to derive both iteration and operation complexities of our proposed nonconvex Newton-MR variant in the presence of inexact Hessian information with minimal assumptions and limited algorithmic modifications compared with similar Newton-type methods in the literature. In particular, (i) we only consider minimal smoothness assumptions on the gradient and do not extend smoothness assumptions to the Hessian itself; and (ii) we do not employ the Hessian regularization/damping techniques used in [79, 67, 48, 29], as distorting the curvature information further could potentially lead to inferior performance (see experiments in Section A.3).

The rest of this paper is organized as follows. We end this section by presenting the essential notation and definitions used in this paper. In Section 2, we lay out the details of our algorithm. This is then followed by the theoretical analyses in Section 3. Specifically, we first establish the iteration complexities of our algorithm is Section 3.1. Subsequently, we derive the underlying operation complexities in Section 3.2. In Section 4, we empirically evaluate the performance of our algorithm on several machine learning problems. Conclusions and further thoughts are gathered in Section 5. Some proof details as well as further numerical results are deferred to Appendix A.

Notation

Throughout the paper, vectors and matrices are, respectively, denoted by bold lower and upper case letters, e.g. 𝐚\mathbf{a} and 𝐀\mathbf{A}. Their respective norms, e.g., 𝐚\|\mathbf{a}\| and 𝐀\|\mathbf{A}\|, are Euclidean and matrix spectral norms. Regular letters represent scalar, e.g. dd, LL, α\alpha, etc. We denote the exact and the inexact Hessian as 𝐇\mathbf{H} and 𝐇¯\mathbf{\bar{H}}, respectively. We use subscripts to denote the iteration counter of our algorithm, e.g., 𝐱k\mathbf{x}_{k}. The objective function and its gradient at iteration kk are denoted by fkf(𝐱k)f_{k}\triangleq f(\mathbf{x}_{k}) and 𝐠k𝐠(𝐱k)\mathbf{g}_{k}\triangleq\mathbf{g}(\mathbf{x}_{k}), respectively. We use superscripts to denote the iterations of MINRES. Specifically, 𝐬k(t)\mathbf{s}_{k}^{(t)} refers to the ttht\textsuperscript{th} iterate of MINRES using 𝐇¯k\mathbf{\bar{H}}_{k} and 𝐠k\mathbf{g}_{k} with the corresponding residual 𝐫k(t)=𝐇¯k𝐬k(t)𝐠k\mathbf{r}_{k}^{(t)}=-\mathbf{\bar{H}}_{k}\mathbf{s}_{k}^{(t)}-\mathbf{g}_{k}. The Krylov subspace of degree tt, generated by 𝐇¯k\mathbf{\bar{H}}_{k} and 𝐠k\mathbf{g}_{k}, is denoted by 𝒦t(𝐇¯k,𝐠k)\mathcal{K}_{t}(\mathbf{\bar{H}}_{k},\mathbf{g}_{k}). Natural logarithm is denoted by log()\log(\cdot). The range and the nullspace of a matrix, say 𝐇¯\mathbf{\bar{H}}, are denoted by Range(𝐇¯)\text{Range}(\mathbf{\bar{H}}) and Null(𝐇¯)\text{Null}(\mathbf{\bar{H}}), respectively. We also denote fminf(𝐱)f^{\star}\triangleq\min f(\mathbf{x}), which is assumed finite.

Definitions

We analyze the complexity of our proposed algorithm in nonconvex settings. An interesting subclass of nonconvex functions that often times allows for favourable convergence rates is those satisfying (generalized) Polyak-Łojasiewicz (PL) condition.

Definition 1 (θ\theta-Polyak-Łojasiewicz Condition).

For any θ[1,2]\theta\in[1,2], we say a function satisfies the θ\theta-PL condition, if there exists a constant μ>0\mu>0 such that

𝐠(𝐱)22μ(f(𝐱)f)2/θ,𝐱d.\displaystyle\|\mathbf{g}(\mathbf{x})\|^{2}\geq 2\mu(f(\mathbf{x})-f^{\star})^{{2}/{\theta}},\quad\forall\mathbf{x}\in\mathbb{R}^{d}. (1.2)

The class of θ\theta-PL functions has been extensively considered in the literature for the convergence analysis of various optimization algorithms, e.g., [43, 65, 56, 1, 80, 3, 83, 5, 34]. It has been shown that 1.2 is a special case of global Kurdyka-Łojasiewicz inequality [34]. Note that as θ\theta gets closer to 11, the function is allowed to be more flat near the set of optimal points, e.g., consider f(x)=x2αf(x)=x^{2\alpha} for α1\alpha\geq 1, which satisfies 1.2 with θ=2α/(2α1)\theta=2\alpha/(2\alpha-1) (and μ=0.5(2α)(2α1)/α\mu=0.5(2\alpha)^{(2\alpha-1)/\alpha}).

In optimization, one’s objective is often to find solutions that satisfy a certain level of approximate optimality. When dealing with θ\theta-PL functions, it is appropriate to consider approximate global optimality.

Definition 2 (εf\varepsilon_{f}-Global Optimality).

Given 0<εf<10<\varepsilon_{f}<1, a point 𝐱d\mathbf{x}\in\mathbb{R}^{d} is an εf\varepsilon_{f}- global optimal solution to the problem 1.1 if

f(𝐱)fεff(\mathbf{x})-f^{*}\leq\varepsilon_{f} (1.3)

For more general nonconvex functions, a more suitable measure of sub-optimality is that of the approximate first order criticality.

Definition 3 (ε𝐠\varepsilon_{\mathbf{g}}-First Order Optimality).

Given 0<ε𝐠<10<\varepsilon_{\mathbf{g}}<1, a point 𝐱d\mathbf{x}\in\mathbb{R}^{d} is an ε𝐠\varepsilon_{\mathbf{g}}- first order optimal solution to the problem 1.1 if

𝐠(𝐱)ε𝐠\|\mathbf{g}(\mathbf{x})\|\leq\varepsilon_{\mathbf{g}} (1.4)

The geometry of the optimization landscape is essentially encoded in the spectrum of the Hessian matrix. In that light, one can obtain descent by leveraging directions that, to a great degree, align with the eigenspace associated with the small or negative eigenvalues of the Hessian. These directions are referred to as σ\sigma-Limited Curvature (LC) directions.

Definition 4 (σ\sigma-Limited Curvature Direction).

For any nonzero 𝐱d\mathbf{x}\in\mathbb{R}^{d} and σ>0\sigma>0, we day 𝐱\mathbf{x} is a σ\sigma-limited curvature (σ\sigma-LC) direction for a matrix 𝐀\mathbf{A}, if 𝐱,𝐀𝐱dσ𝐱2.\langle\mathbf{x},\mathbf{A}\mathbf{x}\rangle\leq d\sigma\|\mathbf{x}\|^{2}.

We perform our convergence analysis under a simple noise model for the Hessian matrix.

Definition 5 (Inexact Hessian).

For some ε>0\varepsilon>0, the matrix 𝐇¯(𝐱)\mathbf{\bar{H}}(\mathbf{x}) is an estimate of the underlying Hessian, 𝐇(𝐱)\mathbf{H}(\mathbf{x}), in that

𝐇(𝐱)𝐇¯(𝐱)ε.\displaystyle\left\|\mathbf{H}(\mathbf{x})-\mathbf{\bar{H}}(\mathbf{x})\right\|\leq\varepsilon. (1.5)

This type of inexactness model includes popular frameworks such as the sub-sampled Newton [66, 75] and Newton Sketch [63, 35] and is considered frequently in the literature, e.g., [79, 51, 48, 17]. For example, consider large-scale finite sum minimization problems [71], where f(𝐱)=i=1nfi(𝐱)/nf(\mathbf{x})=\sum_{i=1}^{n}f_{i}(\mathbf{x})/n and n1n\gg 1. Note that 𝐇(𝐱)=i=1n𝐇i(𝐱)/n\mathbf{H}(\mathbf{x})=\sum_{i=1}^{n}\mathbf{H}_{i}(\mathbf{x})/n. Under a simple Lipschitz gradient assumption, it is shown in [75, Lemma 16] that, for any ε>0\varepsilon>0 and 0<δ<10<\delta<1, if 𝐇¯(𝐱)=i𝒮𝐇i(𝐱)/|𝒮|\mathbf{\bar{H}}(\mathbf{x})=\sum_{i\in\mathcal{S}}\mathbf{H}_{i}(\mathbf{x})/|\mathcal{S}| is constructed using |𝒮|𝒪(ε2log(2d/δ))|\mathcal{S}|\in\mathcal{O}(\varepsilon^{-2}\log(2d/\delta)) samples drawn uniformly at random from {1,2,,n}\{1,2,\ldots,n\}, we obtain 1.5 with probability at least 1δ1-\delta.

2 Newton-MR with Inexact Hessian

We now present our variant of Newton-MR that incorporates inexact Hessian information. The core of our algorithm lies on MINRES, depicted in Algorithm 1, as the sub-problem solver. Algorithm 1 is similar to Algorithm 2.1 in [53]. The main differences are that Algorithm 1 is equipped with early termination criteria, outlined in Step 7 and Step 11, which respectively correspond to sub-problem inexactness (Condition 1) and detection of limited curvature (Condition 2).

Algorithm 1 MINRES(𝐇¯,𝐠,η,σ)(\mathbf{\bar{H}},\mathbf{g},\eta,\sigma)
1:Inexact Hessian 𝐇¯\mathbf{\bar{H}}, Gradient 𝐠\mathbf{g}, Inexactness tolerance η>0\eta>0, limited curvature tolerance σ>0\sigma>0
2:ϕ0=β~1=𝐠\phi_{0}=\tilde{\beta}_{1}=\|\mathbf{g}\|, 𝐫0=𝐠\mathbf{r}_{0}=-\mathbf{g}, 𝐯1=𝐫0/ϕ0\mathbf{v}_{1}=\mathbf{r}_{0}/\phi_{0}, 𝐯0=𝐬0=𝐰0=𝐰1=𝟎\mathbf{v}_{0}=\mathbf{s}_{0}=\mathbf{w}_{0}=\mathbf{w}_{-1}=\mathbf{0},
3:c0=1c_{0}=-1, δ1(1)=s0=τ0=0\delta_{1}^{(1)}=s_{0}=\tau_{0}=0, Type = SOL,
4:t=1t=1
5:while True do
6:     𝐪t=𝐇¯𝐯t\mathbf{q}_{t}=\mathbf{\bar{H}}\mathbf{v}_{t}, α~t=𝐯t𝐪t\tilde{\alpha}_{t}=\mathbf{v}_{t}^{\top}\mathbf{q}_{t}, 𝐪t=𝐪tβ~t𝐯t1α~t𝐯t\mathbf{q}_{t}=\mathbf{q}_{t}-\tilde{\beta}_{t}\mathbf{v}_{t-1}-\tilde{\alpha}_{t}\mathbf{v}_{t}, β~t+1=𝐪t\tilde{\beta}_{t+1}=\|\mathbf{q}_{t}\|
7:     [δt(2)ϵt+1γt(1)δt+1(1)]=[ct1st1st1ct1][δt(1)0α~tβ~t+1]\begin{bmatrix}\delta_{t}^{(2)}&\epsilon_{t+1}\\ \gamma_{t}^{(1)}&\delta_{t+1}^{(1)}\end{bmatrix}=\begin{bmatrix}c_{t-1}&s_{t-1}\\ s_{t-1}&-c_{t-1}\end{bmatrix}\begin{bmatrix}\delta_{t}^{(1)}&0\\ \tilde{\alpha}_{t}&\tilde{\beta}_{t+1}\end{bmatrix}
8:     if ϕt1(γt(1))2+(δt+1(1))2ηϕ02ϕt12\phi_{t-1}\sqrt{\left(\gamma_{t}^{(1)}\right)^{2}+\left(\delta_{t+1}^{(1)}\right)^{2}}\leq\eta\sqrt{\phi_{0}^{2}-\phi_{t-1}^{2}} then \triangleright Condition 1: 𝐇¯𝐫t1η𝐇¯𝐬t1\left\|\mathbf{\bar{H}}\mathbf{r}_{t-1}\right\|\leq\eta\left\|\mathbf{\bar{H}}\mathbf{s}_{t-1}\right\|
9:         Type = SOL \triangleright SOL direction
10:         return 𝐬t1\mathbf{s}_{t-1}, Type
11:     end if
12:     if ct1γt(1)σd-c_{t-1}\gamma_{t}^{(1)}\leq\sigma d then \triangleright Condition 2: 𝐫t1,𝐇¯𝐫t1σd𝐫t12\left\langle{\mathbf{r}_{t-1},\mathbf{\bar{H}}\mathbf{r}_{t-1}}\right\rangle\leq\sigma d\left\|\mathbf{r}_{t-1}\right\|^{2}
13:         Type = LC \triangleright σ\sigma-LC direction
14:         return 𝐫t1\mathbf{r}_{t-1}, Type
15:     end if
16:     γt(2)=(γt(1))2+β~t+12\gamma_{t}^{(2)}=\sqrt{\left(\gamma_{t}^{(1)}\right)^{2}+\tilde{\beta}_{t+1}^{2}}
17:     if γt(2)0\gamma_{t}^{(2)}\not=0 then
18:         ct=γt(1)/γt(2)c_{t}=\gamma_{t}^{(1)}/\gamma_{t}^{(2)}, st=β~t+1/γt(2)s_{t}=\tilde{\beta}_{t+1}/\gamma_{t}^{(2)}, τt=ctϕt1\tau_{t}=c_{t}\phi_{t-1}, ϕt=stϕt1\phi_{t}=s_{t}\phi_{t-1},
19:         𝐰t=(𝐯tδt(2)𝐰t1ϵt𝐰t2)/γt(2)\mathbf{w}_{t}=\left(\mathbf{v}_{t}-\delta_{t}^{(2)}\mathbf{w}_{t-1}-\epsilon_{t}\mathbf{w}_{t-2}\right)/\gamma_{t}^{(2)}, 𝐬t=𝐬t1+τt𝐰t\mathbf{s}_{t}=\mathbf{s}_{t-1}+\tau_{t}\mathbf{w}_{t}
20:         if β~t+10\tilde{\beta}_{t+1}\not=0 then
21:              𝐯t+1=𝐪t/β~t+1,𝐫t=st2𝐫t1ϕtct𝐯t+1\mathbf{v}_{t+1}=\mathbf{q}_{t}/\tilde{\beta}_{t+1},\mathbf{r}_{t}=s_{t}^{2}\mathbf{r}_{t-1}-\phi_{t}c_{t}\mathbf{v}_{t+1}
22:         else
23:              return 𝐬t\mathbf{s}_{t}, Type
24:         end if
25:     else
26:         ct=0c_{t}=0, st=1s_{t}=1, τt=0\tau_{t}=0, ϕt=ϕt1\phi_{t}=\phi_{t-1}, 𝐫t=𝐫t1\mathbf{r}_{t}=\mathbf{r}_{t-1}, 𝐬t=𝐬t1\mathbf{s}_{t}=\mathbf{s}_{t-1}
27:     end if
28:     t = t + 1
29:end while
Condition 1 (Sub-problem Inexactness Condition).

If 𝐇¯k𝐫k(t1)<η𝐇¯k𝐬k(t1)\|\mathbf{\bar{H}}_{k}\mathbf{r}_{k}^{(t-1)}\|<\eta\|\mathbf{\bar{H}}_{k}\mathbf{s}_{k}^{(t-1)}\| at iteration tt of Algorithm 1, then 𝐬k(t1)\mathbf{s}_{k}^{(t-1)} is declared as a solution direction satisfying the sub-problem inexactness condition where η>0\eta>0 is any given inexactness parameter. This condition is effortlessly verified in Step 7 of Algorithm 1.

Condition 2 (σ\sigma-Limited Curvature Condition).

At iteration tt of Algorithm 1, if 𝐫k(t1),𝐇¯k𝐫k(t1)σd𝐫k(t1)2\langle\mathbf{r}_{k}^{(t-1)},\mathbf{\bar{H}}_{k}\mathbf{r}_{k}^{(t-1)}\rangle\leq\sigma d\|\mathbf{r}_{k}^{(t-1)}\|^{2}, for some positive constant σ\sigma, then 𝐫k(t1)\mathbf{r}_{k}^{(t-1)} is declared as an σ\sigma-LC direction. This condition is effortlessly verified in Step 11 of Algorithm 1.

Condition 1 relates the optimality of the sub-problem, i.e., it measures how close the approximated direction is to Newton’s direction in terms of the residual of the normal equation. We emphasize that Condition 1 guarantees to be eventually satisfied during the iterations, regardless of the choice of η\eta. This is because, as a consequence of [53, Lemma 3.1], 𝐇¯k𝐫k(t1)\|\mathbf{\bar{H}}_{k}\mathbf{r}_{k}^{(t-1)}\| is decreasing to zero while 𝐇¯k𝐬k(t1)\|\mathbf{\bar{H}}_{k}\mathbf{s}_{k}^{(t-1)}\| is monotonically increasing. This is in sharp contrast to the typical relative residual condition 𝐫k(t1)η𝐠k\|\mathbf{r}_{k}^{(t-1)}\|\leq\eta\|\mathbf{g}_{k}\|, which is often used in related works. In nonconvex settings, where the gradient might not lie entirely in the range of Hessian, the relative residual condition might never be satisfied unless the inexactness parameter is appropriately set, which itself depends on some unavailable lower bound.

Note that, when σ=0\sigma=0, Condition 2 coincides with the nonpositive curvature (NPC) condition. Variants of such NPC condition have been extensively used within various optimization algorithms to generate descent directions for nonconvex problems [52, 67, 76, 21, 48]. It turns out that for the settings we consider in this paper, considering σ>0\sigma>0 removes the need to introduce more structural assumptions on the Hessian matrix, while still allowing for the construction of descent directions. This is mainly due to the following simple observation that as long as Condition 2 has not been triggered, the underlying Krylov subspace contains vectors that better align with the eigenspace of Hessian corresponding to large positive eigenvalues, i.e., regions of sufficiently large positive curvature.

Lemma 1.

Suppose Condition 2 has not yet been detected at iteration tt. For any vector 𝐯𝒦t(𝐇¯k,𝐠k)\mathbf{v}\in\mathcal{K}_{t}({\mathbf{\bar{H}}_{k}},{\mathbf{g}_{k}}), we have 𝐯,𝐇¯k𝐯σ𝐯2\left\langle{\mathbf{v},\mathbf{\bar{H}}_{k}\mathbf{v}}\right\rangle\geq\sigma\|\mathbf{v}\|^{2}.

Proof.

Let 𝐯𝒦t(𝐇¯k,𝐠k)\mathbf{v}\in\mathcal{K}_{t}({\mathbf{\bar{H}}_{k}},{\mathbf{g}_{k}}). Since, Span{𝐫k(0),𝐫k(1),,𝐫k(t1)}=𝒦t(𝐇¯k,𝐠k)\{\mathbf{r}_{k}^{(0)},\mathbf{r}_{k}^{(1)},\cdots,\mathbf{r}_{k}^{(t-1)}\}=\mathcal{K}_{t}({\mathbf{\bar{H}}_{k}},{\mathbf{g}_{k}}), we can write 𝐯=i=0t1ci𝐫k(i)\mathbf{v}=\sum_{i=0}^{t-1}c_{i}\mathbf{r}_{k}^{(i)} for some set of scalars, cic_{i}. Using the 𝐇¯k\mathbf{\bar{H}}_{k}-conjugacy of the residuals, it follows that

𝐯,𝐇¯k𝐯\displaystyle\left\langle{\mathbf{v},\mathbf{\bar{H}}_{k}\mathbf{v}}\right\rangle =i=0t1ci2𝐫k(i),𝐇¯k𝐫k(i)i=0t1ci2σd𝐫k(i)2σdt+1i=0t1ci𝐫k(t1)2σ𝐯2,\displaystyle=\sum_{i=0}^{t-1}c_{i}^{2}\left\langle{\mathbf{r}_{k}^{(i)},\mathbf{\bar{H}}_{k}\mathbf{r}_{k}^{(i)}}\right\rangle\geq\sum_{i=0}^{t-1}c_{i}^{2}\sigma d\left\|\mathbf{r}_{k}^{(i)}\right\|^{2}\geq\frac{\sigma d}{t+1}\left\|\sum_{i=0}^{t-1}c_{i}\mathbf{r}_{k}^{(t-1)}\right\|^{2}\geq\sigma\left\|\mathbf{v}\right\|^{2},

where the second-to-last inequality follows from [14, Fact 9.7.9]. ∎

We highlight that, unlike Condition 1, Condition 2 does not relate to any sub-optimality criterion for the MINRES iterates. As it was shown in Lemma 1, as long as Condition 2 has not been detected, the MINRES iterate 𝐬k(t1)\mathbf{s}_{k}^{(t-1)} from Algorithm 1 will be, to a great extend, aligned with the eigenspace corresponding to large positive eigenvalues and can thus be used to construct a suitable descent direction. However, Condition 2 indicates that the underlying Krylov subspace now contains directions of small or negative curvature, suggesting that the MINRES iterate 𝐬k(t1)\mathbf{s}_{k}^{(t-1)} might not be a good direction to follow. In such cases, a descent direction is constructed from the residual vector, 𝐫k(t1)\mathbf{r}_{k}^{(t-1)}, rather than the iterate itself.

Depending on which termination condition is satisfied first, the search direction for Newton-MR is constructed as follows:

𝐝k={𝐬k(t1),if Condition 1 is satisfied,𝐫k(t1),if Condition 2 is satisfied.\displaystyle\mathbf{d}_{k}=\left\{\begin{array}[]{lll}\mathbf{s}_{k}^{(t-1)},&&\text{if \lx@cref{creftype~refnum}{cond:inexactness} is satisfied,}\\ \\ \mathbf{r}_{k}^{(t-1)},&&\text{if \lx@cref{creftype~refnum}{cond:LC} is satisfied.}\end{array}\right.

It turns out that MINRES enjoys a wealth of theoretical and empirical properties that position it as a preferred sub-problem solver for many Newton-type methods for nonconvex settings [53, 49]. Among them, one can show that the search direction 𝐝k\mathbf{d}_{k} is always guaranteed to be a descent direction for ff. Indeed, Properties 2.4 and 2.2 below always guarantee 𝐝k,𝐠k<0\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle<0 when Conditions 1 and 2 are met, respectively. The proofs of these properties are deferred to Appendix A.

Properties 1.

Consider the ttht\textsuperscript{th} iteration of Algorithm 1 and let 0jit10\leq j\leq i\leq t-1. We have

𝐫k(i)2\displaystyle\left\|\mathbf{r}_{k}^{(i)}\right\|^{2} 𝐫k(j)2,\displaystyle\geq\left\|\mathbf{r}_{k}^{(j)}\right\|^{2}, (2.1)
𝐫k(i),𝐠k\displaystyle-\left\langle{\mathbf{r}^{(i)}_{k},\mathbf{g}_{k}}\right\rangle =𝐫k(i)2,\displaystyle=\left\|\mathbf{r}^{(i)}_{k}\right\|^{2}, (2.2)
𝐫k(t1),𝐫k(i)\displaystyle\left\langle{\mathbf{r}^{(t-1)}_{k},\mathbf{r}^{(i)}_{k}}\right\rangle =𝐫k(t1)2.\displaystyle=\left\|\mathbf{r}^{(t-1)}_{k}\right\|^{2}. (2.3)

Furthermore, if Condition 2 has not yet been detected, then

𝐬k(i+1),𝐠k<𝐬k(i+1),𝐇¯k𝐬k(i+1)\displaystyle\left\langle{\mathbf{s}_{k}^{(i+1)},\mathbf{g}_{k}}\right\rangle<-\left\langle{\mathbf{s}_{k}^{(i+1)},\mathbf{\bar{H}}_{k}\mathbf{s}_{k}^{(i+1)}}\right\rangle (2.4)
𝐠k2𝐠k,𝐇¯𝐠k2𝐇¯k𝐠k4=𝐬k(1)2𝐬k(j+1)2𝐬k(i+1)2.\displaystyle\left\|\mathbf{g}_{k}\right\|^{2}\frac{\left\langle{\mathbf{g}_{k},\mathbf{\bar{H}}\mathbf{g}_{k}}\right\rangle^{2}}{\left\|\mathbf{\bar{H}}_{k}\mathbf{g}_{k}\right\|^{4}}=\left\|\mathbf{s}_{k}^{(1)}\right\|^{2}\leq\left\|\mathbf{s}_{k}^{(j+1)}\right\|^{2}\leq\left\|\mathbf{s}_{k}^{(i+1)}\right\|^{2}. (2.5)

Having characterized the search direction, we then enforce sufficient descent for each iteration by choosing a step size αk\alpha_{k} that satisfies the Armijo-type line-search with parameter 0<ρ<10<\rho<1 as

f(𝐱k+αk𝐝k)f(𝐱k)+ραk𝐠k,𝐬k.f(\mathbf{x}_{k}+\alpha_{k}\mathbf{d}_{k})\leq f(\mathbf{x}_{k})+\rho\alpha_{k}\langle\mathbf{g}_{k},\mathbf{s}_{k}\rangle. (2.6)

In particular, when 𝐝k=𝐬k(t1)\mathbf{d}_{k}=\mathbf{s}_{k}^{(t-1)} (i.e., Type = SOL), we perform the backward tracking line-search (Algorithm 3) to obtain the largest 0<αk10<\alpha_{k}\leq 1. Such backtracking technique is standard within the optimization methods that employ line-search as the globalization strategy [61]. When 𝐝k=𝐫k(t1)\mathbf{d}_{k}=\mathbf{r}_{k}^{(t-1)} (i.e., Type = LC), we perform the forward and backward tracking line-search (Algorithm 4) to find the largest 0<αk<0<\alpha_{k}<\infty. After a step size, αk\alpha_{k} is found, the next iterate is then given as 𝐱k+1=𝐱k+αk𝐝k\mathbf{x}_{k+1}=\mathbf{x}_{k}+\alpha_{k}\mathbf{d}_{k}. Such a forward tracking strategy, though less widely used than backward tracking, has been shown to yield much larger step sizes when directions of small or negative curvature are used and can significantly improve empirical performance [39, 52]. We note that even though this forward tracking strategy can also be used with 𝐝k=𝐬k(t1)\mathbf{d}_{k}=\mathbf{s}_{k}^{(t-1)}, we have found that, empirically, the larger step sizes from forward tracking when Type = SOL might be less effective, and at times even detrimental, compared with the case when Type = LC.

The culmination of these steps is depicted in Algorithm 2. Compared with the variant of Newton-MR in [52], Algorithm 2 contains two main differences: employing the inexact Hessian 𝐇¯\mathbf{\bar{H}} instead of the exact one 𝐇\mathbf{H}, as well as leveraging σ\sigma-LC condition with σ>0\sigma>0 instead of the typical NPC condition with σ=0\sigma=0. For the sake of brevity, we have introduced a variable “Type” in Algorithm 2 and in our subsequent discussions. This variable takes the value “Type = LC” when Condition 2 is met, signifying 𝐝k=𝐫k(t1)\mathbf{d}_{k}=\mathbf{r}_{k}^{(t-1)}, and is set to “Type = SOL” when Condition 1 is triggered, indicating 𝐝k=𝐬k(t1)\mathbf{d}_{k}=\mathbf{s}_{k}^{(t-1)}.

Algorithm 2 Newton-MR with Inexact Hessian
1:Initial point: 𝐱0\mathbf{x}_{0}, First order optimality tolerance: 0<ε𝐠10<\varepsilon_{\mathbf{g}}\leq 1, Sub-problem inexactness tolerance : η>0\eta>0, Line-search parameter : 0<ρ<10<\rho<1, Limited curvature parameter : σ>0\sigma>0
2:k=0k=0
3:while 𝐠k>ε𝐠\|\mathbf{g}_{k}\|>\varepsilon_{\mathbf{g}} do
4:     Call MINRES Algorithm 1: (𝐝k,Type)=MINRES(𝐇¯k,𝐠k,η,σ)(\mathbf{d}_{k},\texttt{Type})=\text{MINRES}(\mathbf{\bar{H}}_{k},\mathbf{g}_{k},\eta,\sigma)
5:     Use Algorithm 3 (Type = SOL) or Algorithm 4 (Type = LC) to obtain αk\alpha_{k} satisfying 2.6
6:     𝐱k+1=𝐱k+αk𝐝k\mathbf{x}_{k+1}=\mathbf{x}_{k}+\alpha_{k}\mathbf{d}_{k}
7:     k=k+1k=k+1
8:end while
9:return 𝐱k\mathbf{x}_{k}
Algorithm 3 Backward Tracking Line-Search [52]
1:α=1\alpha=1 and 0<ξ<10<\xi<1
2:while 2.6 is not satisfied do
3:     α=ξα\alpha=\xi\alpha
4:end while
5:return α\alpha
Algorithm 4 Forward/Backward Tracking Line-Search [52]
1:α>0\alpha>0 and 0<ξ<10<\xi<1
2:if 2.6 is not satisfied then
3:     Call Algorithm 3
4:else
5:     while 2.6 is satisfied do
6:         α=α/ξ\alpha=\alpha/\xi
7:     end while
8:     return ξα\xi\alpha
9:end if

3 Theoretical Analyses

In this section, we present complexity analyses of Algorithm 2. In particular, after establishing the iteration complexity in Section 3.1, we derive the operation complexities in Section 3.2, which are more representative of the actual costs in the large-scale problems. Remarkably, we show that, under minimal assumptions, our algorithm converges irrespective of the degree of Hessian approximation, i.e., ε\varepsilon, and the accuracy of the inner problem solution, i.e., η\eta and σ\sigma. In fact, for all of our analysis, we only make the following blanket assumption, which is widely used throughout the literature.

Assumption 1.

The function ff is twice continuously differentiable and bounded below. Furthermore, there exists a constant 0L𝐠<0\leq L_{\mathbf{g}}<\infty such that for any 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}, we have 𝐠(𝐱)𝐠(𝐲)L𝐠𝐱𝐲\left\|\mathbf{g}(\mathbf{x})-\mathbf{g}(\mathbf{y})\right\|\leq L_{\mathbf{g}}\left\|\mathbf{x}-\mathbf{y}\right\| (or, equivalently, for any 𝐱d\mathbf{x}\in\mathbb{R}^{d}, 𝐇(𝐱)L𝐠\left\|\mathbf{H}(\mathbf{x})\right\|\leq L_{\mathbf{g}}).

It is well-known that Assumption 1 implies

f(𝐱+𝐯)f(𝐱)+𝐠(𝐱),𝐯+L𝐠2𝐯2,𝐱d,𝐯d.\displaystyle f(\mathbf{x}+\mathbf{v})\leq f(\mathbf{x})+\left\langle{\mathbf{g}(\mathbf{x}),\mathbf{v}}\right\rangle+\frac{L_{\mathbf{g}}}{2}\|\mathbf{v}\|^{2},\quad\forall\mathbf{x}\in\mathbb{R}^{d},\;\forall\mathbf{v}\in\mathbb{R}^{d}. (3.1)

Furthermore, from Assumptions 1 and 1.5 and using the reverse triangle inequality, we have

𝐇¯𝐯(𝐱)\displaystyle\left\|\mathbf{\bar{H}}\mathbf{v}(\mathbf{x})\right\| 𝐇¯(𝐱)𝐯(L𝐠+ε)𝐯,𝐱d,𝐯d.\displaystyle\leq\left\|\mathbf{\bar{H}}(\mathbf{x})\right\|\left\|\mathbf{v}\right\|\leq(L_{\mathbf{g}}+\varepsilon)\left\|\mathbf{v}\right\|,\quad\forall\mathbf{x}\in\mathbb{R}^{d},\;\forall\mathbf{v}\in\mathbb{R}^{d}. (3.2)

3.1 Iteration Complexity

To obtain the operation complexity of Algorithm 2, we first establish its convergence rate, i.e., the iteration complexity. We emphasize that our main focus in this paper is to study the consequences of using inexact Hessian as it relates to complexity, under minimal assumptions and with minimal modifications to the original exact algorithm. Hence, we only consider minimal smoothness, as stated in Assumption 1, i.e., we merely assume that the gradient is Lipschitz continuous and do not extend this smoothness assumption to the Hessian itself. Additionally, we do not employ the Hessian regularization techniques used in [79, 67, 48], as distorting the curvature information further could potentially lead to inferior performance; see the experiments in Section A.3.

Our analysis in this section relies on Lemmas 2 and 3 below that shows the minimum amount of descent Algorithm 2 can achieve going from fkf_{k} to fk+1f_{k+1}, when “Type = SOL” and “Type = LC”, respectively.

Lemma 2.

Under Assumption 1, in Algorithm 2 with Type = SOL, we have

fk+1fk2ρ(1ρ)ξCs𝐠k2,\displaystyle f_{k+1}-f_{k}\leq-2\rho(1-\rho)\xi C_{s}\|\mathbf{g}_{k}\|^{2},

where 0<ρ<10<\rho<1 and 0<ξ<10<\xi<1 are the line-search parameters and Csd2σ4/(L𝐠(L𝐠+ε)4)C_{s}\triangleq d^{2}\sigma^{4}/(L_{\mathbf{g}}(L_{\mathbf{g}}+\varepsilon)^{4}).

Proof.

In this case, 𝐝k=𝐬k(t1)\mathbf{d}_{k}=\mathbf{s}_{k}^{(t-1)}. Using 3.1, we have f(𝐱+αk𝐝k)f(𝐱)αk𝐝k,𝐠k+αk2L𝐠𝐝k2/2f(\mathbf{x}+\alpha_{k}\mathbf{d}_{k})-f(\mathbf{x})\leq\alpha_{k}\left\langle\mathbf{d}_{k},\mathbf{g}_{k}\right\rangle+\alpha_{k}^{2}L_{\mathbf{g}}\left\|\mathbf{d}_{k}\right\|^{2}/2. Now, the line-search is satisfied, if, for some 0<α10<\alpha\leq 1, we have αk𝐝k,𝐠k+αk2L𝐠𝐝k2/2αkρ𝐝k,𝐠k\alpha_{k}\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle+\alpha_{k}^{2}L_{\mathbf{g}}\|\mathbf{d}_{k}\|^{2}/2\leq\alpha_{k}\rho\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle. This inequality is satisfied for any step-size that is smaller than 2(1ρ)𝐝k,𝐠k/(L𝐠𝐝k2)-2(1-\rho)\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle/(L_{\mathbf{g}}\|\mathbf{d}_{k}\|^{2}). As a result, starting from a sufficiently large initial trial step-size, the backtracking line-search will return a step-size that is at least αk2(1ρ)ξ𝐝k,𝐠k/(L𝐠𝐝k2)\alpha_{k}\geq-2(1-\rho)\xi\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle/(L_{\mathbf{g}}\|\mathbf{d}_{k}\|^{2}). Since Condition 2 was not met at iteration t2t-2 (otherwise, we would have returned Type = LC in the previous iteration), from 2.4 and 1, we have 𝐝k,𝐠k<𝐝k,𝐇¯k𝐝kσ𝐝k2.\left\langle\mathbf{d}_{k},\mathbf{g}_{k}\right\rangle<-\left\langle\mathbf{d}_{k},\mathbf{\bar{H}}_{k}\mathbf{d}_{k}\right\rangle\leq-\sigma\left\|\mathbf{d}_{k}\right\|^{2}. Using this and substituting back to the inequality from the line-search, we get

fk+1fkαkρ𝐝k,𝐠k2ρ(1ρ)ξ𝐝k,𝐠k2/(L𝐠𝐝k2)2σ2ρ(1ρ)ξ𝐝k2/L𝐠.\displaystyle f_{k+1}-f_{k}\leq\alpha_{k}\rho\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle\leq-2\rho(1-\rho)\xi\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle^{2}/(L_{\mathbf{g}}\|\mathbf{d}_{k}\|^{2})\leq-2\sigma^{2}\rho(1-\rho)\xi\|\mathbf{d}_{k}\|^{2}/L_{\mathbf{g}}.

To find the lower bound for 𝐝k2\|\mathbf{d}_{k}\|^{2}, from 2.5 we get

𝐝k2\displaystyle\|\mathbf{d}_{k}\|^{2} 𝐠k,𝐇¯𝐠k2𝐠k2/𝐇¯𝐠k4σ2d2𝐠k2/(L𝐠+ε)4,\displaystyle\geq{\left\langle{\mathbf{g}_{k},\mathbf{\bar{H}}\mathbf{g}_{k}}\right\rangle^{2}}\left\|\mathbf{g}_{k}\right\|^{2}/\left\|\mathbf{\bar{H}}\mathbf{g}_{k}\right\|^{4}\geq{\sigma^{2}d^{2}}\|\mathbf{g}_{k}\|^{2}/(L_{\mathbf{g}}+\varepsilon)^{4},

where the last inequality follows from Conditions 1 and 3.2. This gives the desired result. ∎

Lemma 3.

Under Assumption 1, in Algorithm 2 with Type = LC, we have

fk+1fk2ρ(1ρ)ξCl𝐠k2,\displaystyle f_{k+1}-f_{k}\leq-2\rho(1-\rho)\xi C_{l}\|\mathbf{g}_{k}\|^{2},

where 0<ρ<10<\rho<1 and 0<ξ<10<\xi<1 are the line-search parameters and Clη2/[L𝐠((L𝐠+ε)2+η2)]C_{l}\triangleq\eta^{2}/[L_{\mathbf{g}}((L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2})].

Proof.

First, we note that when Type = LC, Condition 1 has not been met (since it appears before Condition 2 in Algorithm 1). Similar to the proof Lemma 2, for αk\alpha_{k} to satisfy 2.6, we need αk𝐝k,𝐠k+αk2L𝐠𝐝k2/2αkρ𝐝k,𝐠k\alpha_{k}\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle+\alpha_{k}^{2}L_{\mathbf{g}}\|\mathbf{d}_{k}\|^{2}/2\leq\alpha_{k}\rho\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle, which implies any step-size smaller than 2(1ρ)𝐝k,𝐠k/(L𝐠𝐝k2)-2(1-\rho)\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle/(L_{\mathbf{g}}\|\mathbf{d}_{k}\|^{2}) satisfies the line search. Since Condition 2 is satisfied, i.e., 𝐝k=𝐫k(t1)\mathbf{d}_{k}=\mathbf{r}_{k}^{(t-1)}, from 2.2, we have 𝐝k,𝐠k=𝐫k(t1),𝐠k=𝐫k(t1)2=𝐝k2\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle=\langle\mathbf{r}_{k}^{(t-1)},\mathbf{g}_{k}\rangle=-\|\mathbf{r}_{k}^{(t-1)}\|^{2}=-\left\|\mathbf{d}_{k}\right\|^{2}. Hence, using a similar argument as in the proof of Lemma 2, the step size returned from the line-search must satisfy αk2(1ρ)ξ/L𝐠\alpha_{k}\geq 2(1-\rho)\xi/L_{\mathbf{g}}. Substituting back to the inequality from the line-search, we get fk+1fkαkρ𝐝k,𝐠k=2ρ(1ρ)ξ𝐝k2/L𝐠f_{k+1}-f_{k}\leq\alpha_{k}\rho\langle\mathbf{d}_{k},\mathbf{g}_{k}\rangle=-2\rho(1-\rho)\xi\left\|\mathbf{d}_{k}\right\|^{2}/L_{\mathbf{g}}. Since Condition 1 is not satisfied, we have

𝐇¯k𝐫k(t1)2η2\displaystyle\frac{\left\|\mathbf{\bar{H}}_{k}\mathbf{r}_{k}^{(t-1)}\right\|^{2}}{\eta^{2}} 𝐇¯k𝐬k(t1)2=𝐫k(t1)+𝐠k2\displaystyle\geq\left\|\mathbf{\bar{H}}_{k}\mathbf{s}_{k}^{(t-1)}\right\|^{2}=\left\|\mathbf{r}_{k}^{(t-1)}+\mathbf{g}_{k}\right\|^{2}
=𝐫k(t1)2+𝐠k2+2𝐫k(t1),𝐠k=𝐠k2𝐫k(t1)2,\displaystyle=\left\|\mathbf{r}_{k}^{(t-1)}\right\|^{2}+\left\|\mathbf{g}_{k}\right\|^{2}+2\left\langle{\mathbf{r}_{k}^{(t-1)},\mathbf{g}_{k}}\right\rangle=\left\|\mathbf{g}_{k}\right\|^{2}-\left\|\mathbf{r}_{k}^{(t-1)}\right\|^{2},

where the last equality follows from 2.2. Together, this implies ((L𝐠+ε)2+η2)𝐫k(t1)2𝐇¯k𝐫k(t1)2+η2𝐫k(t1)2η2𝐠k2((L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2})\|\mathbf{r}_{k}^{(t-1)}\|^{2}\geq\|\mathbf{\bar{H}}_{k}\mathbf{r}_{k}^{(t-1)}\|^{2}+\eta^{2}\|\mathbf{r}_{k}^{(t-1)}\|^{2}\geq\eta^{2}\left\|\mathbf{g}_{k}\right\|^{2}, which gives the desired result. ∎

With Lemmas 2 and 3 in hand, we can study the convergence rate of Algorithm 2 for nonconvex functions. Beyond general nonconvex settings, we also consider subclass of functions that satisfy the θ\theta-PL condition, i.e., Definition 1. In this setting, we show that with θ=2\theta=2 and under Assumption 1, Algorithm 2 achieves a global linear convergence rate. For 1θ<21\leq\theta<2, however, the global convergence of Algorithm 2 transitions from a linear rate to a sub-linear rate in the neighbourhood of a solution. In all these cases, θ\theta-PL property allows us to derive a faster convergence rate than in the general nonconvex settings. In the context of Newton-type methods with inexact Hessian information, to the best of our knowledge, our work here is the first to establish fast convergence rates for θ\theta-PL functions under mild smoothness assumptions and algorithmic requirements.

Prior works on Newton-type methods with inexact Hessians and linear convergence rates have predominantly focused on (strongly) convex functions. These studies have either assumed that the approximate Hessian remains positive definite [66, 18, 19, 17, 33, 66] or that the gradient lies in the range of Hessian at all times [40]. These stringent assumptions either necessitate highly accurate Hessian approximations or significantly restrict the scope of problems that can be effectively solved. However, these limitations often conflict with what is observed in practice. For instance, the sub-sampled Newton method can effectively solve (strongly) convex finite-sum problems with very crude Hessian approximations using much fewer samples than what the concentration inequalities predict [77]. Similarly, for high-dimensional convex problems where the Hessian matrix is singular and the gradient may not entirely lie in the range of Hessian, Newton’s method remains practically a viable algorithm [65].

The limitations imposed by these stringent assumptions arise from the common practice in previous studies, where the sub-problem solver is treated as a black box, neglecting the opportunity to leverage its inherent properties. In contrast, by tapping into the properties of MINRES, we show that simply requiring Assumption 1 allows us to establish the convergence of Algorithm 2 for nonconvex functions under any amount of inexactness in either the Hessian matrix or the sub-problem solution, effectively bridging the gap between theory and practice in these setting.

Theorem 1.

Let Assumption 1 holds and consider Algorithm 2 with any ε>0\varepsilon>0 in 1.5, any η>0\eta>0 in Condition 1, any σ>0\sigma>0 in Condition 2, the line-search parameter 0<ρ<10<\rho<1.

  • For functions satisfying the θ\theta-PL condition 1.2, after at most

    K12μC(log(max{1,f0f})+εf(12/θ)log(1/εf)),\displaystyle K\triangleq\frac{1}{2\mu C}\left(\log\left(\max\{1,f_{0}-f^{\star}\}\right)+\varepsilon_{f}^{(1-{2}/{\theta})}\log(1/\varepsilon_{f})\right),

    iterations of Algorithm 2, the approximate global optimality 1.3 is satisfied.

  • For more general nonconvex functions, after at most

    K(f0f)ε𝐠2C,\displaystyle K\triangleq\frac{(f_{0}-f^{\star})\varepsilon_{\mathbf{g}}^{-2}}{C},

    iterations of Algorithm 2, the approximate first order sub-optimality 1.4 is satisfied.

Here, C2ρ(1ρ)ξmin{Cs,Cl}C\triangleq 2\rho(1-\rho)\xi\min\left\{C_{s},C_{l}\right\} where CsC_{s} and ClC_{l} are defined in Lemmas 2 and 3, respectively.

Proof.

For each iteration of Algorithm 2, we either have Type = SOL or Type = LC. Hence, from Lemmas 2 and 3, we obtain

fk+1fkC𝐠k2.\displaystyle f_{k+1}-f_{k}\leq-C\|\mathbf{g}_{k}\|^{2}. (3.3)

Suppose the θ\theta-PL condition 1.2 holds. When θ=2\theta=2, it immediately follows that

fk+1f(12μC)(fkf).\displaystyle f_{k+1}-f^{\star}\leq\left(1-2\mu C\right)(f_{k}-f^{\star}).

Hence, to get fkfεff_{k}-f^{\star}\leq\varepsilon_{f}, it suffices to find kk for which (12μC)kεf/(f0f)\left(1-2\mu C\right)^{k}\leq\varepsilon_{f}/(f_{0}-f^{\star}), which gives the desired results. Now, consider 1θ<21\leq\theta<2. As long as fkf1f_{k}-f^{\star}\geq 1, we have 𝐠k2/2μ(fkf)2/θfkf\left\|\mathbf{g}_{k}\right\|^{2}/2\mu\geq(f_{k}-f^{\star})^{2/\theta}\geq f_{k}-f^{\star}, which is handled just as in the case of θ=2\theta=2, implying that after at most log(f0f)/(2μC)\log(f_{0}-f^{\star})/(2\mu C) iterations, we have fkf1f_{k}-f^{\star}\leq 1. Once, εffkf<1\varepsilon_{f}\leq f_{k}-f^{\star}<1, from 3.3 and 1.2, we obtain

fk+1f\displaystyle f_{k+1}-f^{\star} (1C𝐠k2fkf)(fkf)(12μC(fkf)2/θ1)(fkf)\displaystyle\leq\left(1-\frac{C\|\mathbf{g}_{k}\|^{2}}{f_{k}-f^{\star}}\right)(f_{k}-f^{\star})\leq\left(1-2\mu C(f_{k}-f^{\star})^{{2}/{\theta}-1}\right)(f_{k}-f^{\star})
(12μCεf2θ1)(fkf).\displaystyle\leq\left(1-2\mu C\varepsilon_{f}^{\frac{2}{\theta}-1}\right)(f_{k}-f^{\star}).

Setting (12μCεf2θ1)kεf(1-2\mu C\varepsilon_{f}^{\frac{2}{\theta}-1})^{k}\leq\varepsilon_{f} gives the result.

Now, consider a more general nonconvex setting. By a way of contradiction, supposed 𝐠J+1ε𝐠\left\|\mathbf{g}_{J+1}\right\|\leq\varepsilon_{\mathbf{g}} happens for the first time at some J>KJ>K. So, we must have 𝐠k>ε𝐠\|\mathbf{g}_{k}\|>\varepsilon_{\mathbf{g}}, for all jJj\leq J. From 3.3, it follows that

fjfj+1\displaystyle f_{j}-f_{j+1} C𝐠j2>Cε𝐠2,jJ.\displaystyle\geq C\|\mathbf{g}_{j}\|^{2}>C\varepsilon_{\mathbf{g}}^{2},\quad\forall j\leq J.

Hence, by the definition of KK, we must have

f0fJ\displaystyle f_{0}-f_{J} =j=0J1fjfj+1>JCε𝐠2>KCε𝐠2=f0f\displaystyle=\sum_{j=0}^{J-1}f_{j}-f_{j+1}>JC\varepsilon_{\mathbf{g}}^{2}>KC\varepsilon_{\mathbf{g}}^{2}=f_{0}-f^{\star}

This implies f>fJf^{\star}>f_{J}, which is a contradiction. ∎

Remark 1.

As it can be seen, for θ=2\theta=2, the convergence is linear, while for 1θ<21\leq\theta<2, the convergence transitions to a sublinear rate in the vicinity of an optimum point, i.e., when fkf1f_{k}-f^{\star}\leq 1. Smaller values of θ\theta imply a slower convergence rate in the neighborhood of a solution, which is expected as the function becomes flatter in those regions.

Remark 2.

The advantage of Theorem 1 is that it establishes linear convergence of Algorithm 2 for any value of ε\varepsilon and η\eta. In other words, regardless of how crude our Hessian approximation or inner problem solution might be, Algorithm 2 still converges for nonconvex functions. This is only possible thanks to leveraging the properties of MINRES as a sub-problem solver in Lemmas 3 and 2. This, to a great degree, corroborates what is typically observed in practice. For example, for solving strongly convex problems, which can be seen as a sub-class of PL functions, Newton’s method remains a convergent algorithm with well-defined iterations, irrespective of the level of approximation used for Hessian or the underlying sub-problem solution [65, 66, 12].

Remark 3.

The iteration complexity of 𝒪(ε𝐠2)\mathcal{O}(\varepsilon_{\mathbf{g}}^{-2}) in Theorem 1 is known to be optimal within the class of second-order methods when applied to solve 1.1 under Lipschitz continuous gradient assumption [25]. Therefore, under Assumption 1, our analysis aligns with the theoretical lower bound, highlighting the efficiency of our approach under these conditions.

3.2 Operation Complexity

In large-scale settings where storing the Hessian matrix explicitly is impractical, and one can only employ operations such as matrix-vector products, the main computational cost lies in solving the sub-problem. Therefore, iteration complexity alone does not adequately represent the actual costs. In this section, we provide an upper bound estimate on the operation complexity, that is the total number of gradient and Hessian-vector product evaluations required by Algorithm 2, to achieve a desired sub-optimality. For clarity of exposition, we drop the subscript kk of 𝐠k\mathbf{g}_{k} and 𝐇¯k\mathbf{\bar{H}}_{k} in what follows in this section.

The interplay between the inexact Hessian and the gradient plays a central role in establishing the operation complexity of Algorithm 2. This interplay is captured by the notion of 𝐠\mathbf{g}-relevant eigenvalues/eigenvectors.

Definition 6.

An eigenvector 𝐯\mathbf{v} of 𝐇¯\mathbf{\bar{H}} is called a 𝐠\mathbf{g}-relevant eigenvector if 𝐠,𝐯0\left\langle{\mathbf{g},\mathbf{v}}\right\rangle\not=0. The corresponding eigenvalue λ\lambda, i.e., 𝐇¯𝐯=λ𝐯\mathbf{\bar{H}}\mathbf{v}=\lambda\mathbf{v}, is called a 𝐠\mathbf{g}-relevant eigenvalue.

Consider the eigen-decomposition of 𝐇¯\mathbf{\bar{H}} be

𝐇¯=[𝐕𝐕𝐕𝟎][𝐙𝐙𝟎][𝐕𝐕𝐕𝟎],\displaystyle\mathbf{\bar{H}}=\begin{bmatrix}\mathbf{V}&\mathbf{V}_{\perp}&\mathbf{V}_{\mathbf{0}}\end{bmatrix}\begin{bmatrix}\mathbf{Z}&&\\ &\mathbf{Z}_{\perp}&\\ &&\mathbf{0}\end{bmatrix}\begin{bmatrix}\mathbf{V}&\mathbf{V}_{\perp}&\mathbf{V}_{\mathbf{0}}\end{bmatrix}^{\top}, (3.4)

where 𝐙=diag(λ1,,λϕ)\mathbf{Z}=\text{diag}(\lambda_{1},\cdots,\lambda_{\phi}) is the diagonal matrix with nonzero 𝐠\mathbf{g}-relevant eigenvalues in a nonincreasing order, that is λ1λϕ+>0>λϕ++1λϕ\lambda_{1}\geq\cdots\geq\lambda_{\phi_{+}}>0>\lambda_{\phi_{+}+1}\geq\cdots\geq\lambda_{\phi}, where ϕ\phi and ϕ+\phi_{+} denote, respectively, the total number of nonzero and positive 𝐠\mathbf{g}-relevant eigenvalues (including multiplicities). The columns of the matrix 𝐕\mathbf{V} contain the corresponding orthonormal eigenvectors. The matrices 𝐕\mathbf{V}_{\perp} and 𝐙\mathbf{Z}_{\perp} are defined similarly, but with respect to nonzero non-𝐠\mathbf{g}-relevant counterparts, i.e., 𝐕𝐠=𝟎\mathbf{V}_{\perp}\mathbf{g}=\mathbf{0}. Finally, the columns of the matrix 𝐕𝟎\mathbf{V}_{\mathbf{0}} span the nullspace of 𝐇¯\mathbf{\bar{H}}. For any 1jϕ1\leq j\leq\phi, we further define 𝐕=[𝐕j𝐕j]\mathbf{V}=\begin{bmatrix}\mathbf{V}_{j}&\mathbf{V}_{j\perp}\end{bmatrix}, where 𝐕j\mathbf{V}_{j} denotes the matrix formed by the first jj columns of 𝐕\mathbf{V} and 𝐕j\mathbf{V}_{j\perp} contains the remaining columns of 𝐕\mathbf{V}. Similarly, we define 𝐙j=diag(λ1,,λj)\mathbf{Z}_{j}=\text{diag}(\lambda_{1},\cdots,\lambda_{j}) and 𝐙j=diag(λj+1,,λϕ)\mathbf{Z}_{j\perp}=\text{diag}(\lambda_{j+1},\cdots,\lambda_{\phi}). A simple example can help clarify the notation.

Example 1.

Let 𝐔=[𝐮1𝐮10]10×10\mathbf{U}=\begin{bmatrix}\mathbf{u}_{1}&\cdots&\mathbf{u}_{10}\end{bmatrix}\in\mathbb{R}^{10\times 10} be an orthonormal matrix, 𝐠=𝐮1+𝐮3+𝐮7+𝐮10\mathbf{g}=\mathbf{u}_{1}+\mathbf{u}_{3}+\mathbf{u}_{7}+\mathbf{u}_{10} and

𝐇¯\displaystyle\mathbf{\bar{H}} =𝐔diag(9,7,6,6,4,4,0,2,3,5)𝐔.\displaystyle=\mathbf{U}\;\text{diag}(9,7,6,6,4,4,0,-2,-3,-5)\;\mathbf{U}^{\top}.

Then, 𝐕=[𝐮1𝐮3𝐮4𝐮10]\mathbf{V}=\begin{bmatrix}\mathbf{u}_{1}&\mathbf{u}_{3}&\mathbf{u}_{4}&\mathbf{u}_{10}\end{bmatrix}, 𝐕=[𝐮2𝐮5𝐮6𝐮8𝐮9]\mathbf{V}_{\perp}=\begin{bmatrix}\mathbf{u}_{2}&\mathbf{u}_{5}&\mathbf{u}_{6}&\mathbf{u}_{8}&\mathbf{u}_{9}\end{bmatrix}, 𝐕0=𝐮7\mathbf{V}_{0}=\mathbf{u}_{7}; λ1=9\lambda_{1}=9, λ2=3\lambda_{2}=3, λ3=3\lambda_{3}=3 and λ4=5\lambda_{4}=-5; ϕ+=3\phi_{+}=3 and ϕ=4\phi=4.

We now make several key observations when Condition 2 is not met in the very first iteration. In this case, from 3.2, we must have σd<(L𝐠+ε)\sigma d<(L_{\mathbf{g}}+\varepsilon). Also, clearly, ϕ+1\phi_{+}\geq 1. Furthermore, we must have have λj>σd/2\lambda_{j}>\sigma d/2, for some 1jϕ+1\leq j\leq\phi_{+}. Indeed, suppose the contrary, which implies λ1σd/2\lambda_{1}\leq\sigma d/2. Since Condition 2 is not detected in the first iteration, i.e., 𝐠,𝐇¯𝐠>σd𝐠2\left\langle{\mathbf{g},\mathbf{\bar{H}}\mathbf{g}}\right\rangle>\sigma d\left\|\mathbf{g}\right\|^{2}, we get

σd<𝐠,𝐇¯𝐠𝐠2=i=1ϕλi𝐠,𝐯i2𝐠2λ1i=1ϕ𝐠,𝐯i2𝐠2λ1σd2,\displaystyle\sigma d<\frac{\left\langle{\mathbf{g},\mathbf{\bar{H}}\mathbf{g}}\right\rangle}{\left\|\mathbf{g}\right\|^{2}}=\sum_{i=1}^{\phi}\lambda_{i}\frac{\left\langle{\mathbf{g},\mathbf{v}_{i}}\right\rangle^{2}}{\|\mathbf{g}\|^{2}}\leq\lambda_{1}\sum_{i=1}^{\phi}\frac{\left\langle{\mathbf{g},\mathbf{v}_{i}}\right\rangle^{2}}{\left\|\mathbf{g}\right\|^{2}}\leq\lambda_{1}\leq\frac{\sigma d}{2},

which is a contradiction. Now, let 1jϕ+1\leq j\leq\phi_{+} be the largest index such that λj>σd/2\lambda_{j}>\sigma d/2, i.e., λj\lambda_{j} is the smallest 𝐠\mathbf{g}-relevant eigenvalue with such property. Since σd𝐠2<𝐠,𝐇¯𝐠\sigma d\|\mathbf{g}\|^{2}<\left\langle{\mathbf{g},\mathbf{\bar{H}}\mathbf{g}}\right\rangle, the Pythagorean theorem gives

σd(𝐕𝐕𝐕j𝐕j)𝐠2\displaystyle\sigma d\left\|(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g}\right\|^{2} +σd𝐕j𝐕j𝐠2+σd𝐕𝐕𝐠2\displaystyle+\sigma d\left\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\right\|^{2}+\sigma d\left\|\mathbf{V}_{\perp}\mathbf{V}_{\perp}^{\top}\mathbf{g}\right\|^{2}
<𝐕j𝐕j𝐠,𝐇¯𝐕j𝐕j𝐠+(𝐕𝐕𝐕j𝐕j)𝐠,𝐇¯(𝐕𝐕𝐕j𝐕j)𝐠\displaystyle<\left\langle{\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g},\mathbf{\bar{H}}\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}}\right\rangle+\left\langle{(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g},\mathbf{\bar{H}}(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g}}\right\rangle
(L𝐠+ε)𝐕j𝐕j𝐠2+σd2(𝐕𝐕𝐕j𝐕j)𝐠2,\displaystyle\leq(L_{\mathbf{g}}+\varepsilon)\left\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\right\|^{2}+\frac{\sigma d}{2}\left\|(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g}\right\|^{2},

where the last inequality follows from 3.2 and the fact that 𝐕j𝐕j=𝐕𝐕𝐕j𝐕j\mathbf{V}_{j\perp}\mathbf{V}_{j\perp}^{\top}=\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top} is the projection on the eigenspace corresponding to the eigenvalues that are smaller than or equal to σd/2\sigma d/2. Hence, it follows that

σd2(𝐕𝐕𝐕j𝐕j)𝐠2\displaystyle\frac{\sigma d}{2}\left\|(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g}\right\|^{2} (L𝐠+εσd)𝐕j𝐕j𝐠2.\displaystyle\leq(L_{\mathbf{g}}+\varepsilon-\sigma d)\left\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\right\|^{2}. (3.5)

Next, we state a sufficient condition for Condition 1 to be satisfied.

Lemma 4.

Suppose Assumption 1 holds and ϕ+1\phi_{+}\geq 1. For any 1jϕ+1\leq j\leq\phi_{+}, if

𝐫(t1)2\displaystyle\left\|\mathbf{r}^{(t-1)}\right\|^{2} η2(L𝐠+ε)2+η2𝐕j𝐕j𝐠2,\displaystyle\leq\frac{\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}}\left\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\right\|^{2},

then Condition 1 is satisfied.

Proof.

By assumption on 𝐫(t1)\|\mathbf{r}^{(t-1)}\|, We have

𝐫(t1)2\displaystyle\left\|\mathbf{r}^{(t-1)}\right\|^{2} η2(L𝐠+ε)2+η2𝐕j𝐕j𝐠2η2(L𝐠+ε)2+η2𝐠2,\displaystyle\leq\frac{\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}}\left\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\right\|^{2}\leq\frac{\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}}\left\|\mathbf{g}\right\|^{2},

which using 3.2 implies

𝐇¯𝐫(t1)2+η2𝐫(t1)2\displaystyle\left\|\mathbf{\bar{H}}\mathbf{r}^{(t-1)}\right\|^{2}+\eta^{2}\left\|\mathbf{r}^{(t-1)}\right\|^{2} ((L𝐠+ε)2+η2)𝐫(t1)2η2𝐠2.\displaystyle\leq\left((L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}\right)\left\|\mathbf{r}^{(t-1)}\right\|^{2}\leq\eta^{2}\left\|\mathbf{g}\right\|^{2}.

Hence, using 2.2 and the fact that 𝐠2𝐫(t1)2=𝐠2+2𝐠,𝐫(t1)+𝐫(t1)2=𝐠+𝐫(t1)2\|\mathbf{g}\|^{2}-\|\mathbf{r}^{(t-1)}\|^{2}=\|\mathbf{g}\|^{2}+2\langle\mathbf{g},\mathbf{r}^{(t-1)}\rangle+\|\mathbf{r}^{(t-1)}\|^{2}=\|\mathbf{g}+\mathbf{r}^{(t-1)}\|^{2}, we obtain

𝐇¯𝐫(t1)2\displaystyle\left\|\mathbf{\bar{H}}\mathbf{r}^{(t-1)}\right\|^{2} η2(𝐠2𝐫(t1)2)=η2𝐠+𝐫(t1)2=η2𝐇¯𝐬(t1)2.\displaystyle\leq\eta^{2}\left(\left\|\mathbf{g}\right\|^{2}-\left\|\mathbf{r}^{(t-1)}\right\|^{2}\right)=\eta^{2}\left\|\mathbf{g}+\mathbf{r}^{(t-1)}\right\|^{2}=\eta^{2}\left\|\mathbf{\bar{H}}\mathbf{s}^{(t-1)}\right\|^{2}.

The following lemma gives a certain convergence behaviour of MINRES for indefinite matrices, as it relates to a subspace corresponding to positive 𝐠\mathbf{g}-relevant eigenvalues.

Lemma 5.

Let ϕ+1\phi_{+}\geq 1 and 1iϕ+1\leq i\leq\phi_{+}. For any 0<ω<40<\omega<4, after at most

tλ1/λi4log(4ω)+1,\displaystyle t\geq\left\lceil\frac{\sqrt{\lambda_{1}/\lambda_{i}}}{4}\log\left(\frac{4}{\omega}\right)\right\rceil+1,

iterations of Algorithm 1, we have 𝐫(t1)2(𝐕𝐕𝐕i𝐕i)𝐠2+ω𝐕i𝐕i𝐠2\|\mathbf{r}^{(t-1)}\|^{2}\leq\|(\mathbf{VV}-\mathbf{V}_{i}\mathbf{V}_{i}^{\top})\mathbf{g}\|^{2}+\omega\|\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\|^{2}.

Proof.

Recall that the ttht\textsuperscript{th} iterate of MINRES using inexact Hessian 𝐇¯(𝐱)\mathbf{\bar{H}}(\mathbf{x}) is given as [53, 62]

𝐬k(t)=argmin𝐬𝒦t(𝐇¯k,𝐠k)𝐇¯k𝐬+𝐠k2.\displaystyle\mathbf{s}_{k}^{(t)}=\operatorname*{arg\,min}_{\mathbf{s}\in\mathcal{K}_{t}(\mathbf{\bar{H}}_{k},\mathbf{g}_{k})}\left\|\mathbf{\bar{H}}_{k}\mathbf{s}+\mathbf{g}_{k}\right\|^{2}. (3.6)

From 3.4, Range(𝐇¯)=Range(𝐕i𝐕i)Range(𝐕i𝐕i+𝐕𝐕)\text{Range}(\mathbf{\bar{H}})=\text{Range}(\mathbf{V}_{i}\mathbf{V}_{i}^{\top})\bigoplus\text{Range}(\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}+\mathbf{V}_{\perp}\mathbf{V}_{\perp}^{\top}), for any 1iϕ+1\leq i\leq\phi_{+}. Using a similar reasoning as [51, Lemma 3.12], the formulation 3.6 can be written as

𝐫(t)2\displaystyle\left\|\mathbf{r}^{(t)}\right\|^{2} =min𝐬𝒦t(𝐇¯,𝐠)𝐇¯𝐬+𝐠2\displaystyle=\min_{\mathbf{s}\in\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{s}+\mathbf{g}\right\|^{2}
=min𝐬𝐕i𝐕i𝒦t(𝐇¯,𝐠)𝐇¯𝐬+𝐕i𝐕i𝐠2+min𝐬(𝐕i𝐕i+𝐕𝐕)𝒦t(𝐇¯,𝐠)𝐇¯𝐬+(𝐕i𝐕i+𝐕𝐕)𝐠2\displaystyle=\min_{\mathbf{s}\in\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{s}+\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}+\min_{\mathbf{s}\in(\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}+\mathbf{V}_{\perp}\mathbf{V}_{\perp}^{\top})\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{s}+(\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}+\mathbf{V}_{\perp}\mathbf{V}_{\perp}^{\top})\mathbf{g}\right\|^{2}
=min𝐬𝐕i𝐕i𝒦t(𝐇¯,𝐠)𝐇¯𝐬+𝐕i𝐕i𝐠2+min𝐬𝐕i𝐕i𝒦t(𝐇¯,𝐠)𝐇¯𝐬+𝐕i𝐕i𝐠2,\displaystyle=\min_{\mathbf{s}\in\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{s}+\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}+\min_{\mathbf{s}\in\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{s}+\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}\mathbf{g}\right\|^{2},

where the last equality follows from the fact that 𝐕\mathbf{V}_{\perp} are constructed from non-𝐠\mathbf{g}-relevant eigenvectors and hence, 𝐕𝐕𝐠=𝟎\mathbf{V}_{\perp}\mathbf{V}_{\perp}^{\top}\mathbf{g}=\mathbf{0} and 𝐕𝐕𝒦t(𝐇¯,𝐠)={𝟎}\mathbf{V}_{\perp}\mathbf{V}_{\perp}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})=\{\mathbf{0}\}. We also note that

min𝐬𝐕i𝐕i𝒦t(𝐇¯,𝐠)𝐇¯𝐬+𝐕i𝐕i𝐠2𝐕i𝐕i𝐠2=(𝐕𝐕𝐕i𝐕i)𝐠2.\displaystyle\min_{\mathbf{s}\in\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{s}+\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}\mathbf{g}\right\|^{2}\leq\left\|\mathbf{V}_{i\perp}\mathbf{V}_{i\perp}^{\top}\mathbf{g}\right\|^{2}=\left\|(\mathbf{VV}^{\top}-\mathbf{V}_{i}\mathbf{V}_{i}^{\top})\mathbf{g}\right\|^{2}.

Since for all 𝐬𝐕i𝐕i𝒦t(𝐇¯,𝐠)\mathbf{s}\in\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g}), we can write 𝐬=𝐕i𝐕i𝐬\mathbf{s}=\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{s}, it follows that

min𝐬𝐕i𝐕i𝒦t(𝐇¯,𝐠)𝐇¯𝐕i𝐕i𝐬+𝐕i𝐕i𝐠2=min𝐬𝒦t(𝐕i𝐙i𝐕i,𝐕i𝐕i𝐠)𝐕i𝐙i𝐕i𝐬+𝐕i𝐕i𝐠2.\displaystyle\min_{\mathbf{s}\in\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathcal{K}_{t}(\mathbf{\bar{H}},\mathbf{g})}\left\|\mathbf{\bar{H}}\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{s}+\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}=\min_{\mathbf{s}\in\mathcal{K}_{t}(\mathbf{V}_{i}\mathbf{Z}_{i}\mathbf{V}_{i}^{\top},\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g})}\left\|\mathbf{V}_{i}\mathbf{Z}_{i}\mathbf{V}_{i}^{\top}\mathbf{s}+\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}.

where 𝐙i\mathbf{Z}_{i} is the diagonal matrix containing the top ii positive 𝐠\mathbf{g}-relevant eigenvalues of 𝐇¯\mathbf{\bar{H}}. Hence, using the standard convergence results for MINRES [70], we have

min𝐬𝒦t(𝐕i𝐙i𝐕i,𝐕i𝐕i𝐠)𝐕i𝐙i𝐕i𝐬+𝐕i𝐕i𝐠2\displaystyle\min_{\mathbf{s}\in\mathcal{K}_{t}(\mathbf{V}_{i}\mathbf{Z}_{i}\mathbf{V}_{i}^{\top},\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g})}\left\|\mathbf{V}_{i}\mathbf{Z}_{i}\mathbf{V}_{i}^{\top}\mathbf{s}+\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2} 4𝐕i𝐕i𝐠2(λ1/λi1λ1/λi+1)2t.\displaystyle\leq 4\left\|\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}\left(\frac{\sqrt{\lambda_{1}/\lambda_{i}}-1}{\sqrt{\lambda_{1}/\lambda_{i}}+1}\right)^{2t}.

Together, we obtain,

𝐫(t)2\displaystyle\left\|\mathbf{r}^{(t)}\right\|^{2} (𝐕𝐕𝐕i𝐕i)𝐠2+4𝐕i𝐕i𝐠2(λ1/λi1λ1/λi+1)2t.\displaystyle\leq\left\|(\mathbf{VV}-\mathbf{V}_{i}\mathbf{V}_{i}^{\top})\mathbf{g}\right\|^{2}+4\left\|\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}\left(\frac{\sqrt{\lambda_{1}/\lambda_{i}}-1}{\sqrt{\lambda_{1}/\lambda_{i}}+1}\right)^{2t}.

Now, for any 0<ω<40<\omega<4, if Algorithm 1 iterates for

tλ1/λi4log(4ω)log(4ω)/log(λ1/λi1λ1/λi+1)2,\displaystyle t\geq\left\lceil\frac{\sqrt{\lambda_{1}/\lambda_{i}}}{4}\log\left(\frac{4}{\omega}\right)\right\rceil\geq\left\lceil\log\left(\frac{4}{\omega}\right)\bigg{/}\log\left(\frac{\sqrt{\lambda_{1}/\lambda_{i}}-1}{\sqrt{\lambda_{1}/\lambda_{i}}+1}\right)^{2}\right\rceil,

then we have 𝐫(t)2(𝐕𝐕𝐕i𝐕i)𝐠2+ω𝐕i𝐕i𝐠2\|\mathbf{r}^{(t)}\|^{2}\leq\left\|(\mathbf{VV}^{\top}-\mathbf{V}_{i}\mathbf{V}_{i}^{\top})\mathbf{g}\right\|^{2}+\omega\left\|\mathbf{V}_{i}\mathbf{V}_{i}^{\top}\mathbf{g}\right\|^{2}. ∎

Using Lemmas 4 and 5, the following lemma gives an upper estimate on the total number of iterations of Algorithm 1 before termination.

Lemma 6.

Suppose Assumption 1 holds and let σ>0\sigma>0 and η>0\eta>0. Algorithm 1 terminates after at most TT iterations where

T{1,σd(L𝐠+ε)1,(L𝐠+ε)8σdlog(8((L𝐠+ε)2+η2)η2)+1,η2+(L𝐠+ε)21.25η2+(L𝐠+ε)2σd(L𝐠+ε)<1,(L𝐠+ε)((L𝐠+ε)2+η2)σ(1.25η2+(L𝐠+ε)2),Otherwise.\displaystyle T\triangleq\left\{\begin{array}[]{ll}1,&\displaystyle\frac{\sigma d}{(L_{\mathbf{g}}+\varepsilon)}\geq 1,\\ \\ \displaystyle\left\lceil\sqrt{\frac{(L_{\mathbf{g}}+\varepsilon)}{8\sigma d}}\log\left(\frac{8((L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2})}{\eta^{2}}\right)\right\rceil+1,&\displaystyle\frac{\eta^{2}+(L_{\mathbf{g}}+\varepsilon)^{2}}{1.25\eta^{2}+(L_{\mathbf{g}}+\varepsilon)^{2}}\leq\frac{\sigma d}{(L_{\mathbf{g}}+\varepsilon)}<1,\\ \\ \displaystyle\left\lceil\frac{(L_{\mathbf{g}}+\varepsilon)((L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2})}{\sigma(1.25\eta^{2}+(L_{\mathbf{g}}+\varepsilon)^{2})}\right\rceil,&\text{Otherwise.}\end{array}\right. (3.12)
Proof.

If Condition 2 is met in the very first iteration, then the result trivially holds. Otherwise, by the observations made earlier, we must have σd<L𝐠+ε\sigma d<L_{\mathbf{g}}+\varepsilon and λj>σd/2\lambda_{j}>\sigma d/2, for some 1jϕ+1\leq j\leq\phi_{+}. Also, from 3.5, for this jj we get, (𝐕𝐕𝐕j𝐕j)𝐠22(L𝐠+εσd)/(σd)𝐕j𝐕j𝐠2\|(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g}\|^{2}\leq 2(L_{\mathbf{g}}+\varepsilon-\sigma d)/(\sigma d)\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\|^{2}. In this case, for the sufficient condition from Lemma 4 to be satisfied, we need to find ω\omega in Lemma 5 such that

(𝐕𝐕𝐕j𝐕j)𝐠2+ω𝐕j𝐕j𝐠2η2(L𝐠+ε)2+η2𝐕j𝐕j𝐠2.\displaystyle\|(\mathbf{VV}^{\top}-\mathbf{V}_{j}\mathbf{V}_{j}^{\top})\mathbf{g}\|^{2}+\omega\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\|^{2}\leq\frac{\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}}\left\|\mathbf{V}_{j}\mathbf{V}_{j}^{\top}\mathbf{g}\right\|^{2}.

Hence, it suffices to find ω\omega such that

2(L𝐠+εσd)σd+ωη2(L𝐠+ε)2+η2.\displaystyle\frac{2(L_{\mathbf{g}}+\varepsilon-\sigma d)}{\sigma d}+\omega\leq\frac{\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}}.

It follows that, as long as

d2(L𝐠+ε)((L𝐠+ε)2+η2)σ(2.5η2+2(L𝐠+ε)2),\displaystyle d\geq\frac{2(L_{\mathbf{g}}+\varepsilon)((L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2})}{\sigma(2.5\eta^{2}+2(L_{\mathbf{g}}+\varepsilon)^{2})}, (3.13)

we have

η2(L𝐠+ε)2+η22(L𝐠+εσd)σd>0.5η2(L𝐠+ε)2+η2,\displaystyle\frac{\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}}-\frac{2(L_{\mathbf{g}}+\varepsilon-\sigma d)}{\sigma d}>\frac{0.5\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}},

which implies that

ω0.5η2(L𝐠+ε)2+η2,\displaystyle\omega\leq\frac{0.5\eta^{2}}{(L_{\mathbf{g}}+\varepsilon)^{2}+\eta^{2}},

ensures Condition 1. Otherwise, Algorithm 1 naturally terminates after at most dd iterations with an exact solution. Noting that λ1L𝐠+ε\lambda_{1}\leq L_{\mathbf{g}}+\varepsilon and λj>σd/2\lambda_{j}>\sigma d/2, Lemma 5 implies the result. ∎

With Lemma 6, it is straightforward to obtain the total operation complexity of Algorithm 2, which is again optimal among the class of second-order methods for solving non-convex problems of the form 1.1 satisfying Lipschitz continuous gradient assumption [25].

Corollary 1 (Optimal Operation Complexity of Algorithm 2).

Under the assumptions of Theorem 1, for θ\theta-PL functions, after at most 𝒪(log(max{1,f0f})+εf(12/θ)log(1/εf))\mathcal{O}\left(\log\left(\max\{1,f_{0}-f^{\star}\}\right)+\varepsilon_{f}^{(1-{2}/{\theta})}\log(1/\varepsilon_{f})\right) gradient and Hessian-vector product evaluations, Algorithm 2 produces a solution that satisfies the approximate global optimality 1.3. For more general nonconvex functions, this bound is 𝒪(ε𝐠2)\mathcal{O}\left(\varepsilon_{\mathbf{g}}^{-2}\right), for the approximate first order sub-optimality 1.4 to be satisfied.

4 Numerical Experiments

For our numerical experiments, we consider a series of finite-sum minimization problems. We evaluate the performance of Algorithm 2 in comparison with several other second order methods, namely Newton-CG [67] and its sub-sampled variant [79], Steihaug’s trust-region method [28, 61] and its corresponding sub-sampled algorithms [75, 78], as well as L-BFGS [61]. We approximate the Hessian by means of sub-sampling, that is we form 𝐇¯\mathbf{\bar{H}} using 1%1\%, 5%5\%, and 10%10\% of the total samples. For each experiment, the performance is depicted in two plots, namely the graph of f(𝐱)f(\mathbf{x}) with respect to the number of iterations as well as a corresponding plot based on the number of function oracle calls. For instance, the amount of work required for computing the gradient can be considered as equivalent to two oracle calls, i.e., one forward and one backward propagation, and a Hessian-vector product operation can be obtained through a work equivalent to four oracle calls (see [51, 65] for more details). Our implementations are matrix-free in that we do not store the (inexact) Hessian and rely on Hessian-vector operations to perform the iterations of MINRES and CG. The code for the following experiments can be found in https://github.com/alexlim1993/NumOpt.

4.1 Implementation details

Newton-MR and its sub-sampled variants

: For Armijo line-search parameters in 2.6, we use ρ=104\rho=10^{-4}, which is typical [61]. For both Algorithms 3 and 4, we use ξ=0.5\xi=0.5. The parameters σ\sigma in Condition 2 and η\eta in Condition 1 will be chosen specific to each problem. In particular, we aim to fine-tune η\eta so as to enable generating directions of Type = SOL more frequently.

Newton-CG and its sub-sampled variants

: We implement the line-search used in [67, 79] with its parameter set to 10410^{-4}. The back-tracking line-search parameter is set to 0.50.5. Within the Capped-CG algorithm [67], the damping parameters 𝐇~=𝐇+ζ𝐈\tilde{\mathbf{H}}=\mathbf{H}+\zeta\mathbf{I} is set to ζ=0.5\zeta=0.5. The desired accuracy of Capped-CG algorithm varies across different experiments to allow for best performance.

Steihaug’s trust-region and its sub-sampled variants

: Following the notation used in [61, Algorithm 4.1], we set Δ^=1010\hat{\Delta}=10^{10}, Δ0=105\Delta_{0}=10^{5} and η=0.1\eta=0.1. We use the CG–Steihaug [61, Algorithm 7.2] to approximately solve the trust-region sub-problems.

L-BFGS

: The limited memory size is set to 20, and the Armijo and the Strong Wolfe line-search parameters are, respectively, set to their typical values of 10410^{-4} and 0.90.9.

All algorithms are terminated when either the oracle calls reach 10610^{6}, the magnitude of the gradient falls below 10610^{-6}, or the step size/trust region goes below 101810^{-18}. In the legends of Figures 1 and 2, ‘x%x\%’ refers to when the approximate Hessian is formed using an ‘xx’ percent of the full dataset. For all other figures, ‘algorithm_xx’ refers to a given algorithm when 100x100x percent of the full dataset is used to construct the Hessian. When there is no postfix, the Hessian matrix is formed exactly.

4.2 Binary Classification

We first investigate the advantages/disadvantages of Hessian approximation within Newton-MR framework using two simple finite-sum problems in the form of binary classification. Namely, we consider convex binary logistic regression (Figure 1),

f(𝐱)=1ni=1nln(1+exp(𝐚i,𝐱))bi𝐚i,𝐱,\displaystyle f(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}\ln(1+\exp({\langle\mathbf{a}_{i},\mathbf{x}\rangle}))-b_{i}\langle\mathbf{a}_{i},\mathbf{x}\rangle,

and nonconvex nonlinear least square (Figure 2),

f(𝐱)=1ni=1n(11+exp(𝐚i,𝐱)bi)2.\displaystyle f(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{1}{1+\exp(-\langle\mathbf{a}_{i},\mathbf{x}\rangle)}-b_{i}\right)^{2}.

For our experiments here, we use the MINST data set [30], which contains of 60,00060,000 grayscale images of size 28×2828\times 28, i.e., {(𝐚i,bi)}i=1n=60,000784×[0,1]\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{n=60,000}\subset\mathbb{R}^{784}\times[0,1]. The labels of MNIST dataset are converted to binary classification, i.e., even and odd numbers.

In Algorithm 2, we set η=0.001\eta=0.001 and 𝐱0=𝟎\mathbf{x}_{0}=\mathbf{0}. The results of running Algorithm 2 using various degrees of Hessian approximation are given in Figures 2 and 1. As predicted by our theory, Algorithm 2 converges irrespective of the degree of Hessian approximation. Also, while Hessian approximation generally helps with reducing the overall computational costs, its effectiveness can diminish if the approximation is overly crude. This is expected since a significant reduction in sub-sample size could lead to a substantial loss of curvature information, resulting in poor performance.

Refer to caption
Refer to caption
Figure 1: Performance of Algorithm 2 using various degrees of Hessian approximation on the nonconvex nonlinear least squares loss function. As predicted by our theory, Algorithm 2 converges irrespective of the degree of Hessian approximation. Also, Hessian approximation typically reduces computational costs; however, a substantial reduction in sub-sample size can lead to a significant loss of curvature information, resulting in poor performances.
Refer to caption
Refer to caption
Figure 2: Performance of Algorithm 2 using various degrees of Hessian approximation on the convex logistic loss function. As predicted by our theory, Algorithm 2 converges irrespective of the degree of Hessian approximation. Also, Hessian approximation typically reduces computational costs; however, a substantial reduction in sub-sample size can lead to a significant loss of curvature information, resulting in poor performances.

4.3 Feed Forward Neural Network

In this section, we consider a multi-class classification problem using a feed forward neural network, namely

f(𝐱)=1n(i=1n𝐛i,ln(𝐡(𝐱;𝐚i)))+λg(𝐱),f(\mathbf{x})=-\frac{1}{n}\left(\sum_{i=1}^{n}\left\langle\mathbf{b}_{i},\ln\left(\mathbf{h}(\mathbf{x};\mathbf{a}_{i})\right)\right\rangle\right)+\lambda g(\mathbf{x}), (4.1)

where 𝐡(𝐱;𝐚i):d[0,1]C\mathbf{h}(\mathbf{x};\mathbf{a}_{i}):\mathbb{R}^{d}\to[0,1]^{C} represent the output of the neural network, 𝐛i{0,1}C\mathbf{b}_{i}\in\{0,1\}^{C} denotes the labels corresponding to input data 𝐚i\mathbf{a}_{i}, g(𝐱):dg(\mathbf{x}):\mathbb{R}^{d}\to\mathbb{R} is a nonconvex regularizer defined as g(𝐱)i=1dxi2/(1+xi2)g(\mathbf{x})\triangleq\sum^{d}_{i=1}x_{i}^{2}/(1+x_{i}^{2}), and λ>0\lambda>0 is a regularization parameter.

We first consider CIFAR10 dataset [45], consisting of 60,00060,000 colored images of size 32×3232\times 32 in 10 classes, i.e., {(𝐚i,𝐛i)}i=1n=60,0003,072×{0,1}10\{(\mathbf{a}_{i},\mathbf{b}_{i})\}_{i=1}^{n=60,000}\subset\mathbb{R}^{3,072}\times\{0,1\}^{10}. We split the 60,000 samples into 50,000 training samples and 10,000 validation samples. Figures 3, 4 and 5 depict the performance of each algorithm on the training data. Plots showing performance in terms of validation accuracy/error are gathered in Section A.2. The feed forward neural network has the following architecture,

3072 (input)tanh512tanh128tanh32softmax10 (output),\displaystyle 3072\text{ (input)}\to\tanh\to 512\to\tanh\to 128\to\tanh\to 32\to\text{softmax}\to 10\text{ (output)},

resulting in d=1,643,498d=1,643,498 numbers of parameters. The regularization parameter is set to λ=108\lambda=10^{-8}. All algorithms are initialized from a normal distribution with zero mean and co-variance 0.1𝐈0.1\mathbf{I}. The inexactness and the σ\sigma-LC parameters of Algorithm 2 are set to η=10\eta=10 and σ=1016\sigma=10^{-16}. For Newton-CG [67, 79], the desired accuracy for Capped-CG is set to 0.010.01.

Refer to caption
Refer to caption
Figure 3: Comparison of Newton-MR and Newton-CG on CIFAR10 dataset in Section 4.3.
Refer to caption
Refer to caption
Figure 4: Comparison of Newton-MR and Trust-Region on CIFAR10 dataset in Section 4.3.
Refer to caption
Refer to caption
Figure 5: Comparison of Newton-MR and L-BFGS on CIFAR10 dataset in Section 4.3.

Next, we consider a different dataset, namely Covertype [15], which includes n=300,000n=300,000 training and 2,0002,000 validation samples of 5454 attributes across seven classes. Figures 7, 6 and 8 depict the performance of each algorithm on the training data. Plots showing performance in terms of validation accuracy/error are gathered in Section A.2. For these experiments, the neural network architecture is chosen to be

54 (input)sigmoid256tanh256sigmoid128tanh128\displaystyle 54\text{ (input)}\to\text{sigmoid}\to 256\to\tanh\to 256\to\text{sigmoid}\to 128\to\tanh\to 128\to
sigmoid64tanh64sigmoid32tanh32softmax7 (output),\displaystyle\to\text{sigmoid}\to 64\to\tanh\to 64\to\text{sigmoid}\to 32\to\tanh\to 32\to\text{softmax}\to 7\text{ (output)},

resulting in d=145,063d=145,063. We also set λ=108\lambda=10^{-8}. Again, 𝐱0\mathbf{x}_{0} is chosen as above. For Newton-MR, the sub-problem inexactness and σ\sigma-LC parameters are changed to η=10,000\eta=10,000 and σ=1026\sigma=10^{-26} respectively. For Newton-CG, the desired accuracy parameter is set to 0.010.01.

Refer to caption
Refer to caption
Figure 6: Comparison of Newton-MR and Newton-CG on Covertype dataset in Section 4.3.
Refer to caption
Refer to caption
Figure 7: Comparison of Newton-MR and Trust-Region on Covertype dataset in Section 4.3.
Refer to caption
Refer to caption
Figure 8: Comparison of Newton-MR and L-BFGS on Covertype dataset in Section 4.3.

In the experiments of this section, all algorithms reach the maximum allowed number of oracle calls. In all Figures 3, 4, 5, 7, 6 and 8, Algorithm 2 with appropriate degree of sub-sampling either is very competitive with, or downright superior to, alternative methods. Using the exact Hessian improves convergence by reducing the number of iterations. However, unless highly accurate solutions are required – which is often not necessary in machine learning applications – subsampling the Hessian can reduce the number of required oracle calls. What is striking is the relative poor performance of all Newton-CG variants. This further corroborates the observations in [65, 51, 52] that MINRES typically manifests itself as a far superior inner solver than the widely used CG method.

4.4 Recurrent Neural Network

Finally, we consider a slightly more challenging problem of training a recurrent neural network (RNN) [72] with mean squared error loss. Specifically, we consider the following objective function

f(𝐱)=1ni=1n𝐡(𝐱;𝐚i)𝐛i2+λg(𝐱),\displaystyle f(\mathbf{x})=\frac{1}{n}\sum_{i=1}^{n}\left\|\mathbf{h}(\mathbf{x};\mathbf{a}_{i})-\mathbf{b}_{i}\right\|^{2}+\lambda g(\mathbf{x}),

where 𝐡(𝐱;𝐚)\mathbf{h}(\mathbf{x};\mathbf{a}) is the output of the RNN and g(𝐱)g(\mathbf{x}) is again the same nonconvex regularizer as in Section 4.3 with regularization parameter λ=108\lambda=10^{-8}. The point 𝐱0\mathbf{x}_{0} is chosen in the same manner as in Section 4.3.

For the experiment here, we use the Gas sensor array under dynamic gas mixtures data set [36], which is a time series data. In particular, the data contains a time series of 16 features across three time stamps. So, we get {𝐚i,𝐛i}i=1n=45,828(3×16)×\{\mathbf{a}_{i},\mathbf{b}_{i}\}_{i=1}^{n=45,828}\subset(\mathbb{R}^{3}\times\mathbb{R}^{16})\times\mathbb{R}, and d=100,417d=100,417. We further split the samples into 40,828 training and 5,000 validation samples. For Newton-MR, the relative residual tolerance is set to η=100\eta=100. For Newton-CG, the accuracy of Capped-CG is set to 0.10.1.

Figures 9, 11 and 10 depict the performance of each algorithm on the training data. Plots showing performance in terms of validation accuracy/error are gathered in Section A.2. Once again, in all these experiments, Algorithm 2 with adequate sub-sampling can either outperform or at least remain competitive with the alternative methods. Also, one can again observe the visibly poor performance of Newton-CG relative to Algorithm 2. In Figure 9, NewtonCG_0.05\texttt{NewtonCG}\_0.05 and NewtonCG_0.1\texttt{NewtonCG}\_0.1 are terminated after 1410 and 2168 iterations, respectively, as their step sizes get below 101810^{-18}. All the other algorithms reach the maximum allowed number of oracle calls.

Refer to caption
Refer to caption
Figure 9: Comparison of Newton-MR and Newton-CG for training a recurrent neural network as in Section 4.4. NewtonCG_0.05\texttt{NewtonCG}\_0.05 and NewtonCG_0.1\texttt{NewtonCG}\_0.1 are terminated after 1410 and 2168 iterations with function values of 2.4428 and 0.31219, respectively, as their step sizes get below 101810^{-18}.
Refer to caption
Refer to caption
Figure 10: Comparison of Newton-MR and Trust-Region for training a recurrent neural network as in Section 4.4.
Refer to caption
Refer to caption
Figure 11: Comparison of Newton-MR and L-BFGS for training a recurrent neural network as in Section 4.4.

5 Conclusion

We considered a variant of the Newton-MR algorithm for nonconvex problems, initially proposed in [51], to accommodate inexact Hessian information. Our primary focus was to establish iteration and operation complexities of our proposed variant under simple assumptions and with minimal algorithmic changes. In particular, we assume minimal smoothness on the gradient without extending it to the Hessian, and we avoid Hessian regularization/damping techniques to maintain curvature information and prevent performance degradation. We show that our algorithm converges regardless of the degree of approximation of the Hessian as well as the accuracy of the solution to the sub-problem. Beyond general nonconvex settings, we consider, in particular, a subclass of functions that satisfy the θ\theta-PL condition and show that with θ=2\theta=2, our algorithm achieves a global linear convergence rate while for 1θ<21\leq\theta<2, the global convergence transitions from a linear rate to a sub-linear rate in the neighbourhood of a solution. Finally, we provided numerical experiments on several machine learning problems to further illustrate the advantages of sub-sampled Hessian in the Newton-MR framework.

5.1 Acknowledgments

Fred Roosta was partially supported by the Australian Research Council through an Industrial Transformation Training Centre for Information Resilience (IC200100022) as well as a Discovery Early Career Researcher Award (DE180100923).

References

  • [1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
  • [2] Motasem Alfarra, Slavomir Hanzely, Alyazeed Albasyoni, Bernard Ghanem, and Peter Richtárik. Adaptive Learning of the Optimal Batch Size of SGD. arXiv preprint arXiv:2005.01097, 2020.
  • [3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pages 242–252. PMLR, 2019.
  • [4] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. In Conference on Learning Theory, pages 242–299. PMLR, 2020.
  • [5] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564, 2018.
  • [6] A. Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics, 2017.
  • [7] Stefania Bellavia, Eugenio Fabrizi, and Benedetta Morini. Linesearch Newton-CG methods for convex optimization with noise. ANNALI DELL’UNIVERSITA’DI FERRARA, 68(2):483–504, 2022.
  • [8] Stefania Bellavia and Gianmarco Gurioli. Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy. Optimization, 71(1):227–261, 2022.
  • [9] Stefania Bellavia, Gianmarco Gurioli, and Benedetta Morini. Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization. IMA Journal of Numerical Analysis, 41(1):764–799, 2021.
  • [10] Stefania Bellavia, Nataša Krejić, and Nataša Krklec Jerinkić. Subsampled inexact Newton methods for minimizing large sums of convex functions. IMA Journal of Numerical Analysis, 40(4):2309–2341, 2020.
  • [11] Stefania Bellavia, Nataša Krejić, and Benedetta Morini. Inexact restoration with subsampled trust-region methods for finite-sum minimization. Computational Optimization and Applications, 76:701–736, 2020.
  • [12] Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software, 35(4):661–680, 2020.
  • [13] El Houcine Bergou, Youssef Diouane, Vladimir Kunc, Vyacheslav Kungurtsev, and Clément W Royer. A subsampling line-search method with second-order results. INFORMS Journal on Optimization, 4(4):403–425, 2022.
  • [14] Dennis S Bernstein. Matrix mathematics: Theory, Facts, and Formulas (Second Edition). Princeton university press, 2009.
  • [15] Jock Blackard. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.
  • [16] Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS journal on optimization, 1(2):92–119, 2019.
  • [17] Raghu Bollapragada, Richard H Byrd, and Jorge Nocedal. Exact and inexact subsampled Newton methods for optimization. IMA Journal of Numerical Analysis, 39(2):545–578, 2019.
  • [18] Richard H. Byrd, Gillian M. Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
  • [19] Richard H. Byrd, Gillian M. Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.
  • [20] Yair Carmon and John C Duchi. Analysis of Krylov subspace solutions of regularized non-convex quadratic problems. Advances in Neural Information Processing Systems, 31, 2018.
  • [21] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. “Convex until proven guilty”: dimension-free acceleration of gradient descent on non-convex functions. In International conference on machine learning, pages 654–663. PMLR, 2017.
  • [22] Coralia Cartis, Jaroslav Fowkes, and Zhen Shao. Randomised subspace methods for non-convex optimization, with applications to nonlinear least-squares. arXiv preprint arXiv:2211.09873, 2022.
  • [23] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
  • [24] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295–319, 2011.
  • [25] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 3711–3750. World Scientific, 2018.
  • [26] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022.
  • [27] Jiabin Chen, Rui Yuan, Guillaume Garrigos, and Robert M Gower. SAN: stochastic average newton algorithm for minimizing finite sums. In International Conference on Artificial Intelligence and Statistics, pages 279–318. PMLR, 2022.
  • [28] Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
  • [29] Frank E Curtis, Daniel P Robinson, Clément W Royer, and Stephen J Wright. Trust-region Newton-CG with strong second-order complexity guarantees for nonconvex optimization. SIAM Journal on Optimization, 31(1):518–544, 2021.
  • [30] Li Deng. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  • [31] Michał Dereziński and Elizaveta Rebrova. Sharp analysis of sketch-and-project methods via a connection to randomized singular value decomposition. SIAM Journal on Mathematics of Data Science, 6(1):127–153, 2024.
  • [32] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • [33] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods. Advances in Neural Information Processing Systems, 28, 2015.
  • [34] Ilyas Fatkhullin, Jalal Etesami, Niao He, and Negar Kiyavash. Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality. Advances in Neural Information Processing Systems, 35:15836–15848, 2022.
  • [35] Zhili Feng, Fred Roosta, and David P Woodruff. Non-PSD matrix sketching with applications to regression and optimization. In Uncertainty in Artificial Intelligence, pages 1841–1851. PMLR, 2021.
  • [36] Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors and Actuators B: Chemical, 215:618–629, 2015.
  • [37] Terunari Fuji, Pierre-Louis Poirion, and Akiko Takeda. Randomized subspace regularized Newton method for unconstrained non-convex optimization. arXiv preprint arXiv:2209.04170, 2022.
  • [38] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • [39] Nicholas IM Gould, Stefano Lucidi, Massimo Roma, and Ph L Toint. Exploiting negative curvature directions in linesearch methods for unconstrained optimization. Optimization methods and software, 14(1-2):75–98, 2000.
  • [40] Robert Gower, Dmitry Kovalev, Felix Lieder, and Peter Richtárik. RSN: Randomized subspace Newton. Advances in Neural Information Processing Systems, 32, 2019.
  • [41] Serge Gratton, Clément W Royer, Luís N Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38(3):1579–1597, 2018.
  • [42] Billy Jin, Katya Scheinberg, and Miaolan Xie. Sample complexity analysis for adaptive optimization algorithms with stochastic oracles. Mathematical Programming, pages 1–29, 2024.
  • [43] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795–811. Springer, 2016.
  • [44] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [45] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical Report, 2009.
  • [46] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in the Data Sciences. Springer International Publishing, 2020.
  • [47] Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of applied mathematics, 2(2):164–168, 1944.
  • [48] Shuyao Li and Stephen J Wright. A randomized algorithm for nonconvex minimization with inexact evaluations and complexity guarantees. arXiv preprint arXiv:2310.18841, 2023.
  • [49] Alexander Lim, Yang Liu, and Fred Roosta. Conjugate Direction Methods Under Inconsistent Systems. arXiv preprint arXiv:2401.11714, 2024.
  • [50] Z. Lin, H. Li, and C. Fang. Accelerated Optimization for Machine Learning: First-Order Algorithms. Springer Singapore, 2020.
  • [51] Yang Liu and Fred Roosta. Convergence of Newton-MR under inexact Hessian information. SIAM Journal on Optimization, 31(1):59–90, 2021.
  • [52] Yang Liu and Fred Roosta. A Newton-MR algorithm with complexity guarantees for nonconvex smooth unconstrained optimization. arXiv preprint arXiv:2208.07095, 2022.
  • [53] Yang Liu and Fred Roosta. MINRES: From Negative Curvature Detection to Monotonicity Properties. SIAM Journal on Optimization, 32(4):2636–2661, 2022.
  • [54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [55] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
  • [56] Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International conference on machine learning, pages 6820–6829. PMLR, 2020.
  • [57] Shashi K Mishra and Giorgio Giorgi. Invexity and optimization, volume 88. Springer Science & Business Media, 2008.
  • [58] Sen Na, Michał Dereziński, and Michael W Mahoney. Hessian averaging in stochastic Newton methods achieves superlinear convergence. Mathematical Programming, 201(1):473–520, 2023.
  • [59] Sen Na and Michael W. Mahoney. Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming. arXiv preprint arXiv:2205.13687, 2024.
  • [60] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • [61] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • [62] Christopher C Paige and Michael A Saunders. Solution of sparse indefinite systems of linear equations. SIAM journal on numerical analysis, 12(4):617–629, 1975.
  • [63] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
  • [64] Francesco Rinaldi, Luis Nunes Vicente, and Damiano Zeffiro. Stochastic trust-region and direct-search methods: A weak tail bound condition and reduced sample sizing. SIAM Journal on Optimization, 34(2):2067–2092, 2024.
  • [65] Fred Roosta, Yang Liu, Peng Xu, and Michael W Mahoney. Newton-MR: Inexact Newton method with minimum residual sub-problem solver. EURO Journal on Computational Optimization, 10:100035, 2022.
  • [66] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods. Mathematical Programming, 174(1):293–326, 2019.
  • [67] Clément W Royer, Michael O’Neill, and Stephen J Wright. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization. Mathematical Programming, 180:451–488, 2020.
  • [68] Clément W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
  • [69] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • [70] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.
  • [71] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [72] Alex Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404:132306, 2020.
  • [73] Yingjie Tian, Yuqi Zhang, and Haibin Zhang. Recent advances in stochastic gradient descent in deep learning. Mathematics, 11(3):682, 2023.
  • [74] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. Advances in neural information processing systems, 31, 2018.
  • [75] Peng Xu, Fred Roosta, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming, 184(1-2):35–70, 2020.
  • [76] Peng Xu, Fred Roosta, and Michael W Mahoney. Second-order optimization for non-convex machine learning: An empirical study. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 199–207. SIAM, 2020.
  • [77] Peng Xu, Jiyan Yang, Fred Roosta, Christopher Ré, and Michael W Mahoney. Sub-sampled Newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems (NeurIPS-16), pages 3000–3008, 2016.
  • [78] Zhewei Yao, Peng Xu, Fred Roosta, and Michael W Mahoney. Inexact nonconvex Newton-type methods. INFORMS Journal on Optimization, 3(2):154–182, 2021.
  • [79] Zhewei Yao, Peng Xu, Fred Roosta, Stephen J Wright, and Michael W Mahoney. Inexact Newton-CG algorithms with complexity guarantees. IMA Journal of Numerical Analysis, 43(3):1855–1897, 2023.
  • [80] Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022.
  • [81] Rui Yuan, Alessandro Lazaric, and Robert M Gower. Sketched Newton–Raphson. SIAM Journal on Optimization, 32(3):1555–1583, 2022.
  • [82] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • [83] Jinshan Zeng, Shikang Ouyang, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. Global convergence in deep learning with variable splitting via the Kurdyka-Łojasiewicz property. arXiv preprint arXiv:1803.00225, 9, 2018.

Appendix A Appendix

A.1 Proof of 1

The results have been established, in one form or another, in prior works [70, 53]; however we provide the proofs here for the sake of completeness. The proofs of 2.1 and 2.2 can be readily found in [70] and [53, Lemma 3.1d] respectively. For 2.3, we note 𝐫(t1),𝐫(i)=𝐫(t1),𝐠𝐇¯𝐬(i)=𝐫(t1)2\langle\mathbf{r}^{(t-1)},\mathbf{r}^{(i)}\rangle=\langle\mathbf{r}^{(t-1)},-\mathbf{g}-\mathbf{\bar{H}}\mathbf{s}^{(i)}\rangle=\|\mathbf{r}^{(t-1)}\|^{2}, where the last equality follows from 2.2 and Petrov-Galerkin condition, i.e., 𝐫(t1)𝐇¯𝒦t(𝐇¯,𝐠)\mathbf{r}^{(t-1)}\perp\mathbf{\bar{H}}\mathcal{K}_{t}({\mathbf{\bar{H}}},{\mathbf{g}}). Property 2.4 can be found in [53, Theorem 3.8a]. For 2.5, the increasing sequence 𝐬(i)𝐬(j)\|\mathbf{s}^{(i)}\|\geq\|\mathbf{s}^{(j)}\| can be again found in [53, Theorem 3.11e]. For the expression of 𝐬(1)\mathbf{s}^{(1)}, we write

𝐬(1)=argmin𝐬𝒦1(𝐇¯,𝐠)𝐠𝐇¯𝐬2=argmin𝐬span(𝐠)𝐇¯𝐬2+2𝐠,𝐇¯𝐬\displaystyle\mathbf{s}^{(1)}=\underset{\mathbf{s}\in\mathcal{K}_{1}({\mathbf{\bar{H}}},{\mathbf{g}})}{\text{argmin}}\left\|-\mathbf{g}-\mathbf{\bar{H}}\mathbf{s}\right\|^{2}=\underset{\mathbf{s}\in\text{span}(\mathbf{g})}{\text{argmin}}\left\|\mathbf{\bar{H}}\mathbf{s}\right\|^{2}+2\left\langle{\mathbf{g},\mathbf{\bar{H}}\mathbf{s}}\right\rangle

So, the vector 𝐬(1)\mathbf{s}^{(1)} must be some multiple of 𝐠\mathbf{g}, i.e., 𝐬(1)=α𝐠\mathbf{s}^{(1)}=\alpha\mathbf{g}. To find α\alpha, we note,

𝐠,𝐇¯𝐠𝐇¯𝐠2=argminαα2𝐇¯𝐠2+2α𝐠,𝐇¯𝐠.\displaystyle\frac{\left\langle{\mathbf{g},\mathbf{\bar{H}}\mathbf{g}}\right\rangle}{\left\|\mathbf{\bar{H}}\mathbf{g}\right\|^{2}}=\underset{\alpha\in\mathbb{R}}{\text{argmin}}\,\,\alpha^{2}\left\|\mathbf{\bar{H}}\mathbf{g}\right\|^{2}+2\alpha\left\langle{\mathbf{g},\mathbf{\bar{H}}\mathbf{g}}\right\rangle.

A.2 Further Numerical Results: Training and Validation Accuracy/Error

In this section, we provide the comparative performance of all methods, as measured by training and validation accuracy/error in each experiment of Section 4.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Comparison of Newton-MR and Newton-CG on CIFAR10 dataset in Section 4.3.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Comparison of Newton-MR and Trust-Region on CIFAR10 dataset in Section 4.3.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14: Comparison of Newton-MR and L-BFGS on CIFAR10 dataset in Section 4.3.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: Comparison of Newton-MR and Newton-CG on Covertype dataset in Section 4.3.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16: Comparison of Newton-MR and Trust-Region on Covertype dataset in Section 4.3.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17: Comparison of Newton-MR and L-BFGS on Covertype in Section 4.3.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 18: Comparison of Newton-MR and Newton-CG for training a recurrent neural network as in Section 4.4. The error is measured as (i𝒮|h(𝐱;𝐚i)bi|)/|𝒮|(\sum_{i\in\mathcal{S}}|h(\mathbf{x};\mathbf{a}_{i})-b_{i}|)/\mathcal{|S|}, where 𝒮\mathcal{S} denotes the training or validation set, containing |𝒮||\mathcal{S}| data points. Note that although the plots of validation error and training error appear highly similar, there are slight, albeit imperceptible, differences between them.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 19: Comparison of Newton-MR and Trust-Region for training a recurrent neural network as in Section 4.4. The error is measured as (i𝒮|h(𝐱;𝐚i)bi|)/|𝒮|(\sum_{i\in\mathcal{S}}|h(\mathbf{x};\mathbf{a}_{i})-b_{i}|)/\mathcal{|S|}, where 𝒮\mathcal{S} denotes the training or validation set, containing |𝒮||\mathcal{S}| data points. Note that although the plots of validation error and training error appear highly similar, there are slight, albeit imperceptible, differences between them.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 20: Comparison of Newton-MR and L-BFGS for training a recurrent neural network as in Section 4.4. The error is measured as (i𝒮|h(𝐱;𝐚i)bi|)/|𝒮|(\sum_{i\in\mathcal{S}}|h(\mathbf{x};\mathbf{a}_{i})-b_{i}|)/\mathcal{|S|}, where 𝒮\mathcal{S} denotes the training or validation set, containing |𝒮||\mathcal{S}| data points. Note that although the plots of validation error and training error appear highly similar, there are slight, albeit imperceptible, differences between them.

A.3 Why not damp/regularize the Hessian matrix?

As mentioned earlier, our primary focus in this paper is to design a convergent algorithm for nonconvex problems that requires minimal algorithmic modification, relative to similar Newton-type methods in the literature. In nonconvex settings, many Newton-type methods often consider some modifications to the Hessian, e.g. Levenberg-Marquardt [47, 55], Newton-CG [67, 75], and Trust-region-CG [29]. In particular, the regularized/damped Hessian in the form of 𝐇~k=𝐇k+ζ𝐈\tilde{\mathbf{H}}_{k}=\mathbf{H}_{k}+\zeta\mathbf{I}, for some ζ>0\zeta>0, is used in [67, 75, 29] within the Capped-CG procedure. Not only does this particular modification allow for the extraction of sufficiently negative curvature direction, but it also provides a means for obtaining state-of-the-art convergent analysis. However, as we now show, this damping of the Hessian spectrum can adversely affect the practical performance of the algorithm. Figure 21 depicts the results of the nonlinear least square problem from Section 4.2 with CIFAR10 dataset [45]. We consider different magnitudes of damping for the Hessian. As it can be seen, almost always, such a regularization/damping results in poor empirical performance. Intuitively, this is largely due to the lessening the contribution of the eigenvectors corresponding to the small eigenvalues of the Hessian in forming the update direction.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 21: Nonlinear least square problem from Section 4.2 with CIFAR10 dataset [45]. Newton-MR with different magnitude of regularization/damping of (Left) the full Hessian, (Middle) 10% sub-sampled Hessian and (Right) 5% sub-sampled Hessian. Here, NewtonMR_reg-ζ\zeta refers to the Newton-MR algorithm Algorithm 2 with regularized/damped Hessian, i.e., 𝐇~k=𝐇¯k+ζ𝐈\tilde{\mathbf{H}}_{k}=\mathbf{\bar{H}}_{k}+\zeta\mathbf{I}, where ζ=0,0.001,0.01,0.1,1\zeta=0,0.001,0.01,0.1,1. As it can be seen, almost always, such a regularization/damping results in poor empirical performance.