This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newsiamthm

assumptionAssumption \newsiamremarkconditionCondition \newsiamremarkremarkRemark

New Methods for Parametric Optimization via Differential Equations

Heyuan Liu Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720 (mailto: [email protected]).    Paul Grigas Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720 (mailto: [email protected]). This author’s research is supported, in part, by NSF AI Institute for Advances in Optimization Award 2112533.
Abstract

We develop and analyze several different second-order algorithms for computing a near-optimal solution path of a convex parametric optimization problem with smooth Hessian. Our algorithms are inspired by a differential equation perspective on the parametric solution path and do not rely on the specific structure of the objective function. We present computational guarantees that bound the oracle complexity to achieve a near-optimal solution path under different sets of smoothness assumptions. Under the assumptions, the results are an improvement over the best-known results of the grid search methods. We also develop second-order conjugate gradient variants that avoid exact computations of Hessians and solving of linear equations. We present computational results that demonstrate the effectiveness of our methods over grid search methods on both real and synthetic datasets. On large-scale problems, we demonstrate significant speedups of the second-order conjugate variants as compared to the standard versions of our methods.

1 Introduction

In many applications of interest, it is necessary to solve not just a single optimization problem but also an entire collection of related problems. In these settings, some or all of the objects involved in defining the objective function or constraints of an optimization problem depend on one or more parameters, and we would like to solve the problem as a function of these parameters. Generally, a parametric optimization problem can be written as:

(1) P(λ):minxS(λ)F(x,λ),P(\lambda):\quad\min_{x\in S(\lambda)}F(x,\lambda),

where λ\lambda belongs to the set of interest Λm\Lambda\subseteq\mathbb{R}^{m}, and the feasible sets satisfy S(λ)pS(\lambda)\subseteq\mathbb{R}^{p}. There are many problems of interest that are formulated as parametric optimization problems of the form Eq. 1. Subsequently, as indicated by Guddat et al. [19], there are several strong motivations to design algorithms for Eq. 1, including but not limited to: (i) the need to solve problems arising in application areas like regularized regression with cross-validation (see, e.g., Osborne et al. [33]) and model predictive control (see, e.g., Garcia et al. [14]), (ii) as a building block for developing globally convergent algorithms by the approach of path-following as is done in interior-point methods (see, e.g., Nesterov and Nemirovskii [31]), and (iii) the need to address multi-objective optimization, for instance, finding the Pareto frontier of a two-objective optimization problem. Depending on the assumptions made, the goal may be to find global/local optimal solutions or Karush–Kuhn–Tucker (KKT) points of the problem P(λ)P(\lambda) for λΛ\lambda\in\Lambda.

In the rest of the paper, we will focus on a more specific problem, in which we assume that: (i) the dependence on λ\lambda is linear, that is, F(x,λ)F(x,\lambda) can be written as f(x)+λΩ(x)f(x)+\lambda\cdot\Omega(x), (ii) both f()f(\cdot) and Ω()\Omega(\cdot) are convex functions with certain properties, and (iii) the feasible set S(λ)S(\lambda) is the entire vector space p\mathbb{R}^{p} for all λΛ\lambda\in\Lambda. That is, we focus on the parametric optimization problem:

(2) P(λ):Fλ:=minxp{Fλ(x):=f(x)+λΩ(x)},P(\lambda):\quad F_{\lambda}^{*}:=\min_{x\in\mathbb{R}^{p}}\left\{F_{\lambda}(x):=f(x)+\lambda\cdot\Omega(x)\right\},

where f():pf(\cdot):\mathbb{R}^{p}\rightarrow\mathbb{R} and Ω():p\Omega(\cdot):\mathbb{R}^{p}\rightarrow\mathbb{R} are twice-differentiable functions such that f()f(\cdot) is μ\mu-strongly convex for some μ0\mu\geq 0 and Ω()\Omega(\cdot) is σ\sigma-strongly convex for some σ>0\sigma>0, both with respect to the 2\ell_{2}-norm (denoted by \left\|\cdot\right\| herein). For any λ>0\lambda>0, let

(3) x(λ):=argminxpFλ(x)x(\lambda):=\arg\min_{x\in\mathbb{R}^{p}}F_{\lambda}(x)

denote the unique optimal solution of P(λ)P(\lambda) defined in Eq. 2. We are interested in the problem of (approximately) computing the set of optimal solutions {x(λ):λΛ}\left\{x(\lambda):\lambda\in\Lambda\right\} where Λ=[λmin,λmax]\Lambda=[\lambda_{\min},\lambda_{\max}] is the set of interest for some 0<λmin<λmax0<\lambda_{\min}<\lambda_{\max}, and we also refer to this set of solutions as the (exact) solution path. An important set of problems in practice and a popular line of research involves computing the solution path of regularized machine learning problems, including the LASSO as in Efron et al. [12], Osborne et al. [33] and the SVM problem as in Hastie et al. [20]. In these works, algorithms are designed to compute the exact piecewise-linear solution path. Also, in the context of interior-point methods for constrained convex optimization (see, for instance, Nesterov and Nemirovskii [31] and Renegar [36]), f()f(\cdot) represents the objective function and Ω()\Omega(\cdot) represents the barrier function induced by the constraints of the original problem. Note that the application to interior-point methods requires a slightly more general variant of problem (2) where Ω():p{+}\Omega(\cdot):\mathbb{R}^{p}\rightarrow\mathbb{R}\cup\{+\infty\}. Interior-point methods generally start with the problem P(λ)P(\lambda) for a moderately large λ0\lambda_{0} and terminate when λk<δ\lambda_{k}<\delta for some small enough positive threshold δ\delta.

Recently, there has been growing interest in developing algorithms for computing an approximate solution path of a generic problem such as (3). Rosset [37] consider applying exact Newton steps on prespecified grids, and Ndiaye et al. [29] consider adaptive methods to discretize the interval [λmin,λmax][\lambda_{\min},\lambda_{\max}], for example. These grid search type methods, which discretize the interval [λmin,λmax][\lambda_{\min},\lambda_{\max}] and subsequently solve a sequence of individual optimization problems, take a very black-box approach. A natural question is: Can we “open the black-box” by developing a methodology that better exploits the structure of the solution path? We answer this question positively by introducing a differential equation perspective to analyze the solution path, which enables us to better reveal and exploit the underlying structure of the solution path. This deeper understanding enables us to build more efficient algorithms and present improved computational guarantees.

In particular, we derive an ordinary differential equation with an initial condition whose solution is the exact solution path of (2). The dynamics of the ODE that we derive resemble, but are distinct from, the dynamics of a path-wise version of Newton’s method. Based on the ODE, we propose efficient algorithms to generate approximate solution paths x^(λ):λ[λmin,λmax]p\hat{x}(\lambda):\lambda\in[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p} and provide the corresponding complexity analysis. The metric we consider is the 2\ell_{2}-norm of the gradient of the regularized problem, namely Fλ(x^(λ))2\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|_{2}, and we use the largest norm along the approximate path supλΛFλ(x^(λ))2\sup_{\lambda\in\Lambda}\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|_{2} to represent the accuracy of an approximate path x^(λ)\hat{x}(\lambda) (as formally defined in Definition 3.1). To analyze the computational cost of our proposed algorithms, we consider the oracle complexity – either in terms of full Hessian or Hessian-vector product/gradient evaluations – to obtain an ϵ\epsilon-accurate solution path. Note that considering the oracle complexity is in contrast to other works that consider the number of individual optimization problems that need to be solved (see, for example, Giesen et al. [16], Ndiaye et al. [29]), as well as the number of ordinary differential equations that need to be solved (see, for example, Zhou and Lange [46]).

1.1 Contributions

The first set of contributions of this paper concern the perspective of the solution path of (2) from an ordinary differential equation point of view. We derive an ordinary differential equation with an initial condition whose solution is the solution path of (2), based on the first-order optimality conditions of (2). With this observation, we propose a novel and efficient way to approximate the entire solution path. Our derivation does not rely on the special structure of the optimization problem, like existing results in the solution path of LASSO or SVM problems, and holds for general objective functions.

The second set of contributions of this paper concern the design of efficient algorithms and the corresponding oracle complexity analysis. Classical error analysis of numerical ordinary differential equation methods [15, 40] provides only asymptotic results, and the global error has an exponential dependency on the Lipschitz constant and the length of the time period. In contrast, we design new update schemes to compute an approximate solution path and develop nonasymptotic complexity bounds. In particular, we apply a semi-implicit Euler method on the ordinary differential equation in order to compute the approximate optimal solutions under a finite set of penalty coefficients. Then, we incorporate linear interpolation, which was usually missing either in practice or in the lower bound complexity analysis, to generate nearly optimal solutions under other penalty coefficients within the range of parameter values of interest. The two-step algorithm guarantees an ϵ\epsilon-accurate solution path within at most 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) gradient and Hessian evaluations as well as linear equation system solves. When the objective function has higher-order smoothness properties, we modify the traditional trapezoid method in numerical differential equations and design a new update scheme, which guarantees an ϵ\epsilon-path within at most 𝒪(1ϵ)\mathcal{O}(\frac{1}{\sqrt{\epsilon}}) Hessian evaluations. It is important to emphasize that the complexity results in this paper are in terms of the number of operations (for example, Hessian evaluations), rather than the number of sub-problems that need to be solved (for example, solving a single numerical ODE or individual optimization problems) as has been studied in prior work such as Giesen et al. [16], Zhou and Lange [46] and Ndiaye et al. [29]. Further extensions of our proposed algorithms to Runge-Kutta methods are possible but we omit them for brevity. We also provide a detailed computational evaluation of our algorithms and existing methods, including several experiments with synthetic data, the breast cancer dataset [11], and the leukemia dataset [18].

The third set of contributions of the paper concerns second-order conjugate gradient type methods and computational guarantees in the presence of inexact gradient and Hessian oracles, as well as approximate linear equation solvers. When the dimension of the problem is high, computing the Hessian and/or solving linear systems becomes a computational bottleneck. To avoid this, one would like an algorithm that only requires approximate directions at each iteration. We first consider the case where the (absolute) numerical error incurred in the calculation of a new direction dkd_{k} is bounded by some δk>0\delta_{k}>0. We show that our algorithms are robust to numerical error in the sense that the additional errors of inexact directions do not accumulate and do not depend on the condition number. We extend the complexity analysis to the case where the numerical error δk\delta_{k} has a uniform upper bound αϵ\alpha\epsilon for α(0,1)\alpha\in(0,1) and show that the Euler method maintains 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) complexity, and the trapezoid method maintains 𝒪(1ϵ)\mathcal{O}(\frac{1}{\sqrt{\epsilon}}) complexity when there is higher-order smoothness. We then propose variations of the algorithms mentioned before that only require gradient and Hessian-vector product oracles, rather than gradient and Hessian oracle, as well as a linear system solver. We also leverage the previous analysis in the case of inexact directions in order to provide computational complexity results for the second-order conjugate gradient-type algorithms, which have the same order of ϵ\epsilon as the results for the exact methods mentioned above. Our results demonstrate that our algorithms are more robust and second-order conjugate gradient variations require less computational cost compared to existing methods.

1.2 Related Literature

We now discuss previous work related to our algorithm and analysis from three distinct aspects.

Other path methods and comparison of results

As previously mentioned, for the LASSO and SVM problems, the exact solution path is piecewise linear and can be computed by the path following methods such as the least angle regression (LARS) algorithm proposed by Efron et al. [12], Hastie et al. [20] and Osborne et al. [33]. Additional problems whose solution paths are piecewise linear are considered by Rosset and Zhu [38], and Mairal and Yu [24] showed that the number of breakpoints in the solution path can be exponential in the number of data points. Generalized linear regression problems with 1\ell_{1} regularization are considered by Yuan and Zou [44] and Zhou and Wu [47] via LARS based methods. Another line of work focuses on computing approximate solution paths for specific problems, including the elastic net in Friedman et al. [13], the SVM problem in Bach et al. [7] and [16], matrix completion with nuclear norm regularization in Mazumder et al. [26], and other structural convex problems in Giesen et al. [17] and Loosli et al. [23]. For problems with nonconvex but coordinate decomposable regularization functions, a coordinate descent-based algorithm was considered by Mazumder et al. [27] and Wang et al. [42]. Closest to our problem set up, Rosset [37] considered a general problem when f()f(\cdot) and Ω()\Omega(\cdot) have third-order derivatives and provided an algorithm which applied exact Newton steps on equally spaced grids starting from the optimal solution of the nonregularized problem. The lower bound complexity analysis when the approximate solution path is limited to a piecewise constant function is considered by Giesen et al. [16], Ndiaye et al. [29].

Related global complexity analysis of second-order methods

Newton-like methods are an important class of algorithms in optimization. Some notable lines of work include interior-point methods (Nesterov and Nemirovskii [31], Renegar [36]) and applications in regression problems (Kim et al. [22]) as well as the Levenberg-Marquardt method (Moré [28]). In practice, techniques are often incorporated to ensure global convergence, such as line searches (Ralph [35]) and trust-region techniques (Conn et al. [10]). The global complexity analysis of Newton and higher-order methods with regularization has also received much recent interest, such as the work of Nesterov [30], Nesterov and Polyak [32] and Polyak [34] and the references therein. In our paper, we make similar types of assumptions as in the global complexity analysis of regularized second and higher-order methods and we also prove global complexity results for the class of second-order algorithms considered herein.

Related work on differential equations and optimization methods

Early works, including the work of Alvarez and Attouch [2] and Alvarez et al. [3], analyzed inertial dynamical systems driven by gradient and Newton flows with application to optimization problems. Newton-like dynamic systems of monotone inclusions with connections to Levenberg-Marquardt and regularized Newton methods were also analyzed by Abbas et al. [1], Attouch and Svaiter [4] and Bot and Csetnek [9]. Due to the recent popularity of the accelerated gradient descent method in the machine learning and optimization communities, the limiting dynamics of different optimization methods and their discretization have thus received much renewed interest in recent years; see Attouch et al. [6], Scieur et al. [39], Su et al. [41], Wibisono et al. [43], Zhang et al. [45] and the references therein.

1.3 Organization

The paper is organized as follows. In Section 2, we derive the ordinary differential equation with an initial condition whose solution is the exact solution path (3) and provide the existence and uniqueness of the solution of the differential equation. In Section 3, we leverage the ODE to develop a numerical algorithm to compute the approximate solution path of (2), and derive the corresponding complexity analysis in Theorem 3.2. In Section 4, we propose a multistage method that is beneficial when the functions f()f(\cdot) and Ω()\Omega(\cdot) have higher order smoothness, and we also provide its complexity analysis in Theorem 4.1. In Section 5, we extend our results in the presence of inexact oracles, and as a direct application we propose second-order conjugate gradient variants of the aforementioned algorithms, which avoid exact Hessian evaluations and exact solutions of linear systems. Section 6 contains a detailed computational experiments of the proposed algorithms and grid search methods on real and synthetic datasets.

1.4 Notation

For a positive integer nn, let [n]:={1,,n}[n]:=\{1,\dots,n\}. For a vector-valued function y(t):py(t):\mathbb{R}\rightarrow\mathbb{R}^{p} that can be written as y(t)=(y1(t),,yp(t))y(t)=(y_{1}(t),\dots,y_{p}(t)), we say that y()y(\cdot) is differentiable if yi()y_{i}(\cdot) is differentiable for all i=1,,pi=1,\dots,p and let dydt\frac{\textnormal{d}y}{\textnormal{d}t} be the derivative of y(t)y(t), namely dydt=(dy1dt,,dypdt)\frac{\textnormal{d}y}{\textnormal{d}t}=(\frac{\textnormal{d}y_{1}}{\textnormal{d}t},\dots,\frac{\textnormal{d}y_{p}}{\textnormal{d}t}). Let 𝟏p{\bm{1}}_{p} and 𝟏p×p{\bm{1}}_{p\times p} denote the pp-dimensional all-ones vector and the p×pp\times p all-ones matrix, respectively. Throughout the paper, we fix the norm \|\cdot\| on p\mathbb{R}^{p} to be the 2\ell_{2}-norm, which is defined by x:=x2=xTx\|x\|:=\|x\|_{2}=\sqrt{x^{T}x}. Also, in a slight abuse of notation, we use \|\cdot\| to represent the operator norm, i.e., the induced 2\ell_{2}-norm 2\|\cdot\|_{2} on n×p\mathbb{R}^{n\times p}, which is defined by A:=A2,2=maxx21Ax2\|A\|:=\|A\|_{2,2}=\max_{\|x\|_{2}\leq 1}\|Ax\|_{2}.

2 Ordinary Differential Equation Characterization of Solution Path

Let us begin by describing a differential equations perspective on the solution path Eq. 3 of Eq. 2 that will prove fruitful in the development of efficient computational methods. First, we introduce a reparameterization in terms of an auxiliary variable t0t\geq 0 (thought of as “time”), whereby for a given T>0T>0 we introduce functions λ():[0,T][λmin,λmax]\lambda(\cdot):[0,T]\to[\lambda_{\min},\lambda_{\max}] and ξ():[λmin,λmax]\xi(\cdot):[\lambda_{\min},\lambda_{\max}]\to\mathbb{R} such that: ξ()\xi(\cdot) is Lipschitz, λ()\lambda(\cdot) is differentiable on (0,T)(0,T), and we have dλdt=ξ(λ(t))\frac{\textnormal{d}\lambda}{\textnormal{d}t}=\xi(\lambda(t)) for all t(0,T)t\in(0,T). In a slight abuse of notation, we define the path with respect to tt as {x(t):=x(λ(t)):t[0,T]}\left\{x(t):=x(\lambda(t)):t\in[0,T]\right\}. Now notice that, for any t[0,T]t\in[0,T], the first-order optimality condition for the problem P(λ(t))P(\lambda(t)) states that f(x(t))+λ(t)Ω(x(t))=0\nabla f(x(t))+\lambda(t)\nabla\Omega(x(t))=0. By differentiating both sides of the previous equation with respect to tt, we have 2f(x(t))dxdt+Ω(x(t))dλdt+λ(t)2Ω(x(t))dxdt=0\nabla^{2}f(x(t))\cdot\frac{\textnormal{d}x}{\textnormal{d}t}+\nabla\Omega(x(t))\cdot\frac{\textnormal{d}\lambda}{\textnormal{d}t}+\lambda(t)\nabla^{2}\Omega(x(t))\cdot\frac{\textnormal{d}x}{\textnormal{d}t}=0. Rearranging the above and again using dλdt=ξ(λ(t))\frac{\textnormal{d}\lambda}{\textnormal{d}t}=\xi(\lambda(t)) yields dxdt=(2f(x(t))+λ(t)2Ω(x(t)))1ξ(λ(t))Ω(x(t))\frac{\textnormal{d}x}{\textnormal{d}t}=-\left(\nabla^{2}f(x(t))+\lambda(t)\nabla^{2}\Omega(x(t))\right)^{-1}\xi(\lambda(t))\nabla\Omega(x(t)). Then, applying the fact that f(x(t))+λ(t)Ω(x(t))=0\nabla f(x(t))+\lambda(t)\nabla\Omega(x(t))=0 yields dxdt=(2f(x(t))+λ(t)2Ω(x(t)))1ξ(λ(t))λ(t)f(x(t))\frac{\textnormal{d}x}{\textnormal{d}t}=\left(\nabla^{2}f(x(t))+\lambda(t)\nabla^{2}\Omega(x(t))\right)^{-1}\cdot\frac{\xi(\lambda(t))}{\lambda(t)}\nabla f(x(t)). Thus, we arrive at the following autonomous system

(4) dλdt=ξ(λ),dxdt=v(x,λ):=(2f(x)+λ2Ω(x))1ξ(λ)λf(x),\frac{\textnormal{d}\lambda}{\textnormal{d}t}=\xi(\lambda),\quad\frac{\textnormal{d}x}{\textnormal{d}t}=v(x,\lambda):=\left(\nabla^{2}f(x)+\lambda\nabla^{2}\Omega(x)\right)^{-1}\frac{\xi(\lambda)}{\lambda}\nabla f(x),

for t[0,T]t\in[0,T].

By considering specific choices of ξ()\xi(\cdot) and Ω()\Omega(\cdot), the system (4) generalizes some previously studied methodologies in parameteric optimization. First, consider the scenario with an equally spaced discretization of the interval [0,T][0,T], namely tk=kht_{k}=k\cdot h for some fixed step-size h>0h>0. Thus, the sequence λk:=λ(tk)\lambda_{k}:=\lambda(t_{k}) is approximately given by λk+1λk+hξ(λk)\lambda_{k+1}\approx\lambda_{k}+h\cdot\xi(\lambda_{k}). Intuitively, the choice of ξ()\xi(\cdot) controls the dynamic of λ()\lambda(\cdot) and generalizes some previously considered sequences {λk}\{\lambda_{k}\} for the problem (2). For example, letting ξ(λ)1\xi(\lambda)\equiv 1 we recover the arithmetic sequence in Rosset [37] and letting ξ(λ)λ\xi(\lambda)\equiv-\lambda we recover the geometric sequence in Ndiaye et al. [29]. In addition, consider the special case where Ω(x)=12x2\Omega(x)=\frac{1}{2}\left\|x\right\|^{2}. Then the dynamic for x(t)x(t) in Eq. 4 is similar to the limiting dynamic for the proximal Newton method (also known as the Levenberg-Marquardt regularization procedure [25] for convex optimization problems). The property of a similar dynamic of monotone inclusion is analyzed in Attouch et al. [5], Attouch and Svaiter [4], which includes finding zero of the gradient of a convex function.

Before developing and presenting algorithms designed to compute the approximate solution path based on (4), we first verify that the system (4) has a unique trajectory. The following proposition states conditions on f()f(\cdot) and Ω()\Omega(\cdot) such that v(,)v(\cdot,\cdot) defined in Eq. 4 is continuous in λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}] and is uniformly LvL_{v}-Lipschitz continuous with respect to xx, namely, v(x1,λ)v(x2,λ)Lvx1x2\|v(x_{1},\lambda)-v(x_{2},\lambda)\|\leq L_{v}\|x_{1}-x_{2}\| for any λ[λmin,λmax],x1,x2p\lambda\in[\lambda_{\min},\lambda_{\max}],x_{1},x_{2}\in\mathbb{R}^{p}. This uniform Lipschitz property ensures that the above system has a unique trajectory, which therefore coincides with the solution path defined in (3).

Proposition 2.1 (Theorem 5.3.1 of Gautschi [15]).

Suppose that 2f()\nabla^{2}f(\cdot), 2Ω()\nabla^{2}\Omega(\cdot), f()\nabla f(\cdot), f()f(\cdot) are all LL-Lipschitz continuous, f()f(\cdot) is μ\mu-strongly convex for μ0\mu\geq 0, and Ω()\Omega(\cdot) is σ\sigma-strongly convex for σ>0\sigma>0. Suppose further that ξ()\xi(\cdot) is Lipschitz continuous on [λmin,λmax][\lambda_{\min},\lambda_{\max}] and satisfies |ξ(λ)λ|C|\frac{\xi(\lambda)}{\lambda}|\leq C for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}]. Then, it holds that v(,)v(\cdot,\cdot) defined in Eq. 4 is continuous in λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}] and is uniformly LvL_{v}-Lipschitz continuous with Lv=LCμ+λminσ+L2C(1+λmax)(μ+λminσ)2L_{v}=\frac{LC}{\mu+\lambda_{\min}\sigma}+\frac{L^{2}C(1+\lambda_{\max})}{(\mu+\lambda_{\min}\sigma)^{2}}, and Eq. 4 has a unique trajectory (λ(t),x(t))(\lambda(t),x(t)) for t[0,T]t\in[0,T].

3 Discretization and Complexity Analysis

In this section, we present algorithms to compute an approximate solution path based on discretizations of Eq. 4, along with the corresponding complexity analysis. The primary error metric we consider is the 22-norm of the gradient across the entire interval [λmin,λmax][\lambda_{\min},\lambda_{\max}] as formally presented in Definition 3.1.

Definition 3.1.

An approximate solution path x^():[λmin,λmax]p\hat{x}(\cdot):[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p} to the parametric optimization problem Eq. 2 has accuracy ϵ0\epsilon\geq 0 if Fλ(x^(λ))ϵ\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|\leq\epsilon for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}].

Notice that the strong convexity of the objective function Fλ()F_{\lambda}(\cdot) for all λ>0\lambda>0 immediately implies that an ϵ\epsilon-accurate solution path x^()\hat{x}(\cdot) also has the optimality gap guarantee, which is Fλ(x^(λ))Fλϵ22(μ+λσ)ϵ22(μ+λminσ)F_{\lambda}(\hat{x}(\lambda))-F_{\lambda}^{\ast}\leq\frac{\epsilon^{2}}{2(\mu+\lambda\sigma)}\leq\frac{\epsilon^{2}}{2(\mu+\lambda_{\min}\sigma)}, for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}]. Algorithm 1 below presents a two-step “meta-algorithm” for computing an approximate solution path x^()\hat{x}(\cdot). Inspired by numerical methods to solve ordinary differential equations, we first design several update schemes to iteratively update (xk,λk)(x_{k},\lambda_{k}) by exploiting the dynamics in Eq. 4. We use the function ψ(,):p×[λmin,λmax]p×[λmin,λmax]\psi(\cdot,\cdot):\mathbb{R}^{p}\times[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p}\times[\lambda_{\min},\lambda_{\max}] to denote a generic update rule in the meta-algorithm below, and we consider several different specific examples herein. Then we apply an interpolation method ()\mathcal{I}(\cdot) to resolve the previously computed sequence of points into an approximate path x^()\hat{x}(\cdot) over [λmin,λmax][\lambda_{\min},\lambda_{\max}].

Algorithm 1 Meta-algorithm for computing an approximate solution path x^()\hat{x}(\cdot)
1:  Input: initial point x0px_{0}\in\mathbb{R}^{p}, total number of iterations K1K\geq 1, update rule ψ(,)\psi(\cdot,\cdot), and interpolation method ()\mathcal{I}(\cdot)
2:  Initialize regularization parameter λ0λmax\lambda_{0}\leftarrow\lambda_{\max}
3:  for k=0,,K1k=0,\dots,K-1 do
4:     Update (xk+1,λk+1)ψ(xk,λk)(x_{k+1},\lambda_{k+1})\leftarrow\psi(x_{k},\lambda_{k})
5:  end for
6:  return  x^()({(xk,λk)}k=1K)\hat{x}(\cdot)\leftarrow\mathcal{I}\left({\{(x_{k},\lambda_{k})\}}_{k=1}^{K}\right)

We develop oracle complexity results for different update schemes and interpolation methods in terms of the number of gradient computations, Hessian computations, and linear system solves required to compute an ϵ\epsilon-accurate approximate path. In this section, we will stick to a simple version of Algorithm 1 based on the application of the semi-implicit Euler method and linear interpolation to specify the update rule ψ(,)\psi(\cdot,\cdot) and the interpolation method ()\mathcal{I}(\cdot). In particular, the semi-implicit Euler discretization of Eq. 4 is

(5) λk+1=λk+hξ(λk),xk+1=xk+hv(xk,λk+1),\lambda_{k+1}=\lambda_{k}+h\cdot\xi(\lambda_{k}),\quad x_{k+1}=x_{k}+h\cdot v(x_{k},\lambda_{k+1}),

and the linear interpolation linear():{(xk,λk)}k=0Kx^()\mathcal{I}_{\textnormal{linear}}(\cdot):{\{(x_{k},\lambda_{k})\}}_{k=0}^{K}\to\hat{x}(\cdot) is defined by x^(λ):=αxk+(1α)xk+1\hat{x}(\lambda):=\alpha x_{k}+(1-\alpha)x_{k+1} with α=λλk+1λkλk+1\alpha=\frac{\lambda-\lambda_{k+1}}{\lambda_{k}-\lambda_{k+1}} for all λ[λk+1,λk]\lambda\in[\lambda_{k+1},\lambda_{k}] and k{0,,K1}k\in\{0,\ldots,K-1\}.

The algorithm with the exponential decaying parameter sequence

Recall the update rule Eq. 5, we can see that the function ξ()\xi(\cdot) (or equivalently, λ()\lambda(\cdot)) is still to be determined. In practical cases, the value of λ\lambda usually decreases exponentially from λmax\lambda_{\max} to λmin\lambda_{\min}. This choice of penalty scale parameters {λk}\left\{\lambda_{k}\right\} arises in the solution path for linear models, see Friedman et al. [13], and the interior-point method, see Renegar [36]. Although our analysis holds for a broad class of λ()\lambda(\cdot), we first present the version with an exponentially decaying parameter sequence, namely λ(t)=etλ(0)\lambda(t)=e^{-t}\lambda(0). This specific version of Algorithm 1 is formally described in Algorithm 2.

Algorithm 2 Euler method for computing an approximate solution path x^()\hat{x}(\cdot)
1:  Input: initial point x0px_{0}\in\mathbb{R}^{p}, total number of iterations K1K\geq 1
2:  Initialize regularization parameter λ0λmax\lambda_{0}\leftarrow\lambda_{\max}, set step-size h1(λminλmax)1Kh\leftarrow 1-(\frac{\lambda_{\min}}{\lambda_{\max}})^{\frac{1}{K}}
3:  for k=0,,K1k=0,\dots,K-1 do
4:     λk+1λkhλk\lambda_{k+1}\leftarrow\lambda_{k}-h\cdot\lambda_{k}
5:     xk+1xkh(2f(xk)+λk+12Ω(xk))1f(xk)x_{k+1}\leftarrow x_{k}-h\left(\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k})\right)^{-1}\nabla f(x_{k})
6:  end for
7:  return  x^()linear({(xk,λk)}k=1K)\hat{x}(\cdot)\leftarrow\mathcal{I}_{\textnormal{linear}}\left({\{(x_{k},\lambda_{k})\}}_{k=1}^{K}\right) according to linear interpolation

Before going further into detailed analysis, we first state the computational guarantee of Algorithm 2. In our complexity analysis, we make the following smoothness assumptions for f()f(\cdot) and Ω()\Omega(\cdot).

{assumption}

In addition to μ\mu-strong convexity of f()f(\cdot) and σ\sigma-strong convexity of Ω()\Omega(\cdot), these functions have LL-Lipschitz gradients and Hessians, where L>0L>0 is an upper bound on the four relevant Lipschitz constants. Furthermore, we assume that f:=minxf(x)>f^{\ast}:=\min_{x}f(x)>-\infty and that G>0G>0 is an upper bound on the norm of the gradients of f()f(\cdot) and Ω()\Omega(\cdot) on the level set {xp:f(x)f(x0)}\{x\in\mathbb{R}^{p}:f(x)\leq f(x_{0})\}.

Theorem 3.2 is our main result concerning the complexity of Algorithm 2 and demonstrates that in terms of the accuracy parameter ϵ\epsilon, Algorithm 2 requires 𝒪(1/ϵ)\mathcal{O}(1/\epsilon) iterations to compute an ϵ\epsilon-accurate solution path.

Theorem 3.2.

Suppose that Section 3 holds, let ϵ>0\epsilon>0 be the desired accuracy, and suppose that the initial point x0x_{0} satisfies Fλmax(x0)ϵ4\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{4}. Let T:=log(λmax/λmin)T:=\log(\lambda_{\max}/\lambda_{\min}), let τ=max{1+λminμ+λminσ,1+λmaxμ+λmaxσ}\tau=\max\{\frac{1+\lambda_{\min}}{\mu+\lambda_{\min}\sigma},\frac{1+\lambda_{\max}}{\mu+\lambda_{\max}\sigma}\}, and let

(6) KE:=max{2T,LGτT3,4(f(x0)f)τLTϵ,2L(τG+1)Tϵ}.K_{\textnormal{E}}:=\left\lceil\max\left\{2T,\frac{\sqrt{LG}\tau T}{\sqrt{3}},\frac{4(f(x_{0})-f^{\ast})\tau LT}{\epsilon},\frac{2\sqrt{L}(\tau G+1)T}{\sqrt{\epsilon}}\right\}\right\rceil.

If the total number of iterations KK satisfies KKEK\geq K_{\textnormal{E}}, then Algorithm 2 returns an ϵ\epsilon-accurate solution path.

Remark 3.3.

Grid search type methods for computing approximate solution paths are proposed in Giesen et al. [16], Ndiaye et al. [29], and we will follow the analysis of the latter, which considers more general cases. To ensure that the function value gap satisfies h(x)hϵh(x)-h^{*}\leq\epsilon^{\prime}, we require the number of grid points K=τGTϵK=\frac{\sqrt{\tau}GT}{\sqrt{\epsilon^{\prime}}}. For LL-smooth function h()h(\cdot), h(x)hϵ22Lh(x)-h^{*}\leq\frac{\epsilon^{2}}{2L} implies h(x)ϵ\|h(x)\|\leq\epsilon, which is the goal in our paper. Therefore, we need to set ϵ=ϵ22L\epsilon^{\prime}=\frac{\epsilon^{2}}{2L}, and hence we have the number of grid points is K=τLGTϵK=\frac{\sqrt{\tau L}GT}{\epsilon} when the desired accuracy of the inner problem is set to ϵc=ϵ2\epsilon_{c}=\frac{\epsilon^{\prime}}{2}. Using the exact Newton method and the last grid point as a warm start to solve the inner problem, Karimireddy et al. [21] implies that the inner complexity is (τL)2log2(\tau L)^{2}\log 2, and therefore the total complexity is (τL)5/2GTlog2ϵ\frac{(\tau L)^{5/2}GT\log 2}{\epsilon}.

From the results in Theorems 3.2 and 3.3, we can see that both our results and the grid search method have a complexity of order 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}). In most practical cases, f(x0)ff(x_{0})-f^{*} is smaller than (τL)3/2Glog2(\tau L)^{3/2}G\log 2, which implies that the constant in Eq. 6 is better. In Section 4, we will propose algorithms utilizing the smoothness of f()f(\cdot) and Ω()\Omega(\cdot) and achieve better complexity results.

3.1 Semi-Implicit Euler Update Scheme

Herein we provide the computational guarantee of the semi-implicit Euler update scheme defined in Eq. 5, where the main object we concern is the accuracy at each grid point. Let rk:=Fλk(xk)r_{k}:=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\| denote the accuracy at (xk,λk)(x_{k},\lambda_{k}). We consider a broad family of ξ()\xi(\cdot) in the following analysis, although in Algorithm 2 and the corresponding complexity analysis in Theorem 3.2, we only consider the special scenario that λ(t)=etλ(0)\lambda(t)=e^{-t}\lambda(0). We first present the computation guarantees of Taylor expansion approximations given the Lipschitz continuity of Hessian in Lemma 3.4.

Lemma 3.4 (Lemma 1 in Nesterov and Polyak [32]).

Suppose ϕ():S\phi(\cdot):S\to\mathbb{R} has LL-Lipschitz Hessian for some convex set SdS\subseteq\mathbb{R}^{d}, then the following inequalities hold for all x,ySx,y\in S:

  1. (i)

    ϕ(y)ϕ(x)2ϕ(x)(yx)L2yx2\left\|\nabla\phi(y)-\nabla\phi(x)-\nabla^{2}\phi(x)(y-x)\right\|\leq\frac{L}{2}\left\|y-x\right\|^{2}.

  2. (ii)

    ϕ(y)ϕ(x)ϕ(x)T(yx)12(yx)T2ϕ(x)(yx)L6yx3\left\|\phi(y)-\phi(x)-\nabla\phi(x)^{T}(y-x)-\frac{1}{2}(y-x)^{T}\nabla^{2}\phi(x)(y-x)\right\|\leq\frac{L}{6}\left\|y-x\right\|^{3}.

Based on the results in Lemma 3.4, we provide the local analysis of rkr_{k}, that is, how the norm of the gradient at each grid point accumulates. Lemma 3.5 provides an upper bound on rk+1\left\|r_{k+1}\right\| based on rk\left\|r_{k}\right\|, which represents the accuracy in the previous iteration.

Lemma 3.5.

Suppose Section 3 holds, discretization Eq. 5 has the following guarantee for all k0k\geq 0:

rk+1λk+1λkrk+h2L(1+λk+1)2v(xk,λk+1)2.r_{k+1}\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{2}\cdot\frac{L(1+\lambda_{k+1})}{2}\left\|v(x_{k},\lambda_{k+1})\right\|^{2}.

Proof 3.6.

It holds that

rk+1\displaystyle r_{k+1} =f(xk+1)+λk+1Ω(xk+1)\displaystyle=\left\|\nabla f(x_{k+1})+\lambda_{k+1}\nabla\Omega(x_{k+1})\right\|
f(xk+1)f(xk)2f(xk)(xk+1xk)\displaystyle\leq\left\|\nabla f(x_{k+1})-\nabla f(x_{k})-\nabla^{2}f(x_{k})(x_{k+1}-x_{k})\right\|
+λk+1(Ω(xk+1)Ω(xk)2Ω(xk)(xk+1xk))\displaystyle\phantom{\leq}+\left\|\lambda_{k+1}\left(\nabla\Omega(x_{k+1})-\nabla\Omega(x_{k})-\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)\right\|
+λk+1(Ω(xk)+2Ω(xk)(xk+1xk))+f(xk)+2f(xk)(xk+1xk)\displaystyle\phantom{\leq}+\left\|\lambda_{k+1}\left(\nabla\Omega(x_{k})+\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)+\nabla f(x_{k})+\nabla^{2}f(x_{k})(x_{k+1}-x_{k})\right\|
L2xk+1xk2+λk+1L2xk+1xk2+λk+1λkf(xk)+λkΩ(xk)\displaystyle\leq\frac{L}{2}\left\|x_{k+1}-x_{k}\right\|^{2}+\frac{\lambda_{k+1}L}{2}\left\|x_{k+1}-x_{k}\right\|^{2}+\frac{\lambda_{k+1}}{\lambda_{k}}\left\|\nabla f(x_{k})+\lambda_{k}\nabla\Omega(x_{k})\right\|
=λk+1λkrk+h2L(1+λk+1)2v(xk,λk+1)2,\displaystyle=\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{2}\cdot\frac{L(1+\lambda_{k+1})}{2}\left\|v(x_{k},\lambda_{k+1})\right\|^{2},

where the first inequality is true due to the triangle inequality, and in the second inequality, for the first two terms, we apply Item i in Lemma 3.4, and for the third terms, they are equal to each other.

Lemma 3.5 provides the first technical result of the semi-implicit Euler update scheme. When Section 3 holds, Eq. 5 has the computation guarantee rj+1λj+1rjλj+𝒪(h2)\tfrac{r_{j+1}}{\lambda_{j+1}}\leq\tfrac{r_{j}}{\lambda_{j}}+\mathcal{O}(h^{2}). After telescoping the inequalities for j=0,1,,k1j=0,1,\dots,k-1 for kK=T/hk\leq K=T/h we have rk𝒪(h)r_{k}\sim\mathcal{O}(h). Therefore, one would expect a uniform 𝒪(h)\mathcal{O}(h) bound on all accuracy rkr_{k} at points xkx_{k} for all k=0,,Kk=0,\dots,K. We formalize the idea in the following lemma.

Lemma 3.7.

Suppose Section 3 holds, λk+1λmin\lambda_{k+1}\geq\lambda_{\min}, ξ(λj)<0\xi(\lambda_{j})<0 for all jkj\leq k, and step-size hh satisfies the following condition for all 0j<k0\leq j<k:

(7) hmin{λj2ξ(λj),3λj(μ+λj+1σ)2ξ(λj)LG}.h\leq\min\left\{\frac{\lambda_{j}}{-2\xi(\lambda_{j})},\sqrt{\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)^{2}}{-\xi(\lambda_{j})LG}}\right\}.

Then the sequence {(xk,λk)}k=0K\{(x_{k},\lambda_{k})\}_{k=0}^{K} generated by update scheme Eq. 5 satisfies f(xk+1)f(xk)f(x_{k+1})\leq f(x_{k}), and the corresponding accuracy {rk}j=0K\{r_{k}\}_{j=0}^{K} has the following guarantee:

(8) rkλkr0λ0+2hL(f(x0)f(xk))maxj[k1]{(1+λj+1)ξ(λj)λjλj+1(μ+λj+1σ)}.\frac{r_{k}}{\lambda_{k}}\leq\frac{r_{0}}{\lambda_{0}}+2hL(f(x_{0})-f(x_{k}))\cdot\max_{j\in[k-1]}\left\{\frac{-(1+\lambda_{j+1})\xi(\lambda_{j})}{\lambda_{j}\lambda_{j+1}(\mu+\lambda_{j+1}\sigma)}\right\}.

Proof 3.8.

We will use induction to show that f(xj+1)f(xj)f(x_{j+1})\leq f(x_{j}), which implies that xkSx0x_{k}\in S_{x_{0}} for all k{0,,K}k\in\{0,\dots,K\}. First, suppose xjSx0x_{j}\in S_{x_{0}} for some j{0,,K1}j\in\{0,\dots,K-1\}. Let djd_{j} denote v(xj,λj+1)v(x_{j},\lambda_{j+1}) and we have djGμ+λj+1σ\left\|d_{j}\right\|\leq\frac{G}{\mu+\lambda_{j+1}\sigma}. Since h+λk2ξ(λk)0h+\frac{\lambda_{k}}{2\xi(\lambda_{k})}\leq 0 and h23λj(μ+λj+1σ)2ξ(λj)LG3λj(μ+λj+1σ)ξ(λk)Ldkh^{2}\leq\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)^{2}}{-\xi(\lambda_{j})LG}\leq\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)}{-\xi(\lambda_{k})L\left\|d_{k}\right\|}, it holds that

f(xj+1)\displaystyle f(x_{j+1}) f(xj)+hf(xj)Tdj+h212djT2f(xj)dj+h316Ldj3\displaystyle\leq f(x_{j})+h\cdot\nabla f(x_{j})^{T}d_{j}+h^{2}\cdot\frac{1}{2}d_{j}^{T}\nabla^{2}f(x_{j})d_{j}+h^{3}\cdot\frac{1}{6}L\left\|d_{j}\right\|^{3}
f(xj)+h4λjξ(λj)(μ+λj+1σ)dj2.\displaystyle\leq f(x_{j})+\frac{h}{4}\cdot\frac{\lambda_{j}}{\xi(\lambda_{j})}\cdot(\mu+\lambda_{j+1}\sigma)\left\|d_{j}\right\|^{2}.

Since ξ(λj)<0\xi(\lambda_{j})<0, it holds that f(xj+1)f(xj)f(x_{j+1})\leq f(x_{j}) and therefore implies that xj+1Sx0x_{j+1}\in S_{x_{0}}. Then by induction we conclude that xjSx0x_{j}\in S_{x_{0}} for all jkj\leq k. Also, combining the result in Lemma 3.5, for all jkj\leq k, it holds that

rj+1λj+1rjλj+h2L(1+λj+1)2λj+1dj2rjλj+hL(1+λj+1)2λj+14(f(xj)f(xj+1))λjξ(λj)(μ+λj+1σ).\frac{r_{j+1}}{\lambda_{j+1}}\leq\frac{r_{j}}{\lambda_{j}}+h^{2}\cdot\frac{L(1+\lambda_{j+1})}{2\lambda_{j+1}}\left\|d_{j}\right\|^{2}\leq\frac{r_{j}}{\lambda_{j}}+h\cdot\frac{L(1+\lambda_{j+1})}{2\lambda_{j+1}}\cdot\frac{4(f(x_{j})-f(x_{j+1}))}{\frac{\lambda_{j}}{-\xi(\lambda_{j})}\cdot(\mu+\lambda_{j+1}\sigma)}.

By taking the summation over jj from 0 to kk, we obtain Eq. 8.

Lemma 3.7 provides the computational guarantees of Algorithm 2 with any decreasing sequence of {λk}\left\{\lambda_{k}\right\}. Note that the upper bound of the accuracy at point xkx_{k} still involves a constant related to the sequence {λk}\left\{\lambda_{k}\right\}. We then consider a family of the sequence {λk}\left\{\lambda_{k}\right\} and derive a simpler upper bound.

Proposition 3.9.

Suppose Section 3 holds, λk+1λmin\lambda_{k+1}\geq\lambda_{\min}, λξ(λ)<0-\lambda\leq\xi(\lambda)<0 for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], and step-size hh satisfies

(9) hmin{12,3τ2LG}.h\leq\min\left\{\frac{1}{2},\sqrt{\frac{3}{\tau^{2}LG}}\right\}.

Then under Section 3, discretization Eq. 5, it holds that

(10) rk+1r0+2hτL(f(x0)f(xk+1)).r_{k+1}\leq r_{0}+2h\tau L(f(x_{0})-f(x_{k+1})).

Proof 3.10.

Since ξ(λj)[λj,0)\xi(\lambda_{j})\in[-\lambda_{j},0), we have λjξ(λj)1\frac{\lambda_{j}}{-\xi(\lambda_{j})}\geq 1. Therefore, we have

λj2ξ(λj)12and3λj(μ+λj+1σ)2ξ(λj)LG3(μ+λminσ)2LG,\frac{\lambda_{j}}{-2\xi(\lambda_{j})}\geq\frac{1}{2}\quad\text{and}\quad\sqrt{\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)^{2}}{-\xi(\lambda_{j})LG}}\geq\sqrt{\frac{3(\mu+\lambda_{\min}\sigma)^{2}}{LG}},

and hence condition Eq. 9 implies condition Eq. 7. Also, since 1+λμ+λσ\frac{1+\lambda}{\mu+\lambda\sigma} is monotone in [λmin,λmax][\lambda_{\min},\lambda_{\max}], it holds that τ=maxλ[λmin,λmax]1+λμ+λσ\tau=\max_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1+\lambda}{\mu+\lambda\sigma}. Therefore, we have

maxj[k1]{(1+λj+1)ξ(λj)λjλj+1(μ+λj+1σ)}1+λkλk(μ+λkσ)τλk.\max_{j\in[k-1]}\left\{\frac{-(1+\lambda_{j+1})\xi(\lambda_{j})}{\lambda_{j}\lambda_{j+1}(\mu+\lambda_{j+1}\sigma)}\right\}\leq\frac{1+\lambda_{k}}{\lambda_{k}(\mu+\lambda_{k}\sigma)}\leq\frac{\tau}{\lambda_{k}}.

Apply the above inequality to Eq. 8 we obtain Eq. 10.

Proposition 3.9 provides a uniform upper bound on accuracy of all almost-optimal solutions xkx_{k}. When ξ(λ)λ[1,0)\frac{\xi(\lambda)}{\lambda}\in[-1,0) for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], Algorithm 2 generates a solution sequence {xk}\left\{x_{k}\right\} such that Fλk(xk)𝒪(h)+r0\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\sim\mathcal{O}(h)+r_{0}.

3.2 Linear Interpolation

In the last section, we provided a general accuracy analysis at all nearly-optimal solutions xkx_{k}. Then the next step is to construct the entire path x^(λ):[λmin,λmax]P\hat{x}(\lambda):[\lambda_{\min},\lambda_{\max}]\rightarrow\mathbb{R}^{P} based on these nearly-optimal solutions. In this section, we analyze the second procedure in Algorithm 2, i.e., linear interpolation. First, we recall the definition of linear interpolation for x(),λ()x(\cdot),\lambda(\cdot):

(11) λ^(t):=αλk+(1α)λk+1,x^(t):=αxk+(1α)xk+1,\hat{\lambda}(t):=\alpha\lambda_{k}+(1-\alpha)\lambda_{k+1},\quad\hat{x}(t):=\alpha x_{k}+(1-\alpha)x_{k+1},

where α:=tk+1th\alpha:=\frac{t_{k+1}-t}{h}, t[tk,tk+1]t\in[t_{k},t_{k+1}], and k{0,,K1}k\in\{0,\dots,K-1\}. That is, given an arbitrary λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], we first select tt such that λ^(t)=λ\hat{\lambda}(t)=\lambda, then we output x^:=x^(t)\hat{x}:=\hat{x}(t) as a nearly-optimal solution to problem P(λ)P(\lambda). The following lemma provides a uniform upper bound of Fλ(x^(λ))\|\nabla F_{\lambda}(\hat{x}(\lambda))\| for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}].

Theorem 3.11.

Suppose Section 3 holds. Let rmax=max0kKrkr_{\max}=\max_{0\leq k\leq K}r_{k}. Then, linear interpolation x^(),λ^()\hat{x}(\cdot),\hat{\lambda}(\cdot) of sequence {(xk,λk)}k=0K\left\{(x_{k},\lambda_{k})\right\}_{k=0}^{K} generated by Eq. 11 has the following computational guarantee for all t[t0,tK]t\in[t_{0},t_{K}]:

Fλ^(t)(x^(t))rmax+L8maxk[K1]{(1+λk)xk+1xk2+2h|ξ(λk)|xk+1xk}.\left\|\nabla F_{\hat{\lambda}(t)}(\hat{x}(t))\right\|\leq r_{\textnormal{max}}+\frac{L}{8}\cdot\max_{k\in[K-1]}\left\{(1+\lambda_{k})\|x_{k+1}-x_{k}\|^{2}+2h|\xi(\lambda_{k})|\|x_{k+1}-x_{k}\|\right\}.

Proof 3.12.

First suppose t[tk,tk+1]t\in[t_{k},t_{k+1}]. For simplicity, we define x:=x^(t)x:=\hat{x}(t), λ:=λ^(t)\lambda:=\hat{\lambda}(t), δ1:=f(xk)+λkΩ(xk)\delta_{1}:=\nabla f(x_{k})+\lambda_{k}\nabla\Omega(x_{k}), δ2:=f(xk+1)+λk+1Ω(xk+1)\delta_{2}:=\nabla f(x_{k+1})+\lambda_{k+1}\nabla\Omega(x_{k+1}). By triangle inequality and Item i in Lemma 3.4, it holds that

αf(xk)+(1α)f(xk+1)f(x)\displaystyle\left\|\alpha\nabla f(x_{k})+(1-\alpha)\nabla f(x_{k+1})-\nabla f(x)\right\| α(1α)L2xkxk+12.\displaystyle\leq\frac{\alpha(1-\alpha)L}{2}\left\|x_{k}-x_{k+1}\right\|^{2}.

Also, by applying similar trick on Ω()\nabla\Omega(\cdot), it holds that

λΩ(x)αλkΩ(xk)(1α)λk+1Ω(xk+1)\displaystyle\left\|\lambda\nabla\Omega(x)-\alpha\lambda_{k}\nabla\Omega(x_{k})-(1-\alpha)\lambda_{k+1}\nabla\Omega(x_{k+1})\right\|
\displaystyle\leq λ(αΩ(xk)+(1α)Ω(xk+1)Ω(x))\displaystyle\left\|\lambda(\alpha\nabla\Omega(x_{k})+(1-\alpha)\nabla\Omega(x_{k+1})-\nabla\Omega(x))\right\|
+α(λλk)Ω(xk)+(1α)(λλk+1)Ω(xk+1)\displaystyle+\left\|\alpha(\lambda-\lambda_{k})\nabla\Omega(x_{k})+(1-\alpha)(\lambda-\lambda_{k+1})\nabla\Omega(x_{k+1})\right\|
\displaystyle\leq λL2α(1α)xk+1xk2+α(1α)(λk+1λk)(Ω(xk+1)Ω(xk))\displaystyle\frac{\lambda L}{2}\alpha(1-\alpha)\left\|x_{k+1}-x_{k}\right\|^{2}+\left\|\alpha(1-\alpha)(\lambda_{k+1}-\lambda_{k})(\nabla\Omega(x_{k+1})-\nabla\Omega(x_{k}))\right\|
\displaystyle\leq λL8xk+1xk2+h|ξ(λk)|L4xk+1xk.\displaystyle\frac{\lambda L}{8}\left\|x_{k+1}-x_{k}\right\|^{2}+\frac{h\left|\xi(\lambda_{k})\right|L}{4}\left\|x_{k+1}-x_{k}\right\|.

Combine the above two inequality and apply triangle inequality, we have

f(x)+λΩ(x)αδ1(1α)δ2\displaystyle\left\|\nabla f(x)+\lambda\nabla\Omega(x)-\alpha\delta_{1}-(1-\alpha)\delta_{2}\right\|
\displaystyle\leq L8xk+1xk2+λL8xk+1xk2+|ξ(λk)|Lh4xk+1xk.\displaystyle\frac{L}{8}\left\|x_{k+1}-x_{k}\right\|^{2}+\frac{\lambda L}{8}\left\|x_{k+1}-x_{k}\right\|^{2}+\frac{\left|\xi(\lambda_{k})\right|Lh}{4}\left\|x_{k+1}-x_{k}\right\|.

Theorem 3.11 provides the computational guarantee of linear interpolation of the sequence {(xk,λk}k=0K\left\{(x_{k},\lambda_{k}\right\}_{k=0}^{K} generated by the update scheme Eq. 5. Observing that xk+1xk\left\|x_{k+1}-x_{k}\right\| is of the order 𝒪(h)\mathcal{O}(h), we see that the additional error incurred by linear interpolation is of the order 𝒪(h2)\mathcal{O}(h^{2}). Together with the 𝒪(h)\mathcal{O}(h) accuracy from the update scheme Eq. 5, we are able to provide a computational guarantee on the accuracy of the approximate path generated by Algorithm 2. In the following part, we will provide a uniform bound on Fλ^(t)(x^(t))\|\nabla F_{\hat{\lambda}(t)}(\hat{x}(t))\| for a family of λ(t)\lambda(t) and discretization.

3.3 Computational Guarantee for the Exponential Decaying Parameter Sequence

Under the λ(t)=λmaxexp(t)\lambda(t)=\lambda_{\max}\cdot\exp(-t) scenario, we have ξ(λ)=λ\xi(\lambda)=-\lambda and ξ()\xi(\cdot) satisfies the assumption in Proposition 3.9. Therefore, we extend the result of Theorem 3.11 and provide an explicit uniform bound of the path accuracy.

Proposition 3.13.

Suppose Section 3 holds, and the step-size hh satisfies the condition in Eq. 9. Let x^():[0,T]p\hat{x}(\cdot):[0,T]\to\mathbb{R}^{p} and λ^():[0,T][λmin,λmax]\hat{\lambda}(\cdot):[0,T]\to[\lambda_{\min},\lambda_{\max}] denote the approximate solution path generated by Algorithm 2. Then, we have the following computational guarantee for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}]:

(12) Fλ(x^(λ))Fλmax(x0)+2hτL(f(x0)f)+h2L8(τG+1)2.\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|\leq\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|+2h\tau L(f(x_{0})-f^{*})+\frac{h^{2}L}{8}\cdot(\tau G+1)^{2}.

Proof 3.14.

First we extend the result in Proposition 3.9, and it holds that

rmaxmaxk[K]rkr0+2hL(f(x0)f)τ.r_{\max}\leq\max_{k\in[K]}r_{k}\leq r_{0}+2hL(f(x_{0})-f^{*})\cdot\tau.

Also, for the result in Theorem 3.11, we further have

L8maxk[K1]{(1+λk)xk+1xk2+2h|ξ(λk)|xk+1xk}\displaystyle\frac{L}{8}\cdot\max_{k\in[K-1]}\left\{(1+\lambda_{k})\|x_{k+1}-x_{k}\|^{2}+2h|\xi(\lambda_{k})|\|x_{k+1}-x_{k}\|\right\}
\displaystyle\leq h2L8maxk[K1]{(1+λk)G2(μ+λkσ)2+2λkGμ+λkσ}h2L8(τG+1)2.\displaystyle\frac{h^{2}L}{8}\cdot\max_{k\in[K-1]}\left\{\frac{(1+\lambda_{k})G^{2}}{(\mu+\lambda_{k}\sigma)^{2}}+\frac{2\lambda_{k}G}{\mu+\lambda_{k}\sigma}\right\}\leq\frac{h^{2}L}{8}\cdot(\tau G+1)^{2}.

By combining the above two inequalities, Proposition 3.9, and Theorem 3.11, we obtain Eq. 12.

In Algorithm 2, the sequence {λk}k=0K\left\{\lambda_{k}\right\}_{k=0}^{K} is given by λk+1=(1h)λk\lambda_{k+1}=(1-h)\lambda_{k}, and implies λmin=(1h)Kλmax\lambda_{\min}=(1-h)^{K}\lambda_{\max}. Therefore, we have h=1(λminλmax)1/Kh=1-(\tfrac{\lambda_{\min}}{\lambda_{\max}})^{1/K}. Applying this to Proposition 3.13, we arrive at the complexity analysis with respect to the number of iterations. We formalize the complexity analysis in the following proof of Theorem 3.2, which appears at the beginning of this section.

Proof 3.15 (Proof of Theorem 3.2).

The conditions that Kmax{2T,LGτT3}K\geq\max\left\{2T,\frac{\sqrt{LG}\tau T}{\sqrt{3}}\right\} and h=1(λminλmax)1/KTKh=1-\left(\frac{\lambda_{\min}}{\lambda_{\max}}\right)^{1/K}\leq\frac{T}{K} guarantee that step-size hh satisfies Eq. 9. Also, KτLT(f(x0)f)ϵK\geq\frac{\tau LT(f(x_{0})-f^{*})}{\epsilon} and K(τG+1)LTϵK\geq\frac{(\tau G+1)\sqrt{L}T}{\sqrt{\epsilon}} guarantee that 2hτL(f(x0)f)ϵ22h\tau L(f(x_{0})-f^{*})\leq\frac{\epsilon}{2} and h2L8(τG+1)2ϵ4\frac{h^{2}L}{8}\cdot(\tau G+1)^{2}\leq\frac{\epsilon}{4}. Hence Algorithm 2 guarantees a ϵ\epsilon-accurate solution path.

Recall that in the assumption of Theorem 3.2, it requires a good initialization x0x_{0} satisfying Fλmax(x0)ϵ2\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{2}. In practical cases, we can either implement a specific convex optimization algorithm to solve a x0x_{0} satisfying the initial condition or use the initialization suggested in the following lemma. Here we suggest one choice of initialization with computational guarantee when the function Ω()\Omega(\cdot) is structured, i.e., the optimal solution to minimize Ω()\Omega(\cdot) is easy to compute. Let xΩ:=argminxpΩ(x)x_{\Omega}:=\arg\min_{x\in\mathbb{R}^{p}}\Omega(x) and x0:=xΩ(2f(xΩ)+λmax2Ω(xΩ))1f(xΩ)x_{0}:=x_{\Omega}-(\nabla^{2}f(x_{\Omega})+\lambda_{\max}\nabla^{2}\Omega(x_{\Omega}))^{-1}\nabla f(x_{\Omega}). Intuitively, the initialization x0x_{0} is a Newton step from the optimizer of Ω()\Omega(\cdot), and we can show that Fλmax(x0)L(1+λmax)f(xΩ)22(μ+λmaxσ)2\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{L(1+\lambda_{\max})\|\nabla f(x_{\Omega})\|^{2}}{2(\mu+\lambda_{\max}\sigma)^{2}}. Notice that the value of LL and f(xΩ)\left\|\nabla f(x_{\Omega})\right\| are independent of λmax\lambda_{\max}. Hence, when λmax\lambda_{\max} is large enough, we have r0ϵ4r_{0}\leq\frac{\epsilon}{4}. Also, since xΩx_{\Omega} is the optimal solution when λ=+\lambda=+\infty, the initialization can be considered as an update step of Eq. 5 from xΩx_{\Omega}.

4 Multi-Stage Discretizations

In the analysis of the previous section, Algorithm 2 guarantees an ϵ\epsilon-accurate solution path within 𝒪(ϵ1)\mathcal{O}(\epsilon^{-1}) calls to the gradient, Hessian oracle, and linear system solver. One advantage of the main result proposed in Theorem 3.2 is that only the smooth Hessian of f()f(\cdot) and Ω()\Omega(\cdot) is required and no assumption of ϵ\epsilon is required. When f()f(\cdot) and Ω()\Omega(\cdot) have better properties and ϵ\epsilon is relatively small, one would like an algorithm that utilizes these properties and requires fewer calls to the oracle with respect to the order of ϵ\epsilon. Motivated by the multistage numerical methods for solving differential equations, we design several update schemes to achieve higher-order accuracy. In particular, in this section we still consider the exponentially decaying parameter, that is, λ(t)=etλ(0)\lambda(t)=e^{-t}\lambda(0).

4.1 The Trapezoid Method

In this section, we propose and analyze the trapezoid method, whose formal description is given in Algorithm 3. The trapezoid method is beneficial when the functions f()f(\cdot) and Ω()\Omega(\cdot) have Lipschitz continuous third-order derivatives, and it achieves a higher-order accuracy than the implicit Euler method. The accuracy of the output path of Algorithm 3 is of order 𝒪(h2)\mathcal{O}(h^{2}) where hh is the step-size, or equivalently, 𝒪(K2)\mathcal{O}(K^{-2}) where KK is the number of iterations. Moreover, we want to mention that the algorithm does not require the oracle to higher-order derivatives, but still requires a gradient and Hessian oracle as in the Euler method.

Algorithm 3 Trapezoid method for solution path
1:  Input: Initial point x0px_{0}\in\mathbb{R}^{p}, total number of iterations K1K\geq 1
2:  Initialize parameter λ0λmax\lambda_{0}\leftarrow\lambda_{\max}, set step-size h12(λminλmax)1K1h\leftarrow 1-\sqrt{2(\frac{\lambda_{\min}}{\lambda_{\max}})^{\frac{1}{K}}-1}
3:  for k=0,,K1k=0,\dots,K-1 do
4:     dk,1v(xk,λk)d_{k,1}\leftarrow v(x_{k},\lambda_{k}) and dk,2v(xk+hdk,1,(1h+h2)λk)d_{k,2}\leftarrow v(x_{k}+h\cdot d_{k,1},(1-h+h^{2})\lambda_{k})
5:     xk+1xk+hdk,1+dk,22x_{k+1}\leftarrow x_{k}+h\cdot\frac{d_{k,1}+d_{k,2}}{2} and λk+1(1h+h22)λk\lambda_{k+1}\leftarrow(1-h+\frac{h^{2}}{2})\lambda_{k}
6:  end for
7:  return  x^()linear({(xk,λk)}k=0K)\hat{x}(\cdot)\leftarrow\mathcal{I}_{\textnormal{linear}}\left({\{(x_{k},\lambda_{k})\}}_{k=0}^{K}\right) according to linear interpolation

We first state the main technical assumption and computational guarantees of Algorithm 3.

{assumption}

In addition to Section 3, we assume that the third-order directional derivative of f()f(\cdot) and Ω()\Omega(\cdot) are LL-Lipschitz continuous and σ1\sigma\geq 1.

Theorem 4.1.

Suppose Section 4.1 holds, let ϵ>0\epsilon>0 be desired accuracy, let μ~:=μ+λminσ\tilde{\mu}:=\mu+\lambda_{\min}\sigma, suppose that the initial point x0x_{0} satisfies Fλmin(x0)ϵ2μ~\left\|\nabla F_{\lambda_{\min}}(x_{0})\right\|\leq\frac{\epsilon}{2}\leq\tilde{\mu}, let T:=1.1log(λmax/λmin)T:=1.1\log(\lambda_{\max}/\lambda_{\min}), and let

Ktr:=max{10T,8LT(1+G)μ~,6L1/2(1+G)3/2Tϵ1/2,5τ2/3L(1+G)4/3Tϵ1/3}.K_{\textnormal{tr}}:=\left\lceil\max\left\{10T,\frac{8LT(1+G)}{\tilde{\mu}},\frac{6L^{1/2}(1+G)^{3/2}T}{\epsilon^{1/2}},\frac{5\tau^{2/3}L(1+G)^{4/3}T}{\epsilon^{1/3}}\right\}\right\rceil.

If the total number of iterations KK satisfies KKtrK\geq K_{\textnormal{tr}}, then Algorithm 3 returns an ϵ\epsilon-accurate solution path.

The result in Theorem 4.1 shows that we improve the total complexity to 𝒪(1ϵ)\mathcal{O}(\frac{1}{\sqrt{\epsilon}}), which is better than the 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) complexity of the Euler method and the best known results for the grid search method (see Theorems 3.2 and 3.3). Similarly to the previous complexity analysis of the semi-implicit Euler method, the analysis of the trapezoid method consists of two parts: we first present the computational guarantee of the trapezoid update scheme, which is defined as

(13) (xk+1,λk+1)T(xk,λk):=(xk+hdk,1+dk,22,(1h+h22)λk),(x_{k+1},\lambda_{k+1})\leftarrow T(x_{k},\lambda_{k}):=\left(x_{k}+h\cdot\tfrac{d_{k,1}+d_{k,2}}{2},(1-h+\tfrac{h^{2}}{2})\lambda_{k}\right),

where dk,1=v(xk,λk)d_{k,1}=v(x_{k},\lambda_{k}), and dk,2=v(xk+hdk,1,(1h+h2)λk)d_{k,2}=v(x_{k}+h\cdot d_{k,1},(1-h+h^{2})\lambda_{k}).

Lemma 4.2.

Suppose Section 4.1 holds, rk=Fλk(xk)μ~r_{k}=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\leq\tilde{\mu}, and the next iterate (xk+1,λk+1)(x_{k+1},\lambda_{k+1}) is given by (xk+1,λk+1)=T(xk,λk)(x_{k+1},\lambda_{k+1})=T(x_{k},\lambda_{k}) defined in Eq. 13. Then, it holds that (1+λk)xkxk+13h(1+G)(1+\lambda_{k})\|x_{k}-x_{k+1}\|\leq 3h(1+G), and

(14) rk+1:=Fλk+1(xk)λk+1λkrk+h33L(1+G)3+h42L3τ2(1+G)4.r_{k+1}:=\left\|\nabla F_{\lambda_{k+1}}(x_{k})\right\|\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{3}\cdot 3L(1+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(1+G)^{4}.

We leave the proof of Lemma 4.2 in the appendix because of its length and complexity. In the proof, we mainly work with the directional derivatives (see Nesterov [30], for details) and the accuracy of the Taylor expansion in Lemma 3.4. The result in Lemma 4.2 shows that the trapezoid update scheme in Eq. 13 guarantees a local accumulation of 𝒪(h3)\mathcal{O}(h^{3}). Moreover, we can derive an 𝒪(h2)\mathcal{O}(h^{2}) uniform upper bound on the accuracy of all nearly optimal solutions {xk}\left\{x_{k}\right\}. For all other λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], we implement the linear interpolation to approximate the corresponding nearly-optimal solution. We then provide the formal proof of Theorem 4.1.

Proof 4.3 (Proof of Theorem 4.1).

Since hh221Klog(λmaxλmin)h-\frac{h^{2}}{2}\leq\frac{1}{K}\log(\frac{\lambda_{\max}}{\lambda_{\min}}), it holds that

hmin{0.1,μ~8L(1+G),ϵ1/23L1/2(1+G)3/2,ϵ1/33τ2/3L(1+G)4/3}.h\leq\min\left\{0.1,\frac{\tilde{\mu}}{8L(1+G)},\frac{\epsilon^{1/2}}{3L^{1/2}(1+G)^{3/2}},\frac{\epsilon^{1/3}}{3\tau^{2/3}L(1+G)^{4/3}}\right\}.

Then we show that rk:=Fλk(xk)ϵ2r_{k}:=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\leq\frac{\epsilon}{2} for all kk by induction. Suppose rkϵ2r_{k}\leq\frac{\epsilon}{2}, then by Lemma 4.2, it holds that

rk+1:=Fλk+1(xk)λk+1λkrk+h33L(1+G)3+h42L3τ2(1+G)4ϵ2.r_{k+1}:=\left\|\nabla F_{\lambda_{k+1}}(x_{k})\right\|\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{3}\cdot 3L(1+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(1+G)^{4}\leq\frac{\epsilon}{2}.

Therefore, rkϵ2r_{k}\leq\frac{\epsilon}{2} for all k{0,,K}k\in\{0,\dots,K\}. Suppose λ[λk+1,λk]\lambda\in[\lambda_{k+1},\lambda_{k}], and hence x^(λ)=αxk+(1α)xk+1\hat{x}(\lambda)=\alpha x_{k}+(1-\alpha)x_{k+1} where α=λλk+1λkλk+1\alpha=\frac{\lambda-\lambda_{k+1}}{\lambda_{k}-\lambda_{k+1}}. Applying the results in Theorem 3.11, we have

f(x^(λ))+λx^(λ)\displaystyle\left\|f(\hat{x}(\lambda))+\lambda\hat{x}(\lambda)\right\| ϵ2+L8maxk[K1]{(1+λk)xk+1xk2+2h|ξ(λk)|xk+1xk}\displaystyle\leq\frac{\epsilon}{2}+\frac{L}{8}\max_{k\in[K-1]}\left\{(1+\lambda_{k})\|x_{k+1}-x_{k}\|^{2}+2h|\xi(\lambda_{k})|\|x_{k+1}-x_{k}\|\right\}
ϵ2+L8(9h2(1+G)2+6h2(1+G))ϵ.\displaystyle\leq\frac{\epsilon}{2}+\frac{L}{8}\cdot\left(9h^{2}(1+G)^{2}+6h^{2}(1+G)\right)\leq\epsilon.

5 Analysis with Inexact Linear Equations Solutions and Second-Order Conjugate Gradient Variants

In this section, we present the complexity analysis of the aforementioned methods with the presence of inexact oracle to gradient and Hessian and/or inexact linear equations solutions, as well as variants applying the second-order conjugate gradient (SOCG) type methods. We first consider the case when the gradient and Hessian oracle are inexact and/or the linear equation solver yields approximate solutions.

5.1 Analysis with Inexact Oracle and/or Approximate Solver

At each iteration of the Euler, trapezoid, and Runge-Kutta method, an essential subroutine is to compute the directions dk,id_{k,i}. For example, in the Euler method, dkd_{k} is given by formula dk=v(xk,λk)=(2f(xk)+λk+12Ω(xk))1f(xk)d_{k}=v(x_{k},\lambda_{k})=-(\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k}))^{-1}\nabla f(x_{k}). In most large-scale problems, the computation of the Hessian matrix and solving linear equations exactly will be the computational bottleneck. Therefore, we consider the case with the presence of numerical error, which may be induced by inexact gradient and Hessian oracle, or by the linear equations solver. Nevertheless, we tackle the two types of numerical error together.

Definition 5.1.

Suppose an exact solution dkd_{k} has the form dk=Hk1gkd_{k}=-H_{k}^{-1}g_{k}. For any constant δ0\delta\geq 0, d^k\hat{d}_{k} is a δ\delta-approximate direction with respect to dkd_{k} if Hkd^k+gkδ\|H_{k}\hat{d}_{k}+g_{k}\|\leq\delta. Furthermore, the δ\delta-approximate versions of Algorithm 2 or Algorithm 3 are the same as the original algorithms except that they use a δ\delta-approximate direction in each update.

Compared with the original update scheme in Algorithm 2, the only difference in the update scheme of the δ\delta-approximate version is that an approximate direction d^k\hat{d}_{k} is applied at each iteration instead of the exact direction v(,)v(\cdot,\cdot). We want to mention that there is no constraint on how the approximate direction d^k\hat{d}_{k} is generated, and in Section 5.2 we provide several efficient methods to compute the approximate direction and the corresponding complexity analysis. The following two lemmas characterize the accumulation of local errors of the update scheme in the δ\delta-approximate version of Algorithm 2 and Algorithm 3.

Lemma 5.2.

Suppose Section 3 holds. Let d^k\hat{d}_{k} denote an approximate solution to v(xk,λk)v(x_{k},\lambda_{k}). Let δk=(2f(xk)+λk+12Ω(xk))d^k+f(xk)\delta_{k}=\left(\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k})\right)\hat{d}_{k}+\nabla f(x_{k}), rk=Fλk(xk)r_{k}=\|\nabla F_{\lambda_{k}}(x_{k})\|, and rk+1=Fλk+1(xk+1)r_{k+1}=\|\nabla F_{\lambda_{k+1}}(x_{k+1})\|. Then we have

(15) rk+1λk+1λkrk+h2L(1+λk+1)2d^k2+hδk.r_{k+1}\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+\frac{h^{2}L(1+\lambda_{k+1})}{2}\cdot\|\hat{d}_{k}\|^{2}+h\left\|\delta_{k}\right\|.

Furthermore, if we set λs+1=(1h)λs\lambda_{s+1}=(1-h)\lambda_{s} and δsδ\left\|\delta_{s}\right\|\leq\delta for some scalar δ>0\delta>0 and all s=0,,ks=0,\dots,k, it holds that

(16) rkλkλ0r0+2hτL(f(x0)f)+δ.r_{k}\leq\frac{\lambda_{k}}{\lambda_{0}}\cdot r_{0}+2h\tau L(f(x_{0})-f^{*})+\delta.

Proof 5.3.

First it holds that

λk+1(Ω(xk)+2Ω(xk)(xk+1xk))+f(xk)+2f(xk)(xk+1xk)\displaystyle\lambda_{k+1}\left(\nabla\Omega(x_{k})+\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)+\nabla f(x_{k})+\nabla^{2}f(x_{k})(x_{k+1}-x_{k})
=\displaystyle= λk+1λk(λkΩ(xk)+f(xk))+hδk.\displaystyle\frac{\lambda_{k+1}}{\lambda_{k}}(\lambda_{k}\nabla\Omega(x_{k})+\nabla f(x_{k}))+h\cdot\delta_{k}.

Then apply same technique as in Lemma 3.5 we show that Eq. 15 holds. Also, applying Eq. 15 to Proposition 3.13 implies the result in Eq. 16.

Lemma 5.4.

Suppose rk=Fλk(xk)μ~r_{k}=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\leq\tilde{\mu}. Let δk,1,δk,2\delta_{k,1},\delta_{k,2} denote the residual of the approximate directions d^k,1,d^k,2\hat{d}_{k,1},\hat{d}_{k,2} with δk,1,δk,2μ~\left\|\delta_{k,1}\right\|,\left\|\delta_{k,2}\right\|\leq\tilde{\mu}. Then, it holds that

(17) rk+1λk+1λkrk+h33L(2+G)3+h42L3τ2(2+G)4+h2δk,1δk,2+h22δk,1.r_{k+1}\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{3}\cdot 3L(2+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(2+G)^{4}+\frac{h}{2}\left\|\delta_{k,1}-\delta_{k,2}\right\|+\frac{h^{2}}{2}\left\|\delta_{k,1}\right\|.

Proof 5.5.

We will follow the idea in Lemmas A.2, A.4 and 4.2. Recall the result in Lemma A.2, since H~1d1=f(x1)+δ1\tilde{H}_{1}d_{1}=-\nabla f(x_{1})+\delta_{1}, it holds that (1+λ)d12(G+1+δ1/μ~)(1+\lambda)\left\|d_{1}\right\|\leq 2(G+1+\left\|\delta_{1}\right\|/\tilde{\mu}). Also, the result in Lemma A.4 becomes

H~1(d2d1)2f(x1)(x1x2)(H~1H~2)d2+δ1δ2L2x1x22,\left\|\tilde{H}_{1}(d_{2}-d_{1})-\nabla^{2}f(x_{1})(x_{1}-x_{2})-(\tilde{H}_{1}-\tilde{H}_{2})d_{2}+\delta_{1}-\delta_{2}\right\|\leq\frac{L}{2}\left\|x_{1}-x_{2}\right\|^{2},

where H~1=2f(x1)+λ12Ω(x1)\tilde{H}_{1}=\nabla^{2}f(x_{1})+\lambda_{1}\nabla^{2}\Omega(x_{1}) and H~2=2f(x2)+λ22Ω(x2)\tilde{H}_{2}=\nabla^{2}f(x_{2})+\lambda_{2}\nabla^{2}\Omega(x_{2}). Now we modify the proof of Lemma 4.2 to get Eq. 17. Then the right-hand side of Eq. 20 becomes hδ1-h\delta_{1}. Also, the right-hand side of Eq. 21 becomes h2(δ1δ2)\frac{h}{2}(\delta_{1}-\delta_{2}) and the right-hand side of Eq. 22 becomes h22λ2Ω(x)d1+h22δ1\frac{h^{2}}{2}\lambda\nabla^{2}\Omega(x)d_{1}+\frac{h^{2}}{2}\delta_{1}.

The following two corollaries provide the complexity analysis of the δ\delta-approximate version of Algorithm 2 and Algorithm 3.

Corollary 5.6.

Suppose that Section 3 holds, and suppose that the initial point x0x_{0} satisfies Fλmax(x0)ϵ4\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{4}. Let T:=log(λmax/λmin)T:=\log(\lambda_{\max}/\lambda_{\min}), let ϵ(0,μ+λminσ]\epsilon\in(0,\mu+\lambda_{\min}\sigma] be the desired accuracy and let

KE, approx:=max{2T,LGτT3,8(f(x0)f)τLTϵ,4L(τ(G+ϵ)+1)Tϵ}.K_{\textnormal{E, approx}}:=\left\lceil\max\left\{2T,\frac{\sqrt{LG}\tau T}{\sqrt{3}},\frac{8(f(x_{0})-f^{\ast})\tau LT}{\epsilon},\frac{4\sqrt{L}(\tau(G+\epsilon)+1)T}{\sqrt{\epsilon}}\right\}\right\rceil.

If the number of iterations KK satisfies KKE, approxK\geq K_{\textnormal{E, approx}} and approximate directions d^k\hat{d}_{k} are all ϵ4\frac{\epsilon}{4}-approximate, the δ\delta-approximate version of Algorithm 2 returns an ϵ\epsilon-accurate solution path.

Proof 5.7.

Since the step-size h=1(λminλmax)1KTKh=1-(\frac{\lambda_{\min}}{\lambda_{\max}})^{\frac{1}{K}}\leq\frac{T}{K}, combining Eq. 16, we have rk3ϵ4r_{k}\leq\frac{3\epsilon}{4}, for all k{0,,K}k\in\{0,\dots,K\}. For linear interpolation error, we have

L8maxk[K1]{(1+λk)xk+1xk2+2h|ξ(λk)|xk+1xk}\displaystyle\frac{L}{8}\cdot\max_{k\in[K-1]}\{(1+\lambda_{k})\|x_{k+1}-x_{k}\|^{2}+2h|\xi(\lambda_{k})|\|x_{k+1}-x_{k}\|\}
\displaystyle\leq h2L2maxk[K1]{(1+λk)(dk2+dkd^k2)+|ξ(λk)|(dk+dkd^k)}\displaystyle\frac{h^{2}L}{2}\cdot\max_{k\in[K-1]}\{(1+\lambda_{k})(\|d_{k}\|^{2}+\|d_{k}-\hat{d}_{k}\|^{2})+|\xi(\lambda_{k})|(\|d_{k}\|+\|d_{k}-\hat{d}_{k}\|)\}
\displaystyle\leq h2L2(τ(G+ϵ)+1)2ϵ4.\displaystyle\frac{h^{2}L}{2}\cdot(\tau(G+\epsilon)+1)^{2}\leq\frac{\epsilon}{4}.

Applying the above inequality to Theorem 3.11 completes the proof.

Corollary 5.8.

Suppose Section 4.1 holds, let μ~:=μ+λminσ\tilde{\mu}:=\mu+\lambda_{\min}\sigma, let ϵ(0,μ~]\epsilon\in(0,\tilde{\mu}] be the desired accuracy, suppose that the initial point x0x_{0} satisfies Fλmax(x0)ϵ4\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{4}, let T:=1.1log(λmax/λmin)T:=1.1\log(\lambda_{\max}/\lambda_{\min}), and let

Ktr, approx:=max{10T,8LT(2+G)μ~,6L1/2(2+G)3/2Tϵ1/2,6Lτ2/3(2+G)4/3Tϵ1/3}.K_{\textnormal{tr, approx}}:=\left\lceil\max\left\{10T,\frac{8LT(2+G)}{\tilde{\mu}},\frac{6L^{1/2}(2+G)^{3/2}T}{\epsilon^{1/2}},\frac{6L\tau^{2/3}(2+G)^{4/3}T}{\epsilon^{1/3}}\right\}\right\rceil.

If the number of iterations KK satisfies KKtr, approxK\geq K_{\textnormal{tr, approx}} and all approximate directions d^k,1,d^k,2\hat{d}_{k,1},\hat{d}_{k,2} are ϵ4\frac{\epsilon}{4}-approximate, the δ\delta-approximate version of Algorithm 3 returns an ϵ\epsilon-accurate solution path.

Proof 5.9.

Using Eq. 17 for rk+1,,r1r_{k+1},\dots,r_{1}, we have

rk+1\displaystyle r_{k+1} r0+3h3L(2+G)3+2h4L3τ2(2+G)4+h2δk,1δk,2+h22δk,11(1h+h22)\displaystyle\leq r_{0}+\frac{3h^{3}L(2+G)^{3}+2h^{4}L^{3}\tau^{2}(2+G)^{4}+\frac{h}{2}\left\|\delta_{k,1}-\delta_{k,2}\right\|+\frac{h^{2}}{2}\left\|\delta_{k,1}\right\|}{1-\left(1-h+\frac{h^{2}}{2}\right)}
ϵ4+3h2L(2+G)3+2h3L3τ2(2+G)4+ϵ4+hϵ81h23ϵ4.\displaystyle\leq\frac{\epsilon}{4}+\frac{3h^{2}L(2+G)^{3}+2h^{3}L^{3}\tau^{2}(2+G)^{4}+\frac{\epsilon}{4}+\frac{h\epsilon}{8}}{1-\frac{h}{2}}\leq\frac{3\epsilon}{4}.

Also, since xk+1xkh(dk+d^k,1+d^k,22(μ+λkσ))\|x_{k+1}-x_{k}\|\leq h(\|d_{k}\|+\frac{\|\hat{d}_{k,1}\|+\|\hat{d}_{k,2}\|}{2(\mu+\lambda_{k}\sigma)}), it holds that

f(x^(λ))+λx^(λ)\displaystyle\left\|f(\hat{x}(\lambda))+\lambda\hat{x}(\lambda)\right\| 3ϵ4+L8maxk[K1]{(1+λk)xk+1xk2+2hλkxk+1xk}\displaystyle\leq\frac{3\epsilon}{4}+\frac{L}{8}\cdot\max_{k\in[K-1]}\left\{(1+\lambda_{k})\|x_{k+1}-x_{k}\|^{2}+2h\lambda_{k}\|x_{k+1}-x_{k}\|\right\}
3ϵ4+L4(h2(9(1+G)2+2τϵ)+h2(3(1+G)+ϵ))ϵ.\displaystyle\leq\frac{3\epsilon}{4}+\frac{L}{4}\cdot\left(h^{2}\left(9(1+G)^{2}+2\tau\epsilon\right)+h^{2}(3(1+G)+\epsilon)\right)\leq\epsilon.

Corollaries 5.6 and 5.8 provide the complexity analysis of the Euler method and the trapezoid method with the presence of approximate directions. We can see that when the residuals of approximate directions have a uniform upper bound over all iterations, the δ\delta-approximate versions of Algorithms 2 and 3 have complexity of the same order as the original algorithms, respectively, which require exact directions. Moreover, Corollaries 5.6 and 5.8 only require ϵ\epsilon-approximate directions but no assumptions about how the approximate directions are generated. It provides flexibility in the choice of an approximate oracle to compute approximate directions.

5.2 Second-Order Conjugate Gradient Variants

Following the complexity analysis, we apply the conjugate gradient method to solve the sub-problem, i.e., to compute a δ\delta-approximate direction of v(,)v(\cdot,\cdot). To measure the efficiency of second-order conjugate gradient type algorithms, we measure the computational complexity by the total number of calls to both gradient and Hessian-vector product oracles.

Now we apply the conjugate gradient method as an approximate oracle to compute the approximate direction d^k\hat{d}_{k} at each iteration. We use the δ\delta-approximate version of the Euler method as an example. At iteration kk, the algorithm requires an approximate solution d^k\hat{d}_{k} satisfying Hkd^k+gk2δ\|{H_{k}\hat{d}_{k}+g_{k}}\|_{2}\leq\delta where Hk:=2f(xk)+λk+12Ω(xk)H_{k}:=\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k}) and gk:=f(xk)g_{k}:=\nabla f(x_{k}). We apply the conjugate gradient to approximately solve the equation Hkd^k+gk=0H_{k}\hat{d}_{k}+g_{k}=0 and set the initial guess to be the approximate direction d^k1\hat{d}_{k-1} in the last iteration. Herein we provide the complexity analysis of Euler-CG method and trapezoid-CG method.

Theorem 5.10.

Under the same conditions as Theorem 3.2, let δϵ/4\delta\leftarrow\epsilon/4, and let KE-CG𝒪~(L3/2τ3/2(f(x0)f)T/ϵ)K_{\textnormal{E-CG}}\sim\tilde{\mathcal{O}}(L^{3/2}\tau^{3/2}(f(x_{0})-f^{*})T/\epsilon). If the total number of iterations KK satisfies KKE-CGK\geq K_{\textnormal{E-CG}}, then the Euler-CG method returns an ϵ\epsilon-accurate solution path. Also, suppose that the assumption in Theorem 4.1 holds, let δϵ/4\delta\leftarrow\epsilon/4, and let Ktr-CG𝒪~(Lτ1/2(2+G)3/2T/ϵ1/2)K_{\textnormal{tr-CG}}\sim\tilde{\mathcal{O}}(L\tau^{1/2}(2+G)^{3/2}T/\epsilon^{1/2}). If the total number of iterations KK satisfies KKtr-CGK\geq K_{\textnormal{tr-CG}}, then the trapezoid-CG method returns an ϵ\epsilon-accurate solution path.

Proof 5.11 (Proof of Theorem 5.10).

Here we only need to consider the inner complexity, i.e., the number of iterations required to compute the approximate direction. For the Euler-CG algorithm, let {yk,s}\left\{y_{k,s}\right\} denote the sequence generated by the conjugate gradient method with yk,0=d^k1y_{k,0}=\hat{d}_{k-1}, let Hk=2f(xk)+λk+12Ω(xk)H_{k}=\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k}), and let gk=f(xk)g_{k}=\nabla f(x_{k}). The existing results of the conjugate gradient method [8] guarantee that Hkyk,s+gk22κk(12κk+1)sHkyk,0+gk2\left\|H_{k}y_{k,s}+g_{k}\right\|_{2}\leq 2\sqrt{\kappa_{k}}(1-\frac{2}{\sqrt{\kappa_{k}}+1})^{s}\left\|H_{k}y_{k,0}+g_{k}\right\|_{2}, where κk\kappa_{k} is the condition number of HkH_{k} and κk(1+λk)Lμ+λkστL\kappa_{k}\leq\frac{(1+\lambda_{k})L}{\mu+\lambda_{k}\sigma}\leq\tau L. Since the initialization y0=d^k1y_{0}=\hat{d}_{k-1} is the approximate direction at the last iteration, we have Hk1d^k1+gk1ϵ4\|H_{k-1}\hat{d}_{k-1}+g_{k-1}\|\leq\frac{\epsilon}{4}. Then the initialization guarantees that

Hkd^k1+gk2\displaystyle\|H_{k}\hat{d}_{k-1}+g_{k}\|_{2} Hk1d^k1+gk12+Hkd^k1+gkHk1d^k1gk1\displaystyle\leq\|H_{k-1}\hat{d}_{k-1}+g_{k-1}\|_{2}+\|H_{k}\hat{d}_{k-1}+g_{k}-H_{k-1}\hat{d}_{k-1}-g_{k-1}\|
ϵ4+hL(1+λk)(d^k12+d^k1)ϵ4+2hL(2+G)2.\displaystyle\leq\frac{\epsilon}{4}+hL(1+\lambda_{k})\left(\|\hat{d}_{k-1}\|^{2}+\|\hat{d}_{k-1}\|\right)\leq\frac{\epsilon}{4}+2hL(2+G)^{2}.

Applying h𝒪(ϵ(f(x0)f)τL)h\sim\mathcal{O}(\frac{\epsilon}{(f(x_{0})-f^{*})\tau L}), the inner complexity NkN_{k} has an upper bound Nkκk+12log(2κkHkd^k1+gkϵ/4)𝒪~(τL)N_{k}\leq\frac{\sqrt{\kappa_{k}}+1}{2}\log(\frac{2\sqrt{\kappa_{k}}\|H_{k}\hat{d}_{k-1}+g_{k}\|}{\epsilon/4})\sim\tilde{\mathcal{O}}(\sqrt{\tau L}). Therefore, the total computation complexity of the Euler-CG method with the approximate oracle of the conjugate gradient to compute an ϵ\epsilon-accurate solution path is 𝒪~(L3/2τ3/2(f(x0)f)Tϵ)\tilde{\mathcal{O}}(\frac{L^{3/2}\tau^{3/2}(f(x_{0})-f^{*})T}{\epsilon}). For the trapezoid-CG algorithm, let xk=xk+hd^k,1x_{k}^{\prime}=x_{k}+h\hat{d}_{k,1}, λk=(1h+h2)λk\lambda_{k}^{\prime}=(1-h+h^{2})\lambda_{k}, Hk,1:=2f(xk)+λkΩ(xk),gk,1:=f(xk)H_{k,1}:=\nabla^{2}f(x_{k})+\lambda_{k}\nabla\Omega(x_{k}),g_{k,1}:=\nabla f(x_{k}), Hk,2:=2f(xk)+λkΩ(xk)H_{k,2}:=\nabla^{2}f(x_{k}^{\prime})+\lambda_{k}^{\prime}\nabla\Omega(x_{k}^{\prime}), and gk,2:=f(xk)g_{k,2}:=\nabla f(x_{k}^{\prime}). At iteration kk, the trapezoid-CG method requires to compute the ϵ4\frac{\epsilon}{4} directions d^k,1\hat{d}_{k,1} and d^k,2\hat{d}_{k,2}. In the conjugate gradient subroutine, we use d^k1,1\hat{d}_{k-1,1} as a warm start to compute d^k,1\hat{d}_{k,1}, and likewise use d^k,1\hat{d}_{k,1} to warm start d^k,2\hat{d}_{k,2}. Similarly, we have Hk,1d^k1,1+gk,1ϵ4+2hL(2+G)2\|H_{k,1}\hat{d}_{k-1,1}+g_{k,1}\|\leq\frac{\epsilon}{4}+2hL(2+G)^{2} and Hk,2d^k,1+gk,2ϵ4+2hL(2+G)2\|H_{k,2}\hat{d}_{k,1}+g_{k,2}\|\leq\frac{\epsilon}{4}+2hL(2+G)^{2}. Applying h𝒪(ϵ1/2L1/2(2+G)3/2)h\sim\mathcal{O}(\frac{\epsilon^{1/2}}{L^{1/2}(2+G)^{3/2}}), the inner complexity Nk,1N_{k,1} and Nk,2N_{k,2} have upper bounds Nk,1,Nk,2𝒪~(τL)N_{k,1},N_{k,2}\sim\tilde{\mathcal{O}}(\sqrt{\tau L}). Therefore, the total computational complexity of the trapezoid-CG method with the conjugate gradient approximate oracle to compute an ϵ\epsilon-accurate solution path is 𝒪~(Lτ1/2(2+G)3/2Tϵ1/2)\tilde{\mathcal{O}}(\frac{L\tau^{1/2}(2+G)^{3/2}T}{\epsilon^{1/2}}).

Recalling the complexity results in Remark 3.3, we notice that when we use Nesterov’s accelerated gradient method as the subproblem solver, the total complexity will be (τL)3/2GTlog2ϵ\frac{(\tau L)^{3/2}GT\log 2}{\epsilon}. We can see that the total complexity of the δ\delta-approximate version of Algorithm 3 has order 𝒪(1ϵ)\mathcal{O}(\frac{1}{\sqrt{\epsilon}}) and is better than the best known result for the grid search method. Again, the complexity of the trapezoid method is better than that of the Euler method, since it exploits the higher-order smoothness of the functions f()f(\cdot) and Ω()\Omega(\cdot).

6 Computational Experiments and Results

In this section, we present computational results of numerical experiments in which we implement different versions of discretization schemes and interpolation methods to compute the approximate solution path. To compare with our methods, we introduce two approaches based on grid search (see, e.g., Giesen et al. [16], Ndiaye et al. [29]). For the subproblem solver in the grid search methods, we use the warm-started exact Newton method and Nesterov’s accelerated gradient method to compare with our exact methods and SOCG variants, respectively. We focus on the following 8 versions of the update schemes, where “CG” stands for conjugate gradient. In all cases, we fix an accuracy level ϵ>0\epsilon>0 and set the step/grid-size parameters accordingly.

  • Euler, Euler-CG: Algorithm 2, and its δ\delta-approximate version using CG as the sub-problem approximate oracle.

  • Trapezoid, Trapezoid-CG: Algorithm 3, and its δ\delta-approximate version using CG as the approximate subproblem oracle.

  • Runge-Kutta, Runge Kutta-CG: an extension of our algorithm that uses the Runge-Kutta update scheme, and its δ\delta-approximate version using CG as the subproblem approximate oracle.

  • Grid Search-Newton/AGD: grid search method with either the exact Newton or Nesterov’s accelerated gradient method as the sub-problem solver.

6.1 Logistic Regression

Herein we examine the empirical behavior of each of the previously presented methods on logistic regression problems using the breast cancer dataset from Dua and Graff [11] (3232 features and 569569 observations) and the leukemia dataset from Golub et al. [18] (71297129 features and 7272 observations). In particular, let {(ai,bi)}i=1n\left\{(a_{i},b_{i})\right\}_{i=1}^{n} denote a training set of features aipa_{i}\in\mathbb{R}^{p} and labels bi{1,+1}b_{i}\in\{-1,+1\} and define the sets of positive and negative examples by S+:={i[n]:bi=1}S_{+}:=\left\{i\in[n]:b_{i}=1\right\} and S:={i[n]:bi=1}S_{-}:=\left\{i\in[n]:b_{i}=-1\right\}. We examine two logistic regression variants: (i) regularized logistic regression with f(x)=1ni=1nlog(1+ebiaiTx)f(x)=\frac{1}{n}\sum_{i=1}^{n}\log(1+e^{-b_{i}a_{i}^{T}x}) and Ω(x)=12x2\Omega(x)=\frac{1}{2}\left\|x\right\|^{2}, where λmin=104\lambda_{\min}=10^{-4}, λmax=104\lambda_{\max}=10^{4}, and (ii) re-weighted logistic regression with f(x)=1|S+|iS+log(1+ebiaiTx)f(x)=\frac{1}{|S_{+}|}\sum_{i\in S_{+}}\log(1+e^{-b_{i}a_{i}^{T}x}) and Ω(x)=1|S|iSlog(1+ebiaiTx)\Omega(x)=\frac{1}{|S_{-}|}\sum_{i\in S_{-}}\log(1+e^{-b_{i}a_{i}^{T}x}), where λmin=101\lambda_{\min}=10^{-1}, λmax=10\lambda_{\max}=10. The initialization x0x_{0} which we apply in each method is the same and is a very nearly-optimal solution to the problem such that Fλmax(x0)1015\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\approx 10^{-15}. Note that in general it is difficult to compute the exact path accuracy of an approximate solution path x^():[λmin,λmax]p\hat{x}(\cdot):[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p}, namely A(x^):=maxλΛFλ(x^(λ))A(\hat{x}):=\max_{\lambda\in\Lambda}\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|, where Λ=[λmin,λmax]\Lambda=[\lambda_{\min},\lambda_{\max}]. We examine the approximate path accuracy in lieu of the exact computation of the path accuracy, namely, we consider A^(x^):=maxλΛ^Fλ(x^(λ))\hat{A}(\hat{x}):=\max_{\lambda\in\hat{\Lambda}}\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\| where Λ^={λ0,λ0+λ12,λ1,λ1+λ22,λ2,,λK}\hat{\Lambda}=\{\lambda_{0},\frac{\lambda_{0}+\lambda_{1}}{2},\lambda_{1},\frac{\lambda_{1}+\lambda_{2}}{2},\lambda_{2},\dots,\lambda_{K}\}, since the theoretical largest interpolation error occurs at the midpoint of two breakpoints. In Fig. 1, we vary the desired accuracy parameter ϵ\epsilon and plot the number of Hessian oracle computations required by Algorithm 2, Algorithm 3, the algorithm with the Runge-Kutta update scheme, and the grid search method. The theoretical evaluation numbers are derived from the complexity analysis, and the observed number of evaluations of each method is determined according to a “doubling trick,” whereby for each value of KK we calculate the observed accuracy along the path by interpolation, and if the observed accuracy is too large, then we double the value of KK until it is below ϵ\epsilon. The numerical results match the asymptotic order and intuition, as well as the superior performance of the trapezoid and Runge-Kutta methods due to the higher-order smoothness of the loss and regularization function. In part (a) of Fig. 1, we notice that the theoretical complexity is higher than the practical one, since it is more conservative. Therefore, we will stick to the “doubling method” in the following experiments and compare the practical performance of each method.

Refer to caption
(a) Regularized logistic regression.
Refer to caption
(b) Re-weighted logistic regression.
Figure 1: Exact methods on the breast cancer data with n=569n=569 observations and p=30p=30 features.

Figure 2 summarizes the performance of the three aforementioned SOCG methods, and the grid search method with Nesterov’s accelerated gradient method. For SOCG methods, we record the total number of both gradient evaluation and Hessian-vector products, which is our measure of computational complexity. We implement these methods on the leukemia dataset and the regularized logistic regression problem. From Fig. 2, we can see that the numerical results match our theoretical asymptotic bounds. In part (b), we provide the CPU time to compute the approximate solutions of both SOCG methods and exact second-order methods (in the more transparent bars). We comment that for the exact methods we only run the case when the desired accuracy equals 10410^{-4}, since the grid search method already takes about 15 hours to finish. The dominance of SOCG methods in these cases illustrates the benefit and ability of SOCG methods to deal with large-scale problems.

Refer to caption
(a) Computation complexity.
Refer to caption
(b) CPU time.
Figure 2: Second-order conjugate gradient methods on regularized logistic regression on leukemia data with n=72n=72 observations and p=7129p=7129 features.

6.2 Moment Matching Problem

Here we consider the moment matching problem with entropy regularization. Suppose that a discrete random variable ZZ has sample space {w1,,wp+1}\{w_{1},\dots,w_{p+1}\} and probability distribution x=(x(1),,x(p+1))x=(x_{(1)},\dots,x_{(p+1)}). The goal is to match the empirical first nn-th moments of ZZ, with entropy regularization to promote “equal” distributions. To formalize the problem, we consider the following constrained optimization problem:

P(λ):minxp+1\displaystyle P(\lambda):\quad\min_{x\in\mathbb{R}^{p+1}}\ \ 12Axb2+λj=1nx(j)log(x(j))\displaystyle\frac{1}{2}\left\|Ax-b\right\|^{2}+\lambda\cdot\sum_{j=1}^{n}x_{(j)}\log(x_{(j)})
s.t. 𝟏p+1Tx=1,x0,\displaystyle{\bm{1}}_{p+1}^{T}x=1,\quad x\geq 0,

where x(j)x_{(j)} is the jj-th component of xx, An×(p+1)A\in\mathbb{R}^{n\times(p+1)} with Ai,j=wjiA_{i,j}=w_{j}^{i}, and λΛ=[102,102]\lambda\in\Lambda=[10^{-2},10^{2}]. The parametric optimization problem P(λ)P(\lambda) is a constrained optimization problem, including an affine constraint, which does not satisfy our assumptions. Therefore, we introduce a new variable yy to substitute xx. Let ypy\in\mathbb{R}^{p} with y(i)=x(i)y_{(i)}=x_{(i)}, for i=1,,pi=1,\dots,p, and S={yp:y0,𝟏pTy1}S^{\prime}=\{y\in\mathbb{R}^{p}:y\geq 0,{\bm{1}}_{p}^{T}y\leq 1\}. Then the above moment matching problem P(λ)P(\lambda) is equivalent to the following parametric optimization problem:

P(λ):minyp\displaystyle P^{\prime}(\lambda):\quad\min_{y\in\mathbb{R}^{p}}\ \ 12Ayb2+λ(j=1ny(j)log(y(j))+(1𝟏pTy)log(1𝟏pTy))\displaystyle\frac{1}{2}\|A^{\prime}y-b^{\prime}\|^{2}+\lambda\cdot\left(\sum_{j=1}^{n}y_{(j)}\log(y_{(j)})+(1-{\bm{1}}_{p}^{T}y)\log(1-{\bm{1}}_{p}^{T}y)\right)
s.t. 𝟏pTy1,y0,\displaystyle{\bm{1}}_{p}^{T}y\leq 1,\quad y\geq 0,

where A=A1:pAp+1𝟏pTA^{\prime}=A_{1:p}-A_{p+1}{\bm{1}}_{p}^{T} and b=bAp+1b^{\prime}=b-A_{p+1}. We examine the empirical behavior of each of the previously presented methods on (P)(P^{\prime}) without constraints, that is, f(y)=12Ayb2f(y)=\frac{1}{2}\|A^{\prime}y-b^{\prime}\|^{2} and Ω(y)=j=1ny(j)log(y(j))+(1𝟏pTy)log(1𝟏pTy)\Omega(y)=\sum_{j=1}^{n}y_{(j)}\log(y_{(j)})+(1-{\bm{1}}_{p}^{T}y)\log(1-{\bm{1}}_{p}^{T}y). Translating yy back to xx, we can show that the exact path x(λ)x(\lambda) for λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}] is a subset of the relative interior of SS, but the proof is omitted for brevity. Also, when the step-size hh is small enough, all grid points {xk}\{x_{k}\} will be in the relative interior of SS, and therefore, the approximate path x^(λ)\hat{x}(\lambda) for λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}] is a subset of the relative interior of SS.

Synthetic Data Generation Process

We generate the data (x,w)(x,w) according to the following generative model. The true distribution vector xtruex^{\textnormal{true}} is according to the model x(i)true=exp(z(i))j=1p+1exp(z(j))x_{(i)}^{\textnormal{true}}=\frac{\exp(z_{(i)})}{\sum_{j=1}^{p+1}\exp(z_{(j)})} for i=1,,p+1i=1,\dots,p+1, where z(i)unif(0,1)z_{(i)}\sim\textnormal{unif}(0,1). The sample vector ww is generated from an independent uniform distribution, i.e., w(i)unif(0,1)w_{(i)}\sim\textnormal{unif}(0,1) for i=1,,pi=1,\dots,p, and without loss of generality, we set w(p+1)=0w_{(p+1)}=0.

We examine the aforementioned exact and SOCG methods with different problem dimensions. In the following set of experiments, we set the number of observations n=20n=20, the desired accuracy ϵ=105\epsilon=10^{-5}, and vary the problem dimension p{128,256,512,1024,2048}p\in\{128,256,512,1024,2048\}. The true distribution xtruex^{\text{true}} and the sample vector ww are synthetically generated according to the same process as before. Figure 3 displays our result for this experiment. Generally, we observe that the CPU time of computing an ϵ\epsilon-path increases as the dimension of the problem increases. Comparing the CPU time of the exact methods and the SOCG methods, we notice that the SOCG methods are less sensitive to the size of the problems. Again, the results demonstrate the superiority of SOCG methods for solving large-scale problems.

Refer to caption
Figure 3: Exact and SOCG methods on moment matching problem with n=20n=20 and ϵ=105\epsilon=10^{-5}, the CPU time comparison with different problem dimension.

7 Conclusion

Inspired by a differential equation perspective on the parametric solution path, we developed and analyzed several different exact and inexact second-order algorithms. We develop theoretical results that are an improvement over the best-known results of grid search methods, and we present computational results that demonstrate the effectiveness of our methods.

Acknowledgments

This research is supported, in part, by NSF AI Institute for Advances in Optimization Award 2112533.

References

  • Abbas et al. [2014] Boushra Abbas, Hédy Attouch, and Benar F Svaiter. Newton-like dynamics and forward-backward methods for structured monotone inclusions in hilbert spaces. Journal of Optimization Theory and Applications, 161(2):331–360, 2014.
  • Alvarez and Attouch [2001] Felipe Alvarez and Hedy Attouch. An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3–11, 2001.
  • Alvarez et al. [2002] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and P Redont. A second-order gradient-like dissipative dynamical system with hessian-driven damping.: Application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8):747–779, 2002.
  • Attouch and Svaiter [2011] Hedy Attouch and Benar Fux Svaiter. A continuous dynamical newton-like approach to solving monotone inclusions. SIAM Journal on Control and Optimization, 49(2):574–598, 2011.
  • Attouch et al. [2016] Hedy Attouch, Maicon Marques Alves, and Benar Fux Svaiter. A dynamic approach to a proximal-newton method for monotone inclusions in hilbert spaces, with complexity o(1/n2)o(1/n^{2}). Journal of Convex Analysis, 23(1):139–180, 2016.
  • Attouch et al. [2018] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1-2):123–175, 2018.
  • Bach et al. [2006] Francis R Bach, David Heckerman, and Eric Horvitz. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7(Aug):1713–1741, 2006.
  • Bertsekas [1997] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
  • Bot and Csetnek [2016] Radu Ioan Bot and Ernö Robert Csetnek. Second order forward-backward dynamical systems for monotone inclusion problems. SIAM Journal on Control and Optimization, 54(3):1423–1443, 2016.
  • Conn et al. [2000] Andrew R Conn, Nicholas IM Gould, and Ph L Toint. Trust region methods, volume 1. Siam, 2000.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Efron et al. [2004] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
  • Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
  • Garcia et al. [1989] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
  • Gautschi [2011] Walter Gautschi. Numerical analysis. Springer Science & Business Media, 2011.
  • Giesen et al. [2012a] Joachim Giesen, Martin Jaggi, and Sören Laue. Approximating parameterized convex optimization problems. ACM Transactions on Algorithms (TALG), 9(1):10, 2012a.
  • Giesen et al. [2012b] Joachim Giesen, Martin Jaggi, and Sören Laue. Regularization paths with guarantees for convex semidefinite optimization. In Artificial Intelligence and Statistics, pages 432–439, 2012b.
  • Golub et al. [1999] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537, 1999.
  • Guddat et al. [1990] Jürgen Guddat, F Guerra Vazquez, and Hubertus Th Jongen. Parametric optimization: singularities, pathfollowing and jumps. Springer, 1990.
  • Hastie et al. [2004] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(Oct):1391–1415, 2004.
  • Karimireddy et al. [2018] Sai Praneeth Karimireddy, Sebastian U Stich, and Martin Jaggi. Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
  • Kim et al. [2007] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-point method for large-scale 1\ell_{1}-regularized least squares. IEEE journal of selected topics in signal processing, 1(4):606–617, 2007.
  • Loosli et al. [2007] Gaëlle Loosli, Gilles Gasso, and Stéphane Canu. Regularization paths for ν\nu-svm and ν\nu-svr. In International Symposium on Neural Networks, pages 486–496. Springer, 2007.
  • Mairal and Yu [2012] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1835–1842. Omnipress, 2012.
  • Marquardt [1963] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
  • Mazumder et al. [2010] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287–2322, 2010.
  • Mazumder et al. [2011] Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.
  • Moré [1978] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
  • Ndiaye et al. [2019] Eugene Ndiaye, Tam Le, Olivier Fercoq, Joseph Salmon, and Ichiro Takeuchi. Safe grid search with optimal complexity. In International Conference on Machine Learning, pages 4771–4780, 2019.
  • Nesterov [2018] Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2018.
  • Nesterov and Nemirovskii [1994] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
  • Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • Osborne et al. [2000] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000.
  • Polyak [2009] Roman A Polyak. Regularized newton method for unconstrained convex optimization. Mathematical programming, 120(1):125–145, 2009.
  • Ralph [1994] Daniel Ralph. Global convergence of damped newton’s method for nonsmooth equations via the path search. Mathematics of Operations Research, 19(2):352–389, 1994.
  • Renegar [2001] James Renegar. A mathematical view of interior-point methods in convex optimization, volume 3. Siam, 2001.
  • Rosset [2005] Saharon Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1153–1160. MIT Press, 2005. URL http://papers.nips.cc/paper/2600-following-curved-regularized-optimization-solution-paths.pdf.
  • Rosset and Zhu [2007] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, pages 1012–1030, 2007.
  • Scieur et al. [2017] Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, pages 1109–1118, 2017.
  • Scott [2011] L.R. Scott. Numerical Analysis. Princeton University Press, 2011. ISBN 9781400838967. URL https://books.google.com/books?id=SfCjL_5AaRQC.
  • Su et al. [2014] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
  • Wang et al. [2014] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164–2201, 2014.
  • Wibisono et al. [2016] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.
  • Yuan and Zou [2009] Ming Yuan and Hui Zou. Efficient global approximation of generalized nonlinear 1\ell_{1}-regularized solution paths and its applications. Journal of the American Statistical Association, 104(488):1562–1574, 2009.
  • Zhang et al. [2018] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. In Advances in Neural Information Processing Systems, pages 3900–3909, 2018.
  • Zhou and Lange [2015] Hua Zhou and Kenneth Lange. Path following in the exact penalty method of convex programming. Computational optimization and applications, 61(3):609–634, 2015.
  • Zhou and Wu [2014] Hua Zhou and Yichao Wu. A generic path algorithm for regularized statistical estimation. Journal of the American Statistical Association, 109(506):686–699, 2014.

Appendix A Proofs

A.1 Proof of Proposition 2.1

For fixed λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], we have ξ(λ)λ\frac{\xi(\lambda)}{\lambda} is a constant and its absolute value is no larger than CC. Let Hi=2f(x1)+λ2Ω(x1)H_{i}=\nabla^{2}f(x_{1})+\lambda\nabla^{2}\Omega(x_{1}) and gi=f(xi)g_{i}=\nabla f(x_{i}) for i=1,2i=1,2. Since f()f(\cdot) is μ\mu-strongly convex and Ω()\Omega(\cdot) is σ\sigma strong convex, we have Hiμ+λminσ\|H_{i}\|\geq\mu+\lambda_{\min}\sigma for i=1,2i=1,2. Since f()f(\cdot) is LL-Lipschitz continuous, we have g2L\|g_{2}\|\leq L. Also, let v1=H11g1v_{1}=H_{1}^{-1}g_{1}, v2=H21g2v_{2}=H_{2}^{-1}g_{2}, and v1=H11g2v_{1}^{\prime}=H_{1}^{-1}g_{2}. t holds that

v1v1=H11(g1g2)H11g1g21μ+λminσLx1x2.\|v_{1}-v_{1}^{\prime}\|=\|H_{1}^{-1}(g_{1}-g_{2})\|\leq\|H_{1}^{-1}\|\cdot\|g_{1}-g_{2}\|\leq\tfrac{1}{\mu+\lambda_{\min}\sigma}\cdot L\|x_{1}-x_{2}\|.

Also, since H1H22f(x1)2f(x2)+λ2Ω(x1)2Ω(x2)L(1+λmax)x1x2\|H_{1}-H_{2}\|\leq\|\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})\|+\lambda\|\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2})\|\leq L(1+\lambda_{\max})\|x_{1}-x_{2}\| and H1H2=H1(H11H21)H2H1H11H21H2\|H_{1}-H_{2}\|=\|H_{1}(H_{1}^{-1}-H_{2}^{-1})H_{2}\|\leq\|H_{1}\|\cdot\|H_{1}^{-1}-H_{2}^{-1}\|\cdot\|H_{2}\|, it holds that H11H21L(1+λmax)x1x2(μ+λminσ)2\|H_{1}^{-1}-H_{2}^{-1}\|\leq\tfrac{L(1+\lambda_{\max})\|x_{1}-x_{2}\|}{(\mu+\lambda_{\min}\sigma)^{2}}. Therefore, we have

v1v2=(H11H21)g2H11H21g2L2(1+λmax)x1x2(μ+λminσ)2.\|v_{1}^{\prime}-v_{2}\|=\|\left(H_{1}^{-1}-H_{2}^{-1}\right)g_{2}\|\leq\|H_{1}^{-1}-H_{2}^{-1}\|\cdot\|g_{2}\|\leq\tfrac{L^{2}(1+\lambda_{\max})\|x_{1}-x_{2}\|}{(\mu+\lambda_{\min}\sigma)^{2}}.

To conclude, since v(x1,λ)v(x2,λ)=ξ(λ)λ(v1v2)v(x_{1},\lambda)-v(x_{2},\lambda)=\frac{\xi(\lambda)}{\lambda}\cdot(v_{1}-v_{2}), it holds that

v(x1,λ)v(x2,λ)(LCμ+λminσ+L2C(1+λmax)(μ+λminσ)2)x1x2.\|v(x_{1},\lambda)-v(x_{2},\lambda)\|\leq(\tfrac{LC}{\mu+\lambda_{\min}\sigma}+\tfrac{L^{2}C(1+\lambda_{\max})}{(\mu+\lambda_{\min}\sigma)^{2}})\cdot\|x_{1}-x_{2}\|.

A.2 Proof of Lemma 4.2

In the following residual analysis, we will work with the high-order directional derivatives of f()f(\cdot) and Ω()\Omega(\cdot). We now introduce the definition of the directional derivatives. For p1p\geq 1, let Dpf(x)[h1,,hp]D^{p}f(x)[h_{1},\dots,h_{p}] denote the directional derivative of function ff at xx along directions hi,i=1,,ph_{i},i=1,\dots,p. For example, Df(x)[h]=f(x)ThDf(x)[h]=\nabla f(x)^{T}h and D2f(x)[h1,h2]=h1T2f(x)h2D^{2}f(x)[h_{1},h_{2}]=h_{1}^{T}\nabla^{2}f(x)h_{2}. Furthermore, the norm of directional derivatives is defined as Dpf(x):=maxh1,,hp{|Dpf(x)[h1,,hp]|:hi1}\|D^{p}f(x)\|:=\max_{h_{1},\dots,h_{p}}\left\{\left|D^{p}f(x)[h_{1},\dots,h_{p}]\right|:\|h_{i}\|\leq 1\right\}. For detailed properties of directional derivatives, we refer the reader to Nesterov [30]. We start with computational guarantees of the Taylor approximation on functions with Lipschitz continuous high-order derivatives. The following lemma guarantees the accuracy of the Taylor expansion.

Lemma A.1 ((1.5)(1.5) and (1.6)(1.6) in Nesterov [30]).

Let the function ϕ()\phi(\cdot) be convex and pp-times differentiable. Suppose pp-th order derivative of ϕ()\phi(\cdot) is LpL_{p} Lipschitz continuous. Let Φx,p()\Phi_{x,p}(\cdot) denote the Taylor approximation of function ϕ()\phi(\cdot) at xx: Φx,p(y):=f(x)+i=1p1i!Diϕ(x)[yx]i\Phi_{x,p}(y):=f(x)+\sum_{i=1}^{p}\frac{1}{i!}D^{i}\phi(x)[y-x]^{i}. Then we have the following guarantees:

|ϕ(y)Φx,p(y)|Lp(p+1)!yxp+1,ϕ(y)Φx,p(y)Lpp!yxp.\left|\phi(y)-\Phi_{x,p}(y)\right|\leq\frac{L_{p}}{(p+1)!}\|y-x\|^{p+1},\quad\|\nabla\phi(y)-\nabla\Phi_{x,p}(y)\|\leq\frac{L_{p}}{p!}\|y-x\|^{p}.

{condition}

Suppose step-size hh satisfies that hmin{0.2,μ~8L(1+G)}h\leq\min\left\{0.2,\frac{\tilde{\mu}}{8L(1+G)}\right\}.

For simplicity, we use (x~,λ~)(\tilde{x},\tilde{\lambda}) to represent (xnext,λnext)(x_{\textnormal{next}},\lambda_{\textnormal{next}}). We will begin the complexity analysis of trapezoid method by the following two lemmas which provide proper upper bounds the norm of direction d1\|d_{1}\| and the difference between d1d_{1} and d2d_{2}.

Lemma A.2.

Suppose σ1\sigma\geq 1, xSx0x\in S_{x_{0}}, λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], h>0h>0 and (x~,λ~)=T(x,λ;h)(\tilde{x},\tilde{\lambda})=T(x,\lambda;h). Let rr denote the initial residual f(x)+λΩ(x)\|\nabla f(x)+\lambda\nabla\Omega(x)\| that satisfies rμ~r\leq\tilde{\mu}. Then we have (1+λ)d12(G+1)(1+\lambda)\|d_{1}\|\leq 2(G+1).

Proof A.3.

Let H=2f(x)+λ2Ω(x)H=\nabla^{2}f(x)+\lambda\nabla^{2}\Omega(x). By definition d1=H1f(x)d_{1}=-H^{-1}\nabla f(x). When λ1\lambda\geq 1, we have (1+λ)d11+λλG2(1+\lambda)\|d_{1}\|\leq\frac{1+\lambda}{\lambda}G\leq 2. Also when λ1\lambda\leq 1, it holds that

(1+λ)d1(1+λ)(H1λΩ(x)+H1(f(x)+λΩ(x)))2(G+1).(1+\lambda)\|d_{1}\|\leq(1+\lambda)\left(\|H^{-1}\lambda\nabla\Omega(x)\|+\|H^{-1}(\nabla f(x)+\lambda\nabla\Omega(x))\|\right)\leq 2(G+1).

Lemma A.4.

Suppose Section 4.1 and Section A.2 hold. Let (x,λ)n×+(x,\lambda)\in\mathbb{R}^{n}\times\mathbb{R}^{+}, and let xi,λi,di,i{1,2}x_{i},\lambda_{i},d_{i},i\in\{1,2\} be generated by the trapezoid update scheme defined in Eq. 13, we have

H~1(d2d1)2f(x1)(x1x2)(H~1H~2)d2L2x1x22,\|\tilde{H}_{1}(d_{2}-d_{1})-\nabla^{2}f(x_{1})(x_{1}-x_{2})-(\tilde{H}_{1}-\tilde{H}_{2})d_{2}\|\leq\tfrac{L}{2}\|x_{1}-x_{2}\|^{2},

where H~i=2f(xi)+λi2Ω(xi),i{1,2}\tilde{H}_{i}=\nabla^{2}f(x_{i})+\lambda_{i}\nabla^{2}\Omega(x_{i}),i\in\left\{1,2\right\}. Furthermore, it holds that

d2d12hLτ(d1+d12).\|d_{2}-d_{1}\|\leq 2hL\tau(\|d_{1}\|+\|d_{1}\|^{2}).

Proof A.5.

Using the definition of d1,d2d_{1},d_{2} we have

(18) H~1(d2d1)\displaystyle\tilde{H}_{1}(d_{2}-d_{1}) =f(x1)f(x2)+(IH~1H~21)f(x2)\displaystyle=\nabla f(x_{1})-\nabla f(x_{2})+(I-\tilde{H}_{1}\tilde{H}_{2}^{-1})\nabla f(x_{2})
=2f(x1)(x1x2)+(H~1H~2)d2+(R),\displaystyle=\nabla^{2}f(x_{1})(x_{1}-x_{2})+(\tilde{H}_{1}-\tilde{H}_{2})d_{2}+(R),

where (R)=f(x1)f(x2)2f(x1)(x1x2)h22Ld12\|(R)\|=\|\nabla f(x_{1})-\nabla f(x_{2})-\nabla^{2}f(x_{1})(x_{1}-x_{2})\|\leq\tfrac{h^{2}}{2}\cdot L\|d_{1}\|^{2}. Also, we have 2f(x1)(x1x2)=h2f(x1)d1hLd1\|\nabla^{2}f(x_{1})(x_{1}-x_{2})\|=h\|\nabla^{2}f(x_{1})d_{1}\|\leq hL\|d_{1}\|, and that

(H~1H~2)d2\displaystyle\|(\tilde{H}_{1}-\tilde{H}_{2})d_{2}\|
=\displaystyle= (2f(x1)2f(x2)+λ1(2Ω(x1)2Ω(x2))+(λ1λ2)2Ω(x2))d2\displaystyle\|\left(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})+\lambda_{1}(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2}))+(\lambda_{1}-\lambda_{2})\nabla^{2}\Omega(x_{2})\right)d_{2}\|
\displaystyle\leq (Lx1x2+λ1Lx1x2+|λ1λ2|L)d2h(L(λ+1)d1+λL)d2.\displaystyle\left(L\|x_{1}-x_{2}\|+\lambda_{1}L\|x_{1}-x_{2}\|+\left|\lambda_{1}-\lambda_{2}\right|L\right)\|d_{2}\|\leq h\left(L(\lambda+1)\|d_{1}\|+\lambda L\right)\|d_{2}\|.

Hence, it holds that

(19) μ~d2μ~d1μ~d2d1H~1(d2d1)\displaystyle\tilde{\mu}\|d_{2}\|-\tilde{\mu}\|d_{1}\|\leq\tilde{\mu}\|d_{2}-d_{1}\|\leq\|\tilde{H}_{1}(d_{2}-d_{1})\|
\displaystyle\leq h22Ld12+hLd1+h(L(λ+1)d1+λL)d2.\displaystyle\tfrac{h^{2}}{2}\cdot L\|d_{1}\|^{2}+hL\|d_{1}\|+h\left(L(\lambda+1)\|d_{1}\|+\lambda L\right)\|d_{2}\|.

When hh satisfies Section A.2, it holds that h(L(λ+1)d1+λL)μ~3h\left(L(\lambda+1)\|d_{1}\|+\lambda L\right)\leq\tfrac{\tilde{\mu}}{3} and h22Ld1+hLμ~3\tfrac{h^{2}}{2}\cdot L\|d_{1}\|+hL\leq\tfrac{\tilde{\mu}}{3}. Apply them to Eq. 19, we have 23μ~d243μ~d1\tfrac{2}{3}\tilde{\mu}\|d_{2}\|\leq\tfrac{4}{3}\tilde{\mu}\|d_{1}\| and it implies that d22d1\|d_{2}\|\leq 2\|d_{1}\|. Apply it to Eq. 19, it holds that

H~1(d1d2)\displaystyle\|\tilde{H}_{1}(d_{1}-d_{2})\| h22Ld12+hLd1+h(L(λ0+1)d1+λL)d2\displaystyle\leq\tfrac{h^{2}}{2}\cdot L\|d_{1}\|^{2}+hL\|d_{1}\|+h\left(L(\lambda_{0}+1)\|d_{1}\|+\lambda L\right)\|d_{2}\|
2hL(1+λ)(d1+d12).\displaystyle\leq 2hL(1+\lambda)(\|d_{1}\|+\|d_{1}\|^{2}).

Hence, it holds that d1d2h2L(1+λ)(d1+d12)μ+λσ2hLτ(d1+d12)\|d_{1}-d_{2}\|\leq h\cdot\tfrac{2L(1+\lambda)(\|d_{1}\|+\|d_{1}\|^{2})}{\mu+\lambda\sigma}\leq 2hL\tau(\|d_{1}\|+\|d_{1}\|^{2}).

Based on the results of Lemma A.2 and Lemma A.4, the following theorem analyzes the one-step residual accumulation of the trapezoid update in Eq. 13.

Proof A.6 (Proof of Lemma 4.2).

We will begin the local residual analysis by estimating the difference between Fλ~(x~)\nabla F_{\tilde{\lambda}}(\tilde{x}) and λ~λFλ(x)\tfrac{\tilde{\lambda}}{\lambda}\cdot\nabla F_{\lambda}(x). For simplicity, let xx, x~\tilde{x}, λ\lambda, and λ~\tilde{\lambda} denote xkx_{k}, xk+1x_{k+1}, λk\lambda_{k}, and λk+1\lambda_{k+1} in the trapezoid update Eq. 13 respectively. Then we have

R=Fλ~(x~)λ~λFλ(x)=f(x~)f(x)(A)+λ~(Ω(x~)Ω(x))(B)+(1λ~λ)f(x)(C).R=\nabla F_{\tilde{\lambda}}(\tilde{x})-\tfrac{\tilde{\lambda}}{\lambda}\cdot\nabla F_{\lambda}(x)=\underbrace{\nabla f(\tilde{x})-\nabla f(x)}_{(A)}+\underbrace{\tilde{\lambda}(\nabla\Omega(\tilde{x})-\nabla\Omega(x))}_{(B)}+\underbrace{(1-\tfrac{\tilde{\lambda}}{\lambda})\nabla f(x)}_{(C)}.

We will approach the result in Eq. 14 by splitting and rearranging the terms in (A)(A), (B)(B) and (C)(C). From Lemma A.1, it holds that

(RA):=(A)2f(x)(x~x)(A)12D3f(x)[x~x]2(A3)L6x~x3=h36Ld~3.\|(RA)\|:=\|(A)-\underbrace{\nabla^{2}f(x)(\tilde{x}-x)}_{(A^{\prime})}-\underbrace{\tfrac{1}{2}D^{3}f(x)[\tilde{x}-x]^{2}}_{(A3)}\|\leq\tfrac{L}{6}\|\tilde{x}-x\|^{3}=\tfrac{h^{3}}{6}L\|\tilde{d}\|^{3}.

From the update Eq. 13, we have (A)=h2f(x)d1(A1)+h22f(x)(d2d1)(A2)(A^{\prime})=\underbrace{h\nabla^{2}f(x)d_{1}}_{(A1)}+\underbrace{\tfrac{h}{2}\nabla^{2}f(x)(d_{2}-d_{1})}_{(A2)}. For (B)(B), using Lemma A.1 and based on update Eq. 13 we have

(B)\displaystyle(B) =λ2Ω(x)(x~x)+(λ~λ)2Ω(x)(x~x)(B3)+λ~12D3Ω(x)[x~x]2(B4)+(RB)\displaystyle=\lambda\nabla^{2}\Omega(x)(\tilde{x}-x)+\underbrace{(\tilde{\lambda}-\lambda)\nabla^{2}\Omega(x)(\tilde{x}-x)}_{(B3)}+\underbrace{\tilde{\lambda}\cdot\tfrac{1}{2}D^{3}\Omega(x)[\tilde{x}-x]^{2}}_{(B4)}+(RB)
=hλ2Ω(x)d1(B1)+h2λ2Ω(x)(d2d1)(B2)+(B3)+(B4)+(RB),\displaystyle=\underbrace{h\lambda\nabla^{2}\Omega(x)d_{1}}_{(B1)}+\underbrace{\tfrac{h}{2}\lambda\nabla^{2}\Omega(x)(d_{2}-d_{1})}_{(B2)}+(B3)+(B4)+(RB),

where (RB)=λ~Ω(x~)Ω(x)2Ω(x)(x~x)12D3Ω(x)[x~x]2h3λ~L6d~3\|(RB)\|=\tilde{\lambda}\|\nabla\Omega(\tilde{x})-\nabla\Omega(x)-\nabla^{2}\Omega(x)(\tilde{x}-x)-\tfrac{1}{2}D^{3}\Omega(x)[\tilde{x}-x]^{2}\|\leq\tfrac{h^{3}\tilde{\lambda}L}{6}\|\tilde{d}\|^{3}. Also, (C)=(hh22)f(x)=hf(x)(C1)h22f(x)(C2)(C)=(h-\tfrac{h^{2}}{2})\nabla f(x)=\underbrace{h\nabla f(x)}_{(C1)}-\underbrace{\tfrac{h^{2}}{2}\nabla f(x)}_{(C2)}. By the definition of d1d_{1}, we have

(20) (A1)+(B1)+(C1)=h2f(x)d1+hλ2Ω(x)d1+hf(x)=0.(A1)+(B1)+(C1)=h\nabla^{2}f(x)d_{1}+h\lambda\nabla^{2}\Omega(x)d_{1}+h\nabla f(x)=0.

Using Lemma A.4, we have

(21) (A2)+(B2)\displaystyle(A2)+(B2) =h22f(x)(x1x2)(D1)+h2(2f(x1)2f(x2))d2(D2)\displaystyle=\underbrace{\tfrac{h}{2}\nabla^{2}f(x)(x_{1}-x_{2})}_{(D1)}+\underbrace{\tfrac{h}{2}(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2}))d_{2}}_{(D2)}
+h2(λ12Ω(x1)λ22Ω(x2))d2(D3)+(RD),\displaystyle\phantom{=}+\underbrace{\tfrac{h}{2}(\lambda_{1}\nabla^{2}\Omega(x_{1})-\lambda_{2}\nabla^{2}\Omega(x_{2}))d_{2}}_{(D3)}+(RD),

where (RD)h2L2x1x22=h34Ld12\|(RD)\|\leq\tfrac{h}{2}\cdot\tfrac{L}{2}\|x_{1}-x_{2}\|^{2}=\tfrac{h^{3}}{4}L\|d_{1}\|^{2}. Furthermore, it holds that

(22) (D1)+(C2)=h222f(x)d1h22f(x)=h22λ2Ω(x)d1(E1),(D1)+(C2)=-\tfrac{h^{2}}{2}\nabla^{2}f(x)d_{1}-\tfrac{h^{2}}{2}\nabla f(x)=\underbrace{\tfrac{h^{2}}{2}\lambda\nabla^{2}\Omega(x)d_{1}}_{(E1)},

and

(A3)+(D2)\displaystyle(A3)+(D2) =h22D3f(x)[12(d1+d2)]2+h2D3f(x)[x1x2,d2]+(R1)\displaystyle=\tfrac{h^{2}}{2}D^{3}f(x)\left[\tfrac{1}{2}(d_{1}+d_{2})\right]^{2}+\tfrac{h}{2}D^{3}f(x)[x_{1}-x_{2},d_{2}]+(R1)
=h22D3f(x)[12(d1d2)]2(R2)+(R1),\displaystyle=\underbrace{\tfrac{h^{2}}{2}D^{3}f(x)\left[\tfrac{1}{2}(d_{1}-d_{2})\right]^{2}}_{(R2)}+(R1),

where

(R1)\displaystyle\|(R1)\| =h2(2f(x1)2f(x2))d2h2D3f(x)[x1x2,d2]\displaystyle=\|\tfrac{h}{2}\left(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})\right)d_{2}-\tfrac{h}{2}D^{3}f(x)[x_{1}-x_{2},d_{2}]\|
h2L2x1x22d2=h34Ld12d2.\displaystyle\leq\tfrac{h}{2}\cdot\tfrac{L}{2}\|x_{1}-x_{2}\|^{2}\|d_{2}\|=\tfrac{h^{3}}{4}L\|d_{1}\|^{2}\|d_{2}\|.

We further have (D3)=h2λ2(2Ω(x1)2Ω(x2))d2(E2)+h2(λ1λ2)2Ω(x1)d2(E3)(D3)=\underbrace{\tfrac{h}{2}\lambda_{2}\left(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2})\right)d_{2}}_{(E2)}+\underbrace{\tfrac{h}{2}(\lambda_{1}-\lambda_{2})\nabla^{2}\Omega(x_{1})d_{2}}_{(E3)}, and

(B4)+(E2)\displaystyle(B4)+(E2) =λ~12D3Ω(x)[x~x]2+h2λ2D3Ω(x1)[x1x2,d2]+(R5)\displaystyle=\tilde{\lambda}\cdot\tfrac{1}{2}D^{3}\Omega(x)[\tilde{x}-x]^{2}+\tfrac{h}{2}\lambda_{2}D^{3}\Omega(x_{1})[x_{1}-x_{2},d_{2}]+(R5)
=h22(λ~λ2)D3Ω(x)[d1+d22]2(R3)+h22λ2D3Ω(x1)[d1d22]2(R4)+(R5),\displaystyle=\underbrace{\tfrac{h^{2}}{2}(\tilde{\lambda}-\lambda_{2})D^{3}\Omega(x)\left[\tfrac{d_{1}+d_{2}}{2}\right]^{2}}_{(R3)}+\underbrace{\tfrac{h^{2}}{2}\lambda_{2}D^{3}\Omega(x_{1})\left[\tfrac{d_{1}-d_{2}}{2}\right]^{2}}_{(R4)}+(R5),

where

(R5)\displaystyle\|(R5)\| =h2λ2(2Ω(x1)2Ω(x2))d2h2λ2D3Ω(x)[x1x2,d2]\displaystyle=\|\tfrac{h}{2}\lambda_{2}\left(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2})\right)d_{2}-\tfrac{h}{2}\lambda_{2}D^{3}\Omega(x)[x_{1}-x_{2},d_{2}]\|
h2λ2L2x1x22d2=h34λ2Ld12d2,\displaystyle\leq\tfrac{h}{2}\lambda_{2}\cdot\tfrac{L}{2}\|x_{1}-x_{2}\|^{2}\|d_{2}\|=\tfrac{h^{3}}{4}\lambda_{2}L\|d_{1}\|^{2}\|d_{2}\|,

and

(B3)+(E1)+(E3)=\displaystyle(B3)+(E1)+(E3)= (h+h22)λ2Ω(x)hd1+d22+h22λ2Ω(x)d1\displaystyle(-h+\tfrac{h^{2}}{2})\lambda\nabla^{2}\Omega(x)h\cdot\tfrac{d_{1}+d_{2}}{2}+\tfrac{h^{2}}{2}\lambda\nabla^{2}\Omega(x)d_{1}
+h2(hh2)2Ω(x)d2=h34λ2Ω(x)(d1d2)(R6).\displaystyle+\tfrac{h}{2}\left(h-h^{2}\right)\nabla^{2}\Omega(x)d_{2}=\underbrace{\tfrac{h^{3}}{4}\lambda\nabla^{2}\Omega(x)(d_{1}-d_{2})}_{(R6)}.

Hence, it holds that

(23) (R)\displaystyle\|(R)\|\leq (RA)+(RB)+(RD)+(R1)+(R2)\displaystyle\|(RA)\|+\|(RB)\|+\|(RD)\|+\|(R1)\|+\|(R2)\|
+(R3)+(R4)+(R5)+(R6)\displaystyle+\|(R3)\|+\|(R4)\|+\|(R5)\|+\|(R6)\|
\displaystyle\leq h3L6d1+d223+h3λ~L6d1+d223+h3L6d12\displaystyle h^{3}\cdot\tfrac{L}{6}\|\tfrac{d_{1}+d_{2}}{2}\|^{3}+h^{3}\cdot\tfrac{\tilde{\lambda}L}{6}\|\tfrac{d_{1}+d_{2}}{2}\|^{3}+h^{3}\cdot\tfrac{L}{6}\|d_{1}\|^{2}
+h3L4d12d2+h2L8d1d22+h4λL4d1+d222\displaystyle+h^{3}\cdot\tfrac{L}{4}\|d_{1}\|^{2}\|d_{2}\|+h^{2}\cdot\tfrac{L}{8}\|d_{1}-d_{2}\|^{2}+h^{4}\cdot\tfrac{\lambda L}{4}\|\tfrac{d_{1}+d_{2}}{2}\|^{2}
+h2λ2L8d1d22+h3λ2L4d12d2+h3λL4d1d2\displaystyle+h^{2}\cdot\tfrac{\lambda_{2}L}{8}\|d_{1}-d_{2}\|^{2}+h^{3}\cdot\tfrac{\lambda_{2}L}{4}\|d_{1}\|^{2}\|d_{2}\|+h^{3}\cdot\tfrac{\lambda L}{4}\|d_{1}-d_{2}\|
\displaystyle\leq h3L3d13+h3Lλ3d13+h3L6d12+h3L2d13\displaystyle h^{3}\cdot\tfrac{L}{3}\|d_{1}\|^{3}+h^{3}\cdot\tfrac{L\lambda}{3}\|d_{1}\|^{3}+h^{3}\cdot\tfrac{L}{6}\|d_{1}\|^{2}+h^{3}\cdot\tfrac{L}{2}\|d_{1}\|^{3}
+h2L8d1d22+h4λL2d12+h2λL8d1d22\displaystyle+h^{2}\cdot\tfrac{L}{8}\|d_{1}-d_{2}\|^{2}+h^{4}\cdot\tfrac{\lambda L}{2}\|d_{1}\|^{2}+h^{2}\cdot\tfrac{\lambda L}{8}\|d_{1}-d_{2}\|^{2}
+h3λL2d13+h3λL4d1d2\displaystyle+h^{3}\cdot\tfrac{\lambda L}{2}\|d_{1}\|^{3}+h^{3}\cdot\tfrac{\lambda L}{4}\|d_{1}-d_{2}\|
\displaystyle\leq h3L(1+λ)(d13+d12+d1)+h2L(1+λ)8d1d22.\displaystyle h^{3}L(1+\lambda)(\|d_{1}\|^{3}+\|d_{1}\|^{2}+\|d_{1}\|)+h^{2}\cdot\tfrac{L(1+\lambda)}{8}\|d_{1}-d_{2}\|^{2}.

Apply the result in Lemma A.4 into Eq. 23, we can further get

(24) d1d22(2hLτ(d1+d12))28h2L2τ2(d12+d14).\|d_{1}-d_{2}\|^{2}\leq(2hL\tau(\|d_{1}\|+\|d_{1}\|^{2}))^{2}\leq 8h^{2}L^{2}\tau^{2}(\|d_{1}\|^{2}+\|d_{1}\|^{4}).

Also, by applying the results in Lemma A.2 to Eqs. 23 and 24, we have

(R)h33L(1+G)3+h42L3τ2(1+G)4.\displaystyle\|(R)\|\leq h^{3}\cdot 3L(1+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(1+G)^{4}.