\newsiamthm

assumptionAssumption \newsiamremarkconditionCondition \newsiamremarkremarkRemark

New Methods for Parametric Optimization via Differential Equations

Heyuan Liu Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720 (mailto: [email protected]). Paul Grigas Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720 (mailto: [email protected]). This author’s research is supported, in part, by NSF AI Institute for Advances in Optimization Award 2112533.

Abstract

We develop and analyze several different second-order algorithms for computing a near-optimal solution path of a convex parametric optimization problem with smooth Hessian. Our algorithms are inspired by a differential equation perspective on the parametric solution path and do not rely on the specific structure of the objective function. We present computational guarantees that bound the oracle complexity to achieve a near-optimal solution path under different sets of smoothness assumptions. Under the assumptions, the results are an improvement over the best-known results of the grid search methods. We also develop second-order conjugate gradient variants that avoid exact computations of Hessians and solving of linear equations. We present computational results that demonstrate the effectiveness of our methods over grid search methods on both real and synthetic datasets. On large-scale problems, we demonstrate significant speedups of the second-order conjugate variants as compared to the standard versions of our methods.

1 Introduction

In many applications of interest, it is necessary to solve not just a single optimization problem but also an entire collection of related problems. In these settings, some or all of the objects involved in defining the objective function or constraints of an optimization problem depend on one or more parameters, and we would like to solve the problem as a function of these parameters. Generally, a parametric optimization problem can be written as:

(1)

P(\lambda):\quad\min_{x\in S(\lambda)}F(x,\lambda),

where $\lambda$ belongs to the set of interest $\Lambda\subseteq\mathbb{R}^{m}$ , and the feasible sets satisfy $S(\lambda)\subseteq\mathbb{R}^{p}$ . There are many problems of interest that are formulated as parametric optimization problems of the form Eq. 1. Subsequently, as indicated by Guddat et al. [19], there are several strong motivations to design algorithms for Eq. 1, including but not limited to: (i) the need to solve problems arising in application areas like regularized regression with cross-validation (see, e.g., Osborne et al. [33]) and model predictive control (see, e.g., Garcia et al. [14]), (ii) as a building block for developing globally convergent algorithms by the approach of path-following as is done in interior-point methods (see, e.g., Nesterov and Nemirovskii [31]), and (iii) the need to address multi-objective optimization, for instance, finding the Pareto frontier of a two-objective optimization problem. Depending on the assumptions made, the goal may be to find global/local optimal solutions or Karush–Kuhn–Tucker (KKT) points of the problem $P(\lambda)$ for $\lambda\in\Lambda$ .

In the rest of the paper, we will focus on a more specific problem, in which we assume that: (i) the dependence on $\lambda$ is linear, that is, $F(x,\lambda)$ can be written as $f(x)+\lambda\cdot\Omega(x)$ , (ii) both $f(\cdot)$ and $\Omega(\cdot)$ are convex functions with certain properties, and (iii) the feasible set $S(\lambda)$ is the entire vector space $\mathbb{R}^{p}$ for all $\lambda\in\Lambda$ . That is, we focus on the parametric optimization problem:

(2)

P(\lambda):\quad F_{\lambda}^{*}:=\min_{x\in\mathbb{R}^{p}}\left\{F_{\lambda}(x):=f(x)+\lambda\cdot\Omega(x)\right\},

where $f(\cdot):\mathbb{R}^{p}\rightarrow\mathbb{R}$ and $\Omega(\cdot):\mathbb{R}^{p}\rightarrow\mathbb{R}$ are twice-differentiable functions such that $f(\cdot)$ is $\mu$ -strongly convex for some $\mu\geq 0$ and $\Omega(\cdot)$ is $\sigma$ -strongly convex for some $\sigma>0$ , both with respect to the $\ell_{2}$ -norm (denoted by $\left\|\cdot\right\|$ herein). For any $\lambda>0$ , let

(3)

x(\lambda):=\arg\min_{x\in\mathbb{R}^{p}}F_{\lambda}(x)

denote the unique optimal solution of $P(\lambda)$ defined in Eq. 2. We are interested in the problem of (approximately) computing the set of optimal solutions $\left\{x(\lambda):\lambda\in\Lambda\right\}$ where $\Lambda=[\lambda_{\min},\lambda_{\max}]$ is the set of interest for some $0<\lambda_{\min}<\lambda_{\max}$ , and we also refer to this set of solutions as the (exact) solution path. An important set of problems in practice and a popular line of research involves computing the solution path of regularized machine learning problems, including the LASSO as in Efron et al. [12], Osborne et al. [33] and the SVM problem as in Hastie et al. [20]. In these works, algorithms are designed to compute the exact piecewise-linear solution path. Also, in the context of interior-point methods for constrained convex optimization (see, for instance, Nesterov and Nemirovskii [31] and Renegar [36]), $f(\cdot)$ represents the objective function and $\Omega(\cdot)$ represents the barrier function induced by the constraints of the original problem. Note that the application to interior-point methods requires a slightly more general variant of problem (2) where $\Omega(\cdot):\mathbb{R}^{p}\rightarrow\mathbb{R}\cup\{+\infty\}$ . Interior-point methods generally start with the problem $P(\lambda)$ for a moderately large $\lambda_{0}$ and terminate when $\lambda_{k}<\delta$ for some small enough positive threshold $\delta$ .

Recently, there has been growing interest in developing algorithms for computing an approximate solution path of a generic problem such as (3). Rosset [37] consider applying exact Newton steps on prespecified grids, and Ndiaye et al. [29] consider adaptive methods to discretize the interval $[\lambda_{\min},\lambda_{\max}]$ , for example. These grid search type methods, which discretize the interval $[\lambda_{\min},\lambda_{\max}]$ and subsequently solve a sequence of individual optimization problems, take a very black-box approach. A natural question is: Can we “open the black-box” by developing a methodology that better exploits the structure of the solution path? We answer this question positively by introducing a differential equation perspective to analyze the solution path, which enables us to better reveal and exploit the underlying structure of the solution path. This deeper understanding enables us to build more efficient algorithms and present improved computational guarantees.

In particular, we derive an ordinary differential equation with an initial condition whose solution is the exact solution path of (2). The dynamics of the ODE that we derive resemble, but are distinct from, the dynamics of a path-wise version of Newton’s method. Based on the ODE, we propose efficient algorithms to generate approximate solution paths $\hat{x}(\lambda):\lambda\in[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p}$ and provide the corresponding complexity analysis. The metric we consider is the $\ell_{2}$ -norm of the gradient of the regularized problem, namely $\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|_{2}$ , and we use the largest norm along the approximate path $\sup_{\lambda\in\Lambda}\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|_{2}$ to represent the accuracy of an approximate path $\hat{x}(\lambda)$ (as formally defined in Definition 3.1). To analyze the computational cost of our proposed algorithms, we consider the oracle complexity – either in terms of full Hessian or Hessian-vector product/gradient evaluations – to obtain an $\epsilon$ -accurate solution path. Note that considering the oracle complexity is in contrast to other works that consider the number of individual optimization problems that need to be solved (see, for example, Giesen et al. [16], Ndiaye et al. [29]), as well as the number of ordinary differential equations that need to be solved (see, for example, Zhou and Lange [46]).

1.1 Contributions

The first set of contributions of this paper concern the perspective of the solution path of (2) from an ordinary differential equation point of view. We derive an ordinary differential equation with an initial condition whose solution is the solution path of (2), based on the first-order optimality conditions of (2). With this observation, we propose a novel and efficient way to approximate the entire solution path. Our derivation does not rely on the special structure of the optimization problem, like existing results in the solution path of LASSO or SVM problems, and holds for general objective functions.

The second set of contributions of this paper concern the design of efficient algorithms and the corresponding oracle complexity analysis. Classical error analysis of numerical ordinary differential equation methods [15, 40] provides only asymptotic results, and the global error has an exponential dependency on the Lipschitz constant and the length of the time period. In contrast, we design new update schemes to compute an approximate solution path and develop nonasymptotic complexity bounds. In particular, we apply a semi-implicit Euler method on the ordinary differential equation in order to compute the approximate optimal solutions under a finite set of penalty coefficients. Then, we incorporate linear interpolation, which was usually missing either in practice or in the lower bound complexity analysis, to generate nearly optimal solutions under other penalty coefficients within the range of parameter values of interest. The two-step algorithm guarantees an $\epsilon$ -accurate solution path within at most $\mathcal{O}(\frac{1}{\epsilon})$ gradient and Hessian evaluations as well as linear equation system solves. When the objective function has higher-order smoothness properties, we modify the traditional trapezoid method in numerical differential equations and design a new update scheme, which guarantees an $\epsilon$ -path within at most $\mathcal{O}(\frac{1}{\sqrt{\epsilon}})$ Hessian evaluations. It is important to emphasize that the complexity results in this paper are in terms of the number of operations (for example, Hessian evaluations), rather than the number of sub-problems that need to be solved (for example, solving a single numerical ODE or individual optimization problems) as has been studied in prior work such as Giesen et al. [16], Zhou and Lange [46] and Ndiaye et al. [29]. Further extensions of our proposed algorithms to Runge-Kutta methods are possible but we omit them for brevity. We also provide a detailed computational evaluation of our algorithms and existing methods, including several experiments with synthetic data, the breast cancer dataset [11], and the leukemia dataset [18].

The third set of contributions of the paper concerns second-order conjugate gradient type methods and computational guarantees in the presence of inexact gradient and Hessian oracles, as well as approximate linear equation solvers. When the dimension of the problem is high, computing the Hessian and/or solving linear systems becomes a computational bottleneck. To avoid this, one would like an algorithm that only requires approximate directions at each iteration. We first consider the case where the (absolute) numerical error incurred in the calculation of a new direction $d_{k}$ is bounded by some $\delta_{k}>0$ . We show that our algorithms are robust to numerical error in the sense that the additional errors of inexact directions do not accumulate and do not depend on the condition number. We extend the complexity analysis to the case where the numerical error $\delta_{k}$ has a uniform upper bound $\alpha\epsilon$ for $\alpha\in(0,1)$ and show that the Euler method maintains $\mathcal{O}(\frac{1}{\epsilon})$ complexity, and the trapezoid method maintains $\mathcal{O}(\frac{1}{\sqrt{\epsilon}})$ complexity when there is higher-order smoothness. We then propose variations of the algorithms mentioned before that only require gradient and Hessian-vector product oracles, rather than gradient and Hessian oracle, as well as a linear system solver. We also leverage the previous analysis in the case of inexact directions in order to provide computational complexity results for the second-order conjugate gradient-type algorithms, which have the same order of $\epsilon$ as the results for the exact methods mentioned above. Our results demonstrate that our algorithms are more robust and second-order conjugate gradient variations require less computational cost compared to existing methods.

1.2 Related Literature

We now discuss previous work related to our algorithm and analysis from three distinct aspects.

Other path methods and comparison of results

As previously mentioned, for the LASSO and SVM problems, the exact solution path is piecewise linear and can be computed by the path following methods such as the least angle regression (LARS) algorithm proposed by Efron et al. [12], Hastie et al. [20] and Osborne et al. [33]. Additional problems whose solution paths are piecewise linear are considered by Rosset and Zhu [38], and Mairal and Yu [24] showed that the number of breakpoints in the solution path can be exponential in the number of data points. Generalized linear regression problems with $\ell_{1}$ regularization are considered by Yuan and Zou [44] and Zhou and Wu [47] via LARS based methods. Another line of work focuses on computing approximate solution paths for specific problems, including the elastic net in Friedman et al. [13], the SVM problem in Bach et al. [7] and [16], matrix completion with nuclear norm regularization in Mazumder et al. [26], and other structural convex problems in Giesen et al. [17] and Loosli et al. [23]. For problems with nonconvex but coordinate decomposable regularization functions, a coordinate descent-based algorithm was considered by Mazumder et al. [27] and Wang et al. [42]. Closest to our problem set up, Rosset [37] considered a general problem when $f(\cdot)$ and $\Omega(\cdot)$ have third-order derivatives and provided an algorithm which applied exact Newton steps on equally spaced grids starting from the optimal solution of the nonregularized problem. The lower bound complexity analysis when the approximate solution path is limited to a piecewise constant function is considered by Giesen et al. [16], Ndiaye et al. [29].

Related global complexity analysis of second-order methods

Newton-like methods are an important class of algorithms in optimization. Some notable lines of work include interior-point methods (Nesterov and Nemirovskii [31], Renegar [36]) and applications in regression problems (Kim et al. [22]) as well as the Levenberg-Marquardt method (Moré [28]). In practice, techniques are often incorporated to ensure global convergence, such as line searches (Ralph [35]) and trust-region techniques (Conn et al. [10]). The global complexity analysis of Newton and higher-order methods with regularization has also received much recent interest, such as the work of Nesterov [30], Nesterov and Polyak [32] and Polyak [34] and the references therein. In our paper, we make similar types of assumptions as in the global complexity analysis of regularized second and higher-order methods and we also prove global complexity results for the class of second-order algorithms considered herein.

Related work on differential equations and optimization methods

Early works, including the work of Alvarez and Attouch [2] and Alvarez et al. [3], analyzed inertial dynamical systems driven by gradient and Newton flows with application to optimization problems. Newton-like dynamic systems of monotone inclusions with connections to Levenberg-Marquardt and regularized Newton methods were also analyzed by Abbas et al. [1], Attouch and Svaiter [4] and Bot and Csetnek [9]. Due to the recent popularity of the accelerated gradient descent method in the machine learning and optimization communities, the limiting dynamics of different optimization methods and their discretization have thus received much renewed interest in recent years; see Attouch et al. [6], Scieur et al. [39], Su et al. [41], Wibisono et al. [43], Zhang et al. [45] and the references therein.

1.3 Organization

The paper is organized as follows. In Section 2, we derive the ordinary differential equation with an initial condition whose solution is the exact solution path (3) and provide the existence and uniqueness of the solution of the differential equation. In Section 3, we leverage the ODE to develop a numerical algorithm to compute the approximate solution path of (2), and derive the corresponding complexity analysis in Theorem 3.2. In Section 4, we propose a multistage method that is beneficial when the functions $f(\cdot)$ and $\Omega(\cdot)$ have higher order smoothness, and we also provide its complexity analysis in Theorem 4.1. In Section 5, we extend our results in the presence of inexact oracles, and as a direct application we propose second-order conjugate gradient variants of the aforementioned algorithms, which avoid exact Hessian evaluations and exact solutions of linear systems. Section 6 contains a detailed computational experiments of the proposed algorithms and grid search methods on real and synthetic datasets.

1.4 Notation

For a positive integer $n$ , let $[n]:=\{1,\dots,n\}$ . For a vector-valued function $y(t):\mathbb{R}\rightarrow\mathbb{R}^{p}$ that can be written as $y(t)=(y_{1}(t),\dots,y_{p}(t))$ , we say that $y(\cdot)$ is differentiable if $y_{i}(\cdot)$ is differentiable for all $i=1,\dots,p$ and let $\frac{\textnormal{d}y}{\textnormal{d}t}$ be the derivative of $y(t)$ , namely $\frac{\textnormal{d}y}{\textnormal{d}t}=(\frac{\textnormal{d}y_{1}}{\textnormal{d}t},\dots,\frac{\textnormal{d}y_{p}}{\textnormal{d}t})$ . Let ${\bm{1}}_{p}$ and ${\bm{1}}_{p\times p}$ denote the $p$ -dimensional all-ones vector and the $p\times p$ all-ones matrix, respectively. Throughout the paper, we fix the norm $\|\cdot\|$ on $\mathbb{R}^{p}$ to be the $\ell_{2}$ -norm, which is defined by $\|x\|:=\|x\|_{2}=\sqrt{x^{T}x}$ . Also, in a slight abuse of notation, we use $\|\cdot\|$ to represent the operator norm, i.e., the induced $\ell_{2}$ -norm $\|\cdot\|_{2}$ on $\mathbb{R}^{n\times p}$ , which is defined by $\|A\|:=\|A\|_{2,2}=\max_{\|x\|_{2}\leq 1}\|Ax\|_{2}$ .

2 Ordinary Differential Equation Characterization of Solution Path

Let us begin by describing a differential equations perspective on the solution path Eq. 3 of Eq. 2 that will prove fruitful in the development of efficient computational methods. First, we introduce a reparameterization in terms of an auxiliary variable $t\geq 0$ (thought of as “time”), whereby for a given $T>0$ we introduce functions $\lambda(\cdot):[0,T]\to[\lambda_{\min},\lambda_{\max}]$ and $\xi(\cdot):[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}$ such that: $\xi(\cdot)$ is Lipschitz, $\lambda(\cdot)$ is differentiable on $(0,T)$ , and we have $\frac{\textnormal{d}\lambda}{\textnormal{d}t}=\xi(\lambda(t))$ for all $t\in(0,T)$ . In a slight abuse of notation, we define the path with respect to $t$ as $\left\{x(t):=x(\lambda(t)):t\in[0,T]\right\}$ . Now notice that, for any $t\in[0,T]$ , the first-order optimality condition for the problem $P(\lambda(t))$ states that $\nabla f(x(t))+\lambda(t)\nabla\Omega(x(t))=0$ . By differentiating both sides of the previous equation with respect to $t$ , we have $\nabla^{2}f(x(t))\cdot\frac{\textnormal{d}x}{\textnormal{d}t}+\nabla\Omega(x(t))\cdot\frac{\textnormal{d}\lambda}{\textnormal{d}t}+\lambda(t)\nabla^{2}\Omega(x(t))\cdot\frac{\textnormal{d}x}{\textnormal{d}t}=0$ . Rearranging the above and again using $\frac{\textnormal{d}\lambda}{\textnormal{d}t}=\xi(\lambda(t))$ yields $\frac{\textnormal{d}x}{\textnormal{d}t}=-\left(\nabla^{2}f(x(t))+\lambda(t)\nabla^{2}\Omega(x(t))\right)^{-1}\xi(\lambda(t))\nabla\Omega(x(t))$ . Then, applying the fact that $\nabla f(x(t))+\lambda(t)\nabla\Omega(x(t))=0$ yields $\frac{\textnormal{d}x}{\textnormal{d}t}=\left(\nabla^{2}f(x(t))+\lambda(t)\nabla^{2}\Omega(x(t))\right)^{-1}\cdot\frac{\xi(\lambda(t))}{\lambda(t)}\nabla f(x(t))$ . Thus, we arrive at the following autonomous system

(4)

\frac{\textnormal{d}\lambda}{\textnormal{d}t}=\xi(\lambda),\quad\frac{\textnormal{d}x}{\textnormal{d}t}=v(x,\lambda):=\left(\nabla^{2}f(x)+\lambda\nabla^{2}\Omega(x)\right)^{-1}\frac{\xi(\lambda)}{\lambda}\nabla f(x),

for $t\in[0,T]$ .

By considering specific choices of $\xi(\cdot)$ and $\Omega(\cdot)$ , the system (4) generalizes some previously studied methodologies in parameteric optimization. First, consider the scenario with an equally spaced discretization of the interval $[0,T]$ , namely $t_{k}=k\cdot h$ for some fixed step-size $h>0$ . Thus, the sequence $\lambda_{k}:=\lambda(t_{k})$ is approximately given by $\lambda_{k+1}\approx\lambda_{k}+h\cdot\xi(\lambda_{k})$ . Intuitively, the choice of $\xi(\cdot)$ controls the dynamic of $\lambda(\cdot)$ and generalizes some previously considered sequences $\{\lambda_{k}\}$ for the problem (2). For example, letting $\xi(\lambda)\equiv 1$ we recover the arithmetic sequence in Rosset [37] and letting $\xi(\lambda)\equiv-\lambda$ we recover the geometric sequence in Ndiaye et al. [29]. In addition, consider the special case where $\Omega(x)=\frac{1}{2}\left\|x\right\|^{2}$ . Then the dynamic for $x(t)$ in Eq. 4 is similar to the limiting dynamic for the proximal Newton method (also known as the Levenberg-Marquardt regularization procedure [25] for convex optimization problems). The property of a similar dynamic of monotone inclusion is analyzed in Attouch et al. [5], Attouch and Svaiter [4], which includes finding zero of the gradient of a convex function.

Before developing and presenting algorithms designed to compute the approximate solution path based on (4), we first verify that the system (4) has a unique trajectory. The following proposition states conditions on $f(\cdot)$ and $\Omega(\cdot)$ such that $v(\cdot,\cdot)$ defined in Eq. 4 is continuous in $\lambda\in[\lambda_{\min},\lambda_{\max}]$ and is uniformly $L_{v}$ -Lipschitz continuous with respect to $x$ , namely, $\|v(x_{1},\lambda)-v(x_{2},\lambda)\|\leq L_{v}\|x_{1}-x_{2}\|$ for any $\lambda\in[\lambda_{\min},\lambda_{\max}],x_{1},x_{2}\in\mathbb{R}^{p}$ . This uniform Lipschitz property ensures that the above system has a unique trajectory, which therefore coincides with the solution path defined in (3).

Proposition 2.1 (Theorem 5.3.1 of Gautschi [15]).

Suppose that $\nabla^{2}f(\cdot)$ , $\nabla^{2}\Omega(\cdot)$ , $\nabla f(\cdot)$ , $f(\cdot)$ are all $L$ -Lipschitz continuous, $f(\cdot)$ is $\mu$ -strongly convex for $\mu\geq 0$ , and $\Omega(\cdot)$ is $\sigma$ -strongly convex for $\sigma>0$ . Suppose further that $\xi(\cdot)$ is Lipschitz continuous on $[\lambda_{\min},\lambda_{\max}]$ and satisfies $|\frac{\xi(\lambda)}{\lambda}|\leq C$ for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ . Then, it holds that $v(\cdot,\cdot)$ defined in Eq. 4 is continuous in $\lambda\in[\lambda_{\min},\lambda_{\max}]$ and is uniformly $L_{v}$ -Lipschitz continuous with $L_{v}=\frac{LC}{\mu+\lambda_{\min}\sigma}+\frac{L^{2}C(1+\lambda_{\max})}{(\mu+\lambda_{\min}\sigma)^{2}}$ , and Eq. 4 has a unique trajectory $(\lambda(t),x(t))$ for $t\in[0,T]$ .

3 Discretization and Complexity Analysis

In this section, we present algorithms to compute an approximate solution path based on discretizations of Eq. 4, along with the corresponding complexity analysis. The primary error metric we consider is the $2$ -norm of the gradient across the entire interval $[\lambda_{\min},\lambda_{\max}]$ as formally presented in Definition 3.1.

Definition 3.1.

An approximate solution path $\hat{x}(\cdot):[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p}$ to the parametric optimization problem Eq. 2 has accuracy $\epsilon\geq 0$ if $\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|\leq\epsilon$ for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ .

Notice that the strong convexity of the objective function $F_{\lambda}(\cdot)$ for all $\lambda>0$ immediately implies that an $\epsilon$ -accurate solution path $\hat{x}(\cdot)$ also has the optimality gap guarantee, which is $F_{\lambda}(\hat{x}(\lambda))-F_{\lambda}^{\ast}\leq\frac{\epsilon^{2}}{2(\mu+\lambda\sigma)}\leq\frac{\epsilon^{2}}{2(\mu+\lambda_{\min}\sigma)}$ , for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ . Algorithm 1 below presents a two-step “meta-algorithm” for computing an approximate solution path $\hat{x}(\cdot)$ . Inspired by numerical methods to solve ordinary differential equations, we first design several update schemes to iteratively update $(x_{k},\lambda_{k})$ by exploiting the dynamics in Eq. 4. We use the function $\psi(\cdot,\cdot):\mathbb{R}^{p}\times[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p}\times[\lambda_{\min},\lambda_{\max}]$ to denote a generic update rule in the meta-algorithm below, and we consider several different specific examples herein. Then we apply an interpolation method $\mathcal{I}(\cdot)$ to resolve the previously computed sequence of points into an approximate path $\hat{x}(\cdot)$ over $[\lambda_{\min},\lambda_{\max}]$ .

Algorithm 1 Meta-algorithm for computing an approximate solution path

\hat{x}(\cdot)

1: Input: initial point

x_{0}\in\mathbb{R}^{p}

, total number of iterations

K\geq 1

, update rule

\psi(\cdot,\cdot)

, and interpolation method

\mathcal{I}(\cdot)

2: Initialize regularization parameter

\lambda_{0}\leftarrow\lambda_{\max}

3: for

k=0,\dots,K-1

4: Update

(x_{k+1},\lambda_{k+1})\leftarrow\psi(x_{k},\lambda_{k})

5: end for

6: return

\hat{x}(\cdot)\leftarrow\mathcal{I}\left({\{(x_{k},\lambda_{k})\}}_{k=1}^{K}\right)

We develop oracle complexity results for different update schemes and interpolation methods in terms of the number of gradient computations, Hessian computations, and linear system solves required to compute an $\epsilon$ -accurate approximate path. In this section, we will stick to a simple version of Algorithm 1 based on the application of the semi-implicit Euler method and linear interpolation to specify the update rule $\psi(\cdot,\cdot)$ and the interpolation method $\mathcal{I}(\cdot)$ . In particular, the semi-implicit Euler discretization of Eq. 4 is

(5)

\lambda_{k+1}=\lambda_{k}+h\cdot\xi(\lambda_{k}),\quad x_{k+1}=x_{k}+h\cdot v(x_{k},\lambda_{k+1}),

and the linear interpolation $\mathcal{I}_{\textnormal{linear}}(\cdot):{\{(x_{k},\lambda_{k})\}}_{k=0}^{K}\to\hat{x}(\cdot)$ is defined by $\hat{x}(\lambda):=\alpha x_{k}+(1-\alpha)x_{k+1}$ with $\alpha=\frac{\lambda-\lambda_{k+1}}{\lambda_{k}-\lambda_{k+1}}$ for all $\lambda\in[\lambda_{k+1},\lambda_{k}]$ and $k\in\{0,\ldots,K-1\}$ .

The algorithm with the exponential decaying parameter sequence

Recall the update rule Eq. 5, we can see that the function $\xi(\cdot)$ (or equivalently, $\lambda(\cdot)$ ) is still to be determined. In practical cases, the value of $\lambda$ usually decreases exponentially from $\lambda_{\max}$ to $\lambda_{\min}$ . This choice of penalty scale parameters $\left\{\lambda_{k}\right\}$ arises in the solution path for linear models, see Friedman et al. [13], and the interior-point method, see Renegar [36]. Although our analysis holds for a broad class of $\lambda(\cdot)$ , we first present the version with an exponentially decaying parameter sequence, namely $\lambda(t)=e^{-t}\lambda(0)$ . This specific version of Algorithm 1 is formally described in Algorithm 2.

Algorithm 2 Euler method for computing an approximate solution path

\hat{x}(\cdot)

1: Input: initial point

x_{0}\in\mathbb{R}^{p}

, total number of iterations

K\geq 1

2: Initialize regularization parameter

\lambda_{0}\leftarrow\lambda_{\max}

, set step-size

h\leftarrow 1-(\frac{\lambda_{\min}}{\lambda_{\max}})^{\frac{1}{K}}

3: for

k=0,\dots,K-1

\lambda_{k+1}\leftarrow\lambda_{k}-h\cdot\lambda_{k}

x_{k+1}\leftarrow x_{k}-h\left(\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k})\right)^{-1}\nabla f(x_{k})

6: end for

7: return

\hat{x}(\cdot)\leftarrow\mathcal{I}_{\textnormal{linear}}\left({\{(x_{k},\lambda_{k})\}}_{k=1}^{K}\right)

according to linear interpolation

Before going further into detailed analysis, we first state the computational guarantee of Algorithm 2. In our complexity analysis, we make the following smoothness assumptions for $f(\cdot)$ and $\Omega(\cdot)$ .

{assumption}

In addition to $\mu$ -strong convexity of $f(\cdot)$ and $\sigma$ -strong convexity of $\Omega(\cdot)$ , these functions have $L$ -Lipschitz gradients and Hessians, where $L>0$ is an upper bound on the four relevant Lipschitz constants. Furthermore, we assume that $f^{\ast}:=\min_{x}f(x)>-\infty$ and that $G>0$ is an upper bound on the norm of the gradients of $f(\cdot)$ and $\Omega(\cdot)$ on the level set $\{x\in\mathbb{R}^{p}:f(x)\leq f(x_{0})\}$ .

Theorem 3.2 is our main result concerning the complexity of Algorithm 2 and demonstrates that in terms of the accuracy parameter $\epsilon$ , Algorithm 2 requires $\mathcal{O}(1/\epsilon)$ iterations to compute an $\epsilon$ -accurate solution path.

Theorem 3.2.

Suppose that Section 3 holds, let $\epsilon>0$ be the desired accuracy, and suppose that the initial point $x_{0}$ satisfies $\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{4}$ . Let $T:=\log(\lambda_{\max}/\lambda_{\min})$ , let $\tau=\max\{\frac{1+\lambda_{\min}}{\mu+\lambda_{\min}\sigma},\frac{1+\lambda_{\max}}{\mu+\lambda_{\max}\sigma}\}$ , and let

(6)

K_{\textnormal{E}}:=\left\lceil\max\left\{2T,\frac{\sqrt{LG}\tau T}{\sqrt{3}},\frac{4(f(x_{0})-f^{\ast})\tau LT}{\epsilon},\frac{2\sqrt{L}(\tau G+1)T}{\sqrt{\epsilon}}\right\}\right\rceil.

If the total number of iterations $K$ satisfies $K\geq K_{\textnormal{E}}$ , then Algorithm 2 returns an $\epsilon$ -accurate solution path.

Remark 3.3.

Grid search type methods for computing approximate solution paths are proposed in Giesen et al. [16], Ndiaye et al. [29], and we will follow the analysis of the latter, which considers more general cases. To ensure that the function value gap satisfies $h(x)-h^{*}\leq\epsilon^{\prime}$ , we require the number of grid points $K=\frac{\sqrt{\tau}GT}{\sqrt{\epsilon^{\prime}}}$ . For $L$ -smooth function $h(\cdot)$ , $h(x)-h^{*}\leq\frac{\epsilon^{2}}{2L}$ implies $\|h(x)\|\leq\epsilon$ , which is the goal in our paper. Therefore, we need to set $\epsilon^{\prime}=\frac{\epsilon^{2}}{2L}$ , and hence we have the number of grid points is $K=\frac{\sqrt{\tau L}GT}{\epsilon}$ when the desired accuracy of the inner problem is set to $\epsilon_{c}=\frac{\epsilon^{\prime}}{2}$ . Using the exact Newton method and the last grid point as a warm start to solve the inner problem, Karimireddy et al. [21] implies that the inner complexity is $(\tau L)^{2}\log 2$ , and therefore the total complexity is $\frac{(\tau L)^{5/2}GT\log 2}{\epsilon}$ .

From the results in Theorems 3.2 and 3.3, we can see that both our results and the grid search method have a complexity of order $\mathcal{O}(\frac{1}{\epsilon})$ . In most practical cases, $f(x_{0})-f^{*}$ is smaller than $(\tau L)^{3/2}G\log 2$ , which implies that the constant in Eq. 6 is better. In Section 4, we will propose algorithms utilizing the smoothness of $f(\cdot)$ and $\Omega(\cdot)$ and achieve better complexity results.

3.1 Semi-Implicit Euler Update Scheme

Herein we provide the computational guarantee of the semi-implicit Euler update scheme defined in Eq. 5, where the main object we concern is the accuracy at each grid point. Let $r_{k}:=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|$ denote the accuracy at $(x_{k},\lambda_{k})$ . We consider a broad family of $\xi(\cdot)$ in the following analysis, although in Algorithm 2 and the corresponding complexity analysis in Theorem 3.2, we only consider the special scenario that $\lambda(t)=e^{-t}\lambda(0)$ . We first present the computation guarantees of Taylor expansion approximations given the Lipschitz continuity of Hessian in Lemma 3.4.

Lemma 3.4 (Lemma 1 in Nesterov and Polyak [32]).

Suppose $\phi(\cdot):S\to\mathbb{R}$ has $L$ -Lipschitz Hessian for some convex set $S\subseteq\mathbb{R}^{d}$ , then the following inequalities hold for all $x,y\in S$ :

(i)

$\left\|\nabla\phi(y)-\nabla\phi(x)-\nabla^{2}\phi(x)(y-x)\right\|\leq\frac{L}{2}\left\|y-x\right\|^{2}$ .
(ii)

$\left\|\phi(y)-\phi(x)-\nabla\phi(x)^{T}(y-x)-\frac{1}{2}(y-x)^{T}\nabla^{2}\phi(x)(y-x)\right\|\leq\frac{L}{6}\left\|y-x\right\|^{3}$ .

Based on the results in Lemma 3.4, we provide the local analysis of $r_{k}$ , that is, how the norm of the gradient at each grid point accumulates. Lemma 3.5 provides an upper bound on $\left\|r_{k+1}\right\|$ based on $\left\|r_{k}\right\|$ , which represents the accuracy in the previous iteration.

Lemma 3.5.

Suppose Section 3 holds, discretization Eq. 5 has the following guarantee for all $k\geq 0$ :

r_{k+1}\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{2}\cdot\frac{L(1+\lambda_{k+1})}{2}\left\|v(x_{k},\lambda_{k+1})\right\|^{2}.

Proof 3.6.

It holds that

	$\displaystyle r_{k+1}$	$\displaystyle=\left\\|\nabla f(x_{k+1})+\lambda_{k+1}\nabla\Omega(x_{k+1})\right\\|$
		$\displaystyle\leq\left\\|\nabla f(x_{k+1})-\nabla f(x_{k})-\nabla^{2}f(x_{k})(x_{k+1}-x_{k})\right\\|$
		$\displaystyle\phantom{\leq}+\left\\|\lambda_{k+1}\left(\nabla\Omega(x_{k+1})-\nabla\Omega(x_{k})-\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)\right\\|$
		$\displaystyle\phantom{\leq}+\left\\|\lambda_{k+1}\left(\nabla\Omega(x_{k})+\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)+\nabla f(x_{k})+\nabla^{2}f(x_{k})(x_{k+1}-x_{k})\right\\|$
		$\displaystyle\leq\frac{L}{2}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{\lambda_{k+1}L}{2}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{\lambda_{k+1}}{\lambda_{k}}\left\\|\nabla f(x_{k})+\lambda_{k}\nabla\Omega(x_{k})\right\\|$
		$\displaystyle=\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{2}\cdot\frac{L(1+\lambda_{k+1})}{2}\left\\|v(x_{k},\lambda_{k+1})\right\\|^{2},$

where the first inequality is true due to the triangle inequality, and in the second inequality, for the first two terms, we apply Item i in Lemma 3.4, and for the third terms, they are equal to each other.

Lemma 3.5 provides the first technical result of the semi-implicit Euler update scheme. When Section 3 holds, Eq. 5 has the computation guarantee $\tfrac{r_{j+1}}{\lambda_{j+1}}\leq\tfrac{r_{j}}{\lambda_{j}}+\mathcal{O}(h^{2})$ . After telescoping the inequalities for $j=0,1,\dots,k-1$ for $k\leq K=T/h$ we have $r_{k}\sim\mathcal{O}(h)$ . Therefore, one would expect a uniform $\mathcal{O}(h)$ bound on all accuracy $r_{k}$ at points $x_{k}$ for all $k=0,\dots,K$ . We formalize the idea in the following lemma.

Lemma 3.7.

Suppose Section 3 holds, $\lambda_{k+1}\geq\lambda_{\min}$ , $\xi(\lambda_{j})<0$ for all $j\leq k$ , and step-size $h$ satisfies the following condition for all $0\leq j<k$ :

(7)

h\leq\min\left\{\frac{\lambda_{j}}{-2\xi(\lambda_{j})},\sqrt{\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)^{2}}{-\xi(\lambda_{j})LG}}\right\}.

Then the sequence $\{(x_{k},\lambda_{k})\}_{k=0}^{K}$ generated by update scheme Eq. 5 satisfies $f(x_{k+1})\leq f(x_{k})$ , and the corresponding accuracy $\{r_{k}\}_{j=0}^{K}$ has the following guarantee:

(8)

\frac{r_{k}}{\lambda_{k}}\leq\frac{r_{0}}{\lambda_{0}}+2hL(f(x_{0})-f(x_{k}))\cdot\max_{j\in[k-1]}\left\{\frac{-(1+\lambda_{j+1})\xi(\lambda_{j})}{\lambda_{j}\lambda_{j+1}(\mu+\lambda_{j+1}\sigma)}\right\}.

Proof 3.8.

We will use induction to show that $f(x_{j+1})\leq f(x_{j})$ , which implies that $x_{k}\in S_{x_{0}}$ for all $k\in\{0,\dots,K\}$ . First, suppose $x_{j}\in S_{x_{0}}$ for some $j\in\{0,\dots,K-1\}$ . Let $d_{j}$ denote $v(x_{j},\lambda_{j+1})$ and we have $\left\|d_{j}\right\|\leq\frac{G}{\mu+\lambda_{j+1}\sigma}$ . Since $h+\frac{\lambda_{k}}{2\xi(\lambda_{k})}\leq 0$ and $h^{2}\leq\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)^{2}}{-\xi(\lambda_{j})LG}\leq\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)}{-\xi(\lambda_{k})L\left\|d_{k}\right\|}$ , it holds that

	$\displaystyle f(x_{j+1})$	$\displaystyle\leq f(x_{j})+h\cdot\nabla f(x_{j})^{T}d_{j}+h^{2}\cdot\frac{1}{2}d_{j}^{T}\nabla^{2}f(x_{j})d_{j}+h^{3}\cdot\frac{1}{6}L\left\\|d_{j}\right\\|^{3}$
		$\displaystyle\leq f(x_{j})+\frac{h}{4}\cdot\frac{\lambda_{j}}{\xi(\lambda_{j})}\cdot(\mu+\lambda_{j+1}\sigma)\left\\|d_{j}\right\\|^{2}.$

Since $\xi(\lambda_{j})<0$ , it holds that $f(x_{j+1})\leq f(x_{j})$ and therefore implies that $x_{j+1}\in S_{x_{0}}$ . Then by induction we conclude that $x_{j}\in S_{x_{0}}$ for all $j\leq k$ . Also, combining the result in Lemma 3.5, for all $j\leq k$ , it holds that

\frac{r_{j+1}}{\lambda_{j+1}}\leq\frac{r_{j}}{\lambda_{j}}+h^{2}\cdot\frac{L(1+\lambda_{j+1})}{2\lambda_{j+1}}\left\|d_{j}\right\|^{2}\leq\frac{r_{j}}{\lambda_{j}}+h\cdot\frac{L(1+\lambda_{j+1})}{2\lambda_{j+1}}\cdot\frac{4(f(x_{j})-f(x_{j+1}))}{\frac{\lambda_{j}}{-\xi(\lambda_{j})}\cdot(\mu+\lambda_{j+1}\sigma)}.

By taking the summation over $j$ from $0$ to $k$ , we obtain Eq. 8.

Lemma 3.7 provides the computational guarantees of Algorithm 2 with any decreasing sequence of $\left\{\lambda_{k}\right\}$ . Note that the upper bound of the accuracy at point $x_{k}$ still involves a constant related to the sequence $\left\{\lambda_{k}\right\}$ . We then consider a family of the sequence $\left\{\lambda_{k}\right\}$ and derive a simpler upper bound.

Proposition 3.9.

Suppose Section 3 holds, $\lambda_{k+1}\geq\lambda_{\min}$ , $-\lambda\leq\xi(\lambda)<0$ for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , and step-size $h$ satisfies

(9)

h\leq\min\left\{\frac{1}{2},\sqrt{\frac{3}{\tau^{2}LG}}\right\}.

Then under Section 3, discretization Eq. 5, it holds that

(10)

r_{k+1}\leq r_{0}+2h\tau L(f(x_{0})-f(x_{k+1})).

Proof 3.10.

Since $\xi(\lambda_{j})\in[-\lambda_{j},0)$ , we have $\frac{\lambda_{j}}{-\xi(\lambda_{j})}\geq 1$ . Therefore, we have

\frac{\lambda_{j}}{-2\xi(\lambda_{j})}\geq\frac{1}{2}\quad\text{and}\quad\sqrt{\frac{3\lambda_{j}(\mu+\lambda_{j+1}\sigma)^{2}}{-\xi(\lambda_{j})LG}}\geq\sqrt{\frac{3(\mu+\lambda_{\min}\sigma)^{2}}{LG}},

and hence condition Eq. 9 implies condition Eq. 7. Also, since $\frac{1+\lambda}{\mu+\lambda\sigma}$ is monotone in $[\lambda_{\min},\lambda_{\max}]$ , it holds that $\tau=\max_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1+\lambda}{\mu+\lambda\sigma}$ . Therefore, we have

\max_{j\in[k-1]}\left\{\frac{-(1+\lambda_{j+1})\xi(\lambda_{j})}{\lambda_{j}\lambda_{j+1}(\mu+\lambda_{j+1}\sigma)}\right\}\leq\frac{1+\lambda_{k}}{\lambda_{k}(\mu+\lambda_{k}\sigma)}\leq\frac{\tau}{\lambda_{k}}.

Apply the above inequality to Eq. 8 we obtain Eq. 10.

Proposition 3.9 provides a uniform upper bound on accuracy of all almost-optimal solutions $x_{k}$ . When $\frac{\xi(\lambda)}{\lambda}\in[-1,0)$ for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , Algorithm 2 generates a solution sequence $\left\{x_{k}\right\}$ such that $\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\sim\mathcal{O}(h)+r_{0}$ .

3.2 Linear Interpolation

In the last section, we provided a general accuracy analysis at all nearly-optimal solutions $x_{k}$ . Then the next step is to construct the entire path $\hat{x}(\lambda):[\lambda_{\min},\lambda_{\max}]\rightarrow\mathbb{R}^{P}$ based on these nearly-optimal solutions. In this section, we analyze the second procedure in Algorithm 2, i.e., linear interpolation. First, we recall the definition of linear interpolation for $x(\cdot),\lambda(\cdot)$ :

(11)

\hat{\lambda}(t):=\alpha\lambda_{k}+(1-\alpha)\lambda_{k+1},\quad\hat{x}(t):=\alpha x_{k}+(1-\alpha)x_{k+1},

where $\alpha:=\frac{t_{k+1}-t}{h}$ , $t\in[t_{k},t_{k+1}]$ , and $k\in\{0,\dots,K-1\}$ . That is, given an arbitrary $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , we first select $t$ such that $\hat{\lambda}(t)=\lambda$ , then we output $\hat{x}:=\hat{x}(t)$ as a nearly-optimal solution to problem $P(\lambda)$ . The following lemma provides a uniform upper bound of $\|\nabla F_{\lambda}(\hat{x}(\lambda))\|$ for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ .

Theorem 3.11.

Suppose Section 3 holds. Let $r_{\max}=\max_{0\leq k\leq K}r_{k}$ . Then, linear interpolation $\hat{x}(\cdot),\hat{\lambda}(\cdot)$ of sequence $\left\{(x_{k},\lambda_{k})\right\}_{k=0}^{K}$ generated by Eq. 11 has the following computational guarantee for all $t\in[t_{0},t_{K}]$ :

\left\|\nabla F_{\hat{\lambda}(t)}(\hat{x}(t))\right\|\leq r_{\textnormal{max}}+\frac{L}{8}\cdot\max_{k\in[K-1]}\left\{(1+\lambda_{k})\|x_{k+1}-x_{k}\|^{2}+2h|\xi(\lambda_{k})|\|x_{k+1}-x_{k}\|\right\}.

Proof 3.12.

First suppose $t\in[t_{k},t_{k+1}]$ . For simplicity, we define $x:=\hat{x}(t)$ , $\lambda:=\hat{\lambda}(t)$ , $\delta_{1}:=\nabla f(x_{k})+\lambda_{k}\nabla\Omega(x_{k})$ , $\delta_{2}:=\nabla f(x_{k+1})+\lambda_{k+1}\nabla\Omega(x_{k+1})$ . By triangle inequality and Item i in Lemma 3.4, it holds that

\displaystyle\left\|\alpha\nabla f(x_{k})+(1-\alpha)\nabla f(x_{k+1})-\nabla f(x)\right\|

\displaystyle\leq\frac{\alpha(1-\alpha)L}{2}\left\|x_{k}-x_{k+1}\right\|^{2}.

Also, by applying similar trick on $\nabla\Omega(\cdot)$ , it holds that

		$\displaystyle\left\\|\lambda\nabla\Omega(x)-\alpha\lambda_{k}\nabla\Omega(x_{k})-(1-\alpha)\lambda_{k+1}\nabla\Omega(x_{k+1})\right\\|$
	$\displaystyle\leq$	$\displaystyle\left\\|\lambda(\alpha\nabla\Omega(x_{k})+(1-\alpha)\nabla\Omega(x_{k+1})-\nabla\Omega(x))\right\\|$
		$\displaystyle+\left\\|\alpha(\lambda-\lambda_{k})\nabla\Omega(x_{k})+(1-\alpha)(\lambda-\lambda_{k+1})\nabla\Omega(x_{k+1})\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\lambda L}{2}\alpha(1-\alpha)\left\\|x_{k+1}-x_{k}\right\\|^{2}+\left\\|\alpha(1-\alpha)(\lambda_{k+1}-\lambda_{k})(\nabla\Omega(x_{k+1})-\nabla\Omega(x_{k}))\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\lambda L}{8}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{h\left\|\xi(\lambda_{k})\right\|L}{4}\left\\|x_{k+1}-x_{k}\right\\|.$

Combine the above two inequality and apply triangle inequality, we have

		$\displaystyle\left\\|\nabla f(x)+\lambda\nabla\Omega(x)-\alpha\delta_{1}-(1-\alpha)\delta_{2}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{L}{8}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{\lambda L}{8}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{\left\|\xi(\lambda_{k})\right\|Lh}{4}\left\\|x_{k+1}-x_{k}\right\\|.$

Theorem 3.11 provides the computational guarantee of linear interpolation of the sequence $\left\{(x_{k},\lambda_{k}\right\}_{k=0}^{K}$ generated by the update scheme Eq. 5. Observing that $\left\|x_{k+1}-x_{k}\right\|$ is of the order $\mathcal{O}(h)$ , we see that the additional error incurred by linear interpolation is of the order $\mathcal{O}(h^{2})$ . Together with the $\mathcal{O}(h)$ accuracy from the update scheme Eq. 5, we are able to provide a computational guarantee on the accuracy of the approximate path generated by Algorithm 2. In the following part, we will provide a uniform bound on $\|\nabla F_{\hat{\lambda}(t)}(\hat{x}(t))\|$ for a family of $\lambda(t)$ and discretization.

3.3 Computational Guarantee for the Exponential Decaying Parameter Sequence

Under the $\lambda(t)=\lambda_{\max}\cdot\exp(-t)$ scenario, we have $\xi(\lambda)=-\lambda$ and $\xi(\cdot)$ satisfies the assumption in Proposition 3.9. Therefore, we extend the result of Theorem 3.11 and provide an explicit uniform bound of the path accuracy.

Proposition 3.13.

Suppose Section 3 holds, and the step-size $h$ satisfies the condition in Eq. 9. Let $\hat{x}(\cdot):[0,T]\to\mathbb{R}^{p}$ and $\hat{\lambda}(\cdot):[0,T]\to[\lambda_{\min},\lambda_{\max}]$ denote the approximate solution path generated by Algorithm 2. Then, we have the following computational guarantee for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ :

(12)

\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|\leq\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|+2h\tau L(f(x_{0})-f^{*})+\frac{h^{2}L}{8}\cdot(\tau G+1)^{2}.

Proof 3.14.

First we extend the result in Proposition 3.9, and it holds that

r_{\max}\leq\max_{k\in[K]}r_{k}\leq r_{0}+2hL(f(x_{0})-f^{*})\cdot\tau.

Also, for the result in Theorem 3.11, we further have

		$\displaystyle\frac{L}{8}\cdot\max_{k\in[K-1]}\left\{(1+\lambda_{k})\\|x_{k+1}-x_{k}\\|^{2}+2h\|\xi(\lambda_{k})\|\\|x_{k+1}-x_{k}\\|\right\}$
	$\displaystyle\leq$	$\displaystyle\frac{h^{2}L}{8}\cdot\max_{k\in[K-1]}\left\{\frac{(1+\lambda_{k})G^{2}}{(\mu+\lambda_{k}\sigma)^{2}}+\frac{2\lambda_{k}G}{\mu+\lambda_{k}\sigma}\right\}\leq\frac{h^{2}L}{8}\cdot(\tau G+1)^{2}.$

By combining the above two inequalities, Proposition 3.9, and Theorem 3.11, we obtain Eq. 12.

In Algorithm 2, the sequence $\left\{\lambda_{k}\right\}_{k=0}^{K}$ is given by $\lambda_{k+1}=(1-h)\lambda_{k}$ , and implies $\lambda_{\min}=(1-h)^{K}\lambda_{\max}$ . Therefore, we have $h=1-(\tfrac{\lambda_{\min}}{\lambda_{\max}})^{1/K}$ . Applying this to Proposition 3.13, we arrive at the complexity analysis with respect to the number of iterations. We formalize the complexity analysis in the following proof of Theorem 3.2, which appears at the beginning of this section.

Proof 3.15 (Proof of Theorem 3.2).

The conditions that $K\geq\max\left\{2T,\frac{\sqrt{LG}\tau T}{\sqrt{3}}\right\}$ and $h=1-\left(\frac{\lambda_{\min}}{\lambda_{\max}}\right)^{1/K}\leq\frac{T}{K}$ guarantee that step-size $h$ satisfies Eq. 9. Also, $K\geq\frac{\tau LT(f(x_{0})-f^{*})}{\epsilon}$ and $K\geq\frac{(\tau G+1)\sqrt{L}T}{\sqrt{\epsilon}}$ guarantee that $2h\tau L(f(x_{0})-f^{*})\leq\frac{\epsilon}{2}$ and $\frac{h^{2}L}{8}\cdot(\tau G+1)^{2}\leq\frac{\epsilon}{4}$ . Hence Algorithm 2 guarantees a $\epsilon$ -accurate solution path.

Recall that in the assumption of Theorem 3.2, it requires a good initialization $x_{0}$ satisfying $\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{2}$ . In practical cases, we can either implement a specific convex optimization algorithm to solve a $x_{0}$ satisfying the initial condition or use the initialization suggested in the following lemma. Here we suggest one choice of initialization with computational guarantee when the function $\Omega(\cdot)$ is structured, i.e., the optimal solution to minimize $\Omega(\cdot)$ is easy to compute. Let $x_{\Omega}:=\arg\min_{x\in\mathbb{R}^{p}}\Omega(x)$ and $x_{0}:=x_{\Omega}-(\nabla^{2}f(x_{\Omega})+\lambda_{\max}\nabla^{2}\Omega(x_{\Omega}))^{-1}\nabla f(x_{\Omega})$ . Intuitively, the initialization $x_{0}$ is a Newton step from the optimizer of $\Omega(\cdot)$ , and we can show that $\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{L(1+\lambda_{\max})\|\nabla f(x_{\Omega})\|^{2}}{2(\mu+\lambda_{\max}\sigma)^{2}}$ . Notice that the value of $L$ and $\left\|\nabla f(x_{\Omega})\right\|$ are independent of $\lambda_{\max}$ . Hence, when $\lambda_{\max}$ is large enough, we have $r_{0}\leq\frac{\epsilon}{4}$ . Also, since $x_{\Omega}$ is the optimal solution when $\lambda=+\infty$ , the initialization can be considered as an update step of Eq. 5 from $x_{\Omega}$ .

4 Multi-Stage Discretizations

In the analysis of the previous section, Algorithm 2 guarantees an $\epsilon$ -accurate solution path within $\mathcal{O}(\epsilon^{-1})$ calls to the gradient, Hessian oracle, and linear system solver. One advantage of the main result proposed in Theorem 3.2 is that only the smooth Hessian of $f(\cdot)$ and $\Omega(\cdot)$ is required and no assumption of $\epsilon$ is required. When $f(\cdot)$ and $\Omega(\cdot)$ have better properties and $\epsilon$ is relatively small, one would like an algorithm that utilizes these properties and requires fewer calls to the oracle with respect to the order of $\epsilon$ . Motivated by the multistage numerical methods for solving differential equations, we design several update schemes to achieve higher-order accuracy. In particular, in this section we still consider the exponentially decaying parameter, that is, $\lambda(t)=e^{-t}\lambda(0)$ .

4.1 The Trapezoid Method

In this section, we propose and analyze the trapezoid method, whose formal description is given in Algorithm 3. The trapezoid method is beneficial when the functions $f(\cdot)$ and $\Omega(\cdot)$ have Lipschitz continuous third-order derivatives, and it achieves a higher-order accuracy than the implicit Euler method. The accuracy of the output path of Algorithm 3 is of order $\mathcal{O}(h^{2})$ where $h$ is the step-size, or equivalently, $\mathcal{O}(K^{-2})$ where $K$ is the number of iterations. Moreover, we want to mention that the algorithm does not require the oracle to higher-order derivatives, but still requires a gradient and Hessian oracle as in the Euler method.

Algorithm 3 Trapezoid method for solution path

1: Input: Initial point

x_{0}\in\mathbb{R}^{p}

, total number of iterations

K\geq 1

2: Initialize parameter

\lambda_{0}\leftarrow\lambda_{\max}

, set step-size

h\leftarrow 1-\sqrt{2(\frac{\lambda_{\min}}{\lambda_{\max}})^{\frac{1}{K}}-1}

3: for

k=0,\dots,K-1

d_{k,1}\leftarrow v(x_{k},\lambda_{k})

and

d_{k,2}\leftarrow v(x_{k}+h\cdot d_{k,1},(1-h+h^{2})\lambda_{k})

x_{k+1}\leftarrow x_{k}+h\cdot\frac{d_{k,1}+d_{k,2}}{2}

and

\lambda_{k+1}\leftarrow(1-h+\frac{h^{2}}{2})\lambda_{k}

6: end for

7: return

\hat{x}(\cdot)\leftarrow\mathcal{I}_{\textnormal{linear}}\left({\{(x_{k},\lambda_{k})\}}_{k=0}^{K}\right)

according to linear interpolation

We first state the main technical assumption and computational guarantees of Algorithm 3.

{assumption}

In addition to Section 3, we assume that the third-order directional derivative of $f(\cdot)$ and $\Omega(\cdot)$ are $L$ -Lipschitz continuous and $\sigma\geq 1$ .

Theorem 4.1.

Suppose Section 4.1 holds, let $\epsilon>0$ be desired accuracy, let $\tilde{\mu}:=\mu+\lambda_{\min}\sigma$ , suppose that the initial point $x_{0}$ satisfies $\left\|\nabla F_{\lambda_{\min}}(x_{0})\right\|\leq\frac{\epsilon}{2}\leq\tilde{\mu}$ , let $T:=1.1\log(\lambda_{\max}/\lambda_{\min})$ , and let

K_{\textnormal{tr}}:=\left\lceil\max\left\{10T,\frac{8LT(1+G)}{\tilde{\mu}},\frac{6L^{1/2}(1+G)^{3/2}T}{\epsilon^{1/2}},\frac{5\tau^{2/3}L(1+G)^{4/3}T}{\epsilon^{1/3}}\right\}\right\rceil.

If the total number of iterations $K$ satisfies $K\geq K_{\textnormal{tr}}$ , then Algorithm 3 returns an $\epsilon$ -accurate solution path.

The result in Theorem 4.1 shows that we improve the total complexity to $\mathcal{O}(\frac{1}{\sqrt{\epsilon}})$ , which is better than the $\mathcal{O}(\frac{1}{\epsilon})$ complexity of the Euler method and the best known results for the grid search method (see Theorems 3.2 and 3.3). Similarly to the previous complexity analysis of the semi-implicit Euler method, the analysis of the trapezoid method consists of two parts: we first present the computational guarantee of the trapezoid update scheme, which is defined as

(13)

(x_{k+1},\lambda_{k+1})\leftarrow T(x_{k},\lambda_{k}):=\left(x_{k}+h\cdot\tfrac{d_{k,1}+d_{k,2}}{2},(1-h+\tfrac{h^{2}}{2})\lambda_{k}\right),

where $d_{k,1}=v(x_{k},\lambda_{k})$ , and $d_{k,2}=v(x_{k}+h\cdot d_{k,1},(1-h+h^{2})\lambda_{k})$ .

Lemma 4.2.

Suppose Section 4.1 holds, $r_{k}=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\leq\tilde{\mu}$ , and the next iterate $(x_{k+1},\lambda_{k+1})$ is given by $(x_{k+1},\lambda_{k+1})=T(x_{k},\lambda_{k})$ defined in Eq. 13. Then, it holds that $(1+\lambda_{k})\|x_{k}-x_{k+1}\|\leq 3h(1+G)$ , and

(14)

r_{k+1}:=\left\|\nabla F_{\lambda_{k+1}}(x_{k})\right\|\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{3}\cdot 3L(1+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(1+G)^{4}.

We leave the proof of Lemma 4.2 in the appendix because of its length and complexity. In the proof, we mainly work with the directional derivatives (see Nesterov [30], for details) and the accuracy of the Taylor expansion in Lemma 3.4. The result in Lemma 4.2 shows that the trapezoid update scheme in Eq. 13 guarantees a local accumulation of $\mathcal{O}(h^{3})$ . Moreover, we can derive an $\mathcal{O}(h^{2})$ uniform upper bound on the accuracy of all nearly optimal solutions $\left\{x_{k}\right\}$ . For all other $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , we implement the linear interpolation to approximate the corresponding nearly-optimal solution. We then provide the formal proof of Theorem 4.1.

Proof 4.3 (Proof of Theorem 4.1).

Since $h-\frac{h^{2}}{2}\leq\frac{1}{K}\log(\frac{\lambda_{\max}}{\lambda_{\min}})$ , it holds that

h\leq\min\left\{0.1,\frac{\tilde{\mu}}{8L(1+G)},\frac{\epsilon^{1/2}}{3L^{1/2}(1+G)^{3/2}},\frac{\epsilon^{1/3}}{3\tau^{2/3}L(1+G)^{4/3}}\right\}.

Then we show that $r_{k}:=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\leq\frac{\epsilon}{2}$ for all $k$ by induction. Suppose $r_{k}\leq\frac{\epsilon}{2}$ , then by Lemma 4.2, it holds that

r_{k+1}:=\left\|\nabla F_{\lambda_{k+1}}(x_{k})\right\|\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{3}\cdot 3L(1+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(1+G)^{4}\leq\frac{\epsilon}{2}.

Therefore, $r_{k}\leq\frac{\epsilon}{2}$ for all $k\in\{0,\dots,K\}$ . Suppose $\lambda\in[\lambda_{k+1},\lambda_{k}]$ , and hence $\hat{x}(\lambda)=\alpha x_{k}+(1-\alpha)x_{k+1}$ where $\alpha=\frac{\lambda-\lambda_{k+1}}{\lambda_{k}-\lambda_{k+1}}$ . Applying the results in Theorem 3.11, we have

	$\displaystyle\left\\|f(\hat{x}(\lambda))+\lambda\hat{x}(\lambda)\right\\|$	$\displaystyle\leq\frac{\epsilon}{2}+\frac{L}{8}\max_{k\in[K-1]}\left\{(1+\lambda_{k})\\|x_{k+1}-x_{k}\\|^{2}+2h\|\xi(\lambda_{k})\|\\|x_{k+1}-x_{k}\\|\right\}$
		$\displaystyle\leq\frac{\epsilon}{2}+\frac{L}{8}\cdot\left(9h^{2}(1+G)^{2}+6h^{2}(1+G)\right)\leq\epsilon.$

5 Analysis with Inexact Linear Equations Solutions and Second-Order Conjugate Gradient Variants

In this section, we present the complexity analysis of the aforementioned methods with the presence of inexact oracle to gradient and Hessian and/or inexact linear equations solutions, as well as variants applying the second-order conjugate gradient (SOCG) type methods. We first consider the case when the gradient and Hessian oracle are inexact and/or the linear equation solver yields approximate solutions.

5.1 Analysis with Inexact Oracle and/or Approximate Solver

At each iteration of the Euler, trapezoid, and Runge-Kutta method, an essential subroutine is to compute the directions $d_{k,i}$ . For example, in the Euler method, $d_{k}$ is given by formula $d_{k}=v(x_{k},\lambda_{k})=-(\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k}))^{-1}\nabla f(x_{k})$ . In most large-scale problems, the computation of the Hessian matrix and solving linear equations exactly will be the computational bottleneck. Therefore, we consider the case with the presence of numerical error, which may be induced by inexact gradient and Hessian oracle, or by the linear equations solver. Nevertheless, we tackle the two types of numerical error together.

Definition 5.1.

Suppose an exact solution $d_{k}$ has the form $d_{k}=-H_{k}^{-1}g_{k}$ . For any constant $\delta\geq 0$ , $\hat{d}_{k}$ is a $\delta$ -approximate direction with respect to $d_{k}$ if $\|H_{k}\hat{d}_{k}+g_{k}\|\leq\delta$ . Furthermore, the $\delta$ -approximate versions of Algorithm 2 or Algorithm 3 are the same as the original algorithms except that they use a $\delta$ -approximate direction in each update.

Compared with the original update scheme in Algorithm 2, the only difference in the update scheme of the $\delta$ -approximate version is that an approximate direction $\hat{d}_{k}$ is applied at each iteration instead of the exact direction $v(\cdot,\cdot)$ . We want to mention that there is no constraint on how the approximate direction $\hat{d}_{k}$ is generated, and in Section 5.2 we provide several efficient methods to compute the approximate direction and the corresponding complexity analysis. The following two lemmas characterize the accumulation of local errors of the update scheme in the $\delta$ -approximate version of Algorithm 2 and Algorithm 3.

Lemma 5.2.

Suppose Section 3 holds. Let $\hat{d}_{k}$ denote an approximate solution to $v(x_{k},\lambda_{k})$ . Let $\delta_{k}=\left(\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k})\right)\hat{d}_{k}+\nabla f(x_{k})$ , $r_{k}=\|\nabla F_{\lambda_{k}}(x_{k})\|$ , and $r_{k+1}=\|\nabla F_{\lambda_{k+1}}(x_{k+1})\|$ . Then we have

(15)

r_{k+1}\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+\frac{h^{2}L(1+\lambda_{k+1})}{2}\cdot\|\hat{d}_{k}\|^{2}+h\left\|\delta_{k}\right\|.

Furthermore, if we set $\lambda_{s+1}=(1-h)\lambda_{s}$ and $\left\|\delta_{s}\right\|\leq\delta$ for some scalar $\delta>0$ and all $s=0,\dots,k$ , it holds that

(16)

r_{k}\leq\frac{\lambda_{k}}{\lambda_{0}}\cdot r_{0}+2h\tau L(f(x_{0})-f^{*})+\delta.

Proof 5.3.

First it holds that

		$\displaystyle\lambda_{k+1}\left(\nabla\Omega(x_{k})+\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)+\nabla f(x_{k})+\nabla^{2}f(x_{k})(x_{k+1}-x_{k})$
	$\displaystyle=$	$\displaystyle\frac{\lambda_{k+1}}{\lambda_{k}}(\lambda_{k}\nabla\Omega(x_{k})+\nabla f(x_{k}))+h\cdot\delta_{k}.$

Then apply same technique as in Lemma 3.5 we show that Eq. 15 holds. Also, applying Eq. 15 to Proposition 3.13 implies the result in Eq. 16.

Lemma 5.4.

Suppose $r_{k}=\left\|\nabla F_{\lambda_{k}}(x_{k})\right\|\leq\tilde{\mu}$ . Let $\delta_{k,1},\delta_{k,2}$ denote the residual of the approximate directions $\hat{d}_{k,1},\hat{d}_{k,2}$ with $\left\|\delta_{k,1}\right\|,\left\|\delta_{k,2}\right\|\leq\tilde{\mu}$ . Then, it holds that

(17)

r_{k+1}\leq\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{3}\cdot 3L(2+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(2+G)^{4}+\frac{h}{2}\left\|\delta_{k,1}-\delta_{k,2}\right\|+\frac{h^{2}}{2}\left\|\delta_{k,1}\right\|.

Proof 5.5.

We will follow the idea in Lemmas A.2, A.4 and 4.2. Recall the result in Lemma A.2, since $\tilde{H}_{1}d_{1}=-\nabla f(x_{1})+\delta_{1}$ , it holds that $(1+\lambda)\left\|d_{1}\right\|\leq 2(G+1+\left\|\delta_{1}\right\|/\tilde{\mu})$ . Also, the result in Lemma A.4 becomes

\left\|\tilde{H}_{1}(d_{2}-d_{1})-\nabla^{2}f(x_{1})(x_{1}-x_{2})-(\tilde{H}_{1}-\tilde{H}_{2})d_{2}+\delta_{1}-\delta_{2}\right\|\leq\frac{L}{2}\left\|x_{1}-x_{2}\right\|^{2},

where $\tilde{H}_{1}=\nabla^{2}f(x_{1})+\lambda_{1}\nabla^{2}\Omega(x_{1})$ and $\tilde{H}_{2}=\nabla^{2}f(x_{2})+\lambda_{2}\nabla^{2}\Omega(x_{2})$ . Now we modify the proof of Lemma 4.2 to get Eq. 17. Then the right-hand side of Eq. 20 becomes $-h\delta_{1}$ . Also, the right-hand side of Eq. 21 becomes $\frac{h}{2}(\delta_{1}-\delta_{2})$ and the right-hand side of Eq. 22 becomes $\frac{h^{2}}{2}\lambda\nabla^{2}\Omega(x)d_{1}+\frac{h^{2}}{2}\delta_{1}$ .

The following two corollaries provide the complexity analysis of the $\delta$ -approximate version of Algorithm 2 and Algorithm 3.

Corollary 5.6.

Suppose that Section 3 holds, and suppose that the initial point $x_{0}$ satisfies $\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{4}$ . Let $T:=\log(\lambda_{\max}/\lambda_{\min})$ , let $\epsilon\in(0,\mu+\lambda_{\min}\sigma]$ be the desired accuracy and let

K_{\textnormal{E, approx}}:=\left\lceil\max\left\{2T,\frac{\sqrt{LG}\tau T}{\sqrt{3}},\frac{8(f(x_{0})-f^{\ast})\tau LT}{\epsilon},\frac{4\sqrt{L}(\tau(G+\epsilon)+1)T}{\sqrt{\epsilon}}\right\}\right\rceil.

If the number of iterations $K$ satisfies $K\geq K_{\textnormal{E, approx}}$ and approximate directions $\hat{d}_{k}$ are all $\frac{\epsilon}{4}$ -approximate, the $\delta$ -approximate version of Algorithm 2 returns an $\epsilon$ -accurate solution path.

Proof 5.7.

Since the step-size $h=1-(\frac{\lambda_{\min}}{\lambda_{\max}})^{\frac{1}{K}}\leq\frac{T}{K}$ , combining Eq. 16, we have $r_{k}\leq\frac{3\epsilon}{4}$ , for all $k\in\{0,\dots,K\}$ . For linear interpolation error, we have

		$\displaystyle\frac{L}{8}\cdot\max_{k\in[K-1]}\{(1+\lambda_{k})\\|x_{k+1}-x_{k}\\|^{2}+2h\|\xi(\lambda_{k})\|\\|x_{k+1}-x_{k}\\|\}$
	$\displaystyle\leq$	$\displaystyle\frac{h^{2}L}{2}\cdot\max_{k\in[K-1]}\{(1+\lambda_{k})(\\|d_{k}\\|^{2}+\\|d_{k}-\hat{d}_{k}\\|^{2})+\|\xi(\lambda_{k})\|(\\|d_{k}\\|+\\|d_{k}-\hat{d}_{k}\\|)\}$
	$\displaystyle\leq$	$\displaystyle\frac{h^{2}L}{2}\cdot(\tau(G+\epsilon)+1)^{2}\leq\frac{\epsilon}{4}.$

Applying the above inequality to Theorem 3.11 completes the proof.

Corollary 5.8.

Suppose Section 4.1 holds, let $\tilde{\mu}:=\mu+\lambda_{\min}\sigma$ , let $\epsilon\in(0,\tilde{\mu}]$ be the desired accuracy, suppose that the initial point $x_{0}$ satisfies $\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\leq\frac{\epsilon}{4}$ , let $T:=1.1\log(\lambda_{\max}/\lambda_{\min})$ , and let

K_{\textnormal{tr, approx}}:=\left\lceil\max\left\{10T,\frac{8LT(2+G)}{\tilde{\mu}},\frac{6L^{1/2}(2+G)^{3/2}T}{\epsilon^{1/2}},\frac{6L\tau^{2/3}(2+G)^{4/3}T}{\epsilon^{1/3}}\right\}\right\rceil.

If the number of iterations $K$ satisfies $K\geq K_{\textnormal{tr, approx}}$ and all approximate directions $\hat{d}_{k,1},\hat{d}_{k,2}$ are $\frac{\epsilon}{4}$ -approximate, the $\delta$ -approximate version of Algorithm 3 returns an $\epsilon$ -accurate solution path.

Proof 5.9.

Using Eq. 17 for $r_{k+1},\dots,r_{1}$ , we have

	$\displaystyle r_{k+1}$	$\displaystyle\leq r_{0}+\frac{3h^{3}L(2+G)^{3}+2h^{4}L^{3}\tau^{2}(2+G)^{4}+\frac{h}{2}\left\\|\delta_{k,1}-\delta_{k,2}\right\\|+\frac{h^{2}}{2}\left\\|\delta_{k,1}\right\\|}{1-\left(1-h+\frac{h^{2}}{2}\right)}$
		$\displaystyle\leq\frac{\epsilon}{4}+\frac{3h^{2}L(2+G)^{3}+2h^{3}L^{3}\tau^{2}(2+G)^{4}+\frac{\epsilon}{4}+\frac{h\epsilon}{8}}{1-\frac{h}{2}}\leq\frac{3\epsilon}{4}.$

Also, since $\|x_{k+1}-x_{k}\|\leq h(\|d_{k}\|+\frac{\|\hat{d}_{k,1}\|+\|\hat{d}_{k,2}\|}{2(\mu+\lambda_{k}\sigma)})$ , it holds that

	$\displaystyle\left\\|f(\hat{x}(\lambda))+\lambda\hat{x}(\lambda)\right\\|$	$\displaystyle\leq\frac{3\epsilon}{4}+\frac{L}{8}\cdot\max_{k\in[K-1]}\left\{(1+\lambda_{k})\\|x_{k+1}-x_{k}\\|^{2}+2h\lambda_{k}\\|x_{k+1}-x_{k}\\|\right\}$
		$\displaystyle\leq\frac{3\epsilon}{4}+\frac{L}{4}\cdot\left(h^{2}\left(9(1+G)^{2}+2\tau\epsilon\right)+h^{2}(3(1+G)+\epsilon)\right)\leq\epsilon.$

Corollaries 5.6 and 5.8 provide the complexity analysis of the Euler method and the trapezoid method with the presence of approximate directions. We can see that when the residuals of approximate directions have a uniform upper bound over all iterations, the $\delta$ -approximate versions of Algorithms 2 and 3 have complexity of the same order as the original algorithms, respectively, which require exact directions. Moreover, Corollaries 5.6 and 5.8 only require $\epsilon$ -approximate directions but no assumptions about how the approximate directions are generated. It provides flexibility in the choice of an approximate oracle to compute approximate directions.

5.2 Second-Order Conjugate Gradient Variants

Following the complexity analysis, we apply the conjugate gradient method to solve the sub-problem, i.e., to compute a $\delta$ -approximate direction of $v(\cdot,\cdot)$ . To measure the efficiency of second-order conjugate gradient type algorithms, we measure the computational complexity by the total number of calls to both gradient and Hessian-vector product oracles.

Now we apply the conjugate gradient method as an approximate oracle to compute the approximate direction $\hat{d}_{k}$ at each iteration. We use the $\delta$ -approximate version of the Euler method as an example. At iteration $k$ , the algorithm requires an approximate solution $\hat{d}_{k}$ satisfying $\|{H_{k}\hat{d}_{k}+g_{k}}\|_{2}\leq\delta$ where $H_{k}:=\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k})$ and $g_{k}:=\nabla f(x_{k})$ . We apply the conjugate gradient to approximately solve the equation $H_{k}\hat{d}_{k}+g_{k}=0$ and set the initial guess to be the approximate direction $\hat{d}_{k-1}$ in the last iteration. Herein we provide the complexity analysis of Euler-CG method and trapezoid-CG method.

Theorem 5.10.

Under the same conditions as Theorem 3.2, let $\delta\leftarrow\epsilon/4$ , and let $K_{\textnormal{E-CG}}\sim\tilde{\mathcal{O}}(L^{3/2}\tau^{3/2}(f(x_{0})-f^{*})T/\epsilon)$ . If the total number of iterations $K$ satisfies $K\geq K_{\textnormal{E-CG}}$ , then the Euler-CG method returns an $\epsilon$ -accurate solution path. Also, suppose that the assumption in Theorem 4.1 holds, let $\delta\leftarrow\epsilon/4$ , and let $K_{\textnormal{tr-CG}}\sim\tilde{\mathcal{O}}(L\tau^{1/2}(2+G)^{3/2}T/\epsilon^{1/2})$ . If the total number of iterations $K$ satisfies $K\geq K_{\textnormal{tr-CG}}$ , then the trapezoid-CG method returns an $\epsilon$ -accurate solution path.

Proof 5.11 (Proof of Theorem 5.10).

Here we only need to consider the inner complexity, i.e., the number of iterations required to compute the approximate direction. For the Euler-CG algorithm, let $\left\{y_{k,s}\right\}$ denote the sequence generated by the conjugate gradient method with $y_{k,0}=\hat{d}_{k-1}$ , let $H_{k}=\nabla^{2}f(x_{k})+\lambda_{k+1}\nabla^{2}\Omega(x_{k})$ , and let $g_{k}=\nabla f(x_{k})$ . The existing results of the conjugate gradient method [8] guarantee that $\left\|H_{k}y_{k,s}+g_{k}\right\|_{2}\leq 2\sqrt{\kappa_{k}}(1-\frac{2}{\sqrt{\kappa_{k}}+1})^{s}\left\|H_{k}y_{k,0}+g_{k}\right\|_{2}$ , where $\kappa_{k}$ is the condition number of $H_{k}$ and $\kappa_{k}\leq\frac{(1+\lambda_{k})L}{\mu+\lambda_{k}\sigma}\leq\tau L$ . Since the initialization $y_{0}=\hat{d}_{k-1}$ is the approximate direction at the last iteration, we have $\|H_{k-1}\hat{d}_{k-1}+g_{k-1}\|\leq\frac{\epsilon}{4}$ . Then the initialization guarantees that

	$\displaystyle\\|H_{k}\hat{d}_{k-1}+g_{k}\\|_{2}$	$\displaystyle\leq\\|H_{k-1}\hat{d}_{k-1}+g_{k-1}\\|_{2}+\\|H_{k}\hat{d}_{k-1}+g_{k}-H_{k-1}\hat{d}_{k-1}-g_{k-1}\\|$
		$\displaystyle\leq\frac{\epsilon}{4}+hL(1+\lambda_{k})\left(\\|\hat{d}_{k-1}\\|^{2}+\\|\hat{d}_{k-1}\\|\right)\leq\frac{\epsilon}{4}+2hL(2+G)^{2}.$

Applying $h\sim\mathcal{O}(\frac{\epsilon}{(f(x_{0})-f^{*})\tau L})$ , the inner complexity $N_{k}$ has an upper bound $N_{k}\leq\frac{\sqrt{\kappa_{k}}+1}{2}\log(\frac{2\sqrt{\kappa_{k}}\|H_{k}\hat{d}_{k-1}+g_{k}\|}{\epsilon/4})\sim\tilde{\mathcal{O}}(\sqrt{\tau L})$ . Therefore, the total computation complexity of the Euler-CG method with the approximate oracle of the conjugate gradient to compute an $\epsilon$ -accurate solution path is $\tilde{\mathcal{O}}(\frac{L^{3/2}\tau^{3/2}(f(x_{0})-f^{*})T}{\epsilon})$ . For the trapezoid-CG algorithm, let $x_{k}^{\prime}=x_{k}+h\hat{d}_{k,1}$ , $\lambda_{k}^{\prime}=(1-h+h^{2})\lambda_{k}$ , $H_{k,1}:=\nabla^{2}f(x_{k})+\lambda_{k}\nabla\Omega(x_{k}),g_{k,1}:=\nabla f(x_{k})$ , $H_{k,2}:=\nabla^{2}f(x_{k}^{\prime})+\lambda_{k}^{\prime}\nabla\Omega(x_{k}^{\prime})$ , and $g_{k,2}:=\nabla f(x_{k}^{\prime})$ . At iteration $k$ , the trapezoid-CG method requires to compute the $\frac{\epsilon}{4}$ directions $\hat{d}_{k,1}$ and $\hat{d}_{k,2}$ . In the conjugate gradient subroutine, we use $\hat{d}_{k-1,1}$ as a warm start to compute $\hat{d}_{k,1}$ , and likewise use $\hat{d}_{k,1}$ to warm start $\hat{d}_{k,2}$ . Similarly, we have $\|H_{k,1}\hat{d}_{k-1,1}+g_{k,1}\|\leq\frac{\epsilon}{4}+2hL(2+G)^{2}$ and $\|H_{k,2}\hat{d}_{k,1}+g_{k,2}\|\leq\frac{\epsilon}{4}+2hL(2+G)^{2}$ . Applying $h\sim\mathcal{O}(\frac{\epsilon^{1/2}}{L^{1/2}(2+G)^{3/2}})$ , the inner complexity $N_{k,1}$ and $N_{k,2}$ have upper bounds $N_{k,1},N_{k,2}\sim\tilde{\mathcal{O}}(\sqrt{\tau L})$ . Therefore, the total computational complexity of the trapezoid-CG method with the conjugate gradient approximate oracle to compute an $\epsilon$ -accurate solution path is $\tilde{\mathcal{O}}(\frac{L\tau^{1/2}(2+G)^{3/2}T}{\epsilon^{1/2}})$ .

Recalling the complexity results in Remark 3.3, we notice that when we use Nesterov’s accelerated gradient method as the subproblem solver, the total complexity will be $\frac{(\tau L)^{3/2}GT\log 2}{\epsilon}$ . We can see that the total complexity of the $\delta$ -approximate version of Algorithm 3 has order $\mathcal{O}(\frac{1}{\sqrt{\epsilon}})$ and is better than the best known result for the grid search method. Again, the complexity of the trapezoid method is better than that of the Euler method, since it exploits the higher-order smoothness of the functions $f(\cdot)$ and $\Omega(\cdot)$ .

6 Computational Experiments and Results

In this section, we present computational results of numerical experiments in which we implement different versions of discretization schemes and interpolation methods to compute the approximate solution path. To compare with our methods, we introduce two approaches based on grid search (see, e.g., Giesen et al. [16], Ndiaye et al. [29]). For the subproblem solver in the grid search methods, we use the warm-started exact Newton method and Nesterov’s accelerated gradient method to compare with our exact methods and SOCG variants, respectively. We focus on the following 8 versions of the update schemes, where “CG” stands for conjugate gradient. In all cases, we fix an accuracy level $\epsilon>0$ and set the step/grid-size parameters accordingly.

•

Euler, Euler-CG: Algorithm 2, and its $\delta$ -approximate version using CG as the sub-problem approximate oracle.
•

Trapezoid, Trapezoid-CG: Algorithm 3, and its $\delta$ -approximate version using CG as the approximate subproblem oracle.
•

Runge-Kutta, Runge Kutta-CG: an extension of our algorithm that uses the Runge-Kutta update scheme, and its $\delta$ -approximate version using CG as the subproblem approximate oracle.
•

Grid Search-Newton/AGD: grid search method with either the exact Newton or Nesterov’s accelerated gradient method as the sub-problem solver.

6.1 Logistic Regression

Herein we examine the empirical behavior of each of the previously presented methods on logistic regression problems using the breast cancer dataset from Dua and Graff [11] ( $32$ features and $569$ observations) and the leukemia dataset from Golub et al. [18] ( $7129$ features and $72$ observations). In particular, let $\left\{(a_{i},b_{i})\right\}_{i=1}^{n}$ denote a training set of features $a_{i}\in\mathbb{R}^{p}$ and labels $b_{i}\in\{-1,+1\}$ and define the sets of positive and negative examples by $S_{+}:=\left\{i\in[n]:b_{i}=1\right\}$ and $S_{-}:=\left\{i\in[n]:b_{i}=-1\right\}$ . We examine two logistic regression variants: (i) regularized logistic regression with $f(x)=\frac{1}{n}\sum_{i=1}^{n}\log(1+e^{-b_{i}a_{i}^{T}x})$ and $\Omega(x)=\frac{1}{2}\left\|x\right\|^{2}$ , where $\lambda_{\min}=10^{-4}$ , $\lambda_{\max}=10^{4}$ , and (ii) re-weighted logistic regression with $f(x)=\frac{1}{|S_{+}|}\sum_{i\in S_{+}}\log(1+e^{-b_{i}a_{i}^{T}x})$ and $\Omega(x)=\frac{1}{|S_{-}|}\sum_{i\in S_{-}}\log(1+e^{-b_{i}a_{i}^{T}x})$ , where $\lambda_{\min}=10^{-1}$ , $\lambda_{\max}=10$ . The initialization $x_{0}$ which we apply in each method is the same and is a very nearly-optimal solution to the problem such that $\left\|\nabla F_{\lambda_{\max}}(x_{0})\right\|\approx 10^{-15}$ . Note that in general it is difficult to compute the exact path accuracy of an approximate solution path $\hat{x}(\cdot):[\lambda_{\min},\lambda_{\max}]\to\mathbb{R}^{p}$ , namely $A(\hat{x}):=\max_{\lambda\in\Lambda}\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|$ , where $\Lambda=[\lambda_{\min},\lambda_{\max}]$ . We examine the approximate path accuracy in lieu of the exact computation of the path accuracy, namely, we consider $\hat{A}(\hat{x}):=\max_{\lambda\in\hat{\Lambda}}\left\|\nabla F_{\lambda}(\hat{x}(\lambda))\right\|$ where $\hat{\Lambda}=\{\lambda_{0},\frac{\lambda_{0}+\lambda_{1}}{2},\lambda_{1},\frac{\lambda_{1}+\lambda_{2}}{2},\lambda_{2},\dots,\lambda_{K}\}$ , since the theoretical largest interpolation error occurs at the midpoint of two breakpoints. In Fig. 1, we vary the desired accuracy parameter $\epsilon$ and plot the number of Hessian oracle computations required by Algorithm 2, Algorithm 3, the algorithm with the Runge-Kutta update scheme, and the grid search method. The theoretical evaluation numbers are derived from the complexity analysis, and the observed number of evaluations of each method is determined according to a “doubling trick,” whereby for each value of $K$ we calculate the observed accuracy along the path by interpolation, and if the observed accuracy is too large, then we double the value of $K$ until it is below $\epsilon$ . The numerical results match the asymptotic order and intuition, as well as the superior performance of the trapezoid and Runge-Kutta methods due to the higher-order smoothness of the loss and regularization function. In part (a) of Fig. 1, we notice that the theoretical complexity is higher than the practical one, since it is more conservative. Therefore, we will stick to the “doubling method” in the following experiments and compare the practical performance of each method.

Refer to caption — (a) Regularized logistic regression.

Figure 2 summarizes the performance of the three aforementioned SOCG methods, and the grid search method with Nesterov’s accelerated gradient method. For SOCG methods, we record the total number of both gradient evaluation and Hessian-vector products, which is our measure of computational complexity. We implement these methods on the leukemia dataset and the regularized logistic regression problem. From Fig. 2, we can see that the numerical results match our theoretical asymptotic bounds. In part (b), we provide the CPU time to compute the approximate solutions of both SOCG methods and exact second-order methods (in the more transparent bars). We comment that for the exact methods we only run the case when the desired accuracy equals $10^{-4}$ , since the grid search method already takes about 15 hours to finish. The dominance of SOCG methods in these cases illustrates the benefit and ability of SOCG methods to deal with large-scale problems.

6.2 Moment Matching Problem

Here we consider the moment matching problem with entropy regularization. Suppose that a discrete random variable $Z$ has sample space $\{w_{1},\dots,w_{p+1}\}$ and probability distribution $x=(x_{(1)},\dots,x_{(p+1)})$ . The goal is to match the empirical first $n$ -th moments of $Z$ , with entropy regularization to promote “equal” distributions. To formalize the problem, we consider the following constrained optimization problem:

	$\displaystyle P(\lambda):\quad\min_{x\in\mathbb{R}^{p+1}}\ \$	$\displaystyle\frac{1}{2}\left\\|Ax-b\right\\|^{2}+\lambda\cdot\sum_{j=1}^{n}x_{(j)}\log(x_{(j)})$
	s.t.	$\displaystyle{\bm{1}}_{p+1}^{T}x=1,\quad x\geq 0,$

where $x_{(j)}$ is the $j$ -th component of $x$ , $A\in\mathbb{R}^{n\times(p+1)}$ with $A_{i,j}=w_{j}^{i}$ , and $\lambda\in\Lambda=[10^{-2},10^{2}]$ . The parametric optimization problem $P(\lambda)$ is a constrained optimization problem, including an affine constraint, which does not satisfy our assumptions. Therefore, we introduce a new variable $y$ to substitute $x$ . Let $y\in\mathbb{R}^{p}$ with $y_{(i)}=x_{(i)}$ , for $i=1,\dots,p$ , and $S^{\prime}=\{y\in\mathbb{R}^{p}:y\geq 0,{\bm{1}}_{p}^{T}y\leq 1\}$ . Then the above moment matching problem $P(\lambda)$ is equivalent to the following parametric optimization problem:

	$\displaystyle P^{\prime}(\lambda):\quad\min_{y\in\mathbb{R}^{p}}\ \$	$\displaystyle\frac{1}{2}\\|A^{\prime}y-b^{\prime}\\|^{2}+\lambda\cdot\left(\sum_{j=1}^{n}y_{(j)}\log(y_{(j)})+(1-{\bm{1}}_{p}^{T}y)\log(1-{\bm{1}}_{p}^{T}y)\right)$
	s.t.	$\displaystyle{\bm{1}}_{p}^{T}y\leq 1,\quad y\geq 0,$

where $A^{\prime}=A_{1:p}-A_{p+1}{\bm{1}}_{p}^{T}$ and $b^{\prime}=b-A_{p+1}$ . We examine the empirical behavior of each of the previously presented methods on $(P^{\prime})$ without constraints, that is, $f(y)=\frac{1}{2}\|A^{\prime}y-b^{\prime}\|^{2}$ and $\Omega(y)=\sum_{j=1}^{n}y_{(j)}\log(y_{(j)})+(1-{\bm{1}}_{p}^{T}y)\log(1-{\bm{1}}_{p}^{T}y)$ . Translating $y$ back to $x$ , we can show that the exact path $x(\lambda)$ for $\lambda\in[\lambda_{\min},\lambda_{\max}]$ is a subset of the relative interior of $S$ , but the proof is omitted for brevity. Also, when the step-size $h$ is small enough, all grid points $\{x_{k}\}$ will be in the relative interior of $S$ , and therefore, the approximate path $\hat{x}(\lambda)$ for $\lambda\in[\lambda_{\min},\lambda_{\max}]$ is a subset of the relative interior of $S$ .

Synthetic Data Generation Process

We generate the data $(x,w)$ according to the following generative model. The true distribution vector $x^{\textnormal{true}}$ is according to the model $x_{(i)}^{\textnormal{true}}=\frac{\exp(z_{(i)})}{\sum_{j=1}^{p+1}\exp(z_{(j)})}$ for $i=1,\dots,p+1$ , where $z_{(i)}\sim\textnormal{unif}(0,1)$ . The sample vector $w$ is generated from an independent uniform distribution, i.e., $w_{(i)}\sim\textnormal{unif}(0,1)$ for $i=1,\dots,p$ , and without loss of generality, we set $w_{(p+1)}=0$ .

We examine the aforementioned exact and SOCG methods with different problem dimensions. In the following set of experiments, we set the number of observations $n=20$ , the desired accuracy $\epsilon=10^{-5}$ , and vary the problem dimension $p\in\{128,256,512,1024,2048\}$ . The true distribution $x^{\text{true}}$ and the sample vector $w$ are synthetically generated according to the same process as before. Figure 3 displays our result for this experiment. Generally, we observe that the CPU time of computing an $\epsilon$ -path increases as the dimension of the problem increases. Comparing the CPU time of the exact methods and the SOCG methods, we notice that the SOCG methods are less sensitive to the size of the problems. Again, the results demonstrate the superiority of SOCG methods for solving large-scale problems.

7 Conclusion

Inspired by a differential equation perspective on the parametric solution path, we developed and analyzed several different exact and inexact second-order algorithms. We develop theoretical results that are an improvement over the best-known results of grid search methods, and we present computational results that demonstrate the effectiveness of our methods.

Acknowledgments

This research is supported, in part, by NSF AI Institute for Advances in Optimization Award 2112533.

References

Abbas et al. [2014] Boushra Abbas, Hédy Attouch, and Benar F Svaiter. Newton-like dynamics and forward-backward methods for structured monotone inclusions in hilbert spaces. Journal of Optimization Theory and Applications, 161(2):331–360, 2014.
Alvarez and Attouch [2001] Felipe Alvarez and Hedy Attouch. An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3–11, 2001.
Alvarez et al. [2002] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and P Redont. A second-order gradient-like dissipative dynamical system with hessian-driven damping.: Application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8):747–779, 2002.
Attouch and Svaiter [2011] Hedy Attouch and Benar Fux Svaiter. A continuous dynamical newton-like approach to solving monotone inclusions. SIAM Journal on Control and Optimization, 49(2):574–598, 2011.
Attouch et al. [2016] Hedy Attouch, Maicon Marques Alves, and Benar Fux Svaiter. A dynamic approach to a proximal-newton method for monotone inclusions in hilbert spaces, with complexity $o(1/n^{2})$ . Journal of Convex Analysis, 23(1):139–180, 2016.
Attouch et al. [2018] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1-2):123–175, 2018.
Bach et al. [2006] Francis R Bach, David Heckerman, and Eric Horvitz. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7(Aug):1713–1741, 2006.
Bertsekas [1997] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
Bot and Csetnek [2016] Radu Ioan Bot and Ernö Robert Csetnek. Second order forward-backward dynamical systems for monotone inclusion problems. SIAM Journal on Control and Optimization, 54(3):1423–1443, 2016.
Conn et al. [2000] Andrew R Conn, Nicholas IM Gould, and Ph L Toint. Trust region methods, volume 1. Siam, 2000.
Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Efron et al. [2004] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
Garcia et al. [1989] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
Gautschi [2011] Walter Gautschi. Numerical analysis. Springer Science & Business Media, 2011.
Giesen et al. [2012a] Joachim Giesen, Martin Jaggi, and Sören Laue. Approximating parameterized convex optimization problems. ACM Transactions on Algorithms (TALG), 9(1):10, 2012a.
Giesen et al. [2012b] Joachim Giesen, Martin Jaggi, and Sören Laue. Regularization paths with guarantees for convex semidefinite optimization. In Artificial Intelligence and Statistics, pages 432–439, 2012b.
Golub et al. [1999] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537, 1999.
Guddat et al. [1990] Jürgen Guddat, F Guerra Vazquez, and Hubertus Th Jongen. Parametric optimization: singularities, pathfollowing and jumps. Springer, 1990.
Hastie et al. [2004] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(Oct):1391–1415, 2004.
Karimireddy et al. [2018] Sai Praneeth Karimireddy, Sebastian U Stich, and Martin Jaggi. Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
Kim et al. [2007] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-point method for large-scale $\ell_{1}$ -regularized least squares. IEEE journal of selected topics in signal processing, 1(4):606–617, 2007.
Loosli et al. [2007] Gaëlle Loosli, Gilles Gasso, and Stéphane Canu. Regularization paths for $\nu$ -svm and $\nu$ -svr. In International Symposium on Neural Networks, pages 486–496. Springer, 2007.
Mairal and Yu [2012] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1835–1842. Omnipress, 2012.
Marquardt [1963] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
Mazumder et al. [2010] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287–2322, 2010.
Mazumder et al. [2011] Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.
Moré [1978] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
Ndiaye et al. [2019] Eugene Ndiaye, Tam Le, Olivier Fercoq, Joseph Salmon, and Ichiro Takeuchi. Safe grid search with optimal complexity. In International Conference on Machine Learning, pages 4771–4780, 2019.
Nesterov [2018] Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2018.
Nesterov and Nemirovskii [1994] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
Osborne et al. [2000] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000.
Polyak [2009] Roman A Polyak. Regularized newton method for unconstrained convex optimization. Mathematical programming, 120(1):125–145, 2009.
Ralph [1994] Daniel Ralph. Global convergence of damped newton’s method for nonsmooth equations via the path search. Mathematics of Operations Research, 19(2):352–389, 1994.
Renegar [2001] James Renegar. A mathematical view of interior-point methods in convex optimization, volume 3. Siam, 2001.
Rosset [2005] Saharon Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1153–1160. MIT Press, 2005. URL http://papers.nips.cc/paper/2600-following-curved-regularized-optimization-solution-paths.pdf.
Rosset and Zhu [2007] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, pages 1012–1030, 2007.
Scieur et al. [2017] Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, pages 1109–1118, 2017.
Scott [2011] L.R. Scott. Numerical Analysis. Princeton University Press, 2011. ISBN 9781400838967. URL https://books.google.com/books?id=SfCjL_5AaRQC.
Su et al. [2014] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
Wang et al. [2014] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164–2201, 2014.
Wibisono et al. [2016] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.
Yuan and Zou [2009] Ming Yuan and Hui Zou. Efficient global approximation of generalized nonlinear $\ell_{1}$ -regularized solution paths and its applications. Journal of the American Statistical Association, 104(488):1562–1574, 2009.
Zhang et al. [2018] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. In Advances in Neural Information Processing Systems, pages 3900–3909, 2018.
Zhou and Lange [2015] Hua Zhou and Kenneth Lange. Path following in the exact penalty method of convex programming. Computational optimization and applications, 61(3):609–634, 2015.
Zhou and Wu [2014] Hua Zhou and Yichao Wu. A generic path algorithm for regularized statistical estimation. Journal of the American Statistical Association, 109(506):686–699, 2014.

Appendix A Proofs

A.1 Proof of Proposition 2.1

For fixed $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , we have $\frac{\xi(\lambda)}{\lambda}$ is a constant and its absolute value is no larger than $C$ . Let $H_{i}=\nabla^{2}f(x_{1})+\lambda\nabla^{2}\Omega(x_{1})$ and $g_{i}=\nabla f(x_{i})$ for $i=1,2$ . Since $f(\cdot)$ is $\mu$ -strongly convex and $\Omega(\cdot)$ is $\sigma$ strong convex, we have $\|H_{i}\|\geq\mu+\lambda_{\min}\sigma$ for $i=1,2$ . Since $f(\cdot)$ is $L$ -Lipschitz continuous, we have $\|g_{2}\|\leq L$ . Also, let $v_{1}=H_{1}^{-1}g_{1}$ , $v_{2}=H_{2}^{-1}g_{2}$ , and $v_{1}^{\prime}=H_{1}^{-1}g_{2}$ . t holds that

\|v_{1}-v_{1}^{\prime}\|=\|H_{1}^{-1}(g_{1}-g_{2})\|\leq\|H_{1}^{-1}\|\cdot\|g_{1}-g_{2}\|\leq\tfrac{1}{\mu+\lambda_{\min}\sigma}\cdot L\|x_{1}-x_{2}\|.

Also, since $\|H_{1}-H_{2}\|\leq\|\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})\|+\lambda\|\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2})\|\leq L(1+\lambda_{\max})\|x_{1}-x_{2}\|$ and $\|H_{1}-H_{2}\|=\|H_{1}(H_{1}^{-1}-H_{2}^{-1})H_{2}\|\leq\|H_{1}\|\cdot\|H_{1}^{-1}-H_{2}^{-1}\|\cdot\|H_{2}\|$ , it holds that $\|H_{1}^{-1}-H_{2}^{-1}\|\leq\tfrac{L(1+\lambda_{\max})\|x_{1}-x_{2}\|}{(\mu+\lambda_{\min}\sigma)^{2}}$ . Therefore, we have

\|v_{1}^{\prime}-v_{2}\|=\|\left(H_{1}^{-1}-H_{2}^{-1}\right)g_{2}\|\leq\|H_{1}^{-1}-H_{2}^{-1}\|\cdot\|g_{2}\|\leq\tfrac{L^{2}(1+\lambda_{\max})\|x_{1}-x_{2}\|}{(\mu+\lambda_{\min}\sigma)^{2}}.

To conclude, since $v(x_{1},\lambda)-v(x_{2},\lambda)=\frac{\xi(\lambda)}{\lambda}\cdot(v_{1}-v_{2})$ , it holds that

\|v(x_{1},\lambda)-v(x_{2},\lambda)\|\leq(\tfrac{LC}{\mu+\lambda_{\min}\sigma}+\tfrac{L^{2}C(1+\lambda_{\max})}{(\mu+\lambda_{\min}\sigma)^{2}})\cdot\|x_{1}-x_{2}\|.

A.2 Proof of Lemma 4.2

In the following residual analysis, we will work with the high-order directional derivatives of $f(\cdot)$ and $\Omega(\cdot)$ . We now introduce the definition of the directional derivatives. For $p\geq 1$ , let $D^{p}f(x)[h_{1},\dots,h_{p}]$ denote the directional derivative of function $f$ at $x$ along directions $h_{i},i=1,\dots,p$ . For example, $Df(x)[h]=\nabla f(x)^{T}h$ and $D^{2}f(x)[h_{1},h_{2}]=h_{1}^{T}\nabla^{2}f(x)h_{2}$ . Furthermore, the norm of directional derivatives is defined as $\|D^{p}f(x)\|:=\max_{h_{1},\dots,h_{p}}\left\{\left|D^{p}f(x)[h_{1},\dots,h_{p}]\right|:\|h_{i}\|\leq 1\right\}$ . For detailed properties of directional derivatives, we refer the reader to Nesterov [30]. We start with computational guarantees of the Taylor approximation on functions with Lipschitz continuous high-order derivatives. The following lemma guarantees the accuracy of the Taylor expansion.

Lemma A.1 ( $(1.5)$ and $(1.6)$ in Nesterov [30]).

Let the function $\phi(\cdot)$ be convex and $p$ -times differentiable. Suppose $p$ -th order derivative of $\phi(\cdot)$ is $L_{p}$ Lipschitz continuous. Let $\Phi_{x,p}(\cdot)$ denote the Taylor approximation of function $\phi(\cdot)$ at $x$ : $\Phi_{x,p}(y):=f(x)+\sum_{i=1}^{p}\frac{1}{i!}D^{i}\phi(x)[y-x]^{i}$ . Then we have the following guarantees:

\left|\phi(y)-\Phi_{x,p}(y)\right|\leq\frac{L_{p}}{(p+1)!}\|y-x\|^{p+1},\quad\|\nabla\phi(y)-\nabla\Phi_{x,p}(y)\|\leq\frac{L_{p}}{p!}\|y-x\|^{p}.

{condition}

Suppose step-size $h$ satisfies that $h\leq\min\left\{0.2,\frac{\tilde{\mu}}{8L(1+G)}\right\}$ .

For simplicity, we use $(\tilde{x},\tilde{\lambda})$ to represent $(x_{\textnormal{next}},\lambda_{\textnormal{next}})$ . We will begin the complexity analysis of trapezoid method by the following two lemmas which provide proper upper bounds the norm of direction $\|d_{1}\|$ and the difference between $d_{1}$ and $d_{2}$ .

Lemma A.2.

Suppose $\sigma\geq 1$ , $x\in S_{x_{0}}$ , $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , $h>0$ and $(\tilde{x},\tilde{\lambda})=T(x,\lambda;h)$ . Let $r$ denote the initial residual $\|\nabla f(x)+\lambda\nabla\Omega(x)\|$ that satisfies $r\leq\tilde{\mu}$ . Then we have $(1+\lambda)\|d_{1}\|\leq 2(G+1)$ .

Proof A.3.

Let $H=\nabla^{2}f(x)+\lambda\nabla^{2}\Omega(x)$ . By definition $d_{1}=-H^{-1}\nabla f(x)$ . When $\lambda\geq 1$ , we have $(1+\lambda)\|d_{1}\|\leq\frac{1+\lambda}{\lambda}G\leq 2$ . Also when $\lambda\leq 1$ , it holds that

(1+\lambda)\|d_{1}\|\leq(1+\lambda)\left(\|H^{-1}\lambda\nabla\Omega(x)\|+\|H^{-1}(\nabla f(x)+\lambda\nabla\Omega(x))\|\right)\leq 2(G+1).

Lemma A.4.

Suppose Section 4.1 and Section A.2 hold. Let $(x,\lambda)\in\mathbb{R}^{n}\times\mathbb{R}^{+}$ , and let $x_{i},\lambda_{i},d_{i},i\in\{1,2\}$ be generated by the trapezoid update scheme defined in Eq. 13, we have

\|\tilde{H}_{1}(d_{2}-d_{1})-\nabla^{2}f(x_{1})(x_{1}-x_{2})-(\tilde{H}_{1}-\tilde{H}_{2})d_{2}\|\leq\tfrac{L}{2}\|x_{1}-x_{2}\|^{2},

where $\tilde{H}_{i}=\nabla^{2}f(x_{i})+\lambda_{i}\nabla^{2}\Omega(x_{i}),i\in\left\{1,2\right\}$ . Furthermore, it holds that

\|d_{2}-d_{1}\|\leq 2hL\tau(\|d_{1}\|+\|d_{1}\|^{2}).

Proof A.5.

Using the definition of $d_{1},d_{2}$ we have

(18)		$\displaystyle\tilde{H}_{1}(d_{2}-d_{1})$	$\displaystyle=\nabla f(x_{1})-\nabla f(x_{2})+(I-\tilde{H}_{1}\tilde{H}_{2}^{-1})\nabla f(x_{2})$
(18)			$\displaystyle=\nabla^{2}f(x_{1})(x_{1}-x_{2})+(\tilde{H}_{1}-\tilde{H}_{2})d_{2}+(R),$

where $\|(R)\|=\|\nabla f(x_{1})-\nabla f(x_{2})-\nabla^{2}f(x_{1})(x_{1}-x_{2})\|\leq\tfrac{h^{2}}{2}\cdot L\|d_{1}\|^{2}$ . Also, we have $\|\nabla^{2}f(x_{1})(x_{1}-x_{2})\|=h\|\nabla^{2}f(x_{1})d_{1}\|\leq hL\|d_{1}\|$ , and that

		$\displaystyle\\|(\tilde{H}_{1}-\tilde{H}_{2})d_{2}\\|$
	$\displaystyle=$	$\displaystyle\\|\left(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})+\lambda_{1}(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2}))+(\lambda_{1}-\lambda_{2})\nabla^{2}\Omega(x_{2})\right)d_{2}\\|$
	$\displaystyle\leq$	$\displaystyle\left(L\\|x_{1}-x_{2}\\|+\lambda_{1}L\\|x_{1}-x_{2}\\|+\left\|\lambda_{1}-\lambda_{2}\right\|L\right)\\|d_{2}\\|\leq h\left(L(\lambda+1)\\|d_{1}\\|+\lambda L\right)\\|d_{2}\\|.$

Hence, it holds that

(19)			$\displaystyle\tilde{\mu}\\|d_{2}\\|-\tilde{\mu}\\|d_{1}\\|\leq\tilde{\mu}\\|d_{2}-d_{1}\\|\leq\\|\tilde{H}_{1}(d_{2}-d_{1})\\|$
(19)		$\displaystyle\leq$	$\displaystyle\tfrac{h^{2}}{2}\cdot L\\|d_{1}\\|^{2}+hL\\|d_{1}\\|+h\left(L(\lambda+1)\\|d_{1}\\|+\lambda L\right)\\|d_{2}\\|.$

When $h$ satisfies Section A.2, it holds that $h\left(L(\lambda+1)\|d_{1}\|+\lambda L\right)\leq\tfrac{\tilde{\mu}}{3}$ and $\tfrac{h^{2}}{2}\cdot L\|d_{1}\|+hL\leq\tfrac{\tilde{\mu}}{3}$ . Apply them to Eq. 19, we have $\tfrac{2}{3}\tilde{\mu}\|d_{2}\|\leq\tfrac{4}{3}\tilde{\mu}\|d_{1}\|$ and it implies that $\|d_{2}\|\leq 2\|d_{1}\|$ . Apply it to Eq. 19, it holds that

	$\displaystyle\\|\tilde{H}_{1}(d_{1}-d_{2})\\|$	$\displaystyle\leq\tfrac{h^{2}}{2}\cdot L\\|d_{1}\\|^{2}+hL\\|d_{1}\\|+h\left(L(\lambda_{0}+1)\\|d_{1}\\|+\lambda L\right)\\|d_{2}\\|$
		$\displaystyle\leq 2hL(1+\lambda)(\\|d_{1}\\|+\\|d_{1}\\|^{2}).$

Hence, it holds that $\|d_{1}-d_{2}\|\leq h\cdot\tfrac{2L(1+\lambda)(\|d_{1}\|+\|d_{1}\|^{2})}{\mu+\lambda\sigma}\leq 2hL\tau(\|d_{1}\|+\|d_{1}\|^{2})$ .

Based on the results of Lemma A.2 and Lemma A.4, the following theorem analyzes the one-step residual accumulation of the trapezoid update in Eq. 13.

Proof A.6 (Proof of Lemma 4.2).

We will begin the local residual analysis by estimating the difference between $\nabla F_{\tilde{\lambda}}(\tilde{x})$ and $\tfrac{\tilde{\lambda}}{\lambda}\cdot\nabla F_{\lambda}(x)$ . For simplicity, let $x$ , $\tilde{x}$ , $\lambda$ , and $\tilde{\lambda}$ denote $x_{k}$ , $x_{k+1}$ , $\lambda_{k}$ , and $\lambda_{k+1}$ in the trapezoid update Eq. 13 respectively. Then we have

R=\nabla F_{\tilde{\lambda}}(\tilde{x})-\tfrac{\tilde{\lambda}}{\lambda}\cdot\nabla F_{\lambda}(x)=\underbrace{\nabla f(\tilde{x})-\nabla f(x)}_{(A)}+\underbrace{\tilde{\lambda}(\nabla\Omega(\tilde{x})-\nabla\Omega(x))}_{(B)}+\underbrace{(1-\tfrac{\tilde{\lambda}}{\lambda})\nabla f(x)}_{(C)}.

We will approach the result in Eq. 14 by splitting and rearranging the terms in $(A)$ , $(B)$ and $(C)$ . From Lemma A.1, it holds that

\|(RA)\|:=\|(A)-\underbrace{\nabla^{2}f(x)(\tilde{x}-x)}_{(A^{\prime})}-\underbrace{\tfrac{1}{2}D^{3}f(x)[\tilde{x}-x]^{2}}_{(A3)}\|\leq\tfrac{L}{6}\|\tilde{x}-x\|^{3}=\tfrac{h^{3}}{6}L\|\tilde{d}\|^{3}.

From the update Eq. 13, we have $(A^{\prime})=\underbrace{h\nabla^{2}f(x)d_{1}}_{(A1)}+\underbrace{\tfrac{h}{2}\nabla^{2}f(x)(d_{2}-d_{1})}_{(A2)}$ . For $(B)$ , using Lemma A.1 and based on update Eq. 13 we have

	$\displaystyle(B)$	$\displaystyle=\lambda\nabla^{2}\Omega(x)(\tilde{x}-x)+\underbrace{(\tilde{\lambda}-\lambda)\nabla^{2}\Omega(x)(\tilde{x}-x)}_{(B3)}+\underbrace{\tilde{\lambda}\cdot\tfrac{1}{2}D^{3}\Omega(x)[\tilde{x}-x]^{2}}_{(B4)}+(RB)$
		$\displaystyle=\underbrace{h\lambda\nabla^{2}\Omega(x)d_{1}}_{(B1)}+\underbrace{\tfrac{h}{2}\lambda\nabla^{2}\Omega(x)(d_{2}-d_{1})}_{(B2)}+(B3)+(B4)+(RB),$

where $\|(RB)\|=\tilde{\lambda}\|\nabla\Omega(\tilde{x})-\nabla\Omega(x)-\nabla^{2}\Omega(x)(\tilde{x}-x)-\tfrac{1}{2}D^{3}\Omega(x)[\tilde{x}-x]^{2}\|\leq\tfrac{h^{3}\tilde{\lambda}L}{6}\|\tilde{d}\|^{3}$ . Also, $(C)=(h-\tfrac{h^{2}}{2})\nabla f(x)=\underbrace{h\nabla f(x)}_{(C1)}-\underbrace{\tfrac{h^{2}}{2}\nabla f(x)}_{(C2)}$ . By the definition of $d_{1}$ , we have

(20)

(A1)+(B1)+(C1)=h\nabla^{2}f(x)d_{1}+h\lambda\nabla^{2}\Omega(x)d_{1}+h\nabla f(x)=0.

Using Lemma A.4, we have

(21)		$\displaystyle(A2)+(B2)$	$\displaystyle=\underbrace{\tfrac{h}{2}\nabla^{2}f(x)(x_{1}-x_{2})}_{(D1)}+\underbrace{\tfrac{h}{2}(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2}))d_{2}}_{(D2)}$
(21)			$\displaystyle\phantom{=}+\underbrace{\tfrac{h}{2}(\lambda_{1}\nabla^{2}\Omega(x_{1})-\lambda_{2}\nabla^{2}\Omega(x_{2}))d_{2}}_{(D3)}+(RD),$

where $\|(RD)\|\leq\tfrac{h}{2}\cdot\tfrac{L}{2}\|x_{1}-x_{2}\|^{2}=\tfrac{h^{3}}{4}L\|d_{1}\|^{2}$ . Furthermore, it holds that

(22)

(D1)+(C2)=-\tfrac{h^{2}}{2}\nabla^{2}f(x)d_{1}-\tfrac{h^{2}}{2}\nabla f(x)=\underbrace{\tfrac{h^{2}}{2}\lambda\nabla^{2}\Omega(x)d_{1}}_{(E1)},

and

	$\displaystyle(A3)+(D2)$	$\displaystyle=\tfrac{h^{2}}{2}D^{3}f(x)\left[\tfrac{1}{2}(d_{1}+d_{2})\right]^{2}+\tfrac{h}{2}D^{3}f(x)[x_{1}-x_{2},d_{2}]+(R1)$
		$\displaystyle=\underbrace{\tfrac{h^{2}}{2}D^{3}f(x)\left[\tfrac{1}{2}(d_{1}-d_{2})\right]^{2}}_{(R2)}+(R1),$

where

	$\displaystyle\\|(R1)\\|$	$\displaystyle=\\|\tfrac{h}{2}\left(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})\right)d_{2}-\tfrac{h}{2}D^{3}f(x)[x_{1}-x_{2},d_{2}]\\|$
		$\displaystyle\leq\tfrac{h}{2}\cdot\tfrac{L}{2}\\|x_{1}-x_{2}\\|^{2}\\|d_{2}\\|=\tfrac{h^{3}}{4}L\\|d_{1}\\|^{2}\\|d_{2}\\|.$

We further have $(D3)=\underbrace{\tfrac{h}{2}\lambda_{2}\left(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2})\right)d_{2}}_{(E2)}+\underbrace{\tfrac{h}{2}(\lambda_{1}-\lambda_{2})\nabla^{2}\Omega(x_{1})d_{2}}_{(E3)}$ , and

	$\displaystyle(B4)+(E2)$	$\displaystyle=\tilde{\lambda}\cdot\tfrac{1}{2}D^{3}\Omega(x)[\tilde{x}-x]^{2}+\tfrac{h}{2}\lambda_{2}D^{3}\Omega(x_{1})[x_{1}-x_{2},d_{2}]+(R5)$
		$\displaystyle=\underbrace{\tfrac{h^{2}}{2}(\tilde{\lambda}-\lambda_{2})D^{3}\Omega(x)\left[\tfrac{d_{1}+d_{2}}{2}\right]^{2}}_{(R3)}+\underbrace{\tfrac{h^{2}}{2}\lambda_{2}D^{3}\Omega(x_{1})\left[\tfrac{d_{1}-d_{2}}{2}\right]^{2}}_{(R4)}+(R5),$

where

	$\displaystyle\\|(R5)\\|$	$\displaystyle=\\|\tfrac{h}{2}\lambda_{2}\left(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2})\right)d_{2}-\tfrac{h}{2}\lambda_{2}D^{3}\Omega(x)[x_{1}-x_{2},d_{2}]\\|$
		$\displaystyle\leq\tfrac{h}{2}\lambda_{2}\cdot\tfrac{L}{2}\\|x_{1}-x_{2}\\|^{2}\\|d_{2}\\|=\tfrac{h^{3}}{4}\lambda_{2}L\\|d_{1}\\|^{2}\\|d_{2}\\|,$

and

	$\displaystyle(B3)+(E1)+(E3)=$	$\displaystyle(-h+\tfrac{h^{2}}{2})\lambda\nabla^{2}\Omega(x)h\cdot\tfrac{d_{1}+d_{2}}{2}+\tfrac{h^{2}}{2}\lambda\nabla^{2}\Omega(x)d_{1}$
		$\displaystyle+\tfrac{h}{2}\left(h-h^{2}\right)\nabla^{2}\Omega(x)d_{2}=\underbrace{\tfrac{h^{3}}{4}\lambda\nabla^{2}\Omega(x)(d_{1}-d_{2})}_{(R6)}.$

Hence, it holds that

(23)	$\displaystyle\\|(R)\\|\leq$	$\displaystyle\\|(RA)\\|+\\|(RB)\\|+\\|(RD)\\|+\\|(R1)\\|+\\|(R2)\\|$
		$\displaystyle+\\|(R3)\\|+\\|(R4)\\|+\\|(R5)\\|+\\|(R6)\\|$
	$\displaystyle\leq$	$\displaystyle h^{3}\cdot\tfrac{L}{6}\\|\tfrac{d_{1}+d_{2}}{2}\\|^{3}+h^{3}\cdot\tfrac{\tilde{\lambda}L}{6}\\|\tfrac{d_{1}+d_{2}}{2}\\|^{3}+h^{3}\cdot\tfrac{L}{6}\\|d_{1}\\|^{2}$
		$\displaystyle+h^{3}\cdot\tfrac{L}{4}\\|d_{1}\\|^{2}\\|d_{2}\\|+h^{2}\cdot\tfrac{L}{8}\\|d_{1}-d_{2}\\|^{2}+h^{4}\cdot\tfrac{\lambda L}{4}\\|\tfrac{d_{1}+d_{2}}{2}\\|^{2}$
		$\displaystyle+h^{2}\cdot\tfrac{\lambda_{2}L}{8}\\|d_{1}-d_{2}\\|^{2}+h^{3}\cdot\tfrac{\lambda_{2}L}{4}\\|d_{1}\\|^{2}\\|d_{2}\\|+h^{3}\cdot\tfrac{\lambda L}{4}\\|d_{1}-d_{2}\\|$
	$\displaystyle\leq$	$\displaystyle h^{3}\cdot\tfrac{L}{3}\\|d_{1}\\|^{3}+h^{3}\cdot\tfrac{L\lambda}{3}\\|d_{1}\\|^{3}+h^{3}\cdot\tfrac{L}{6}\\|d_{1}\\|^{2}+h^{3}\cdot\tfrac{L}{2}\\|d_{1}\\|^{3}$
		$\displaystyle+h^{2}\cdot\tfrac{L}{8}\\|d_{1}-d_{2}\\|^{2}+h^{4}\cdot\tfrac{\lambda L}{2}\\|d_{1}\\|^{2}+h^{2}\cdot\tfrac{\lambda L}{8}\\|d_{1}-d_{2}\\|^{2}$
		$\displaystyle+h^{3}\cdot\tfrac{\lambda L}{2}\\|d_{1}\\|^{3}+h^{3}\cdot\tfrac{\lambda L}{4}\\|d_{1}-d_{2}\\|$
	$\displaystyle\leq$	$\displaystyle h^{3}L(1+\lambda)(\\|d_{1}\\|^{3}+\\|d_{1}\\|^{2}+\\|d_{1}\\|)+h^{2}\cdot\tfrac{L(1+\lambda)}{8}\\|d_{1}-d_{2}\\|^{2}.$

Apply the result in Lemma A.4 into Eq. 23, we can further get

(24)

\|d_{1}-d_{2}\|^{2}\leq(2hL\tau(\|d_{1}\|+\|d_{1}\|^{2}))^{2}\leq 8h^{2}L^{2}\tau^{2}(\|d_{1}\|^{2}+\|d_{1}\|^{4}).

Also, by applying the results in Lemma A.2 to Eqs. 23 and 24, we have

\displaystyle\|(R)\|\leq h^{3}\cdot 3L(1+G)^{3}+h^{4}\cdot 2L^{3}\tau^{2}(1+G)^{4}.

	$\displaystyle r_{k+1}$	$\displaystyle=\left\\|\nabla f(x_{k+1})+\lambda_{k+1}\nabla\Omega(x_{k+1})\right\\|$
		$\displaystyle\leq\left\\|\nabla f(x_{k+1})-\nabla f(x_{k})-\nabla^{2}f(x_{k})(x_{k+1}-x_{k})\right\\|$
		$\displaystyle\phantom{\leq}+\left\\|\lambda_{k+1}\left(\nabla\Omega(x_{k+1})-\nabla\Omega(x_{k})-\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)\right\\|$
		$\displaystyle\phantom{\leq}+\left\\|\lambda_{k+1}\left(\nabla\Omega(x_{k})+\nabla^{2}\Omega(x_{k})(x_{k+1}-x_{k})\right)+\nabla f(x_{k})+\nabla^{2}f(x_{k})(x_{k+1}-x_{k})\right\\|$
		$\displaystyle\leq\frac{L}{2}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{\lambda_{k+1}L}{2}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{\lambda_{k+1}}{\lambda_{k}}\left\\|\nabla f(x_{k})+\lambda_{k}\nabla\Omega(x_{k})\right\\|$
		$\displaystyle=\frac{\lambda_{k+1}}{\lambda_{k}}\cdot r_{k}+h^{2}\cdot\frac{L(1+\lambda_{k+1})}{2}\left\\|v(x_{k},\lambda_{k+1})\right\\|^{2},$

		$\displaystyle\left\\|\lambda\nabla\Omega(x)-\alpha\lambda_{k}\nabla\Omega(x_{k})-(1-\alpha)\lambda_{k+1}\nabla\Omega(x_{k+1})\right\\|$
	$\displaystyle\leq$	$\displaystyle\left\\|\lambda(\alpha\nabla\Omega(x_{k})+(1-\alpha)\nabla\Omega(x_{k+1})-\nabla\Omega(x))\right\\|$
		$\displaystyle+\left\\|\alpha(\lambda-\lambda_{k})\nabla\Omega(x_{k})+(1-\alpha)(\lambda-\lambda_{k+1})\nabla\Omega(x_{k+1})\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\lambda L}{2}\alpha(1-\alpha)\left\\|x_{k+1}-x_{k}\right\\|^{2}+\left\\|\alpha(1-\alpha)(\lambda_{k+1}-\lambda_{k})(\nabla\Omega(x_{k+1})-\nabla\Omega(x_{k}))\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\lambda L}{8}\left\\|x_{k+1}-x_{k}\right\\|^{2}+\frac{h\left\|\xi(\lambda_{k})\right\|L}{4}\left\\|x_{k+1}-x_{k}\right\\|.$

		$\displaystyle\\|(\tilde{H}_{1}-\tilde{H}_{2})d_{2}\\|$
	$\displaystyle=$	$\displaystyle\\|\left(\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})+\lambda_{1}(\nabla^{2}\Omega(x_{1})-\nabla^{2}\Omega(x_{2}))+(\lambda_{1}-\lambda_{2})\nabla^{2}\Omega(x_{2})\right)d_{2}\\|$
	$\displaystyle\leq$	$\displaystyle\left(L\\|x_{1}-x_{2}\\|+\lambda_{1}L\\|x_{1}-x_{2}\\|+\left\|\lambda_{1}-\lambda_{2}\right\|L\right)\\|d_{2}\\|\leq h\left(L(\lambda+1)\\|d_{1}\\|+\lambda L\right)\\|d_{2}\\|.$

(23)	$\displaystyle\\|(R)\\|\leq$	$\displaystyle\\|(RA)\\|+\\|(RB)\\|+\\|(RD)\\|+\\|(R1)\\|+\\|(R2)\\|$
		$\displaystyle+\\|(R3)\\|+\\|(R4)\\|+\\|(R5)\\|+\\|(R6)\\|$
	$\displaystyle\leq$	$\displaystyle h^{3}\cdot\tfrac{L}{6}\\|\tfrac{d_{1}+d_{2}}{2}\\|^{3}+h^{3}\cdot\tfrac{\tilde{\lambda}L}{6}\\|\tfrac{d_{1}+d_{2}}{2}\\|^{3}+h^{3}\cdot\tfrac{L}{6}\\|d_{1}\\|^{2}$
		$\displaystyle+h^{3}\cdot\tfrac{L}{4}\\|d_{1}\\|^{2}\\|d_{2}\\|+h^{2}\cdot\tfrac{L}{8}\\|d_{1}-d_{2}\\|^{2}+h^{4}\cdot\tfrac{\lambda L}{4}\\|\tfrac{d_{1}+d_{2}}{2}\\|^{2}$
		$\displaystyle+h^{2}\cdot\tfrac{\lambda_{2}L}{8}\\|d_{1}-d_{2}\\|^{2}+h^{3}\cdot\tfrac{\lambda_{2}L}{4}\\|d_{1}\\|^{2}\\|d_{2}\\|+h^{3}\cdot\tfrac{\lambda L}{4}\\|d_{1}-d_{2}\\|$
	$\displaystyle\leq$	$\displaystyle h^{3}\cdot\tfrac{L}{3}\\|d_{1}\\|^{3}+h^{3}\cdot\tfrac{L\lambda}{3}\\|d_{1}\\|^{3}+h^{3}\cdot\tfrac{L}{6}\\|d_{1}\\|^{2}+h^{3}\cdot\tfrac{L}{2}\\|d_{1}\\|^{3}$
		$\displaystyle+h^{2}\cdot\tfrac{L}{8}\\|d_{1}-d_{2}\\|^{2}+h^{4}\cdot\tfrac{\lambda L}{2}\\|d_{1}\\|^{2}+h^{2}\cdot\tfrac{\lambda L}{8}\\|d_{1}-d_{2}\\|^{2}$
		$\displaystyle+h^{3}\cdot\tfrac{\lambda L}{2}\\|d_{1}\\|^{3}+h^{3}\cdot\tfrac{\lambda L}{4}\\|d_{1}-d_{2}\\|$
	$\displaystyle\leq$	$\displaystyle h^{3}L(1+\lambda)(\\|d_{1}\\|^{3}+\\|d_{1}\\|^{2}+\\|d_{1}\\|)+h^{2}\cdot\tfrac{L(1+\lambda)}{8}\\|d_{1}-d_{2}\\|^{2}.$

New Methods for Parametric Optimization via Differential Equations

Abstract

1 Introduction

1.1 Contributions

1.2 Related Literature

Other path methods and comparison of results

Related global complexity analysis of second-order methods

Related work on differential equations and optimization methods

1.3 Organization

1.4 Notation

2 Ordinary Differential Equation Characterization of Solution Path

Proposition 2.1 (Theorem 5.3.1 of Gautschi [15]).

3 Discretization and Complexity Analysis

Definition 3.1.

The algorithm with the exponential decaying parameter sequence

Theorem 3.2.

Remark 3.3.

3.1 Semi-Implicit Euler Update Scheme

Lemma 3.4 (Lemma 1 in Nesterov and Polyak [32]).

Lemma 3.5.

Proof 3.6.

Lemma 3.7.

Proof 3.8.

Proposition 3.9.

Proof 3.10.

3.2 Linear Interpolation

Theorem 3.11.

Proof 3.12.

3.3 Computational Guarantee for the Exponential Decaying Parameter Sequence

Proposition 3.13.

Proof 3.14.

Proof 3.15 (Proof of Theorem 3.2).

4 Multi-Stage Discretizations

4.1 The Trapezoid Method

Theorem 4.1.

Lemma 4.2.

Proof 4.3 (Proof of Theorem 4.1).

5 Analysis with Inexact Linear Equations Solutions and Second-Order Conjugate Gradient Variants

5.1 Analysis with Inexact Oracle and/or Approximate Solver

Definition 5.1.

Lemma 5.2.

Proof 5.3.

Lemma 5.4.

Proof 5.5.

Corollary 5.6.

Proof 5.7.

Corollary 5.8.

Proof 5.9.

5.2 Second-Order Conjugate Gradient Variants

Theorem 5.10.

Proof 5.11 (Proof of Theorem 5.10).

6 Computational Experiments and Results

6.1 Logistic Regression

6.2 Moment Matching Problem

Synthetic Data Generation Process

7 Conclusion

Acknowledgments

References

Appendix A Proofs

A.1 Proof of Proposition 2.1

A.2 Proof of Lemma 4.2

Lemma A.1 ((1.5)(1.5) and (1.6)(1.6) in Nesterov [30]).

Lemma A.2.

Proof A.3.

Lemma A.4.

Proof A.5.

Proof A.6 (Proof of Lemma 4.2).

Lemma A.1 ( $(1.5)$ and $(1.6)$ in Nesterov [30]).