assumptionAssumption \newsiamremarkconditionCondition \newsiamremarkremarkRemark
New Methods for Parametric Optimization via Differential Equations
Abstract
We develop and analyze several different second-order algorithms for computing a near-optimal solution path of a convex parametric optimization problem with smooth Hessian. Our algorithms are inspired by a differential equation perspective on the parametric solution path and do not rely on the specific structure of the objective function. We present computational guarantees that bound the oracle complexity to achieve a near-optimal solution path under different sets of smoothness assumptions. Under the assumptions, the results are an improvement over the best-known results of the grid search methods. We also develop second-order conjugate gradient variants that avoid exact computations of Hessians and solving of linear equations. We present computational results that demonstrate the effectiveness of our methods over grid search methods on both real and synthetic datasets. On large-scale problems, we demonstrate significant speedups of the second-order conjugate variants as compared to the standard versions of our methods.
1 Introduction
In many applications of interest, it is necessary to solve not just a single optimization problem but also an entire collection of related problems. In these settings, some or all of the objects involved in defining the objective function or constraints of an optimization problem depend on one or more parameters, and we would like to solve the problem as a function of these parameters. Generally, a parametric optimization problem can be written as:
(1) |
where belongs to the set of interest , and the feasible sets satisfy . There are many problems of interest that are formulated as parametric optimization problems of the form Eq. 1. Subsequently, as indicated by Guddat et al. [19], there are several strong motivations to design algorithms for Eq. 1, including but not limited to: (i) the need to solve problems arising in application areas like regularized regression with cross-validation (see, e.g., Osborne et al. [33]) and model predictive control (see, e.g., Garcia et al. [14]), (ii) as a building block for developing globally convergent algorithms by the approach of path-following as is done in interior-point methods (see, e.g., Nesterov and Nemirovskii [31]), and (iii) the need to address multi-objective optimization, for instance, finding the Pareto frontier of a two-objective optimization problem. Depending on the assumptions made, the goal may be to find global/local optimal solutions or Karush–Kuhn–Tucker (KKT) points of the problem for .
In the rest of the paper, we will focus on a more specific problem, in which we assume that: (i) the dependence on is linear, that is, can be written as , (ii) both and are convex functions with certain properties, and (iii) the feasible set is the entire vector space for all . That is, we focus on the parametric optimization problem:
(2) |
where and are twice-differentiable functions such that is -strongly convex for some and is -strongly convex for some , both with respect to the -norm (denoted by herein). For any , let
(3) |
denote the unique optimal solution of defined in Eq. 2. We are interested in the problem of (approximately) computing the set of optimal solutions where is the set of interest for some , and we also refer to this set of solutions as the (exact) solution path. An important set of problems in practice and a popular line of research involves computing the solution path of regularized machine learning problems, including the LASSO as in Efron et al. [12], Osborne et al. [33] and the SVM problem as in Hastie et al. [20]. In these works, algorithms are designed to compute the exact piecewise-linear solution path. Also, in the context of interior-point methods for constrained convex optimization (see, for instance, Nesterov and Nemirovskii [31] and Renegar [36]), represents the objective function and represents the barrier function induced by the constraints of the original problem. Note that the application to interior-point methods requires a slightly more general variant of problem (2) where . Interior-point methods generally start with the problem for a moderately large and terminate when for some small enough positive threshold .
Recently, there has been growing interest in developing algorithms for computing an approximate solution path of a generic problem such as (3). Rosset [37] consider applying exact Newton steps on prespecified grids, and Ndiaye et al. [29] consider adaptive methods to discretize the interval , for example. These grid search type methods, which discretize the interval and subsequently solve a sequence of individual optimization problems, take a very black-box approach. A natural question is: Can we “open the black-box” by developing a methodology that better exploits the structure of the solution path? We answer this question positively by introducing a differential equation perspective to analyze the solution path, which enables us to better reveal and exploit the underlying structure of the solution path. This deeper understanding enables us to build more efficient algorithms and present improved computational guarantees.
In particular, we derive an ordinary differential equation with an initial condition whose solution is the exact solution path of (2). The dynamics of the ODE that we derive resemble, but are distinct from, the dynamics of a path-wise version of Newton’s method. Based on the ODE, we propose efficient algorithms to generate approximate solution paths and provide the corresponding complexity analysis. The metric we consider is the -norm of the gradient of the regularized problem, namely , and we use the largest norm along the approximate path to represent the accuracy of an approximate path (as formally defined in Definition 3.1). To analyze the computational cost of our proposed algorithms, we consider the oracle complexity – either in terms of full Hessian or Hessian-vector product/gradient evaluations – to obtain an -accurate solution path. Note that considering the oracle complexity is in contrast to other works that consider the number of individual optimization problems that need to be solved (see, for example, Giesen et al. [16], Ndiaye et al. [29]), as well as the number of ordinary differential equations that need to be solved (see, for example, Zhou and Lange [46]).
1.1 Contributions
The first set of contributions of this paper concern the perspective of the solution path of (2) from an ordinary differential equation point of view. We derive an ordinary differential equation with an initial condition whose solution is the solution path of (2), based on the first-order optimality conditions of (2). With this observation, we propose a novel and efficient way to approximate the entire solution path. Our derivation does not rely on the special structure of the optimization problem, like existing results in the solution path of LASSO or SVM problems, and holds for general objective functions.
The second set of contributions of this paper concern the design of efficient algorithms and the corresponding oracle complexity analysis. Classical error analysis of numerical ordinary differential equation methods [15, 40] provides only asymptotic results, and the global error has an exponential dependency on the Lipschitz constant and the length of the time period. In contrast, we design new update schemes to compute an approximate solution path and develop nonasymptotic complexity bounds. In particular, we apply a semi-implicit Euler method on the ordinary differential equation in order to compute the approximate optimal solutions under a finite set of penalty coefficients. Then, we incorporate linear interpolation, which was usually missing either in practice or in the lower bound complexity analysis, to generate nearly optimal solutions under other penalty coefficients within the range of parameter values of interest. The two-step algorithm guarantees an -accurate solution path within at most gradient and Hessian evaluations as well as linear equation system solves. When the objective function has higher-order smoothness properties, we modify the traditional trapezoid method in numerical differential equations and design a new update scheme, which guarantees an -path within at most Hessian evaluations. It is important to emphasize that the complexity results in this paper are in terms of the number of operations (for example, Hessian evaluations), rather than the number of sub-problems that need to be solved (for example, solving a single numerical ODE or individual optimization problems) as has been studied in prior work such as Giesen et al. [16], Zhou and Lange [46] and Ndiaye et al. [29]. Further extensions of our proposed algorithms to Runge-Kutta methods are possible but we omit them for brevity. We also provide a detailed computational evaluation of our algorithms and existing methods, including several experiments with synthetic data, the breast cancer dataset [11], and the leukemia dataset [18].
The third set of contributions of the paper concerns second-order conjugate gradient type methods and computational guarantees in the presence of inexact gradient and Hessian oracles, as well as approximate linear equation solvers. When the dimension of the problem is high, computing the Hessian and/or solving linear systems becomes a computational bottleneck. To avoid this, one would like an algorithm that only requires approximate directions at each iteration. We first consider the case where the (absolute) numerical error incurred in the calculation of a new direction is bounded by some . We show that our algorithms are robust to numerical error in the sense that the additional errors of inexact directions do not accumulate and do not depend on the condition number. We extend the complexity analysis to the case where the numerical error has a uniform upper bound for and show that the Euler method maintains complexity, and the trapezoid method maintains complexity when there is higher-order smoothness. We then propose variations of the algorithms mentioned before that only require gradient and Hessian-vector product oracles, rather than gradient and Hessian oracle, as well as a linear system solver. We also leverage the previous analysis in the case of inexact directions in order to provide computational complexity results for the second-order conjugate gradient-type algorithms, which have the same order of as the results for the exact methods mentioned above. Our results demonstrate that our algorithms are more robust and second-order conjugate gradient variations require less computational cost compared to existing methods.
1.2 Related Literature
We now discuss previous work related to our algorithm and analysis from three distinct aspects.
Other path methods and comparison of results
As previously mentioned, for the LASSO and SVM problems, the exact solution path is piecewise linear and can be computed by the path following methods such as the least angle regression (LARS) algorithm proposed by Efron et al. [12], Hastie et al. [20] and Osborne et al. [33]. Additional problems whose solution paths are piecewise linear are considered by Rosset and Zhu [38], and Mairal and Yu [24] showed that the number of breakpoints in the solution path can be exponential in the number of data points. Generalized linear regression problems with regularization are considered by Yuan and Zou [44] and Zhou and Wu [47] via LARS based methods. Another line of work focuses on computing approximate solution paths for specific problems, including the elastic net in Friedman et al. [13], the SVM problem in Bach et al. [7] and [16], matrix completion with nuclear norm regularization in Mazumder et al. [26], and other structural convex problems in Giesen et al. [17] and Loosli et al. [23]. For problems with nonconvex but coordinate decomposable regularization functions, a coordinate descent-based algorithm was considered by Mazumder et al. [27] and Wang et al. [42]. Closest to our problem set up, Rosset [37] considered a general problem when and have third-order derivatives and provided an algorithm which applied exact Newton steps on equally spaced grids starting from the optimal solution of the nonregularized problem. The lower bound complexity analysis when the approximate solution path is limited to a piecewise constant function is considered by Giesen et al. [16], Ndiaye et al. [29].
Related global complexity analysis of second-order methods
Newton-like methods are an important class of algorithms in optimization. Some notable lines of work include interior-point methods (Nesterov and Nemirovskii [31], Renegar [36]) and applications in regression problems (Kim et al. [22]) as well as the Levenberg-Marquardt method (Moré [28]). In practice, techniques are often incorporated to ensure global convergence, such as line searches (Ralph [35]) and trust-region techniques (Conn et al. [10]). The global complexity analysis of Newton and higher-order methods with regularization has also received much recent interest, such as the work of Nesterov [30], Nesterov and Polyak [32] and Polyak [34] and the references therein. In our paper, we make similar types of assumptions as in the global complexity analysis of regularized second and higher-order methods and we also prove global complexity results for the class of second-order algorithms considered herein.
Related work on differential equations and optimization methods
Early works, including the work of Alvarez and Attouch [2] and Alvarez et al. [3], analyzed inertial dynamical systems driven by gradient and Newton flows with application to optimization problems. Newton-like dynamic systems of monotone inclusions with connections to Levenberg-Marquardt and regularized Newton methods were also analyzed by Abbas et al. [1], Attouch and Svaiter [4] and Bot and Csetnek [9]. Due to the recent popularity of the accelerated gradient descent method in the machine learning and optimization communities, the limiting dynamics of different optimization methods and their discretization have thus received much renewed interest in recent years; see Attouch et al. [6], Scieur et al. [39], Su et al. [41], Wibisono et al. [43], Zhang et al. [45] and the references therein.
1.3 Organization
The paper is organized as follows. In Section 2, we derive the ordinary differential equation with an initial condition whose solution is the exact solution path (3) and provide the existence and uniqueness of the solution of the differential equation. In Section 3, we leverage the ODE to develop a numerical algorithm to compute the approximate solution path of (2), and derive the corresponding complexity analysis in Theorem 3.2. In Section 4, we propose a multistage method that is beneficial when the functions and have higher order smoothness, and we also provide its complexity analysis in Theorem 4.1. In Section 5, we extend our results in the presence of inexact oracles, and as a direct application we propose second-order conjugate gradient variants of the aforementioned algorithms, which avoid exact Hessian evaluations and exact solutions of linear systems. Section 6 contains a detailed computational experiments of the proposed algorithms and grid search methods on real and synthetic datasets.
1.4 Notation
For a positive integer , let . For a vector-valued function that can be written as , we say that is differentiable if is differentiable for all and let be the derivative of , namely . Let and denote the -dimensional all-ones vector and the all-ones matrix, respectively. Throughout the paper, we fix the norm on to be the -norm, which is defined by . Also, in a slight abuse of notation, we use to represent the operator norm, i.e., the induced -norm on , which is defined by .
2 Ordinary Differential Equation Characterization of Solution Path
Let us begin by describing a differential equations perspective on the solution path Eq. 3 of Eq. 2 that will prove fruitful in the development of efficient computational methods. First, we introduce a reparameterization in terms of an auxiliary variable (thought of as “time”), whereby for a given we introduce functions and such that: is Lipschitz, is differentiable on , and we have for all . In a slight abuse of notation, we define the path with respect to as . Now notice that, for any , the first-order optimality condition for the problem states that . By differentiating both sides of the previous equation with respect to , we have . Rearranging the above and again using yields . Then, applying the fact that yields . Thus, we arrive at the following autonomous system
(4) |
for .
By considering specific choices of and , the system (4) generalizes some previously studied methodologies in parameteric optimization. First, consider the scenario with an equally spaced discretization of the interval , namely for some fixed step-size . Thus, the sequence is approximately given by . Intuitively, the choice of controls the dynamic of and generalizes some previously considered sequences for the problem (2). For example, letting we recover the arithmetic sequence in Rosset [37] and letting we recover the geometric sequence in Ndiaye et al. [29]. In addition, consider the special case where . Then the dynamic for in Eq. 4 is similar to the limiting dynamic for the proximal Newton method (also known as the Levenberg-Marquardt regularization procedure [25] for convex optimization problems). The property of a similar dynamic of monotone inclusion is analyzed in Attouch et al. [5], Attouch and Svaiter [4], which includes finding zero of the gradient of a convex function.
Before developing and presenting algorithms designed to compute the approximate solution path based on (4), we first verify that the system (4) has a unique trajectory. The following proposition states conditions on and such that defined in Eq. 4 is continuous in and is uniformly -Lipschitz continuous with respect to , namely, for any . This uniform Lipschitz property ensures that the above system has a unique trajectory, which therefore coincides with the solution path defined in (3).
Proposition 2.1 (Theorem 5.3.1 of Gautschi [15]).
Suppose that , , , are all -Lipschitz continuous, is -strongly convex for , and is -strongly convex for . Suppose further that is Lipschitz continuous on and satisfies for all . Then, it holds that defined in Eq. 4 is continuous in and is uniformly -Lipschitz continuous with , and Eq. 4 has a unique trajectory for .
3 Discretization and Complexity Analysis
In this section, we present algorithms to compute an approximate solution path based on discretizations of Eq. 4, along with the corresponding complexity analysis. The primary error metric we consider is the -norm of the gradient across the entire interval as formally presented in Definition 3.1.
Definition 3.1.
An approximate solution path to the parametric optimization problem Eq. 2 has accuracy if for all .
Notice that the strong convexity of the objective function for all immediately implies that an -accurate solution path also has the optimality gap guarantee, which is , for all . Algorithm 1 below presents a two-step “meta-algorithm” for computing an approximate solution path . Inspired by numerical methods to solve ordinary differential equations, we first design several update schemes to iteratively update by exploiting the dynamics in Eq. 4. We use the function to denote a generic update rule in the meta-algorithm below, and we consider several different specific examples herein. Then we apply an interpolation method to resolve the previously computed sequence of points into an approximate path over .
We develop oracle complexity results for different update schemes and interpolation methods in terms of the number of gradient computations, Hessian computations, and linear system solves required to compute an -accurate approximate path. In this section, we will stick to a simple version of Algorithm 1 based on the application of the semi-implicit Euler method and linear interpolation to specify the update rule and the interpolation method . In particular, the semi-implicit Euler discretization of Eq. 4 is
(5) |
and the linear interpolation is defined by with for all and .
The algorithm with the exponential decaying parameter sequence
Recall the update rule Eq. 5, we can see that the function (or equivalently, ) is still to be determined. In practical cases, the value of usually decreases exponentially from to . This choice of penalty scale parameters arises in the solution path for linear models, see Friedman et al. [13], and the interior-point method, see Renegar [36]. Although our analysis holds for a broad class of , we first present the version with an exponentially decaying parameter sequence, namely . This specific version of Algorithm 1 is formally described in Algorithm 2.
Before going further into detailed analysis, we first state the computational guarantee of Algorithm 2. In our complexity analysis, we make the following smoothness assumptions for and .
In addition to -strong convexity of and -strong convexity of , these functions have -Lipschitz gradients and Hessians, where is an upper bound on the four relevant Lipschitz constants. Furthermore, we assume that and that is an upper bound on the norm of the gradients of and on the level set .
Theorem 3.2 is our main result concerning the complexity of Algorithm 2 and demonstrates that in terms of the accuracy parameter , Algorithm 2 requires iterations to compute an -accurate solution path.
Theorem 3.2.
Suppose that Section 3 holds, let be the desired accuracy, and suppose that the initial point satisfies . Let , let , and let
(6) |
If the total number of iterations satisfies , then Algorithm 2 returns an -accurate solution path.
Remark 3.3.
Grid search type methods for computing approximate solution paths are proposed in Giesen et al. [16], Ndiaye et al. [29], and we will follow the analysis of the latter, which considers more general cases. To ensure that the function value gap satisfies , we require the number of grid points . For -smooth function , implies , which is the goal in our paper. Therefore, we need to set , and hence we have the number of grid points is when the desired accuracy of the inner problem is set to . Using the exact Newton method and the last grid point as a warm start to solve the inner problem, Karimireddy et al. [21] implies that the inner complexity is , and therefore the total complexity is .
From the results in Theorems 3.2 and 3.3, we can see that both our results and the grid search method have a complexity of order . In most practical cases, is smaller than , which implies that the constant in Eq. 6 is better. In Section 4, we will propose algorithms utilizing the smoothness of and and achieve better complexity results.
3.1 Semi-Implicit Euler Update Scheme
Herein we provide the computational guarantee of the semi-implicit Euler update scheme defined in Eq. 5, where the main object we concern is the accuracy at each grid point. Let denote the accuracy at . We consider a broad family of in the following analysis, although in Algorithm 2 and the corresponding complexity analysis in Theorem 3.2, we only consider the special scenario that . We first present the computation guarantees of Taylor expansion approximations given the Lipschitz continuity of Hessian in Lemma 3.4.
Lemma 3.4 (Lemma 1 in Nesterov and Polyak [32]).
Suppose has -Lipschitz Hessian for some convex set , then the following inequalities hold for all :
-
(i)
.
-
(ii)
.
Based on the results in Lemma 3.4, we provide the local analysis of , that is, how the norm of the gradient at each grid point accumulates. Lemma 3.5 provides an upper bound on based on , which represents the accuracy in the previous iteration.
Proof 3.6.
Lemma 3.5 provides the first technical result of the semi-implicit Euler update scheme. When Section 3 holds, Eq. 5 has the computation guarantee . After telescoping the inequalities for for we have . Therefore, one would expect a uniform bound on all accuracy at points for all . We formalize the idea in the following lemma.
Lemma 3.7.
Proof 3.8.
We will use induction to show that , which implies that for all . First, suppose for some . Let denote and we have . Since and , it holds that
Since , it holds that and therefore implies that . Then by induction we conclude that for all . Also, combining the result in Lemma 3.5, for all , it holds that
By taking the summation over from to , we obtain Eq. 8.
Lemma 3.7 provides the computational guarantees of Algorithm 2 with any decreasing sequence of . Note that the upper bound of the accuracy at point still involves a constant related to the sequence . We then consider a family of the sequence and derive a simpler upper bound.
Proposition 3.9.
Proof 3.10.
Proposition 3.9 provides a uniform upper bound on accuracy of all almost-optimal solutions . When for all , Algorithm 2 generates a solution sequence such that .
3.2 Linear Interpolation
In the last section, we provided a general accuracy analysis at all nearly-optimal solutions . Then the next step is to construct the entire path based on these nearly-optimal solutions. In this section, we analyze the second procedure in Algorithm 2, i.e., linear interpolation. First, we recall the definition of linear interpolation for :
(11) |
where , , and . That is, given an arbitrary , we first select such that , then we output as a nearly-optimal solution to problem . The following lemma provides a uniform upper bound of for all .
Theorem 3.11.
Proof 3.12.
Theorem 3.11 provides the computational guarantee of linear interpolation of the sequence generated by the update scheme Eq. 5. Observing that is of the order , we see that the additional error incurred by linear interpolation is of the order . Together with the accuracy from the update scheme Eq. 5, we are able to provide a computational guarantee on the accuracy of the approximate path generated by Algorithm 2. In the following part, we will provide a uniform bound on for a family of and discretization.
3.3 Computational Guarantee for the Exponential Decaying Parameter Sequence
Under the scenario, we have and satisfies the assumption in Proposition 3.9. Therefore, we extend the result of Theorem 3.11 and provide an explicit uniform bound of the path accuracy.
Proposition 3.13.
Suppose Section 3 holds, and the step-size satisfies the condition in Eq. 9. Let and denote the approximate solution path generated by Algorithm 2. Then, we have the following computational guarantee for all :
(12) |
Proof 3.14.
First we extend the result in Proposition 3.9, and it holds that
Also, for the result in Theorem 3.11, we further have
By combining the above two inequalities, Proposition 3.9, and Theorem 3.11, we obtain Eq. 12.
In Algorithm 2, the sequence is given by , and implies . Therefore, we have . Applying this to Proposition 3.13, we arrive at the complexity analysis with respect to the number of iterations. We formalize the complexity analysis in the following proof of Theorem 3.2, which appears at the beginning of this section.
Proof 3.15 (Proof of Theorem 3.2).
The conditions that and guarantee that step-size satisfies Eq. 9. Also, and guarantee that and . Hence Algorithm 2 guarantees a -accurate solution path.
Recall that in the assumption of Theorem 3.2, it requires a good initialization satisfying . In practical cases, we can either implement a specific convex optimization algorithm to solve a satisfying the initial condition or use the initialization suggested in the following lemma. Here we suggest one choice of initialization with computational guarantee when the function is structured, i.e., the optimal solution to minimize is easy to compute. Let and . Intuitively, the initialization is a Newton step from the optimizer of , and we can show that . Notice that the value of and are independent of . Hence, when is large enough, we have . Also, since is the optimal solution when , the initialization can be considered as an update step of Eq. 5 from .
4 Multi-Stage Discretizations
In the analysis of the previous section, Algorithm 2 guarantees an -accurate solution path within calls to the gradient, Hessian oracle, and linear system solver. One advantage of the main result proposed in Theorem 3.2 is that only the smooth Hessian of and is required and no assumption of is required. When and have better properties and is relatively small, one would like an algorithm that utilizes these properties and requires fewer calls to the oracle with respect to the order of . Motivated by the multistage numerical methods for solving differential equations, we design several update schemes to achieve higher-order accuracy. In particular, in this section we still consider the exponentially decaying parameter, that is, .
4.1 The Trapezoid Method
In this section, we propose and analyze the trapezoid method, whose formal description is given in Algorithm 3. The trapezoid method is beneficial when the functions and have Lipschitz continuous third-order derivatives, and it achieves a higher-order accuracy than the implicit Euler method. The accuracy of the output path of Algorithm 3 is of order where is the step-size, or equivalently, where is the number of iterations. Moreover, we want to mention that the algorithm does not require the oracle to higher-order derivatives, but still requires a gradient and Hessian oracle as in the Euler method.
We first state the main technical assumption and computational guarantees of Algorithm 3.
In addition to Section 3, we assume that the third-order directional derivative of and are -Lipschitz continuous and .
Theorem 4.1.
Suppose Section 4.1 holds, let be desired accuracy, let , suppose that the initial point satisfies , let , and let
If the total number of iterations satisfies , then Algorithm 3 returns an -accurate solution path.
The result in Theorem 4.1 shows that we improve the total complexity to , which is better than the complexity of the Euler method and the best known results for the grid search method (see Theorems 3.2 and 3.3). Similarly to the previous complexity analysis of the semi-implicit Euler method, the analysis of the trapezoid method consists of two parts: we first present the computational guarantee of the trapezoid update scheme, which is defined as
(13) |
where , and .
Lemma 4.2.
Suppose Section 4.1 holds, , and the next iterate is given by defined in Eq. 13. Then, it holds that , and
(14) |
We leave the proof of Lemma 4.2 in the appendix because of its length and complexity. In the proof, we mainly work with the directional derivatives (see Nesterov [30], for details) and the accuracy of the Taylor expansion in Lemma 3.4. The result in Lemma 4.2 shows that the trapezoid update scheme in Eq. 13 guarantees a local accumulation of . Moreover, we can derive an uniform upper bound on the accuracy of all nearly optimal solutions . For all other , we implement the linear interpolation to approximate the corresponding nearly-optimal solution. We then provide the formal proof of Theorem 4.1.
Proof 4.3 (Proof of Theorem 4.1).
Since , it holds that
Then we show that for all by induction. Suppose , then by Lemma 4.2, it holds that
Therefore, for all . Suppose , and hence where . Applying the results in Theorem 3.11, we have
5 Analysis with Inexact Linear Equations Solutions and Second-Order Conjugate Gradient Variants
In this section, we present the complexity analysis of the aforementioned methods with the presence of inexact oracle to gradient and Hessian and/or inexact linear equations solutions, as well as variants applying the second-order conjugate gradient (SOCG) type methods. We first consider the case when the gradient and Hessian oracle are inexact and/or the linear equation solver yields approximate solutions.
5.1 Analysis with Inexact Oracle and/or Approximate Solver
At each iteration of the Euler, trapezoid, and Runge-Kutta method, an essential subroutine is to compute the directions . For example, in the Euler method, is given by formula . In most large-scale problems, the computation of the Hessian matrix and solving linear equations exactly will be the computational bottleneck. Therefore, we consider the case with the presence of numerical error, which may be induced by inexact gradient and Hessian oracle, or by the linear equations solver. Nevertheless, we tackle the two types of numerical error together.
Definition 5.1.
Suppose an exact solution has the form . For any constant , is a -approximate direction with respect to if . Furthermore, the -approximate versions of Algorithm 2 or Algorithm 3 are the same as the original algorithms except that they use a -approximate direction in each update.
Compared with the original update scheme in Algorithm 2, the only difference in the update scheme of the -approximate version is that an approximate direction is applied at each iteration instead of the exact direction . We want to mention that there is no constraint on how the approximate direction is generated, and in Section 5.2 we provide several efficient methods to compute the approximate direction and the corresponding complexity analysis. The following two lemmas characterize the accumulation of local errors of the update scheme in the -approximate version of Algorithm 2 and Algorithm 3.
Lemma 5.2.
Suppose Section 3 holds. Let denote an approximate solution to . Let , , and . Then we have
(15) |
Furthermore, if we set and for some scalar and all , it holds that
(16) |
Proof 5.3.
First it holds that
Then apply same technique as in Lemma 3.5 we show that Eq. 15 holds. Also, applying Eq. 15 to Proposition 3.13 implies the result in Eq. 16.
Lemma 5.4.
Suppose . Let denote the residual of the approximate directions with . Then, it holds that
(17) |
Proof 5.5.
We will follow the idea in Lemmas A.2, A.4 and 4.2. Recall the result in Lemma A.2, since , it holds that . Also, the result in Lemma A.4 becomes
where and . Now we modify the proof of Lemma 4.2 to get Eq. 17. Then the right-hand side of Eq. 20 becomes . Also, the right-hand side of Eq. 21 becomes and the right-hand side of Eq. 22 becomes .
The following two corollaries provide the complexity analysis of the -approximate version of Algorithm 2 and Algorithm 3.
Corollary 5.6.
Suppose that Section 3 holds, and suppose that the initial point satisfies . Let , let be the desired accuracy and let
If the number of iterations satisfies and approximate directions are all -approximate, the -approximate version of Algorithm 2 returns an -accurate solution path.
Proof 5.7.
Since the step-size , combining Eq. 16, we have , for all . For linear interpolation error, we have
Applying the above inequality to Theorem 3.11 completes the proof.
Corollary 5.8.
Suppose Section 4.1 holds, let , let be the desired accuracy, suppose that the initial point satisfies , let , and let
If the number of iterations satisfies and all approximate directions are -approximate, the -approximate version of Algorithm 3 returns an -accurate solution path.
Proof 5.9.
Corollaries 5.6 and 5.8 provide the complexity analysis of the Euler method and the trapezoid method with the presence of approximate directions. We can see that when the residuals of approximate directions have a uniform upper bound over all iterations, the -approximate versions of Algorithms 2 and 3 have complexity of the same order as the original algorithms, respectively, which require exact directions. Moreover, Corollaries 5.6 and 5.8 only require -approximate directions but no assumptions about how the approximate directions are generated. It provides flexibility in the choice of an approximate oracle to compute approximate directions.
5.2 Second-Order Conjugate Gradient Variants
Following the complexity analysis, we apply the conjugate gradient method to solve the sub-problem, i.e., to compute a -approximate direction of . To measure the efficiency of second-order conjugate gradient type algorithms, we measure the computational complexity by the total number of calls to both gradient and Hessian-vector product oracles.
Now we apply the conjugate gradient method as an approximate oracle to compute the approximate direction at each iteration. We use the -approximate version of the Euler method as an example. At iteration , the algorithm requires an approximate solution satisfying where and . We apply the conjugate gradient to approximately solve the equation and set the initial guess to be the approximate direction in the last iteration. Herein we provide the complexity analysis of Euler-CG method and trapezoid-CG method.
Theorem 5.10.
Under the same conditions as Theorem 3.2, let , and let . If the total number of iterations satisfies , then the Euler-CG method returns an -accurate solution path. Also, suppose that the assumption in Theorem 4.1 holds, let , and let . If the total number of iterations satisfies , then the trapezoid-CG method returns an -accurate solution path.
Proof 5.11 (Proof of Theorem 5.10).
Here we only need to consider the inner complexity, i.e., the number of iterations required to compute the approximate direction. For the Euler-CG algorithm, let denote the sequence generated by the conjugate gradient method with , let , and let . The existing results of the conjugate gradient method [8] guarantee that , where is the condition number of and . Since the initialization is the approximate direction at the last iteration, we have . Then the initialization guarantees that
Applying , the inner complexity has an upper bound . Therefore, the total computation complexity of the Euler-CG method with the approximate oracle of the conjugate gradient to compute an -accurate solution path is . For the trapezoid-CG algorithm, let , , , , and . At iteration , the trapezoid-CG method requires to compute the directions and . In the conjugate gradient subroutine, we use as a warm start to compute , and likewise use to warm start . Similarly, we have and . Applying , the inner complexity and have upper bounds . Therefore, the total computational complexity of the trapezoid-CG method with the conjugate gradient approximate oracle to compute an -accurate solution path is .
Recalling the complexity results in Remark 3.3, we notice that when we use Nesterov’s accelerated gradient method as the subproblem solver, the total complexity will be . We can see that the total complexity of the -approximate version of Algorithm 3 has order and is better than the best known result for the grid search method. Again, the complexity of the trapezoid method is better than that of the Euler method, since it exploits the higher-order smoothness of the functions and .
6 Computational Experiments and Results
In this section, we present computational results of numerical experiments in which we implement different versions of discretization schemes and interpolation methods to compute the approximate solution path. To compare with our methods, we introduce two approaches based on grid search (see, e.g., Giesen et al. [16], Ndiaye et al. [29]). For the subproblem solver in the grid search methods, we use the warm-started exact Newton method and Nesterov’s accelerated gradient method to compare with our exact methods and SOCG variants, respectively. We focus on the following 8 versions of the update schemes, where “CG” stands for conjugate gradient. In all cases, we fix an accuracy level and set the step/grid-size parameters accordingly.
-
•
Euler, Euler-CG: Algorithm 2, and its -approximate version using CG as the sub-problem approximate oracle.
-
•
Trapezoid, Trapezoid-CG: Algorithm 3, and its -approximate version using CG as the approximate subproblem oracle.
-
•
Runge-Kutta, Runge Kutta-CG: an extension of our algorithm that uses the Runge-Kutta update scheme, and its -approximate version using CG as the subproblem approximate oracle.
-
•
Grid Search-Newton/AGD: grid search method with either the exact Newton or Nesterov’s accelerated gradient method as the sub-problem solver.
6.1 Logistic Regression
Herein we examine the empirical behavior of each of the previously presented methods on logistic regression problems using the breast cancer dataset from Dua and Graff [11] ( features and observations) and the leukemia dataset from Golub et al. [18] ( features and observations). In particular, let denote a training set of features and labels and define the sets of positive and negative examples by and . We examine two logistic regression variants: (i) regularized logistic regression with and , where , , and (ii) re-weighted logistic regression with and , where , . The initialization which we apply in each method is the same and is a very nearly-optimal solution to the problem such that . Note that in general it is difficult to compute the exact path accuracy of an approximate solution path , namely , where . We examine the approximate path accuracy in lieu of the exact computation of the path accuracy, namely, we consider where , since the theoretical largest interpolation error occurs at the midpoint of two breakpoints. In Fig. 1, we vary the desired accuracy parameter and plot the number of Hessian oracle computations required by Algorithm 2, Algorithm 3, the algorithm with the Runge-Kutta update scheme, and the grid search method. The theoretical evaluation numbers are derived from the complexity analysis, and the observed number of evaluations of each method is determined according to a “doubling trick,” whereby for each value of we calculate the observed accuracy along the path by interpolation, and if the observed accuracy is too large, then we double the value of until it is below . The numerical results match the asymptotic order and intuition, as well as the superior performance of the trapezoid and Runge-Kutta methods due to the higher-order smoothness of the loss and regularization function. In part (a) of Fig. 1, we notice that the theoretical complexity is higher than the practical one, since it is more conservative. Therefore, we will stick to the “doubling method” in the following experiments and compare the practical performance of each method.


Figure 2 summarizes the performance of the three aforementioned SOCG methods, and the grid search method with Nesterov’s accelerated gradient method. For SOCG methods, we record the total number of both gradient evaluation and Hessian-vector products, which is our measure of computational complexity. We implement these methods on the leukemia dataset and the regularized logistic regression problem. From Fig. 2, we can see that the numerical results match our theoretical asymptotic bounds. In part (b), we provide the CPU time to compute the approximate solutions of both SOCG methods and exact second-order methods (in the more transparent bars). We comment that for the exact methods we only run the case when the desired accuracy equals , since the grid search method already takes about 15 hours to finish. The dominance of SOCG methods in these cases illustrates the benefit and ability of SOCG methods to deal with large-scale problems.


6.2 Moment Matching Problem
Here we consider the moment matching problem with entropy regularization. Suppose that a discrete random variable has sample space and probability distribution . The goal is to match the empirical first -th moments of , with entropy regularization to promote “equal” distributions. To formalize the problem, we consider the following constrained optimization problem:
s.t. |
where is the -th component of , with , and . The parametric optimization problem is a constrained optimization problem, including an affine constraint, which does not satisfy our assumptions. Therefore, we introduce a new variable to substitute . Let with , for , and . Then the above moment matching problem is equivalent to the following parametric optimization problem:
s.t. |
where and . We examine the empirical behavior of each of the previously presented methods on without constraints, that is, and . Translating back to , we can show that the exact path for is a subset of the relative interior of , but the proof is omitted for brevity. Also, when the step-size is small enough, all grid points will be in the relative interior of , and therefore, the approximate path for is a subset of the relative interior of .
Synthetic Data Generation Process
We generate the data according to the following generative model. The true distribution vector is according to the model for , where . The sample vector is generated from an independent uniform distribution, i.e., for , and without loss of generality, we set .
We examine the aforementioned exact and SOCG methods with different problem dimensions. In the following set of experiments, we set the number of observations , the desired accuracy , and vary the problem dimension . The true distribution and the sample vector are synthetically generated according to the same process as before. Figure 3 displays our result for this experiment. Generally, we observe that the CPU time of computing an -path increases as the dimension of the problem increases. Comparing the CPU time of the exact methods and the SOCG methods, we notice that the SOCG methods are less sensitive to the size of the problems. Again, the results demonstrate the superiority of SOCG methods for solving large-scale problems.

7 Conclusion
Inspired by a differential equation perspective on the parametric solution path, we developed and analyzed several different exact and inexact second-order algorithms. We develop theoretical results that are an improvement over the best-known results of grid search methods, and we present computational results that demonstrate the effectiveness of our methods.
Acknowledgments
This research is supported, in part, by NSF AI Institute for Advances in Optimization Award 2112533.
References
- Abbas et al. [2014] Boushra Abbas, Hédy Attouch, and Benar F Svaiter. Newton-like dynamics and forward-backward methods for structured monotone inclusions in hilbert spaces. Journal of Optimization Theory and Applications, 161(2):331–360, 2014.
- Alvarez and Attouch [2001] Felipe Alvarez and Hedy Attouch. An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3–11, 2001.
- Alvarez et al. [2002] Felipe Alvarez, Hedy Attouch, Jérôme Bolte, and P Redont. A second-order gradient-like dissipative dynamical system with hessian-driven damping.: Application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8):747–779, 2002.
- Attouch and Svaiter [2011] Hedy Attouch and Benar Fux Svaiter. A continuous dynamical newton-like approach to solving monotone inclusions. SIAM Journal on Control and Optimization, 49(2):574–598, 2011.
- Attouch et al. [2016] Hedy Attouch, Maicon Marques Alves, and Benar Fux Svaiter. A dynamic approach to a proximal-newton method for monotone inclusions in hilbert spaces, with complexity . Journal of Convex Analysis, 23(1):139–180, 2016.
- Attouch et al. [2018] Hedy Attouch, Zaki Chbani, Juan Peypouquet, and Patrick Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1-2):123–175, 2018.
- Bach et al. [2006] Francis R Bach, David Heckerman, and Eric Horvitz. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7(Aug):1713–1741, 2006.
- Bertsekas [1997] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
- Bot and Csetnek [2016] Radu Ioan Bot and Ernö Robert Csetnek. Second order forward-backward dynamical systems for monotone inclusion problems. SIAM Journal on Control and Optimization, 54(3):1423–1443, 2016.
- Conn et al. [2000] Andrew R Conn, Nicholas IM Gould, and Ph L Toint. Trust region methods, volume 1. Siam, 2000.
- Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Efron et al. [2004] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
- Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
- Garcia et al. [1989] Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989.
- Gautschi [2011] Walter Gautschi. Numerical analysis. Springer Science & Business Media, 2011.
- Giesen et al. [2012a] Joachim Giesen, Martin Jaggi, and Sören Laue. Approximating parameterized convex optimization problems. ACM Transactions on Algorithms (TALG), 9(1):10, 2012a.
- Giesen et al. [2012b] Joachim Giesen, Martin Jaggi, and Sören Laue. Regularization paths with guarantees for convex semidefinite optimization. In Artificial Intelligence and Statistics, pages 432–439, 2012b.
- Golub et al. [1999] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286(5439):531–537, 1999.
- Guddat et al. [1990] Jürgen Guddat, F Guerra Vazquez, and Hubertus Th Jongen. Parametric optimization: singularities, pathfollowing and jumps. Springer, 1990.
- Hastie et al. [2004] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(Oct):1391–1415, 2004.
- Karimireddy et al. [2018] Sai Praneeth Karimireddy, Sebastian U Stich, and Martin Jaggi. Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
- Kim et al. [2007] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-point method for large-scale -regularized least squares. IEEE journal of selected topics in signal processing, 1(4):606–617, 2007.
- Loosli et al. [2007] Gaëlle Loosli, Gilles Gasso, and Stéphane Canu. Regularization paths for -svm and -svr. In International Symposium on Neural Networks, pages 486–496. Springer, 2007.
- Mairal and Yu [2012] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1835–1842. Omnipress, 2012.
- Marquardt [1963] Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
- Mazumder et al. [2010] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287–2322, 2010.
- Mazumder et al. [2011] Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138, 2011.
- Moré [1978] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
- Ndiaye et al. [2019] Eugene Ndiaye, Tam Le, Olivier Fercoq, Joseph Salmon, and Ichiro Takeuchi. Safe grid search with optimal complexity. In International Conference on Machine Learning, pages 4771–4780, 2019.
- Nesterov [2018] Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2018.
- Nesterov and Nemirovskii [1994] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
- Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- Osborne et al. [2000] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000.
- Polyak [2009] Roman A Polyak. Regularized newton method for unconstrained convex optimization. Mathematical programming, 120(1):125–145, 2009.
- Ralph [1994] Daniel Ralph. Global convergence of damped newton’s method for nonsmooth equations via the path search. Mathematics of Operations Research, 19(2):352–389, 1994.
- Renegar [2001] James Renegar. A mathematical view of interior-point methods in convex optimization, volume 3. Siam, 2001.
- Rosset [2005] Saharon Rosset. Following curved regularized optimization solution paths. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1153–1160. MIT Press, 2005. URL http://papers.nips.cc/paper/2600-following-curved-regularized-optimization-solution-paths.pdf.
- Rosset and Zhu [2007] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, pages 1012–1030, 2007.
- Scieur et al. [2017] Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, pages 1109–1118, 2017.
- Scott [2011] L.R. Scott. Numerical Analysis. Princeton University Press, 2011. ISBN 9781400838967. URL https://books.google.com/books?id=SfCjL_5AaRQC.
- Su et al. [2014] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
- Wang et al. [2014] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164–2201, 2014.
- Wibisono et al. [2016] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.
- Yuan and Zou [2009] Ming Yuan and Hui Zou. Efficient global approximation of generalized nonlinear -regularized solution paths and its applications. Journal of the American Statistical Association, 104(488):1562–1574, 2009.
- Zhang et al. [2018] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. In Advances in Neural Information Processing Systems, pages 3900–3909, 2018.
- Zhou and Lange [2015] Hua Zhou and Kenneth Lange. Path following in the exact penalty method of convex programming. Computational optimization and applications, 61(3):609–634, 2015.
- Zhou and Wu [2014] Hua Zhou and Yichao Wu. A generic path algorithm for regularized statistical estimation. Journal of the American Statistical Association, 109(506):686–699, 2014.
Appendix A Proofs
A.1 Proof of Proposition 2.1
For fixed , we have is a constant and its absolute value is no larger than . Let and for . Since is -strongly convex and is strong convex, we have for . Since is -Lipschitz continuous, we have . Also, let , , and . t holds that
Also, since and , it holds that . Therefore, we have
To conclude, since , it holds that
A.2 Proof of Lemma 4.2
In the following residual analysis, we will work with the high-order directional derivatives of and . We now introduce the definition of the directional derivatives. For , let denote the directional derivative of function at along directions . For example, and . Furthermore, the norm of directional derivatives is defined as . For detailed properties of directional derivatives, we refer the reader to Nesterov [30]. We start with computational guarantees of the Taylor approximation on functions with Lipschitz continuous high-order derivatives. The following lemma guarantees the accuracy of the Taylor expansion.
Lemma A.1 ( and in Nesterov [30]).
Let the function be convex and -times differentiable. Suppose -th order derivative of is Lipschitz continuous. Let denote the Taylor approximation of function at : . Then we have the following guarantees:
Suppose step-size satisfies that .
For simplicity, we use to represent . We will begin the complexity analysis of trapezoid method by the following two lemmas which provide proper upper bounds the norm of direction and the difference between and .
Lemma A.2.
Suppose , , , and . Let denote the initial residual that satisfies . Then we have .
Proof A.3.
Let . By definition . When , we have . Also when , it holds that
Lemma A.4.
Suppose Section 4.1 and Section A.2 hold. Let , and let be generated by the trapezoid update scheme defined in Eq. 13, we have
where . Furthermore, it holds that
Proof A.5.
Using the definition of we have
(18) | ||||
where . Also, we have , and that
Hence, it holds that
(19) | ||||
When satisfies Section A.2, it holds that and . Apply them to Eq. 19, we have and it implies that . Apply it to Eq. 19, it holds that
Hence, it holds that .
Based on the results of Lemma A.2 and Lemma A.4, the following theorem analyzes the one-step residual accumulation of the trapezoid update in Eq. 13.
Proof A.6 (Proof of Lemma 4.2).
We will begin the local residual analysis by estimating the difference between and . For simplicity, let , , , and denote , , , and in the trapezoid update Eq. 13 respectively. Then we have
We will approach the result in Eq. 14 by splitting and rearranging the terms in , and . From Lemma A.1, it holds that
From the update Eq. 13, we have . For , using Lemma A.1 and based on update Eq. 13 we have
where . Also, . By the definition of , we have
(20) |
Using Lemma A.4, we have
(21) | ||||
where . Furthermore, it holds that
(22) |
and
where
We further have , and
where
and
Hence, it holds that
(23) | ||||
Apply the result in Lemma A.4 into Eq. 23, we can further get
(24) |
Also, by applying the results in Lemma A.2 to Eqs. 23 and 24, we have