This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploiting higher-order derivatives in convex optimization methods

Dmitry Kamzolov (Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE), Alexander Gasnikov (MIPT, Moscow, Russia; IITP RAS, Moscow, Russia; Caucasus Mathematical Center, Adyghe State University, Maikop, Russia), Pavel Dvurechensky (WIAS, Berlin, Germany), Artem Agafonov (Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE; MIPT, Moscow, Russia), Martin Takáč (Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE)

Keywords— High-order methods, tensor methods, convex optimization, inexact methods, stochastic optimization, distributed optimization

1 Introduction

It is well known since the works of I. Newton [64] and L. Kantorovich [45] that the second-order derivative of the objective function can be used in numerical algorithms for solving optimization problems and nonlinear equations and that such algorithms have better convergence guarantees. Higher-order derivatives can be efficiently used for solving nonlinear equations as was shown by P. Chebyshev [17, 24]. For optimization problems, the basic idea known at least since 1970’s [39] and developed in later works [69, 68, 9] is to approximate, at each iteration of an algorithm, the objective by its Taylor polynomial at the current iterate, optionally add a regularization, and minimize this Taylor approximation to obtain the next iterate. In this way, the first Taylor polynomial leads to first-order methods that are very well understood, see, e.g., [54, 58], with the optimal methods existing since 1980’s [54, 56].The second-order methods are using the second Taylor polynomial. The most famous representative is the Newton’s method that minimizes at each iteration the second-order quadratic approximation of the objective function. If the second derivative is Lipschitz continuous, the objective is strongly convex, and the starting point is sufficiently close to the solution, then this algorithm has very fast quadratic convergence and requires loglogε1\log\log\varepsilon^{-1} iterations to reach an ε\varepsilon-solution in terms of the objective value [45]. This bound is optimal [54] even for univariate optimization problems with the possibility of using in algorithms derivatives of any order. Different modifications of the basic algorithm, such as the Damped Newton’s method or the Levenberg–Marquardt algorithm achieve global convergence, but have a slower, i.e., linear, convergence [67, 65]. Second-order methods played also the central role in the development by Yu. Nesterov and A. Nemirovskii of interior point methods that have global linear convergence rate and allow proving polinomial solvability of a large class of convex optimization problems [62]. This theory was based on the analysis of the Damped Newton’s method for the class of self-concordant functions that, in particular, include functions without Lipschitz derivatives.

An important idea that eventually led to the current developments of tensor methods was the cubic regularization of the Newton’s method, which dates back to the work of A. Griewank [37]. The global performance of the Cubic regularized Newton’s method was analysed in 2006 by Yu. Nesterov and B. Polyak in [63] for a large list of settings with the main assumption that the second derivative is Lipschitz continuous. The main idea is based on the fact that, for functions with Lipshitz second derivative, the objective’s model consisting of the second Taylor polynomial regularized by the cube of the norm with sufficiently large regularization parameter is an upper bound for the objective function. This model is minimized in each iteration of the algorithm, which, in particular, allowed to obtain global convergence rate 1/k21/k^{2} for minimizing convex functions with Lipschitz second derivative. Moreover, the authors showed that the complexity of minimizing the third-order polynomial model of the objective is of the same order as the complexity of the standard Newton’s method step. The Cubic regularized Newton’s method was further accelerated in [55] to reach 1/k31/k^{3} convergence rate for convex problems with Lipschitz second derivative. In 2012 R. Monteiro and B. Svaiter proposed in [53] a very perspective accelerated-proximal envelope (see also the work [16]) that allowed them to develop even faster second-order method with the convergence rate 1/k3.51/k^{3.5} for minimizing convex objectives with Lipschitz second derivative.

Another important step in the development of tensor methods was made by M. Baes in 2009 [8], where the Newton–Nesterov–Polyak algorithm was generalized to the setting of convex minimization with the objective having Lipschitz pp-th derivative for p2p\geq 2. The idea was to construct a (p+1)(p+1)-th order polynomial model that upper bounds the objective function by taking the pp-th Taylor polynomial of the objective and regularizing it by the (p+1)(p+1)-th power of the Euclidean norm with sufficiently large regularization parameter. The author showed 1/kp1/k^{p} global convergence rate for methods that minimize such model in each iteration and proposed an accelerated version with the rate 1/kp+11/k^{p+1}, all under the assumption of Lipschitz pp-th derivative. As in the world of first-order methods, where the interest in optimal methods was one of the central driving forces in 2000–2020, a natural question arose on what are the lower bounds for second- and higher-order methods and which algorithms are optimal. The results in [7, 5, 29] gave the answer that the lower bound is 1/k(3p+1)/21/k^{(3p+1)/2}, revealing the gap when p3p\geq 3 between the rates of existing methods an lower bounds. One of the drawbacks of tensor methods at this stage was that each iteration required to minimize a higher-order polynomial that may not be convex, leading to the high cost of each iteration and impracticality of such methods.

In 2018, a breakthrough was made by Yu. Nesterov [59] in understanding the significance of higher-order methods in modern convex optimization when the pp-th derivative satisfies the Lipschitz condition. The breakthrough involved increasing the regularization parameter in M. Baes’s approach for the regularized pp-th Taylor polynomial of the objective. It was demonstrated that this modified Taylor model is convex. Additionally, it was shown that the convex Taylor model can be efficiently minimized for p=3p=3, with similar complexity to the standard Newton’s method step. Furthermore, it was proven that minimizing the convex Taylor model doesn’t require the calculation of the entire tensor of third derivatives. Instead, directional third derivatives can be calculated, for example, through automatic differentiation. Lastly, a lower bound of 1/k(3p+1)/21/k^{(3p+1)/2} was established for convex problems with Lipschitz pp-th derivative. For each p2p\geq 2, worst-case functions within the class of convex functions with Lipschitz pp-th derivatives were constructed, and a research program was outlined to develop optimal tensor methods [58]. It followed from the obtained lower bounds for convex optimization problems that the Monteiro–Svaiter second-order method is optimal up to a logarithmic factor since it used a complicated line-search procedure. Based on the Monteiro–Svaiter method and the convex Taylor model of the objective, in [30, 26, 27] and independently in [11, 42], the authors proposed near-optimal tensor methods for convex problems with Lipschitz pp-th derivatives, which have the rate 1/k(3p+1)/21/k^{(3p+1)/2} up to logarithmic factors. In Section 3, the Monteiro–Svaiter approach with Nesterov’s implementable tensor steps is described. As it was mentioned in [58, 21], the auxiliary line-search procedures that lead to logarithmic factors in the convergence rates of the near-optimal tensor methods may also significantly slow down the convergence in practice. In 2022, two independent works [48] and [14] proposed optimal tensor methods with convergence rates without additional logarithmic factors. At the same time, such methods were proposed also for monotone variational inequalities [1, 51] under higher-order regularity of the operator.

The developments in the theory of tensor methods had also an important impact on the theory of second-order methods. In 2020, Yu. Nesterov proposed an implementation of the third-order method using inexact third derivative constructed via finite differences of gradients [61]. In other words, it appeared to be possible to implement the third-order method using only first and second derivatives, which lead to <<superfast>> second-order method with the convergence rate 1/k41/k^{4} violating the lower bound thanks to additional assumption that the third derivative is Lipschitz continuous. The key observation was that in the class of functions with Lipschitz second derivative the worst-case function [59] for second-order methods does not have Lipschitz third derivative. These results were further improved in [43, 60], where second-order algorithms with the rate 1/k51/k^{5} corresponding to optimal third-order methods were proposed for minimizing convex functions with Lipschitz third derivative. These developments are in large contrast to first-order methods, for which the worst-case function is quadratic [57] and has Lipschitz derivatives of any order, thus, preventing improvements under additional assumptions. This line of research was continued in [6], where such reductions were shown to be possible for higher-order methods. The ideas used in superfast methods are described in Section 4.

Tensor methods remain a very active area of research with many extensions of the existing methods. In particular, there are adaptive variants for the setting when the Lipschitz constant is not known [41, 36], universal generalizations for the setting when the pp-th derivative is Hölder continuous [33, 70, 19], versions with inexact minimization of the Taylor model [21, 35], tensor methods for finding approximate stationary points of convex functions [34, 22]. The ideas described above, especially the Monteiro–Svaiter accelerated proximal point method, turned out to be productive also in the areas not directly related to tensor methods, see non-trivial examples in [10, 12, 15]. Modern second-order and third-order methods demonstrate also their efficiency in Data Science and Machine Learning applications [18, 2, 23, 13]. More details of such applications are given in Section 5.

2 Notation and problem statement

The following optimization problem is considered

minxd{F(x):=f(x)+g(x)},\min\limits_{x\in\mathbb{R}^{d}}\{F\left(x\right):=f\left(x\right)+g\left(x\right)\}, (1)

where ff and gg are convex functions. Denote xx^{*} – the solution of the problem (1). If the solution is not unique, let xx^{*} be a solution that is the closest to starting point x0x_{0} in 22-norm.

Denote \|\,\cdot\,\| the Euclidean 22-norm in d\mathbb{R}^{d},

Dkf(x)[h]k=i1,,id0:j=1dij=kkf(x)x1i1xdidh1i1hdid,D^{k}f(x)[h]^{k}=\sum_{i_{1},...,i_{d}\geq 0:\,\,\sum_{j=1}^{d}i_{j}=k}\frac{\partial^{k}f(x)}{\partial x_{1}^{i_{1}}...\partial x_{d}^{i_{d}}}h_{1}^{i_{1}}\cdot...\cdot h_{d}^{i_{d}},
Dkf(x)=maxh1Dkf(x)[h]k.\|D^{k}f(x)\|=\max_{\|h\|\leq 1}\left\|D^{k}f(x)[h]^{k}\right\|.

Assume that ff has Lipschitz derivatives of order pp (p1p\geq 1):

Dpf(x)Dpf(y)Lp,fxy,x,yd.\|D^{p}f(x)-D^{p}f(y)\|\leq L_{p,f}\|x-y\|,\;x,y\in\mathbb{R}^{d}. (2)

Here and below (see e.g. (5)) it is considered that x,ydx,y\in\mathbb{R}^{d} belongs to the ball in 22-norm centred at xx^{*} with the radius O(x0x)O(\|x_{0}-x^{*}\|) [60].

The pp-th Taylor polynomial of ff is defined as

Ωp(f,x;y)=f(x)+k=1p1k!Dkf(x)[yx]k,yd.\Omega_{p}(f,x;y)=f(x)+\sum_{k=1}^{p}\frac{1}{k!}D^{k}f(x)\left[y-x\right]^{k},\;y\in\mathbb{R}^{d}. (3)

Note that from (2) it follows [59] that

|f(y)Ωp(f,x;y)|Lp,f(p+1)!yxp+1.\left|f(y)-\Omega_{p}(f,x;y)\right|\leq\frac{L_{p,f}}{(p+1)!}\|y-x\|^{p+1}. (4)

FF satisfies rr-growth condition (p+1r2p+1\geq r\geq 2) (see [66]) with constant σr>0\sigma_{r}>0 iff

F(x)F(x)σrxxr,xd.F(x)-F(x_{*})\geq\sigma_{r}\|x-x^{*}\|^{r},\;x\in\mathbb{R}^{d}. (5)

3 Optimal Tensor methods

The following algorithm is taken from [28], for g0g\equiv 0 see [11].

Algorithm 1 Monteiro–Svaiter–Nesterov
MSN(x0x_{0},ff,gg,pp,HH,KK)
1:  Input: p1p\geq 1, f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R}, H>0H>0.
2:  A0=0,y0=x0A_{0}=0,y_{0}=x_{0}.
3:  for  k=0k=0 to k=K1k=K-1 do
4:     Find λk+1>0\lambda_{k+1}>0 and yk+1dy_{k+1}\in\mathbb{R}^{d} such that
12λk+1Hyk+1x~kp1p!pp+1, where\frac{1}{2}\leq\lambda_{k+1}\frac{H\|y_{k+1}-\tilde{x}_{k}\|^{p-1}}{p!}\leq\frac{p}{p+1}\,,\text{ where}
yk+1=argminyd{Ωp(f,x~k;y)+g(y)+H(p+1)!yx~kp+1},y_{k+1}=\operatornamewithlimits{argmin}_{y\in\mathbb{R}^{d}}\left\{\Omega_{p}(f,\tilde{x}_{k};y)+g(y)+\frac{H}{(p+1)!}\|y-\tilde{x}_{k}\|^{p+1}\right\}\,, (6)
ak+1=λk+1+λk+12+4λk+1Ak2 , Ak+1=Ak+ak+1 , a_{k+1}=\frac{\lambda_{k+1}+\sqrt{\lambda_{k+1}^{2}+4\lambda_{k+1}A_{k}}}{2}\text{ , }A_{k+1}=A_{k}+a_{k+1}\text{ , }
x~k=AkAk+1yk+ak+1Ak+1xk.\tilde{x}_{k}=\frac{A_{k}}{A_{k+1}}y_{k}+\frac{a_{k+1}}{A_{k+1}}x_{k}.
5:     xk+1:=xkak+1f(yk+1)ak+1g(yk+1)x_{k+1}:=x_{k}-a_{k+1}\nabla f(y_{k+1})-a_{k+1}\nabla g(y_{k+1}).
6:  end for
7:  return  yKy_{K}
Theorem 1.

[28] Let yky_{k} be an output point of Algorithm 1 MSN(x0x_{0}, ff, gg, pp, HH, kk) after kk iterations, when p1p\geq 1 and H(p+1)Lp,fH\geq(p+1)L_{p,f}. Then

F(yk)F(x)cpHRp+1k3p+12,F(y_{k})-F(x^{\ast})\leq\frac{c_{p}HR^{p+1}}{k^{\frac{3p+1}{2}}}\,, (7)

where cp=2p1(p+1)3p+12/p!c_{p}=2^{p-1}(p+1)^{\frac{3p+1}{2}}/p!, R=x0xR=\|x_{0}-x^{\ast}\|.

Moreover, when p2p\geq 2 for ε\varepsilon: F(yk)F(x)εF(y_{k})-F(x_{\ast})\leq\varepsilon it is required to solve auxiliary problem (6), to find (λk+1,yk+1)(\lambda_{k+1},y_{k+1}) with proper accuracy, O(ln(ε1))O\left(\ln\left(\varepsilon^{-1}\right)\right) times.

Note that Theorem 1 is true also when H2Lp,fH\geq 2L_{p,f} (independently of p1p\geq 1). It can be derived from (4). The condition H(p+1)Lp,fH\geq(p+1)L_{p,f} was used only because it guarantees the convexity of auxiliary problem (6), see [59]. Under the conditions H(p+1)Lp,fH\geq(p+1)L_{p,f}, g0g\equiv 0 for p=1,2,3p=1,2,3 there exist efficient ways to solve auxiliary problem (6), see [59]. For p=1p=1 there exists an explicit formula for the solution of (6). For p=2,3p=2,3 the complexity of (6) is almost the same (up to a logarithmic factor in ε\varepsilon) as a complexity of Newton’s method iteration [59], see also Section 4. It is important that there is no need to solve (6) accurately. It is sufficient to find such y~k+1\tilde{y}_{k+1} that satisfied

Ω~k(y~k+1)14p(p+1)F(y~k+1),\left\|\nabla\widetilde{\Omega}^{k}(\tilde{y}_{k+1})\right\|\leq\frac{1}{4p(p+1)}\|\nabla F(\tilde{y}_{k+1})\|, (8)

where

Ω~k(y):=Ωp(f,x~k;y)+g(y)+H(p+1)!yx~kp+1.\widetilde{\Omega}^{k}(y):=\Omega_{p}(f,\tilde{x}_{k};y)+g(y)+\frac{H}{(p+1)!}\|y-\tilde{x}_{k}\|^{p+1}.

Such a practical modification slow down the theoretical convergence rate in a factor 12/512/5 in the right hand side of (7) [43, 35, 60].

In May 2022, D. Kovalev et al. [48] proposed for g0g\equiv 0 an explicit policy for choosing a pair (λk+1,yk+1)(\lambda_{k+1},y_{k+1}) in Algorithm 1 and modify the stopping criteria (8) according to [40]. It allows to solve (6) on each iteration at average only two times rather O(ln(ε1))O\left(\ln\left(\varepsilon^{-1}\right)\right) times as it was before. The final complexity bound for pp-order oracle calls in [48] matched the lower oracle complexity bound from [59] (see also [25]) obtained for the worst-case function

Fp(x1,,xd)=|x1|p+1+|x2x1|p+1++|xdxd1|p+1.F_{p}\left(x_{1},...,x_{d}\right)=|x_{1}|^{p+1}+|x_{2}-x_{1}|^{p+1}+...+|x_{d}-x_{d-1}|^{p+1}.

In concurrent and independent paper [14], the authors also proposed a way of reducing additional logarithmic factors. Note, that approach of [14] does not require any a priory knowledge of smoothness parameters (including Holder continuity assumption instead of Lipschitz one).

If additionally FF satisfies rr-growth condition (5) then optimal method can be developed based on restarts procedure [63, 27] – see Algorithm 2.

Algorithm 2 Restarted MSN(x0x_{0},ff,gg,pp,rr,σr\sigma_{r},HH,KK)
1:  Input: F=f+g:dF=f+g:\mathbb{R}^{d}\rightarrow\mathbb{R} satisfies rr-growth condition with constant σr\sigma_{r}, MSN(x0x_{0},ff,gg,pp,HH,KK).
2:  z0=x0z_{0}=x_{0}.
3:  for k=0k=0 to KK do
4:     Rk=R02kR_{k}=R_{0}\cdot 2^{-k},
Nk=max{(rcpH2rσrRkp+1r)23p+1,1}.N_{k}=\max\left\{\left\lceil\left(\frac{rc_{p}H2^{r}}{\sigma_{r}}R_{k}^{p+1-r}\right)^{\frac{2}{3p+1}}\right\rceil,1\right\}. (9)
5:     zk+1:=yNkz_{k+1}:=y_{N_{k}}, where yNky_{N_{k}} – output of MSN(zkz_{k},ff,gg,pp,HH,NkN_{k}).
6:  end for
7:  return  zKz_{K}
Theorem 2.

[63, 27] Let zKz_{K} be an output of Algorithm 2 after KK restarts. If H(p+1)Lp,fH\geq(p+1)L_{p,f}, σr>0\sigma_{r}>0, then for F(zK)F(x)εF(z_{K})-F(x^{*})\leq\varepsilon it sufficient to solve (6):

N=O~((HRp+1rσr)23p+1)N=\tilde{O}\left(\left(\frac{HR^{p+1-r}}{\sigma_{r}}\right)^{\frac{2}{3p+1}}\right) (10)

times, where O~()\tilde{O}(\,) – means the same as O()O(\,) up to a lnα(ε1)\ln^{\alpha}\left({\varepsilon^{-1}}\right) factor.

Everything that was noted after Theorem 1 will also take place in this case.

In particular, when g0g\equiv 0, p2p\geq 2 and r=2r=2 MSN algorithm in Theorem 2 can be replaced by Kovalev’s variant of MSN [48] to obtain

O((Lp,fRp1σ2)23p+1+loglog(σ2p+1p1Lp,f2p1ε))O\left(\left(\frac{L_{p,f}R^{p-1}}{\sigma_{2}}\right)^{\frac{2}{3p+1}}+\log\log\left(\frac{\sigma_{2}^{\frac{p+1}{p-1}}}{L_{p,f}^{\frac{2}{p-1}}\varepsilon}\right)\right) (11)

pp-order oracle complexity bound. This upper bound corresponds to the σ2\sigma_{2}-strongly convex case lower bound from [46] and improves (10) on a logarithmic factor. The dependence on ε\varepsilon is loglogε1\log\log\varepsilon^{-1} as it should be locally for tensor methods [26, 20] (see also [54, 63, 5, 19]). But (11) describe two regimes: the first term (complexity independent on ε\varepsilon) describe the complexity to reach the vicinity of quadratic convergence, the second term (complexity is loglogε1\log\log\varepsilon^{-1}) describe Newton’s type local convergence.

Note that due to the presence of composite term gg the described above MSN algorithm and its variations can be used for splitting oracle complexities [44, 47]. Namely, assume that ff and gg have Lipshitz pp-th order derivatives with different Lipschitz constants. Based on MSN algorithm one can propose a general framework to accelerate tensor methods by splitting computational complexities. As a result, one can get near-optimal oracle complexity for each function in the sum separately for any p1p\geq 1, including the first-order methods. To be more precise, if the near optimal complexity to minimize ff is Nf(ε)N_{f}(\varepsilon) and to minimize gg is Ng(ε)N_{g}(\varepsilon), then MSN-based sliding algorithm [44, 47] requires no more than O~(Nf(ε))\tilde{O}\left(N_{f}(\varepsilon)\right) oracle calls for ff and O~(Ng(ε))\tilde{O}\left(N_{g}(\varepsilon)\right) oracle calls for gg to minimze F=f+gF=f+g.

Not also that for p=1p=1 and f0f\equiv 0 one can take H>0H>0 in MSN and obtain Monteiro–Svaiter accelerated proximal envelop. In the cycle of recent papers [40, 16, 47, 49, 48] it was shown that this type of proximal (Catalyst-type [50]) envelop is logarithmic-free due to the well developed stopping rule criteria for the inner (auxiliary) problem. So it opens up different perspectives for improving existing Catalyst-based algorithms, see e.g. [72, 71].

4 Superfast acceleration and the structure of the auxiliary problem

In this section, the computationally efficient solution of the tensor subproblem for p=3p=3 is discussed. With the proposed procedure, it is possible to implement third-order methods without the computation and storing of the third-order derivative. At the beginning of the section, it is shown that this subproblem is convex. Then the Bregman-distance gradient method is used as a subsolver for the tensor step. (6).

For this section, it is assumed that g=0g=0 and p=3p=3. Then L3=L3,fL_{3}=L_{3,f}, the third-order Taylor’s polynomial is

Ω(f,x;y)=Ω3(f,x;y)=f(x)+f(x)[yx]+122f(x)[yx]2+16D3f(x)[yx]3.\begin{gathered}\Omega(f,x;y)=\Omega_{3}(f,x;y)=f(x)+\nabla f(x)[y-x]\\ +\tfrac{1}{2}\nabla^{2}f(x)\left[y-x\right]^{2}+\tfrac{1}{6}D^{3}f(x)\left[y-x\right]^{3}.\end{gathered} (12)

The regularized third-order model is

Ω~(f,x;y)=f(x)+f(x)[yx]+122f(x)[yx]2+16D3f(x)[yx]3+H24yx4.\begin{gathered}\tilde{\Omega}(f,x;y)=f(x)+\nabla f(x)[y-x]\\ +\tfrac{1}{2}\nabla^{2}f(x)\left[y-x\right]^{2}+\tfrac{1}{6}D^{3}f(x)\left[y-x\right]^{3}+\tfrac{H}{24}\|y-x\|^{4}.\end{gathered} (13)

The basic step for every tensor method is formulated as

xk+1=argminyd{Ω~(f,xk;y)}.\begin{gathered}x_{k+1}=\operatornamewithlimits{argmin}\limits_{y\in\mathbb{R}^{d}}\left\{\tilde{\Omega}(f,x_{k};y)\right\}.\end{gathered} (14)

This step is a major part of almost every third-order method. Next, it is described how to effectively solve this subproblem without the computation of the third-order derivative and only using second-order information.

First, it is clarified that the function (13) is convex for H3L3H\geq 3L_{3}. Note, that if HL3H\geq L_{3} the model (13) is possibly non-convex because of third-order derivative. It means that the minimizing non-convex subproblem is harder than minimizing original convex problem. In 2018 Yu. Nesterov made the breakthrough in [59]. It was shown that if H3L3H\geq 3L_{3} then the model (13) is convex and high-order methods are implementable.

To give some intuition, a sketch of the convexity proof is presented. The following inequality can be derived from the Lipschitz-continuous condition and the convexity of function ff.

02f(y)2Ω(f,x;y)+L32yx2I.0\preceq\nabla^{2}f(y)\preceq\nabla^{2}\Omega(f,x;y)+\frac{L_{3}}{2}\|y-x\|^{2}I.

Now, to finish the proof, one need to choose HH such that

2Ω~(f,x;y)2Ω(f,x;y)+L32yx2I.\nabla^{2}\tilde{\Omega}(f,x;y)\succeq\nabla^{2}\Omega(f,x;y)+\frac{L_{3}}{2}\|y-x\|^{2}I.

This lead to the crucial detail, that was misleading before

2{14x4}=2xxT+x2Ix2I.\displaystyle\nabla^{2}\left\{\tfrac{1}{4}\|x\|^{4}\right\}=2xx^{T}+\|x\|^{2}I\succeq\|x\|^{2}I.

So, from the fact that xx is a vector and xxTxx^{T} is a singular matrix, the factor 33 in the last inequality is losing. Finally, if H3L3H\geq 3L_{3} then the function (13) is convex. For more details one can check Theorem 1 in [59].

Thus, one can move to the effective subsolver of the problem (14) by Bregman-distance gradient method for relatively smooth functions from [52]. The main idea is to show that the optimized function ϕ(y)\phi(y) is LρL_{\rho}-smooth and μρ\mu_{\rho}-strongly convex with respect to some convex function ρ(y)\rho(y) or

μρ2ρ(y)2ϕ(y)Lρ2ρ(y).\mu_{\rho}\nabla^{2}\rho(y)\preceq\nabla^{2}\phi(y)\preceq L_{\rho}\nabla^{2}\rho(y).

Then Bregman-distance gradient method make next steps

yt+1=argminyd{ϕ(yt),yyt+Lρβρ(yt,y)},y_{t+1}=\operatornamewithlimits{argmin}\limits_{y\in\mathbb{R}^{d}}\left\{\left\langle\nabla\phi(y_{t}),y-y_{t}\right\rangle+L_{\rho}\beta_{\rho}\left(y_{t},y\right)\right\},

where

βρ(yt,y)=ρ(y)ρ(yt)ρ(yt),yyt\beta_{\rho}\left(y_{t},y\right)=\rho(y)-\rho(y_{t})-\left\langle\nabla\rho(y_{t}),y-y_{t}\right\rangle

is a Bregman-distance generated by ρ(y)\rho(y). Bregman-distance gradient method converges linearly with condition number κ=Lρμρ\kappa=\tfrac{L_{\rho}}{\mu_{\rho}} and convergence rate

N=Lρμρlog(Lρβρ(y0,y)ε).N=\tfrac{L_{\rho}}{\mu_{\rho}}\log\left(\tfrac{L_{\rho}\beta_{\rho}\left(y_{0},y_{\ast}\right)}{\varepsilon}\right).

One can show that the model (13) with H=6L3H=6L_{3} can be optimized as ϕ(y)=Ω~(f,xk;y)\phi(y)=\tilde{\Omega}(f,x_{k};y) by gradient method with Bregman-distance generated by

ρxk(y)=122f(xk)[yxk]2+L34yxk4,\rho_{x_{k}}(y)=\tfrac{1}{2}\nabla^{2}f(x_{k})\left[y-x_{k}\right]^{2}+\tfrac{L_{3}}{4}\|y-x_{k}\|^{4},

Lρ=1+12L_{\rho}=1+\tfrac{1}{\sqrt{2}}, μρ=112\mu_{\rho}=1-\tfrac{1}{\sqrt{2}}, and κ=1(1+2)2=O(1)\kappa=\tfrac{1}{\left(1+\sqrt{2}\right)^{2}}=O(1). This means that this method is very fast and converges with a fixed number of iterations. It is worth noting that for each step, the full hessian for β\beta is computed, but the full third-order derivative is unnecessary because one only need the derivative-vector-vector product D3f(xk)[yx]2D^{3}f(x_{k})\left[y-x\right]^{2} to compute Ω~(f,xk;y)\nabla\tilde{\Omega}(f,x_{k};y). Derivative-vector-vector product can be efficiently and precisely computed by autogradient or approximated by finite difference. To summarize, the tensor method with a convergence rate of 1/k51/k^{5} is obtained for third-order methods but only the first and second-order derivatives are computed. This approach was firstly proposed in [59] and then it was improved in [61]. Section 5 from [59] is good for a general understanding of this approach. For details and precise proofs, one can read [61].

5 Tensor methods and stochastic distributed setup

As well as first-order methods, modifications of tensor methods are proposed for problems of the finite sum type, stochastic and distributed optimization.

5.1 Stochastic Optimization

In this subsection, the problem (1) with g=0g=0 under inexact information on the derivatives of objective ff up to order p2p\geq 2 is considered. In particular, it is motivated by stochastic optimization. In this case, objective ff has the following form

f(x)=𝔼ξ𝒟[f(x;ξ)],f(x)=\mathbb{E}_{\xi\sim\mathcal{D}}[f(x;\xi)], (15)

where the random variable ξ\xi is sampled from a distribution 𝒟\mathcal{D}, and an optimization procedure has access only to stochastic realizations of the function f(x;ξ)f(x;\xi) via its higher-order derivatives up to the order p2p\geq 2.

The full Taylor expansion of ff (3) requires computing all the derivatives up to the order pp, which can be expensive to calculate. Following the works [32, 3, 4] some approximations Gx,iG_{x,i} are used for the derivatives if(x)\nabla^{i}f\left(x\right), i=1,,pi=1,\ldots,p to construct an inexact pp-th order Taylor expansion of the objective:

Φp(f,x;y)=f(x)+i=1p1i!Gx,i[yx]i.\Phi_{p}(f,x;y)=f\left(x\right)+\sum\limits_{i=1}^{p}\frac{1}{i!}G_{x,i}[y-x]^{i}. (16)

Let us consider the case when approximations Gx,i=Gi(x,ξ)G_{x,i}=G_{i}(x,\xi) are stochastic, namely, they satisfy the following assumption.

Assumption 1 (Unbiased stochastic gradient with bounded variance and bounded variance stochastic high-order derivatives).

For any xdx\in\mathbb{R}^{d} stochastic gradient G1(x,ξ)G_{1}(x,\xi) and stochastic high-order derivatives Gi(x,ξ),i=2,,pG_{i}(x,\xi),~{}i=2,\ldots,p satisfy

𝔼[G1(x,ξ)x]=f(x),𝔼[G1(x,ξ)f(x)2x]σ12,\displaystyle\mathbb{E}[G_{1}(x,\xi)\mid x]=\nabla f(x),\quad\mathbb{E}\left[\|G_{1}(x,\xi)-\nabla f(x)\|^{2}\mid x\right]\leq\sigma_{1}^{2}, (17)
𝔼[Gi(x,ξ)if(x)2x]σi2,i=2,,p.\displaystyle\mathbb{E}\left[\|G_{i}(x,\xi)-\nabla^{i}f(x)\|^{2}\mid x\right]\leq\sigma_{i}^{2},~{}i=2,\ldots,p.

Next, stochastic tensor methods under Assumption 1 is described.

Recall, that exact tensor methods are based on minimisation of tensor model Ω~(f,x;y)\tilde{\Omega}(f,x;y) (see eq. (13) for p=3p=3). For inexact tensor methods the model is constructed in the following way [4, 3]:

ωpH,δ¯(f,x;y):=ϕx,p(y)+δ¯2xy2+i=3pηiσii!xyi+pH(p+1)!xyp+1,\omega^{H,\bar{\delta}}_{p}(f,x;y)\vcentcolon=\phi_{x,p}(y)+\tfrac{\bar{\delta}}{2}\|x-y\|^{2}+\textstyle{\sum\limits_{i=3}^{p}}\tfrac{\eta_{i}\sigma_{i}}{i!}\|x-y\|^{i}+\tfrac{pH}{(p+1)!}\|x-y\|^{p+1},

where δ¯>0\bar{\delta}>0 is regularization parameter dependent on σ1,σ2\sigma_{1},\sigma_{2}, and ηi\eta_{i} are some constants: ηi>0,3ip\eta_{i}>0,~{}3\leq i\leq p.

This model satisfies two main conditions ([3], Theorem 1):

  • Model ωpH,δ¯(f,x;y)\omega^{H,\bar{\delta}}_{p}(f,x;y) is a global upper bound for the function ff:

    f(y)ωpH,δ¯(f,x;y),x,yn.f(y)\leq\omega^{H,\bar{\delta}}_{p}(f,x;y),\;x,y\in\mathbb{R}^{n}.
  • Model ωpH,δ¯(f,x;y)\omega^{H,\bar{\delta}}_{p}(f,x;y) is convex.

At each step of the tensor algorithms, one need to find
uargminydωpH,δ¯(f,x;y)u\in\operatornamewithlimits{argmin}_{y\in\mathbb{R}^{d}}\omega^{H,\bar{\delta}}_{p}(f,x;y). However, finding the respective minimizer is a separate challenge. Instead of computing the exact minimum, one can find a point sds\in\mathbb{R}^{d} with a small norm of the gradient.

Definition 3.

Denote by sH,δ¯,τ(x)s^{H,\bar{\delta},\tau}(x) a τ\tau-inexact solution of subproblem, i.e. a point s:=sH,δ¯,τ(x)s\vcentcolon=s^{H,\bar{\delta},\tau}(x) such that

ωxH,δ¯(s)τ.\|\nabla\omega^{H,\bar{\delta}}_{x}(s)\|\leq\tau.

Everything is now prepared to introduce the Accelerated Stochastic Tensor Method ([4], Algorithm 2) listed as Algorithm 3.

Algorithm 3 Accelerated Stochastic Tensor Method [4]
1:  Input: y0=x0y_{0}=x_{0} is starting point; constants H2pLpH\geq\frac{2}{p}L_{p}; ηi4\eta_{i}\geq 4,   3ip3\leq i\leq p; starting inexactness δ¯00\bar{\delta}_{0}\geq 0; nonnegative nondecreasing sequences {κ¯it}t0\{\bar{\kappa}^{t}_{i}\}_{t\geq 0} for i=2,,p+1i=2,\ldots,p+1, and
αt=p+1t+p+1,At=j=1t(1αj),A0=1.\alpha_{t}=\tfrac{p+1}{t+p+1},~{}~{}~{}A_{t}=\textstyle{\prod\limits_{j=1}^{t}(1-\alpha_{j})},~{}~{}~{}A_{0}=1. (18)
ψ0(x):=κ¯20+λ02xx02+i=3pκ¯i0i!xx0i.\psi_{0}(x):=\tfrac{\bar{\kappa}_{2}^{0}+\lambda_{0}}{2}\|x-x_{0}\|^{2}+\textstyle{\sum\limits_{i=3}^{p}}\tfrac{\bar{\kappa}_{i}^{0}}{i!}\|x-x_{0}\|^{i}.
2:  for t0t\geq 0 do
3:     
vt=(1αt)xt+αtyt,xt+1=SpH,δ¯t,τ(vt)v_{t}=(1-\alpha_{t})x_{t}+\alpha_{t}y_{t},\quad x_{t+1}=S_{p}^{H,\bar{\delta}_{t},\tau}(v_{t})
4:     Compute
yt+1=argminxn{ψt+1(x):=ψt(x)+λt+1λt2xx02\displaystyle y_{t+1}=\arg\min_{x\in\mathbb{R}^{n}}\left\{\psi_{t+1}(x):=\psi_{t}(x)+\tfrac{\lambda_{t+1}-\lambda_{t}}{2}\|x-x_{0}\|^{2}\right.
+i=2pκ¯it+1κ¯iti!xx0i+αtAt(f(xt+1)+G1(x,ξ),xxt+1)},\displaystyle\left.+\textstyle{\sum\limits_{i=2}^{p}}\tfrac{\bar{\kappa}^{t+1}_{i}-\bar{\kappa}^{t}_{i}}{i!}\|x-x_{0}\|^{i}+\tfrac{\alpha_{t}}{A_{t}}(f(x_{t+1})+\left\langle G_{1}(x,\xi),x-x_{t+1}\right\rangle)\right\},
5:  end for
Theorem 4.

[4] Let ff has a Lipschitz derivatives of order pp and Assumption 1 hold. After T1T\geq 1 iterations of Algorithm 3 with parameters

κ¯2t=2pαt2Atδ¯t;κ¯p+1t=(p+1)p+12αtp+1AtH;κ¯it=pi12αtiAtηiσi,3ip;λt=σR(t+p+1)p+1/2;δt=σ2+τ+σ1R(t+p+1)3/2.\begin{gathered}\bar{\kappa}_{2}^{t}=\tfrac{2p\alpha_{t}^{2}}{A_{t}}\bar{\delta}_{t};~{}\bar{\kappa}_{p+1}^{t}=\tfrac{(p+1)^{p+1}}{2}\tfrac{\alpha_{t}^{p+1}}{A_{t}}H;\\ \bar{\kappa}_{i}^{t}=\tfrac{p^{i-1}}{2}\tfrac{\alpha_{t}^{i}}{A_{t}}\eta_{i}\sigma_{i},~{}3\leq i\leq p;\\ \lambda_{t}=\tfrac{\sigma}{R}(t+p+1)^{p+1/2};~{}\delta_{t}=\sigma_{2}+\tfrac{\tau+\sigma_{1}}{R}(t+p+1)^{3/2}.\end{gathered}

the objective residual can be bounded as follows

𝔼[f(xT)f(x)]\displaystyle\mathbb{E}\left[f(x_{T})-f(x^{\ast})\right] 3(p+1)3(τ+σ1)R(T+p+1)1/2+i=3p2(p+1)2i1i!σiRi(T+p+1)i+(p+1)2(p+1)(p+1)!HRp+1(T+p+1)p+1.\displaystyle\leq\tfrac{3(p+1)^{3}(\tau+\sigma_{1})R}{(T+p+1)^{1/2}}+\textstyle{\sum\limits_{i=3}^{p}}\tfrac{2(p+1)^{2i-1}}{i!}\tfrac{\sigma_{i}R^{i}}{(T+p+1)^{i}}+\tfrac{(p+1)^{2(p+1)}}{(p+1)!}\tfrac{HR^{p+1}}{(T+p+1)^{p+1}}.

The Theorem 4 yields several intriguing outcomes and results.

First of all, in the Assumption 1 the unbiasedness of the gradients, but not of the high-order derivatives was assumed. The general case of inexact (and possibly biased) gradients is considered in [3]. In this scenario, a slightly different version of Algorithm 3 converges with the rate O(δ1R+i=2pδiRiTi+LpRp+1Tp+1)O\left(\delta_{1}R+\textstyle{\sum\limits_{i=2}^{p}}\frac{\delta_{i}R^{i}}{T^{i}}+\frac{L_{p}R^{p+1}}{T^{p+1}}\right), where δi\delta_{i} is the inexactness in ii-th derivative ((Gx,iif(x))[yx]i1δiyxi1,i=1,,p.\|(G_{x,i}-\nabla^{i}f({x}))[y-x]^{i-1}\|\leq\delta_{i}\|y-x\|^{i-1},i=1,\ldots,p.).

Moreover, Algorithm 3 also allows for restarts. The total number of iterations of Restarted Accelerated Stochastic Tensor Method ([4], Algorithm 3) to reach desired accuracy ε:f(xt)f(x)ε\varepsilon:~{}f(x_{t})-f(x^{*})\leq\varepsilon in expectation is

O((τ+σ1)2με+(σ2μ+1)logf(z0)f(x)ε+i=3p(σiRi2μ)1i+(LpRp1μ)1p+1).O\left(\tfrac{(\tau+\sigma_{1})^{2}}{\mu\varepsilon}+\left(\sqrt{\tfrac{\sigma_{2}}{\mu}}+1\right)\log\tfrac{f(z_{0})-f(x^{*})}{\varepsilon}+\textstyle{\sum}_{i=3}^{p}\left(\tfrac{\sigma_{i}R^{i-2}}{\mu}\right)^{\frac{1}{i}}+\left(\tfrac{L_{p}R^{p-1}}{\mu}\right)^{\frac{1}{p+1}}\right). (19)

One can use the mini-batch Restarted Accelerated Stochastic Tensor Method to solve problem (15). For simplicity let p=2p=2 and assume that the subproblem can be solved exactly, i.e. τ=0\tau=0. In this approach, the mini-batched stochastic gradient gradients and Hessians are computed as 1r1i=1r1f(x,ξi),1r2i=1r22f(x,ξi)\tfrac{1}{r_{1}}\textstyle\sum_{i=1}^{r_{1}}\nabla f(x,\xi_{i}),~{}\tfrac{1}{r_{2}}\textstyle\sum_{i=1}^{r_{2}}\nabla^{2}f(x,\xi_{i}) respectively, where r1r_{1} and r2r_{2} represent the batch sizes. From the convergence estimate (19), one can determine the overall number of stochastic gradient computations O(σ12εμ2/3)O\left(\tfrac{\sigma_{1}^{2}}{\varepsilon\mu^{2/3}}\right), which is similar to the accelerated SGD method [31]. Interestingly, the number of stochastic Hessian computations scales linearly with the desired accuracy ε\varepsilon, i.e., O(σ2μ1/3log1ε)O\left(\tfrac{\sigma_{2}}{\mu^{1/3}}\log\frac{1}{\varepsilon}\right).

Finally, note that in the case of p=2p=2 the Algorithm 3 matches the lower bound ([4], Theorem 4.2)  Ω(σ1RT+σ2R2T2+L2R3T7/2)\Omega\left(\tfrac{\sigma_{1}R}{\sqrt{T}}+\frac{\sigma_{2}R^{2}}{T^{2}}+\frac{L_{2}R^{3}}{T^{7/2}}\right) for stochastic second-order methods in gradient and Hessian error terms.

5.2 Distributed optimization

In this subsection, distributed empirical risk minimization problem is considered. In this setup, multiple agents/devices/workers collaborate to solve a machine learning problem by communicating over a central node (server). The goal is to minimize the following finite-sum objective:

minxd{f(x):=1mj=1mfj(x)},\min\limits_{x\in\mathbb{R}^{d}}\left\{f(x):=\tfrac{1}{m}\sum\limits_{j=1}^{m}f_{j}(x)\right\}, (20)

where fj(x)f_{j}(x) is the loss function (parametrized by xdx\in\mathbb{R}^{d}) associated with the data stored on the jj-th machine and the μf\mu_{f}-strongly convex function f(x)f(x) is the global empirical loss function. Each node has access only to its local data and can communicate with the central server.

Problems of the form (20) arise in distributed machine learning and other applications. Beyond first-order methods, second-order methods have been developed for such Federated Learning problems. One of the keys to efficient second-order methods for this setting is exploiting the statistical similarity of the data [73, 23, 18, 13, 2]. Using statistical arguments, it is possible to show that if f1(x)=1nl=1n(x,yl)f_{1}(x)=\frac{1}{n}\sum_{l=1}^{n}\ell(x,y_{l}), where (x,yl)\ell(x,y_{l}) is the individual loss function of each example ll stored at the first device chosen to be the central device, then 2f(x)2f1(x)2σ\|\nabla^{2}f(x)-\nabla^{2}f_{1}(x)\|_{2}\leq\sigma, where σ\sigma is proportional to 1n\frac{1}{\sqrt{n}}. Then, defining

ρ(x)=1nl=1n(x;yl)+σ2x22,\rho(x)=\frac{1}{n}\sum_{l=1}^{n}\ell(x;y_{l})+\frac{\sigma}{2}\|x\|_{2}^{2}, (21)

the global objective ff becomes strongly convex and smooth relative to ρ\rho, i.e., (cf. Section 4)

μρ2ϕ(x)2f(x)Lρ2ϕ(x),\mu_{\rho}\nabla^{2}\phi(x)\preceq\nabla^{2}f(x)\preceq L_{\rho}\nabla^{2}\phi(x),

where Lρ=1L_{\rho}=1, μρ=μf/(μf+2σ)\mu_{\rho}={\mu_{f}}/({\mu_{f}+2\sigma}), and κρ=Lρ/μρ=1+2σ/μf\kappa_{\rho}={L_{\rho}}/{\mu_{\rho}}=1+2\sigma/\mu_{f}. Thus, as in Section 4 one can apply the Bregman-distance gradient method and its <<accelerated>> variant [38].

When implemented, these methods require to minimize at master node (the first node)

miny{f(ym),yym+Lϕβρ(ym,y)}.\min_{y}\left\{\left\langle\nabla f(y_{m}),y-y_{m}\right\rangle+L_{\phi}\beta_{\rho}\left(y_{m},y\right)\right\}.

The first term is available due to communications and the last term is a sum-type function with nn terms according to (21). If nn is large enough and dd is moderate then Hessian calculation time can dominate Hessian inversion time. That allows to use efficiently Tensor methods. In particular Superfast second-order methods mentioned in the Section 4.

This idea was exploited and analysed in [23] which resulted in an efficient distributed second-order solver with good practical performance and in a certain regime total arithmetic operations complexity being better than that of the existing variance reduced first-order algorithms for problem (20).

The work of A. Agafonov was supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye), No. 075-00337-20-03, project No. 0714-2020-0005. The work of A. Gasnikov was supported by the strategic academic leadership program <<Priority 2030>> (Agreement 075-02-2021-1316 30.09.2021).

6 Conclusion

In each iteration higher-order (also called tensor) methods minimize a regularized Taylor expansion of the objective function, which leads to faster convergence rates if the corresponding higher-order derivative is Lipschitz-continuous. Recently a series of lower iteration complexity bounds for such methods were proved, and a gap between upper an lower complexity bounds was revealed. Moreover, it was shown that such methods can be implementable since the appropriately regularized Taylor expansion of a convex function is also convex and, thus, can be minimized in polynomial time. Only very recently an algorithm with optimal convergence rate 1/k((3p+1)/2)1/k^{((3p+1)/2)} was proposed for minimizing convex functions with Lipschitz pp-th derivative. For convex functions with Lipschitz third derivative, these developments allowed to propose a second-order method with convergence rate 1/k51/k^{5}, which is faster than the rate 1/k3.51/k^{3.5} of existing second-order methods.

7 MC codes

90C25,65K10,90C15

8 Cross-References

  • Automatic Differentiation: Calculation of Newton Steps; Dixon, L.

  • Computational Optimal Transport ; Dvurechensky P., Dvinskikh D., Tupitsa N., Gasnikov A.

  • Stochastic gradient descent; Lan G.

  • Unified analysis of SGD-type methods; Gorbunov E.

  • Decentralized Convex Optimization over Time-Varying Graphs; Rogozin A., Gasnikov A., Beznosikov A., Kovalev D.

  • Stochastic Quasi-Newton Scheme; Jalilzadeh, A.

References

  • [1] Deeksha Adil, Brian Bullins, Arun Jambulapati, and Sushant Sachdeva. Line search-free methods for higher-order smooth monotone variational inequalities. arXiv preprint arXiv:2205.06167, 2022.
  • [2] Artem Agafonov, Pavel Dvurechensky, Gesualdo Scutari, Alexander Gasnikov, Dmitry Kamzolov, Aleksandr Lukashevich, and Amir Daneshmand. An accelerated second-order method for distributed stochastic optimization. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 2407–2413. IEEE, 2021.
  • [3] Artem Agafonov, Dmitry Kamzolov, Pavel Dvurechensky, and Alexander Gasnikov. Inexact tensor methods and their application to stochastic convex optimization. arXiv preprint arXiv:2012.15636, 2020.
  • [4] Artem Agafonov, Dmitry Kamzolov, Alexander Gasnikov, Kimon Antonakopoulos, Volkan Cevher, and Martin Takáč. Advancing the lower bounds: An accelerated, stochastic, second-order method with optimal adaptation to inexactness, 2023.
  • [5] Naman Agarwal and Elad Hazan. Lower bounds for higher-order convex optimization. In Conference On Learning Theory, pages 774–792. PMLR, 2018.
  • [6] Masoud Ahookhosh and Yurii Nesterov. High-order methods beyond the classical complexity bounds, ii: inexact high-order proximal-point methods with segment search. arXiv preprint arXiv:2109.12303, 2021.
  • [7] Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods for smooth convex optimization. Mathematical Programming, 178(1):327–360, 2019.
  • [8] Michel Baes. Estimate sequence methods: extensions and approximations. Institute for Operations Research, ETH, Zürich, Switzerland, 2009.
  • [9] Ali Bouaricha. Tensor methods for large, sparse unconstrained optimization. SIAM Journal on Optimization, 7(3):732–756, 1997.
  • [10] Sébastien Bubeck, Qijia Jiang, Yin-Tat Lee, Yuanzhi Li, and Aaron Sidford. Complexity of highly parallel non-smooth convex optimization. Advances in Neural Information Processing Systems, 32, 2019.
  • [11] Sébastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal method for highly smooth convex optimization. In Conference on Learning Theory, pages 492–507. PMLR, 2019.
  • [12] Brian Bullins. Highly smooth minimization of non-smooth problems. In Conference on Learning Theory, pages 988–1030. PMLR, 2020.
  • [13] Brian Bullins, Kshitij Patel, Ohad Shamir, Nathan Srebro, and Blake E Woodworth. A stochastic newton algorithm for distributed convex optimization. Advances in Neural Information Processing Systems, 34:26818–26830, 2021.
  • [14] Yair Carmon, Danielle Hausler, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Optimal and adaptive monteiro-svaiter acceleration. arXiv preprint arXiv:2205.15371, 2022.
  • [15] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Conference on Learning Theory, pages 866–882. PMLR, 2021.
  • [16] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Recapp: Crafting a more efficient catalyst for convex optimization. In International Conference on Machine Learning, pages 2658–2685. PMLR, 2022.
  • [17] Pafnuty Chebyshev. Collected works. Vol 5. Strelbytskyy Multimedia Publishing, 2018.
  • [18] Amir Daneshmand, Gesualdo Scutari, Pavel Dvurechensky, and Alexander Gasnikov. Newton method over networks is fast up to the statistical precision. In International Conference on Machine Learning, pages 2398–2409. PMLR, 2021.
  • [19] Nikita Doikov, Konstantin Mishchenko, and Yurii Nesterov. Super-universal regularized newton method. arXiv preprint arXiv:2208.05888, 2022.
  • [20] Nikita Doikov and Yurii Nesterov. Local convergence of tensor methods. Mathematical Programming, 193(1):315–336, 2022.
  • [21] Nikita Doikov and Yurii E Nesterov. Inexact tensor methods with dynamic accuracies. In ICML, pages 2577–2586, 2020.
  • [22] Pavel Dvurechensky, Alexander Gasnikov, Petr Ostroukhov, César A. Uribe, and Anastasiya Ivanova. Near-optimal tensor methods for minimizing the gradient norm of convex function. arXiv:1912.03381, 2019. WIAS Preprint No. 2694.
  • [23] Pavel Dvurechensky, Dmitry Kamzolov, Aleksandr Lukashevich, Soomin Lee, Erik Ordentlich, César A Uribe, and Alexander Gasnikov. Hyperfast second-order local solvers for efficient statistically preconditioned distributed optimization. arXiv preprint arXiv:2102.08246, 2021.
  • [24] Yurii Gavrilovich Evtushenko and Alexey Anatol’evich Tret’yakov. pth-order approximation of the solution set of nonlinear equations. Computational Mathematics and Mathematical Physics, 53(12):1763–1780, 2013.
  • [25] Ankit Garg, Robin Kothari, Praneeth Netrapalli, and Suhail Sherif. Near-optimal lower bounds for convex optimization for all orders of smoothness. Advances in Neural Information Processing Systems, 34:29874–29884, 2021.
  • [26] Alexander Gasnikov. Modern numerical methods. universal gradient descent. MIPT, arXiv:1711.00394, 2018.
  • [27] Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, and César A Uribe. Optimal tensor methods in smooth convex and uniformly convexoptimization. In Conference on Learning Theory, pages 1374–1391. PMLR, 2019.
  • [28] Alexander Vladimirovich Gasnikov, Darina Mikhailovna Dvinskikh, Pavel Evgenievich Dvurechensky, Dmitry Igorevich Kamzolov, Vladislav Vyacheslavovich Matyukhin, Dmitry Arkadievich Pasechnyuk, Nazarii Konstantinovich Tupitsa, and Aleksey Vladimirovich Chernov. Accelerated meta-algorithm for convex optimization problems. Computational Mathematics and Mathematical Physics, 61(1):17–28, 2021.
  • [29] Alexander Vladimirovich Gasnikov and Dmitry Alexandrovich Kovalev. A hypothesis about the rate of global convergence for optimal methods (newton’s type) in smooth convex optimization. Computer research and modeling, 10(3):305–314, 2018.
  • [30] A.V. Gasnikov, E.A. Gorbunov, D.A. Kovalev, A.A.M. Mohammed, and E.O. Chernousova. Substantiation of the hypothesis about optimal estimates of the rate of convergence of numerical methods of high-order convex optimization. Computer Research and Modeling, 10(6):737–753, 2018.
  • [31] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
  • [32] Saeed Ghadimi, Han Liu, and Tong Zhang. Second-order methods with cubic regularization under inexact information. arXiv preprint arXiv:1710.05782, 2017.
  • [33] G. N. Grapiglia and Yu. Nesterov. Tensor methods for minimizing convex functions with hölder continuous higher-order derivatives. SIAM Journal on Optimization, 30(4):2750–2779, 2020.
  • [34] G. N. Grapiglia and Yurii Nesterov. Tensor methods for finding approximate stationary points of convex functions. Optimization Methods and Software, 0(0):1–34, 2020.
  • [35] Geovani Nunes Grapiglia and Yurii Nesterov. On inexact solution of auxiliary problems in tensor methods for convex optimization. Optimization Methods and Software, 36(1):145–170, 2021.
  • [36] Geovani Nunes Grapiglia and Yurii Nesterov. Adaptive third-order methods for composite convex optimization. arXiv:2202.12730, 2022.
  • [37] Andreas Griewank. The modification of newton’s method for unconstrained optimization by bounding cubic terms. Technical report, Technical report NA/12, 1981.
  • [38] Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically preconditioned accelerated gradient method for distributed optimization. In International conference on machine learning, pages 4203–4227. PMLR, 2020.
  • [39] Karl-Heinz Hoffmann and Hans-Joachim Kornstaedt. Higher-order necessary conditions in abstract mathematical programming. Journal of Optimization Theory and Applications, 26(4):533–568, 1978.
  • [40] Anastasiya Ivanova, Dmitry Pasechnyuk, Dmitry Grishchenko, Egor Shulgin, Alexander Gasnikov, and Vladislav Matyukhin. Adaptive catalyst for smooth convex optimization. In International Conference on Optimization and Applications, pages 20–37. Springer, 2021.
  • [41] Bo Jiang, Tianyi Lin, and Shuzhong Zhang. A unified adaptive tensor approximation scheme to accelerate composite convex optimization. SIAM Journal on Optimization, 30(4):2897–2926, 2020.
  • [42] Bo Jiang, Haoyue Wang, and Shuzhong Zhang. An optimal high-order tensor method for convex optimization. In Conference on Learning Theory, pages 1799–1801. PMLR, 2019.
  • [43] Dmitry Kamzolov and Alexander Gasnikov. Near-optimal hyperfast second-order method for convex optimization and its sliding. arXiv preprint arXiv:2002.09050, 2020.
  • [44] Dmitry Kamzolov, Alexander Gasnikov, and Pavel Dvurechensky. Optimal combination of tensor optimization methods. In International Conference on Optimization and Applications, pages 166–183. Springer, 2020.
  • [45] Leonid Vital’evich Kantorovich. On newton’s method. Trudy Matematicheskogo Instituta imeni VA Steklova, 28:104–144, 1949.
  • [46] Guy Kornowski and Ohad Shamir. High-order oracle complexity of smooth and strongly convex optimization. arXiv preprint arXiv:2010.06642, 2020.
  • [47] Dmitry Kovalev, Aleksandr Beznosikov, Ekaterina Borodich, Alexander Gasnikov, and Gesualdo Scutari. Optimal gradient sliding and its application to distributed optimization under similarity. arXiv preprint arXiv:2205.15136, 2022.
  • [48] Dmitry Kovalev and Alexander Gasnikov. The first optimal acceleration of high-order methods in smooth convex optimization. arXiv preprint arXiv:2205.09647, 2022.
  • [49] Dmitry Kovalev and Alexander Gasnikov. The first optimal acceleration of high-order methods in smooth convex optimization. arXiv preprint arXiv:2205.09647, 2022.
  • [50] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research, 18(1):7854–7907, 2018.
  • [51] Tianyi Lin, Michael Jordan, et al. Perseus: A simple high-order regularization method for variational inequalities. arXiv preprint arXiv:2205.03202, 2022.
  • [52] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
  • [53] Renato D.C. Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):1092–1125, 2013.
  • [54] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
  • [55] Yu Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008.
  • [56] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2)o(1/k^{2}). Soviet Mathematics Doklady, 27(2):372–376, 1983.
  • [57] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • [58] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
  • [59] Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, 186(1):157–183, 2021.
  • [60] Yurii Nesterov. Inexact high-order proximal-point methods with auxiliary search procedure. SIAM Journal on Optimization, 31(4):2807–2828, 2021.
  • [61] Yurii Nesterov. Superfast second-order methods for unconstrained convex optimization. Journal of Optimization Theory and Applications, 191(1):1–30, 2021.
  • [62] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
  • [63] Yurii Nesterov and Boris Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • [64] Isaac Newton. Methodus fluxionum et serierum infinitarum, 1671. The method of fluxions and infinite series, 1967.
  • [65] Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 1999.
  • [66] Boris Teodorovich Poljak. Sharp minimum. In Generalized Lagrangians and applications. Oxford: Pergamon Press, 1982.
  • [67] B. Polyak. Introduction to optimization. Optimization Software, 1987.
  • [68] Robert B Schnabel and Ta-Tung Chow. Tensor methods for unconstrained optimization using second derivatives. SIAM Journal on Optimization, 1(3):293–315, 1991.
  • [69] Robert B. Schnabel and Paul D. Frank. Tensor methods for nonlinear equations. SIAM Journal on Numerical Analysis, 21(5):815–843, 1984.
  • [70] Chaobing Song, Yong Jiang, and Yi Ma. Unified acceleration of high-order algorithms under general hölder continuity. SIAM Journal on Optimization, 31(3):1797–1826, 2021.
  • [71] Ye Tian, Gesualdo Scutari, Tianyu Cao, and Alexander Gasnikov. Acceleration in distributed optimization under similarity. In International Conference on Artificial Intelligence and Statistics, pages 5721–5756. PMLR, 2022.
  • [72] Vladislav Tominin, Yaroslav Tominin, Ekaterina Borodich, Dmitry Kovalev, Alexander Gasnikov, and Pavel Dvurechensky. On accelerated methods for saddle-point problems with composite structure. arXiv preprint arXiv:2103.09344, 2021.
  • [73] Yuchen Zhang and Lin Xiao. Communication-Efficient Distributed Optimization of Self-concordant Empirical Loss, pages 289–341. Springer International Publishing, Cham, 2018.