Stochastic Adaptive Regularization Method with Cubics:
A High Probability Complexity Bound
Abstract
We present a high probability complexity bound for a stochastic adaptive regularization method with cubics, also known as regularized Newton method. The method makes use of stochastic zeroth-, first- and second-order oracles that satisfy certain accuracy and reliability assumptions. Such oracles have been used in the literature by other stochastic adaptive methods, such as trust region and line search. These oracles capture many settings, such as expected risk minimization, simulation optimization, and others. In this paper, we give the first high probability iteration bound for stochastic cubic regularization, and show that just as in the deterministic case, it is superior to other stochastic adaptive methods.
1 Introduction
We are interested in unconstrained optimization problems of the form
where is possibly nonconvex and satisfies the following condition:
Assumption 1.
is bounded from below by a constant , is twice continuously differentiable, and has globally -Lipschitz continuous gradient and -Lipschitz continuous Hessian.
We present and analyze a stochastic adaptive cubic regularization algorithm for computing a point such that , for some , when or its derivatives are not computable exactly. Specifically, we assume access to stochastic zeroth-, first- and second-order oracles, which are defined as follows.
Stochastic zeroth-order oracle (). Given a point , the oracle computes , where is a random variable, whose distribution may depend on , and , that satisfies
(1) |
for any .
We view as the input to the oracle, as the output and the values as values intrinsic to the oracle. Thus is a sub-exponential random variable with parameters , whose mean is bounded by some constant .
Stochastic first-order oracle (. Given a point and constants , , the oracle computes , such that
(2) |
where is a random variable whose distribution may depend on , , and . We view , and as inputs to the oracle, while is intrinsic to the oracle.
Stochastic second-order oracle (). Given a point and constants , , the oracle computes , such that
(3) |
where is a random variable whose distribution may depend on , , and . The norm on the matrix is the operator norm. , and are inputs to the oracle, while is intrinsic to the oracle.
Wherever possible, we will omit the dependence on and write , and instead of , and , and we will use and to denote the outputs of the stochastic oracles for brevity.
Related work. Several stochastic adaptive optimization methods have been studied in recent years under various stochastic oracle assumptions, similar to the ones we present above. Specifically, [CS18] bounds the expected iteration complexity for an adaptive step search (line search) method and an adaptive cubic regularization method, with a similar first-order oracle, but with an exact zeroth-order oracle. In [PS20], an expected iteration complexity result is derived for a variant of a step search method under a stochastic zeroth-order oracle. In [JSX21], the results of [PS20] are strengthened under a somewhat more restrictive zeroth-order oracle, which is similar to the in this paper, and a high probability complexity bound is derived.
Similarly, in [BSV14], [GRVZ18], a trust region method is analyzed under essentially and , but with an exact zeroth-order oracle. Later in [CMS18, BCMS19] an expected complexity bound is derived for a trust region method with a stochastic zeroth-order oracle. In [CBS22] a high probability iteration complexity bound for first- and second-order stochastic trust region methods is derived under the same oracles we use. Recently, the same oracles were used within a stochastic quasi-Newton method in [MWX23], and a stochastic SQP-based method for nonlinear equality constrained problems in [BXZ23].
Adaptive regularization with cubics (ARC) methods are theoretically superior alternatives to line search and trust region methods, when applied to deterministic smooth functions, because of their optimal complexity of vs for finding stationary points [CGT11c, CGT11a]. There are many variants of adaptive cubic regularization methods under various assumptions and requirements on the function value, gradient, and Hessian estimates. Specifically, in [CGT11b, LLHT18, BGMT19, WZLL19, KL17, PJP20], the oracles are assumed to be deterministically bounded, with adaptive magnitude or errors. In [CS18, BG22, BGMT22, BGMT20], bounds on expected complexity are provided under exact or deterministically bounded zeroth-order oracle and the gradient and Hessian oracles similar to and . A cubicly regularized method in a fully stochastic setting is analyzed in [TSJ+18]. The method is not adaptive, relying on the knowledge of the Lipschitz constants of , and therefore not requiring a zeroth-order oracle at all. However, the results in that paper are simply derived assuming that stochastic gradient and Hessian estimates are sufficiently accurate at each iteration. The final complexity bound only applies with probability that this holds true. No expected complexity bound can thus be derived.
Our contributions. In this work we provide the first high probability analysis of a stochastic ARC method (SARC) that allows 1. Stochastic function estimates that can have arbitrarily large errors, and 2. Stochastic gradient and Hessian approximations whose error is bounded by an adaptive quantity with sufficiently high probability, but otherwise can be arbitrarily large. To the best of our knowledge, our work is the first to derive an iteration complexity bound that matches the deterministic iteration complexity of in this setting with an overwhelmingly high probability. We show that our variant of stochastic ARC, while more general than those in prior literature, still maintains its optimal iteration complexity.
The analysis presented here extends the stochastic settings and high probability results in [JSX21] and [CBS22] to the framework in [CS18]. However, this extension is far from trivial, as it requires careful modification of most of the elements of the existing analysis. We point out these modifications in the appropriate places in the paper.
The oracles used in this paper are essentially the same as in [JSX21] and [CBS22]. However, our assumption on the oracles is a bit stronger in this paper than in these two previous works. In particular, we assume that and are implementable for arbitrarily small values of and , respectively. In contrast, the analysis in [JSX21] and [CBS22] allows for the case when these oracles cannot be implemented for arbitrarily small error bounds. We will discuss further in the paper, that even though SARC may impose small values of and , this happens only with small probability.
We do not discuss the numerical performance of our method in this paper. Although deterministic implementations of ARC can be competitive with trust-region and line search methods when implemented with care, their efficiency in practice is highly dependent on the subproblem solver used. We expect similar behavior in the stochastic case and leave this study as a subject of future research.
2 Stochastic Adaptive Regularization Method with Cubics (SARC) with Probabilistic Second-Order Models
The Stochastic Adaptive Regularization with Cubics (SARC) method is presented below as Algorithm 1. At each iteration , given gradient estimate , Hessian estimate , and a regularization parameter , the following model is approximately minimized with respect to to obtain the trial step :
(4) |
The constant term is never computed and is used simply for presentation purposes. In the case of SARC, and are computed using and so as to satisfy certain accuracy requirements with sufficiently high probability, which will be specified in Section 3. We require the trial step to be an “approximate minimizer" of in the sense that it has to satisfy:
(5) |
and
(6) |
where is a user-chosen constant. The conditions are typical in the literature, e.g., in [CGT11c] and can be satisfied, for example, using algorithms in [CGT11b, CD19] to approximately minimize the model (4), as well as by any global minimizer of .
As in any variant of the ARC method, once is computed, the trial step is accepted (and is decreased) if the estimated function value of is sufficiently smaller than that of , when compared to the model value decrease. We call these iterations successful. Otherwise, the trial step is rejected (and is increased). We call these iterations unsuccessful. In the case of SARC, however, function value estimates are obtained via and the step acceptance criterion is modified by adding an "error correction" term of . This is because function value estimates have an irreducible error, so without this correction term, the algorithm may always reject improvement steps.
- Input:
-
Oracles , and ; initial iterate , parameters , , , , , and .
- Repeat for
- 1. Compute a model trial step :
- 2. Check sufficient decrease:
-
Let . Compute function value estimations and using the , and set
(7) - 3. Update the iterate:
-
Set
(8) - 4. Update the regularization parameter :
-
Set
(9)
3 Deterministic Properties of Algorithm 1
Algorithm 1 generates a stochastic process and we will analyze it in the next section. First, however, we state and prove several lemmas that establish the behavior of the algorithm for every realization.
A key concept that will be used in the analysis is the concept of a true iteration. Let and .
Definition 1 (True iteration).
We say that iteration is true if
(10) | ||||
(11) |
Remark 1.
We will show in Lemma 6 that by using and with the respective inputs, in (2) and in (3), each iteration of Algorithm 1 satifies (10) with probability at least . However, we note that the probabilistic requirement of (10) can be implied by more relaxed inputs that use in (2), and in (3), instead of and . Since depends on the output of the oracles, implementing such a relaxed version is not trivial, and may require modification of Algorithm 1. We leave it as a subject of future research.
We will now prove a sequence of results that hold for each realization of Algorithm 1, and are essential for the complexity analysis. The two key results are Corollary 1 and Lemma 5, where Corollary 1 shows that until an -stationary point is reached, every true iteration with large enough is successful, and Lemma 5 establishes the lower bound on function improvement on true and successful iterations. Lemmas 1, 2, 3 and 4 lay the building blocks for them: On every successful iteration, the function improvement is lower bounded in terms of the norm of the step (Lemma 1). There is a threshold value of where any true iteration is either always successful or results in a very small step (Lemma 2). When an iteration is true and the step is not very small, the norm of the step is lower bounded in terms of (Lemma 3). Until an -stationary point is reached, the step cannot be too small on true iterations (Lemma 4).
Lemma 1 (Improvement on successful iterations).
Consider any realization of Algorithm 1. For each iteration , we have
(12) |
On every successful iteration , we have
(13) |
which implies
(14) |
Proof.
The proof is similar to the proof of Lemma 3.3 in [CGT11b]. Clearly, (13) follows from (12) and the sufficient decrease condition (7)-(8):
and (14) follows from the definition of and .
It remains to prove (12). Combining the first condition on step in (5), with the model expression for , we can write
The second condition on in (5) implies . Together with the above equation, we obtain (12).
∎
Lemma 2 (Large guarantees success or small step).
Proof.
Clearly, if , then is successful by definition. Let us consider the case when ; then if , is successful. We have from (7), that
Notice that:
The second to last inequality follows from the definition of and , and the last inequality due to the iteration being true.
Taylor expansion and Cauchy-Schwarz inequalities give, for some ,
where the last inequality follows from the fact that the iteration is true and hence (10) holds: and from Assumption 1. So as long as , we have
∎
Note that for the above lemma to hold, does not need to include in the numerator. However, we will need another condition on later that will involve ; hence for simplicity of notation we introduced above to satisfy all necessary bounds.
Lemma 3 (Lower bound on step norm in terms of ).
Proof.
The triangle inequality, the equality and condition (6) on together give
(17) |
Recalling Taylor expansion of : and applying triangle inequality again, we have
where to get the second inequality, we also used (10) and Assumption 1. We can bound as follows:
Thus finally, we can bound all the terms on the right hand side of (17) in terms of and using the fact that we can write
which is equivalent to (16). ∎
Lemma 4 (Lower bound on step norm until -accuracy is reached).
Proof.
Corollary 1 (True iteration with large must be successful).
Lemma 5 (Minimum improvement achieved by true and successful iterations).
4 Stochastic Properties of Algorithm 1
Algorithm 1 generates a stochastic process. Let denote the collection of random variables , whose realizations are . Let denote the filtration generated by . At iteration , denotes the random iterate, is the random gradient approximation, is the random Hessian approximation, is the random model regularization parameter. is the step computed for the random model, are the random function estimates at the current point and the candidate point, respectively.
Conditioned on and , the random quantities and are determined by and respectively. The realization of depends on the realizations of and . The function estimates are determined by , conditioned on and . In summary, the stochastic process generated by the algorithm, with realization , is adapted to the filtration .
We further define and , with realizations and . Let , and let . The indicator random variables and are clearly measurable with respect to the filtration .
The next lemma shows that by construction of the algorithm, the stochastic model at iteration is “sufficiently accurate" with probability at least .
Lemma 6.
The indicator variable
satisfies the following submartingale-like condition
Proof.
Recall Definition 1 and that we denote the event of iteration being true by indicator random variable . It is crucial for our analysis that for all . We will later combine Lemma 6 with the properties of for a bound on to ensure .
The iteration complexity of our algorithm is defined as the following stopping time.
Definition 2 (Stopping time).
For , , the iteration complexity of the algorithm for reaching an -stationary point. We will refer to as the stopping time of the algorithm.
It is important to note that even if for some iteration , , this iteration may not be successful and thus may be greater than . This is a consequence of the complexity analysis of cubic regularization methods that measure progress in terms of the gradient at the trial point and not at the current iterate, and thus is not specific to SARC. The stopping time is thus defined as the first time at which the algorithm computes a point at which the gradient of is less than .
It is easy to see that is a stopping time of the stochastic process with respect to . Given a level of accuracy , we aim to derive a bound on the iterations complexity with overwhelmingly high probability. In particular, we will show the number of iterations until the stopping time is a sub-exponential random variable, whose value (with high probability) scales as , similarly to the deterministic case. Towards that end, we define stochastic process to measure the progress towards optimality.
Definition 3 (Measure of Progress).
For each , let be a random variable measuring the progress of the algorithm at step : where is a lower bound of .
Armed with these definitions, we will be able to state properties of the stochastic process generated by Algorithm 1, which lead to the desired bounds on . These properties hold under certain conditions on the parameters used by Algorithm 1. We state these conditions here.
Assumption 2.
Define and and .
- (a)
-
,
- (b)
-
are chosen sufficiently small so that ,
- (c)
-
(27)
Remark 2.
2 (c) gives a lower bound on the best accuracy the algorithm can achieve given the accuracy parameters related to the stochastic oracles and . Specifically, is lower bounded by , which is the "irreducible" error of the zeroth-order oracle. We observe that, if then the term involving the error of the zeroth-order oracle in the lower bound of for SARC is , which is better dependency than those of SASS in [JSX21] and the stochastic trust region algorithms in [CBS22], where is lower bounded by .
The dependence of on has a somewhat more complicated interpretation: can be chosen arbitrarily by the algorithm, as long as oracles can deliver appropriate accuracy. Recall that in the algorithm, the accuracy input for is and the accuracy input for is . If is bounded from above by a constant, then essentially is proportional to the best accuracy required of during the algorithm procedure and it is proportional to the square of the best accuracy required of during the algorithm procedure. This dependency is the same as in deterministic inexact algorithms as well as in [JSX21] and the stochastic trust region algorithms in [CBS22]. We will comment on the existence of the upper bound on after our main complexity result.
The following theorem establishes key properties of the stochastic process generated by Algorithm 1, that are essential for the convergence analysis. Similar properties are used in [JSX21] obtain high probability iteration complexity for a stochastic step search method.
To remain consistent with the notation used in [JSX21], we define the random variable , with realization , and a constant .
Theorem 2.
Let Assumptions 1 and 2 hold. For and the following non-decreasing function :
the following hold for all :
-
(i)
for all . (Conditioning on the past, each iteration is true with probability at least .)
-
(ii)
If and then . (True iterations with sufficiently small are successful.)
-
(iii)
If then . (True, successful iterations make progress.)
-
(iv)
. (The lower bound of potential progress for an iteration with parameter .)
-
(v)
for all . (The “damage” at each iteration is bounded.)
Proof.
Part (i) follows from the assumptions on and the definition of the true iteration.
Part (ii) follows directly from Corollary 1.
Part (iii) follows from Lemma 5.
Part (iv) follows from the definitions of , , and inequality (27). Specifically, plugging in the definitions of and , one can show that the inequality is equivalent to , which holds by 2.
Part (v) has exactly the same proof as that of Proposition 1 part (v) in [JSX21] and is easily derived from the step acceptance condition of Algorithm 1.
∎
5 High Probability Iteration Complexity Result
In [JSX21] a high probability bound on is derived for a stochastic process with properties stated in Theorem 2. Thus, we can simply apply this theorem here. We first observe that
(28) |
with , where . This follows from (1) of by applying Proposition 2.7.1 of [Ver18]. Another minor modification of the result in [JSX21] is that it now applies to the event of instead of the event of , due to the different definitions of the stopping time.
Theorem 3.
Remark 3.
- 1.
-
2.
The SARC algorithm encounters an -stationary point in a finite number of iterations with probability . This is a direct consequence of the Borel–Cantelli lemma.
-
3.
Since the probabilities of the failure events are exponentially decaying for all , this implies a complexity bound of in expectation for SARC.
5.1 Upper Bound on
While the penalty parameters form a stochastic process, this process has some nice properties. Specifically it is upper bounded by a one-sided geometric random walk. This random walk is analyzed in [JSX23] and it is shown that for any given number of iterations , and for chosen appropriately dependent on , with high probability. A consequence of this fact is that with high probability, for any realization of Algorithm 1, there exists a lower bound on all of the accuracy requirements and , which are inputs to the oracles and as in (2) and (3). This, in turn, can give rise to a total "sample" complexity bound for Algorithm 1. For examples of such analyses, we refer the reader to [JSX23].
References
- [BCMS19] Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS journal on optimization, 1(2):92–119, 2019.
- [BG22] Stefania Bellavia and Gianmarco Gurioli. Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic hessian accuracy. Optimization, 71(1):227–261, 2022.
- [BGMT19] Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Philippe L Toint. Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. SIAM Journal on Optimization, 29(4):2881–2915, 2019.
- [BGMT20] Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Philippe L Toint. A stochastic cubic regularisation method with inexact function evaluations and random derivatives for finite sum minimisation. In Thirty-seventh International Conference on Machine Learning: ICML2020, 2020.
- [BGMT22] Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Ph L Toint. Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives. Journal of Complexity, 68:101591, 2022.
- [BSV14] Afonso S Bandeira, Katya Scheinberg, and Luis Nunes Vicente. Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization, 24(3):1238–1264, 2014.
- [BXZ23] Albert S. Berahas, Miaolan Xie, and Baoyu Zhou. A sequential quadratic programming method with high probability complexity bounds for nonlinear equality constrained stochastic optimization. arxiv preprint arxiv:2301.00477, 2023.
- [CBS22] Liyuan Cao, Albert S Berahas, and Katya Scheinberg. First-and second-order high probability complexity bounds for trust-region methods with noisy oracles. arXiv preprint arXiv:2205.03667, 2022.
- [CD19] Yair Carmon and John Duchi. Gradient descent finds the cubic-regularized nonconvex newton step. SIAM Journal on Optimization, 29(3):2146–2178, 2019.
- [CGT11a] C. Cartis, N. Gould, and Philippe L. Toint. Optimal Newton-type methods for nonconvex smooth optimization problems. Technical Report Optimization Online, 2011.
- [CGT11b] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Math. Program., 127(2):245–295, 2011.
- [CGT11c] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function- and derivative-evaluation complexity. Math. Program., 130(2):295–319, 2011.
- [CMS18] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a trust-region method and random models. Mathematical Programming, 169(2):447–487, 2018.
- [CS18] Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, 169(2):337–375, 2018.
- [GRVZ18] Serge Gratton, Clément W Royer, Luís N Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38(3):1579–1597, 2018.
- [JSX21] Billy Jin, Katya Scheinberg, and Miaolan Xie. High probability complexity bounds for adaptive step search based on stochastic oracles. arXiv preprint arXiv:2106.06454, 2021.
- [JSX23] Billy Jin, Katya Scheinberg, and Miaolan Xie. Sample complexity analysis for adaptive optimization algorithms with stochastic oracles. arXiv preprint arXiv:2303.06838, 2023.
- [KL17] Jonas Moritz Kohler and Aurélien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research. PMLR, 2017.
- [LLHT18] Liu Liu, Xuanqing Liu, Cho-Jui Hsieh, and Dacheng Tao. Stochastic second-order methods for non-convex optimization with inexact hessian and gradient. arXiv preprint arXiv:1809.09853, 2018.
- [MWX23] Matt Menickelly, Stefan M. Wild, and Miaolan Xie. A stochastic quasi-newton method in the absence of common random numbers. arXiv preprint arXiv:2302.09128, 2023.
- [PJP20] Seonho Park, Seung Hyun Jung, and Panos M Pardalos. Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. Journal of Optimization Theory and Applications, 184(3):953–971, 2020.
- [PS20] Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis. SIAM Journal on Optimization, 30(1):349–376, 2020.
- [TSJ+18] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2899–2908, 2018.
- [Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.
- [WZLL19] Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. A note on inexact gradient and hessian conditions for cubic regularized newton’s method. Operations Research Letters, 47(2):146–149, 2019.