This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Adaptive Regularization Method with Cubics:
A High Probability Complexity Bound

Katya Scheinberg   Miaolan Xie
School of Operations Research and Information Engineering
Cornell University
Abstract

We present a high probability complexity bound for a stochastic adaptive regularization method with cubics, also known as regularized Newton method. The method makes use of stochastic zeroth-, first- and second-order oracles that satisfy certain accuracy and reliability assumptions. Such oracles have been used in the literature by other stochastic adaptive methods, such as trust region and line search. These oracles capture many settings, such as expected risk minimization, simulation optimization, and others. In this paper, we give the first high probability iteration bound for stochastic cubic regularization, and show that just as in the deterministic case, it is superior to other stochastic adaptive methods.

1 Introduction

We are interested in unconstrained optimization problems of the form

minxnϕ(x),\min_{x\in\mathbb{R}^{n}}\phi(x),

where ϕ\phi is possibly nonconvex and satisfies the following condition:

Assumption 1.

ϕ\phi is bounded from below by a constant ϕ\phi^{*}, is twice continuously differentiable, and has globally LL-Lipschitz continuous gradient and LHL_{H}-Lipschitz continuous Hessian.

We present and analyze a stochastic adaptive cubic regularization algorithm for computing a point xx such that ϕ(x)ϵ\left\lVert{\nabla\phi(x)}\right\rVert\leq\epsilon, for some ϵ>0\epsilon>0, when ϕ(x)\phi(x) or its derivatives are not computable exactly. Specifically, we assume access to stochastic zeroth-, first- and second-order oracles, which are defined as follows.

Stochastic zeroth-order oracle (𝖲𝖹𝖮(ϵf,λ,a)\mathsf{SZO}(\epsilon_{f},\lambda,a)). Given a point xx, the oracle computes f(x,Ξ(x))f(x,\Xi(x)), where Ξ(x)\Xi(x) is a random variable, whose distribution may depend on xx, ϵf,λ\epsilon_{f},\lambda and aa, that satisfies

𝔼Ξ(x)[|ϕ(x)f(x,Ξ(x))|]ϵf and Ξ(x)(|ϕ(x)f(x,Ξ(x))|<t)1eλ(at),{\mathbb{E}_{\Xi(x)}}\left[|{\phi(x)-f(x,\Xi(x))}|\right]\leq\epsilon_{f}\ \text{~{}and~{} }{\mathbb{P}_{\Xi(x)}}\left(|{\phi(x)-f(x,\Xi(x))}|<t\right)\geq 1-e^{\lambda(a-t)}, (1)

for any t>0t>0.

We view xx as the input to the oracle, f(x,Ξ(x))f(x,\Xi(x)) as the output and the values (ϵf,λ,a)(\epsilon_{f},\lambda,a) as values intrinsic to the oracle. Thus |f(x,Ξ(x))ϕ(x)||f(x,\Xi(x))-\phi(x)| is a sub-exponential random variable with parameters (λ,a)(\lambda,a), whose mean is bounded by some constant ϵf>0\epsilon_{f}>0.

Stochastic first-order oracle (𝖲𝖥𝖮(κg)\mathsf{SFO}(\kappa_{g}). Given a point xx and constants μ1>0\mu_{1}>0, δ1[0,12)\delta_{1}\in[0,\frac{1}{2}), the oracle computes g(x,Ξ1(x))g(x,\Xi^{1}(x)), such that

Ξ1(x)(ϕ(x)g(x,Ξ1(x))κgμ1)1δ1,\mathbb{P}_{\Xi^{1}(x)}(\|\nabla\phi(x)-g(x,\Xi^{1}(x))\|\leq\kappa_{g}\mu_{1})\geq 1-\delta_{1}, (2)

where Ξ1(x)\Xi^{1}(x) is a random variable whose distribution may depend on xx, μ1\mu_{1}, δ1\delta_{1} and κg\kappa_{g}. We view xx, μ1\mu_{1} and δ1\delta_{1} as inputs to the oracle, while κg\kappa_{g} is intrinsic to the oracle.

Stochastic second-order oracle (𝖲𝖲𝖮(κH)\mathsf{SSO}(\kappa_{H})). Given a point xx and constants μ2>0\mu_{2}>0, δ2[0,12)\delta_{2}\in[0,\frac{1}{2}), the oracle computes H(x,Ξ2(x))H(x,\Xi^{2}(x)), such that

Ξ2(x)(2ϕ(x)H(x,Ξ2(x))κHμ2)1δ2,\mathbb{P}_{\Xi^{2}(x)}(\|\nabla^{2}\phi(x)-{H}(x,\Xi^{2}(x))\|\leq\kappa_{H}\mu_{2})\geq 1-\delta_{2}, (3)

where Ξ2(x)\Xi^{2}(x) is a random variable whose distribution may depend on xx, μ2\mu_{2}, δ2\delta_{2} and κH\kappa_{H}. The norm on the matrix is the operator norm. xx, μ2\mu_{2} and δ2\delta_{2} are inputs to the oracle, while κH\kappa_{H} is intrinsic to the oracle.

Wherever possible, we will omit the dependence on xx and write Ξ,Ξ1\Xi,\Xi^{1}, and Ξ2\Xi^{2} instead of Ξ(x),Ξ1(x)\Xi(x),\Xi^{1}(x), and Ξ2(x)\Xi^{2}(x), and we will use f(x),g(x)f(x),g(x) and H(x)H(x) to denote the outputs of the stochastic oracles for brevity.

Related work. Several stochastic adaptive optimization methods have been studied in recent years under various stochastic oracle assumptions, similar to the ones we present above. Specifically, [CS18] bounds the expected iteration complexity for an adaptive step search (line search) method and an adaptive cubic regularization method, with a similar first-order oracle, but with an exact zeroth-order oracle. In [PS20], an expected iteration complexity result is derived for a variant of a step search method under a stochastic zeroth-order oracle. In [JSX21], the results of [PS20] are strengthened under a somewhat more restrictive zeroth-order oracle, which is similar to the 𝖲𝖹𝖮\mathsf{SZO} in this paper, and a high probability complexity bound is derived.

Similarly, in [BSV14], [GRVZ18], a trust region method is analyzed under essentially 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO}, but with an exact zeroth-order oracle. Later in [CMS18, BCMS19] an expected complexity bound is derived for a trust region method with a stochastic zeroth-order oracle. In [CBS22] a high probability iteration complexity bound for first- and second-order stochastic trust region methods is derived under the same oracles we use. Recently, the same oracles were used within a stochastic quasi-Newton method in [MWX23], and a stochastic SQP-based method for nonlinear equality constrained problems in [BXZ23].

Adaptive regularization with cubics (ARC) methods are theoretically superior alternatives to line search and trust region methods, when applied to deterministic smooth functions, because of their optimal complexity of O(ϵ3/2)O(\epsilon^{-3/2}) vs O(ϵ2)O(\epsilon^{-2}) for finding ϵ\epsilon stationary points [CGT11c, CGT11a]. There are many variants of adaptive cubic regularization methods under various assumptions and requirements on the function value, gradient, and Hessian estimates. Specifically, in [CGT11b, LLHT18, BGMT19, WZLL19, KL17, PJP20], the oracles are assumed to be deterministically bounded, with adaptive magnitude or errors. In [CS18, BG22, BGMT22, BGMT20], bounds on expected complexity are provided under exact or deterministically bounded zeroth-order oracle and the gradient and Hessian oracles similar to 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO}. A cubicly regularized method in a fully stochastic setting is analyzed in [TSJ+18]. The method is not adaptive, relying on the knowledge of the Lipschitz constants of ϕ(x)\nabla\phi(x), and therefore not requiring a zeroth-order oracle at all. However, the results in that paper are simply derived assuming that stochastic gradient and Hessian estimates are sufficiently accurate at each iteration. The final complexity bound only applies with probability that this holds true. No expected complexity bound can thus be derived.

Our contributions. In this work we provide the first high probability analysis of a stochastic ARC method (SARC) that allows 1. Stochastic function estimates that can have arbitrarily large errors, and 2. Stochastic gradient and Hessian approximations whose error is bounded by an adaptive quantity with sufficiently high probability, but otherwise can be arbitrarily large. To the best of our knowledge, our work is the first to derive an iteration complexity bound that matches the deterministic iteration complexity of O(ϵ3/2)O(\epsilon^{-3/2}) in this setting with an overwhelmingly high probability. We show that our variant of stochastic ARC, while more general than those in prior literature, still maintains its optimal iteration complexity.

The analysis presented here extends the stochastic settings and high probability results in [JSX21] and [CBS22] to the framework in [CS18]. However, this extension is far from trivial, as it requires careful modification of most of the elements of the existing analysis. We point out these modifications in the appropriate places in the paper.

The oracles used in this paper are essentially the same as in [JSX21] and [CBS22]. However, our assumption on the oracles is a bit stronger in this paper than in these two previous works. In particular, we assume that 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO} are implementable for arbitrarily small values of μ1\mu_{1} and μ2\mu_{2}, respectively. In contrast, the analysis in [JSX21] and [CBS22] allows for the case when these oracles cannot be implemented for arbitrarily small error bounds. We will discuss further in the paper, that even though SARC may impose small values of μ1\mu_{1} and μ2\mu_{2}, this happens only with small probability.

We do not discuss the numerical performance of our method in this paper. Although deterministic implementations of ARC can be competitive with trust-region and line search methods when implemented with care, their efficiency in practice is highly dependent on the subproblem solver used. We expect similar behavior in the stochastic case and leave this study as a subject of future research.

2 Stochastic Adaptive Regularization Method with Cubics (SARC) with Probabilistic Second-Order Models

The Stochastic Adaptive Regularization with Cubics (SARC) method is presented below as Algorithm 1. At each iteration kk, given gradient estimate gkg_{k}, Hessian estimate HkH_{k}, and a regularization parameter σk>0\sigma_{k}>0, the following model is approximately minimized with respect to ss to obtain the trial step sks_{k}:

mk(xk+s)=ϕ(xk)+sTgk+12sTHks+σk3s3.m_{k}(x_{k}+s)=\phi(x_{k})+s^{T}g_{k}+\frac{1}{2}s^{T}H_{k}s+\frac{\sigma_{k}}{3}\|s\|^{3}. (4)

The constant term ϕ(xk)\phi(x_{k}) is never computed and is used simply for presentation purposes. In the case of SARC, gkg_{k} and HkH_{k} are computed using 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO} so as to satisfy certain accuracy requirements with sufficiently high probability, which will be specified in Section 3. We require the trial step sks_{k} to be an “approximate minimizer" of mk(xk+s)m_{k}(x_{k}+s) in the sense that it has to satisfy:

(sk)Tgk+(sk)THksk+σksk3=0and(sk)THksk+σksk30(s_{k})^{T}g_{k}+(s_{k})^{T}H_{k}s_{k}+\sigma_{k}\|s_{k}\|^{3}=0\,\,{\rm and}\,\,(s_{k})^{T}H_{k}s_{k}+\sigma_{k}\|s_{k}\|^{3}\geq 0 (5)

and

mk(xk+sk)ηmin{1,sk}gk,\|\nabla m_{k}(x_{k}+s_{k})\|\leq\eta\min\left\{1,\|s_{k}\|\right\}\|g_{k}\|, (6)

where η(0,1)\eta\in(0,1) is a user-chosen constant. The conditions are typical in the literature, e.g., in [CGT11c] and can be satisfied, for example, using algorithms in [CGT11b, CD19] to approximately minimize the model (4), as well as by any global minimizer of mk(xk+s)m_{k}(x_{k}+s).

As in any variant of the ARC method, once sks_{k} is computed, the trial step is accepted (and σk\sigma_{k} is decreased) if the estimated function value of xk+=xk+skx_{k}^{+}=x_{k}+s_{k} is sufficiently smaller than that of xkx_{k}, when compared to the model value decrease. We call these iterations successful. Otherwise, the trial step is rejected (and σk\sigma_{k} is increased). We call these iterations unsuccessful. In the case of SARC, however, function value estimates are obtained via 𝖲𝖹𝖮\mathsf{SZO} and the step acceptance criterion is modified by adding an "error correction" term of 2ϵf2\epsilon_{f}^{\prime}. This is because function value estimates have an irreducible error, so without this correction term, the algorithm may always reject improvement steps.

Input:

Oracles 𝖲𝖹𝖮(ϵf,λ,a)\mathsf{SZO}(\epsilon_{f},\lambda,a), 𝖲𝖥𝖮(κg)\mathsf{SFO}(\kappa_{g}) and 𝖲𝖲𝖮(κH)\mathsf{SSO}(\kappa_{H}); initial iterate x0x_{0}, parameters γ(0,1)\gamma\in(0,1), θ(0,1)\theta\in(0,1), δ1,δ2[0,12)\delta_{1},\delta_{2}\in[0,\frac{1}{2}), σmin>0\sigma_{\min}>0, η(0,1)\eta\in(0,1), μ0,ϵf>0\mu\geq 0,\epsilon_{f}^{\prime}>0 and σ0σmin\sigma_{0}\geq\sigma_{\min}.

Repeat for k=0,1,k=0,1,\ldots
 1. Compute a model trial step sks_{k}:

Generate gk=g(xk,ξk1)g_{k}=g(x_{k},\xi_{k}^{1}) and Hk=H(xk,ξk2)H_{k}=H(x_{k},\xi_{k}^{2}) using 𝖲𝖥𝖮(κg)\mathsf{SFO}(\kappa_{g}) and 𝖲𝖲𝖮(κH)\mathsf{SSO}(\kappa_{H}) with (μσk,δ1)(\frac{\mu}{\sigma_{k}},\delta_{1}), and (μσk,δ2)(\sqrt{\frac{\mu}{\sigma_{k}}},\delta_{2}) as inputs, respectively. Compute a trial step sks_{k} that satisfies (5) and (6) with parameters η\eta and σk\sigma_{k} at xkx_{k}.

 2. Check sufficient decrease:

Let xk+=xk+skx_{k}^{+}=x_{k}+s_{k}. Compute function value estimations f(xk)=f(xk,ξk)f(x_{k})=f(x_{k},\xi_{k}) and f(xk+)=f(xk+,ξk+)f(x_{k}^{+})=f(x_{k}^{+},\xi_{k}^{+}) using the 𝖲𝖹𝖮(ϵf,λ,a)\mathsf{SZO}(\epsilon_{f},\lambda,a), and set

ρk=f(xk)f(xk+)+2ϵfm(xk)mk(xk+),\rho_{k}=\frac{f(x_{k})-f(x_{k}^{+})+2\epsilon_{f}^{\prime}}{m(x_{k})-m_{k}(x_{k}^{+})}, (7)
 3. Update the iterate:

Set

xk+1={xk+ifρkθ[successful iteration]xk,otherwise[unsuccessful iteration]x_{k+1}=\left\{\begin{array}[]{lr}x_{k}^{+}\quad{\rm if}\quad\rho_{k}\geq\theta&\text{[successful iteration]}\\ x_{k},\quad{\rm otherwise}&\text{[unsuccessful iteration]}\end{array}\right. (8)
 4. Update the regularization parameter σk\sigma_{k}:

Set

σk+1={max{γσk,σmin}ifρkθ1γσk,otherwise.\sigma_{k+1}=\left\{\begin{array}[]{ll}\max\left\{\gamma\sigma_{k},\sigma_{\min}\right\}\quad{\rm if}\quad\rho_{k}\geq\theta\\ \frac{1}{\gamma}\sigma_{k},\quad{\rm otherwise.}\end{array}\right. (9)
Algorithm 1 Stochastic Adaptive Regularization with Cubics (SARC)

3 Deterministic Properties of Algorithm 1

Algorithm 1 generates a stochastic process and we will analyze it in the next section. First, however, we state and prove several lemmas that establish the behavior of the algorithm for every realization.

A key concept that will be used in the analysis is the concept of a true iteration. Let ek=|f(xk)ϕ(xk)|e_{k}=|f(x_{k})-\phi(x_{k})| and ek+=|f(xk+)ϕ(xk+)|e_{k}^{+}=|f(x_{k}^{+})-\phi(x_{k}^{+})|.

Definition 1 (True iteration).

We say that iteration kk is true if

ϕ(xk)gkκg\displaystyle\|\nabla\phi(x_{k})-g_{k}\|\leq\kappa_{g} max{μσk,sk2}(2ϕ(xk)Hk)skκHmax{μσk,sk2}\displaystyle\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\}\text{, }\|(\nabla^{2}\phi(x_{k})-H_{k})s_{k}\|\leq\kappa_{H}\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\} (10)
and ek+ek+2ϵf.\displaystyle\text{ and }e_{k}+e_{k}^{+}\leq 2\epsilon_{f}^{\prime}. (11)
Remark 1.

We will show in Lemma 6 that by using 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO} with the respective inputs, μ1=μσk\mu_{1}=\frac{\mu}{\sigma_{k}} in (2) and μ2=μσk\mu_{2}=\sqrt{\frac{\mu}{\sigma_{k}}} in (3), each iteration of Algorithm 1 satifies (10) with probability at least 1δ1δ21-\delta_{1}-\delta_{2}. However, we note that the probabilistic requirement of (10) can be implied by more relaxed inputs that use μ1=max{μσk,sk2}\mu_{1}=\max\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\} in (2), and μ2=max{μσksk,sk}\mu_{2}=\max\{\frac{\mu}{\sigma_{k}\|s_{k}\|},\|s_{k}\|\} in (3), instead of μ1=μσk\mu_{1}=\frac{\mu}{\sigma_{k}} and μ2=μσk\mu_{2}=\sqrt{\frac{\mu}{\sigma_{k}}}. Since sks_{k} depends on the output of the oracles, implementing such a relaxed version is not trivial, and may require modification of Algorithm 1. We leave it as a subject of future research.

We will now prove a sequence of results that hold for each realization of Algorithm 1, and are essential for the complexity analysis. The two key results are Corollary 1 and Lemma 5, where Corollary 1 shows that until an ϵ\epsilon-stationary point is reached, every true iteration with large enough σk\sigma_{k} is successful, and Lemma 5 establishes the lower bound on function improvement on true and successful iterations. Lemmas 1, 2, 3 and 4 lay the building blocks for them: On every successful iteration, the function improvement is lower bounded in terms of the norm of the step (Lemma 1). There is a threshold value of σk\sigma_{k} where any true iteration is either always successful or results in a very small step (Lemma 2). When an iteration is true and the step is not very small, the norm of the step is lower bounded in terms of ϕ(xk+)\|\nabla\phi(x_{k}^{+})\| (Lemma 3). Until an ϵ\epsilon-stationary point is reached, the step cannot be too small on true iterations (Lemma 4).

Lemma 1 (Improvement on successful iterations).

Consider any realization of Algorithm 1. For each iteration kk, we have

mk(xk)mk(xk+)16σksk3.m_{k}(x_{k})-m_{k}(x_{k}^{+})\geq\frac{1}{6}\sigma_{k}\|s_{k}\|^{3}. (12)

On every successful iteration kk, we have

f(xk)f(xk+1)θ6σksk32ϵf,f(x_{k})-f(x_{k+1})\geq\frac{\theta}{6}\sigma_{k}\|s_{k}\|^{3}-2\epsilon_{f}^{\prime}, (13)

which implies

ϕ(xk)ϕ(xk+1)θ6σksk3ekek+2ϵf.\phi(x_{k})-\phi(x_{k+1})\geq\frac{\theta}{6}\sigma_{k}\|s_{k}\|^{3}-e_{k}-e_{k}^{+}-2\epsilon_{f}^{\prime}. (14)
Proof.

The proof is similar to the proof of Lemma 3.3 in [CGT11b]. Clearly, (13) follows from (12) and the sufficient decrease condition (7)-(8):

f(xk)f(xk+)+2ϵfmk(xk)mk(xk+)θ,\frac{f(x_{k})-f(x_{k}^{+})+2\epsilon_{f}^{\prime}}{m_{k}(x_{k})-m_{k}(x_{k}^{+})}\geq\theta,

and (14) follows from the definition of eke_{k} and ek+e_{k}^{+}.

It remains to prove (12). Combining the first condition on step sks_{k} in (5), with the model expression for s=sks=s_{k}, we can write

mk(xk)mk(xk+)=12(sk)THksk+23σksk3.m_{k}(x_{k})-m_{k}(x_{k}^{+})=\frac{1}{2}(s_{k})^{T}H_{k}s_{k}+\frac{2}{3}\sigma_{k}\|s_{k}\|^{3}.

The second condition on sks_{k} in (5) implies (sk)THkskσksk3(s_{k})^{T}H_{k}s_{k}\geq-\sigma_{k}\|s_{k}\|^{3}. Together with the above equation, we obtain (12).

Lemma 2 (Large σk\sigma_{k} guarantees success or small step).

that Let Assumption 1 hold. For any realization of Algorithm 1, if iteration kk is true, and if

σkσ¯=2κg+κH+L+LH113θ,\sigma_{k}\geq\bar{\sigma}=\frac{2\kappa_{g}+\kappa_{H}+L+L_{H}}{1-\frac{1}{3}\theta}, (15)

then iteration kk is either successful or produces sks_{k} such that sk2<μσk\|s_{k}\|^{2}<\frac{\mu}{\sigma_{k}}.

Proof.

Clearly, if ρk10\rho_{k}-1\geq 0, then kk is successful by definition. Let us consider the case when ρk<1\rho_{k}<1; then if 1ρk1θ1-\rho_{k}\leq 1-\theta, kk is successful. We have from (7), that

1ρk=mk(xk)mk(xk+)f(xk)+f(xk+)2ϵfmk(xk)mk(xk+).1-\rho_{k}=\frac{m_{k}(x_{k})-m_{k}(x_{k}^{+})-f(x_{k})+f(x_{k}^{+})-2\epsilon_{f}^{\prime}}{m_{k}(x_{k})-m_{k}(x_{k}^{+})}.

Notice that:

mk(xk)mk(xk+)f(xk)+f(xk+)2ϵf=f(xk+)(f(xk)+skTgk+12skTHksk+σk3sk3)2ϵf\displaystyle m_{k}(x_{k})-m_{k}(x_{k}^{+})-f(x_{k})+f(x_{k}^{+})-2\epsilon_{f}^{\prime}=f(x_{k}^{+})-\left(f(x_{k})+s_{k}^{T}g_{k}+\frac{1}{2}s_{k}^{T}H_{k}s_{k}+\frac{\sigma_{k}}{3}\|s_{k}\|^{3}\right)-2\epsilon_{f}^{\prime}
ϕ(xk+)(ϕ(xk)+skTgk+12skTHksk+σk3sk3)2ϵf+ek+ek+ϕ(xk+)ϕ(xk)skTgk12skTHkskσk3sk3.\displaystyle\leq\phi(x_{k}^{+})-\left(\phi(x_{k})+s_{k}^{T}g_{k}+\frac{1}{2}s_{k}^{T}H_{k}s_{k}+\frac{\sigma_{k}}{3}\|s_{k}\|^{3}\right)-2\epsilon_{f}^{\prime}+e_{k}+e_{k}^{+}\leq\phi(x_{k}^{+})-\phi(x_{k})-s_{k}^{T}g_{k}-\frac{1}{2}s_{k}^{T}H_{k}s_{k}-\frac{\sigma_{k}}{3}\|s_{k}\|^{3}.

The second to last inequality follows from the definition of eke_{k} and ek+e_{k}^{+}, and the last inequality due to the iteration being true.

Taylor expansion and Cauchy-Schwarz inequalities give, for some τ[xk,xk+]\tau\in[x_{k},x_{k}^{+}],

ϕ(xk+)ϕ(xk)skTgk12skTHkskσk3sk3=[ϕ(xk)gk]Tsk+12(sk)T[2ϕ(τ)2ϕ(xk)]sk+12(sk)T[2ϕ(xk)Hk]sk13σksk3ϕ(xk)gksk+122ϕ(τ)2ϕ(xk)sk2+12(2ϕ(xk)Hk)sksk13σksk3(κg+κH2)max{μσk,sk2}sk+(LH213σk)sk3\begin{array}[]{l}\phi(x_{k}^{+})-\phi(x_{k})-s_{k}^{T}g_{k}-\frac{1}{2}s_{k}^{T}H_{k}s_{k}-\frac{\sigma_{k}}{3}\|s_{k}\|^{3}\\[4.30554pt] =[\nabla\phi(x_{k})-g_{k}]^{T}s_{k}+\frac{1}{2}(s_{k})^{T}[\nabla^{2}\phi(\tau)-\nabla^{2}\phi(x_{k})]s_{k}+\frac{1}{2}(s_{k})^{T}[\nabla^{2}\phi(x_{k})-H_{k}]s_{k}-\frac{1}{3}\sigma_{k}\|s_{k}\|^{3}\\[4.30554pt] \leq\|\nabla\phi(x_{k})-g_{k}\|\cdot\|s_{k}\|+\frac{1}{2}\|\nabla^{2}\phi(\tau)-\nabla^{2}\phi(x_{k})\|\cdot\|s_{k}\|^{2}+\frac{1}{2}\|(\nabla^{2}\phi(x_{k})-H_{k})s_{k}\|\cdot\|s_{k}\|-\frac{1}{3}\sigma_{k}\|s_{k}\|^{3}\\[4.30554pt] \leq(\kappa_{g}+\frac{\kappa_{H}}{2})\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\}\|s_{k}\|+\left(\frac{L_{H}}{2}-\frac{1}{3}\sigma_{k}\right)\|s_{k}\|^{3}\end{array}

where the last inequality follows from the fact that the iteration is true and hence (10) holds: ϕ(xk)gkκgmax{μσk,sk2}and(2ϕ(xk)Hk)skκHmax{μσk,sk2}\|\nabla\phi(x_{k})-g_{k}\|\leq\kappa_{g}\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\}\quad{\rm and}\quad\|(\nabla^{2}\phi(x_{k})-H_{k})s_{k}\|\leq\kappa_{H}\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\} and from Assumption 1. So as long as sk2μσk\|s_{k}\|^{2}\geq\frac{\mu}{\sigma_{k}}, we have

mk(xk)mk(xk+)f(xk)+f(xk+)2ϵf(κg+κH2+LH213σk)sk3=(6κg+3LH+3κH2σk)16sk3,m_{k}(x_{k})-m_{k}(x_{k}^{+})-f(x_{k})+f(x_{k}^{+})-2\epsilon_{f}^{\prime}\leq\left(\kappa_{g}+\frac{\kappa_{H}}{2}+\frac{L_{H}}{2}-\frac{1}{3}\sigma_{k}\right)\|s_{k}\|^{3}=(6\kappa_{g}+3L_{H}+3\kappa_{H}-2\sigma_{k})\frac{1}{6}\|s_{k}\|^{3},

which, together with (12), gives that 1ρk1θ1-\rho_{k}\leq 1-\theta when σk\sigma_{k} satisfies (15).

Note that for the above lemma to hold, σ¯\bar{\sigma} does not need to include LL in the numerator. However, we will need another condition on σ¯\bar{\sigma} later that will involve LL; hence for simplicity of notation we introduced σ¯\bar{\sigma} above to satisfy all necessary bounds.

Lemma 3 (Lower bound on step norm in terms of ϕ(xk+)\|\nabla\phi(x_{k}^{+})\|).

Let Assumption 1 hold. For any realization of Algorithm 1, if kk is a true iteration we have

max{sk2,μσk}1ησk+(1θ3)σ¯ϕ(xk+).\max\left\{\|s_{k}\|^{2},\frac{\mu}{\sigma_{k}}\right\}\geq{\frac{1-\eta}{\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma}}\|\nabla\phi(x_{k}^{+})\|}. (16)
Proof.

The triangle inequality, the equality mk(xk+s)=gk+Hks+σkss\nabla m_{k}(x_{k}+s)=g_{k}+H_{k}s+\sigma_{k}\|s\|s and condition (6) on sks_{k} together give

ϕ(xk+)ϕ(xk+)mk(xk+)+mk(xk+)ϕ(xk+)gkHksk+σksk2+ηmin{1,sk}gk.\begin{array}[]{lcl}\|\nabla\phi(x_{k}^{+})\|&\leq&\|\nabla\phi(x_{k}^{+})-\nabla m_{k}(x_{k}^{+})\|+\|\nabla m_{k}(x_{k}^{+})\|\\ &\leq&\|\nabla\phi(x_{k}^{+})-g_{k}-H_{k}s_{k}\|+\sigma_{k}\|s_{k}\|^{2}+\eta\min\left\{1,\|s_{k}\|\right\}\|g_{k}\|.\end{array} (17)

Recalling Taylor expansion of ϕ(xk+)\nabla\phi(x_{k}^{+}): ϕ(xk+)=ϕ(xk)+012ϕ(xk+tsk)sk𝑑t,\nabla\phi(x_{k}^{+})=\nabla\phi(x_{k})+\int_{0}^{1}\nabla^{2}\phi(x_{k}+ts_{k})s_{k}dt, and applying triangle inequality again, we have

ϕ(xk+)gkHkskϕ(xk)gk+01[2ϕ(xk+tsk)2ϕ(xk)]sk𝑑t+2ϕ(xk)skHksk(κg+κH)max{μσk,sk2}+12LHsk2,\begin{array}[]{lcl}\|\nabla\phi(x_{k}^{+})-g_{k}-H_{k}s_{k}\|&\leq&\|\nabla\phi(x_{k})-g_{k}\|+\left\|\int_{0}^{1}[\nabla^{2}\phi(x_{k}+ts_{k})-\nabla^{2}\phi(x_{k})]s_{k}dt\right\|+\|\nabla^{2}\phi(x_{k})s_{k}-H_{k}s_{k}\|\\[4.30554pt] &\leq&(\kappa_{g}+\kappa_{H})\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\}+\frac{1}{2}L_{H}\|s_{k}\|^{2},\end{array}

where to get the second inequality, we also used (10) and Assumption 1. We can bound gk\|g_{k}\| as follows:

gkgkϕ(xk)+ϕ(xk)ϕ(xk+)+ϕ(xk+)κgmax{μσk,sk2}+Lsk+ϕ(xk+).\|g_{k}\|\leq\|g_{k}-\nabla\phi(x_{k})\|+\|\nabla\phi(x_{k})-\nabla\phi(x_{k}^{+})\|+\|\nabla\phi(x_{k}^{+})\|\leq\kappa_{g}\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\}+L\|s_{k}\|+\|\nabla\phi(x_{k}^{+})\|.

Thus finally, we can bound all the terms on the right hand side of (17) in terms of sk2\|s_{k}\|^{2} and using the fact that η(0,1)\eta\in(0,1) we can write

(1η)ϕ(xk+)\displaystyle(1-\eta)\|\nabla\phi(x_{k}^{+})\| (2κg+κH)max{μσk,sk2}+(L+LH+σk)sk2\displaystyle\leq(2\kappa_{g}+\kappa_{H})\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\}+(L+L_{H}+\sigma_{k})\|s_{k}\|^{2}
(2κg+κH+L+LH+σk)max{μσk,sk2},\displaystyle\leq(2\kappa_{g}+\kappa_{H}+L+L_{H}+\sigma_{k})\max\left\{\frac{\mu}{\sigma_{k}},\|s_{k}\|^{2}\right\},

which is equivalent to (16). ∎

Lemma 4 (Lower bound on step norm until ϵ\epsilon-accuracy is reached).

Let Assumption 1 hold. Consider any realization of Algorithm 1. Let ϵ\epsilon satisfy

μ1η1+(1θ3)σ¯σminϵ.{\mu}\leq\frac{1-\eta}{1+\frac{(1-\frac{\theta}{3})\bar{\sigma}}{\sigma_{\min}}}\epsilon. (18)

Then on each true iteration kk such that ϕ(xk+)ϵ\|\nabla\phi(x_{k}^{+})\|\geq\epsilon, we have

sk2μσk.\|s_{k}\|^{2}\geq\frac{\mu}{\sigma_{k}}.
Proof.

If iteration kk is true and ϕ(xk+)>ϵ\|\nabla\phi(x_{k}^{+})\|>\epsilon, then by Lemma 3:

max{sk2,μσk}1ησk+(1θ3)σ¯ϕ(xk+)>1ησk+(1θ3)σ¯ϵ,\max\left\{\|s_{k}\|^{2},\frac{\mu}{\sigma_{k}}\right\}\geq{\frac{1-\eta}{\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma}}\|\nabla\phi(x_{k}^{+})\|}>\frac{1-\eta}{\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma}}\epsilon,

but since

μ1η1+(1θ3)σ¯σminϵ,{\mu}\leq\frac{1-\eta}{1+\frac{(1-\frac{\theta}{3})\bar{\sigma}}{\sigma_{\min}}}\epsilon,

so

μσk1ησk+(1θ3)σ¯σkσminϵ1ησk+(1θ3)σ¯ϵ.\frac{\mu}{\sigma_{k}}\leq\frac{1-\eta}{\sigma_{k}+\frac{(1-\frac{\theta}{3})\bar{\sigma}\sigma_{k}}{\sigma_{\min}}}\epsilon\leq\frac{1-\eta}{\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma}}\epsilon.

Hence, we must have

sk2>1ησk+(1θ3)σ¯ϵμσk.\|s_{k}\|^{2}>\frac{1-\eta}{\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma}}\epsilon\geq\frac{\mu}{\sigma_{k}}.

Corollary 1 (True iteration with large σk\sigma_{k} must be successful).

Let Assumption 1 hold. Consider any realization of Algorithm 1. Let ϵ\epsilon satisfy (18) and

σkσ¯=2κg+κH+L+LH113θ,\sigma_{k}\geq\bar{\sigma}=\frac{2\kappa_{g}+\kappa_{H}+L+L_{H}}{1-\frac{1}{3}\theta},

then if iteration kk is true and ϕ(xk+)>ϵ\|\nabla\phi(x_{k}^{+})\|>\epsilon, then iteration kk is successful.

Proof.

The result is straightforward by applying Lemma 2 and 4. ∎

Lemma 5 (Minimum improvement achieved by true and successful iterations).

Let Assumption 1 hold. Consider any realization of Algorithm 1. Let ϵ\epsilon satisfy (18). Then on each true and successful iteration kk for which ϕ(xk+1)>ϵ\left\lVert{\nabla\phi(x_{k+1})}\right\rVert>\epsilon, we have

ϕ(xk)ϕ(xk+1)\displaystyle\phi(x_{k})-\phi(x_{k+1}) θ6(1η)3/2σk(σk+(1θ3)σ¯)3/2ϕ(xk+1)3/2ekek+2ϵf\displaystyle\geq\frac{\theta}{6}(1-\eta)^{3/2}\frac{\sigma_{k}}{(\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma})^{3/2}}\|\nabla\phi(x_{k+1})\|^{3/2}-e_{k}-e_{k}^{+}-2\epsilon_{f}^{\prime} (19)
θ6(1η)3/2σmin(σk+(1θ3)σ¯)3/2ϕ(xk+1)3/2ekek+2ϵf,\displaystyle\geq\frac{\theta}{6}(1-\eta)^{3/2}\frac{\sigma_{\min}}{(\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma})^{3/2}}\|\nabla\phi(x_{k+1})\|^{3/2}-e_{k}-e_{k}^{+}-2\epsilon_{f}^{\prime}, (20)

where σ¯\bar{\sigma} is defined in (15).

Proof.

Combining Lemma 3, 4, inequality (14) from Lemma 1 and the definition of successful iteration in Algorithm 1 we have, for all true and successful iterations kk,

ϕ(xk)ϕ(xk+1)\displaystyle\phi(x_{k})-\phi(x_{k+1}) θ6σksk3ekek+2ϵf\displaystyle\geq\frac{\theta}{6}\sigma_{k}\|s_{k}\|^{3}-e_{k}-e_{k}^{+}-2\epsilon_{f}^{\prime} (21)
θ6(1η)3/2σk(σk+(1θ3)σ¯)3/2ϕ(xk+1)3/2ekek+2ϵf.\displaystyle\geq\frac{\theta}{6}(1-\eta)^{3/2}\frac{\sigma_{k}}{(\sigma_{k}+(1-\frac{\theta}{3})\bar{\sigma})^{3/2}}\|\nabla\phi(x_{k+1})\|^{3/2}-e_{k}-e_{k}^{+}-2\epsilon_{f}^{\prime}. (22)

Using the fact that σkσmin\sigma_{k}\geq\sigma_{\min}, the result follows. ∎

We can now use these important properties of Algorithm 1 to show that stochastic process that the algorithm generates fits into the framework analyzed in [JSX21].

4 Stochastic Properties of Algorithm 1

Algorithm 1 generates a stochastic process. Let MkM_{k} denote the collection of random variables {Ξk,Ξk+,Ξk1,Ξk2}\left\{\Xi_{k},\Xi_{k}^{+},\Xi_{k}^{1},\Xi_{k}^{2}\right\}, whose realizations are {ξk,ξk+,ξk1,ξk2}\left\{\xi_{k},\xi_{k}^{+},\xi_{k}^{1},\xi_{k}^{2}\right\}. Let {k:k0}\left\{{\cal F}_{k}:k\geq 0\right\} denote the filtration generated by M0,M1,,MkM_{0},M_{1},\ldots,M_{k}. At iteration kk, XkX_{k} denotes the random iterate, GkG_{k} is the random gradient approximation, 𝖧k\mathsf{H}_{k} is the random Hessian approximation, Σk\Sigma_{k} is the random model regularization parameter. SkS_{k} is the step computed for the random model, f(Xk,Ξk) and f(Xk+,Ξk+)f(X_{k},\Xi_{k})\text{ and }f(X_{k}^{+},\Xi_{k}^{+}) are the random function estimates at the current point and the candidate point, respectively.

Conditioned on XkX_{k} and Σk\Sigma_{k}, the random quantities Gk{G}_{k} and 𝖧k\mathsf{H}_{k} are determined by Ξk1\Xi_{k}^{1} and Ξk2\Xi_{k}^{2} respectively. The realization of SkS_{k} depends on the realizations of Gk{G}_{k} and 𝖧k\mathsf{H}_{k}. The function estimates f(Xk,Ξk) and f(Xk+,Ξk+)f(X_{k},\Xi_{k})\text{ and }f(X_{k}^{+},\Xi_{k}^{+}) are determined by Ξk,Ξk+\Xi_{k},\Xi_{k}^{+}, conditioned on XkX_{k} and Xk+X_{k}^{+}. In summary, the stochastic process {(Gk,𝖧k,Sk,f(Xk,Ξk),f(Xk+,Ξk+),Xk,Σk)}\left\{\left({G}_{k},\mathsf{H}_{k},S_{k},f(X_{k},\Xi_{k}),f(X_{k}^{+},\Xi_{k}^{+}),X_{k},\Sigma_{k}\right)\right\} generated by the algorithm, with realization {(gk,Hk,sk,f(xk,ξk),f(xk+,ξk+),xk,σk)}\left\{\left(g_{k},H_{k},s_{k},f(x_{k},\xi_{k}),f(x_{k}^{+},\xi_{k}^{+}),x_{k},\sigma_{k}\right)\right\}, is adapted to the filtration {k:k0}\left\{{\cal F}_{k}:k\geq 0\right\}.

We further define Ek:=|f(Xk,Ξk)ϕ(Xk)|E_{k}:=|f(X_{k},\Xi_{k})-\phi(X_{k})| and Ek+:=|f(Xk+,Ξk+)ϕ(Xk+)|E_{k}^{+}:=|f(X_{k}^{+},\Xi_{k}^{+})-\phi(X_{k}^{+})|, with realizations eke_{k} and ek+e_{k}^{+}. Let Θk:=𝟙{iteration k is successful}\Theta_{k}:={\mathbbm{1}}\left\{\text{iteration $k$ is successful}\right\}, and let Ik:=𝟙{iteration k is true}I_{k}:={\mathbbm{1}}\left\{\text{iteration $k$ is true}\right\}. The indicator random variables Θk\Theta_{k} and IkI_{k} are clearly measurable with respect to the filtration k{\cal F}_{k}.

The next lemma shows that by construction of the algorithm, the stochastic model mkm_{k} at iteration kk is “sufficiently accurate" with probability at least 1δ1δ21-\delta_{1}-\delta_{2}.

Lemma 6.

The indicator variable

Jk=𝟙{\displaystyle J_{k}=\mathbbm{1}\bigg{\{} ϕ(Xk)g(Xk,Ξk1(Xk))κgmax{μΣk,Sk2}, and\displaystyle\|\nabla\phi(X_{k})-g(X_{k},\Xi_{k}^{1}(X_{k}))\|\leq\kappa_{g}\max\left\{\frac{\mu}{\Sigma_{k}},\|S_{k}\|^{2}\right\},\text{ and }
(2ϕ(Xk)H(Xk,Ξk2(Xk)))SkκHmax{μΣk,Sk2}}\displaystyle\|(\nabla^{2}\phi(X_{k})-{H(X_{k},\Xi_{k}^{2}(X_{k}))})S_{k}\|\leq\kappa_{H}\max\left\{\frac{\mu}{\Sigma_{k}},\|S_{k}\|^{2}\right\}\bigg{\}}

satisfies the following submartingale-like condition

(Jk=1k1)1δ1δ2.\mathbb{P}(J_{k}=1\mid\mathcal{F}_{k-1})\geq 1-\delta_{1}-\delta_{2}.
Proof.

By the properties of oracles 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO} and the choices of the inputs for them in step 1 of the algorithm, we have:

(ϕ(Xk)g(Xk,Ξk1(Xk))κgμΣkk1)1δ1,\mathbb{P}\left(\|\nabla\phi(X_{k})-g(X_{k},\Xi_{k}^{1}(X_{k}))\|\leq\kappa_{g}\frac{\mu}{\Sigma_{k}}\mid\mathcal{F}_{k-1}\right)\geq 1-\delta_{1}, (23)

and

((2ϕ(Xk)H(Xk,Ξk2(Xk)))κHμΣkk1)1δ2.\mathbb{P}\left(\|(\nabla^{2}\phi(X_{k})-{H(X_{k},\Xi_{k}^{2}(X_{k}))})\|\leq\kappa_{H}\sqrt{\frac{\mu}{\Sigma_{k}}}\mid\mathcal{F}_{k-1}\right)\geq 1-\delta_{2}. (24)

Inequality (23) implies

(ϕ(Xk)g(Xk,Ξk1(Xk))κgmax{μΣk,Sk2}k1)1δ1,\mathbb{P}\left(\|\nabla\phi(X_{k})-g(X_{k},\Xi_{k}^{1}(X_{k}))\|\leq\kappa_{g}\max\left\{\frac{\mu}{\Sigma_{k}},\|S_{k}\|^{2}\right\}\mid\mathcal{F}_{k-1}\right)\geq 1-\delta_{1}, (25)

and inequality (24) implies

((2ϕ(Xk)H(Xk,Ξk2(Xk)))SkκHmax{μΣk,Sk2}k1)1δ2.\mathbb{P}\left(\|(\nabla^{2}\phi(X_{k})-{H(X_{k},\Xi_{k}^{2}(X_{k}))})S_{k}\|\leq\kappa_{H}\max\left\{\frac{\mu}{\Sigma_{k}},\|S_{k}\|^{2}\right\}\mid\mathcal{F}_{k-1}\right)\geq 1-\delta_{2}. (26)

Thus, we conclude that (Jk=1k1)1δ1δ2\mathbb{P}(J_{k}=1\mid\mathcal{F}_{k-1})\geq 1-\delta_{1}-\delta_{2} by the union bound. ∎

Recall Definition 1 and that we denote the event of iteration kk being true by indicator random variable IkI_{k}. It is crucial for our analysis that (Ik=1k1)p>12\mathbb{P}(I_{k}=1\mid\mathcal{F}_{k-1})\geq p>\frac{1}{2} for all kk. We will later combine Lemma 6 with the properties of 𝖲𝖹𝖮\mathsf{SZO} for a bound on δ1+δ2\delta_{1}+\delta_{2} to ensure p>12p>\frac{1}{2}.

The iteration complexity of our algorithm is defined as the following stopping time.

Definition 2 (Stopping time).

For ϵ>0\epsilon>0, Tϵ:=min{k:ϕ(Xk+)ϵ}+1T_{\epsilon}:=\min\left\{k:\left\lVert{\nabla\phi(X_{k}^{+})}\right\rVert\leq\epsilon\right\}+1, the iteration complexity of the algorithm for reaching an ϵ\epsilon-stationary point. We will refer to TϵT_{\epsilon} as the stopping time of the algorithm.

It is important to note that even if for some iteration kk, ϕ(Xk+)ϵ\left\lVert{\nabla\phi(X_{k}^{+})}\right\rVert\leq\epsilon, this iteration may not be successful and thus ϕ(Xk+1)\left\lVert{\nabla\phi(X_{k+1})}\right\rVert may be greater than ϵ\epsilon. This is a consequence of the complexity analysis of cubic regularization methods that measure progress in terms of the gradient at the trial point and not at the current iterate, and thus is not specific to SARC. The stopping time is thus defined as the first time at which the algorithm computes a point at which the gradient of ϕ\phi is less than ϵ\epsilon.

It is easy to see that TϵT_{\epsilon} is a stopping time of the stochastic process with respect to k{\cal F}_{k}. Given a level of accuracy ϵ\epsilon, we aim to derive a bound on the iterations complexity TϵT_{\epsilon} with overwhelmingly high probability. In particular, we will show the number of iterations until the stopping time TϵT_{\epsilon} is a sub-exponential random variable, whose value (with high probability) scales as O(ϵ3/2){O}(\epsilon^{-3/2}), similarly to the deterministic case. Towards that end, we define stochastic process ZkZ_{k} to measure the progress towards optimality.

Definition 3 (Measure of Progress).

For each k0k\geq 0, let Zk0Z_{k}\geq 0 be a random variable measuring the progress of the algorithm at step kk: Zk=ϕ(Xk)ϕ,Z_{k}=\phi(X_{k})-\phi^{*}, where ϕ\phi^{*} is a lower bound of ϕ\phi.

Armed with these definitions, we will be able to state properties of the stochastic process generated by Algorithm 1, which lead to the desired bounds on TϵT_{\epsilon}. These properties hold under certain conditions on the parameters used by Algorithm 1. We state these conditions here.

Assumption 2.

Define u=ϵfϵfu=\epsilon_{f}^{\prime}-\epsilon_{f} and K=Cmax{1λ,ln(2)a}, C is a universal constantK={C\max\{\frac{1}{\lambda},\frac{\ln(2)}{a}\},\text{ $C$ is a universal constant}} and p=1δ1δ2exp(min{u22K2,u2K})p=1-\delta_{1}-\delta_{2}-\exp\left(-\min\left\{\frac{u^{2}}{2K^{2}},\frac{u}{2K}\right\}\right).

(a)

ϵf>ϵf\epsilon_{f}^{\prime}>\epsilon_{f},

(b)

δ1+δ2\delta_{1}+\delta_{2} are chosen sufficiently small so that p>12p>\frac{1}{2},

(c)
ϵ>max{1+(1θ3)σ¯σmin1ημ,((2θ3)σ¯)1η(24ϵf(p12)θσmin)23}.\epsilon>\max\left\{\frac{1+\frac{(1-\frac{\theta}{3})\bar{\sigma}}{\sigma_{\min}}}{1-\eta}\mu,\frac{((2-\frac{\theta}{3})\bar{\sigma})}{1-\eta}\left(\frac{24\epsilon_{f}^{\prime}}{(p-\frac{1}{2})\theta\sigma_{\min}}\right)^{\frac{2}{3}}\right\}. (27)
Remark 2.

2 (c) gives a lower bound on the best accuracy the algorithm can achieve given the accuracy parameters related to the stochastic oracles 𝖲𝖥𝖮/𝖲𝖲𝖮\mathsf{SFO/SSO} and 𝖲𝖹𝖮\mathsf{SZO}. Specifically, ϵf\epsilon_{f}^{\prime} is lower bounded by ϵf\epsilon_{f}, which is the "irreducible" error of the zeroth-order oracle. We observe that, if ϵfϵf\epsilon_{f}^{\prime}\approx\epsilon_{f} then the term involving the error of the zeroth-order oracle in the lower bound of ϵ\epsilon for SARC is O(ϵf23){O}(\epsilon_{f}^{\frac{2}{3}}), which is better dependency than those of SASS in [JSX21] and the stochastic trust region algorithms in [CBS22], where ϵ\epsilon is lower bounded by O(ϵf){O}\left(\sqrt{\epsilon_{f}}\right).

The dependence of ϵ\epsilon on μ\mu has a somewhat more complicated interpretation: μ\mu can be chosen arbitrarily by the algorithm, as long as oracles 𝖲𝖥𝖮/𝖲𝖲𝖮\mathsf{SFO/SSO} can deliver appropriate accuracy. Recall that in the algorithm, the accuracy input μ1\mu_{1} for 𝖲𝖥𝖮\mathsf{SFO} is μσk\frac{\mu}{\sigma_{k}} and the accuracy input μ2\mu_{2} for 𝖲𝖲𝖮\mathsf{SSO} is μσk\sqrt{\frac{\mu}{\sigma_{k}}}. If σk\sigma_{k} is bounded from above by a constant, then essentially ϵ\epsilon is proportional to the best accuracy required of 𝖲𝖥𝖮\mathsf{SFO} during the algorithm procedure and it is proportional to the square of the best accuracy required of 𝖲𝖲𝖮\mathsf{SSO} during the algorithm procedure. This dependency is the same as in deterministic inexact algorithms as well as in [JSX21] and the stochastic trust region algorithms in [CBS22]. We will comment on the existence of the upper bound on σk\sigma_{k} after our main complexity result.

The following theorem establishes key properties of the stochastic process generated by Algorithm 1, that are essential for the convergence analysis. Similar properties are used in [JSX21] obtain high probability iteration complexity for a stochastic step search method.

To remain consistent with the notation used in [JSX21], we define the random variable Ak:=1ΣkA_{k}:=\frac{1}{\Sigma_{k}}, with realization αk=1σk\alpha_{k}=\frac{1}{\sigma_{k}}, and a constant α¯=1σ¯\bar{\alpha}=\frac{1}{\bar{\sigma}}.

Theorem 2.

Let Assumptions 1 and 2 hold. For α¯=1σ¯\bar{\alpha}=\frac{1}{\bar{\sigma}} and the following non-decreasing function h:h:\mathbb{R}\to\mathbb{R}:

h(α)=θ6(1η)3/2σmin(1α+(1θ3)α¯)3/2ϵ3/2,h(\alpha)=\frac{\theta}{6}(1-\eta)^{3/2}\frac{\sigma_{\min}}{(\frac{1}{\alpha}+\frac{(1-\frac{\theta}{3})}{\bar{\alpha}})^{3/2}}\epsilon^{3/2},

the following hold for all k<Tϵ1k<T_{\epsilon}-1:

  • (i)

    (Ik=1k1)p\mathbb{P}(I_{k}=1\mid\mathcal{F}_{k-1})\geq p for all kk. (Conditioning on the past, each iteration is true with probability at least pp.)

  • (ii)

    If Akα¯A_{k}\leq\bar{\alpha} and Ik=1I_{k}=1 then Θk=1\Theta_{k}=1. (True iterations with sufficiently small αk\alpha_{k} are successful.)

  • (iii)

    If IkΘk=1I_{k}\Theta_{k}=1 then Zk+1Zkh(Ak)+4ϵfZ_{k+1}\leq Z_{k}-h(A_{k})+{4\epsilon_{f}^{\prime}}. (True, successful iterations make progress.)

  • (iv)

    h(α¯)>4ϵfp12h(\bar{\alpha})>{\frac{4\epsilon_{f}^{\prime}}{p-\frac{1}{2}}}. (The lower bound of potential progress for an iteration with parameter α¯\bar{\alpha}.)

  • (v)

    Zk+1Zk+2ϵf+Ek+Ek+Z_{k+1}\leq Z_{k}+{2\epsilon_{f}^{\prime}+{E}_{k}+{E}_{k}^{+}} for all kk. (The “damage” at each iteration is bounded.)

Proof.

Part (i) follows from the assumptions on pp and the definition of the true iteration.

Part (ii) follows directly from Corollary 1.

Part (iii) follows from Lemma 5.

Part (iv) follows from the definitions of α¯\bar{\alpha}, h(α)h(\alpha), and inequality (27). Specifically, plugging in the definitions of α¯\bar{\alpha} and h()h(\cdot), one can show that the inequality h(α¯)4ϵfp12h(\bar{\alpha})\geq{\frac{4\epsilon_{f}^{\prime}}{p-\frac{1}{2}}} is equivalent to ϵ>((2θ3)σ¯)1η(24ϵf(p12)θσmin)23\epsilon>\frac{((2-\frac{\theta}{3})\bar{\sigma})}{1-\eta}\left(\frac{24\epsilon_{f}^{\prime}}{(p-\frac{1}{2})\theta\sigma_{\min}}\right)^{\frac{2}{3}}, which holds by 2.

Part (v) has exactly the same proof as that of Proposition 1 part (v) in [JSX21] and is easily derived from the step acceptance condition of Algorithm 1.

5 High Probability Iteration Complexity Result

In [JSX21] a high probability bound on TϵT_{\epsilon} is derived for a stochastic process with properties stated in Theorem 2. Thus, we can simply apply this theorem here. We first observe that

𝔼Ξ[exp{τ(E(x)𝔼[E(x)])}]exp(τ2ν22),τ[0,1b],\ {\mathbb{E}_{\Xi}}\left[\exp\left\{\tau(E(x)-\mathbb{E}[E(x)])\right\}\right]\leq\exp\left(\frac{\tau^{2}\nu^{2}}{2}\right),\quad\forall\tau\in\left[0,\frac{1}{b}\right], (28)

with ν=b=K\nu=b=K, where K=Cmax{1λ,ln(2)a}, C is a universal constantK={C\max\{\frac{1}{\lambda},\frac{\ln(2)}{a}\},\text{ $C$ is a universal constant}}. This follows from (1) of 𝖲𝖹𝖮\mathsf{SZO} by applying Proposition 2.7.1 of [Ver18]. Another minor modification of the result in [JSX21] is that it now applies to the event of Tϵt+1T_{\epsilon}\leq t+1 instead of the event of TϵtT_{\epsilon}\leq t, due to the different definitions of the stopping time.

Theorem 3.

Suppose Assumptions 1 and 2 hold for Algorithm 1, then we have the following bound on the iteration complexity: for any s0s\geq 0, p^(12+4ϵf+sc1ϵ3/2,p)\hat{p}\in\left(\frac{1}{2}+\frac{4\epsilon_{f}^{\prime}+s}{c_{1}\epsilon^{3/2}},p\right), and tRp^124ϵf+sc1ϵ3/2t\geq\frac{R}{\hat{p}-\frac{1}{2}-\frac{4\epsilon_{f}^{\prime}+s}{c_{1}\epsilon^{3/2}}}, we have

(Tεt+1)1exp((pp^)22p2t)exp(min{s2t8K2,st4K}),\mathbb{P}\left(T_{\varepsilon}\leq t+1\right)\geq 1-\exp\left(-\frac{(p-\hat{p})^{2}}{2p^{2}}t\right)-\exp\left(-\min\left\{\frac{s^{2}t}{{8K^{2}}},\frac{st}{{4K}}\right\}\right),

where c1=θ6(1η)3/2σmin((2θ3)σ¯)3/2c_{1}=\frac{\theta}{6}(1-\eta)^{3/2}\frac{\sigma_{\min}}{((2-\frac{\theta}{3})\bar{\sigma})^{3/2}}, K=Cmax{1λ,ln(2)a}, C is a universal constantK={C\max\{\frac{1}{\lambda},\frac{\ln(2)}{a}\},\text{ $C$ is a universal constant}}, R=ϕ(x0)ϕc1ϵ3/2+max{lnα0+lnσ¯2lnγ,0}R=\frac{\phi(x_{0})-\phi^{*}}{c_{1}\epsilon^{3/2}}+\max\left\{-\frac{\ln\alpha_{0}+\ln\bar{\sigma}}{2\ln\gamma},0\right\}, with pp and σ¯\bar{\sigma} as defined previously.

Remark 3.
  1. 1.

    Theorem 3 shows the iteration complexity of Algorithm 1 is O(ϵ3/2)O(\epsilon^{-3/2}) with overwhelmingly high probability, which matches the deterministic counterpart.

  2. 2.

    The SARC algorithm encounters an ϵ\epsilon-stationary point in a finite number of iterations with probability 11. This is a direct consequence of the Borel–Cantelli lemma.

  3. 3.

    Since the probabilities of the failure events {Tϵ>t+1}\{T_{\epsilon}>t+1\} are exponentially decaying for all tΘ(ϵ3/2)t\geq\Theta(\epsilon^{-3/2}), this implies a complexity bound of O(ϵ3/2)O(\epsilon^{-3/2}) in expectation for SARC.

5.1 Upper Bound on σk\sigma_{k}

While the penalty parameters Σk\Sigma_{k} form a stochastic process, this process has some nice properties. Specifically it is upper bounded by a one-sided geometric random walk. This random walk is analyzed in [JSX23] and it is shown that for any given number of iterations tt, and for γ\gamma chosen appropriately dependent on tt, max1kt{σk}O(σ¯)\max_{1\leq k\leq t}\{\sigma_{k}\}\leq{O}(\bar{\sigma}) with high probability. A consequence of this fact is that with high probability, for any realization of Algorithm 1, there exists a lower bound on all of the accuracy requirements μ1\mu_{1} and μ2\mu_{2}, which are inputs to the oracles 𝖲𝖥𝖮\mathsf{SFO} and 𝖲𝖲𝖮\mathsf{SSO} as in (2) and (3). This, in turn, can give rise to a total "sample" complexity bound for Algorithm 1. For examples of such analyses, we refer the reader to [JSX23].

References

  • [BCMS19] Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust-region method via supermartingales. INFORMS journal on optimization, 1(2):92–119, 2019.
  • [BG22] Stefania Bellavia and Gianmarco Gurioli. Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic hessian accuracy. Optimization, 71(1):227–261, 2022.
  • [BGMT19] Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Philippe L Toint. Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. SIAM Journal on Optimization, 29(4):2881–2915, 2019.
  • [BGMT20] Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Philippe L Toint. A stochastic cubic regularisation method with inexact function evaluations and random derivatives for finite sum minimisation. In Thirty-seventh International Conference on Machine Learning: ICML2020, 2020.
  • [BGMT22] Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Ph L Toint. Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives. Journal of Complexity, 68:101591, 2022.
  • [BSV14] Afonso S Bandeira, Katya Scheinberg, and Luis Nunes Vicente. Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization, 24(3):1238–1264, 2014.
  • [BXZ23] Albert S. Berahas, Miaolan Xie, and Baoyu Zhou. A sequential quadratic programming method with high probability complexity bounds for nonlinear equality constrained stochastic optimization. arxiv preprint arxiv:2301.00477, 2023.
  • [CBS22] Liyuan Cao, Albert S Berahas, and Katya Scheinberg. First-and second-order high probability complexity bounds for trust-region methods with noisy oracles. arXiv preprint arXiv:2205.03667, 2022.
  • [CD19] Yair Carmon and John Duchi. Gradient descent finds the cubic-regularized nonconvex newton step. SIAM Journal on Optimization, 29(3):2146–2178, 2019.
  • [CGT11a] C. Cartis, N. Gould, and Philippe L. Toint. Optimal Newton-type methods for nonconvex smooth optimization problems. Technical Report Optimization Online, 2011.
  • [CGT11b] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Math. Program., 127(2):245–295, 2011.
  • [CGT11c] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function- and derivative-evaluation complexity. Math. Program., 130(2):295–319, 2011.
  • [CMS18] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a trust-region method and random models. Mathematical Programming, 169(2):447–487, 2018.
  • [CS18] Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, 169(2):337–375, 2018.
  • [GRVZ18] Serge Gratton, Clément W Royer, Luís N Vicente, and Zaikun Zhang. Complexity and global rates of trust-region methods based on probabilistic models. IMA Journal of Numerical Analysis, 38(3):1579–1597, 2018.
  • [JSX21] Billy Jin, Katya Scheinberg, and Miaolan Xie. High probability complexity bounds for adaptive step search based on stochastic oracles. arXiv preprint arXiv:2106.06454, 2021.
  • [JSX23] Billy Jin, Katya Scheinberg, and Miaolan Xie. Sample complexity analysis for adaptive optimization algorithms with stochastic oracles. arXiv preprint arXiv:2303.06838, 2023.
  • [KL17] Jonas Moritz Kohler and Aurélien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research. PMLR, 2017.
  • [LLHT18] Liu Liu, Xuanqing Liu, Cho-Jui Hsieh, and Dacheng Tao. Stochastic second-order methods for non-convex optimization with inexact hessian and gradient. arXiv preprint arXiv:1809.09853, 2018.
  • [MWX23] Matt Menickelly, Stefan M. Wild, and Miaolan Xie. A stochastic quasi-newton method in the absence of common random numbers. arXiv preprint arXiv:2302.09128, 2023.
  • [PJP20] Seonho Park, Seung Hyun Jung, and Panos M Pardalos. Combining stochastic adaptive cubic regularization with negative curvature for nonconvex optimization. Journal of Optimization Theory and Applications, 184(3):953–971, 2020.
  • [PS20] Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis. SIAM Journal on Optimization, 30(1):349–376, 2020.
  • [TSJ+18] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2899–2908, 2018.
  • [Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.
  • [WZLL19] Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. A note on inexact gradient and hessian conditions for cubic regularized newton’s method. Operations Research Letters, 47(2):146–149, 2019.