This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Randomized Subspace Derivative-Free Optimization with Quadratic Models and Second-Order Convergence

\nameCoralia Cartisa and Lindon Robertsb CONTACT L. Roberts. Email: [email protected] aMathematical Institute, University of Oxford, Radcliffe Observatory Quarter, Woodstock Road, Oxford OX2 6GG, UK; bSchool of Mathematics and Statistics, University of Sydney, Carslaw Building, Camperdown NSW 2006, Australia
Abstract

We consider model-based derivative-free optimization (DFO) for large-scale problems, based on iterative minimization in random subspaces. We provide the first worst-case complexity bound for such methods for convergence to approximate second-order critical points, and show that these bounds have significantly improved dimension dependence compared to standard full-space methods, provided low accuracy solutions are desired and/or the problem has low effective rank. We also introduce a practical subspace model-based method suitable for general objective minimization, based on iterative quadratic interpolation in subspaces, and show that it can solve significantly larger problems than state-of-the-art full-space methods, while also having comparable performance on medium-scale problems when allowed to use full-dimension subspaces.

keywords:
Derivative-free optimization; Large-scale optimization; Worst case complexity
articletype:

1 Introduction

In this work, we consider the problem of smooth nonlinear optimization where derivatives of the objective are unavailable, so-called derivative-free optimization (DFO) [13, 2, 24]. DFO methods are widely used across many fields [1], but in this work we focus on large-scale derivative-free problems, such as those arising from bilevel learning of variational models in imaging [17], fine-tuning large language models [25], adversarial example generation for neural networks [32], and as a possible proxy for global optimization methods in the high-dimensional setting [10]. Here we focus on the widely used class of model-based DFO methods, which are based on iteratively constructing local (interpolating) models for the objective.

Standard model-based methods [13, 24] suffer from a lack of scalability with respect to the problem dimension, coming from dimension-dependent interpolation error bounds affecting worst-case complexity, and significant per-iteration linear algebra costs; thus, they have traditionally been limited to small- or medium-scale problems. More recently, starting with [9], model-based DFO methods based on iterative minimization in random subspaces has proven to be a promising approach for tackling large-scale DFO problems in theory and practice. Theoretically, for example, using linear models in random subspaces improves the worst-case complexity of such methods from O(n3ϵ2)O(n^{3}\epsilon^{-2}) to O(n2ϵ2)O(n^{2}\epsilon^{-2}) evaluations to find an ϵ\epsilon-approximate stationary point, for a problem with nn decision variables. This aligns with similar results for derivative-based optimization [5, 31] and other DFO approaches, such as direct search methods with random models [21, 30] and for gradient sampling in randomly drawn subspaces [23]. This model-based framework has been extended to handle stochastic objective values [16] and specialized to exploit structured quadratic model construction approaches [12]. The case of direct search methods in random subspaces has also been extended to the stochastic and nonsmooth case, and to show convergence to second-order optimal points [15].

There are alternative approaches for large-scale DFO where iterations are performed in subspaces that are deterministically generated from information from previous iterations, such as [36, 22, 33], where the method from [36, Chapter 5.3] alternates between iterations in two-dimensional subspaces and full-dimensional finite differencing. The method from [35] uses partially random subspaces, similar to [12], but is specialized to two-dimensional subspaces. We also note [26], where a full-dimensional model is updated at each iteration only inside a randomly generated subspace.

1.1 Contributions

Existing random subspace methods for model-based DFO have theoretical convergence to approximate first-order optimal points, but there is currently no theory handling second-order optimality in random subspace model-based DFO (with only [15] handling the direct search case). Although [12] does provide a mechanism for constructing fully quadratic interpolation models in random subspaces, which is a necessary component of second-order convergence theory, only first-order convergence theory is given. We also note the related work [3], which proves convergence to second-order optimal points for a trust-region method with probabilistic models and function values, which uses similar ideas but with different sources of uncertainty (instead of the approach here which uses randomly drawn subspaces with deterministic models and function values).

There are two main contributions in this work:

  • First, we show how the random subspace model-based method in [9] can be extended to allow convergence to second-order stationary points (similar to full space methods [13, 19]), and prove a high probability worst-case complexity bound for this method. Our approach is based on adapting a recent approach for second-order convergence of (derivative-based) subspace adaptive regularization with cubics [31, 11]. We show a worst-case complexity bound111The notation O~()\tilde{O}(\cdot) is the same as O()O(\cdot) but hiding logarithmic factors as well as constants. of O~(n4.5ϵ3)\tilde{O}(n^{4.5}\epsilon^{-3}) iterations and evaluations to achieve second-order optimality level ϵ\epsilon for an nn-dimensional problem, a significant improvement on the O(n9ϵ3)O(n^{9}\epsilon^{-3}) iterations and O(n11ϵ3)O(n^{11}\epsilon^{-3}) evaluations for full space methods. For this theory to hold, we need to consider either low accuracy solutions or problems with low effective rank (i.e. low rank Hessians, such as fine-tuning large language models [25]). We note that the numerical evidence from [9] suggests that subspace methods are indeed most practical for finding low accuracy solutions.

  • We present a new practical algorithm for random subspace model-based DFO based on quadratic interpolation models. In [9], a practical approach based on linear interpolation models was built for nonlinear least-squares problems (a special case where linear models provide good practical performance [8]). Here, we extend this approach to handle general quadratic interpolation approaches, focusing in particular on underdetermined quadratic interpolation [28] in subspaces. Compared to state-of-the-art full-space methods using similar model constructions [4], our approach has comparable performance when set to use full-dimensional subspaces, but when using low-dimensional subspaces is able to solve larger problems (of dimension nO(1000)n\approx O(1000)) where existing approaches fail due to the significant linear algebra costs at this scale.

Structure

Firstly, in Section 2 we recall the random subspace model-based approach from [9] and its associated first-order complexity results. In Section 3 we extend this approach to find second-order critical points, stating the theoretical algorithm and proving the associated worst-case complexity bounds. Then, in Section 4 we outline our practical subspace method based on quadratic model construction in subspaces, which we test numerically in Section 5.

Notation

We use \|\cdot\| to denote the Euclidean norm of vectors in and operator 2-norm of matrices, and B(𝒙,Δ)={𝒚:𝒚𝒙Δ}B(\bm{x},\Delta)=\{\bm{y}:\|\bm{y}-\bm{x}\|\leq\Delta\} is the closed ball of radius Δ\Delta centered at 𝒙\bm{x}. We use O~()\tilde{O}(\cdot) to denote O()O(\cdot) with both logarithmic and constant factors omitted.

2 Background

In this work we consider the problem

min𝒙nf(𝒙),\displaystyle\min_{\bm{x}\in\mathbb{R}^{n}}f(\bm{x}), (1)

where f:nf:\mathbb{R}^{n}\to\mathbb{R} is continuously differentiable and nonconvex, but f\nabla f is not available. In standard model-based DFO, at iteration kk, we construct a local quadratic approximation for ff that we hope to be accurate near the current iterate 𝒙k\bm{x}_{k}:

f(𝒙k+𝒔)mk(𝒔):=f(𝒙k)+𝒈kT𝒔+12𝒔THk𝒔,\displaystyle f(\bm{x}_{k}+\bm{s})\approx m_{k}(\bm{s}):=f(\bm{x}_{k})+\bm{g}_{k}^{T}\bm{s}+\frac{1}{2}\bm{s}^{T}H_{k}\bm{s}, (2)

for some 𝒈kn\bm{g}_{k}\in\mathbb{R}^{n} and (symmetric) Hkn×nH_{k}\in\mathbb{R}^{n\times n}. This is typically achieved by finding 𝒈k\bm{g}_{k} and HkH_{k} to interpolate the value of ff at a collection of points near 𝒙k\bm{x}_{k}. We then use this model inside a trust-region framework to ensure a globally convergent algorithm, where—compared to derivative-based trust-region methods—special care has to be taken to ensure that mkm_{k} is sufficiently accurate and 𝒈k\|\bm{g}_{k}\| is not too small relative to Δk\Delta_{k}, see e.g. [13] for details.

In the large-scale case, which would typically be n1000n\geq 1000 for DFO, the linear algebra cost of model construction and management becomes significant, and the model error (relative to the true objective value) can grow rapidly. To avoid this issue, in [9] it was proposed to perform each iteration within a randomly drawn, low-dimensional subspace. Specifically, for some pnp\ll n, at each iteration kk we generate a random Pkn×pP_{k}\in\mathbb{R}^{n\times p}, defining an affine space 𝒴k:={𝒙k+Pk𝒔^:𝒔^p}\mathcal{Y}_{k}:=\{\bm{x}_{k}+P_{k}\hat{\bm{s}}:\hat{\bm{s}}\in\mathbb{R}^{p}\}. We then construct a low-dimensional model to approximate ff only on 𝒴k\mathcal{Y}_{k},

f(𝒙k+Pk𝒔^)m^k(𝒔^):=f(𝒙k)+𝒈^kT𝒔+12𝒔^TH^k𝒔^,\displaystyle f(\bm{x}_{k}+P_{k}\hat{\bm{s}})\approx\hat{m}_{k}(\hat{\bm{s}}):=f(\bm{x}_{k})+\hat{\bm{g}}_{k}^{T}\bm{s}+\frac{1}{2}\hat{\bm{s}}^{T}\hat{H}_{k}\hat{\bm{s}}, (3)

for some 𝒈^kp\hat{\bm{g}}_{k}\in\mathbb{R}^{p} and (symmetric) H^kp×p\hat{H}_{k}\in\mathbb{R}^{p\times p}. We will adopt the convention where variables with hats are in the low-dimensional space and those without hats are in the full space. The trust-region framework is easily adapted to this situation: the model m^k\hat{m}_{k} is minimized subject to a ball constraint to get a tentative low-dimensional step 𝒔^k\hat{\bm{s}}_{k}, which in turn defines a new potential iterate 𝒙k+Pk𝒔^k\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k}, which may be accepted or rejected via a sufficient decrease test.

To achieve first-order convergence, the model m^k\hat{m}_{k} (3) must be sufficiently accurate in the following sense, [9, Definition 1].

Definition 2.1.

Given 𝒙n\bm{x}\in\mathbb{R}^{n}, Δ>0\Delta>0 and Pn×pP\in\mathbb{R}^{n\times p}, a model m^:p\hat{m}:\mathbb{R}^{p}\to\mathbb{R} is PP-fully linear in B(𝒙,Δ)B(\bm{x},\Delta) if

|f(𝒙+P𝒔^)m^(𝒔^)|\displaystyle|f(\bm{x}+P\hat{\bm{s}})-\hat{m}(\hat{\bm{s}})| κefΔ2,\displaystyle\leq\kappa_{\textnormal{ef}}\Delta^{2}, (4a)
PTf(𝒙+P𝒔^)m^(𝒔^)\displaystyle\|P^{T}\nabla f(\bm{x}+P\hat{\bm{s}})-\nabla\hat{m}(\hat{\bm{s}})\| κegΔ,\displaystyle\leq\kappa_{\textnormal{eg}}\Delta, (4b)

for all 𝒔^p\hat{\bm{s}}\in\mathbb{R}^{p} with 𝒔^Δ\|\hat{\bm{s}}\|\leq\Delta, where the constants κef,κeg>0\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}}>0 are independent of 𝒙\bm{x}, Δ\Delta, PP and m^\hat{m}.

1:Inputs: initial iterate 𝒙0n\bm{x}_{0}\in\mathbb{R}^{n}, trust-region radii 0<Δ0Δmax0<\Delta_{0}\leq\Delta_{\max}, trust-region updating parameters 0<γdec<1<γinc0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}, subspace dimension p{1,,n}p\in\{1,\ldots,n\}, and success check parameters η(0,1)\eta\in(0,1) and μ>0\mu>0.
2:for k=0,1,2,k=0,1,2,\ldots do
3:     Generate a random matrix Pkn×pP_{k}\in\mathbb{R}^{n\times p} satisfying Assumption 2.5.
4:     Build a model m^k\hat{m}_{k} (3) which is PkP_{k}-fully linear in B(𝒙k,Δk)B(\bm{x}_{k},\Delta_{k}).
5:     Calculate a step by approximately solving the trust-region subproblem
𝒔^kargmin𝒔^pm^k(𝒔^),s.t.𝒔^Δk.\displaystyle\hat{\bm{s}}_{k}\approx\operatorname*{arg\,min}_{\hat{\bm{s}}\in\mathbb{R}^{p}}\hat{m}_{k}(\hat{\bm{s}}),\quad\text{s.t.}\>\>\|\hat{\bm{s}}\|\leq\Delta_{k}. (5)
6:     Evaluate f(𝒙k+Pk𝒔^k)f(\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k}) and calculate the ratio
Rk:=f(𝒙k)f(𝒙k+Pk𝒔^k)m^k(𝟎)m^k(𝒔^k).\displaystyle R_{k}:=\frac{f(\bm{x}_{k})-f(\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k})}{\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})}. (6)
7:     if RkηR_{k}\geq\eta and 𝒈^kμΔk\|\hat{\bm{g}}_{k}\|\geq\mu\Delta_{k} then
8:         (successful) Set 𝒙k+1=𝒙k+Pk𝒔^k\bm{x}_{k+1}=\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k} and Δk+1=min(γincΔk,Δmax)\Delta_{k+1}=\min(\gamma_{\textnormal{inc}}\Delta_{k},\Delta_{\max}).
9:     else
10:         (unsuccessful) Set 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k} and Δk+1=γdecΔk\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}.
11:     end if
12:end for
Algorithm 1 Random subspace DFO algorithm applied to (1), a simplified version of [9, Algorithm 1] based on [16, Algorithm 2.1].

A prototypical random subspace algorithm is presented in Algorithm 1. It is a simplification of the original algorithm [9, Algorithm 1], which includes a more sophisticated trust-region management procedure based on [29] and allows the model m^k\hat{m}_{k} to not be PkP_{k}-fully linear at all iterations. The version presented here is based on [16, Algorithm 2.1] and has the same first-order convergence guarantees.

In order to prove convergence of Algorithm 1, we need the following standard assumptions on the objective function, model and trust-region subproblem calculation (5).

Assumption 2.2.

The objective function f:nf:\mathbb{R}^{n}\to\mathbb{R} in (1) is bounded below by flowf_{\textnormal{low}} and is continuously differentiable, where f\nabla f is LfL_{\nabla f}-Lipschitz continuous for some Lf>0L_{\nabla f}>0.

Assumption 2.3.

The model Hessians H^k\hat{H}_{k} are uniformly bounded, H^kκH\|\hat{H}_{k}\|\leq\kappa_{H} for all kk, for some κH1\kappa_{H}\geq 1.

Assumption 2.4.

The computed solution 𝒔^k\hat{\bm{s}}_{k} in (5) satisfies the Cauchy decrease condition

m^k(𝟎)m^k(𝒔^k)c1𝒈^kmin(Δk,𝒈^kmax(H^k,1)),\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq c_{1}\|\hat{\bm{g}}_{k}\|\min\left(\Delta_{k},\frac{\|\hat{\bm{g}}_{k}\|}{\max(\|\hat{H}_{k}\|,1)}\right), (7)

for some c1[1/2,1]c_{1}\in[1/2,1] independent of kk.

While the above assumptions are common for trust-region methods, we need an additional assumption on the random matrices PkP_{k} defining the subspace at each iteration.

Assumption 2.5.

The matrices PkP_{k} are uniformly bounded, PkPmax\|P_{k}\|\leq P_{\max} for all kk, and at each iteration PkP_{k} is well-aligned, i.e.

[PkTf(𝒙k)αf(𝒙k)]1δ,\displaystyle\mathbb{P}\left[\|P_{k}^{T}\nabla f(\bm{x}_{k})\|\geq\alpha\|\nabla f(\bm{x}_{k})\|\right]\geq 1-\delta, (8)

for some α>0\alpha>0 and δ(0,1)\delta\in(0,1) independent of kk and all previous iterations P0,,Pk1P_{0},\ldots,P_{k-1}.

Although condition (8) depends on f(𝒙k)\nabla f(\bm{x}_{k}), which is not available, Assumption 2.5 can still be satisfied using Johnson-Lindenstrauss transforms [34]. For example, PkP_{k} may have independent N(0,1/p)N(0,1/p) entries or be the first pp columns of a random orthogonal n×nn\times n matrix scaled by n/p\sqrt{n/p}. More details on how Assumption 2.5 may be satisfied can be found in [31, Chapter 4], [5, Section 3.2], [9, Section 2.6], [16, Section 3.1], [30, Section 3.3], or [12, Section 2.5], for example. Most crucially, for such constructions, (8) can be satisfied for p=𝒪(1)p=\mathcal{O}(1) independent of nn.

Under these assumptions, Algorithm 1 has a first-order worst-case complexity bound of the following form (e.g. [9, Corollary 1] or [12, Theorem 9]).

Theorem 2.6.

Suppose Assumptions 2.2, 2.3, 2.4 and 2.5 hold. Suppose also that γinc>γdec2\gamma_{\textnormal{inc}}>\gamma_{\textnormal{dec}}^{-2} and δ\delta sufficiently small in Assumption 2.5. Then for kk sufficiently large, the iterates of Algorithm 1 satisfy

[minjkf(𝒙j)CκHκdk]1eck,\displaystyle\mathbb{P}\left[\min_{j\leq k}\|\nabla f(\bm{x}_{j})\|\leq\frac{C\sqrt{\kappa_{H}}\,\kappa_{d}}{\sqrt{k}}\right]\geq 1-e^{-ck}, (9)

for constants c,C>0c,C>0, where κd:=max(κef,κeg)\kappa_{d}:=\max(\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}}) from Definition 2.1.

This result implies that f(𝒙k)\|\nabla f(\bm{x}_{k})\| is driven below ϵ\epsilon for the first time after 𝒪(κHκd2ϵ2)\mathcal{O}(\kappa_{H}\kappa_{d}^{2}\epsilon^{-2}) iterations and 𝒪(pκHκd2ϵ2)\mathcal{O}(p\kappa_{H}\kappa_{d}^{2}\epsilon^{-2}) objective evaluations (with high probability), assuming 𝒪(p)\mathcal{O}(p) interpolation points are used to construct m^k\hat{m}_{k}. Assuming linear interpolation models, we have κd=𝒪(Pmax2p)\kappa_{d}=\mathcal{O}(P_{\max}^{2}p) and κH=1\kappa_{H}=1, so in the full-space case p=np=n and Pk=In×nP_{k}=I_{n\times n} we get a worst-case complexity of 𝒪(n2ϵ2)\mathcal{O}(n^{2}\epsilon^{-2}) iterations and 𝒪(n3ϵ2)\mathcal{O}(n^{3}\epsilon^{-2}) objective evaluations. However, using a Johnson-Lindenstrauss transform we may use p=𝒪(1)p=\mathcal{O}(1) and Pmax=𝒪(n)P_{\max}=\mathcal{O}(\sqrt{n}), giving the same iteration complexity but an improved complexity of 𝒪(n2ϵ2)\mathcal{O}(n^{2}\epsilon^{-2}) evaluations.

3 Convergence to Second-Order Stationary Points

We now consider the extension of the framework in Section 2 to a second-order convergent algorithm. Our basic framework is the same, namely at each iteration we select a random PkP_{k} and build a low-dimensional model m^k\hat{m}_{k} (3) to approximate ff in the corresponding affine space 𝒴k\mathcal{Y}_{k}.

Now, we are interested in the second-order criticality measure

σk:=max(f(𝒙k),τk),whereτk:=max(λmin(2f(𝒙k)),0)),\displaystyle\sigma_{k}:=\max(\|\nabla f(\bm{x}_{k})\|,\tau_{k}),\qquad\text{where}\qquad\tau_{k}:=\max(-\lambda_{\min}(\nabla^{2}f(\bm{x}_{k})),0)), (10)

(see [13, Chapter 10], for example). Restricting the objective to the space 𝒴k\mathcal{Y}_{k} we get the corresponding subspace criticality measure

σ^k:=max(PkTf(𝒙k),τ^k),whereτ^k:=max(λmin(PkT2f(𝒙k)Pk),0)),\displaystyle\hat{\sigma}_{k}:=\max(\|P_{k}^{T}\nabla f(\bm{x}_{k})\|,\hat{\tau}_{k}),\qquad\text{where}\qquad\hat{\tau}_{k}:=\max(-\lambda_{\min}(P_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k}),0)), (11)

and our interpolation model m^k\hat{m}_{k} defines its own approximate (subspace) criticality measure

σ^km:=max(𝒈^k,τ^km),whereτ^km:=max(λmin(H^k),0)).\displaystyle\hat{\sigma}^{m}_{k}:=\max(\|\hat{\bm{g}}_{k}\|,\hat{\tau}^{m}_{k}),\qquad\text{where}\qquad\hat{\tau}^{m}_{k}:=\max(-\lambda_{\min}(\hat{H}_{k}),0)). (12)

To achieve σk0\sigma_{k}\to 0 (i.e. convergence to second-order critical points), we need a more restrictive notion of model accuracy.

Definition 3.1.

The subspace model m^k:p\hat{m}_{k}:\mathbb{R}^{p}\to\mathbb{R} in (3) is PkP_{k}-fully quadratic if there exist constants κef,κeg,κeh\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}},\kappa_{\textnormal{eh}} independent of kk such that

|f(𝒙k+Pk𝒔^)m^k(𝒔^)|\displaystyle|f(\bm{x}_{k}+P_{k}\hat{\bm{s}})-\hat{m}_{k}(\hat{\bm{s}})| κefΔk3,\displaystyle\leq\kappa_{\textnormal{ef}}\Delta_{k}^{3}, (13a)
PkTf(𝒙k+Pk𝒔^)m^k(𝒔^)\displaystyle\|P_{k}^{T}\nabla f(\bm{x}_{k}+P_{k}\hat{\bm{s}})-\nabla\hat{m}_{k}(\hat{\bm{s}})\| κegΔk2,\displaystyle\leq\kappa_{\textnormal{eg}}\Delta_{k}^{2}, (13b)
PkT2f(𝒙k+Pk𝒔^)Pk2m^k(𝒔^)\displaystyle\|P_{k}^{T}\nabla^{2}f(\bm{x}_{k}+P_{k}\hat{\bm{s}})P_{k}-\nabla^{2}\hat{m}_{k}(\hat{\bm{s}})\| κehΔk,\displaystyle\leq\kappa_{\textnormal{eh}}\Delta_{k}, (13c)

for all 𝒔^Δk\|\hat{\bm{s}}\|\leq\Delta_{k}.

This notion is discussed in [12, Section 2.4], and a specific sampling mechanism for achieving this is described there.

1:Inputs: initial iterate 𝒙0n\bm{x}_{0}\in\mathbb{R}^{n} and trust-region radius Δ0>0\Delta_{0}>0, and subspace dimension p{1,,n}p\in\{1,\ldots,n\}.
2:Parameters: maximum trust-region radius ΔmaxΔ0\Delta_{\max}\geq\Delta_{0}, trust-region parameters 0<γdec<1<γinc0<\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}, and step acceptance thresholds η(0,1)\eta\in(0,1) and μ>0\mu>0.
3:for k=0,1,2,k=0,1,2,\ldots do
4:     Select a random matrix Pkn×pP_{k}\in\mathbb{R}^{n\times p} satisfying Assumption 3.7
5:     Build subspace model m^k\hat{m}_{k} (3) which is PkP_{k}-fully quadratic (Definition 3.1).
6:     Calculate a step 𝒔^k\hat{\bm{s}}_{k} by approximately solving the trust-region subproblem (5).
7:     Evaluate f(𝒙k+Pk𝒔^k)f(\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k}) and calculate the ratio RkR_{k} (6).
8:     if RkηR_{k}\geq\eta and σ^kmμΔk\hat{\sigma}^{m}_{k}\geq\mu\Delta_{k} then
9:         (successful) Set 𝒙k+1=𝒙k+Pk𝒔^k\bm{x}_{k+1}=\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k} and Δk+1=min(γincΔk,Δmax)\Delta_{k+1}=\min(\gamma_{\textnormal{inc}}\Delta_{k},\Delta_{\max}).
10:     else
11:         (unsuccessful) Set 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k} and Δk+1=γdecΔk\Delta_{k+1}=\gamma_{\textnormal{dec}}\Delta_{k}.
12:     end if
13:end for
Algorithm 2 Random-subspace model-based DFO algorithm applied to (1) with second-order convergence.

In addition, we need the following assumptions on the objective and computed trust-region step which are standard for second-order convergence (e.g. [13, Chapter 10]).

Assumption 3.2.

The objective ff is bounded below by flowf_{\textnormal{low}}, and is twice continuously differentiable and 2f\nabla^{2}f is LHL_{H}-Lipschitz continuous. Additionally, the Hessian at all iterates is uniformly bounded, 2f(𝒙k)M\|\nabla^{2}f(\bm{x}_{k})\|\leq M for some M>0M>0 independent of kk.

Assumption 3.3.

The step 𝒔^k\hat{\bm{s}}_{k} satisfies

m^k(𝟎)m^k(𝒔^k)κsmax(𝒈^kmin(Δk,𝒈^kmax(H^k,1)),τ^kmΔk2),\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq\kappa_{s}\max\left(\|\hat{\bm{g}}_{k}\|\min\left(\Delta_{k},\frac{\|\hat{\bm{g}}_{k}\|}{\max(\|\hat{H}_{k}\|,1)}\right),\hat{\tau}^{m}_{k}\Delta_{k}^{2}\right), (14)

for some κs>0\kappa_{s}>0 independent of kk.

We also require Assumption 2.3 (bounded model Hessians) holds here.

Lemma 3.4 (Lemma 10.15, [13]).

Suppose Assumption 3.2 holds and m^k\hat{m}_{k} is PkP_{k}-fully quadratic on B(𝐱k,Δk)B(\bm{x}_{k},\Delta_{k}). Then |σ^kσ^km|κσΔk|\hat{\sigma}_{k}-\hat{\sigma}^{m}_{k}|\leq\kappa_{\sigma}\Delta_{k}, where κσ:=max(κegΔmax,κeh)\kappa_{\sigma}:=\max(\kappa_{\textnormal{eg}}\Delta_{\max},\kappa_{\textnormal{eh}}).

Lemma 3.5.

Suppose Assumptions 2.3 and 3.3 hold. If

Δkc0σ^km,wherec0:=min(1μ,1κH,κs(1η)κefΔmax,κs(1η)κef),\displaystyle\Delta_{k}\leq c_{0}\,\hat{\sigma}^{m}_{k},\qquad\text{where}\qquad c_{0}:=\min\left(\frac{1}{\mu},\frac{1}{\kappa_{H}},\frac{\kappa_{s}(1-\eta)}{\kappa_{\textnormal{ef}}\Delta_{\max}},\frac{\kappa_{s}(1-\eta)}{\kappa_{\textnormal{ef}}}\right), (15)

then iteration kk is successful (i.e. RkηR_{k}\geq\eta and σ^kmμΔk\hat{\sigma}^{m}_{k}\geq\mu\Delta_{k}).

Proof.

This argument is based on [13, Lemma 10.17]. First, since c01/μc_{0}\leq 1/\mu, we immediately have σ^kmμΔk\hat{\sigma}^{m}_{k}\geq\mu\Delta_{k}; it remains to show RkηR_{k}\geq\eta. By definition of σ^km\hat{\sigma}^{m}_{k}, we either have σ^km=𝒈^k\hat{\sigma}^{m}_{k}=\|\hat{\bm{g}}_{k}\| or σ^km=τ^km\hat{\sigma}^{m}_{k}=\hat{\tau}^{m}_{k}. From Assumptions 3.3 and 2.3, we have

m^k(𝟎)m^k(𝒔^k)κsmax(𝒈^kmin(Δk,𝒈^kκH),τ^kmΔk2).\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq\kappa_{s}\max\left(\|\hat{\bm{g}}_{k}\|\min\left(\Delta_{k},\frac{\|\hat{\bm{g}}_{k}\|}{\kappa_{H}}\right),\hat{\tau}^{m}_{k}\Delta_{k}^{2}\right). (16)

First, if σ^km=𝒈^k\hat{\sigma}^{m}_{k}=\|\hat{\bm{g}}_{k}\|, then Δk1κHσ^km=1κH𝒈^k\Delta_{k}\leq\frac{1}{\kappa_{H}}\hat{\sigma}^{m}_{k}=\frac{1}{\kappa_{H}}\|\hat{\bm{g}}_{k}\| implies

m^k(𝟎)m^k(𝒔^k)κs𝒈^kΔk=κsσ^kmΔk.\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq\kappa_{s}\|\hat{\bm{g}}_{k}\|\Delta_{k}=\kappa_{s}\hat{\sigma}^{m}_{k}\Delta_{k}. (17)

Then we have

|Rk1|\displaystyle|R_{k}-1| =|f(𝒙k+Pk𝒔^k)m^k(𝒔^k)||m^k(𝟎)m^k(𝒔^k)|,\displaystyle=\frac{|f(\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k})-\hat{m}_{k}(\hat{\bm{s}}_{k})|}{|\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})|}, (18)
κefΔk3κsσ^kmΔk,\displaystyle\leq\frac{\kappa_{\textnormal{ef}}\Delta_{k}^{3}}{\kappa_{s}\hat{\sigma}^{m}_{k}\Delta_{k}}, (19)
κefΔmaxκsσ^kmΔk,\displaystyle\leq\frac{\kappa_{\textnormal{ef}}\Delta_{\max}}{\kappa_{s}\hat{\sigma}^{m}_{k}}\Delta_{k}, (20)

and so |Rk1|1η|R_{k}-1|\leq 1-\eta (by definition of c0c_{0}), and hence RkηR_{k}\geq\eta as required. Alternatively, if σ^km=τ^km\hat{\sigma}^{m}_{k}=\hat{\tau}^{m}_{k}, then

m^k(𝟎)m^k(𝒔^k)κsτ^kmΔk2=κsσ^kmΔk2.\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq\kappa_{s}\hat{\tau}^{m}_{k}\Delta_{k}^{2}=\kappa_{s}\hat{\sigma}^{m}_{k}\Delta_{k}^{2}. (21)

In this case, we get

|Rk1|\displaystyle|R_{k}-1| =|f(𝒙k+Pk𝒔^k)m^k(𝒔^k)||m^k(𝟎)m^k(𝒔^k)|κefΔk3κsσ^kmΔk2=κefΔkκsσ^km,\displaystyle=\frac{|f(\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k})-\hat{m}_{k}(\hat{\bm{s}}_{k})|}{|\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})|}\leq\frac{\kappa_{\textnormal{ef}}\Delta_{k}^{3}}{\kappa_{s}\hat{\sigma}^{m}_{k}\Delta_{k}^{2}}=\frac{\kappa_{\textnormal{ef}}\Delta_{k}}{\kappa_{s}\hat{\sigma}^{m}_{k}}, (22)

and so again |Rk1|1η|R_{k}-1|\leq 1-\eta by definition of c0c_{0}, and RkηR_{k}\geq\eta. ∎

Now, we need to specify our requirements for a well-aligned subspace. This is based on the approach in [11], originally from the PhD thesis [31, Chapter 5.6]. At iteration kk, suppose 2f(𝒙k)\nabla^{2}f(\bm{x}_{k}) has eigendecomposition 2f(𝒙k)=i=1rλi𝒗i𝒗iT\nabla^{2}f(\bm{x}_{k})=\sum_{i=1}^{r}\lambda_{i}\bm{v}_{i}\bm{v}_{i}^{T}, where λ1λ2λr\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{r} and r:=rank(2f(𝒙k))nr:=\operatorname{rank}(\nabla^{2}f(\bm{x}_{k}))\leq n, and 𝒗1,,𝒗r\bm{v}_{1},\ldots,\bm{v}_{r} are orthonormal. Define 𝒗^i:=PkT𝒗ip\hat{\bm{v}}_{i}:=P_{k}^{T}\bm{v}_{i}\in\mathbb{R}^{p}, and so the (true) subspace Hessian is PkT2f(𝒙k)Pk=i=1rλi𝒗^i𝒗^iTP_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k}=\sum_{i=1}^{r}\lambda_{i}\hat{\bm{v}}_{i}\hat{\bm{v}}_{i}^{T}.

Definition 3.6.

A subspace PkP_{k} is well-aligned if

Pk\displaystyle\|P_{k}\| Pmax,\displaystyle\leq P_{\max}, (23a)
PkTf(𝒙k)\displaystyle\|P_{k}^{T}\nabla f(\bm{x}_{k})\| (1α)f(𝒙k),\displaystyle\geq(1-\alpha)\|\nabla f(\bm{x}_{k})\|, (23b)
𝒗^r\displaystyle\|\hat{\bm{v}}_{r}\| 1α,\displaystyle\geq 1-\alpha, (23c)
(𝒗^iT𝒗^r)2\displaystyle(\hat{\bm{v}}_{i}^{T}\hat{\bm{v}}_{r})^{2} 4α2,i=1,,r1,\displaystyle\leq 4\alpha^{2},\qquad\forall i=1,\ldots,r-1, (23d)

for some α(0,1)\alpha\in(0,1) and Pmax>0P_{\max}>0 independent of kk.

In practice, a well-aligned subspace can be achieved by generating PkP_{k} randomly (see Section 3.1 below). Hence, we use the following assumption.

Assumption 3.7.

At any iteration kk, the matrix PkP_{k} is well-aligned in the sense of Definition 3.6 with probability at least 1δS1-\delta_{S}, for some δS[0,1)\delta_{S}\in[0,1), independent of all previous iterations P0,,Pk1P_{0},\ldots,P_{k-1}.

The first two well-aligned properties (23a) and (23b) match the first-order case (Assumption 2.5); i.e. the subspace sees a sizeable fraction of the true gradient. The third condition (23c) says the subspace sees a sizeable fraction of the left-most eigenvector (since 𝒗r=1\|\bm{v}_{r}\|=1), and the last condition (23d) says the first r1r-1 eigenvectors remain approximately orthogonal to the last eigenvector in the subspace.

Lemma 3.8.

Suppose Assumption 3.2 holds. If PkP_{k} is well-aligned and σkϵ>0\sigma_{k}\geq\epsilon>0, and

θ:=(1α)24M(r1)α2ϵ(1α)2>0.\displaystyle\theta:=(1-\alpha)^{2}-\frac{4M(r-1)\alpha^{2}}{\epsilon(1-\alpha)^{2}}>0. (24)

Then

σ^kmin((1α)2,θ)ϵ.\displaystyle\hat{\sigma}_{k}\geq\min((1-\alpha)^{2},\theta)\epsilon. (25)
Proof.

This argument is based on [11, Lemma 6.10], originally from [31, Lemma 5.6.6]. First, suppose that σk=f(𝒙k)\sigma_{k}=\|\nabla f(\bm{x}_{k})\|. Then

σ^kPkTf(𝒙k)(1α)f(𝒙k)=(1α)σk(1α)2ϵ,\displaystyle\hat{\sigma}_{k}\geq\|P_{k}^{T}\nabla f(\bm{x}_{k})\|\geq(1-\alpha)\|\nabla f(\bm{x}_{k})\|=(1-\alpha)\sigma_{k}\geq(1-\alpha)^{2}\epsilon, (26)

using (1α)2<1α(1-\alpha)^{2}<1-\alpha from α(0,1)\alpha\in(0,1) and we are done. Instead, suppose that σk=τk\sigma_{k}=\tau_{k}. Using Rayleigh quotients, we have

λmin(PkT2f(𝒙k)Pk)𝒗^rT(i=1rλi𝒗^i𝒗^iT)𝒗^r𝒗^r2=λr𝒗^r2+i=1r1λi(𝒗^iT𝒗^r)2𝒗^r2.\displaystyle\lambda_{\min}(P_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k})\leq\frac{\hat{\bm{v}}_{r}^{T}(\sum_{i=1}^{r}\lambda_{i}\hat{\bm{v}}_{i}\hat{\bm{v}}_{i}^{T})\hat{\bm{v}}_{r}}{\|\hat{\bm{v}}_{r}\|^{2}}=\lambda_{r}\|\hat{\bm{v}}_{r}\|^{2}+\frac{\sum_{i=1}^{r-1}\lambda_{i}(\hat{\bm{v}}_{i}^{T}\hat{\bm{v}}_{r})^{2}}{\|\hat{\bm{v}}_{r}\|^{2}}. (27)

Since τkϵ>0\tau_{k}\geq\epsilon>0, we have τk=λr\tau_{k}=-\lambda_{r} and λr<ϵ\lambda_{r}<-\epsilon, and so

λmin(PkT2f(𝒙k)Pk)\displaystyle-\lambda_{\min}(P_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k}) ϵ𝒗^r2λ1i=1r1(𝒗^iT𝒗^r)2𝒗^r2.\displaystyle\geq\epsilon\|\hat{\bm{v}}_{r}\|^{2}-\lambda_{1}\frac{\sum_{i=1}^{r-1}(\hat{\bm{v}}_{i}^{T}\hat{\bm{v}}_{r})^{2}}{\|\hat{\bm{v}}_{r}\|^{2}}. (28)

If λ10\lambda_{1}\leq 0 then the second term is non-negative, and so

λmin(PkT2f(𝒙k)Pk)ϵ(1α)2>0,\displaystyle-\lambda_{\min}(P_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k})\geq\epsilon(1-\alpha)^{2}>0, (29)

and so τ^k>0\hat{\tau}_{k}>0 with σ^kτ^kϵ(1α)2\hat{\sigma}_{k}\geq\hat{\tau}_{k}\geq\epsilon(1-\alpha)^{2}. Instead, if λ1>0\lambda_{1}>0 then λ12f(𝒙k)M\lambda_{1}\leq\|\nabla^{2}f(\bm{x}_{k})\|\leq M and so we have

λmin(PkT2f(𝒙k)Pk)\displaystyle-\lambda_{\min}(P_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k}) ϵ(1α)2M4(r1)α2(1α)2=θϵ>0,\displaystyle\geq\epsilon(1-\alpha)^{2}-M\frac{4(r-1)\alpha^{2}}{(1-\alpha)^{2}}=\theta\epsilon>0, (30)

and so again λmin(PkT2f(𝒙k)Pk)<0-\lambda_{\min}(P_{k}^{T}\nabla^{2}f(\bm{x}_{k})P_{k})<0 yielding σ^kτ^k>θϵ\hat{\sigma}_{k}\geq\hat{\tau}_{k}>\theta\epsilon and the result follows. ∎

Lemma 3.9.

Suppose Assumption 3.2 holds and θ>0\theta>0 (24). If PkP_{k} is well-aligned and σkϵ\sigma_{k}\geq\epsilon, then if iteration kk is successful,

σ^kmc1ϵ,wherec1:=min(1α,(1α)2,θ)1+κσμ1.\displaystyle\hat{\sigma}^{m}_{k}\geq c_{1}\epsilon,\qquad\text{where}\qquad c_{1}:=\frac{\min(1-\alpha,(1-\alpha)^{2},\theta)}{1+\kappa_{\sigma}\mu^{-1}}. (31)
Proof.

On successful iterations, we have σ^kmμΔk\hat{\sigma}^{m}_{k}\geq\mu\Delta_{k} and so from Lemmas 3.4 and 3.8 we have

min(1α,(1α)2,θ)ϵσ^kσ^km+|σ^kσ^km|σ^km+κσΔk(1+κσμ1)σ^km,\displaystyle\min(1-\alpha,(1-\alpha)^{2},\theta)\epsilon\leq\hat{\sigma}_{k}\leq\hat{\sigma}^{m}_{k}+|\hat{\sigma}_{k}-\hat{\sigma}^{m}_{k}|\leq\hat{\sigma}^{m}_{k}+\kappa_{\sigma}\Delta_{k}\leq(1+\kappa_{\sigma}\mu^{-1})\hat{\sigma}^{m}_{k}, (32)

and we are done. ∎

We are now ready to prove the worst-case complexity of Algorithm 2 to (approximate) second-order stationary points. The below closely follows the argument from [9], used to prove Theorem 2.6. For a fixed KK, define the following sets of iterations:

  • 𝒜\mathcal{A} is the set of iterations in {0,,K}\{0,\ldots,K\} for which PkP_{k} is well-aligned.

  • 𝒜C\mathcal{A}^{C} is the set of iterations in {0,,K}\{0,\ldots,K\} for which PkP_{k} is not well-aligned.

  • 𝒮\mathcal{S} (resp. 𝒰\mathcal{U}) is the set of iterations in {0,,K}\{0,\ldots,K\} which are successful (resp. unsuccessful).

  • 𝒟(Δ)\mathcal{D}(\Delta) is the set of iterations in {0,,K}\{0,\ldots,K\} for which ΔkΔ\Delta_{k}\geq\Delta.

  • 𝒟C(Δ)\mathcal{D}^{C}(\Delta) is the set of iterations in {0,,K}\{0,\ldots,K\} for which Δk<Δ\Delta_{k}<\Delta.

Hence, for any KK and Δ\Delta, we have the partitions {0,,K}=𝒜𝒜C=𝒮𝒰=𝒟(Δ)𝒟C(Δ)\{0,\ldots,K\}=\mathcal{A}\cup\mathcal{A}^{C}=\mathcal{S}\cup\mathcal{U}=\mathcal{D}(\Delta)\cup\mathcal{D}^{C}(\Delta), as well as 𝒟(Δ1)𝒟(Δ2)\mathcal{D}(\Delta_{1})\subseteq\mathcal{D}(\Delta_{2}) if Δ1Δ2\Delta_{1}\geq\Delta_{2}.

We first use standard trust-region arguments to bound 𝒜\mathcal{A} (by considering the four cases 𝒜𝒟(Δ)𝒮,,𝒜𝒟C(Δ)𝒰\mathcal{A}\cap\mathcal{D}(\Delta)\cap\mathcal{S},\ldots,\mathcal{A}\cap\mathcal{D}^{C}(\Delta)\cap\mathcal{U} separately), provided σk\sigma_{k} remains large. We can then bound (to high probability) the total number of iterations KK for which σk\sigma_{k} is large by noting that 𝒜C\mathcal{A}^{C} occurs with small probability (Assumption 3.7).

Lemma 3.10.

Suppose Assumptions 2.3, 3.2 and 3.3 hold, and θ>0\theta>0 (24). If σkϵ>0\sigma_{k}\geq\epsilon>0 for all kKk\leq K, then

#(𝒜𝒟(Δ)𝒮)ϕ(Δ,ϵ),whereϕ(Δ,ϵ):=f(𝒙0)flowηκsc1min(ϵΔ,ϵΔ2),\displaystyle\#(\mathcal{A}\cap\mathcal{D}(\Delta)\cap\mathcal{S})\leq\phi(\Delta,\epsilon),\qquad\text{where}\quad\phi(\Delta,\epsilon):=\frac{f(\bm{x}_{0})-f_{\textnormal{low}}}{\eta\kappa_{s}c_{1}\min(\epsilon\Delta,\epsilon\Delta^{2})}, (33)

where c1>0c_{1}>0 is defined in Lemma 3.9.

Proof.

From (17) and (21), we have

m^k(𝟎)m^k(𝒔^k)min(κsσ^kmΔk,κsσ^kmΔk2),\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq\min(\kappa_{s}\hat{\sigma}^{m}_{k}\Delta_{k},\kappa_{s}\hat{\sigma}^{m}_{k}\Delta_{k}^{2}), (34)

depending on whether σ^km=𝒈^k\hat{\sigma}^{m}_{k}=\|\hat{\bm{g}}_{k}\| or σ^km=τ^km\hat{\sigma}^{m}_{k}=\hat{\tau}^{m}_{k}. For k𝒜𝒟(Δ)k\in\mathcal{A}\cap\mathcal{D}(\Delta), from Lemma 3.9 we have ΔkΔ\Delta_{k}\geq\Delta and σ^kmc1ϵ\hat{\sigma}^{m}_{k}\geq c_{1}\epsilon and so

m^k(𝟎)m^k(𝒔^k)κsc1min(ϵΔ,ϵΔ2).\displaystyle\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})\geq\kappa_{s}c_{1}\min(\epsilon\Delta,\epsilon\Delta^{2}). (35)

We note that 𝒙k\bm{x}_{k} is only changed for k𝒮k\in\mathcal{S}, and so

f(𝒙0)flowk𝒮f(𝒙k)f(𝒙k+Pk𝒔^k)k𝒜𝒟(Δ)𝒮η(m^k(𝟎)m^k(𝒔^k)),\displaystyle f(\bm{x}_{0})-f_{\textnormal{low}}\geq\sum_{k\in\mathcal{S}}f(\bm{x}_{k})-f(\bm{x}_{k}+P_{k}\hat{\bm{s}}_{k})\geq\sum_{k\in\mathcal{A}\cap\mathcal{D}(\Delta)\cap\mathcal{S}}\eta(\hat{m}_{k}(\bm{0})-\hat{m}_{k}(\hat{\bm{s}}_{k})), (36)

which yields

f(𝒙0)flowηκsc1min(ϵΔ,ϵΔ2)#(𝒜𝒟(Δ)𝒮),\displaystyle f(\bm{x}_{0})-f_{\textnormal{low}}\geq\eta\kappa_{s}c_{1}\min(\epsilon\Delta,\epsilon\Delta^{2})\cdot\#(\mathcal{A}\cap\mathcal{D}(\Delta)\cap\mathcal{S}), (37)

which gives the result. ∎

Lemma 3.11.

Suppose Assumptions 2.3, 3.2 and 3.3 hold, and θ>0\theta>0 (24). If σkϵ>0\sigma_{k}\geq\epsilon>0 for all kKk\leq K, then

#(𝒜𝒟C(Δ)𝒰)=0,\displaystyle\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta)\cap\mathcal{U})=0, (38)

for all Δc2ϵ\Delta\leq c_{2}\epsilon, where

c2:=1κσ+c01min(1α,(1α)2,θ),\displaystyle c_{2}:=\frac{1}{\kappa_{\sigma}+c_{0}^{-1}}\min(1-\alpha,(1-\alpha)^{2},\theta), (39)

where κσ>0\kappa_{\sigma}>0 is defined in Lemma 3.4 and c0>0c_{0}>0 is defined in Lemma 3.5.

Proof.

This argument follows [19, Lemma 4.2.3]. To find a contradiction, suppose k𝒜𝒟C(Δ)𝒰k\in\mathcal{A}\cap\mathcal{D}^{C}(\Delta)\cap\mathcal{U}. Since k𝒜k\in\mathcal{A}, we have σ^kmin(1α,(1α)2,θ)ϵ\hat{\sigma}_{k}\geq\min(1-\alpha,(1-\alpha)^{2},\theta)\epsilon by Lemma 3.8. Also, since k𝒰k\in\mathcal{U}, we must have Δk>c0σ^km\Delta_{k}>c_{0}\hat{\sigma}^{m}_{k} by Lemma 3.5. Thus,

min(1α,(1α)2,θ)ϵσ^k(σ^kσ^km)+σ^km<κσΔk+c01Δk,\displaystyle\min(1-\alpha,(1-\alpha)^{2},\theta)\epsilon\leq\hat{\sigma}_{k}\leq(\hat{\sigma}_{k}-\hat{\sigma}^{m}_{k})+\hat{\sigma}^{m}_{k}<\kappa_{\sigma}\Delta_{k}+c_{0}^{-1}\Delta_{k}, (40)

where the last inequality follows from Lemma 3.4. By assumption on Δ\Delta, this contradicts k𝒟C(Δ)k\in\mathcal{D}^{C}(\Delta), and we are done. ∎

Lemma 3.12.

Suppose Assumptions 2.3, 3.2 and 3.3 hold, and θ>0\theta>0 (24). If σkϵ>0\sigma_{k}\geq\epsilon>0 for all kKk\leq K, then

#(𝒟(γdec1Δ)𝒰)\displaystyle\#(\mathcal{D}(\gamma_{\textnormal{dec}}^{-1}\Delta)\cap\mathcal{U}) c3#(D(γinc1Δ)𝒮)+c4(Δ),\displaystyle\leq c_{3}\cdot\#(D(\gamma_{\textnormal{inc}}^{-1}\Delta)\cap\mathcal{S})+c_{4}(\Delta), (41)

for all ΔΔ0\Delta\leq\Delta_{0}, and

#(𝒟C(γinc1Δ)𝒮)\displaystyle\#(\mathcal{D}^{C}(\gamma_{\textnormal{inc}}^{-1}\Delta)\cap\mathcal{S}) c31#(𝒟C(γdec1Δ)𝒰),\displaystyle\leq c_{3}^{-1}\cdot\#(\mathcal{D}^{C}(\gamma_{\textnormal{dec}}^{-1}\Delta)\cap\mathcal{U}), (42)

for all Δmin(Δ0,γinc1Δmax)\Delta\leq\min(\Delta_{0},\gamma_{\textnormal{inc}}^{-1}\Delta_{\max}), where we define

c3:=log(γinc)log(γdec1),andc4(Δ):=log(Δ0/Δ)log(γdec1).\displaystyle c_{3}:=\frac{\log(\gamma_{\textnormal{inc}})}{\log(\gamma_{\textnormal{dec}}^{-1})},\qquad\text{and}\qquad c_{4}(\Delta):=\frac{\log(\Delta_{0}/\Delta)}{\log(\gamma_{\textnormal{dec}}^{-1})}. (43)
Proof.

This follows the same argument as in [9, Lemmas 6 & 7], which we include here for completeness. In both cases, we consider the movement of log(Δk)\log(\Delta_{k}), which is increased by at most log(γinc)\log(\gamma_{\textnormal{inc}}) if k𝒮k\in\mathcal{S} and decreased by log(γdec1)\log(\gamma_{\textnormal{dec}}^{-1}) if k𝒰k\in\mathcal{U}.

To show (41), we note:

  • Since ΔΔ0\Delta\leq\Delta_{0}, the value log(Δ)\log(\Delta) is log(Δ0/Δ)\log(\Delta_{0}/\Delta) below the starting value log(Δ0)\log(\Delta_{0}).

  • If k𝒮k\in\mathcal{S}, then log(Δk)\log(\Delta_{k}) increases by at most log(γinc)\log(\gamma_{\textnormal{inc}}), and Δk+1Δ\Delta_{k+1}\geq\Delta is only possible if Δkγinc1Δ\Delta_{k}\geq\gamma_{\textnormal{inc}}^{-1}\Delta.

  • If k𝒰k\in\mathcal{U}, then log(Δk)\log(\Delta_{k}) is decreased by log(γdec1)\log(\gamma_{\textnormal{dec}}^{-1}) and Δk+1Δ\Delta_{k+1}\geq\Delta if Δkγdec1Δ\Delta_{k}\geq\gamma_{\textnormal{dec}}^{-1}\Delta.

Hence the total decrease in log(Δk)\log(\Delta_{k}) from k𝒟(γdec1)𝒰k\in\mathcal{D}(\gamma_{\textnormal{dec}}^{-1})\cap\mathcal{U} must be fully matched by the initial gap log(Δ0/Δ)\log(\Delta_{0}/\Delta) plus the maximum possible amount that log(Δk)\log(\Delta_{k}) can be increased above log(Δ)\log(\Delta); that is,

log(γdec1)#(𝒟(γdec1Δ)𝒮)log(Δ0/Δ)+log(γinc)#(𝒟(γinc1Δ)𝒮),\displaystyle\log(\gamma_{\textnormal{dec}}^{-1})\#(\mathcal{D}(\gamma_{\textnormal{dec}}^{-1}\Delta)\cap\mathcal{S})\leq\log(\Delta_{0}/\Delta)+\log(\gamma_{\textnormal{inc}})\#(\mathcal{D}(\gamma_{\textnormal{inc}}^{-1}\Delta)\cap\mathcal{S}), (44)

and we get (41).

To get (42), we use a similar approach. Specifically,

  • If k𝒮k\in\mathcal{S}, then since Δγinc1Δmax\Delta\leq\gamma_{\textnormal{inc}}^{-1}\Delta_{\max}, we have log(Δk)\log(\Delta_{k}) is increased by exactly log(γinc)\log(\gamma_{\textnormal{inc}}), and Δk+1<Δ\Delta_{k+1}<\Delta if and only if Δk<γinc1Δ\Delta_{k}<\gamma_{\textnormal{inc}}^{-1}\Delta.

  • If k𝒰k\in\mathcal{U}, then log(Δk)\log(\Delta_{k}) is decreased by log(γdec1)\log(\gamma_{\textnormal{dec}}^{-1}) and we need Δk<γdec1Δ\Delta_{k}<\gamma_{\textnormal{dec}}^{-1}\Delta to get Δk+1<Δ\Delta_{k+1}<\Delta.

Since ΔΔ0\Delta\leq\Delta_{0}, the total increase in log(Δk)\log(\Delta_{k}) from k𝒟C(γinc1Δ)𝒮k\in\mathcal{D}^{C}(\gamma_{\textnormal{inc}}^{-1}\Delta)\cap\mathcal{S} must be fully matched by the total decrease from k𝒟C(γdec1Δ)𝒰k\in\mathcal{D}^{C}(\gamma_{\textnormal{dec}}^{-1}\Delta)\cap\mathcal{U}, and so

log(γinc)#(𝒟C(γinc1Δ)𝒮)log(γdec1)#(𝒟C(γdec1Δ)𝒰),\displaystyle\log(\gamma_{\textnormal{inc}})\#(\mathcal{D}^{C}(\gamma_{\textnormal{inc}}^{-1}\Delta)\cap\mathcal{S})\leq\log(\gamma_{\textnormal{dec}}^{-1})\#(\mathcal{D}^{C}(\gamma_{\textnormal{dec}}^{-1}\Delta)\cap\mathcal{U}), (45)

and we get (42). ∎

We can now bound the total number of well-aligned iterations.

Lemma 3.13.

Suppose Assumptions 2.3, 3.2 and 3.3 hold, θ>0\theta>0 (24), and γinc>γdec1\gamma_{\textnormal{inc}}>\gamma_{\textnormal{dec}}^{-1}. If σkϵ>0\sigma_{k}\geq\epsilon>0 for all kKk\leq K, then

#(𝒜)ψ(ϵ)+c5c5+1(K+1),\displaystyle\#(\mathcal{A})\leq\psi(\epsilon)+\frac{c_{5}}{c_{5}+1}\cdot(K+1), (46)

where

ψ(ϵ)\displaystyle\psi(\epsilon) :=1c5+1[(c3+1)ϕ(Δmin,ϵ)+c4(γincΔmin)+ϕ(γinc1γdecΔmin,ϵ)1c31],\displaystyle:=\frac{1}{c_{5}+1}\left[(c_{3}+1)\phi(\Delta_{\min},\epsilon)+c_{4}(\gamma_{\textnormal{inc}}\Delta_{\min})+\frac{\phi(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min},\epsilon)}{1-c_{3}^{-1}}\right], (47)
Δmin\displaystyle\Delta_{\min} :=min(γinc1Δ0,γdec1γinc1Δmax,γdecγinc1c2ϵ),\displaystyle:=\min\left(\gamma_{\textnormal{inc}}^{-1}\Delta_{0},\gamma_{\textnormal{dec}}^{-1}\gamma_{\textnormal{inc}}^{-1}\Delta_{\max},\gamma_{\textnormal{dec}}\gamma_{\textnormal{inc}}^{-1}c_{2}\epsilon\right), (48)
c5\displaystyle c_{5} :=max(c3,c311c31),\displaystyle:=\max\left(c_{3},\frac{c_{3}^{-1}}{1-c_{3}^{-1}}\right), (49)

with c5>0c_{5}>0, where c3c_{3} and c4(Δ)c_{4}(\Delta) are defined in Lemma 3.12.

Proof.

This follows the proof of [9, Lemma 9]. Since #(𝒜)=#(𝒜𝒟(Δmin))+#(𝒜𝒟C(Δmin))\#(\mathcal{A})=\#(\mathcal{A}\cap\mathcal{D}(\Delta_{\min}))+\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})), we bound these terms individually.

First, we use Lemma 3.10 to get

#(𝒜𝒟(Δmin))\displaystyle\#(\mathcal{A}\cap\mathcal{D}(\Delta_{\min})) =#(𝒜𝒟(Δmin)𝒮)+#(𝒜𝒟(Δmin)𝒰),\displaystyle=\#(\mathcal{A}\cap\mathcal{D}(\Delta_{\min})\cap\mathcal{S})+\#(\mathcal{A}\cap\mathcal{D}(\Delta_{\min})\cap\mathcal{U}), (50)
ϕ(Δmin,ϵ)+#(𝒜𝒟(γdec1γincΔmin)𝒰)\displaystyle\leq\phi(\Delta_{\min},\epsilon)+\#(\mathcal{A}\cap\mathcal{D}(\gamma_{\textnormal{dec}}^{-1}\gamma_{\textnormal{inc}}\Delta_{\min})\cap\mathcal{U})
+#(𝒜𝒟C(γdec1γincΔmin)𝒟(Δmin)𝒰),\displaystyle\qquad\qquad\quad\>\>\>+\#(\mathcal{A}\cap\mathcal{D}^{C}(\gamma_{\textnormal{dec}}^{-1}\gamma_{\textnormal{inc}}\Delta_{\min})\cap\mathcal{D}(\Delta_{\min})\cap\mathcal{U}), (51)
ϕ(Δmin,ϵ)+c3#(𝒟(Δmin)𝒮)+c4(γincΔmin)+0,\displaystyle\leq\phi(\Delta_{\min},\epsilon)+c_{3}\#(\mathcal{D}(\Delta_{\min})\cap\mathcal{S})+c_{4}(\gamma_{\textnormal{inc}}\Delta_{\min})+0, (52)
(c3+1)ϕ(Δmin,ϵ)+c3#(𝒜C𝒟(Δmin)𝒮)+c4(γincΔmin),\displaystyle\leq(c_{3}+1)\phi(\Delta_{\min},\epsilon)+c_{3}\#(\mathcal{A}^{C}\cap\mathcal{D}(\Delta_{\min})\cap\mathcal{S})+c_{4}(\gamma_{\textnormal{inc}}\Delta_{\min}), (53)

where the second inequality follows from Lemma 3.12 (specifically (41) after noting γincΔminΔ0\gamma_{\textnormal{inc}}\Delta_{\min}\leq\Delta_{0}) and Lemma 3.11 (with γdec1γincΔminc2ϵ\gamma_{\textnormal{dec}}^{-1}\gamma_{\textnormal{inc}}\Delta_{\min}\leq c_{2}\epsilon), and the last inequality comes from using Lemma 3.10 a second time.

Separately, we use Lemma 3.11 (with Δminc2ϵ\Delta_{\min}\leq c_{2}\epsilon) to get

#(𝒜𝒟C(Δmin))\displaystyle\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})) =#(𝒜𝒟C(Δmin)𝒮),\displaystyle=\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{S}), (54)
=#(𝒜𝒟(γinc1γdecΔmin)𝒟C(Δmin)𝒮)\displaystyle=\#(\mathcal{A}\cap\mathcal{D}(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min})\cap\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{S})
+#(𝒜𝒟C(γinc1γdecΔmin)𝒮),\displaystyle\qquad\qquad\qquad+\#(\mathcal{A}\cap\mathcal{D}^{C}(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min})\cap\mathcal{S}), (55)
#(𝒜𝒟(γinc1γdecΔmin)𝒮)+#(𝒟C(γinc1γdecΔmin)𝒮),\displaystyle\leq\#(\mathcal{A}\cap\mathcal{D}(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min})\cap\mathcal{S})+\#(\mathcal{D}^{C}(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min})\cap\mathcal{S}), (56)
ϕ(γinc1γdecΔmin,ϵ)+c31#(𝒟C(Δmin)𝒰),\displaystyle\leq\phi(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min},\epsilon)+c_{3}^{-1}\#(\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{U}), (57)
=ϕ(γinc1γdecΔmin,ϵ)+c31#(𝒜𝒟C(Δmin)𝒰)\displaystyle=\phi(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min},\epsilon)+c_{3}^{-1}\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{U})
+c31#(𝒜C𝒟C(Δmin)𝒰),\displaystyle\qquad\qquad\qquad\qquad\qquad+c_{3}^{-1}\#(\mathcal{A}^{C}\cap\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{U}), (58)

where the second inequality comes from Lemma 3.10 and Lemma 3.12 (specifically (42) with γdecΔminΔminΔ0\gamma_{\textnormal{dec}}\Delta_{\min}\leq\Delta_{\min}\leq\Delta_{0} and γdecΔminγinc1Δmax\gamma_{\textnormal{dec}}\Delta_{\min}\leq\gamma_{\textnormal{inc}}^{-1}\Delta_{\max}). We now use #(𝒜𝒟C(Δmin)𝒰)#(𝒜𝒟C(Δmin))\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{U})\leq\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})) and rearrange the above to get

#(𝒜𝒟C(Δmin))\displaystyle\#(\mathcal{A}\cap\mathcal{D}^{C}(\Delta_{\min})) 11c31[ϕ(γinc1γdecΔmin,ϵ)+c31#(𝒜C𝒟C(Δmin)𝒰)],\displaystyle\leq\frac{1}{1-c_{3}^{-1}}\left[\phi(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min},\epsilon)+c_{3}^{-1}\#(\mathcal{A}^{C}\cap\mathcal{D}^{C}(\Delta_{\min})\cap\mathcal{U})\right], (59)

where the assumption γinc>γdec1>1\gamma_{\textnormal{inc}}>\gamma_{\textnormal{dec}}^{-1}>1 guarantees c31(0,1)c_{3}^{-1}\in(0,1).

We combine (58) and (59) to get

#(𝒜)(c3+1)ϕ(Δmin,ϵ)+c4(γincΔmin)+ϕ(γinc1γdecΔmin,ϵ)1c31\displaystyle\#(\mathcal{A})\leq(c_{3}+1)\phi(\Delta_{\min},\epsilon)+c_{4}(\gamma_{\textnormal{inc}}\Delta_{\min})+\frac{\phi(\gamma_{\textnormal{inc}}^{-1}\gamma_{\textnormal{dec}}\Delta_{\min},\epsilon)}{1-c_{3}^{-1}} (60)
+max(c3,c311c31)#(𝒜C),\displaystyle+\max\left(c_{3},\frac{c_{3}^{-1}}{1-c_{3}^{-1}}\right)\cdot\#(\mathcal{A}^{C}), (61)

and the result then follows after using #(𝒜C)=(K+1)#(𝒜)\#(\mathcal{A}^{C})=(K+1)-\#(\mathcal{A}) and rearranging. To show that c5>0c_{5}>0, recall c31(0,1)c_{3}^{-1}\in(0,1). ∎

To get a total complexity bound, it suffices to note that PkP_{k} is not well-aligned (i.e. 𝒜C\mathcal{A}^{C}) for only a small fraction of all iterations, which comes from the following Chernoff-type bound.

Lemma 3.14.

Suppose Assumption 3.7 holds. Then

[#(𝒜)(1δ)(1δS)(K+1)]exp(δ2(1δS)(K+1)/2),\displaystyle\mathbb{P}\left[\#(\mathcal{A})\leq(1-\delta)(1-\delta_{S})(K+1)\right]\leq\exp\left(-\delta^{2}(1-\delta_{S})(K+1)/2\right), (62)

for any δ(0,1)\delta\in(0,1).

Proof.

This is [9, Eq. (66)]. ∎

Lastly, we follow the argument from [9, Theorem 1] to get our main worst-case complexity bound.

Theorem 3.15.

Suppose Assumptions 2.3, 3.2, 3.3 and 3.7 hold, θ>0\theta>0 (24), γinc>γdec1\gamma_{\textnormal{inc}}>\gamma_{\textnormal{dec}}^{-1}, and δS<11+c5\delta_{S}<\frac{1}{1+c_{5}} for c5c_{5} defined in Lemma 3.13. Then for any ϵ>0\epsilon>0 and

k2ψ(ϵ)1δsc5/(c5+1)1,,\displaystyle k\geq\frac{2\psi(\epsilon)}{1-\delta_{s}-c_{5}/(c_{5}+1)}-1,, (63)

where ψ(ϵ)\psi(\epsilon) is defined in (47), we have

[minjkσjϵ]1exp((k+1)[1δSc5/(c5+1)]28(1δS)).\displaystyle\mathbb{P}\left[\min_{j\leq k}\sigma_{j}\leq\epsilon\right]\geq 1-\exp\left(-(k+1)\cdot\frac{[1-\delta_{S}-c_{5}/(c_{5}+1)]^{2}}{8(1-\delta_{S})}\right). (64)

Alternatively, if we define Kϵ:=min{k:σkϵ}K_{\epsilon}:=\min\{k:\sigma_{k}\leq\epsilon\} we have

[Kϵ2ψ(ϵ)1δsc5/(c5+1)1]1exp(ψ(ϵ)1δSc5/(c5+1)4(1δS)).\displaystyle\mathbb{P}\left[K_{\epsilon}\leq\left\lceil\frac{2\psi(\epsilon)}{1-\delta_{s}-c_{5}/(c_{5}+1)}-1\right\rceil\right]\geq 1-\exp\left(-\psi(\epsilon)\cdot\frac{1-\delta_{S}-c_{5}/(c_{5}+1)}{4(1-\delta_{S})}\right). (65)
Proof.

Fix some iteration kk, and let ϵk:=minjkσj\epsilon_{k}:=\min_{j\leq k}\sigma_{j} and AkA_{k} be the number of well-aligned iterations in {0,,k}\{0,\ldots,k\}. If ϵk>0\epsilon_{k}>0, from Lemma 3.13 we have

Akψ(ϵk)+c5c5+1(k+1).\displaystyle A_{k}\leq\psi(\epsilon_{k})+\frac{c_{5}}{c_{5}+1}(k+1). (66)

If we choose δ(0,1)\delta\in(0,1) such that (1δ)(1δS)>c5/(c5+1)(1-\delta)(1-\delta_{S})>c_{5}/(c_{5}+1), then

[ψ(ϵk)((1δ)(1δS)c5c5+1)(k+1)]\displaystyle\mathbb{P}\left[\psi(\epsilon_{k})\leq\left((1-\delta)(1-\delta_{S})-\frac{c_{5}}{c_{5}+1}\right)(k+1)\right] [Ak(1δ)(1δS)(k+1)],\displaystyle\leq\mathbb{P}\left[A_{k}\leq(1-\delta)(1-\delta_{S})(k+1)\right], (67)
exp(δ2(1δS)(k+1)/2),\displaystyle\leq\exp\left(-\delta^{2}(1-\delta_{S})(k+1)/2\right), (68)

from Lemma 3.14. Now define

δ:=12[1c5(c5+1)(1δS)],\displaystyle\delta:=\frac{1}{2}\left[1-\frac{c_{5}}{(c_{5}+1)(1-\delta_{S})}\right], (69)

which gives

(1δ)(1δS)=12[1δS+c5c5+1]>c5c5+1,\displaystyle(1-\delta)(1-\delta_{S})=\frac{1}{2}\left[1-\delta_{S}+\frac{c_{5}}{c_{5}+1}\right]>\frac{c_{5}}{c_{5}+1}, (70)

using 1δS>c5/(c5+1)1-\delta_{S}>c_{5}/(c_{5}+1) from our assumption on δS\delta_{S}. Then (68) gives

[ψ(ϵk)12(1δSc5c5+1)(k+1)]\displaystyle\mathbb{P}\left[\psi(\epsilon_{k})\leq\frac{1}{2}\left(1-\delta_{S}-\frac{c_{5}}{c_{5}+1}\right)(k+1)\right] exp(δ2(1δS)(k+1)/2).\displaystyle\leq\exp\left(-\delta^{2}(1-\delta_{S})(k+1)/2\right). (71)

If we have ϵk=0\epsilon_{k}=0, then since limϵ0+ψ(ϵ)=\lim_{\epsilon\to 0^{+}}\psi(\epsilon)=\infty, we again get (71) since the left-hand side of (71) is zero.

Now, fix ϵ>0\epsilon>0 and choose kk satisfying (63). Since ψ(ϵ)\psi(\epsilon) is non-increasing in ϵ\epsilon, we have

[ϵk>ϵ]\displaystyle\mathbb{P}\left[\epsilon_{k}>\epsilon\right] [ψ(ϵk)ψ(ϵ)],\displaystyle\leq\mathbb{P}\left[\psi(\epsilon_{k})\leq\psi(\epsilon)\right], (72)
[ψ(ϵk)12(1δSc5c5+1)(k+1)],\displaystyle\leq\mathbb{P}\left[\psi(\epsilon_{k})\leq\frac{1}{2}\left(1-\delta_{S}-\frac{c_{5}}{c_{5}+1}\right)(k+1)\right], (73)
exp(δ2(1δS)(k+1)/2)\displaystyle\leq\exp\left(-\delta^{2}(1-\delta_{S})(k+1)/2\right) (74)

where the second and third inequalities follow from (63) and (71) respectively. We substitute the definition of δ\delta (69) to get our desired result (64).

Lastly, define

k:=2ψ(ϵ)1δsc5/(c5+1)1,\displaystyle k:=\left\lceil\frac{2\psi(\epsilon)}{1-\delta_{s}-c_{5}/(c_{5}+1)}-1\right\rceil, (75)

satisfying (63). Then (64) gives

[Kϵ>k]=[ϵk>ϵ]\displaystyle\mathbb{P}[K_{\epsilon}>k]=\mathbb{P}[\epsilon_{k}>\epsilon] exp((k+1)[1δSc5/(c5+1)]28(1δS)),\displaystyle\leq\exp\left(-(k+1)\cdot\frac{[1-\delta_{S}-c_{5}/(c_{5}+1)]^{2}}{8(1-\delta_{S})}\right), (76)
exp(ψ(ϵ)1δSc5/(c5+1)4(1δS)),\displaystyle\leq\exp\left(-\psi(\epsilon)\cdot\frac{1-\delta_{S}-c_{5}/(c_{5}+1)}{4(1-\delta_{S})}\right), (77)

where we get the last inequality by substituting (75), and we get (65). ∎

Summary of complexity

Our main result Theorem 3.15 gives a worst-case complexity to achieve σkϵ\sigma_{k}\leq\epsilon of O(ψ(ϵ))O(\psi(\epsilon)) iterations, with high probability (and where that probability converges to 1 as ϵ0\epsilon\to 0).

For convenience, define κd:=max(κef,κeg,κeh,κH)\kappa_{d}:=\max(\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}},\kappa_{\textnormal{eh}},\kappa_{H}). Then for ϵ\epsilon sufficiently small222Specifically, we need ϵmin(γdec1Δ0,γdec2Δmax)/c2\epsilon\leq\min(\gamma_{\textnormal{dec}}^{-1}\Delta_{0},\gamma_{\textnormal{dec}}^{-2}\Delta_{\max})/c_{2} to get Δmin=Θ(ϵ)\Delta_{\min}=\Theta(\epsilon) and ϵγinc/(γdecc2)\epsilon\leq\gamma_{\textnormal{inc}}/(\gamma_{\textnormal{dec}}c_{2}) to ensure Δmin1\Delta_{\min}\leq 1, which is required for ϕ(Δmin,ϵ)=Θ(ϵ1Δmin2)\phi(\Delta_{\min},\epsilon)=\Theta(\epsilon^{-1}\Delta_{\min}^{-2}). We also need ϵ1\epsilon\leq 1 in order to ignore the term c4(γincΔmin)log(ϵ1)c_{4}(\gamma_{\textnormal{inc}}\Delta_{\min})\sim\log(\epsilon^{-1}) in ψ(ϵ)\psi(\epsilon). we have Δmin=Θ(c2ϵ)=Θ(κd1ϵ)\Delta_{\min}=\Theta(c_{2}\epsilon)=\Theta(\kappa_{d}^{-1}\epsilon), and ψ(ϵ)=Θ(c11ϵ1Δmin2)\psi(\epsilon)=\Theta(c_{1}^{-1}\epsilon^{-1}\Delta_{\min}^{-2}), with c1=Θ(κσ1)=Θ(κd1)c_{1}=\Theta(\kappa_{\sigma}^{-1})=\Theta(\kappa_{d}^{-1}).

Hence, our overall (high probability) complexity bound is O(κd3ϵ3)O(\kappa_{d}^{3}\epsilon^{-3}) iterations. If we assume O(p2)O(p^{2}) evaluations per iteration, which is required for fully quadratic models, we get an alternative (high probability) complexity of O(p2κd3ϵ3)O(p^{2}\kappa_{d}^{3}\epsilon^{-3}) objective evaluations. This matches the existing (full space) complexity bound for model-based DFO from [19, Chapter 4.2].

Remark 1.

Following the arguments from [9, Section 2.5], the above high-probability complexity bound can be used to derive alternative results, such as complexity bounds that hold in expectation, and almost-sure convergence of a subsequence of σk\sigma_{k} to zero, for example.

3.1 Generating Well-Aligned Subspaces

We now turn our attention to satisfying Assumption 3.7, that is, the matrix PkP_{k} is well-aligned with high probability. We can achieve this by choosing PkP_{k} to be a Johnson-Lindenstrauss Transform (JLT) (e.g. [34, Section 2.1]). This is a random matrix ensemble that approximately preserves norms of finite collections of vectors, the most common example being where the entries of PkP_{k} are chosen to be i.i.d. N(0,p1)N(0,p^{-1}) distributed, although other constructions exist. See [31, Chapter 4] and [9, Section 2.6] for more details.

To satisfy Assumption 3.7, we choose PkP_{k} to be a JLT such that (with high probability) (based on ideas from [31, Chapter 5] and [11])

(1α)f(𝒙k)\displaystyle(1-\alpha)\|\nabla f(\bm{x}_{k})\| PkTf(𝒙k)(1+α)f(𝒙k),\displaystyle\leq\|P_{k}^{T}\nabla f(\bm{x}_{k})\|\leq(1+\alpha)\|\nabla f(\bm{x}_{k})\|, (78a)
(1α)𝒗r\displaystyle(1-\alpha)\|\bm{v}_{r}\| 𝒗^r(1+α)𝒗r,\displaystyle\leq\|\hat{\bm{v}}_{r}\|\leq(1+\alpha)\|\bm{v}_{r}\|, (78b)
(1α)𝒗i+𝒗r\displaystyle(1-\alpha)\|\bm{v}_{i}+\bm{v}_{r}\| 𝒗^i+𝒗^r(1+α)𝒗i+𝒗r,i=1,,r1,\displaystyle\leq\|\hat{\bm{v}}_{i}+\hat{\bm{v}}_{r}\|\leq(1+\alpha)\|\bm{v}_{i}+\bm{v}_{r}\|,\qquad\forall i=1,\ldots,r-1, (78c)
(1α)𝒗i𝒗r\displaystyle(1-\alpha)\|\bm{v}_{i}-\bm{v}_{r}\| 𝒗^i𝒗^r(1+α)𝒗i𝒗r,i=1,,r1,\displaystyle\leq\|\hat{\bm{v}}_{i}-\hat{\bm{v}}_{r}\|\leq(1+\alpha)\|\bm{v}_{i}-\bm{v}_{r}\|,\qquad\forall i=1,\ldots,r-1, (78d)

all hold, where r:=rank(2f(𝒙k))nr:=\operatorname{rank}(\nabla^{2}f(\bm{x}_{k}))\leq n. This requires (approximately) preserving the norm of 2r2r vectors, and so can be achieved with subspace dimension p=O(α2logr)p=O(\alpha^{-2}\log r). The additional requirement PkPmax\|P_{k}\|\leq P_{\max} can also be achieved with high probability without any modification to pp, with Pmax=Θ(n/p)P_{\max}=\Theta(\sqrt{n/p}) typical, e.g. [30, Table 2].

The above immediately gives the first three conditions for a well-aligned subspace, (23a), (23b) and (23c). For the last condition (23d), we use the polarization identity333That is, 𝒙T𝒚=14(𝒙+𝒚2𝒙𝒚2)\bm{x}^{T}\bm{y}=\frac{1}{4}(\|\bm{x}+\bm{y}\|^{2}-\|\bm{x}-\bm{y}\|^{2}) for any 𝒙,𝒚n\bm{x},\bm{y}\in\mathbb{R}^{n}. to compute

𝒗^iT𝒗^r\displaystyle\hat{\bm{v}}_{i}^{T}\hat{\bm{v}}_{r} =14(𝒗^i+𝒗^r2𝒗^i𝒗^r2),\displaystyle=\frac{1}{4}\left(\|\hat{\bm{v}}_{i}+\hat{\bm{v}}_{r}\|^{2}-\|\hat{\bm{v}}_{i}-\hat{\bm{v}}_{r}\|^{2}\right), (79)
14((1+α)2𝒗i+𝒗r2(1α)2𝒗i𝒗r2),\displaystyle\leq\frac{1}{4}\left((1+\alpha)^{2}\|\bm{v}_{i}+\bm{v}_{r}\|^{2}-(1-\alpha)^{2}\|\bm{v}_{i}-\bm{v}_{r}\|^{2}\right), (80)
=(1+α2)𝒗iT𝒗r+14(2α𝒗i+𝒗r2+2α𝒗i𝒗r2),\displaystyle=(1+\alpha^{2})\bm{v}_{i}^{T}\bm{v}_{r}+\frac{1}{4}\left(2\alpha\|\bm{v}_{i}+\bm{v}_{r}\|^{2}+2\alpha\|\bm{v}_{i}-\bm{v}_{r}\|^{2}\right), (81)
=2α,\displaystyle=2\alpha, (82)

where the last equality uses the fact that 𝒗i\bm{v}_{i} and 𝒗r\bm{v}_{r} are orthonormal. Similar reasoning yields 𝒗^iT𝒗^r2α\hat{\bm{v}}_{i}^{T}\hat{\bm{v}}_{r}\geq-2\alpha and so (23d) is achieved.

In terms of the subspace fully quadratic constants, we note that f^k(𝒔):=f(𝒙k+Pk𝒔)\hat{f}_{k}(\bm{s}):=f(\bm{x}_{k}+P_{k}\bm{s}) has (Pmax3LH)(P_{\max}^{3}L_{H})-Lipschitz continuous Hessian under Assumption 3.2. Recalling the definition κd:=max(κef,κeg,κeh)\kappa_{d}:=\max(\kappa_{\textnormal{ef}},\kappa_{\textnormal{eg}},\kappa_{\textnormal{eh}}), for models built using Λ\Lambda-poised fully quadratic interpolation, we have κd=O(p13/2LΛ)\kappa_{d}=O(p_{1}^{3/2}L\Lambda) [13, Theorems 3.14 & 3.16], where p1p_{1} is the number of interpolation points needed. So, using full space methods we have the standard values κd=O(n3LHΛ)\kappa_{d}=O(n^{3}L_{H}\Lambda).

For our subspace models, we have p1=O(p2/2)p_{1}=O(p^{2}/2) and L=O(Pmax3LH)L=O(P_{\max}^{3}L_{H}), and so we have κd=O(p3Pmax3LHΛ)\kappa_{d}=O(p^{3}P_{\max}^{3}L_{H}\Lambda). So, using JLTs as above with p=O(logr)p=O(\log r) and Pmax=O(n/p)P_{\max}=O(\sqrt{n/p}) we get κd=O(p3/2n3/2LHΛ)=O~(n3/2LHΛ)\kappa_{d}=O(p^{3/2}n^{3/2}L_{H}\Lambda)=\tilde{O}(n^{3/2}L_{H}\Lambda).

That is, full space methods use p=np=n with κd=O(n3)\kappa_{d}=O(n^{3}) and subspace methods require p=O(logr)O(logn)p=O(\log r)\leq O(\log n) with κd=O~(n3/2)\kappa_{d}=\tilde{O}(n^{3/2}), a significant improvement for large nn. Specifically, using a complexity of O(κd5ϵ3)O(\kappa_{d}^{5}\epsilon^{-3}) from above, the dependency on nn decreases from O(n9)O(n^{9}) iterations for full-space methods to O~(n4.5)\tilde{O}(n^{4.5}) for subspace methods. This is summarized in Table 1.

Method pp PmaxP_{\max} κd\kappa_{d} Iteration WCC Evaluation WCC
Full space nn 11 O(n3)O(n^{3}) O(n9ϵ3)O(n^{9}\epsilon^{-3}) O(n11ϵ3)O(n^{11}\epsilon^{-3})
Subspace O(logr)O(\log r) Θ(n/p)\Theta(\sqrt{n/p}) O~(n1.5)\tilde{O}(n^{1.5}) O~(n4.5ϵ3)\tilde{O}(n^{4.5}\epsilon^{-3}) O~(n4.5ϵ3)\tilde{O}(n^{4.5}\epsilon^{-3})
Table 1: Summary of relevant constants for full space and subspace methods, based on κd=O(p3Pmax3)\kappa_{d}=O(p^{3}P_{\max}^{3}), iteration complexity O(κd3ϵ3)O(\kappa_{d}^{3}\epsilon^{-3}) and evaluation complexity O(p2κd3ϵ3)O(p^{2}\kappa_{d}^{3}\epsilon^{-3}) (WCC = worst-case complexity).
Remark 2.

A limitation of the above theory is the requirement θ>0\theta>0 (24). If we pick α1\alpha\ll 1 and assume M=O(1)M=O(1), then we get α2=O(ϵ/r)\alpha^{2}=O(\epsilon/r). To apply the JLT, we actually need pO(α2)=O(rϵ1)p\geq O(\alpha^{-2})=O(r\epsilon^{-1}), which gives p=O(ϵ1rlogr)p=O(\epsilon^{-1}r\log r). Hence for θ>0\theta>0 to hold, we need to seek low accuracy solutions, or have a situation where rnr\ll n, i.e. the problem has low effective rank. This structure does arise in practice, see [7, Section 1], [6, Section 1] and references therein for examples, with a notable example being finetuning of large language models, where scalable DFO methods are practically useful [25].

In the case where PkP_{k} is generated using scaled Gaussians (i.e. each entry of PkP_{k} is i.i.d. N(0,p1)N(0,p^{-1})), then this can be improved (see [11, Lemma 6.6] or [31, Lemma 5.6.2]). Then, we compute

(𝒗^iT𝒗^r)2=𝒗^r2(𝒗^iT(𝒗^r𝒗^r))2(1+α)2O(p1),\displaystyle(\hat{\bm{v}}_{i}^{T}\hat{\bm{v}}_{r})^{2}=\|\hat{\bm{v}}_{r}\|^{2}\left(\hat{\bm{v}}_{i}^{T}\left(\frac{\hat{\bm{v}}_{r}}{\|\hat{\bm{v}}_{r}\|}\right)\right)^{2}\leq(1+\alpha)^{2}O(p^{-1}), (83)

since 𝒗^iT𝒚\hat{\bm{v}}_{i}^{T}\bm{y} is a N(0,p1)N(0,p^{-1})-distributed random variable for any unit vector 𝒚\bm{y}. In this case, we can get a different expression for θ\theta, namely

θ=(1α)2M(r1)(1+α)2O(p1)ϵ(1α)2,\displaystyle\theta=(1-\alpha)^{2}-\frac{M(r-1)(1+\alpha)^{2}O(p^{-1})}{\epsilon(1-\alpha)^{2}}, (84)

and so for θ>0\theta>0 we still require either low accuracy solutions or low effective rank, via ϵ>O(rp1)\epsilon>O(rp^{-1}), but without the stringent requirement on α\alpha (and hence on pp).

4 Practical Implementation

In [9, Section 4] the algorithm DFBGN, a practical implementation of Algorithm 1 for nonlinear least-squares problems was presented, based on linear interpolation models (for each residual in the least-squares sum). DFBGN was designed to efficiently use objective evaluations, a key principle of DFO methods (which are often used when objective evaluations are slow/expensive to obtain). In this section, we outline a RSDFO-Q (random subspace DFO with quadratic models), a practical implementation of the ideas in Algorithms 1 and 2, for general objectives (1), based on quadratic interpolation models.

At the core of DFBGN is the maintenance of an interpolation set comprising p+1p+1 points including the current iterate, say {𝒙k,𝒙k+𝒅1,,𝒙k+𝒅p}\{\bm{x}_{k},\bm{x}_{k}+\bm{d}_{1},\ldots,\bm{x}_{k}+\bm{d}_{p}\} for some linearly independent set 𝒅1,,𝒅pn\bm{d}_{1},\ldots,\bm{d}_{p}\in\mathbb{R}^{n}. Points that are added to this set are either from a trust-region step (5) or of the form 𝒙k+Δk𝒅\bm{x}_{k}+\Delta_{k}\bm{d} for random (mutually orthogonal) unit vectors 𝒅\bm{d}, orthogonal to all current 𝒅i\bm{d}_{i} (which is optimal in a specific sense, see [12, Theorem 4]). Points are removed from this set using geometric considerations to ensure the full linearity of the resulting model, following the approach in [29, 8]. Given this interpolation set, the subspace PkP_{k} in Algorithm 2 is effectively defined by Pk=[𝒅1𝒅p]P_{k}=[\bm{d}_{1}\>\cdots\>\bm{d}_{p}].

For a general objective problem (1), quadratic interpolation models, with at least p+2p+2 interpolation points (including 𝒙k\bm{x}_{k}), are typically preferred to linear interpolation models. However, in this situation we lose the important feature of DFBGN, namely that every interpolation point (except 𝐱k\bm{x}_{k}) defines a direction in the subspace. In this section, we give a practical approach to avoiding this issue, and provide a subspace method for solving the general problem (1) based on (underdetermined) quadratic interpolation models.

As with DFBGN, the design of RSDFO-Q is based on the principles:

  • Low linear algebra cost: the per-iteration linear algebra cost should be linear in nn (but may be higher in pp, since the algorithm is designed for pnp\ll n)

  • Efficient use of objective evaluations: objective evaluations should be re-used where possible, while restricting exploration at each iteration to a pp-dimensional subspace. In particular, the algorithm with p=np=n should have broadly comparable performance to state-of-the-art (full-space) DFO methods.444Given the added flexibility of the subspace method to set pnp\ll n, and their resulting ability to handle problems with much larger nn, it is unlikely that a subspace method with p=np=n will match/outperform state-of-the-art full-space methods.

The key idea to extend the DFBGN approach to quadratic models is to maintain two sets of interpolation points. To work in a pp-dimensional subspace with at most p+2q(p+1)(p+2)/2p+2\leq q\leq(p+1)(p+2)/2 interpolation points, we maintain a ‘primary’ set of p+1p+1 interpolation points 𝒴1={𝒙k,𝒙k+𝒅1,,𝒙k+𝒅p}\mathcal{Y}_{1}=\{\bm{x}_{k},\bm{x}_{k}+\bm{d}_{1},\ldots,\bm{x}_{k}+\bm{d}_{p}\}, which define the current subspace Pk=[𝒅1𝒅p]P_{k}=[\bm{d}_{1}\>\cdots\>\bm{d}_{p}] just as in DFBGN. We separately maintain a ‘secondary’ set 𝒴2\mathcal{Y}_{2} of at most qp1q-p-1 interpolation points, which is disjoint from the primary set. Points in the secondary set are only used for model construction, whereas points in the primary set are used for model construction as well as defining the subspace PkP_{k} and geometry management. Whenever points are removed from the primary set, which is based on similar principles to DFBGN, those points are added to the secondary set (and the oldest points from the secondary set are deleted). We note that the case q=(p+1)(p+2)/2q=(p+1)(p+2)/2 in general can yield fully quadratic (subspace) models, as required for Algorithm 2, and smaller values of qq can yield fully linear (subspace) models, as required for Algorithm 1; see [13, Chapters 3& 5] for details.

Given the subspace defined by PkP_{k}, we work with an orthonormal basis QkQ_{k} for the same space for convenience. By construction of PkP_{k}, all the primary interpolation points are in the subspace, but the secondary points likely will not be. Hence, before building an interpolation model, we project all secondary interpolation points into the subspace via

𝒔^=QkQkT(𝒚𝒙k),𝒚𝒴2.\displaystyle\hat{\bm{s}}=Q_{k}Q_{k}^{T}(\bm{y}-\bm{x}_{k}),\qquad\forall\bm{y}\in\mathcal{Y}_{2}. (85)

For notational convenience, we have 𝒔^=𝒚𝒙k\hat{\bm{s}}=\bm{y}-\bm{x}_{k} for all 𝒚𝒴1\bm{y}\in\mathcal{Y}_{1}. Given these perturbed points, the subspace model construction for (3) is based on minimum Frobenius norm quadratic interpolation [28]. That is, we solve the equality-constrained QP

min𝒈^,H^H^H~k1F2s.t.H^=H^T,m^(𝒔^)=f(𝒚),𝒚𝒴1𝒴2,\displaystyle\min_{\hat{\bm{g}},\hat{H}}\|\hat{H}-\widetilde{H}_{k-1}\|_{F}^{2}\qquad\text{s.t.}\quad\hat{H}=\hat{H}^{T},\quad\hat{m}(\hat{\bm{s}})=f(\bm{y}),\>\forall\bm{y}\in\mathcal{Y}_{1}\cup\mathcal{Y}_{2}, (86)

where for any 𝒚𝒴1𝒴2\bm{y}\in\mathcal{Y}_{1}\cup\mathcal{Y}_{2} we have the associated 𝒔^\hat{\bm{s}} defined above, and H~k1:=QkTQk1H^k1Qk1TQk\widetilde{H}_{k-1}:=Q_{k}^{T}Q_{k-1}\hat{H}_{k-1}Q_{k-1}^{T}Q_{k} is the projection of H^k1\hat{H}_{k-1} into the current subspace (with H^1=0\hat{H}_{-1}=0).

1:Inputs: initial iterate 𝒙0n\bm{x}_{0}\in\mathbb{R}^{n}, subspace dimension p{1,,n}p\in\{1,\ldots,n\}, maximum number of interpolation points p+2q(p+1)(p+2)/2p+2\leq q\leq(p+1)(p+2)/2.
2:Parameters: initial trust-region radius Δ0>0\Delta_{0}>0, maximum trust-region radius Δmax\Delta_{\max}, trust-region radii thresholds 0<γS,γdec<1<γincγ¯inc0<\gamma_{S},\gamma_{\textnormal{dec}}<1<\gamma_{\textnormal{inc}}\leq\overline{\gamma}_{\textnormal{inc}} and 0<α1α2<10<\alpha_{1}\leq\alpha_{2}<1 and acceptance thresholds 0<η1η2<10<\eta_{1}\leq\eta_{2}<1, and minimum steps between ρk\rho_{k} reductions NN\in\mathbb{N}.
3:Define the primary interpolation set 𝒴1:={𝒙0,𝒙0+Δ0𝒅1,𝒙0+Δ0𝒅p}\mathcal{Y}_{1}:=\{\bm{x}_{0},\bm{x}_{0}+\Delta_{0}\bm{d}_{1}\ldots,\bm{x}_{0}+\Delta_{0}\bm{d}_{p}\} for pp random orthonormal vectors 𝒅1,,𝒅p\bm{d}_{1},\ldots,\bm{d}_{p}, and secondary interpolation set 𝒴2=\mathcal{Y}_{2}=\emptyset.
4:Set H1=0p×pH_{-1}=0\in\mathbb{R}^{p\times p}, Q1n×pQ_{-1}\in\mathbb{R}^{n\times p} arbitrary and ρ0=Δ0\rho_{0}=\Delta_{0}.
5:for k=0,1,2,k=0,1,2,\ldots do
6:     Build Qkn×pQ_{k}\in\mathbb{R}^{n\times p} with orthonormal columns forming a basis for {𝒚𝒙k:𝒚𝒴1{𝒙k}}\{\bm{y}-\bm{x}_{k}:\bm{y}\in\mathcal{Y}_{1}\setminus\{\bm{x}_{k}\}\}.
7:     Build subspace quadratic model m^k\hat{m}_{k} (3) using minimum Frobenius norm quadratic interpolation (86).
8:     Calculate a step 𝒔^k\hat{\bm{s}}_{k} by approximately solving the trust-region subproblem (5).
9:     Set CAN_REDUCE_RHO=TRUE if kNk\geq N, ρkN=ρk\rho_{k-N}=\rho_{k} and min(𝒔^j,Δj)ρj\min(\|\hat{\bm{s}}_{j}\|,\Delta_{j})\leq\rho_{j} for all j=kN,,kj=k-N,\ldots,k and FALSE otherwise.
10:     if 𝒔^k<γSρk\|\hat{\bm{s}}_{k}\|<\gamma_{S}\rho_{k} then
11:         (safety) Set Rk=1R_{k}=-1 and Δk+1=max(γdecΔk,ρk)\Delta_{k+1}=\max(\gamma_{\textnormal{dec}}\Delta_{k},\rho_{k}) and 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k}.
12:         If CAN_REDUCE_RHO=FALSE or Δk>ρk\Delta_{k}>\rho_{k}, then remove 1 point from 𝒴1\mathcal{Y}_{1} using Algorithm 4.
13:     else
14:         Evaluate f(𝒙k+Qk𝒔^k)f(\bm{x}_{k}+Q_{k}\hat{\bm{s}}_{k}) and calculate the ratio RkR_{k} (6).
15:         Update trust-region radius
Δk+1={max(min(γdecΔk,𝒔^k),ρk),Rk<η1,max(γdecΔk,𝒔^k,ρk),η1Rkη2,min(max(γincΔk,γ¯inc𝒔^k),Δmax),Rk>η2.\displaystyle\Delta_{k+1}=\begin{cases}\max(\min(\gamma_{\textnormal{dec}}\Delta_{k},\|\hat{\bm{s}}_{k}\|),\rho_{k}),&R_{k}<\eta_{1},\\ \max(\gamma_{\textnormal{dec}}\Delta_{k},\|\hat{\bm{s}}_{k}\|,\rho_{k}),&\eta_{1}\leq R_{k}\leq\eta_{2},\\ \min(\max(\gamma_{\textnormal{inc}}\Delta_{k},\overline{\gamma}_{\textnormal{inc}}\|\hat{\bm{s}}_{k}\|),\Delta_{\max}),&R_{k}>\eta_{2}.\end{cases} (87)
16:         Set 𝒙k+1=𝒙k+Qk𝒔^k\bm{x}_{k+1}=\bm{x}_{k}+Q_{k}\hat{\bm{s}}_{k} if Rk>0R_{k}>0 and 𝒙k+1=𝒙k\bm{x}_{k+1}=\bm{x}_{k} otherwise.
17:         if p<np<n then
18:              Set 𝒴1=𝒴1{𝒙k+Qk𝒔^k}\mathcal{Y}_{1}=\mathcal{Y}_{1}\cup\{\bm{x}_{k}+Q_{k}\hat{\bm{s}}_{k}\}.
19:              Move min(max(pdrop,2),p)\min(\max(p_{\textnormal{drop}},2),p) points from 𝒴1\mathcal{Y}_{1} to 𝒴2\mathcal{Y}_{2} with Algorithm 4.
20:         else
21:              Move one point from 𝒴1\mathcal{Y}_{1} to 𝒴2\mathcal{Y}_{2} using Algorithm 4 and set 𝒴1=𝒴1{𝒙k+Qk𝒔^k}\mathcal{Y}_{1}=\mathcal{Y}_{1}\cup\{\bm{x}_{k}+Q_{k}\hat{\bm{s}}_{k}\}.
22:              Move min(max(pdrop,1),p)\min(\max(p_{\textnormal{drop}},1),p) points from 𝒴1\mathcal{Y}_{1} to 𝒴2\mathcal{Y}_{2} with Algorithm 4.
23:         end if
24:     end if
25:     (reduce ρk\rho_{k}) If Rk<0R_{k}<0, Δkρk\Delta_{k}\leq\rho_{k}, and CAN_REDUCE_RHO=TRUE, then set ρk+1=α1ρk\rho_{k+1}=\alpha_{1}\rho_{k} and Δk+1=α2ρk\Delta_{k+1}=\alpha_{2}\rho_{k}. Otherwise, set ρk+1=ρk\rho_{k+1}=\rho_{k}.
26:     (add new points) Let padd:=p+1|𝒴1|p_{\textnormal{add}}:=p+1-|\mathcal{Y}_{1}| and generate random orthonormal vectors {𝒅1,,𝒅padd}\{\bm{d}_{1},\ldots,\bm{d}_{p_{\textnormal{add}}}\} that are also orthogonal to {𝒚𝒙k+1:𝒚𝒴1{𝒙k+1}}\{\bm{y}-\bm{x}_{k+1}:\bm{y}\in\mathcal{Y}_{1}\setminus\{\bm{x}_{k+1}\}\}.
27:     Set 𝒴1=𝒴1{𝒙k+1+Δk+1𝒅1,,𝒙k+1+Δk+1𝒅padd}\mathcal{Y}_{1}=\mathcal{Y}_{1}\cup\{\bm{x}_{k+1}+\Delta_{k+1}\bm{d}_{1},\ldots,\bm{x}_{k+1}+\Delta_{k+1}\bm{d}_{p_{\textnormal{add}}}\}, and let 𝒙k+1\bm{x}_{k+1} be the point in 𝒴1\mathcal{Y}_{1} with smallest objective value.
28:end for
Algorithm 3 RSDFO-Q (Random subspace DFO with quadratic models) for (1).

A complete description of RSDFO-Q is given in Algorithm 3. Aside from the more complex model construction in (86), it closely follows the overall structure of DFBGN [9]. The trust-region updating mechanism is more complex than DFBGN, as it employs a lower bound ρkΔk\rho_{k}\leq\Delta_{k}, where ρk\rho_{k} is only periodically reduced, and does not evaluate the objective at the tentative trust-region step if the step is too short. Both these complications are drawn from BOBYQA [29].

There are two situations where points are to be removed from the primary interpolation set 𝒴1\mathcal{Y}_{1}. The first is where a single point is removed (lines 12 and 21), and the second is where potentially multiple points are removed (lines 19 and 22). In the case where a single point is removed, Algorithm 4 gives details on how this is done. In the case where multiple points are removed, [9, Algorithm 4] is used to select the points to remove. These two approaches are very similar, differing only in whether the tentative trust-region step is used in the Lagrange polynomial evaluation: they try to remove points which are far from the current iterate and/or degrade the geometry of the primary interpolation set. In both cases, the points removed from 𝒴1\mathcal{Y}_{1} are added to 𝒴2\mathcal{Y}_{2} (with the oldest points removed from 𝒴2\mathcal{Y}_{2} if it grows too large). Following the heuristic for DFBGN [9, Section 4.5], the number of points to remove in lines 19 and 22 is p/10p/10 for unsuccessful steps (Rk<0R_{k}<0) and 1 otherwise.

Remark 3.

The default parameter choices in Algorithm 3 are Δ0=0.1max(𝒙0,1)\Delta_{0}=0.1\max(\|\bm{x}_{0}\|_{\infty},1), Δmax=1010\Delta_{\max}=10^{10}, γS=γdec=0.5\gamma_{S}=\gamma_{\textnormal{dec}}=0.5, γinc=2\gamma_{\textnormal{inc}}=2, γ¯inc=4\overline{\gamma}_{\textnormal{inc}}=4, α1=0.1\alpha_{1}=0.1, α2=0.5\alpha_{2}=0.5, η1=0.1\eta_{1}=0.1, η2=0.7\eta_{2}=0.7, and N=5N=5.

1:Inputs: primary interpolation set 𝒴1\mathcal{Y}_{1} with p+1p+1 points including the current iterate 𝒙k\bm{x}_{k}, secondary interpolation set 𝒴2\mathcal{Y}_{2} with at most qp1q-p-1 points, tentative trust-region step Qk𝒔^kQ_{k}\hat{\bm{s}}_{k}, trust-region radius Δk>0\Delta_{k}>0
2:Compute the linear Lagrange polynomials for all points in 𝒴1={𝒙k+1,𝒚1,,𝒚p}\mathcal{Y}_{1}=\{\bm{x}_{k+1},\bm{y}_{1},\ldots,\bm{y}_{p}\}
3:Let 𝒚t𝒴1{𝒙k+1}\bm{y}_{t}\in\mathcal{Y}_{1}\setminus\{\bm{x}_{k+1}\} be the point that maximizes
θt=|t(𝒙k+Qk𝒔^k)|max(𝒚t𝒙k4Δk4,1).\displaystyle\theta_{t}=|\ell_{t}(\bm{x}_{k}+Q_{k}\hat{\bm{s}}_{k})|\cdot\max\left(\frac{\|\bm{y}_{t}-\bm{x}_{k}\|^{4}}{\Delta_{k}^{4}},1\right). (88)
4:Set 𝒴1=𝒴1{𝒚t}\mathcal{Y}_{1}=\mathcal{Y}_{1}\setminus\{\bm{y}_{t}\} and 𝒴2=𝒴2{𝒚t}\mathcal{Y}_{2}=\mathcal{Y}_{2}\cup\{\bm{y}_{t}\}
5:If |𝒴2|>qp1|\mathcal{Y}_{2}|>q-p-1, remove the oldest point from 𝒴2\mathcal{Y}_{2}.
Algorithm 4 Remove a single point from 𝒴1\mathcal{Y}_{1}.

5 Numerical Results

We now demonstrate the numerical performance of RSDFO-Q in light of the underlying principles described in Section 4. In particular, we benchmark against Py-BOBYQA [4] as a state-of-the-art full space DFO method based on the same minimum Frobenius norm quadratic interpolation mechanism. It also shares many of the complexities in the trust-region updating mechanism that are present in Algorithm 3.

5.1 Testing Framework

We test our solvers on two collections of problems from the CUTEst set [20, 18]:

  • The medium-scale collection ‘CFMR’ of 90 problems with 25n12025\leq n\leq 120 (with n=100n=100 most common) from [4, Section 7].

  • The large-scale collection ‘CR-large’ of 28 problems with 1000n50001000\leq n\leq 5000 (with n=1000n=1000 most common) from [9, Appendix C].

We terminate all solvers after 100(n+1)100(n+1) objective evaluations and 12 hours (wall) runtime, whichever occurs first555All solvers are implemented in Python with key linear algebra routines implemented in NumPy. The objective evaluation time for all problems was negligible., and run each randomized solver 10 times on each problem (from the same starting point) to get a good assessment of the average-case behavior. All solvers had the same starting point and initial trust-region radius.

For each problem instance and solver, we measure the total objective evaluations required to find a point 𝒙\bm{x} with f(𝒙)fmin+τ(f(𝒙0)fmin)f(\bm{x})\leq f_{\min}+\tau(f(\bm{x}_{0})-f_{\min}), where fminf_{\min} is the true minimum listed in [4, 9] and τ1\tau\ll 1 is an accuracy measure (smaller values mean higher accuracy solutions). In our results, we show accuracy levels τ{101,103}\tau\in\{10^{-1},10^{-3}\}, as it was observed in [9, Section 5] that subspace methods are most suited to finding lower-accuracy solutions. If a solver terminates on budget or runtime before a sufficiently accurate point 𝒙\bm{x} was found, the total objective evaluations is considered to be \infty.

Solvers are compared using data profiles [27], showing the proportion of problem instances solved within a given number of evaluations in units of (n+1)(n+1) for problems in n\mathbb{R}^{n} (i.e. the cost of estimating a full gradient using finite differencing), and performance profiles [14], which plot the proportion of problem instances solved within a given ratio of the minimum evaluations required by any instance of any solver.

5.2 Medium-Scale Results

We first show results for medium-scale problems based on the ‘CFMR’ problem collection (n100n\approx 100).666The 12hr runtime limit was not reached for any of these problem instances. In Figure 1, we compare Py-BOBYQA with q{n+1,n+2,2n+1}q\in\{n+1,n+2,2n+1\} interpolation points (i.e. linear or underdetermined quadratic interpolation models) against RSDFO-Q with q=2p+1q=2p+1 interpolation points, with subspace dimension p{n/10,n/4,n/2,n}p\in\{n/10,n/4,n/2,n\}.

We see that the full-space method Py-BOBYQA is the best-performing algorithm, with larger numbers of interpolation points qq giving the best performance. Among the subspace methods, those with larger choices of pp achieve the best performance, aligning with the results in [9]. In particular (c.f. principles in Section 4), we see that RSDFO-Q with full blocks p=np=n (and q=2p+1q=2p+1 interpolation points) achieves broadly comparable performance to Py-BOBYQA (worse than Py-BOBYQA with q=2n+1q=2n+1 points but better than q=n+2q=n+2 points), while having the significant additional flexibility of working in a subspace of arbitrary dimension pp.

Refer to caption
(a) Data profile, τ=101\tau=10^{-1}.
Refer to caption
(b) Performance profile, τ=101\tau=10^{-1}.
Refer to caption
(c) Data profile, τ=103\tau=10^{-3}.
Refer to caption
(d) Performance profile, τ=103\tau=10^{-3}.
Figure 1: Comparisons for medium-scale problems, n100n\approx 100.

5.3 Large-Scale results

In Figure 2 we show the same results (based on total objective evaluations) as Figure 1, but for the large-scale test set ‘CR-large’ (n1000n\approx 1000). Here, we observe that Py-BOBYQA with quadratic models (q=2n+1q=2n+1 and q=n+2q=n+2) fail to solve any problems, as their significant per-iteration linear algebra costs777e.g. Model construction in each iteration of Py-BOBYQA with q>n+1q>n+1 requires the solution of a (q+n+1)×(q+n+1)(q+n+1)\times(q+n+1) linear system. mean they hit the runtime limit quickly. The only Py-BOBYQA variant able to make progress uses linear models with q=n+1q=n+1. However, this method is outperformed by all variants of the subspace method RSDFO-Q (including with p=np=n). We find that medium-dimensional subspaces p{n/10,n/4}p\in\{n/10,n/4\} (i.e. p100p\approx 100 and p250p\approx 250) are the best-performing variants.

For comparison, in Figure 3, we show data and performance profiles based on runtime to find a point with sufficient decrease. Among the variants of RSDFO-Q, we find that the low-dimensional variants p{n/10,n/100}p\in\{n/10,n/100\} are the best-performing, at least initially. However, the increased robustness of the p=n/4p=n/4 variant (accompanied by an increase in the per-iteration linear algebra cost) means it ultimately outperforms the p=n/100p=n/100 variant.

Refer to caption
(a) Data profile, τ=101\tau=10^{-1}.
Refer to caption
(b) Performance profile, τ=101\tau=10^{-1}.
Refer to caption
(c) Data profile, τ=103\tau=10^{-3}.
Refer to caption
(d) Performance profile, τ=103\tau=10^{-3}.
Figure 2: Comparisons for large-scale problems, n1000n\approx 1000. Runtime capped at 12hrs per problem instance for all solvers.
Refer to caption
(a) Data profile, τ=101\tau=10^{-1}.
Refer to caption
(b) Performance profile, τ=101\tau=10^{-1}.
Refer to caption
(c) Data profile, τ=103\tau=10^{-3}.
Refer to caption
(d) Performance profile, τ=103\tau=10^{-3}.
Figure 3: Comparisons for large-scale problems, n1000n\approx 1000, based on runtime required to solve the problem (rather than objective evaluations). Runtime capped at 12hrs per problem instance for all solvers.

6 Conclusion

We have provided the first second-order convergence result for a random subspace model-based DFO method, and shown that, compared to full space model-based DFO, these approaches can achieve a significant improvement in dimension dependence of the complexity bound, from O(n11ϵ3)O(n^{11}\epsilon^{-3}) evaluations to O~(n4.5ϵ3)\tilde{O}(n^{4.5}\epsilon^{-3}) evaluations to find an ϵ\epsilon-approximate second-order critical point for an nn-dimensional problem. This theory is relevant particularly when seeking low accuracy solutions or if the objective has low effective rank (i.e. low rank Hessians).

We then introduce a practical algorithm (RSDFO-Q) for general unconstrained model-based DFO in subspaces based on iterative construction of quadratic models in subspaces. This extends the practical method from [9], which was limited to linear models (and specialized to nonlinear least-squares problems). Numerical results compared to the state-of-the-art method Py-BOBYQA show that RSDFO-Q can solve problems of significantly higher dimension (n1000n\approx 1000), exploiting a significantly reduced per-iteration linear algebra cost, while RSDFO-Q with full-dimensional subspaces achieves comparable performance to Py-BOBYQA on medium-scale problems (n100n\approx 100).

Interesting extensions of this approach would be to consider the case of stochastic function evaluations, similar to [16] for the first-order theory, and a practical comparison of RSDFO-Q for large language model fine-tuning problems, which typically have low effective rank [25].

Acknowledgement(s)

We acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility888http://dx.doi.org/10.5281/zenodo.22558 in carrying out this work.

Funding

LR was supported by the Australian Research Council Discovery Early Career Award DE240100006. CC was supported by the Hong Kong Innovation and Technology Commission (InnoHK CIMDA Project).

References

  • [1] S. Alarie, C. Audet, A.E. Gheribi, M. Kokkolaras, and S. Le Digabel, Two decades of blackbox optimization applications, EURO Journal on Computational Optimization 9 (2021), p. 100011.
  • [2] C. Audet and W. Hare, Derivative-Free and Blackbox Optimization, Springer Series in Operations Research and Financial Engineering, Springer, Cham, Switzerland, 2017.
  • [3] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg, Convergence rate analysis of a stochastic trust-region method via supermartingales, INFORMS Journal on Optimization 1 (2019), pp. 92–119.
  • [4] C. Cartis, J. Fiala, B. Marteau, and L. Roberts, Improving the flexibility and robustness of model-based derivative-free optimization solvers, ACM Transactions on Mathematical Software 45 (2019), pp. 1–41.
  • [5] C. Cartis, J. Fowkes, and Z. Shao, Randomised subspace methods for non-convex optimization, with applications to nonlinear least-squares, arXiv preprint arXiv:2211.09873 (2022).
  • [6] C. Cartis, X. Liang, E. Massart, and A. Otemissov, Learning the subspace of variation for global optimization of functions with low effective dimension, arXiv preprint arXiv:2401.17825 (2024).
  • [7] C. Cartis and A. Otemissov, A dimensionality reduction technique for unconstrained global optimization of functions with low effective dimensionality, Information and Inference: A Journal of the IMA 11 (2022), pp. 167–201.
  • [8] C. Cartis and L. Roberts, A derivative-free Gauss–Newton method, Mathematical Programming Computation 11 (2019), pp. 631–674.
  • [9] C. Cartis and L. Roberts, Scalable subspace methods for derivative-free nonlinear least-squares optimization, Mathematical Programming 199 (2023), pp. 461–524.
  • [10] C. Cartis, L. Roberts, and O. Sheridan-Methven, Escaping local minima with local derivative-free methods: A numerical investigation, Optimization 71 (2022), pp. 2343–2373.
  • [11] C. Cartis, Z. Shao, and E. Tansley, Random subspace cubic regularization methods, December (2024).
  • [12] Y. Chen, W. Hare, and A. Wiebe, Q-fully quadratic modeling and its application in a random subspace derivative-free method, Computational Optimization and Applications (2024).
  • [13] A.R. Conn, K. Scheinberg, and L.N. Vicente, Introduction to Derivative-Free Optimization, MPS-SIAM Series on Optimization, Society for Industrial and Applied Mathematics, Philadelphia, 2009.
  • [14] E.D. Dolan and J.J. Moré, Benchmarking optimization software with performance profiles, Mathematical Programming 91 (2002), pp. 201–213.
  • [15] K.J. Dzahini and S.M. Wild, Direct search for stochastic optimization in random subspaces with zeroth-, first-, and second-order convergence and expected complexity, arXiv preprint arXiv:2403.13320 (2024).
  • [16] K.J. Dzahini and S.M. Wild, Stochastic trust-region algorithm in random subspaces with convergence and expected complexity analyses, SIAM Journal on Optimization 34 (2024), pp. 2671–2699.
  • [17] M.J. Ehrhardt and L. Roberts, Inexact derivative-free optimization for bilevel learning, Journal of Mathematical Imaging and Vision 63 (2021), pp. 580–600.
  • [18] J. Fowkes, L. Roberts, and Á. Bűrmen, PyCUTEst: An open source Python package of optimization test problems, Journal of Open Source Software 7 (2022), p. 4377.
  • [19] R. Garmanjani, Trust-region methods without using derivatives: Worst case complexity and the nonsmooth case, Ph.D. diss., Universidade de Coimbra, 2015.
  • [20] N.I.M. Gould, D. Orban, and P.L. Toint, CUTEst: A constrained and unconstrained testing environment with safe threads for mathematical optimization, Computational Optimization and Applications 60 (2015), pp. 545–557.
  • [21] S. Gratton, C.W. Royer, L.N. Vicente, and Z. Zhang, Direct search based on probabilistic descent, SIAM Journal on Optimization 25 (2015), pp. 1515–1541.
  • [22] M. Kimiaei, A. Neumaier, and P. Faramarzi, New subspace method for unconstrained derivative-free optimization, ACM Transactions on Mathematical Software 49 (2023), pp. 1–28.
  • [23] D. Kozak, S. Becker, A. Doostan, and L. Tenorio, A stochastic subspace approach to gradient-free optimization in high dimensions, Computational Optimization and Applications 79 (2021), pp. 339–368.
  • [24] J. Larson, M. Menickelly, and S.M. Wild, Derivative-free optimization methods, Acta Numerica 28 (2019), pp. 287–404.
  • [25] S. Malladi, T. Gao, E. Nichani, A. Damian, J.D. Lee, D. Chen, and S. Arora, Fine-Tuning Language Models with Just Forward Passes, in 37th Conference on Neural Information Processing Systems (NeurIPS 2023). 2023.
  • [26] M. Menickelly, Avoiding geometry improvement in derivative-free model-based methods via randomization, arXiv preprint arXiv:2305.17336 (2023).
  • [27] J.J. Moré and S.M. Wild, Benchmarking derivative-free optimization algorithms, SIAM Journal on Optimization 20 (2009), pp. 172–191.
  • [28] M.J.D. Powell, Least Frobenius norm updating of quadratic models that satisfy interpolation conditions, Mathematical Programming 100 (2004), pp. 183–215.
  • [29] M.J.D. Powell, The BOBYQA algorithm for bound constrained optimization without derivatives, Tech. Rep. DAMTP 2009/NA06, University of Cambridge, 2009.
  • [30] L. Roberts and C.W. Royer, Direct search based on probabilistic descent in reduced spaces, SIAM Journal on Optimization 33 (2023).
  • [31] Z. Shao, On random embeddings and their application to optimisation, Ph.D. diss., University of Oxford, 2021.
  • [32] G. Ughi, V. Abrol, and J. Tanner, An empirical study of derivative-free-optimization algorithms for targeted black-box attacks in deep neural networks, Optimization and Engineering 23 (2022), pp. 1319–1346.
  • [33] D. van de Berg, N. Shan, and A. del Rio-Chanona, High-dimensional derivative-free optimization via trust region surrogates in linear subspaces, in Proceedings of the 34th European Symposium on Computer Aided Process Engineering / 15th International Symposium on Process Systems Engineering (ESCAPE34/PSE24). 2024.
  • [34] D.P. Woodruff, Sketching as a tool for numerical linear algebra, Foundations and Trends® in Theoretical Computer Science 10 (2014), pp. 1–157.
  • [35] P. Xie and Y.x. Yuan, A new two-dimensional model-based subspace method for large-scale unconstrained derivative-free optimization: 2D-MoSub, arXiv preprint arXiv:2309.14855 (2024).
  • [36] Z. Zhang, On derivative-free optimization methods, Ph.D. diss., Chinese Academy of Sciences, 2012.