This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Zeroth-order Random Subspace Algorithm for Non-smooth Convex Optimization

Ryota Nozawa 111[email protected] Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan Pierre-Louis Poirion 222[email protected] Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan Akiko Takeda 333[email protected] Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
Abstract

Zeroth-order optimization, which does not use derivative information, is one of the significant research areas in the field of mathematical optimization and machine learning. Although various studies have explored zeroth-order algorithms, one of the theoretical limitations is that oracle complexity depends on the dimension, i.e., on the number of variables, of the optimization problem. In this paper, to reduce the dependency of the dimension in oracle complexity, we propose a zeroth-order random subspace algorithm by combining a gradient-free algorithm (specifically, Gaussian randomized smoothing with central differences) with random projection. We derive the worst-case oracle complexity of our proposed method in non-smooth and convex settings; it is equivalent to standard results for full-dimensional non-smooth convex algorithms. Furthermore, we prove that ours also has a local convergence rate independent of the original dimension under additional assumptions. In addition to the theoretical results, numerical experiments show that when an objective function has a specific structure, the proposed method can become experimentally more efficient due to random projection.

Keywords: zeroth-order optimization, random projection, convex optimization, oracle complexity

1 Introduction

We consider the following unconstrained optimization problem:

minxnf(x),\min_{x\in\mathbb{R}^{n}}f(x), (1)

where the objective function, ff, is possibly non-smooth, but it is convex. Throughout this paper, we assume that ff is LL-Lipschitz continuous, i.e.,

|f(x)f(y)|Lxy2|f(x)-f(y)|\leq L\|x-y\|_{2}

holds for all xx and yy. Furthermore, we assume that (1) is zeroth-order optimization problem, which means that only the function values f(x)f(x) are accessible and the derivatives of ff are unavailable, or that the calculation of f\nabla f is expensive. Zeroth-order optimization has many applications such as bandit [5], adversarial attack [18], or hyperparameter tuning [3]. Additionally, there are various types of optimization methods [23] for zeroth-order optimization such as random search and model based methods, and these methods have been widely studied in both a smooth setting [35, 2, 22, 36] and a non-smooth setting [14, 30, 11, 13, 4, 33].

However, one of the theoretical limitations of zeroth-order algorithms is that the oracle complexity depends on the dimension nn. For example, Duchi et al. [11] and Gasnikov et al. [14] proposed variants of mirror descent algorithms. Duchi et al. [11] prove under the zeroth-order setting that a lower bound on the oracle complexity required for their method to find an ε\varepsilon-approximate solution is O(nε2)O(\frac{n}{\varepsilon^{2}}) for the case of linear losses. Gasnikov et al. [14] analyze their method on the unit simplex Sn={x|xi0,i=1nxi=1}S_{n}=\{x|x_{i}\geq 0,\sum_{i=1}^{n}x_{i}=1\} and prove that the oracle complexity is O(nε2)O(\frac{n}{\varepsilon^{2}}). On the other hand, Nesterov and Spokoiny [27] proposed a method using Gaussian smoothing, which is defined by

fμ(x):=𝔼u[f(x+μu)].f_{\mu}(x):=\mathbb{E}_{u}\left[f(x+\mu u)\right].

By approximating the gradient of the smoothed function fμf_{\mu} by the forward difference: f(x+μu)f(x)μu\frac{f(x+\mu u)-f(x)}{\mu}u or the central difference: f(x+μu)f(xμu)2μu\frac{f(x+\mu u)-f(x-\mu u)}{2\mu}u, they obtain the oracle complexity under different settings; concretely, they obtain a complexity of order O(n2ε2)O\left(\frac{n^{2}}{\varepsilon^{2}}\right) when the objective function is non-smooth and convex. Later, some papers using random smoothing [33, 13, 30] are able to improve oracle complexity with O(nε2)O(\frac{n}{\varepsilon^{2}}) using the central difference. An oracle complexity with O(nε2)O(\frac{n}{\varepsilon^{2}}) is a natural result in zeroth-order optimization, since standard subgradient methods for non-smooth convex functions require O(1ε2)O(\frac{1}{\varepsilon^{2}}) iterations and the approximation of the gradients using the finite difference needs O(n)O(n) times function evaluations.

To overcome this dependence on the dimension nn, several research [36, 15, 29] assume that ff has a low-dimensional structure. For example, Yue et al. [36] use a notion of effective dimensionality and prove that the oracle complexity depends on the effective dimension EDα:=supxi=1nσi(2f(x))α\mathrm{ED}_{\alpha}:=\sup_{x}\sum_{i=1}^{n}\sigma_{i}(\nabla^{2}f(x))^{\alpha}, where σi\sigma_{i} denotes a singular value, rather than on nn. When the objective function is convex with L1L_{1}-Lipschitz gradient and HH-Lipschitz Hessian, their algorithm is shown to have an oracle complexity where nn in the oracle complexity of [27] has changed to ED1/2\mathrm{ED}_{1/2}. In practice, the effective dimension is often significantly smaller than the dimension nn. In such a case, the oracle complexity is improved under convex and Lipschitz gradient settings.

An alternative approach to reduce the dependency on the dimension in the complexity is to use random projections [6, 22, 2, 31]. Cartis and Roberts [6] combine random projections with a model based derivative-free method, which approximates the objective function by interpolation. In their approach, they solve f(xk+Pku)f(x_{k}+P_{k}u) using a smaller-sized variable udu\in\mathbb{R}^{d} in each iteration, constructed with a random matrix Pkn×dP_{k}\in\mathbb{R}^{n\times d} and xkx_{k}, instead of the original function f(x)f(x). Using random projection theory, they prove that when the objective is smooth and non-convex, the methods reduce the dimensionality of oracle complexity from O(n3ε2)O(\frac{n^{3}}{\varepsilon^{2}}) to O(n2ε2)O(\frac{n^{2}}{\varepsilon^{2}}), in order to find an ε\varepsilon-stationary point. Kozak et al. [22] consider a zeroth-order variant of [21], approximating the exact gradient by the random projected gradient. They obtain the iteration complexity under various parameter choices and assumptions in the smooth setting. In the subsequent work, Rando et al. [30] propose a variant of [22] and obtain an oracle complexity of order O(nε2)O(\frac{n}{\varepsilon^{2}}) in the non-smooth setting. Berglund et al. [2] propose a zeroth-order variant of the randomized subspace Newton method [16] and prove iteration complexity in the strongly convex case. Roberts and Royer [31] propose a subspace variant of the direct search methods [17, 20]. They obtain some convergence guarantees under a wide range of subspace projections and directions, and show that their randomized methods give better results than the original full-dimensional ones. However, to the best of our knowledge, there is no research which reduces the dependency on the dimension nn in oracle complexity for non-smooth functions.

1.1 Main Contribution

In this paper, we aim to reduce the dependency on the dimension in the worst-case oracle complexity by employing random projections, specifically under a non-smooth and convex setting. We propose an algorithm which combines Gaussian smoothing using central differences [27] and random projections. Our idea is to apply the Gaussian smoothing to the objective function restricted to a subspace, i.e., h(k)(u):=f(xk+Pku){\color[rgb]{0,0,0}h^{(k)}}(u):=f(x_{k}+P_{k}u), instead of the original function f(x)f(x). We prove that our algorithm achieves an oracle complexity of O(nε2)O(\frac{n}{\varepsilon^{2}}) globally, which is the standard result under the non-smooth and convex setting. Moreover, under additional local assumptions on ff, we prove an oracle complexity of O(d2ε2)O(\frac{d^{2}}{\varepsilon^{2}}) locally, where dd is the dimension of the random subspace, defined by PkP_{k}. This indicates that by choosing dd much smaller than n\sqrt{n}, our proposed method improves the local oracle complexity.

We can summarize our contribution as follows.

  • We propose a zeroth-order random subspace algorithm by using random projection technique to a Gaussian smoothing algorithm for non-smooth convex optimization problems.

  • Our algorithm achieves an oracle complexity of O(nε2)O(\frac{n}{\varepsilon^{2}}) globally, and also has a local convergence rate independent of the original dimension under additional assumptions.

  • Our numerical experiments show that the proposed method performs well due to random projection for an objective function with a specific structure.

1.2 Related Works on L2L_{2} randomized smoothing

Recently, L2L_{2} randomized smoothing has been actively studied for smoothing non-smooth function ff [11, 13, 33]. The random variable uu that defines fμf_{\mu} is assumed to be a random vector uniformly distributed on a ball with center 0 and radius μ\mu, instead of a normally distributed random Gaussian vector. For the L2L_{2} randomized smoothing, Shamir [33] proposed the algorithm for bandit convex optimization and showed the optimal rate for convex Lipschitz functions using central differences. Gasnikov et al. [13] proposed the generic approach that combines smoothing and first-order methods. They show that the approach achieves the optimal rate of zeroth-order methods and can utilize various techniques in first-order methods for non-smooth zeroth-order optimization.

In fact, our proposed method can be modified so as to use a L2L_{2} randomized smoothing instead of a Gaussian one, and theoretical guarantees are essentially the same for both methods (see Remark 1). In any case, since we use properties of Gaussian random matrices to reduce the dimension of the problem, we use the Gaussian smoothing in this paper for the simplicity of our discussion.

1.3 Organization

In Section 2, we introduce some properties of the smoothing function and random matrices and vectors for our analysis. In Section 3, we present our algorithm, and in Section 4 we prove global convergence in O(nε2)O(\frac{n}{\varepsilon^{2}}). In Section 5, we prove local convergence in O(d2ε2)O(\frac{d^{2}}{\varepsilon^{2}}). In Section 6, we show numerical results and demonstrate that when the objective has a structure suitable for random projections, our method converges faster than existing methods, by reducing the function evaluation time.

2 Preliminaries

2.1 Notation

xx^{*} denotes one of the optimal solutions of (1). Let 𝔼X\mathbb{E}_{X} denote the expectation of a random variable XX, and 𝒩(u,Σ)\mathcal{N}(u,\Sigma) denote the normal distribution with mean uu and covariance Σ\Sigma. IdI_{d} denotes the identity matrix of size dd.

We use \|\cdot\| as the Euclidean norm and ψ2\|\cdot\|_{\psi_{2}} as the sub-Gaussian norm of a sub-Gaussian random variable, which is defined by

Xψ2=inf{s>0|𝔼X[exp(X2/s2)]2}.\|X\|_{\psi_{2}}=\inf\{s>0|\mathbb{E}_{X}[\exp{(X^{2}/s^{2})}]\leq 2\}.

From the property of the sub-Gaussian norm, Xψ2C\|X\|_{\psi_{2}}\leq C is equivalent to

Prob(|X|t)2exp(ct2/C),\mathrm{Prob}(|X|\geq t)\leq 2\exp{(-ct^{2}/C)}, (2)

where cc is an absolute constant. Let λi(A)\lambda_{i}(A) denote the ii-th largest eigenvalue of a matrix AA and let 𝟏𝒳(x)\mathbf{1}_{\mathcal{X}}(x) denote the indicator function defined by

𝟏𝒳(x)={1(x𝒳),0(otherwise).\mathbf{1}_{\mathcal{X}}(x)=\left\{\begin{array}[]{cc}1&(x\in\mathcal{X}),\\ 0&(\mathrm{otherwise}).\end{array}\right.

In particular, when 𝒳={x|x,u0}\mathcal{X}=\{x|\langle x,u\rangle\geq 0\} or 𝒳={x|x,u<0}\mathcal{X}=\{x|\langle x,u\rangle<0\} for some uu, we use 𝟏u+(x)\mathbf{1}_{u}^{+}(x) or 𝟏u(x)\mathbf{1}_{u}^{-}(x), respectively. f(x)\partial f(x) denotes sub-differential at xx.

2.2 Gaussian Smoothing Function

In this subsection, we introduce the definition of Gaussian smoothing function and recall its properties.

Definition 1.

(e.g., [27]) The Gaussian smoothing of f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is defined by

fμ(x):=𝔼u[f(x+μu)],f_{\mu}(x):=\mathbb{E}_{u}[f(x+\mu u)], (3)

where u𝒩(0,In)u\sim\mathcal{N}(0,I_{n}) and μ\mu is some positive constant.

It is well-known that if ff is convex, then fμf_{\mu} is also convex. As derived from Definition 1, the gradient fμ\nabla f_{\mu} can be calculated by the following:

𝔼u[f(x+μu)μu]=\displaystyle\mathbb{E}_{u}\left[\frac{f(x+\mu u)}{\mu}u\right]= 𝔼u[f(x+μu)f(x)μu]\displaystyle\mathbb{E}_{u}\left[\frac{f(x+\mu u)-f(x)}{\mu}u\right]
=\displaystyle= 𝔼u[f(x+μu)f(xμu)2μu]=fμ(x).\displaystyle\mathbb{E}_{u}\left[\frac{f(x+\mu u)-f(x-\mu u)}{2\mu}u\right]=\nabla f_{\mu}(x). (4)

We can evaluate the error bound between ff and fμf_{\mu} when ff is convex and Lipschitz continuous.

Lemma 1.

[27] If a function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} is convex and LL-Lipschitz continuous, then

f(x)fμ(x)f(x)+μLnf(x)\leq f_{\mu}(x)\leq f(x)+\mu L\sqrt{n} (5)

holds for any positive μ\mu.

2.3 Random Matrices and Vectors

In this subsection, we introduce some properties of Gaussian random matrices and vectors.

Lemma 2.

Let udu\in\mathbb{R}^{d} be sampled from 𝒩(0,Id)\mathcal{N}(0,I_{d}).

  1. 1.

    [27, Lemma 1] For p[0,2]p\in[0,2], we have

    𝔼u[up]dp/2.\mathbb{E}_{u}\left[\|u\|^{p}\right]\leq d^{p/2}. (6)

    If p2p\geq 2, then we have the two-side bounds

    dp/2𝔼u[up](d+p)p/2.d^{p/2}\leq\mathbb{E}_{u}\left[\|u\|^{p}\right]\leq(d+p)^{p/2}. (7)
  2. 2.

    [25] Let Ad×dA\in\mathbb{R}^{d\times d} be a symmetric matrix. Then,

    𝔼u[uAu]=tr(A),\displaystyle\mathbb{E}_{u}[u^{\top}Au]=\mathrm{tr}\left({A}\right), (8)
    𝔼u[(uAu)2]=(tr(A))2+2tr(A2)\displaystyle\mathbb{E}_{u}[(u^{\top}Au)^{2}]=(\mathrm{tr}\left({A}\right))^{2}+2\mathrm{tr}\left({A^{2}}\right) (9)

    hold.

  3. 3.

    [34] Consider an LL-Lipschitz function f:nf:\mathbb{R}^{n}\to\mathbb{R}. Then,

    f(u)𝔼u[f(u)]ψ2CL\|f(u)-\mathbb{E}_{u}\left[f(u)\right]\|_{\psi_{2}}\leq CL

    holds, where CC is an absolute constant.

In particular, when p=2p=2, the following relationship is derived from simple calculations:

𝔼u[u2]=d.\mathbb{E}_{u}\left[\|u\|^{2}\right]=d. (10)

From Lemma 2.3, we can evaluate the sub-Gaussian norm f(u)𝔼u[f(u)]ψ2\|f(u)-\mathbb{E}_{u}[f(u)]\|_{\psi_{2}} for random variables u𝒩(0,μ2In)u\sim\mathcal{N}(0,\mu^{2}I_{n}) as follows.

Corollary 1.

Consider a random vector u𝒩(0,In)u\sim\mathcal{N}(0,I_{n}) and an LL-Lipschitz function f:nf:\mathbb{R}^{n}\to\mathbb{R}. Then,

f(μu)𝔼u[f(μu)]ψ2CμL\|f(\mu u)-\mathbb{E}_{u}\left[f(\mu u)\right]\|_{\psi_{2}}\leq C\mu L

holds.

From Corollary 1 and the property of the sub-Gaussian norm (2), we have

Prob(|f(μu)𝔼u[f(μu)]|t)2exp(ct2μ2L2),\mathrm{Prob}(|f(\mu u)-\mathbb{E}_{u}\left[f(\mu u)\right]|\geq t)\leq 2\exp{\left(-\frac{ct^{2}}{\mu^{2}L^{2}}\right)}, (11)

where the constant cc is independent of nn. Next, we recall some properties of random matrices.

Lemma 3.

Let Pn×d(d<n)P\in\mathbb{R}^{n\times d}(d<n) be a random matrix whose entries are independently drawn from 𝒩(0,1)\mathcal{N}(0,1).

  1. 1.

    [34] Then for any xnx\in\mathbb{R}^{n} and ε(0,1)\varepsilon\in(0,1), we have

    Prob[(1ε)x21dPx2(1+ε)x2]12exp(C0ε2d),\mathrm{Prob}\left[(1-\varepsilon)\|x\|^{2}\leq\frac{1}{d}\|P^{\top}x\|^{2}\leq(1+\varepsilon)\|x\|^{2}\right]\geq 1-2\exp{(-C_{0}\varepsilon^{2}d)},

    where C0C_{0} is an absolute constant.

  2. 2.

    Then for any xnx\in\mathbb{R}^{n}, we have

    𝔼P[PPx2]2(n+4)(d+4)x2.{}\mathbb{E}_{P}\left[\|PP^{\top}x\|^{2}\right]\leq 2(n+4)(d+4)\|x\|^{2}. (12)
  3. 3.

    [9, Theorem II.13] Let β=d/n\beta=d/n with dnd\leq n. Then for any t>0t>0,

    Prob(σmin(1nP)1βt)<exp(nt2/2)\mathrm{Prob}\left(\sigma_{\min}\left(\frac{1}{\sqrt{n}}P\right)\leq 1-\sqrt{\beta}-t\right)<\exp(-nt^{2}/2)

    holds, where σmin\sigma_{\min} denotes the minimum nonzero singular value of a matrix.

Proof of Lemma 3.2.

We define P=(u1,u2,,un)P=\left(u_{1},u_{2},\ldots,u_{n}\right)^{\top}, where uiu_{i} is a dd-dimensional vector. Then, we have (PP)ij=uiuj(PP^{\top})_{ij}=u^{\top}_{i}u_{j}, and therefore,

𝔼P[PPx2]\displaystyle\mathbb{E}_{P}\left[\|PP^{\top}x\|^{2}\right]
=𝔼u1,..,un[i=1nk=1nl=1nuluiuiukxkxl]\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\sum^{n}_{k=1}\sum^{n}_{l=1}u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}\right]
=𝔼u1,..,un[i=1nk=1nukuiuiukxk2+i=1nk=1nlknuluiuiukxkxl]\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\sum^{n}_{k=1}u_{k}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}^{2}+\sum^{n}_{i=1}\sum^{n}_{k=1}\sum^{n}_{l\neq k}u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}\right]
=𝔼u1,..,un[i=1nui4xi2+i=1nkinukuiuiukxk2+i=1nlinuluiuiuixixl+i=1nkinlknuluiuiukxkxl].\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\|u_{i}\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}u_{k}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}^{2}+\sum^{n}_{i=1}\sum^{n}_{l\neq i}u_{l}^{\top}u_{i}u_{i}^{\top}u_{i}x_{i}x_{l}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}\sum^{n}_{l\neq k}u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}\right].

Regarding the third and fourth terms on the right-hand side, since the index ll is not equal to the other indices i,ki,k, and given that 𝔼ul[ul]=0\mathbb{E}_{u_{l}}[u_{l}]=0, we have 𝔼u1,..,un[uluiuiuixixl]=0\mathbb{E}_{u_{1},..,u_{n}}[u_{l}^{\top}u_{i}u_{i}^{\top}u_{i}x_{i}x_{l}]=0 and 𝔼u1,..,un[uluiuiukxkxl]=0\mathbb{E}_{u_{1},..,u_{n}}[u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}]=0. Similarly, regarding the second term on the right-hand side, since the index ii is not equal to kk and 𝔼ui[uiui]=I\mathbb{E}_{u_{i}}[u_{i}u_{i}^{\top}]=I, we have 𝔼ui[ukuiuiukxk2]=uk2xk2\mathbb{E}_{u_{i}}[u_{k}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}^{2}]=\|u_{k}\|^{2}x_{k}^{2}. Hence, we obtain

𝔼P[PPx2]\displaystyle\mathbb{E}_{P}\left[\|PP^{\top}x\|^{2}\right] =𝔼u1,..,un[i=1nui4xi2+i=1nkinuk2xk2]\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\|u_{i}\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}\|u_{k}\|^{2}x_{k}^{2}\right]
=(10)𝔼u1,..,un[i=1nui4xi2+i=1nkindxk2]\displaystyle\overset{\text{\eqref{eq:squared norm expectation}}}{=}\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\|u_{i}\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}dx_{k}^{2}\right]
(7)i=1n(d+4)2xi2+i=1nkindxk2\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\leq}\sum^{n}_{i=1}(d+4)^{2}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}dx_{k}^{2}
=((d+4)2+d(n1))x2\displaystyle=((d+4)^{2}+d(n-1))\|x\|^{2}
(dn, 1n)((d+4)(n+4)+(d+4)(n+4))x2\displaystyle{\color[rgb]{0,0,0}\overset{\text{($d\leq n,\;1\leq n$)}}{\leq}((d+4)(n+4)+(d+4)(n+4))\|x\|^{2}}
=2(n+4)(d+4)x2.\displaystyle{\color[rgb]{0,0,0}=}2(n+4)(d+4)\|x\|^{2}.

∎∎

Remark 1.

L2L_{2} randomized smoothing is defined by (3) with the random vector uu sampled from the uniform distribution on the sphere of radius 11 (i.e., u2=1\|u\|_{2}=1), and it is known in [10, 13] to have a dimension-independent upper bound for fμ(x)f_{\mu}(x) as

f(x)fμ(x)f(x)+μL.f(x)\leq f_{\mu}(x)\leq f(x)+\mu L. (13)

The difference between (5) and (13) comes from the norm of random vectors. In terms of upper bounds of fμ(x)f_{\mu}(x), there is no essential difference between these two smoothing methods.

Indeed, our oracle complexity analysis also holds when the random vector uu is sampled from the uniform distribution on the sphere of radius n\sqrt{n}. We can prove Lemmas 1 and 2 for a vector uu uniformly sampled on the sphere of radius n\sqrt{n}. For example,

f(x)𝔼uf(x+μu)f(x)+μLnf(x)\leq\mathbb{E}_{u}f(x+\mu u)\leq f(x)+\mu L\sqrt{n}

holds from [10, 13]. Instead of Lemma 2.1 we can directly state u=n\|u\|=\sqrt{n} and equation (8) also holds. For equation (9), 𝔼u[(uAu)2]=(tr(A))2\mathbb{E}_{u}[(u^{\top}Au)^{2}]=(\mathrm{tr}\left({A}\right))^{2} holds. Lemma 2.3 holds for the uniform distribution on the sphere from [34, Theorem 5.1.4]. Then, the arguments in the following sections hold when we use these lemmas instead of Gaussian versions that are described in this section.

3 Proposed Algorithm

In this section, we describe the Randomized Gradient-Free method (RGF) [27] and our proposed method. RGF, a random search method, updates xkx_{k} by (14), where gμ(x,u)g_{\mu}(x,u) denotes the approximation of the gradient at xx along a random direction uu. The method can use forward differences or central differences for gμ(x,u)g_{\mu}(x,u), i.e.,

gμ(x,u)=f(x+μu)f(x)μu, or gμ(x,u)=f(x+μu)f(xμu)2μu,g_{\mu}(x,u)=\frac{f(x+\mu u)-f(x)}{\mu}u,\;\mbox{ or }\;g_{\mu}(x,u)=\frac{f(x+\mu u)-f(x-\mu u)}{2\mu}u,

respectively. For a convex and Lipschitz continuous objective function, RGF using forward differences achieves iteration complexity of O(n2ε2)O(\frac{n^{2}}{\varepsilon^{2}}) [27] and RGF using central differences achieves one of O(nε2)O(\frac{n}{\varepsilon^{2}}) [33]. For our proposed method, we confirm that using central differences attains better oracle complexity than using forward differences.

Algorithm 1 Randomized gradient-free method (RGF) [27]
0:  x0x_{0}, {αk}\{\alpha_{k}\}, {μk}\{\mu_{k}\}.
  for k=0,1,2,k=0,1,2,... do
     Sample uku_{k} from 𝒩(0,In)\mathcal{N}(0,I_{n})
     
xk+1=xkαkgμk(xk,uk)x_{k+1}=x_{k}-\alpha_{k}g_{\mu_{k}}(x_{k},u_{k}) (14)
  end for

Combining RGF and random projections, we propose Algorithm 2.

Algorithm 2 Subspace randomized gradient-free method
0:  x0x_{0}, {αk}\{\alpha_{k}\}, {μk}\{\mu_{k}\}, dd
  for k=0,1,2,k=0,1,2,... do
     Get a random matrix Pkn×dP_{k}\in\mathbb{R}^{n\times d} whose entries are sampled from 𝒩(0,1)\mathcal{N}(0,1).
     h(k)(u):=f(xk+1nPku){\color[rgb]{0,0,0}h^{(k)}}(u):=f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u)
     Sample uku_{k} from 𝒩(0,Id)\mathcal{N}(0,I_{d})
     
xk+1=xkαkh(k)(μkuk)h(k)(μkuk)2μkPkuk{}x_{k+1}=x_{k}-\alpha_{k}\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k} (15)
  end for

We define h(k)(u){\color[rgb]{0,0,0}h^{(k)}}(u) as the restriction of ff to the random subspace,

h(k)(u):=f(xk+1nPku),{\color[rgb]{0,0,0}h^{(k)}}(u):=f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u), (16)

where Pkn×dP_{k}\in\mathbb{R}^{n\times d} is a random matrix, whose elements are sampled from 𝒩(0,1)\mathcal{N}(0,1). Lemma 4 shows that the expectation of random matrices generates smoothing functions.

Lemma 4.
𝔼Pk[h(k)(μkuk)]=𝔼Pk[h(k)(μkuk)]=fμkukn(xk),\displaystyle\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})]=\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})]=f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k}), (17)
𝔼Pk[h(k)(μkuk)μkPkuk]=𝔼Pk[h(k)(μkuk)μkPkuk]=uk2nfμkukn(xk)\displaystyle\mathbb{E}_{P_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})}{\mu_{k}}P_{k}u_{k}\right]=\mathbb{E}_{P_{k}}\left[-\frac{{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{\mu_{k}}P_{k}u_{k}\right]=\frac{\|u_{k}\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k}) (18)

hold.

Proof.

Using rotational invariance of normal distribution, we have that 𝔼Pk[F(Pk)]=𝔼Pk[F(Pk)]\mathbb{E}_{P_{k}}[F(P_{k})]=\mathbb{E}_{P_{k}}[F(-P_{k})] holds for any function FF. Using this property, we obtain

𝔼Pk[h(k)(μkuk)]=𝔼Pk[f(xk+μknPkuk)]=𝔼Pk[f(xkμknPkuk)]=𝔼Pk[h(k)(μkuk)].\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})\right]=\mathbb{E}_{P_{k}}\left[f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\right]=\mathbb{E}_{P_{k}}\left[f(x_{k}-\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\right]=\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})\right].

Notice that the distribution of PkukP_{k}u_{k} is given by 𝒩(0,uk2In)\mathcal{N}(0,\|u_{k}\|^{2}I_{n}). We can therefore replace PkukP_{k}u_{k} by ukzk\|u_{k}\|z_{k}, where zkz_{k} is sampled from 𝒩(0,In)\mathcal{N}(0,I_{n}). Then, we obtain

𝔼Pk[h(k)(μkuk)]=𝔼zk[f(xk+μkuknzk)]=fμkukn(xk).\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})]=\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]=f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k}). (19)

Similarly, we have

𝔼Pk[h(k)(μkuk)Pkuk]\displaystyle\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})P_{k}u_{k}\right] =\displaystyle= 𝔼Pk[f(xk+μknPkuk)Pkuk]\displaystyle\mathbb{E}_{P_{k}}\left[f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})P_{k}u_{k}\right]
=\displaystyle= 𝔼Pk[f(xkμknPkuk)Pkuk]=𝔼Pk[h(k)(μkuk)Pkuk],\displaystyle-\mathbb{E}_{P_{k}}\left[f(x_{k}-\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})P_{k}u_{k}\right]=-\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})P_{k}u_{k}\right],

and

𝔼Pk[h(k)(μkuk)μkPkuk]\displaystyle\mathbb{E}_{P_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})}{\mu_{k}}P_{k}u_{k}\right] =\displaystyle= 𝔼zk[f(xk+μkuknzk)μkukzk]\displaystyle\mathbb{E}_{z_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})}{\mu_{k}}\|u_{k}\|z_{k}\right]
=\displaystyle= uk2n𝔼zk[f(xk+μkuknzk)μkuknzk]=(2.2)uk2nfμkukn(xk).\displaystyle\frac{\|u_{k}\|^{2}}{\sqrt{n}}\mathbb{E}_{z_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})}{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}z_{k}\right]\overset{\text{\eqref{eq:gradient of smoothing function}}}{=}\frac{\|u_{k}\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k}).

∎∎

4 Global Convergence

In this section, we prove global convergence of our proposed method for convex and Lipschitz continuous functions. We define 𝒰k:=(u0,P0,,uk1,Pk1)\mathcal{U}_{k}:=(u_{0},P_{0},...,u_{k-1},P_{k-1}).

Assumption 1.

ff is LL-Lipschitz continuous, i.e.,

|f(x)f(y)|Lxy|f(x)-f(y)|\leq L\|x-y\|

holds for all x,yx,y, and convex.

Theorem 1.

Suppose that Assumption 1 holds. Let the sequence {xk}\{x_{k}\} be generated by Algorithm 2. Then for any N1N\geq 1,

k=0N1αk(𝔼𝒰k[f(xk)]f(x))n2dx0x2+L(d+3)3/2dk=0N1μkαk+L2(d+4)2(n+4)cdnk=0N1αk2\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))\leq\frac{\sqrt{n}}{2d}\|x_{0}-x^{*}\|^{2}+\frac{L(d+3)^{3/2}}{d}\sum_{k=0}^{N-1}\mu_{k}\alpha_{k}+\frac{L^{2}(d+4)^{2}(n+4)}{cd\sqrt{n}}\sum_{k=0}^{N-1}\alpha_{k}^{2}

holds.

Proof.

We define rk:=xkxr_{k}:=\|x_{k}-x^{*}\|. From (15), we have

rk+12=rk22αkh(k)(μkuk)h(k)(μkuk)2μkPkuk,xkx\displaystyle r^{2}_{k+1}=r^{2}_{k}-2\alpha_{k}\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\rangle
+αk2(h(k)(μkuk)h(k)(μkuk)2μk)2Pkuk2.\displaystyle+\alpha_{k}^{2}\left(\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}\right)^{2}\|P_{k}u_{k}\|^{2}. (20)

Taking the expectation with respect to uku_{k} and PkP_{k}, we then evaluate the second and third terms. Regarding the second term on the right-hand side of (4), we have

𝔼uk𝔼Pk[h(k)(μkuk)h(k)(μkuk)2μkPkuk,xkx]\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}\left[\left\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\right\rangle\right] =(18)𝔼uk[uk2nfμkukn(xk),xkx]\displaystyle\overset{\eqref{eq:subspace gradient of smoothing function}}{=}\mathbb{E}_{u_{k}}\left[\left\langle\frac{\|u_{k}\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k}),x_{k}-x^{*}\right\rangle\right]
convexity𝔼uk[uk2n(fμkukn(xk)fμkukn(x))]\displaystyle\overset{\text{convexity}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{\|u_{k}\|^{2}}{\sqrt{n}}(f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k})-f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x^{*}))\right]
(5)𝔼uk[uk2n(f(xk)f(x)Lμkuk)]\displaystyle\overset{\eqref{eq:error smoothing function}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{\|u_{k}\|^{2}}{\sqrt{n}}(f(x_{k})-f(x^{*})-L\mu_{k}\|u_{k}\|)\right]
=(10)dn(f(xk)f(x))Lμk𝔼uk[uk3]n1/2\displaystyle\overset{\text{\eqref{eq:squared norm expectation}}}{=}\frac{d}{\sqrt{n}}(f(x_{k})-f(x^{*}))-\frac{L\mu_{k}\mathbb{E}_{u_{k}}[\|u_{k}\|^{3}]}{n^{1/2}}
(7)dn(f(xk)f(x))Lμk(d+3)3/2n1/2.\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\geq}\frac{d}{\sqrt{n}}(f(x_{k})-f(x^{*}))-\frac{L\mu_{k}(d+3)^{3/2}}{n^{1/2}}.

For the third term on the right-hand side of (4), from Pkuk=ukzk(zk𝒩(0,In))P_{k}u_{k}=\|u_{k}\|z_{k}\;(z_{k}\sim\mathcal{N}(0,I_{n})), we have

γ\displaystyle\gamma :=𝔼Pk𝔼uk[(h(k)(μkuk)h(k)(μkuk)2μk)2Pkuk2]\displaystyle:=\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\left(\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}\right)^{2}\|P_{k}u_{k}\|^{2}\right]
=(19)14μk2𝔼zk𝔼uk[(f(xk+μkuknzk)f(xkμkuknzk))2uk2zk2]\displaystyle\overset{\text{\eqref{eq:smoothing subspace function}}}{=}\frac{1}{4\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-f(x_{k}-\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right)^{2}\|u_{k}\|^{2}\|z_{k}\|^{2}\right]
=14μk2𝔼zk𝔼uk[((f(xk+μkuknzk)β)+(βf(xkμkuknzk)))2uk2zk2].\displaystyle=\frac{1}{4\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left((f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\beta)+(\beta-f(x_{k}-\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k}))\right)^{2}\|u_{k}\|^{2}\|z_{k}\|^{2}\right].

By applying the inequality (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2}, we obtain

γ12μk2𝔼zk𝔼uk[((f(xk+μkuknzk)β)2+(βf(xkμkuknzk))2)uk2zk2].\gamma\leq\frac{1}{2\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\beta\right)^{2}+\left(\beta-f(x_{k}-\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right)^{2}\right)\|u_{k}\|^{2}\|z_{k}\|^{2}\right].

From the rotational invariance of zkz_{k}, we have

γ12μk2𝔼zk𝔼uk[((f(xk+μkuknzk)β)2+(βf(xk+μkuknzk))2)uk2zk2].\gamma\leq\frac{1}{2\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\beta\right)^{2}+\left(\beta-f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right)^{2}\right)\|u_{k}\|^{2}\|z_{k}\|^{2}\right].

Selecting β=𝔼zk[f(xk+μkuknzk)]\beta=\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right], we have

γ\displaystyle\gamma 1μk2𝔼uk[uk2𝔼zk[(f(xk+μkuknzk)𝔼zk[f(xk+μkuknzk)])2zk2]]\displaystyle\leq\frac{1}{\mu_{k}^{2}}\mathbb{E}_{u_{k}}\left[\|u_{k}\|^{2}\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right)^{2}\|z_{k}\|^{2}\right]\right]
Hölder’s ineq.1μk2𝔼uk[uk2𝔼zk[(f(xk+μkuknzk)𝔼zk[f(xk+μkuknzk)])4]𝔼zk[zk4]].\displaystyle\overset{\text{H\"{o}lder's ineq.}}{\leq}\frac{1}{\mu_{k}^{2}}\mathbb{E}_{u_{k}}\left[\|u_{k}\|^{2}\sqrt{\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right]}\sqrt{\mathbb{E}_{z_{k}}[\|z_{k}\|^{4}]}\right].

We evaluate 𝔼zk[(f(xk+μkuknzk)𝔼zk[f(xk+μkuknzk)])4]\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right] using Corollary 1. Note that for any non-negative random variables XX, 𝔼X[X]=0Prob(Xt)dt\mathbb{E}_{X}[X]=\int_{0}^{\infty}\mathrm{Prob}(X\geq t)\mathrm{d}t holds. Using this relation, we have

𝔼zk[(f(xk+μkuknzk)𝔼zk[f(xk+μkuknzk)])4]\displaystyle\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right]
=0Prob((f(xk+μkuknzk)𝔼zk[f(xk+μkuknzk)])4t)dt\displaystyle\hskip 100.0pt=\int_{0}^{\infty}\mathrm{Prob}\left(\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right)^{4}\geq t\right)\mathrm{d}t
=0Prob(|f(xk+μkuknzk)𝔼zk[f(xk+μkuknzk)]|t1/4)dt\displaystyle\hskip 100.0pt=\int_{0}^{\infty}\mathrm{Prob}\left(\left|f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right|\geq t^{1/4}\right)\mathrm{d}t
(11)02exp(cnt1/2μk2uk2L2)dt\displaystyle\hskip 100.0pt\overset{\text{\eqref{eq:gaussian concentration prob}}}{\leq}\int_{0}^{\infty}2\exp{\left(-\frac{cnt^{1/2}}{\mu_{k}^{2}\|u_{k}\|^{2}L^{2}}\right)}\mathrm{d}t
=μk4uk4L4c2n202exp(T1/2)dT\displaystyle\hskip 100.0pt=\frac{\mu_{k}^{4}\|u_{k}\|^{4}L^{4}}{c^{2}n^{2}}\int_{0}^{\infty}2\exp{\left(-T^{1/2}\right)}\mathrm{d}T
=4μk4uk4L4c2n2,\displaystyle\hskip 100.0pt=\frac{4\mu_{k}^{4}\|u_{k}\|^{4}L^{4}}{c^{2}n^{2}},

where the last equality follows from 0exp(T1/2)dT=2\int_{0}^{\infty}\exp{\left(-T^{1/2}\right)}\mathrm{d}T=2. Then, we obtain

γ\displaystyle\gamma 1μk2𝔼uk[uk24μk4uk4L4c2n2𝔼zk[zk4]]\displaystyle\leq\frac{1}{\mu_{k}^{2}}\mathbb{E}_{u_{k}}\left[\|u_{k}\|^{2}\sqrt{\frac{4\mu_{k}^{4}\|u_{k}\|^{4}L^{4}}{c^{2}n^{2}}}\sqrt{\mathbb{E}_{z_{k}}[\|z_{k}\|^{4}]}\right]
=2L2cn𝔼uk[uk4]𝔼zk[zk4]\displaystyle=\frac{2L^{2}}{cn}\mathbb{E}_{u_{k}}[\|u_{k}\|^{4}]\sqrt{\mathbb{E}_{z_{k}}[\|z_{k}\|^{4}]}
(7)2L2(d+4)2(n+4)cn.\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\leq}\frac{2L^{2}(d+4)^{2}(n+4)}{cn}. (21)

Therefore, we obtain

𝔼uk𝔼Pk[rk+12]rk22dαkn(f(xk)f(x))+2Lμkαk(d+3)3/2n1/2+2L2αk2(d+4)2(n+4)cn.\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}[r_{k+1}^{2}]\leq r_{k}^{2}-\frac{2d\alpha_{k}}{\sqrt{n}}(f(x_{k})-f(x^{*}))+\frac{2L\mu_{k}\alpha_{k}(d+3)^{3/2}}{n^{1/2}}+\frac{2L^{2}\alpha_{k}^{2}(d+4)^{2}(n+4)}{cn}.

Taking the expectation with respect to 𝒰k\mathcal{U}_{k}, we obtain

𝔼𝒰k+1[rk+12]𝔼𝒰k[rk2]2dαkn(𝔼𝒰k[f(xk)]f(x))+2Lμkαk(d+3)3/2n1/2+2L2αk2(d+4)2(n+4)cn.\mathbb{E}_{\mathcal{U}_{k+1}}[r_{k+1}^{2}]\leq\mathbb{E}_{\mathcal{U}_{k}}[r_{k}^{2}]-\frac{2d\alpha_{k}}{\sqrt{n}}(\mathbb{E}_{\mathcal{U}_{k}}\left[f(x_{k})\right]-f(x^{*}))+\frac{2L\mu_{k}\alpha_{k}(d+3)^{3/2}}{n^{1/2}}+\frac{2L^{2}\alpha_{k}^{2}(d+4)^{2}(n+4)}{cn}.

Summing up these inequalities from k=0k=0 and k=N1k=N-1, we obtain

𝔼𝒰N[rN2]r022dnk=0N1αk(𝔼𝒰k[f(xk)]f(x))+2L(d+3)3/2n1/2k=0N1μkαk+2L2(d+4)2(n+4)cnk=0N1αk2.\mathbb{E}_{\mathcal{U}_{N}}[r_{N}^{2}]\leq r_{0}^{2}-\frac{2d}{\sqrt{n}}\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}\left[f(x_{k})\right]-f(x^{*}))+\frac{2L(d+3)^{3/2}}{n^{1/2}}\sum_{k=0}^{N-1}\mu_{k}\alpha_{k}+\frac{2L^{2}(d+4)^{2}(n+4)}{cn}\sum_{k=0}^{N-1}\alpha_{k}^{2}.

Therefore,

k=0N1αk(𝔼𝒰k[f(xk)]f(x))n2dr02+L(d+3)3/2dk=0N1μkαk+L2(d+4)2(n+4)cndk=0N1αk2.\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}\left[f(x_{k})\right]-f(x^{*}))\leq\frac{\sqrt{n}}{2d}r_{0}^{2}+\frac{L(d+3)^{3/2}}{d}\sum_{k=0}^{N-1}\mu_{k}\alpha_{k}+\frac{L^{2}(d+4)^{2}(n+4)}{c\sqrt{n}d}\sum_{k=0}^{N-1}\alpha_{k}^{2}.

∎∎

With fixed αk=α\alpha_{k}=\alpha and μk=μ\mu_{k}=\mu, we obtain

min0iN1𝔼𝒰i[f(xi)]f(x)n2dNαr02+L(d+3)3/2dμ+L2(d+4)2(n+4)cdnα.\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})\leq\frac{\sqrt{n}}{2dN\alpha}r_{0}^{2}+\frac{L(d+3)^{3/2}}{d}\mu+\frac{L^{2}(d+4)^{2}(n+4)}{cd\sqrt{n}}\alpha.

From this relation, the oracle complexity for achieving the inequality: min0iN1𝔼𝒰i[f(xi)]f(x)ε\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})\leq\varepsilon is

N=8r02L2(n+4)(d+4)2c2d2ε2=O(nε2)N=\frac{8r_{0}^{2}L^{2}(n+4)(d+4)^{2}}{c^{2}d^{2}\varepsilon^{2}}=O\left(\frac{n}{\varepsilon^{2}}\right)

with

α=cnr0L(d+4)2(n+4)N,μεd2L(d+3)3/2.\alpha=\frac{\sqrt{cn}r_{0}}{L(d+4)\sqrt{2(n+4)N}},\;\mu\leq\frac{\varepsilon d}{2L(d+3)^{3/2}}.

These parameters are obtained by the relations of n2dNαr02=L2(d+4)2(n+4)cdnα=ε4\frac{\sqrt{n}}{2dN\alpha}r_{0}^{2}=\frac{L^{2}(d+4)^{2}(n+4)}{cd\sqrt{n}}\alpha=\frac{\varepsilon}{4} and L(d+3)3/2dμε2\frac{L(d+3)^{3/2}}{d}\mu\leq\frac{\varepsilon}{2}.

Note that in each iteration, our algorithm calculates the function value twice. Therefore, the oracle complexity is equal to twice the iteration complexity.

5 Local Convergence

In this section, we prove, under some local assumptions on ff, local convergence of our proposed method.

5.1 Assumptions

Assumption 2.

We have that d=o(n)d=o(n), and dd\to\infty as nn\to\infty.

Next we consider the following local assumptions on ff.

Assumption 3.

There exists a neighborhood BB^{*} of xx^{*} and an Alexandrov matrix H~(x)\tilde{H}(x) that satisfy the following properties.

  1. (i)

    There exist constants L~\tilde{L} and τ\tau, and a subgradient gf(y)g\in\partial f(y) such that for all x,yBx,y\in B^{*}:

    f(x)f(y)+g,xy+τ2(xy)H~(y)(xy),\displaystyle f(x)\geq f(y)+\langle g,x-y\rangle+\frac{\tau}{2}(x-y)^{\top}\tilde{H}(y)(x-y), (22)
    f(x)f(y)+g,xy+L~2(xy)H~(y)(xy).\displaystyle f(x)\leq f(y)+\langle g,x-y\rangle+\frac{\tilde{L}}{2}(x-y)^{\top}\tilde{H}(y)(x-y). (23)
  2. (ii)

    There exist constants σ(0,1)\sigma\in(0,1) and λ¯>0\bar{\lambda}>0 such that λσn(H~(x))λ¯\lambda_{\sigma n}(\tilde{H}(x))\geq\bar{\lambda} holds for all xBx\in B^{*}.

When the objective function is twice differentiable, we set g=f(y)g=\nabla f(y) and H~(y)=2f(y)\tilde{H}(y)=\nabla^{2}f(y). In the smooth setting, (22) and (23) in Assumption 3(i) are called relative convexity and smoothness [16, 19], respectively, and some functions achieve this property (e.g. logistic function, Wasserstein distance). This assumption implies that the objective function ff is τ\tau-strongly convex and L~\tilde{L}-smooth under the semi-norm H~(x)\|\cdot\|_{\tilde{H}(x)} for all x,yBx,y\in B^{*}.

While non-smooth objective functions do not necessarily have gradients and Hessians at some points, we can show that when ff is convex, ff is twice differentiable almost everywhere.

Theorem 2.

[1, 28] Every convex function f:nf:\mathbb{R}^{n}\to\mathbb{R} is twice differentiable almost everywhere in the following sense: ff is twice differentiable at zz with Alexandrov Hessian H~(z)=H~(z)0\tilde{H}(z)=\tilde{H}^{\top}(z)\succeq 0, if f(z)\nabla f(z) exists, and if for every ε>0\varepsilon>0 there exists δ>0\delta>0 such that xzδ\|x-z\|\leq\delta implies

supyf(x)yf(z)H~(z)(xz)εxz.\sup_{y\in\partial f(x)}\|y-\nabla f(z)-\tilde{H}(z)(x-z)\|\leq\varepsilon\|x-z\|.

Assumption 3(i) is inspired by Theorem 2, because there exists H~(x)\tilde{H}(x) almost everywhere such that

limh0f(x+h)f(x)f(x),h12hH~(x)hh2=0.\lim_{h\to 0}\frac{f(x+h)-f(x)-\langle\nabla f(x),h\rangle-\frac{1}{2}h^{\top}\tilde{H}(x)h}{\|h\|^{2}}=0.

Note that the subgradient gg and the matrix H~(x)\tilde{H}(x) are used only in the analysis, not in our algorithm.

In the following subsection, we assume the above two assumptions in addition to Assumption 1. The theoretical results in this section can only hold locally around an optimal solution xx^{*}. Indeed, we assume Lipschitz continuity as Assumption 1 and relative convexity as Assumption 3(i). This implies that f(x)f(x)=O(xx)f(x)-f(x^{*})=O(\|x-x^{*}\|) and f(x)f(x)=Ω(xx2)f(x)-f(x^{*})=\Omega(\|x-x^{*}\|^{2}) hold, and then as xx\|x-x^{*}\|\to\infty, these assumptions conflict.

5.2 Local Theoretical Guarantees

We define hμk(k)(u){\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u) as the smoothing function on the random subspace:

hμk(k)(u):=𝔼uk[h(k)(u+μkuk)]=𝔼uk[f(xk+1nPk(u+μkuk))].{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u):=\mathbb{E}_{u_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(u+\mu_{k}u_{k})]=\mathbb{E}_{u_{k}}\left[f\left(x_{k}+\frac{1}{\sqrt{n}}P_{k}(u+\mu_{k}u_{k})\right)\right]. (24)

From (2.2), we have

hμk(k)(u)=𝔼uk[h(k)(u+μkuk)h(k)(u)μkuk]=𝔼uk[h(k)(u+μkuk)h(k)(uμkuk)2μkuk].\displaystyle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u)=\mathbb{E}_{u_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(u+\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(u)}{\mu_{k}}u_{k}\right]=\mathbb{E}_{u_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(u+\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(u-\mu_{k}u_{k})}{2\mu_{k}}u_{k}\right].\hskip 20.0pt (25)

Under some assumptions, we can show that PH~(x)PP^{\top}\tilde{H}(x)P is a positive definite matrix with high probability.

Proposition 1.

[12, Proposition 5.4] Let 0<ε0<10<\varepsilon_{0}<1. Then under Assumptions 1, 2 and 3(ii), there exists n0n_{0}\in\mathbb{N} (which depends only on ε0\varepsilon_{0} and σ\sigma) such that if nn0n\geq n_{0}, for any xBx\in B^{*},

PH~(x)P(1ε0)2n2σ2λ¯IdP^{\top}\tilde{H}(x)P\succeq\frac{(1-\varepsilon_{0})^{2}n}{2}\sigma^{2}\bar{\lambda}I_{d}

holds with probability at least 16exp(d)1-6\exp{(-d)}.

Now, we prove local convergence of our proposed method.

Theorem 3.

Let 0<ε0<10<\varepsilon_{0}<1 and n0n_{0} be as defined in Proposition 1, and the sequence {xk}\{x_{k}\} be generated by Algorithm 2. Suppose that Assumptions 1, 2, and 3 hold. If nn0n\geq n_{0}, xkBx_{k}\in B^{*}, and μk1tr(PkH~(xk)Pk)\mu_{k}\leq\frac{1}{\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right)} hold for any k0k\geq 0, then for any N1N\geq 1, at least one of the following holds:

  1. 1.
    k=0N1αk(𝔼𝒰k[f(xk)]f(x))12r02+Ldk=0N1αkμk+L2(n+4)(d+4)2cnk=0N1αk2,\displaystyle{}\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))\leq\frac{1}{2}r_{0}^{2}+L\sqrt{d}\sum_{k=0}^{N-1}\alpha_{k}\mu_{k}+\frac{L^{2}(n+4)(d+4)^{2}}{cn}\sum_{k=0}^{N-1}\alpha_{k}^{2},\hskip 30.0pt (26)
  2. 2.
    min0kN1f(xk)f(x)LC18d+41d2+LC2nd,{}\min_{0\leq k\leq N-1}f(x_{k})-f(x^{*})\leq LC_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{LC_{2}}{n\sqrt{d}}, (27)

where C1:=8L(1ε0)2τCσ2λ¯C_{1}:=\frac{8L}{(1-\varepsilon_{0})^{2}\tau C\sigma^{2}\bar{\lambda}}, C2:=23L~(1ε0)2τCσ2λ¯C_{2}:=\frac{2\sqrt{3}\tilde{L}}{(1-\varepsilon_{0})^{2}\tau C\sigma^{2}\bar{\lambda}}, and C:=16exp(d)2exp(C0d4)C:=1-6\exp(-d)-2\exp(-\frac{C_{0}d}{4}).

Proof.

Let gkf(xk)g_{k}\in\partial f(x_{k}). Using the same argument as in the proof of Theorem 1, from (4) and (21), we obtain

𝔼uk𝔼Pk[rk+12]\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}[r_{k+1}^{2}] rk22αk𝔼uk𝔼Pk[h(k)(μkuk)h(k)(μkuk)2μkPkuk,xkx]+2L2αk2cn(d+4)2(n+4)\displaystyle\leq r_{k}^{2}-2\alpha_{k}\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}\left[\left\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\right\rangle\right]+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4) (28)
=(25)rk22αk𝔼Pk[hμk(k)(0),Pk(xkx)]+2L2αk2cn(d+4)2(n+4).\displaystyle\overset{\text{\eqref{eq:gradient of smoothing subspace function for u}}}{=}r_{k}^{2}-2\alpha_{k}\mathbb{E}_{P_{k}}\left[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle\right]+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4).

We reevaluate the second term on the right-hand side. Now, we evaluate the error between 𝔼uk[hμk(k)(0),u]\mathbb{E}_{u_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle] and gk,1nPku\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle for any uu. From relation (22) with x=xk+1nPku+μknPkukx=x_{k}+\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k} and y=xky=x_{k}, we have

f(xk+1nPku+μknPkuk)f(xk)+gk,1nPku+μknPkuk+τ2n(u+μkuk)PkH~(xk)Pk(u+μkuk).f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\geq f(x_{k})+\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k}\rangle+\frac{\tau}{2n}(u+\mu_{k}u_{k})^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}(u+\mu_{k}u_{k}).

Taking the expectation with respect to uku_{k}, we obtain

hμk(k)(u)\displaystyle{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u) f(xk)+𝔼uk[gk,1nPku+μknPkuk]+τ2nuPkH~(xk)Pku+𝔼uk[τμknukPkH~(xk)Pku]\displaystyle\geq f(x_{k})+\mathbb{E}_{u_{k}}\left[\left\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k}\right\rangle\right]+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\mathbb{E}_{u_{k}}\left[\frac{\tau\mu_{k}}{n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u\right] (29)
+𝔼uk[τμk22nukPkH~(xk)Pkuk]\displaystyle\hskip 100.0pt+\mathbb{E}_{u_{k}}\left[\frac{\tau\mu_{k}^{2}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u_{k}\right]
=𝔼uk[uk]=0f(xk)+gk,1nPku+τ2nuPkH~(xk)Pku+𝔼uk[τμk22nukPkH~(xk)Pkuk]\displaystyle\overset{\text{$\mathbb{E}_{u_{k}}[u_{k}]=0$}}{=}f(x_{k})+\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\mathbb{E}_{u_{k}}\left[\frac{\tau\mu_{k}^{2}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u_{k}\right]
=(8)f(xk)+gk,1nPku+τ2nuPkH~(xk)Pku+τμk22ntr(PkH~(xk)Pk).\displaystyle\overset{\text{\eqref{eq:trace expectation1}}}{=}f(x_{k})+\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\frac{\tau\mu_{k}^{2}}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right).

First, we evaluate gk,1nPku\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle. Using 𝔼uk[ukuk]=Id\mathbb{E}_{u_{k}}[u_{k}u_{k}^{\top}]=I_{d}, we have

gk,1nPku\displaystyle\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle =𝔼ukgk,1nPkukuku\displaystyle=\mathbb{E}_{u_{k}}\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}u_{k}^{\top}u\rangle (30)
=𝔼uk[gk,1nPkukuk,u]\displaystyle=\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\right]
=𝔼uk[gk,1nPkukuk,u𝟏u+(uk)]+𝔼uk[gk,1nPkukuk,u𝟏u(uk)].\displaystyle=\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]+\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right].\hskip 30.0pt

We compute both an upper bound and a lower bound for gk,1nPkuk\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle. From the convexity of ff, we have

f(xk+μknPkuk)f(xk)μkgk,1nPkuk.\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\geq\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle. (31)

From relation (23) with x=xk+μknPkukx=x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k} and y=xky=x_{k}, we have

gk,1nPkuk+L~μk2nukPkH~(xk)Pkukf(xk+μknPkuk)f(xk)μk.\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle+\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u_{k}\geq\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}. (32)

Using these relations, we have

gk,1nPku\displaystyle\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle =(30)𝔼uk[gk,1nPkukuk,u𝟏u+(uk)]+𝔼uk[gk,1nPkukuk,u𝟏u(uk)]\displaystyle\overset{\text{\eqref{eq:evalueta gTPu}}}{=}\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]+\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right]
(31),(32)𝔼uk[f(xk+μknPkuk)f(xk)μkuk,u𝟏u+(uk)]𝔼uk[L~μk2nukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle\overset{\text{\eqref{eq:upper bound1},\eqref{eq:lower bound1}}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]
+𝔼uk[f(xk+μknPkuk)f(xk)μkuk,u𝟏u(uk)]\displaystyle\hskip 100.0pt+\mathbb{E}_{u_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right]
=𝔼uk[f(xk+μknPkuk)f(xk)μkuk,u]𝔼uk[L~μk2nukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle=\mathbb{E}_{u_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\langle u_{k},u\rangle\right]-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]
=𝔼uk[h(k)(μkuk)h(k)(0)μkuk,u]𝔼uk[L~μk2nukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle=\mathbb{E}_{u_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(0)}{\mu_{k}}\langle u_{k},u\rangle\right]-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]
=(25)hμk(k)(0),u𝔼uk[L~μk2nukPkH~(x)Pkukuk,u𝟏u+(uk)].\displaystyle\overset{\text{\eqref{eq:gradient of smoothing subspace function for u}}}{=}\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]. (33)

Regarding the second term, using 𝔼uk[F(uk)]=𝔼uk[F(uk)]\mathbb{E}_{u_{k}}[F(u_{k})]=\mathbb{E}_{u_{k}}[F(-u_{k})] from the rotational invariance of the normal distribution, we have

𝔼uk[ukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right] =12𝔼uk[ukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle=\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right] (34)
+12𝔼uk[ukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle\hskip 100.0pt+\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]
=rotation invariant12𝔼uk[ukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle\overset{\text{rotation invariant}}{=}\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]
12𝔼uk[ukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle\hskip 100.0pt-\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(-u_{k})\right]
=12𝔼uk[ukPkH~(x)Pkukuk,u𝟏u+(uk)]\displaystyle=\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]
12𝔼uk[ukPkH~(x)Pkukuk,u𝟏u(uk)]\displaystyle\hskip 100.0pt-\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right]
=12𝔼uk[ukPkH~(x)Pkuk|uk,u|].\displaystyle=\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}|\langle u_{k},u\rangle|\right].

Now, we evaluate 𝔼uk[ukPkH~(x)Pkuk|uk,u|]\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}|\langle u_{k},u\rangle|\right]. From Hölder’s inequality and Lemma 2.2, we obtain

𝔼uk[ukPkH~(x)Pkuk|uk,u|]\displaystyle\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}|\langle u_{k},u\rangle|\right] Hölder’s ineq.𝔼uk[(ukPkH~(x)Pkuk)2]𝔼uk[|uk,u|2]\displaystyle\overset{\text{H\"{o}lder's ineq.}}{\leq}\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\mathbb{E}_{u_{k}}\left[|\langle u_{k},u\rangle|^{2}\right]}
=𝔼uk[(ukPkH~(x)Pkuk)2]𝔼uk[uukuku]\displaystyle=\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\mathbb{E}_{u_{k}}\left[u^{\top}u_{k}u_{k}^{\top}u\right]}
=𝔼uk[ukuk]=Id𝔼uk[(ukPkH~(x)Pkuk)2]u2\displaystyle\overset{\text{$\mathbb{E}_{u_{k}}[u_{k}u_{k}^{\top}]=I_{d}$}}{=}\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\|u\|^{2}}
(9)(tr(PkH~(x)Pk))2+2tr((PkH~(x)Pk)2)u.\displaystyle\overset{\text{\eqref{eq:trace expectation2}}}{\leq}\sqrt{(\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right))^{2}+2\mathrm{tr}\left({(P_{k}^{\top}\tilde{H}(x)P_{k})^{2}}\right)}\|u\|.

For any positive semidefinite matrix AA, tr(A2)=iλi(A)2(iλi(A))2=(tr(A))2\mathrm{tr}\left({A^{2}}\right)=\sum_{i}\lambda_{i}(A)^{2}\leq(\sum_{i}\lambda_{i}(A))^{2}=(\mathrm{tr}\left({A}\right))^{2} holds. Then, we have

𝔼uk[ukPkH~(x)Pkuk|uk,u|]3tr(PkH~(x)Pk)u.\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}|\langle u_{k},u\rangle|\right]\leq\sqrt{3}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|u\|. (35)

Finally, from (33),(34), and (35), we obtain

gk,1nPkuhμk(k)(0),u3L~μk4ntr(PkH~(x)Pk)u.{}\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle\geq\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|u\|. (36)

Combining relations from (29) and (36), we obtain

hμk(k)(u)f(xk)+hμk(k)(0),u3L~μk4ntr(PkH~(x)Pk)u+τ2nuPkH~(xk)Pku+μk2τ2ntr(PkH~(xk)Pk).{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u)\geq f(x_{k})+\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|u\|+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\frac{\mu_{k}^{2}\tau}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right).

By substituting Pk(xkx)-P_{k}^{\top}(x_{k}-x^{*}) for uu,

hμk(k)(Pk(xkx))f(xk)hμk(k)(0),Pk(xkx)3L~μk4ntr(PkH~(x)Pk)Pk(xkx)\displaystyle{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))\geq f(x_{k})-\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|P_{k}^{\top}(x_{k}-x^{*})\|
+τ2n(xkx)PkPkH~(xk)PkPk(xkx)+μk2τ2ntr(PkH~(xk)Pk)\displaystyle+\frac{\tau}{2n}(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})+\frac{\mu_{k}^{2}\tau}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right)

holds, and then, regarding the second term on the right-hand side of (28), we have

hμk(k)(0),Pk(xkx)f(xk)hμk(k)(Pk(xkx))3L~μk4ntr(PkH~(x)Pk)Pk(xkx)\displaystyle\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle\geq f(x_{k})-{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|P_{k}^{\top}(x_{k}-x^{*})\|
+τ2n(xkx)PkPkH~(xk)PkPk(xkx)+μk2τ2ntr(PkH~(xk)Pk).\displaystyle+\frac{\tau}{2n}(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})+\frac{\mu_{k}^{2}\tau}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right).\hskip 30.0pt (37)

After taking the expectation with respect to PkP_{k}, we evaluate the right-hand side. As for hμk(k)(Pk(xkx)){\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*})), we have

𝔼Pk[hμk(k)(Pk(xkx))]\displaystyle\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))] =(24)𝔼Pk𝔼uk[f(xk1nPkPk(xkx)+μknPkuk)]\displaystyle\overset{\text{\eqref{eq:smoothing subspace function for u}}}{=}\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[f(x_{k}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top}(x_{k}-x^{*})+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\right]
Lipschitz𝔼Pk[f(xk1nPkPk(xkx))]+𝔼Pk𝔼uk[LμknPkuk]\displaystyle\overset{\text{Lipschitz}}{\leq}\mathbb{E}_{P_{k}}\left[f(x_{k}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top}(x_{k}-x^{*}))\right]+\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\frac{L\mu_{k}}{\sqrt{n}}\|P_{k}u_{k}\|\right]
Lipschitzf(x)+L𝔼Pk[(In1nPkPk)(xkx)]+𝔼Pk𝔼uk[LμknPkuk].\displaystyle\overset{\text{Lipschitz}}{\leq}f(x^{*})+L\mathbb{E}_{P_{k}}\left[\|(I_{n}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top})(x_{k}-x^{*})\|\right]+\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\frac{L\mu_{k}}{\sqrt{n}}\|P_{k}u_{k}\|\right].

Applying Lemma 2.1, we have 𝔼Pk𝔼uk[Pkuk]=𝔼zk𝔼uk[zkuk]nd\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\|P_{k}u_{k}\|\right]=\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\|z_{k}\|\|u_{k}\|\right]\leq\sqrt{nd} and

𝔼Pk[(In1nPkPk)(xkx)2]\displaystyle\mathbb{E}_{P_{k}}\left[\sqrt{\|(I_{n}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top})(x_{k}-x^{*})\|^{2}}\right] Jensen’s ineq.𝔼Pk[(In1nPkPk)(xkx)2]\displaystyle\overset{\text{Jensen's ineq.}}{\leq}\sqrt{\mathbb{E}_{P_{k}}\left[\|(I_{n}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top})(x_{k}-x^{*})\|^{2}\right]}
=𝔼Pk[xkx22nPk(xkx)2+1nPkPk(xkx)2]\displaystyle=\sqrt{\mathbb{E}_{P_{k}}\left[\|x_{k}-x^{*}\|^{2}-\frac{2}{\sqrt{n}}\|P_{k}^{\top}(x_{k}-x^{*})\|^{2}+\frac{1}{n}\|P_{k}P_{k}^{\top}(x_{k}-x^{*})\|^{2}\right]}
=𝔼Pk[PkPk]=dInxkx22dnxkx2+𝔼Pk[1nPkPk(xkx)2]\displaystyle\overset{\text{$\mathbb{E}_{P_{k}}[P_{k}P_{k}^{\top}]=dI_{n}$}}{=}\sqrt{\|x_{k}-x^{*}\|^{2}-\frac{2d}{\sqrt{n}}\|x_{k}-x^{*}\|^{2}+\mathbb{E}_{P_{k}}\left[\frac{1}{n}\|P_{k}P_{k}^{\top}(x_{k}-x^{*})\|^{2}\right]}
(12)12dn+2(d+4)(n+4)nxkx\displaystyle\overset{\text{\eqref{eq: PPT x}}}{\leq}\sqrt{1-2\frac{d}{\sqrt{n}}+2\frac{(d+4)(n+4)}{n}}~{}\|x_{k}-x^{*}\|
=2d+9+8dn+32n2dnxkx\displaystyle=\sqrt{2d+9+\frac{8d}{n}+\frac{32}{n}-\frac{2d}{\sqrt{n}}}~{}\|x_{k}-x^{*}\|
=2d+9+6dn+32n2d(1n1n)xkx\displaystyle{\color[rgb]{0,0,0}=\sqrt{2d+9+\frac{6d}{n}+\frac{32}{n}-2d\left(\frac{1}{\sqrt{n}}-\frac{1}{n}\right)}~{}\|x_{k}-x^{*}\|}
(d0,n1)8d+41xkx.\displaystyle{\color[rgb]{0,0,0}\overset{\text{$(d\geq 0,\;n\geq 1)$}}{\leq}\sqrt{8d+41}~{}\|x_{k}-x^{*}\|.}

Finally, we have

𝔼Pk[hμk(k)(Pk(xkx))]f(x)+Lμkd+L8d+41xkx.\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))]\leq f(x^{*})+L\mu_{k}\sqrt{d}+L{\color[rgb]{0,0,0}\sqrt{8d+41}}~{}\|x_{k}-x^{*}\|. (38)

For 3L~μk4ntr(PkH~(x)Pk)Pk(xkx)\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|P_{k}^{\top}(x_{k}-x^{*})\| in (37), from μk1tr(PkH~(xk)Pk)\mu_{k}\leq\frac{1}{\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right)} and 𝔼Pk[Pk(xkx)]dxkx\mathbb{E}_{P_{k}}[\|P_{k}^{\top}(x_{k}-x^{*})\|]\leq\sqrt{d}\|x_{k}-x^{*}\|, we have

𝔼Pk[3L~μk4ntr(PkH~(x)Pk)Pk(xkx)]3dL~4nxkx.\mathbb{E}_{P_{k}}\left[\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|P_{k}^{\top}(x_{k}-x^{*})\|\right]\leq\frac{\sqrt{3d}\tilde{L}}{4n}\|x_{k}-x^{*}\|. (39)

Next, regarding (xkx)PkPkH~(xk)PkPk(xkx)(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*}), we have

𝔼Pk[(xkx)PkPkH~(xk)PkPk(xkx)]𝔼Pk[λmin(PkH~(xk)Pk)Pk(xkx)2].\displaystyle\mathbb{E}_{P_{k}}[(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})]\geq\mathbb{E}_{P_{k}}[\lambda_{\min}(P_{k}^{\top}\tilde{H}(x_{k})P_{k})\|P_{k}^{\top}(x_{k}-x^{*})\|^{2}].

For applying conditional expectation property, we define 𝒜:={P|λmin(PH~(xk)P)(1ε0)2n2σ2λ¯}\mathcal{A}:=\{P|\lambda_{\min}(P^{\top}\tilde{H}(x_{k})P)\geq\frac{(1-\varepsilon_{0})^{2}n}{2}\sigma^{2}\bar{\lambda}\} , :={P|P(xkx)2d2xkx2}\mathcal{B}:=\{P|\|P^{\top}(x_{k}-x^{*})\|^{2}\geq\frac{d}{2}\|x_{k}-x^{*}\|^{2}\} and X:=(xkx)PkPkH~(xk)PkPk(xkx)0X:=(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})\geq 0. Then, we have

𝔼Pk[X]=𝔼Pk[X|𝒜]P(𝒜)+𝔼Pk[X|𝒜¯](1P(𝒜))𝔼Pk[X|𝒜]P(𝒜).\mathbb{E}_{P_{k}}[X]=\mathbb{E}_{P_{k}}[X|\mathcal{A}\cap\mathcal{B}]P(\mathcal{A}\cap\mathcal{B})+\mathbb{E}_{P_{k}}[X|\overline{\mathcal{A}\cap\mathcal{B}}](1-P(\mathcal{A}\cap\mathcal{B}))\geq\mathbb{E}_{P_{k}}[X|\mathcal{A}\cap\mathcal{B}]P(\mathcal{A}\cap\mathcal{B}).

Applying Lemma 3.1 and Proposition 1, we have

P(𝒜)1P(𝒜)P()16exp(d)2exp(C0d4).P(\mathcal{A}\cap\mathcal{B})\geq 1-P(\mathcal{A})-P(\mathcal{B})\geq 1-6\exp{(-d)}-2\exp{(-\frac{C_{0}d}{4})}.

Let C:=16exp(d)2exp(C0d4)C:=1-6\exp{(-d)}-2\exp{(-\frac{C_{0}d}{4})}, we have

𝔼Pk[X](1ε0)2Cdn4σ2λ¯xkx2.\mathbb{E}_{P_{k}}[X]\geq\frac{(1-\varepsilon_{0})^{2}Cdn}{4}\sigma^{2}\bar{\lambda}\|x_{k}-x^{*}\|^{2}. (40)

From (37), (38), (39), and (40), we obtain

𝔼Pk[hμk(k)(0),Pk(xkx)]f(xk)f(x)LμkdL8d+41xkx\displaystyle\mathbb{E}_{P_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle]\geq f(x_{k})-f(x^{*})-L\mu_{k}\sqrt{d}-L{\color[rgb]{0,0,0}\sqrt{8d+41}}\|x_{k}-x^{*}\|
3dL~4nxkx+(1ε0)2τCd8σ2λ¯xkx2.\displaystyle-\frac{\sqrt{3d}\tilde{L}}{4n}\|x_{k}-x^{*}\|+\frac{(1-\varepsilon_{0})^{2}\tau Cd}{8}\sigma^{2}\bar{\lambda}\|x_{k}-x^{*}\|^{2}.\hskip 30.0pt (41)

To satisfy the condition:

L8d+41xkx3dL~4nxkx+(1ε0)2τCd8σ2λ¯xkx20,-L{\color[rgb]{0,0,0}\sqrt{8d+41}}\|x_{k}-x^{*}\|-\frac{\sqrt{3d}\tilde{L}}{4n}\|x_{k}-x^{*}\|+\frac{(1-\varepsilon_{0})^{2}\tau Cd}{8}\sigma^{2}\bar{\lambda}\|x_{k}-x^{*}\|^{2}\geq 0,

we need

xkxC18d+41d2+C2nd.{}\|x_{k}-x^{*}\|\geq C_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{C_{2}}{n\sqrt{d}}. (42)

When xkx_{k} satisfies (42), we obtain

𝔼Pk[hμk(k)(0),Pk(xkx)]f(xk)f(x)Lμkd.\mathbb{E}_{P_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle]\geq f(x_{k})-f(x^{*})-L\mu_{k}\sqrt{d}. (43)

Then, we have

𝔼uk𝔼Pk[rk+12]\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}[r_{k+1}^{2}] (28)rk22αk𝔼Pk[hμk(k)(0),Pk(xkx)]+2L2αk2cn(d+4)2(n+4)\displaystyle\overset{\eqref{eq:local first inequality}}{\leq}r_{k}^{2}-2\alpha_{k}\mathbb{E}_{P_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle]+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4)
(43)rk22αk(f(xk)f(x))+2Lαkμkd+2L2αk2cn(d+4)2(n+4).\displaystyle\overset{\eqref{eq:new eval in local}}{\leq}r_{k}^{2}-2\alpha_{k}(f(x_{k})-f(x^{*}))+2L\alpha_{k}\mu_{k}\sqrt{d}+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4).

Taking the expectation with respect to 𝒰k\mathcal{U}_{k}, we have

𝔼𝒰k+1[rk+12]𝔼𝒰k[rk2]2αk(𝔼𝒰k[f(xk)]f(x))+2Lαkμkd+2L2αk2cn(d+4)2(n+4).\displaystyle\mathbb{E}_{\mathcal{U}_{k+1}}[r_{k+1}^{2}]\leq\mathbb{E}_{\mathcal{U}_{k}}[r_{k}^{2}]-2\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))+2L\alpha_{k}\mu_{k}\sqrt{d}+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4).\hskip 30.0pt (44)

If {xk}k=1,..,N\{x_{k}\}_{k=1,..,N} satisfy (42),

k=0N1αk(𝔼𝒰k[f(xk)]f(x))12r02+Ldk=0N1αkμk+L2cn(d+4)2(n+4)k=0N1αk2.\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))\leq\frac{1}{2}r_{0}^{2}+L\sqrt{d}\sum_{k=0}^{N-1}\alpha_{k}\mu_{k}+\frac{L^{2}}{cn}(d+4)^{2}(n+4)\sum_{k=0}^{N-1}\alpha_{k}^{2}.

When xkx_{k} does not satisfy (42) for some kk, i.e.,

xkx<C18d+41d2+C2nd,\|x_{k}-x^{*}\|<C_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{C_{2}}{n\sqrt{d}},

from Lipschitz continuity of ff, we have

f(xk)f(x)LC18d+41d2+LC2nd.f(x_{k})-f(x^{*})\leq LC_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{LC_{2}}{n\sqrt{d}}.

∎∎

With fixed αk=α\alpha_{k}=\alpha and μk=μ\mu_{k}=\mu, from (26) we obtain

min0iN1𝔼𝒰i[f(xi)]f(x)r022Nα+Ldμ+L2(n+4)(d+4)2cnα.\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})\leq\frac{r_{0}^{2}}{2N\alpha}+L\sqrt{d}\mu+\frac{L^{2}(n+4)(d+4)^{2}}{cn}\alpha.

From this relation, the iteration complexity for achieving the inequality: min0iN1𝔼𝒰i[f(xi)]f(x)<ε\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})<\varepsilon is

N=8r02L2(n+4)(d+4)2cnε2=O(d2ε2)N=\frac{8r_{0}^{2}L^{2}(n+4)(d+4)^{2}}{cn\varepsilon^{2}}=O\left(\frac{d^{2}}{\varepsilon^{2}}\right) (45)

with

α=cnr0L(d+4)2(n+4)N,με2Ld.\alpha=\frac{\sqrt{cn}r_{0}}{L(d+4)\sqrt{2(n+4)N}},\;\mu\leq\frac{\varepsilon}{2L\sqrt{d}}. (46)

From (27), when LC18d+41d2ε2LC_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}\leq\frac{\varepsilon}{2} and LC2ndε2\frac{LC_{2}}{n\sqrt{d}}\leq\frac{\varepsilon}{2} hold, we obtain f(xk)f(x)εf(x_{k})-f(x^{*})\leq\varepsilon. Then, the reduced dimension dd must satisfy d=Ω(ε2)d=\Omega(\varepsilon^{-2}). When comparing the global and local iteration complexity, the local behavior becomes better when the original dimension nn satisfies n=Θ(εp)n=\Theta(\varepsilon^{-p}) with p>4p>4. In this case, while global iteration complexity becomes O(nε2)=O(εp2)O(n\varepsilon^{-2})=O(\varepsilon^{-p-2}), the local iteration complexity achieves O(d2ε2)=O(ε6)O(d^{2}\varepsilon^{-2})=O(\varepsilon^{-6}), which is less than O(εp2)O(\varepsilon^{-p-2}), with reduced dimension d=Θ(ε2)d=\Theta(\varepsilon^{-2}).

Remark 2.

Indeed, our algorithm uses only PkukP_{k}u_{k}, and does not use PkP_{k} and uku_{k} separately. This is the most interesting part of this research using random projections. We consider that the advantage of PkukP_{k}u_{k} comes from the relation

𝔼Pk[minudh(k)(u)=f(xk+1nPku)]𝔼vk[minαf(xk+αvk)],\mathbb{E}_{P_{k}}[\min_{u\in\mathbb{R}^{d}}{\color[rgb]{0,0,0}h^{(k)}}(u)=f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u)]\leq\mathbb{E}_{v_{k}}[\min_{\alpha\in\mathbb{R}}f(x_{k}+\alpha v_{k})],

where vknv_{k}\in\mathbb{R}^{n} is a random vector whose entries come from 𝒩(0,1)\mathcal{N}(0,1). This relation is clear because all entries of PkP_{k} and vkv_{k} follow 𝒩(0,1)\mathcal{N}(0,1) and the left-hand side problem is identical to the right-hand side when d=1d=1. Noticing that the function on the left-hand side is denoted by h(k)(u){\color[rgb]{0,0,0}h^{(k)}}(u) in (16), we can regard (15) in Algorithm 2 as one iteration of Algorithm 1 (i.e., RGF [27]) applied to the problem on the left-hand side.

6 Experiments

6.1 Softmax Regression

In this section, we evaluate the performance of the proposed method, Algorithm 2, and compare it with RGF [27] with central differences, which is described as Algorithm 1, by optimizing a softmax loss function with L1L_{1} regularization:

minw1,,wc,b1,,bc1mi=1mlogexp(wyixi+byi)k=1cexp(wkxi+bk)+λk=1c(wk1+bk1),\min_{w_{1},...,w_{c},b_{1},...,b_{c}}-\frac{1}{m}\sum_{i=1}^{m}\log{\frac{\exp(w_{y_{i}}^{\top}x_{i}+b_{y_{i}})}{\sum_{k=1}^{c}\exp(w_{k}^{\top}x_{i}+b_{k})}}+\lambda\sum_{k=1}^{c}(\|w_{k}\|_{1}+\|b_{k}\|_{1}),

where (xi,yi)(x_{i},y_{i}) denotes data and yi{1,2,,c}y_{i}\in\{1,2,...,c\}. For the L1L_{1} regularization, we set λ=106\lambda=10^{-6} and use reduced dimensional size d{10,50,100}d\in\{10,50,100\}. For both methods, we set a smoothing parameter μk=108\mu_{k}=10^{-8} and use a fixed step size αk=10i(i)\alpha_{k}=10^{i}\;(i\in\mathbb{N}).

Table 1: Details of the datasets for Softmax regression [7].
Name feature class (cc) training size (mm)
SCOTUS 126,405 13 5,000
news20 62,061 20 15,935
Refer to caption
(a) SCOTUS Dataset
Refer to caption
(b) news20 Dataset
Figure 1: Softmax loss function with L1L_{1} regularization.

Figure 1 shows function values of RGF and the proposed methods per iteration for the datasets listed in Table 1. From Figure 1, we can find that our algorithm converges faster than RGF after a sufficient number of iterations, while RGF reduces the objective function value more rapidly in the early iterations. This behavior might be consistent with the theoretical guarantees of global and local worst-case complexities (Theorems 1 and 3, respectively), considering that the global complexity O(nε2)O(\frac{n}{\varepsilon^{2}}) is the same to the one of RGF. However, it seems difficult to see the difference between the coefficients of the local and global sublinear rates, i.e., that the local iteration complexity is d2/nd^{2}/n times the global one. Perhaps the reason is that the rate improvement of local convergence is about worst-case complexity, and such worst-case may not always be achieved in practice.

As for the result of function values in time, generating random matrices is time-consuming and there is no benefit using random projections in view of time for general functions. In this setting, our proposed methods spend more time for the same number of iterations.

6.2 Adversarially Robust Logistic Regression

We consider the next adversarially robust optimization, which is studied in [24]:

minw,bmaxx~δg~(x~;w,b):=\displaystyle\min_{w,b}\max_{\|\tilde{x}\|\leq\delta}~{}\tilde{g}(\tilde{x};w,b):=
1mi=1mlog1(1+eyi(w(xi+x~)+b))+λ(w1+b1),\displaystyle-\frac{1}{m}\sum_{i=1}^{m}\log{\frac{1}{(1+e^{-y_{i}(w^{\top}(x_{i}+\tilde{x})+b)})}}+\lambda(\|w\|_{1}+\|b\|_{1}), (47)

where λ=107\lambda=10^{-7}. By letting θ=(w,b)\theta^{\top}=(w^{\top},b) and η=(x~,0)\eta^{\top}=(\tilde{x}^{\top},0), we can rewrite Problem (47) as minθf(θ)\min_{\theta}f(\theta), where

f(θ):=maxηgθ(θη),:={η:ηδ},\displaystyle f(\theta):=\max_{\eta\in\mathcal{H}}g_{\theta}(\theta^{\top}\eta),~{}~{}\mathcal{H}:=\{\eta:\|\eta\|\leq\delta\},
gθ(α):=1mi=1mlog1(1+eyi(α+wxi+b))+λ(w1+b1).\displaystyle g_{\theta}(\alpha):=-\frac{1}{m}\sum_{i=1}^{m}\log{\frac{1}{(1+e^{-y_{i}(\alpha+w^{\top}x_{i}+b)})}}+\lambda(\|w\|_{1}+\|b\|_{1}).

Note that the derivative of ff is difficult to compute due to the non-smoothness of ff in general.

In this formulation, we can take advantages of random projections in our proposed method. When we evaluate the function value f(θk+1nPku)f(\theta_{k}+\frac{1}{\sqrt{n}}P_{k}u), it is necessary to solve

maxηgθk(θkη+u1nPkη).{}\max_{\eta\in\mathcal{H}}g_{\theta_{k}}(\theta_{k}^{\top}\eta+u^{\top}\frac{1}{\sqrt{n}}P_{k}^{\top}\eta). (48)

In this case, we solve the following approximated optimization problem:

max(α,β)𝒜gθk(α+uβ),\max_{(\alpha,\beta)\in\mathcal{A}}g_{\theta_{k}}(\alpha+u^{\top}\beta), (49)

where 𝒜={(α,β)|α2θk2+β2δ2}\mathcal{A}=\{(\alpha,\beta)|\frac{\alpha^{2}}{\|\theta_{k}\|^{2}}+\beta^{2}\leq\delta^{2}\}. We will explain that this approximated problem (49) is equivalent to the original problem (48) when some condition holds, and also show that the condition holds with high probability. Now, we confirm that we can obtain η\eta such that ηδ\|\eta\|\leq\delta from the solution of the approximated problem (49). Let (α,β)(\alpha^{*},\beta^{*}) denote optimal solutions of this approximated problem (49). When d+1<nd+1<n, we can confirm the existence of η\eta that satisfies α=θkη\alpha^{*}=\theta_{k}^{\top}\eta and β=1nPkη\beta^{*}=\frac{1}{\sqrt{n}}P_{k}^{\top}\eta by solving the linear equation;

z=Aη,A:=(θkθk1nPk),z:=(αθkβ).z^{*}=A\eta,\;A:=\left(\begin{array}[]{c}\frac{\theta_{k}^{\top}}{\|\theta_{k}\|}\\ \frac{1}{\sqrt{n}}P_{k}^{\top}\end{array}\right),\;z^{*}:=\left(\begin{array}[]{c}\frac{\alpha^{*}}{\|\theta_{k}\|}\\ \beta^{*}\end{array}\right). (50)

From the linear dependence of row vectors in AA, the minimum norm solution of (50) is η=A(AA)1z\eta=A^{\top}(AA^{\top})^{-1}z^{*}, and then η2=(z)(AA)1zλmax((AA)1)z2\|\eta\|^{2}=(z^{*})^{\top}(AA^{\top})^{-1}z^{*}\leq\lambda_{\max}((AA^{\top})^{-1})\|z^{*}\|^{2} holds. This inequality implies that when zλmin(AA)δ\|z^{*}\|\leq\sqrt{\lambda_{\min}(AA^{\top})}\delta holds, ηδ\|\eta\|\leq\delta also holds. From (α,β)𝒜(\alpha^{*},\beta^{*})\in\mathcal{A}, z2=(αθk)2+β2δ2\|z^{*}\|^{2}=\left(\frac{\alpha^{*}}{\|\theta_{k}\|}\right)^{2}+\|\beta^{*}\|^{2}\leq\delta^{2} holds. Then, if λmin(AA)1\lambda_{\min}(AA^{\top})\geq 1 holds, zδ\|z^{*}\|\leq\delta implies ηδ\|\eta\|\leq\delta. To show this relation, we prove next Lemma 5.

Lemma 5.

Let A:=(xx,1nP)A:=\left(\begin{array}[]{cc}\frac{x}{\|x\|},\frac{1}{\sqrt{n}}P\end{array}\right)^{\top}, where Pn×dP\in\mathbb{R}^{n\times d} is a random matrix whose entries are sampled from 𝒩(0,1)\mathcal{N}(0,1) and xnx\in\mathbb{R}^{n}. Then, λmin(AA)12(3+ε)dn+4dn\lambda_{\min}(AA^{\top})\geq 1-2(3+\varepsilon)\sqrt{\frac{d}{n}}+\frac{4d}{n} holds with probability at least 12exp(C0ε2d)exp(d/2)1-2\exp{(-C_{0}\varepsilon^{2}d)}-\exp{(-d/2)}.

Proof.

Let w1w_{1}\in\mathbb{R} and w2dw_{2}\in\mathbb{R}^{d}. From the property of the minimum eigenvalue, we have

λmin(AA)\displaystyle\lambda_{\min}(AA^{\top}) =minw=1wAAw=minw12+w22=1w12+2nxw1xPw2+1nw2PPw2\displaystyle=\min_{\|w\|=1}w^{\top}AA^{\top}w=\min_{w_{1}^{2}+\|w_{2}\|^{2}=1}w_{1}^{2}+\frac{2}{\sqrt{n}\|x\|}w_{1}x^{\top}Pw_{2}+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2}
minw12+w22=1w122nx|w1|Pxw2+1nw2PPw2.\displaystyle\geq\min_{w_{1}^{2}+\|w_{2}\|^{2}=1}w_{1}^{2}-\frac{2}{\sqrt{n}\|x\|}|w_{1}|\|P^{\top}x\|\|w_{2}\|+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2}.

Regarding the second term on the right-hand side, from Lemma 3.1 and w12+w22=1w_{1}^{2}+\|w_{2}\|^{2}=1, we obtain

2nx|w1|Pxw2Lemma 3.12(1+ε)dn|w1|w22(1+ε)dn\frac{2}{\sqrt{n}\|x\|}|w_{1}|\|P^{\top}x\|\|w_{2}\|\overset{\text{Lemma~{}\ref{lemma:properties of random matrix}.\ref{lemma:Johnson}}}{\leq}2(1+\varepsilon)\sqrt{\frac{d}{n}}|w_{1}|\|w_{2}\|\leq 2(1+\varepsilon)\sqrt{\frac{d}{n}}

with probability at least 12exp(C0ε2d)1-2\exp{(-C_{0}\varepsilon^{2}d)}. Then, we have

λmin(AA)minw12+w22=1(w12+1nw2PPw2)2(1+ε)dn.\lambda_{\min}(AA^{\top})\geq\min_{w_{1}^{2}+\|w_{2}\|^{2}=1}(w_{1}^{2}+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2})-2(1+\varepsilon)\sqrt{\frac{d}{n}}.

The first term on right-hand side is equivalent to the minimum eigenvalue of

(1001nPP),\left(\begin{array}[]{cc}1&0\\ 0&\frac{1}{n}P^{\top}P\end{array}\right),

and then, we have minw12+w22=1(w12+1nw2PPw2)=min{1,σd(1nP)2}\min_{w_{1}^{2}+\|w_{2}\|^{2}=1}(w_{1}^{2}+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2})=\min\{1,\sigma_{d}(\frac{1}{\sqrt{n}}P)^{2}\}. Applying Lemma 3.3, σd(1nP)12dn\sigma_{d}(\frac{1}{\sqrt{n}}P)\geq 1-2\sqrt{\frac{d}{n}} holds with probability at least 1exp(d/2)1-\exp{(-d/2)}. Hence, we have

λmin(AA)(12dn)22(1+ε)dn=12(3+ε)dn+4dn\lambda_{\min}(AA^{\top})\geq\left(1-2\sqrt{\frac{d}{n}}\right)^{2}-2(1+\varepsilon)\sqrt{\frac{d}{n}}=1-2(3+\varepsilon)\sqrt{\frac{d}{n}}+\frac{4d}{n}

with probability at least 12exp(C0ε2d)exp(d/2)1-2\exp{(-C_{0}\varepsilon^{2}d)}-\exp{(-d/2)}. ∎∎

When nn is large enough and d=o(n)d=o(n), from Lemma 5, λmin(AA)1\lambda_{\min}(AA^{\top})\gtrapprox 1 holds with high probability. Then, ηδ\|\eta\|\leq\delta holds with high probability.

Next, we confirm that the optimal solution to the original problem (48) can be obtained from (49). Let η\eta^{*} denote the global solution. When (θkη,1nPkη)(\theta_{k}^{\top}\eta^{*},\frac{1}{\sqrt{n}}P_{k}^{\top}\eta^{*}) is contained in 𝒜\mathcal{A}, we can calculate the same maximal value as in the original problem (48) from (49). We show that (θkη,1nPkη)(\theta_{k}^{\top}\eta^{*},\frac{1}{\sqrt{n}}P_{k}^{\top}\eta^{*}) is contained in 𝒜\mathcal{A} with high probability. We have

(θkη)2θk2+1nPkη2Lemma 3.1(θkη)2θk2+d(1+ε)nη2(1+d(1+ε)n)η2\frac{(\theta_{k}^{\top}\eta^{*})^{2}}{\|\theta_{k}\|^{2}}+\frac{1}{n}\|P_{k}^{\top}\eta^{*}\|^{2}\overset{\text{Lemma~{}\ref{lemma:properties of random matrix}.\ref{lemma:Johnson}}}{\leq}\frac{(\theta_{k}^{\top}\eta^{*})^{2}}{\|\theta_{k}\|^{2}}+\frac{d(1+\varepsilon)}{n}\|\eta^{*}\|^{2}\leq\left(1+\frac{d(1+\varepsilon)}{n}\right)\|\eta^{*}\|^{2}

with probability at least 12exp(Cε2d)1-2\exp{(-C\varepsilon^{2}d)}. Hence, with sufficiently large nn and d=o(n)d=o(n), we have Aηηδ\|A\eta^{*}\|\lessapprox\|\eta^{*}\|\leq\delta and AηA\eta^{*} is contained in 𝒜\mathcal{A}. Note that the problem (48) is not convex optimization even if \mathcal{H} is a convex set, because g-g is concave.

Table 2: Details of the dataset for logistic regression [7].
Name feature class (cc) training size (mm)
news20(binary) 1,355,191 2 19,996
random 1,000,000 2 100
Refer to caption
(a) random Dataset (δ=102\delta=10^{-2})
Refer to caption
(b) news20(binary) Dataset (δ=103\delta=10^{-3})
Figure 2: Adversarially robust logistic regression (f(x)f(x) vs time).

In numerical experiments, we evaluate f(x)f(x) by solving the maximum optimization using the accelerated proximal gradient method until the norm of the generalized gradient is less than 10710^{-7}. In our proposed method, we evaluate f(x+1nPu)f(x+\frac{1}{\sqrt{n}}P^{\top}u) using the approximated random optimization problem (49). Furthermore, we increase the sampling size per iteration in both the proposed method and RGF, i.e., we update xkx_{k} by

xk+1=xkαkli=1lh(k)(μluk(l))h(k)(μluk(l))2μkPkuk(l),\displaystyle x_{k+1}=x_{k}-\frac{\alpha_{k}}{\sqrt{l}}\sum_{i=1}^{l}\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{l}u_{k}^{(l)})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{l}u_{k}^{(l)})}{2\mu_{k}}P_{k}u_{k}^{(l)},
xk+1=xkαkli=1lf(xk+μlvk(l))f(xkμlvk(l))2μkvk(l),\displaystyle x_{k+1}=x_{k}-\frac{\alpha_{k}}{\sqrt{l}}\sum_{i=1}^{l}\frac{f(x_{k}+\mu_{l}v_{k}^{(l)})-f(x_{k}-\mu_{l}v_{k}^{(l)})}{2\mu_{k}}v_{k}^{(l)},

respectively. We use a dataset from [7] and randomly generated one. For the random dataset, we use a matrix XX and a vector ww whose entries are sampled from 𝒩(0,1)\mathcal{N}(0,1). The labels are generated as y:=𝟏x0(Xw+ε)y:=\mathbf{1}_{x\geq 0}(Xw+\varepsilon), where ε\varepsilon is a noise vector sampled from 𝒩(0,Im/100)\mathcal{N}(0,I_{m}/100). In both methods, we set a smoothing parameter μk=108\mu_{k}=10^{-8} and use a fixed step size αk=10i(i)\alpha_{k}=10^{i}\;(i\in\mathbb{N}).

Figure 2 shows time versus the function values of RGF and the proposed methods for the datasets listed in Table 2. From Figure 2 when evaluating f(x)f(x) is time-consuming due to solve maximization problems, our proposed method converges faster than RGF. This efficiency comes from the random projection technique, which leads to a reduction of function evaluation time.

7 Conclusion

We proposed a new zeroth-order method combining random projections and smoothing method for non-smooth convex optimization problems. While our proposed method achieves O(nε2)O(\frac{n}{\varepsilon^{2}}) worst-case iteration complexity, which is equivalent to the standard result under convex and non-smooth setting, ours can converge with O(d2ε2)O(\frac{d^{2}}{\varepsilon^{2}}), which does not depend on the dimension nn, under some additional local properties of an objective. In numerical experiments, our method performed well when the function evaluation time can be reduced using random projection. As discussed in Section 6.1, since we have shown in this paper is the improvement of the “worst-case” oracle complexity, it is not always the case that the worst-case oracle complexity is achieved when the algorithm is actually run. Indeed, many applications using zeroth-order methods have succeeded despite of large scale models and their oracle complexities depending on the dimension nn [35, 26, 8, 32]. It may be interesting to investigate whether iteration complexities of zeroth-order methods are not affected by nn in practical use, or whether it can strongly depend on it in any problem setting as a future work. We also would like to investigate the convergence rate of our algorithm in a non-smooth and non-convex setting in the future.

Acknowledgement

This work was supported by JSPS KAKENHI Grant Number 23H03351 and JST ERATO Grant Number JPMJER1903.

8 Compliance with Ethical Standards

This work was partially supported by JSPS KAKENHI (23H03351) and JST ERATO (JPMJER1903). There is no conflict of interest in writing the paper.

References

  • Alexandrov [1939] A. D. Alexandrov. Almost everywhere existence of the second differential of a convex function and some properties of convex surfaces connected with it. Leningrad State Univ. Annals [Uchenye Zapiski] Math. Ser., 6:3, 1939.
  • Berglund et al. [2022] E. Berglund, S. Khirirat, and X. Wang. Zeroth-order randomized subspace newton methods. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6002–6006. IEEE, 2022.
  • Bergstra et al. [2011] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  • Beznosikov et al. [2020] A. Beznosikov, E. Gorbunov, and A. Gasnikov. Derivative-free method for composite optimization with applications to decentralized distributed optimization. IFAC-PapersOnLine, 53(2):4038–4043, 2020.
  • Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 72–85, New York, 2017. Association for Computing Machinery.
  • Cartis and Roberts [2023] C. Cartis and L. Roberts. Scalable subspace methods for derivative-free nonlinear least-squares optimization. Mathematical Programming, 199(1-2):461–524, 2023.
  • Chang and Lin [2011] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Choromanski et al. [2018] K. Choromanski, M. Rowland, V. Sindhwani, R. Turner, and A. Weller. Structured evolution with compact architectures for scalable policy optimization. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 970–978. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/choromanski18a.html.
  • Davidson and Szarek [2001] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.
  • Duchi et al. [2012] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
  • Duchi et al. [2015] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
  • Fuji et al. [2022] T. Fuji, P.-L. Poirion, and A. Takeda. Randomized subspace regularized newton method for unconstrained non-convex optimization. arXiv preprint arXiv:2209.04170, 2022.
  • Gasnikov et al. [2022] A. Gasnikov, A. Novitskii, V. Novitskii, F. Abdukhakimov, D. Kamzolov, A. Beznosikov, M. Takac, P. Dvurechensky, and B. Gu. The power of first-order smooth optimization for black-box non-smooth problems. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 7241–7265. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/gasnikov22a.html.
  • Gasnikov et al. [2016] A. V. Gasnikov, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fedorenko. Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex. Automation and Remote Control, 77:2018–2034, 2016.
  • Golovin et al. [2020] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang. Gradientless descent: High-dimensional zeroth-order optimization. In International Conference on Learning Representations, 2020.
  • Gower et al. [2019] R. Gower, D. Kovalev, F. Lieder, and P. Richtarik. Rsn: Randomized subspace newton. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Gratton et al. [2015] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25(3):1515–1541, 2015.
  • Ilyas et al. [2018] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2137–2146. PMLR, 10–15 Jul 2018.
  • Karimireddy et al. [2018] S. P. Karimireddy, S. U. Stich, and M. Jaggi. Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
  • Kolda et al. [2003] T. G. Kolda, R. M. Lewis, and V. Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM review, 45(3):385–482, 2003.
  • Kozak et al. [2021] D. Kozak, S. Becker, A. Doostan, and L. Tenorio. A stochastic subspace approach to gradient-free optimization in high dimensions. Computational Optimization and Applications, 79(2):339–368, 2021.
  • Kozak et al. [2023] D. Kozak, C. Molinari, L. Rosasco, L. Tenorio, and S. Villa. Zeroth-order optimization with orthogonal random directions. Mathematical Programming, 199(1-2):1179–1219, 2023.
  • Larson et al. [2019] J. Larson, M. Menickelly, and S. M. Wild. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019.
  • Madry et al. [2018] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
  • Magnus [1978] J. R. Magnus. The moments of products of quadratic forms in normal variables. Univ., Instituut voor Actuariaat en Econometrie, 1978.
  • Mania et al. [2018] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
  • Nesterov and Spokoiny [2017] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
  • Niculescu and Persson [2006] C. Niculescu and L.-E. Persson. Convex functions and their applications, volume 23. Springer, New York, 2006.
  • Qian et al. [2016] H. Qian, Y.-Q. Hu, and Y. Yu. Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1946–1952. AAAI Press, 2016.
  • Rando et al. [2023] M. Rando, C. Molinari, L. Rosasco, and S. Villa. An optimal structured zeroth-order algorithm for non-smooth optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=SfdkS6tt81.
  • Roberts and Royer [2023] L. Roberts and C. W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33(4):3057–3082, 2023.
  • Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • Shamir [2017] O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703–1713, 2017.
  • Vershynin [2018] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, Cambridge, 2018.
  • Ye et al. [2018] H. Ye, Z. Huang, C. Fang, C. J. Li, and T. Zhang. Hessian-aware zeroth-order optimization for black-box adversarial attack. arXiv preprint arXiv:1812.11377, 2018.
  • Yue et al. [2023] P. Yue, L. Yang, C. Fang, and Z. Lin. Zeroth-order optimization with weak dimension dependency. In G. Neu and L. Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 4429–4472. PMLR, 12–15 Jul 2023.