Zeroth-order Random Subspace Algorithm for Non-smooth Convex Optimization

Ryota Nozawa ¹¹1[email protected] Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan Pierre-Louis Poirion ²²2[email protected] Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan Akiko Takeda ³³3[email protected] Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan

Abstract

Zeroth-order optimization, which does not use derivative information, is one of the significant research areas in the field of mathematical optimization and machine learning. Although various studies have explored zeroth-order algorithms, one of the theoretical limitations is that oracle complexity depends on the dimension, i.e., on the number of variables, of the optimization problem. In this paper, to reduce the dependency of the dimension in oracle complexity, we propose a zeroth-order random subspace algorithm by combining a gradient-free algorithm (specifically, Gaussian randomized smoothing with central differences) with random projection. We derive the worst-case oracle complexity of our proposed method in non-smooth and convex settings; it is equivalent to standard results for full-dimensional non-smooth convex algorithms. Furthermore, we prove that ours also has a local convergence rate independent of the original dimension under additional assumptions. In addition to the theoretical results, numerical experiments show that when an objective function has a specific structure, the proposed method can become experimentally more efficient due to random projection.

Keywords: zeroth-order optimization, random projection, convex optimization, oracle complexity

1 Introduction

We consider the following unconstrained optimization problem:

\min_{x\in\mathbb{R}^{n}}f(x),

(1)

where the objective function, $f$ , is possibly non-smooth, but it is convex. Throughout this paper, we assume that $f$ is $L$ -Lipschitz continuous, i.e.,

|f(x)-f(y)|\leq L\|x-y\|_{2}

holds for all $x$ and $y$ . Furthermore, we assume that (1) is zeroth-order optimization problem, which means that only the function values $f(x)$ are accessible and the derivatives of $f$ are unavailable, or that the calculation of $\nabla f$ is expensive. Zeroth-order optimization has many applications such as bandit [5], adversarial attack [18], or hyperparameter tuning [3]. Additionally, there are various types of optimization methods [23] for zeroth-order optimization such as random search and model based methods, and these methods have been widely studied in both a smooth setting [35, 2, 22, 36] and a non-smooth setting [14, 30, 11, 13, 4, 33].

However, one of the theoretical limitations of zeroth-order algorithms is that the oracle complexity depends on the dimension $n$ . For example, Duchi et al. [11] and Gasnikov et al. [14] proposed variants of mirror descent algorithms. Duchi et al. [11] prove under the zeroth-order setting that a lower bound on the oracle complexity required for their method to find an $\varepsilon$ -approximate solution is $O(\frac{n}{\varepsilon^{2}})$ for the case of linear losses. Gasnikov et al. [14] analyze their method on the unit simplex $S_{n}=\{x|x_{i}\geq 0,\sum_{i=1}^{n}x_{i}=1\}$ and prove that the oracle complexity is $O(\frac{n}{\varepsilon^{2}})$ . On the other hand, Nesterov and Spokoiny [27] proposed a method using Gaussian smoothing, which is defined by

f_{\mu}(x):=\mathbb{E}_{u}\left[f(x+\mu u)\right].

By approximating the gradient of the smoothed function $f_{\mu}$ by the forward difference: $\frac{f(x+\mu u)-f(x)}{\mu}u$ or the central difference: $\frac{f(x+\mu u)-f(x-\mu u)}{2\mu}u$ , they obtain the oracle complexity under different settings; concretely, they obtain a complexity of order $O\left(\frac{n^{2}}{\varepsilon^{2}}\right)$ when the objective function is non-smooth and convex. Later, some papers using random smoothing [33, 13, 30] are able to improve oracle complexity with $O(\frac{n}{\varepsilon^{2}})$ using the central difference. An oracle complexity with $O(\frac{n}{\varepsilon^{2}})$ is a natural result in zeroth-order optimization, since standard subgradient methods for non-smooth convex functions require $O(\frac{1}{\varepsilon^{2}})$ iterations and the approximation of the gradients using the finite difference needs $O(n)$ times function evaluations.

To overcome this dependence on the dimension $n$ , several research [36, 15, 29] assume that $f$ has a low-dimensional structure. For example, Yue et al. [36] use a notion of effective dimensionality and prove that the oracle complexity depends on the effective dimension $\mathrm{ED}_{\alpha}:=\sup_{x}\sum_{i=1}^{n}\sigma_{i}(\nabla^{2}f(x))^{\alpha}$ , where $\sigma_{i}$ denotes a singular value, rather than on $n$ . When the objective function is convex with $L_{1}$ -Lipschitz gradient and $H$ -Lipschitz Hessian, their algorithm is shown to have an oracle complexity where $n$ in the oracle complexity of [27] has changed to $\mathrm{ED}_{1/2}$ . In practice, the effective dimension is often significantly smaller than the dimension $n$ . In such a case, the oracle complexity is improved under convex and Lipschitz gradient settings.

An alternative approach to reduce the dependency on the dimension in the complexity is to use random projections [6, 22, 2, 31]. Cartis and Roberts [6] combine random projections with a model based derivative-free method, which approximates the objective function by interpolation. In their approach, they solve $f(x_{k}+P_{k}u)$ using a smaller-sized variable $u\in\mathbb{R}^{d}$ in each iteration, constructed with a random matrix $P_{k}\in\mathbb{R}^{n\times d}$ and $x_{k}$ , instead of the original function $f(x)$ . Using random projection theory, they prove that when the objective is smooth and non-convex, the methods reduce the dimensionality of oracle complexity from $O(\frac{n^{3}}{\varepsilon^{2}})$ to $O(\frac{n^{2}}{\varepsilon^{2}})$ , in order to find an $\varepsilon$ -stationary point. Kozak et al. [22] consider a zeroth-order variant of [21], approximating the exact gradient by the random projected gradient. They obtain the iteration complexity under various parameter choices and assumptions in the smooth setting. In the subsequent work, Rando et al. [30] propose a variant of [22] and obtain an oracle complexity of order $O(\frac{n}{\varepsilon^{2}})$ in the non-smooth setting. Berglund et al. [2] propose a zeroth-order variant of the randomized subspace Newton method [16] and prove iteration complexity in the strongly convex case. Roberts and Royer [31] propose a subspace variant of the direct search methods [17, 20]. They obtain some convergence guarantees under a wide range of subspace projections and directions, and show that their randomized methods give better results than the original full-dimensional ones. However, to the best of our knowledge, there is no research which reduces the dependency on the dimension $n$ in oracle complexity for non-smooth functions.

1.1 Main Contribution

In this paper, we aim to reduce the dependency on the dimension in the worst-case oracle complexity by employing random projections, specifically under a non-smooth and convex setting. We propose an algorithm which combines Gaussian smoothing using central differences [27] and random projections. Our idea is to apply the Gaussian smoothing to the objective function restricted to a subspace, i.e., ${\color[rgb]{0,0,0}h^{(k)}}(u):=f(x_{k}+P_{k}u)$ , instead of the original function $f(x)$ . We prove that our algorithm achieves an oracle complexity of $O(\frac{n}{\varepsilon^{2}})$ globally, which is the standard result under the non-smooth and convex setting. Moreover, under additional local assumptions on $f$ , we prove an oracle complexity of $O(\frac{d^{2}}{\varepsilon^{2}})$ locally, where $d$ is the dimension of the random subspace, defined by $P_{k}$ . This indicates that by choosing $d$ much smaller than $\sqrt{n}$ , our proposed method improves the local oracle complexity.

We can summarize our contribution as follows.

•

We propose a zeroth-order random subspace algorithm by using random projection technique to a Gaussian smoothing algorithm for non-smooth convex optimization problems.
•

Our algorithm achieves an oracle complexity of $O(\frac{n}{\varepsilon^{2}})$ globally, and also has a local convergence rate independent of the original dimension under additional assumptions.
•

Our numerical experiments show that the proposed method performs well due to random projection for an objective function with a specific structure.

1.2 Related Works on $L_{2}$ randomized smoothing

Recently, $L_{2}$ randomized smoothing has been actively studied for smoothing non-smooth function $f$ [11, 13, 33]. The random variable $u$ that defines $f_{\mu}$ is assumed to be a random vector uniformly distributed on a ball with center $0$ and radius $\mu$ , instead of a normally distributed random Gaussian vector. For the $L_{2}$ randomized smoothing, Shamir [33] proposed the algorithm for bandit convex optimization and showed the optimal rate for convex Lipschitz functions using central differences. Gasnikov et al. [13] proposed the generic approach that combines smoothing and first-order methods. They show that the approach achieves the optimal rate of zeroth-order methods and can utilize various techniques in first-order methods for non-smooth zeroth-order optimization.

In fact, our proposed method can be modified so as to use a $L_{2}$ randomized smoothing instead of a Gaussian one, and theoretical guarantees are essentially the same for both methods (see Remark 1). In any case, since we use properties of Gaussian random matrices to reduce the dimension of the problem, we use the Gaussian smoothing in this paper for the simplicity of our discussion.

1.3 Organization

In Section 2, we introduce some properties of the smoothing function and random matrices and vectors for our analysis. In Section 3, we present our algorithm, and in Section 4 we prove global convergence in $O(\frac{n}{\varepsilon^{2}})$ . In Section 5, we prove local convergence in $O(\frac{d^{2}}{\varepsilon^{2}})$ . In Section 6, we show numerical results and demonstrate that when the objective has a structure suitable for random projections, our method converges faster than existing methods, by reducing the function evaluation time.

2 Preliminaries

2.1 Notation

$x^{*}$ denotes one of the optimal solutions of (1). Let $\mathbb{E}_{X}$ denote the expectation of a random variable $X$ , and $\mathcal{N}(u,\Sigma)$ denote the normal distribution with mean $u$ and covariance $\Sigma$ . $I_{d}$ denotes the identity matrix of size $d$ .

We use $\|\cdot\|$ as the Euclidean norm and $\|\cdot\|_{\psi_{2}}$ as the sub-Gaussian norm of a sub-Gaussian random variable, which is defined by

\|X\|_{\psi_{2}}=\inf\{s>0|\mathbb{E}_{X}[\exp{(X^{2}/s^{2})}]\leq 2\}.

From the property of the sub-Gaussian norm, $\|X\|_{\psi_{2}}\leq C$ is equivalent to

\mathrm{Prob}(|X|\geq t)\leq 2\exp{(-ct^{2}/C)},

(2)

where $c$ is an absolute constant. Let $\lambda_{i}(A)$ denote the $i$ -th largest eigenvalue of a matrix $A$ and let $\mathbf{1}_{\mathcal{X}}(x)$ denote the indicator function defined by

\mathbf{1}_{\mathcal{X}}(x)=\left\{\begin{array}[]{cc}1&(x\in\mathcal{X}),\\ 0&(\mathrm{otherwise}).\end{array}\right.

In particular, when $\mathcal{X}=\{x|\langle x,u\rangle\geq 0\}$ or $\mathcal{X}=\{x|\langle x,u\rangle<0\}$ for some $u$ , we use $\mathbf{1}_{u}^{+}(x)$ or $\mathbf{1}_{u}^{-}(x)$ , respectively. $\partial f(x)$ denotes sub-differential at $x$ .

2.2 Gaussian Smoothing Function

In this subsection, we introduce the definition of Gaussian smoothing function and recall its properties.

Definition 1.

(e.g., [27]) The Gaussian smoothing of $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is defined by

f_{\mu}(x):=\mathbb{E}_{u}[f(x+\mu u)],

(3)

where $u\sim\mathcal{N}(0,I_{n})$ and $\mu$ is some positive constant.

It is well-known that if $f$ is convex, then $f_{\mu}$ is also convex. As derived from Definition 1, the gradient $\nabla f_{\mu}$ can be calculated by the following:

	$\displaystyle\mathbb{E}_{u}\left[\frac{f(x+\mu u)}{\mu}u\right]=$		$\displaystyle\mathbb{E}_{u}\left[\frac{f(x+\mu u)-f(x)}{\mu}u\right]$
	$\displaystyle=$		$\displaystyle\mathbb{E}_{u}\left[\frac{f(x+\mu u)-f(x-\mu u)}{2\mu}u\right]=\nabla f_{\mu}(x).$		(4)

We can evaluate the error bound between $f$ and $f_{\mu}$ when $f$ is convex and Lipschitz continuous.

Lemma 1.

[27] If a function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is convex and $L$ -Lipschitz continuous, then

f(x)\leq f_{\mu}(x)\leq f(x)+\mu L\sqrt{n}

(5)

holds for any positive $\mu$ .

2.3 Random Matrices and Vectors

In this subsection, we introduce some properties of Gaussian random matrices and vectors.

Lemma 2.

Let $u\in\mathbb{R}^{d}$ be sampled from $\mathcal{N}(0,I_{d})$ .

1.

[27, Lemma 1] For $p\in[0,2]$ , we have

$\mathbb{E}_{u}\left[\|u\|^{p}\right]\leq d^{p/2}.$ (6)

If $p\geq 2$ , then we have the two-side bounds

$d^{p/2}\leq\mathbb{E}_{u}\left[\|u\|^{p}\right]\leq(d+p)^{p/2}.$ (7)

[25] Let $A\in\mathbb{R}^{d\times d}$ be a symmetric matrix. Then,

		$\displaystyle\mathbb{E}_{u}[u^{\top}Au]=\mathrm{tr}\left({A}\right),$			(8)
		$\displaystyle\mathbb{E}_{u}[(u^{\top}Au)^{2}]=(\mathrm{tr}\left({A}\right))^{2}+2\mathrm{tr}\left({A^{2}}\right)$			(9)

hold.

3.

[34] Consider an $L$ -Lipschitz function $f:\mathbb{R}^{n}\to\mathbb{R}$ . Then,

$\|f(u)-\mathbb{E}_{u}\left[f(u)\right]\|_{\psi_{2}}\leq CL$

holds, where $C$ is an absolute constant.

In particular, when $p=2$ , the following relationship is derived from simple calculations:

\mathbb{E}_{u}\left[\|u\|^{2}\right]=d.

(10)

From Lemma 2.3, we can evaluate the sub-Gaussian norm $\|f(u)-\mathbb{E}_{u}[f(u)]\|_{\psi_{2}}$ for random variables $u\sim\mathcal{N}(0,\mu^{2}I_{n})$ as follows.

Corollary 1.

Consider a random vector $u\sim\mathcal{N}(0,I_{n})$ and an $L$ -Lipschitz function $f:\mathbb{R}^{n}\to\mathbb{R}$ . Then,

\|f(\mu u)-\mathbb{E}_{u}\left[f(\mu u)\right]\|_{\psi_{2}}\leq C\mu L

holds.

From Corollary 1 and the property of the sub-Gaussian norm (2), we have

\mathrm{Prob}(|f(\mu u)-\mathbb{E}_{u}\left[f(\mu u)\right]|\geq t)\leq 2\exp{\left(-\frac{ct^{2}}{\mu^{2}L^{2}}\right)},

(11)

where the constant $c$ is independent of $n$ . Next, we recall some properties of random matrices.

Lemma 3.

Let $P\in\mathbb{R}^{n\times d}(d<n)$ be a random matrix whose entries are independently drawn from $\mathcal{N}(0,1)$ .

[34] Then for any $x\in\mathbb{R}^{n}$ and $\varepsilon\in(0,1)$ , we have

\mathrm{Prob}\left[(1-\varepsilon)\|x\|^{2}\leq\frac{1}{d}\|P^{\top}x\|^{2}\leq(1+\varepsilon)\|x\|^{2}\right]\geq 1-2\exp{(-C_{0}\varepsilon^{2}d)},

where $C_{0}$ is an absolute constant.

2.

Then for any $x\in\mathbb{R}^{n}$ , we have

${}\mathbb{E}_{P}\left[\|PP^{\top}x\|^{2}\right]\leq 2(n+4)(d+4)\|x\|^{2}.$ (12)
3.

[9, Theorem II.13] Let $\beta=d/n$ with $d\leq n$ . Then for any $t>0$ ,

$\mathrm{Prob}\left(\sigma_{\min}\left(\frac{1}{\sqrt{n}}P\right)\leq 1-\sqrt{\beta}-t\right)<\exp(-nt^{2}/2)$

holds, where $\sigma_{\min}$ denotes the minimum nonzero singular value of a matrix.

Proof of Lemma 3.2.

We define $P=\left(u_{1},u_{2},\ldots,u_{n}\right)^{\top}$ , where $u_{i}$ is a $d$ -dimensional vector. Then, we have $(PP^{\top})_{ij}=u^{\top}_{i}u_{j}$ , and therefore,

	$\displaystyle\mathbb{E}_{P}\left[\\|PP^{\top}x\\|^{2}\right]$
	$\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\sum^{n}_{k=1}\sum^{n}_{l=1}u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}\right]$
	$\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\sum^{n}_{k=1}u_{k}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}^{2}+\sum^{n}_{i=1}\sum^{n}_{k=1}\sum^{n}_{l\neq k}u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}\right]$
	$\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\\|u_{i}\\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}u_{k}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}^{2}+\sum^{n}_{i=1}\sum^{n}_{l\neq i}u_{l}^{\top}u_{i}u_{i}^{\top}u_{i}x_{i}x_{l}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}\sum^{n}_{l\neq k}u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}\right].$

Regarding the third and fourth terms on the right-hand side, since the index $l$ is not equal to the other indices $i,k$ , and given that $\mathbb{E}_{u_{l}}[u_{l}]=0$ , we have $\mathbb{E}_{u_{1},..,u_{n}}[u_{l}^{\top}u_{i}u_{i}^{\top}u_{i}x_{i}x_{l}]=0$ and $\mathbb{E}_{u_{1},..,u_{n}}[u_{l}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}x_{l}]=0$ . Similarly, regarding the second term on the right-hand side, since the index $i$ is not equal to $k$ and $\mathbb{E}_{u_{i}}[u_{i}u_{i}^{\top}]=I$ , we have $\mathbb{E}_{u_{i}}[u_{k}^{\top}u_{i}u_{i}^{\top}u_{k}x_{k}^{2}]=\|u_{k}\|^{2}x_{k}^{2}$ . Hence, we obtain

		$\displaystyle\mathbb{E}_{P}\left[\\|PP^{\top}x\\|^{2}\right]$	$\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\\|u_{i}\\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}\\|u_{k}\\|^{2}x_{k}^{2}\right]$
			$\displaystyle\overset{\text{\eqref{eq:squared norm expectation}}}{=}\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\\|u_{i}\\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}dx_{k}^{2}\right]$
			$\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\leq}\sum^{n}_{i=1}(d+4)^{2}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}dx_{k}^{2}$
			$\displaystyle=((d+4)^{2}+d(n-1))\\|x\\|^{2}$
			$\displaystyle{\color[rgb]{0,0,0}\overset{\text{($d\leq n,\;1\leq n$)}}{\leq}((d+4)(n+4)+(d+4)(n+4))\\|x\\|^{2}}$
			$\displaystyle{\color[rgb]{0,0,0}=}2(n+4)(d+4)\\|x\\|^{2}.$

∎∎

Remark 1.

$L_{2}$ randomized smoothing is defined by (3) with the random vector $u$ sampled from the uniform distribution on the sphere of radius $1$ (i.e., $\|u\|_{2}=1$ ), and it is known in [10, 13] to have a dimension-independent upper bound for $f_{\mu}(x)$ as

f(x)\leq f_{\mu}(x)\leq f(x)+\mu L.

(13)

The difference between (5) and (13) comes from the norm of random vectors. In terms of upper bounds of $f_{\mu}(x)$ , there is no essential difference between these two smoothing methods.

Indeed, our oracle complexity analysis also holds when the random vector $u$ is sampled from the uniform distribution on the sphere of radius $\sqrt{n}$ . We can prove Lemmas 1 and 2 for a vector $u$ uniformly sampled on the sphere of radius $\sqrt{n}$ . For example,

f(x)\leq\mathbb{E}_{u}f(x+\mu u)\leq f(x)+\mu L\sqrt{n}

holds from [10, 13]. Instead of Lemma 2.1 we can directly state $\|u\|=\sqrt{n}$ and equation (8) also holds. For equation (9), $\mathbb{E}_{u}[(u^{\top}Au)^{2}]=(\mathrm{tr}\left({A}\right))^{2}$ holds. Lemma 2.3 holds for the uniform distribution on the sphere from [34, Theorem 5.1.4]. Then, the arguments in the following sections hold when we use these lemmas instead of Gaussian versions that are described in this section.

3 Proposed Algorithm

In this section, we describe the Randomized Gradient-Free method (RGF) [27] and our proposed method. RGF, a random search method, updates $x_{k}$ by (14), where $g_{\mu}(x,u)$ denotes the approximation of the gradient at $x$ along a random direction $u$ . The method can use forward differences or central differences for $g_{\mu}(x,u)$ , i.e.,

g_{\mu}(x,u)=\frac{f(x+\mu u)-f(x)}{\mu}u,\;\mbox{ or }\;g_{\mu}(x,u)=\frac{f(x+\mu u)-f(x-\mu u)}{2\mu}u,

respectively. For a convex and Lipschitz continuous objective function, RGF using forward differences achieves iteration complexity of $O(\frac{n^{2}}{\varepsilon^{2}})$ [27] and RGF using central differences achieves one of $O(\frac{n}{\varepsilon^{2}})$ [33]. For our proposed method, we confirm that using central differences attains better oracle complexity than using forward differences.

Algorithm 1 Randomized gradient-free method (RGF) [27]

x_{0}

\{\alpha_{k}\}

\{\mu_{k}\}

for

k=0,1,2,...

Sample

u_{k}

from

\mathcal{N}(0,I_{n})

x_{k+1}=x_{k}-\alpha_{k}g_{\mu_{k}}(x_{k},u_{k})

(14)

end for

Combining RGF and random projections, we propose Algorithm 2.

Algorithm 2 Subspace randomized gradient-free method

x_{0}

\{\alpha_{k}\}

\{\mu_{k}\}

d

for

k=0,1,2,...

Get a random matrix

P_{k}\in\mathbb{R}^{n\times d}

whose entries are sampled from

\mathcal{N}(0,1)

{\color[rgb]{0,0,0}h^{(k)}}(u):=f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u)

Sample

u_{k}

from

\mathcal{N}(0,I_{d})

{}x_{k+1}=x_{k}-\alpha_{k}\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k}

(15)

end for

We define ${\color[rgb]{0,0,0}h^{(k)}}(u)$ as the restriction of $f$ to the random subspace,

{\color[rgb]{0,0,0}h^{(k)}}(u):=f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u),

(16)

where $P_{k}\in\mathbb{R}^{n\times d}$ is a random matrix, whose elements are sampled from $\mathcal{N}(0,1)$ . Lemma 4 shows that the expectation of random matrices generates smoothing functions.

Lemma 4.

	$\displaystyle\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})]=\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})]=f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k}),$		(17)
	$\displaystyle\mathbb{E}_{P_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})}{\mu_{k}}P_{k}u_{k}\right]=\mathbb{E}_{P_{k}}\left[-\frac{{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{\mu_{k}}P_{k}u_{k}\right]=\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k})$		(18)

hold.

Proof.

Using rotational invariance of normal distribution, we have that $\mathbb{E}_{P_{k}}[F(P_{k})]=\mathbb{E}_{P_{k}}[F(-P_{k})]$ holds for any function $F$ . Using this property, we obtain

\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})\right]=\mathbb{E}_{P_{k}}\left[f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\right]=\mathbb{E}_{P_{k}}\left[f(x_{k}-\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\right]=\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})\right].

Notice that the distribution of $P_{k}u_{k}$ is given by $\mathcal{N}(0,\|u_{k}\|^{2}I_{n})$ . We can therefore replace $P_{k}u_{k}$ by $\|u_{k}\|z_{k}$ , where $z_{k}$ is sampled from $\mathcal{N}(0,I_{n})$ . Then, we obtain

\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})]=\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]=f_{\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}}(x_{k}).

(19)

Similarly, we have

	$\displaystyle\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})P_{k}u_{k}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}_{P_{k}}\left[f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})P_{k}u_{k}\right]$
		$\displaystyle=$	$\displaystyle-\mathbb{E}_{P_{k}}\left[f(x_{k}-\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})P_{k}u_{k}\right]=-\mathbb{E}_{P_{k}}\left[{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})P_{k}u_{k}\right],$

and

	$\displaystyle\mathbb{E}_{P_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})}{\mu_{k}}P_{k}u_{k}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}_{z_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})}{\mu_{k}}\\|u_{k}\\|z_{k}\right]$
		$\displaystyle=$	$\displaystyle\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}\mathbb{E}_{z_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})}{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}z_{k}\right]\overset{\text{\eqref{eq:gradient of smoothing function}}}{=}\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k}).$

∎∎

4 Global Convergence

In this section, we prove global convergence of our proposed method for convex and Lipschitz continuous functions. We define $\mathcal{U}_{k}:=(u_{0},P_{0},...,u_{k-1},P_{k-1})$ .

Assumption 1.

$f$ is $L$ -Lipschitz continuous, i.e.,

|f(x)-f(y)|\leq L\|x-y\|

holds for all $x,y$ , and convex.

Theorem 1.

Suppose that Assumption 1 holds. Let the sequence $\{x_{k}\}$ be generated by Algorithm 2. Then for any $N\geq 1$ ,

\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))\leq\frac{\sqrt{n}}{2d}\|x_{0}-x^{*}\|^{2}+\frac{L(d+3)^{3/2}}{d}\sum_{k=0}^{N-1}\mu_{k}\alpha_{k}+\frac{L^{2}(d+4)^{2}(n+4)}{cd\sqrt{n}}\sum_{k=0}^{N-1}\alpha_{k}^{2}

holds.

Proof.

We define $r_{k}:=\|x_{k}-x^{*}\|$ . From (15), we have

	$\displaystyle r^{2}_{k+1}=r^{2}_{k}-2\alpha_{k}\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\rangle$
	$\displaystyle+\alpha_{k}^{2}\left(\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}\right)^{2}\\|P_{k}u_{k}\\|^{2}.$		(20)

Taking the expectation with respect to $u_{k}$ and $P_{k}$ , we then evaluate the second and third terms. Regarding the second term on the right-hand side of (4), we have

	$\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}\left[\left\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\right\rangle\right]$	$\displaystyle\overset{\eqref{eq:subspace gradient of smoothing function}}{=}\mathbb{E}_{u_{k}}\left[\left\langle\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k}),x_{k}-x^{*}\right\rangle\right]$
		$\displaystyle\overset{\text{convexity}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}(f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k})-f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x^{*}))\right]$
		$\displaystyle\overset{\eqref{eq:error smoothing function}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}(f(x_{k})-f(x^{*})-L\mu_{k}\\|u_{k}\\|)\right]$
		$\displaystyle\overset{\text{\eqref{eq:squared norm expectation}}}{=}\frac{d}{\sqrt{n}}(f(x_{k})-f(x^{*}))-\frac{L\mu_{k}\mathbb{E}_{u_{k}}[\\|u_{k}\\|^{3}]}{n^{1/2}}$
		$\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\geq}\frac{d}{\sqrt{n}}(f(x_{k})-f(x^{*}))-\frac{L\mu_{k}(d+3)^{3/2}}{n^{1/2}}.$

For the third term on the right-hand side of (4), from $P_{k}u_{k}=\|u_{k}\|z_{k}\;(z_{k}\sim\mathcal{N}(0,I_{n}))$ , we have

		$\displaystyle\gamma$	$\displaystyle:=\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\left(\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}\right)^{2}\\|P_{k}u_{k}\\|^{2}\right]$
			$\displaystyle\overset{\text{\eqref{eq:smoothing subspace function}}}{=}\frac{1}{4\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-f(x_{k}-\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right)^{2}\\|u_{k}\\|^{2}\\|z_{k}\\|^{2}\right]$
			$\displaystyle=\frac{1}{4\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left((f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\beta)+(\beta-f(x_{k}-\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k}))\right)^{2}\\|u_{k}\\|^{2}\\|z_{k}\\|^{2}\right].$

By applying the inequality $(x+y)^{2}\leq 2x^{2}+2y^{2}$ , we obtain

\gamma\leq\frac{1}{2\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\beta\right)^{2}+\left(\beta-f(x_{k}-\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right)^{2}\right)\|u_{k}\|^{2}\|z_{k}\|^{2}\right].

From the rotational invariance of $z_{k}$ , we have

\gamma\leq\frac{1}{2\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\beta\right)^{2}+\left(\beta-f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right)^{2}\right)\|u_{k}\|^{2}\|z_{k}\|^{2}\right].

Selecting $\beta=\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]$ , we have

		$\displaystyle\gamma$	$\displaystyle\leq\frac{1}{\mu_{k}^{2}}\mathbb{E}_{u_{k}}\left[\\|u_{k}\\|^{2}\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right)^{2}\\|z_{k}\\|^{2}\right]\right]$
			$\displaystyle\overset{\text{H\"{o}lder's ineq.}}{\leq}\frac{1}{\mu_{k}^{2}}\mathbb{E}_{u_{k}}\left[\\|u_{k}\\|^{2}\sqrt{\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right]}\sqrt{\mathbb{E}_{z_{k}}[\\|z_{k}\\|^{4}]}\right].$

We evaluate $\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\|u_{k}\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right]$ using Corollary 1. Note that for any non-negative random variables $X$ , $\mathbb{E}_{X}[X]=\int_{0}^{\infty}\mathrm{Prob}(X\geq t)\mathrm{d}t$ holds. Using this relation, we have

	$\displaystyle\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right]$
	$\displaystyle\hskip 100.0pt=\int_{0}^{\infty}\mathrm{Prob}\left(\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right)^{4}\geq t\right)\mathrm{d}t$
	$\displaystyle\hskip 100.0pt=\int_{0}^{\infty}\mathrm{Prob}\left(\left\|f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right\|\geq t^{1/4}\right)\mathrm{d}t$
	$\displaystyle\hskip 100.0pt\overset{\text{\eqref{eq:gaussian concentration prob}}}{\leq}\int_{0}^{\infty}2\exp{\left(-\frac{cnt^{1/2}}{\mu_{k}^{2}\\|u_{k}\\|^{2}L^{2}}\right)}\mathrm{d}t$
	$\displaystyle\hskip 100.0pt=\frac{\mu_{k}^{4}\\|u_{k}\\|^{4}L^{4}}{c^{2}n^{2}}\int_{0}^{\infty}2\exp{\left(-T^{1/2}\right)}\mathrm{d}T$
	$\displaystyle\hskip 100.0pt=\frac{4\mu_{k}^{4}\\|u_{k}\\|^{4}L^{4}}{c^{2}n^{2}},$

where the last equality follows from $\int_{0}^{\infty}\exp{\left(-T^{1/2}\right)}\mathrm{d}T=2$ . Then, we obtain

$\displaystyle\gamma$	$\displaystyle\leq\frac{1}{\mu_{k}^{2}}\mathbb{E}_{u_{k}}\left[\\|u_{k}\\|^{2}\sqrt{\frac{4\mu_{k}^{4}\\|u_{k}\\|^{4}L^{4}}{c^{2}n^{2}}}\sqrt{\mathbb{E}_{z_{k}}[\\|z_{k}\\|^{4}]}\right]$
	$\displaystyle=\frac{2L^{2}}{cn}\mathbb{E}_{u_{k}}[\\|u_{k}\\|^{4}]\sqrt{\mathbb{E}_{z_{k}}[\\|z_{k}\\|^{4}]}$
	$\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\leq}\frac{2L^{2}(d+4)^{2}(n+4)}{cn}.$	(21)

Therefore, we obtain

\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}[r_{k+1}^{2}]\leq r_{k}^{2}-\frac{2d\alpha_{k}}{\sqrt{n}}(f(x_{k})-f(x^{*}))+\frac{2L\mu_{k}\alpha_{k}(d+3)^{3/2}}{n^{1/2}}+\frac{2L^{2}\alpha_{k}^{2}(d+4)^{2}(n+4)}{cn}.

Taking the expectation with respect to $\mathcal{U}_{k}$ , we obtain

\mathbb{E}_{\mathcal{U}_{k+1}}[r_{k+1}^{2}]\leq\mathbb{E}_{\mathcal{U}_{k}}[r_{k}^{2}]-\frac{2d\alpha_{k}}{\sqrt{n}}(\mathbb{E}_{\mathcal{U}_{k}}\left[f(x_{k})\right]-f(x^{*}))+\frac{2L\mu_{k}\alpha_{k}(d+3)^{3/2}}{n^{1/2}}+\frac{2L^{2}\alpha_{k}^{2}(d+4)^{2}(n+4)}{cn}.

Summing up these inequalities from $k=0$ and $k=N-1$ , we obtain

\mathbb{E}_{\mathcal{U}_{N}}[r_{N}^{2}]\leq r_{0}^{2}-\frac{2d}{\sqrt{n}}\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}\left[f(x_{k})\right]-f(x^{*}))+\frac{2L(d+3)^{3/2}}{n^{1/2}}\sum_{k=0}^{N-1}\mu_{k}\alpha_{k}+\frac{2L^{2}(d+4)^{2}(n+4)}{cn}\sum_{k=0}^{N-1}\alpha_{k}^{2}.

Therefore,

\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}\left[f(x_{k})\right]-f(x^{*}))\leq\frac{\sqrt{n}}{2d}r_{0}^{2}+\frac{L(d+3)^{3/2}}{d}\sum_{k=0}^{N-1}\mu_{k}\alpha_{k}+\frac{L^{2}(d+4)^{2}(n+4)}{c\sqrt{n}d}\sum_{k=0}^{N-1}\alpha_{k}^{2}.

∎∎

With fixed $\alpha_{k}=\alpha$ and $\mu_{k}=\mu$ , we obtain

\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})\leq\frac{\sqrt{n}}{2dN\alpha}r_{0}^{2}+\frac{L(d+3)^{3/2}}{d}\mu+\frac{L^{2}(d+4)^{2}(n+4)}{cd\sqrt{n}}\alpha.

From this relation, the oracle complexity for achieving the inequality: $\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})\leq\varepsilon$ is

N=\frac{8r_{0}^{2}L^{2}(n+4)(d+4)^{2}}{c^{2}d^{2}\varepsilon^{2}}=O\left(\frac{n}{\varepsilon^{2}}\right)

with

\alpha=\frac{\sqrt{cn}r_{0}}{L(d+4)\sqrt{2(n+4)N}},\;\mu\leq\frac{\varepsilon d}{2L(d+3)^{3/2}}.

These parameters are obtained by the relations of $\frac{\sqrt{n}}{2dN\alpha}r_{0}^{2}=\frac{L^{2}(d+4)^{2}(n+4)}{cd\sqrt{n}}\alpha=\frac{\varepsilon}{4}$ and $\frac{L(d+3)^{3/2}}{d}\mu\leq\frac{\varepsilon}{2}$ .

Note that in each iteration, our algorithm calculates the function value twice. Therefore, the oracle complexity is equal to twice the iteration complexity.

5 Local Convergence

In this section, we prove, under some local assumptions on $f$ , local convergence of our proposed method.

5.1 Assumptions

Assumption 2.

We have that $d=o(n)$ , and $d\to\infty$ as $n\to\infty$ .

Next we consider the following local assumptions on $f$ .

Assumption 3.

There exists a neighborhood $B^{*}$ of $x^{*}$ and an Alexandrov matrix $\tilde{H}(x)$ that satisfy the following properties.

(i)

There exist constants $\tilde{L}$ and $\tau$ , and a subgradient $g\in\partial f(y)$ such that for all $x,y\in B^{*}$ :

	$\displaystyle f(x)\geq f(y)+\langle g,x-y\rangle+\frac{\tau}{2}(x-y)^{\top}\tilde{H}(y)(x-y),$		(22)
	$\displaystyle f(x)\leq f(y)+\langle g,x-y\rangle+\frac{\tilde{L}}{2}(x-y)^{\top}\tilde{H}(y)(x-y).$		(23)

(ii)

There exist constants $\sigma\in(0,1)$ and $\bar{\lambda}>0$ such that $\lambda_{\sigma n}(\tilde{H}(x))\geq\bar{\lambda}$ holds for all $x\in B^{*}$ .

When the objective function is twice differentiable, we set $g=\nabla f(y)$ and $\tilde{H}(y)=\nabla^{2}f(y)$ . In the smooth setting, (22) and (23) in Assumption 3 (i) are called relative convexity and smoothness [16, 19], respectively, and some functions achieve this property (e.g. logistic function, Wasserstein distance). This assumption implies that the objective function $f$ is $\tau$ -strongly convex and $\tilde{L}$ -smooth under the semi-norm $\|\cdot\|_{\tilde{H}(x)}$ for all $x,y\in B^{*}$ .

While non-smooth objective functions do not necessarily have gradients and Hessians at some points, we can show that when $f$ is convex, $f$ is twice differentiable almost everywhere.

Theorem 2.

[1, 28] Every convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ is twice differentiable almost everywhere in the following sense: $f$ is twice differentiable at $z$ with Alexandrov Hessian $\tilde{H}(z)=\tilde{H}^{\top}(z)\succeq 0$ , if $\nabla f(z)$ exists, and if for every $\varepsilon>0$ there exists $\delta>0$ such that $\|x-z\|\leq\delta$ implies

\sup_{y\in\partial f(x)}\|y-\nabla f(z)-\tilde{H}(z)(x-z)\|\leq\varepsilon\|x-z\|.

Assumption 3 (i) is inspired by Theorem 2, because there exists $\tilde{H}(x)$ almost everywhere such that

\lim_{h\to 0}\frac{f(x+h)-f(x)-\langle\nabla f(x),h\rangle-\frac{1}{2}h^{\top}\tilde{H}(x)h}{\|h\|^{2}}=0.

Note that the subgradient $g$ and the matrix $\tilde{H}(x)$ are used only in the analysis, not in our algorithm.

In the following subsection, we assume the above two assumptions in addition to Assumption 1. The theoretical results in this section can only hold locally around an optimal solution $x^{*}$ . Indeed, we assume Lipschitz continuity as Assumption 1 and relative convexity as Assumption 3 (i). This implies that $f(x)-f(x^{*})=O(\|x-x^{*}\|)$ and $f(x)-f(x^{*})=\Omega(\|x-x^{*}\|^{2})$ hold, and then as $\|x-x^{*}\|\to\infty$ , these assumptions conflict.

5.2 Local Theoretical Guarantees

We define ${\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u)$ as the smoothing function on the random subspace:

{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u):=\mathbb{E}_{u_{k}}[{\color[rgb]{0,0,0}h^{(k)}}(u+\mu_{k}u_{k})]=\mathbb{E}_{u_{k}}\left[f\left(x_{k}+\frac{1}{\sqrt{n}}P_{k}(u+\mu_{k}u_{k})\right)\right].

(24)

From (2.2), we have

\displaystyle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u)=\mathbb{E}_{u_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(u+\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(u)}{\mu_{k}}u_{k}\right]=\mathbb{E}_{u_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(u+\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(u-\mu_{k}u_{k})}{2\mu_{k}}u_{k}\right].\hskip 20.0pt

(25)

Under some assumptions, we can show that $P^{\top}\tilde{H}(x)P$ is a positive definite matrix with high probability.

Proposition 1.

[12, Proposition 5.4] Let $0<\varepsilon_{0}<1$ . Then under Assumptions 1, 2 and 3 (ii), there exists $n_{0}\in\mathbb{N}$ (which depends only on $\varepsilon_{0}$ and $\sigma$ ) such that if $n\geq n_{0}$ , for any $x\in B^{*}$ ,

P^{\top}\tilde{H}(x)P\succeq\frac{(1-\varepsilon_{0})^{2}n}{2}\sigma^{2}\bar{\lambda}I_{d}

holds with probability at least $1-6\exp{(-d)}$ .

Now, we prove local convergence of our proposed method.

Theorem 3.

Let $0<\varepsilon_{0}<1$ and $n_{0}$ be as defined in Proposition 1, and the sequence $\{x_{k}\}$ be generated by Algorithm 2. Suppose that Assumptions 1, 2, and 3 hold. If $n\geq n_{0}$ , $x_{k}\in B^{*}$ , and $\mu_{k}\leq\frac{1}{\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right)}$ hold for any $k\geq 0$ , then for any $N\geq 1$ , at least one of the following holds:

\displaystyle{}\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))\leq\frac{1}{2}r_{0}^{2}+L\sqrt{d}\sum_{k=0}^{N-1}\alpha_{k}\mu_{k}+\frac{L^{2}(n+4)(d+4)^{2}}{cn}\sum_{k=0}^{N-1}\alpha_{k}^{2},\hskip 30.0pt

(26)

{}\min_{0\leq k\leq N-1}f(x_{k})-f(x^{*})\leq LC_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{LC_{2}}{n\sqrt{d}},

(27)

where $C_{1}:=\frac{8L}{(1-\varepsilon_{0})^{2}\tau C\sigma^{2}\bar{\lambda}}$ , $C_{2}:=\frac{2\sqrt{3}\tilde{L}}{(1-\varepsilon_{0})^{2}\tau C\sigma^{2}\bar{\lambda}}$ , and $C:=1-6\exp(-d)-2\exp(-\frac{C_{0}d}{4})$ .

Proof.

Let $g_{k}\in\partial f(x_{k})$ . Using the same argument as in the proof of Theorem 1, from (4) and (21), we obtain

		$\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}[r_{k+1}^{2}]$	$\displaystyle\leq r_{k}^{2}-2\alpha_{k}\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}\left[\left\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\right\rangle\right]+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4)$		(28)
			$\displaystyle\overset{\text{\eqref{eq:gradient of smoothing subspace function for u}}}{=}r_{k}^{2}-2\alpha_{k}\mathbb{E}_{P_{k}}\left[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle\right]+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4).$		(28)

We reevaluate the second term on the right-hand side. Now, we evaluate the error between $\mathbb{E}_{u_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle]$ and $\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle$ for any $u$ . From relation (22) with $x=x_{k}+\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k}$ and $y=x_{k}$ , we have

f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\geq f(x_{k})+\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k}\rangle+\frac{\tau}{2n}(u+\mu_{k}u_{k})^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}(u+\mu_{k}u_{k}).

Taking the expectation with respect to $u_{k}$ , we obtain

$\displaystyle{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u)$	$\displaystyle\geq f(x_{k})+\mathbb{E}_{u_{k}}\left[\left\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k}\right\rangle\right]+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\mathbb{E}_{u_{k}}\left[\frac{\tau\mu_{k}}{n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u\right]$	(29)
	$\displaystyle\hskip 100.0pt+\mathbb{E}_{u_{k}}\left[\frac{\tau\mu_{k}^{2}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u_{k}\right]$
	$\displaystyle\overset{\text{$\mathbb{E}_{u_{k}}[u_{k}]=0$}}{=}f(x_{k})+\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\mathbb{E}_{u_{k}}\left[\frac{\tau\mu_{k}^{2}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u_{k}\right]$
	$\displaystyle\overset{\text{\eqref{eq:trace expectation1}}}{=}f(x_{k})+\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\frac{\tau\mu_{k}^{2}}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right).$

First, we evaluate $\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle$ . Using $\mathbb{E}_{u_{k}}[u_{k}u_{k}^{\top}]=I_{d}$ , we have

$\displaystyle\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle$	$\displaystyle=\mathbb{E}_{u_{k}}\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}u_{k}^{\top}u\rangle$	(30)
	$\displaystyle=\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\right]$
	$\displaystyle=\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]+\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right].\hskip 30.0pt$

We compute both an upper bound and a lower bound for $\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle$ . From the convexity of $f$ , we have

\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\geq\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle.

(31)

From relation (23) with $x=x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k}$ and $y=x_{k}$ , we have

\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle+\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u_{k}\geq\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}.

(32)

Using these relations, we have

$\displaystyle\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle$	$\displaystyle\overset{\text{\eqref{eq:evalueta gTPu}}}{=}\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]+\mathbb{E}_{u_{k}}\left[\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u_{k}\rangle\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right]$
	$\displaystyle\overset{\text{\eqref{eq:upper bound1},\eqref{eq:lower bound1}}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$
	$\displaystyle\hskip 100.0pt+\mathbb{E}_{u_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right]$
	$\displaystyle=\mathbb{E}_{u_{k}}\left[\frac{f(x_{k}+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})-f(x_{k})}{\mu_{k}}\langle u_{k},u\rangle\right]-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$
	$\displaystyle=\mathbb{E}_{u_{k}}\left[\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(0)}{\mu_{k}}\langle u_{k},u\rangle\right]-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$
	$\displaystyle\overset{\text{\eqref{eq:gradient of smoothing subspace function for u}}}{=}\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle-\mathbb{E}_{u_{k}}\left[\frac{\tilde{L}\mu_{k}}{2n}u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right].$	(33)

Regarding the second term, using $\mathbb{E}_{u_{k}}[F(u_{k})]=\mathbb{E}_{u_{k}}[F(-u_{k})]$ from the rotational invariance of the normal distribution, we have

$\displaystyle\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$	$\displaystyle=\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$	(34)
	$\displaystyle\hskip 100.0pt+\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$
	$\displaystyle\overset{\text{rotation invariant}}{=}\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$
	$\displaystyle\hskip 100.0pt-\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(-u_{k})\right]$
	$\displaystyle=\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{+}(u_{k})\right]$
	$\displaystyle\hskip 100.0pt-\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\langle u_{k},u\rangle\mathbf{1}_{u}^{-}(u_{k})\right]$
	$\displaystyle=\frac{1}{2}\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\|\langle u_{k},u\rangle\|\right].$

Now, we evaluate $\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}|\langle u_{k},u\rangle|\right]$ . From Hölder’s inequality and Lemma 2.2, we obtain

		$\displaystyle\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\|\langle u_{k},u\rangle\|\right]$	$\displaystyle\overset{\text{H\"{o}lder's ineq.}}{\leq}\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\mathbb{E}_{u_{k}}\left[\|\langle u_{k},u\rangle\|^{2}\right]}$
			$\displaystyle=\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\mathbb{E}_{u_{k}}\left[u^{\top}u_{k}u_{k}^{\top}u\right]}$
			$\displaystyle\overset{\text{$\mathbb{E}_{u_{k}}[u_{k}u_{k}^{\top}]=I_{d}$}}{=}\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\\|u\\|^{2}}$
			$\displaystyle\overset{\text{\eqref{eq:trace expectation2}}}{\leq}\sqrt{(\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right))^{2}+2\mathrm{tr}\left({(P_{k}^{\top}\tilde{H}(x)P_{k})^{2}}\right)}\\|u\\|.$

For any positive semidefinite matrix $A$ , $\mathrm{tr}\left({A^{2}}\right)=\sum_{i}\lambda_{i}(A)^{2}\leq(\sum_{i}\lambda_{i}(A))^{2}=(\mathrm{tr}\left({A}\right))^{2}$ holds. Then, we have

\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}|\langle u_{k},u\rangle|\right]\leq\sqrt{3}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|u\|.

(35)

Finally, from (33),(34), and (35), we obtain

{}\langle g_{k},\frac{1}{\sqrt{n}}P_{k}u\rangle\geq\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|u\|.

(36)

Combining relations from (29) and (36), we obtain

{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(u)\geq f(x_{k})+\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),u\rangle-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|u\|+\frac{\tau}{2n}u^{\top}P_{k}^{\top}\tilde{H}(x_{k})P_{k}u+\frac{\mu_{k}^{2}\tau}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right).

By substituting $-P_{k}^{\top}(x_{k}-x^{*})$ for $u$ ,

	$\displaystyle{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{}))\geq f(x_{k})-\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{})\rangle-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\\|P_{k}^{\top}(x_{k}-x^{*})\\|$
	$\displaystyle+\frac{\tau}{2n}(x_{k}-x^{})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{})+\frac{\mu_{k}^{2}\tau}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right)$

holds, and then, regarding the second term on the right-hand side of (28), we have

	$\displaystyle\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{})\rangle\geq f(x_{k})-{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{}))-\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\\|P_{k}^{\top}(x_{k}-x^{*})\\|$
	$\displaystyle+\frac{\tau}{2n}(x_{k}-x^{})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{})+\frac{\mu_{k}^{2}\tau}{2n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right).\hskip 30.0pt$		(37)

After taking the expectation with respect to $P_{k}$ , we evaluate the right-hand side. As for ${\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))$ , we have

		$\displaystyle\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))]$	$\displaystyle\overset{\text{\eqref{eq:smoothing subspace function for u}}}{=}\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[f(x_{k}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top}(x_{k}-x^{*})+\frac{\mu_{k}}{\sqrt{n}}P_{k}u_{k})\right]$
			$\displaystyle\overset{\text{Lipschitz}}{\leq}\mathbb{E}_{P_{k}}\left[f(x_{k}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top}(x_{k}-x^{*}))\right]+\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\frac{L\mu_{k}}{\sqrt{n}}\\|P_{k}u_{k}\\|\right]$
			$\displaystyle\overset{\text{Lipschitz}}{\leq}f(x^{})+L\mathbb{E}_{P_{k}}\left[\\|(I_{n}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top})(x_{k}-x^{})\\|\right]+\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\frac{L\mu_{k}}{\sqrt{n}}\\|P_{k}u_{k}\\|\right].$

Applying Lemma 2.1, we have $\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\|P_{k}u_{k}\|\right]=\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\|z_{k}\|\|u_{k}\|\right]\leq\sqrt{nd}$ and

		$\displaystyle\mathbb{E}_{P_{k}}\left[\sqrt{\\|(I_{n}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top})(x_{k}-x^{*})\\|^{2}}\right]$	$\displaystyle\overset{\text{Jensen's ineq.}}{\leq}\sqrt{\mathbb{E}_{P_{k}}\left[\\|(I_{n}-\frac{1}{\sqrt{n}}P_{k}P_{k}^{\top})(x_{k}-x^{*})\\|^{2}\right]}$
			$\displaystyle=\sqrt{\mathbb{E}_{P_{k}}\left[\\|x_{k}-x^{}\\|^{2}-\frac{2}{\sqrt{n}}\\|P_{k}^{\top}(x_{k}-x^{})\\|^{2}+\frac{1}{n}\\|P_{k}P_{k}^{\top}(x_{k}-x^{*})\\|^{2}\right]}$
			$\displaystyle\overset{\text{$\mathbb{E}_{P_{k}}[P_{k}P_{k}^{\top}]=dI_{n}$}}{=}\sqrt{\\|x_{k}-x^{}\\|^{2}-\frac{2d}{\sqrt{n}}\\|x_{k}-x^{}\\|^{2}+\mathbb{E}_{P_{k}}\left[\frac{1}{n}\\|P_{k}P_{k}^{\top}(x_{k}-x^{*})\\|^{2}\right]}$
			$\displaystyle\overset{\text{\eqref{eq: PPT x}}}{\leq}\sqrt{1-2\frac{d}{\sqrt{n}}+2\frac{(d+4)(n+4)}{n}}~{}\\|x_{k}-x^{*}\\|$
			$\displaystyle=\sqrt{2d+9+\frac{8d}{n}+\frac{32}{n}-\frac{2d}{\sqrt{n}}}~{}\\|x_{k}-x^{*}\\|$
			$\displaystyle{\color[rgb]{0,0,0}=\sqrt{2d+9+\frac{6d}{n}+\frac{32}{n}-2d\left(\frac{1}{\sqrt{n}}-\frac{1}{n}\right)}~{}\\|x_{k}-x^{*}\\|}$
			$\displaystyle{\color[rgb]{0,0,0}\overset{\text{$(d\geq 0,\;n\geq 1)$}}{\leq}\sqrt{8d+41}~{}\\|x_{k}-x^{*}\\|.}$

Finally, we have

\mathbb{E}_{P_{k}}[{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(-P_{k}^{\top}(x_{k}-x^{*}))]\leq f(x^{*})+L\mu_{k}\sqrt{d}+L{\color[rgb]{0,0,0}\sqrt{8d+41}}~{}\|x_{k}-x^{*}\|.

(38)

For $\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|P_{k}^{\top}(x_{k}-x^{*})\|$ in (37), from $\mu_{k}\leq\frac{1}{\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x_{k})P_{k}}\right)}$ and $\mathbb{E}_{P_{k}}[\|P_{k}^{\top}(x_{k}-x^{*})\|]\leq\sqrt{d}\|x_{k}-x^{*}\|$ , we have

\mathbb{E}_{P_{k}}\left[\frac{\sqrt{3}\tilde{L}\mu_{k}}{4n}\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right)\|P_{k}^{\top}(x_{k}-x^{*})\|\right]\leq\frac{\sqrt{3d}\tilde{L}}{4n}\|x_{k}-x^{*}\|.

(39)

Next, regarding $(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})$ , we have

\displaystyle\mathbb{E}_{P_{k}}[(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})]\geq\mathbb{E}_{P_{k}}[\lambda_{\min}(P_{k}^{\top}\tilde{H}(x_{k})P_{k})\|P_{k}^{\top}(x_{k}-x^{*})\|^{2}].

For applying conditional expectation property, we define $\mathcal{A}:=\{P|\lambda_{\min}(P^{\top}\tilde{H}(x_{k})P)\geq\frac{(1-\varepsilon_{0})^{2}n}{2}\sigma^{2}\bar{\lambda}\}$ , $\mathcal{B}:=\{P|\|P^{\top}(x_{k}-x^{*})\|^{2}\geq\frac{d}{2}\|x_{k}-x^{*}\|^{2}\}$ and $X:=(x_{k}-x^{*})^{\top}P_{k}P_{k}^{\top}\tilde{H}(x_{k})P_{k}P_{k}^{\top}(x_{k}-x^{*})\geq 0$ . Then, we have

\mathbb{E}_{P_{k}}[X]=\mathbb{E}_{P_{k}}[X|\mathcal{A}\cap\mathcal{B}]P(\mathcal{A}\cap\mathcal{B})+\mathbb{E}_{P_{k}}[X|\overline{\mathcal{A}\cap\mathcal{B}}](1-P(\mathcal{A}\cap\mathcal{B}))\geq\mathbb{E}_{P_{k}}[X|\mathcal{A}\cap\mathcal{B}]P(\mathcal{A}\cap\mathcal{B}).

Applying Lemma 3.1 and Proposition 1, we have

P(\mathcal{A}\cap\mathcal{B})\geq 1-P(\mathcal{A})-P(\mathcal{B})\geq 1-6\exp{(-d)}-2\exp{(-\frac{C_{0}d}{4})}.

Let $C:=1-6\exp{(-d)}-2\exp{(-\frac{C_{0}d}{4})}$ , we have

\mathbb{E}_{P_{k}}[X]\geq\frac{(1-\varepsilon_{0})^{2}Cdn}{4}\sigma^{2}\bar{\lambda}\|x_{k}-x^{*}\|^{2}.

(40)

From (37), (38), (39), and (40), we obtain

	$\displaystyle\mathbb{E}_{P_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{})\rangle]\geq f(x_{k})-f(x^{})-L\mu_{k}\sqrt{d}-L{\color[rgb]{0,0,0}\sqrt{8d+41}}\\|x_{k}-x^{*}\\|$
	$\displaystyle-\frac{\sqrt{3d}\tilde{L}}{4n}\\|x_{k}-x^{}\\|+\frac{(1-\varepsilon_{0})^{2}\tau Cd}{8}\sigma^{2}\bar{\lambda}\\|x_{k}-x^{}\\|^{2}.\hskip 30.0pt$		(41)

To satisfy the condition:

-L{\color[rgb]{0,0,0}\sqrt{8d+41}}\|x_{k}-x^{*}\|-\frac{\sqrt{3d}\tilde{L}}{4n}\|x_{k}-x^{*}\|+\frac{(1-\varepsilon_{0})^{2}\tau Cd}{8}\sigma^{2}\bar{\lambda}\|x_{k}-x^{*}\|^{2}\geq 0,

we need

{}\|x_{k}-x^{*}\|\geq C_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{C_{2}}{n\sqrt{d}}.

(42)

When $x_{k}$ satisfies (42), we obtain

\mathbb{E}_{P_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle]\geq f(x_{k})-f(x^{*})-L\mu_{k}\sqrt{d}.

(43)

Then, we have

		$\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}[r_{k+1}^{2}]$	$\displaystyle\overset{\eqref{eq:local first inequality}}{\leq}r_{k}^{2}-2\alpha_{k}\mathbb{E}_{P_{k}}[\langle\nabla{\color[rgb]{0,0,0}h^{(k)}_{\mu_{k}}}(0),P_{k}^{\top}(x_{k}-x^{*})\rangle]+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4)$
			$\displaystyle\overset{\eqref{eq:new eval in local}}{\leq}r_{k}^{2}-2\alpha_{k}(f(x_{k})-f(x^{*}))+2L\alpha_{k}\mu_{k}\sqrt{d}+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4).$

Taking the expectation with respect to $\mathcal{U}_{k}$ , we have

\displaystyle\mathbb{E}_{\mathcal{U}_{k+1}}[r_{k+1}^{2}]\leq\mathbb{E}_{\mathcal{U}_{k}}[r_{k}^{2}]-2\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))+2L\alpha_{k}\mu_{k}\sqrt{d}+\frac{2L^{2}\alpha_{k}^{2}}{cn}(d+4)^{2}(n+4).\hskip 30.0pt

(44)

If $\{x_{k}\}_{k=1,..,N}$ satisfy (42),

\sum_{k=0}^{N-1}\alpha_{k}(\mathbb{E}_{\mathcal{U}_{k}}[f(x_{k})]-f(x^{*}))\leq\frac{1}{2}r_{0}^{2}+L\sqrt{d}\sum_{k=0}^{N-1}\alpha_{k}\mu_{k}+\frac{L^{2}}{cn}(d+4)^{2}(n+4)\sum_{k=0}^{N-1}\alpha_{k}^{2}.

When $x_{k}$ does not satisfy (42) for some $k$ , i.e.,

\|x_{k}-x^{*}\|<C_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{C_{2}}{n\sqrt{d}},

from Lipschitz continuity of $f$ , we have

f(x_{k})-f(x^{*})\leq LC_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}+\frac{LC_{2}}{n\sqrt{d}}.

∎∎

With fixed $\alpha_{k}=\alpha$ and $\mu_{k}=\mu$ , from (26) we obtain

\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})\leq\frac{r_{0}^{2}}{2N\alpha}+L\sqrt{d}\mu+\frac{L^{2}(n+4)(d+4)^{2}}{cn}\alpha.

From this relation, the iteration complexity for achieving the inequality: $\min_{0\leq i\leq N-1}\mathbb{E}_{\mathcal{U}_{i}}[f(x_{i})]-f(x^{*})<\varepsilon$ is

N=\frac{8r_{0}^{2}L^{2}(n+4)(d+4)^{2}}{cn\varepsilon^{2}}=O\left(\frac{d^{2}}{\varepsilon^{2}}\right)

(45)

with

\alpha=\frac{\sqrt{cn}r_{0}}{L(d+4)\sqrt{2(n+4)N}},\;\mu\leq\frac{\varepsilon}{2L\sqrt{d}}.

(46)

From (27), when $LC_{1}\sqrt{{\color[rgb]{0,0,0}\frac{8d+41}{d^{2}}}}\leq\frac{\varepsilon}{2}$ and $\frac{LC_{2}}{n\sqrt{d}}\leq\frac{\varepsilon}{2}$ hold, we obtain $f(x_{k})-f(x^{*})\leq\varepsilon$ . Then, the reduced dimension $d$ must satisfy $d=\Omega(\varepsilon^{-2})$ . When comparing the global and local iteration complexity, the local behavior becomes better when the original dimension $n$ satisfies $n=\Theta(\varepsilon^{-p})$ with $p>4$ . In this case, while global iteration complexity becomes $O(n\varepsilon^{-2})=O(\varepsilon^{-p-2})$ , the local iteration complexity achieves $O(d^{2}\varepsilon^{-2})=O(\varepsilon^{-6})$ , which is less than $O(\varepsilon^{-p-2})$ , with reduced dimension $d=\Theta(\varepsilon^{-2})$ .

Remark 2.

Indeed, our algorithm uses only $P_{k}u_{k}$ , and does not use $P_{k}$ and $u_{k}$ separately. This is the most interesting part of this research using random projections. We consider that the advantage of $P_{k}u_{k}$ comes from the relation

\mathbb{E}_{P_{k}}[\min_{u\in\mathbb{R}^{d}}{\color[rgb]{0,0,0}h^{(k)}}(u)=f(x_{k}+\frac{1}{\sqrt{n}}P_{k}u)]\leq\mathbb{E}_{v_{k}}[\min_{\alpha\in\mathbb{R}}f(x_{k}+\alpha v_{k})],

where $v_{k}\in\mathbb{R}^{n}$ is a random vector whose entries come from $\mathcal{N}(0,1)$ . This relation is clear because all entries of $P_{k}$ and $v_{k}$ follow $\mathcal{N}(0,1)$ and the left-hand side problem is identical to the right-hand side when $d=1$ . Noticing that the function on the left-hand side is denoted by ${\color[rgb]{0,0,0}h^{(k)}}(u)$ in (16), we can regard (15) in Algorithm 2 as one iteration of Algorithm 1 (i.e., RGF [27]) applied to the problem on the left-hand side.

6 Experiments

6.1 Softmax Regression

In this section, we evaluate the performance of the proposed method, Algorithm 2, and compare it with RGF [27] with central differences, which is described as Algorithm 1, by optimizing a softmax loss function with $L_{1}$ regularization:

\min_{w_{1},...,w_{c},b_{1},...,b_{c}}-\frac{1}{m}\sum_{i=1}^{m}\log{\frac{\exp(w_{y_{i}}^{\top}x_{i}+b_{y_{i}})}{\sum_{k=1}^{c}\exp(w_{k}^{\top}x_{i}+b_{k})}}+\lambda\sum_{k=1}^{c}(\|w_{k}\|_{1}+\|b_{k}\|_{1}),

where $(x_{i},y_{i})$ denotes data and $y_{i}\in\{1,2,...,c\}$ . For the $L_{1}$ regularization, we set $\lambda=10^{-6}$ and use reduced dimensional size $d\in\{10,50,100\}$ . For both methods, we set a smoothing parameter $\mu_{k}=10^{-8}$ and use a fixed step size $\alpha_{k}=10^{i}\;(i\in\mathbb{N})$ .

Table 1: Details of the datasets for Softmax regression [7].

Name	feature	class ( $c$ )	training size ( $m$ )
SCOTUS	126,405	13	5,000
news20	62,061	20	15,935

Figure 1 shows function values of RGF and the proposed methods per iteration for the datasets listed in Table 1. From Figure 1, we can find that our algorithm converges faster than RGF after a sufficient number of iterations, while RGF reduces the objective function value more rapidly in the early iterations. This behavior might be consistent with the theoretical guarantees of global and local worst-case complexities (Theorems 1 and 3, respectively), considering that the global complexity $O(\frac{n}{\varepsilon^{2}})$ is the same to the one of RGF. However, it seems difficult to see the difference between the coefficients of the local and global sublinear rates, i.e., that the local iteration complexity is $d^{2}/n$ times the global one. Perhaps the reason is that the rate improvement of local convergence is about worst-case complexity, and such worst-case may not always be achieved in practice.

As for the result of function values in time, generating random matrices is time-consuming and there is no benefit using random projections in view of time for general functions. In this setting, our proposed methods spend more time for the same number of iterations.

6.2 Adversarially Robust Logistic Regression

We consider the next adversarially robust optimization, which is studied in [24]:

	$\displaystyle\min_{w,b}\max_{\\|\tilde{x}\\|\leq\delta}~{}\tilde{g}(\tilde{x};w,b):=$
	$\displaystyle-\frac{1}{m}\sum_{i=1}^{m}\log{\frac{1}{(1+e^{-y_{i}(w^{\top}(x_{i}+\tilde{x})+b)})}}+\lambda(\\|w\\|_{1}+\\|b\\|_{1}),$		(47)

where $\lambda=10^{-7}$ . By letting $\theta^{\top}=(w^{\top},b)$ and $\eta^{\top}=(\tilde{x}^{\top},0)$ , we can rewrite Problem (47) as $\min_{\theta}f(\theta)$ , where

	$\displaystyle f(\theta):=\max_{\eta\in\mathcal{H}}g_{\theta}(\theta^{\top}\eta),~{}~{}\mathcal{H}:=\{\eta:\\|\eta\\|\leq\delta\},$
	$\displaystyle g_{\theta}(\alpha):=-\frac{1}{m}\sum_{i=1}^{m}\log{\frac{1}{(1+e^{-y_{i}(\alpha+w^{\top}x_{i}+b)})}}+\lambda(\\|w\\|_{1}+\\|b\\|_{1}).$

Note that the derivative of $f$ is difficult to compute due to the non-smoothness of $f$ in general.

In this formulation, we can take advantages of random projections in our proposed method. When we evaluate the function value $f(\theta_{k}+\frac{1}{\sqrt{n}}P_{k}u)$ , it is necessary to solve

{}\max_{\eta\in\mathcal{H}}g_{\theta_{k}}(\theta_{k}^{\top}\eta+u^{\top}\frac{1}{\sqrt{n}}P_{k}^{\top}\eta).

(48)

In this case, we solve the following approximated optimization problem:

\max_{(\alpha,\beta)\in\mathcal{A}}g_{\theta_{k}}(\alpha+u^{\top}\beta),

(49)

where $\mathcal{A}=\{(\alpha,\beta)|\frac{\alpha^{2}}{\|\theta_{k}\|^{2}}+\beta^{2}\leq\delta^{2}\}$ . We will explain that this approximated problem (49) is equivalent to the original problem (48) when some condition holds, and also show that the condition holds with high probability. Now, we confirm that we can obtain $\eta$ such that $\|\eta\|\leq\delta$ from the solution of the approximated problem (49). Let $(\alpha^{*},\beta^{*})$ denote optimal solutions of this approximated problem (49). When $d+1<n$ , we can confirm the existence of $\eta$ that satisfies $\alpha^{*}=\theta_{k}^{\top}\eta$ and $\beta^{*}=\frac{1}{\sqrt{n}}P_{k}^{\top}\eta$ by solving the linear equation;

z^{*}=A\eta,\;A:=\left(\begin{array}[]{c}\frac{\theta_{k}^{\top}}{\|\theta_{k}\|}\\ \frac{1}{\sqrt{n}}P_{k}^{\top}\end{array}\right),\;z^{*}:=\left(\begin{array}[]{c}\frac{\alpha^{*}}{\|\theta_{k}\|}\\ \beta^{*}\end{array}\right).

(50)

From the linear dependence of row vectors in $A$ , the minimum norm solution of (50) is $\eta=A^{\top}(AA^{\top})^{-1}z^{*}$ , and then $\|\eta\|^{2}=(z^{*})^{\top}(AA^{\top})^{-1}z^{*}\leq\lambda_{\max}((AA^{\top})^{-1})\|z^{*}\|^{2}$ holds. This inequality implies that when $\|z^{*}\|\leq\sqrt{\lambda_{\min}(AA^{\top})}\delta$ holds, $\|\eta\|\leq\delta$ also holds. From $(\alpha^{*},\beta^{*})\in\mathcal{A}$ , $\|z^{*}\|^{2}=\left(\frac{\alpha^{*}}{\|\theta_{k}\|}\right)^{2}+\|\beta^{*}\|^{2}\leq\delta^{2}$ holds. Then, if $\lambda_{\min}(AA^{\top})\geq 1$ holds, $\|z^{*}\|\leq\delta$ implies $\|\eta\|\leq\delta$ . To show this relation, we prove next Lemma 5.

Lemma 5.

Let $A:=\left(\begin{array}[]{cc}\frac{x}{\|x\|},\frac{1}{\sqrt{n}}P\end{array}\right)^{\top}$ , where $P\in\mathbb{R}^{n\times d}$ is a random matrix whose entries are sampled from $\mathcal{N}(0,1)$ and $x\in\mathbb{R}^{n}$ . Then, $\lambda_{\min}(AA^{\top})\geq 1-2(3+\varepsilon)\sqrt{\frac{d}{n}}+\frac{4d}{n}$ holds with probability at least $1-2\exp{(-C_{0}\varepsilon^{2}d)}-\exp{(-d/2)}$ .

Proof.

Let $w_{1}\in\mathbb{R}$ and $w_{2}\in\mathbb{R}^{d}$ . From the property of the minimum eigenvalue, we have

		$\displaystyle\lambda_{\min}(AA^{\top})$	$\displaystyle=\min_{\\|w\\|=1}w^{\top}AA^{\top}w=\min_{w_{1}^{2}+\\|w_{2}\\|^{2}=1}w_{1}^{2}+\frac{2}{\sqrt{n}\\|x\\|}w_{1}x^{\top}Pw_{2}+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2}$
			$\displaystyle\geq\min_{w_{1}^{2}+\\|w_{2}\\|^{2}=1}w_{1}^{2}-\frac{2}{\sqrt{n}\\|x\\|}\|w_{1}\|\\|P^{\top}x\\|\\|w_{2}\\|+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2}.$

Regarding the second term on the right-hand side, from Lemma 3.1 and $w_{1}^{2}+\|w_{2}\|^{2}=1$ , we obtain

\frac{2}{\sqrt{n}\|x\|}|w_{1}|\|P^{\top}x\|\|w_{2}\|\overset{\text{Lemma~{}\ref{lemma:properties of random matrix}.\ref{lemma:Johnson}}}{\leq}2(1+\varepsilon)\sqrt{\frac{d}{n}}|w_{1}|\|w_{2}\|\leq 2(1+\varepsilon)\sqrt{\frac{d}{n}}

with probability at least $1-2\exp{(-C_{0}\varepsilon^{2}d)}$ . Then, we have

\lambda_{\min}(AA^{\top})\geq\min_{w_{1}^{2}+\|w_{2}\|^{2}=1}(w_{1}^{2}+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2})-2(1+\varepsilon)\sqrt{\frac{d}{n}}.

The first term on right-hand side is equivalent to the minimum eigenvalue of

\left(\begin{array}[]{cc}1&0\\ 0&\frac{1}{n}P^{\top}P\end{array}\right),

and then, we have $\min_{w_{1}^{2}+\|w_{2}\|^{2}=1}(w_{1}^{2}+\frac{1}{n}w_{2}^{\top}P^{\top}Pw_{2})=\min\{1,\sigma_{d}(\frac{1}{\sqrt{n}}P)^{2}\}$ . Applying Lemma 3.3, $\sigma_{d}(\frac{1}{\sqrt{n}}P)\geq 1-2\sqrt{\frac{d}{n}}$ holds with probability at least $1-\exp{(-d/2)}$ . Hence, we have

\lambda_{\min}(AA^{\top})\geq\left(1-2\sqrt{\frac{d}{n}}\right)^{2}-2(1+\varepsilon)\sqrt{\frac{d}{n}}=1-2(3+\varepsilon)\sqrt{\frac{d}{n}}+\frac{4d}{n}

with probability at least $1-2\exp{(-C_{0}\varepsilon^{2}d)}-\exp{(-d/2)}$ . ∎∎

When $n$ is large enough and $d=o(n)$ , from Lemma 5, $\lambda_{\min}(AA^{\top})\gtrapprox 1$ holds with high probability. Then, $\|\eta\|\leq\delta$ holds with high probability.

Next, we confirm that the optimal solution to the original problem (48) can be obtained from (49). Let $\eta^{*}$ denote the global solution. When $(\theta_{k}^{\top}\eta^{*},\frac{1}{\sqrt{n}}P_{k}^{\top}\eta^{*})$ is contained in $\mathcal{A}$ , we can calculate the same maximal value as in the original problem (48) from (49). We show that $(\theta_{k}^{\top}\eta^{*},\frac{1}{\sqrt{n}}P_{k}^{\top}\eta^{*})$ is contained in $\mathcal{A}$ with high probability. We have

\frac{(\theta_{k}^{\top}\eta^{*})^{2}}{\|\theta_{k}\|^{2}}+\frac{1}{n}\|P_{k}^{\top}\eta^{*}\|^{2}\overset{\text{Lemma~{}\ref{lemma:properties of random matrix}.\ref{lemma:Johnson}}}{\leq}\frac{(\theta_{k}^{\top}\eta^{*})^{2}}{\|\theta_{k}\|^{2}}+\frac{d(1+\varepsilon)}{n}\|\eta^{*}\|^{2}\leq\left(1+\frac{d(1+\varepsilon)}{n}\right)\|\eta^{*}\|^{2}

with probability at least $1-2\exp{(-C\varepsilon^{2}d)}$ . Hence, with sufficiently large $n$ and $d=o(n)$ , we have $\|A\eta^{*}\|\lessapprox\|\eta^{*}\|\leq\delta$ and $A\eta^{*}$ is contained in $\mathcal{A}$ . Note that the problem (48) is not convex optimization even if $\mathcal{H}$ is a convex set, because $-g$ is concave.

Table 2: Details of the dataset for logistic regression [7].

Name	feature	class ( $c$ )	training size ( $m$ )
news20(binary)	1,355,191	2	19,996
random	1,000,000	2	100

In numerical experiments, we evaluate $f(x)$ by solving the maximum optimization using the accelerated proximal gradient method until the norm of the generalized gradient is less than $10^{-7}$ . In our proposed method, we evaluate $f(x+\frac{1}{\sqrt{n}}P^{\top}u)$ using the approximated random optimization problem (49). Furthermore, we increase the sampling size per iteration in both the proposed method and RGF, i.e., we update $x_{k}$ by

	$\displaystyle x_{k+1}=x_{k}-\frac{\alpha_{k}}{\sqrt{l}}\sum_{i=1}^{l}\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{l}u_{k}^{(l)})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{l}u_{k}^{(l)})}{2\mu_{k}}P_{k}u_{k}^{(l)},$
	$\displaystyle x_{k+1}=x_{k}-\frac{\alpha_{k}}{\sqrt{l}}\sum_{i=1}^{l}\frac{f(x_{k}+\mu_{l}v_{k}^{(l)})-f(x_{k}-\mu_{l}v_{k}^{(l)})}{2\mu_{k}}v_{k}^{(l)},$

respectively. We use a dataset from [7] and randomly generated one. For the random dataset, we use a matrix $X$ and a vector $w$ whose entries are sampled from $\mathcal{N}(0,1)$ . The labels are generated as $y:=\mathbf{1}_{x\geq 0}(Xw+\varepsilon)$ , where $\varepsilon$ is a noise vector sampled from $\mathcal{N}(0,I_{m}/100)$ . In both methods, we set a smoothing parameter $\mu_{k}=10^{-8}$ and use a fixed step size $\alpha_{k}=10^{i}\;(i\in\mathbb{N})$ .

Figure 2 shows time versus the function values of RGF and the proposed methods for the datasets listed in Table 2. From Figure 2 when evaluating $f(x)$ is time-consuming due to solve maximization problems, our proposed method converges faster than RGF. This efficiency comes from the random projection technique, which leads to a reduction of function evaluation time.

7 Conclusion

We proposed a new zeroth-order method combining random projections and smoothing method for non-smooth convex optimization problems. While our proposed method achieves $O(\frac{n}{\varepsilon^{2}})$ worst-case iteration complexity, which is equivalent to the standard result under convex and non-smooth setting, ours can converge with $O(\frac{d^{2}}{\varepsilon^{2}})$ , which does not depend on the dimension $n$ , under some additional local properties of an objective. In numerical experiments, our method performed well when the function evaluation time can be reduced using random projection. As discussed in Section 6.1, since we have shown in this paper is the improvement of the “worst-case” oracle complexity, it is not always the case that the worst-case oracle complexity is achieved when the algorithm is actually run. Indeed, many applications using zeroth-order methods have succeeded despite of large scale models and their oracle complexities depending on the dimension $n$ [35, 26, 8, 32]. It may be interesting to investigate whether iteration complexities of zeroth-order methods are not affected by $n$ in practical use, or whether it can strongly depend on it in any problem setting as a future work. We also would like to investigate the convergence rate of our algorithm in a non-smooth and non-convex setting in the future.

Acknowledgement

This work was supported by JSPS KAKENHI Grant Number 23H03351 and JST ERATO Grant Number JPMJER1903.

8 Compliance with Ethical Standards

This work was partially supported by JSPS KAKENHI (23H03351) and JST ERATO (JPMJER1903). There is no conflict of interest in writing the paper.

References

Alexandrov [1939] A. D. Alexandrov. Almost everywhere existence of the second differential of a convex function and some properties of convex surfaces connected with it. Leningrad State Univ. Annals [Uchenye Zapiski] Math. Ser., 6:3, 1939.
Berglund et al. [2022] E. Berglund, S. Khirirat, and X. Wang. Zeroth-order randomized subspace newton methods. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6002–6006. IEEE, 2022.
Bergstra et al. [2011] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
Beznosikov et al. [2020] A. Beznosikov, E. Gorbunov, and A. Gasnikov. Derivative-free method for composite optimization with applications to decentralized distributed optimization. IFAC-PapersOnLine, 53(2):4038–4043, 2020.
Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 72–85, New York, 2017. Association for Computing Machinery.
Cartis and Roberts [2023] C. Cartis and L. Roberts. Scalable subspace methods for derivative-free nonlinear least-squares optimization. Mathematical Programming, 199(1-2):461–524, 2023.
Chang and Lin [2011] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
Choromanski et al. [2018] K. Choromanski, M. Rowland, V. Sindhwani, R. Turner, and A. Weller. Structured evolution with compact architectures for scalable policy optimization. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 970–978. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/choromanski18a.html.
Davidson and Szarek [2001] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.
Duchi et al. [2012] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
Duchi et al. [2015] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
Fuji et al. [2022] T. Fuji, P.-L. Poirion, and A. Takeda. Randomized subspace regularized newton method for unconstrained non-convex optimization. arXiv preprint arXiv:2209.04170, 2022.
Gasnikov et al. [2022] A. Gasnikov, A. Novitskii, V. Novitskii, F. Abdukhakimov, D. Kamzolov, A. Beznosikov, M. Takac, P. Dvurechensky, and B. Gu. The power of first-order smooth optimization for black-box non-smooth problems. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 7241–7265. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/gasnikov22a.html.
Gasnikov et al. [2016] A. V. Gasnikov, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fedorenko. Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex. Automation and Remote Control, 77:2018–2034, 2016.
Golovin et al. [2020] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang. Gradientless descent: High-dimensional zeroth-order optimization. In International Conference on Learning Representations, 2020.
Gower et al. [2019] R. Gower, D. Kovalev, F. Lieder, and P. Richtarik. Rsn: Randomized subspace newton. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Gratton et al. [2015] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25(3):1515–1541, 2015.
Ilyas et al. [2018] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2137–2146. PMLR, 10–15 Jul 2018.
Karimireddy et al. [2018] S. P. Karimireddy, S. U. Stich, and M. Jaggi. Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
Kolda et al. [2003] T. G. Kolda, R. M. Lewis, and V. Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM review, 45(3):385–482, 2003.
Kozak et al. [2021] D. Kozak, S. Becker, A. Doostan, and L. Tenorio. A stochastic subspace approach to gradient-free optimization in high dimensions. Computational Optimization and Applications, 79(2):339–368, 2021.
Kozak et al. [2023] D. Kozak, C. Molinari, L. Rosasco, L. Tenorio, and S. Villa. Zeroth-order optimization with orthogonal random directions. Mathematical Programming, 199(1-2):1179–1219, 2023.
Larson et al. [2019] J. Larson, M. Menickelly, and S. M. Wild. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019.
Madry et al. [2018] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
Magnus [1978] J. R. Magnus. The moments of products of quadratic forms in normal variables. Univ., Instituut voor Actuariaat en Econometrie, 1978.
Mania et al. [2018] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
Nesterov and Spokoiny [2017] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
Niculescu and Persson [2006] C. Niculescu and L.-E. Persson. Convex functions and their applications, volume 23. Springer, New York, 2006.
Qian et al. [2016] H. Qian, Y.-Q. Hu, and Y. Yu. Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1946–1952. AAAI Press, 2016.
Rando et al. [2023] M. Rando, C. Molinari, L. Rosasco, and S. Villa. An optimal structured zeroth-order algorithm for non-smooth optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=SfdkS6tt81.
Roberts and Royer [2023] L. Roberts and C. W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33(4):3057–3082, 2023.
Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
Shamir [2017] O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703–1713, 2017.
Vershynin [2018] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, Cambridge, 2018.
Ye et al. [2018] H. Ye, Z. Huang, C. Fang, C. J. Li, and T. Zhang. Hessian-aware zeroth-order optimization for black-box adversarial attack. arXiv preprint arXiv:1812.11377, 2018.
Yue et al. [2023] P. Yue, L. Yang, C. Fang, and Z. Lin. Zeroth-order optimization with weak dimension dependency. In G. Neu and L. Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 4429–4472. PMLR, 12–15 Jul 2023.

		$\displaystyle\mathbb{E}_{P}\left[\\|PP^{\top}x\\|^{2}\right]$	$\displaystyle=\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\\|u_{i}\\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}\\|u_{k}\\|^{2}x_{k}^{2}\right]$
			$\displaystyle\overset{\text{\eqref{eq:squared norm expectation}}}{=}\mathbb{E}_{u_{1},..,u_{n}}\left[\sum^{n}_{i=1}\\|u_{i}\\|^{4}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}dx_{k}^{2}\right]$
			$\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\leq}\sum^{n}_{i=1}(d+4)^{2}x_{i}^{2}+\sum^{n}_{i=1}\sum^{n}_{k\neq i}dx_{k}^{2}$
			$\displaystyle=((d+4)^{2}+d(n-1))\\|x\\|^{2}$
			$\displaystyle{\color[rgb]{0,0,0}\overset{\text{($d\leq n,\;1\leq n$)}}{\leq}((d+4)(n+4)+(d+4)(n+4))\\|x\\|^{2}}$
			$\displaystyle{\color[rgb]{0,0,0}=}2(n+4)(d+4)\\|x\\|^{2}.$

	$\displaystyle\mathbb{E}_{u_{k}}\mathbb{E}_{P_{k}}\left[\left\langle\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}P_{k}u_{k},x_{k}-x^{*}\right\rangle\right]$	$\displaystyle\overset{\eqref{eq:subspace gradient of smoothing function}}{=}\mathbb{E}_{u_{k}}\left[\left\langle\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}\nabla f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k}),x_{k}-x^{*}\right\rangle\right]$
		$\displaystyle\overset{\text{convexity}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}(f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x_{k})-f_{\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}}(x^{*}))\right]$
		$\displaystyle\overset{\eqref{eq:error smoothing function}}{\geq}\mathbb{E}_{u_{k}}\left[\frac{\\|u_{k}\\|^{2}}{\sqrt{n}}(f(x_{k})-f(x^{*})-L\mu_{k}\\|u_{k}\\|)\right]$
		$\displaystyle\overset{\text{\eqref{eq:squared norm expectation}}}{=}\frac{d}{\sqrt{n}}(f(x_{k})-f(x^{*}))-\frac{L\mu_{k}\mathbb{E}_{u_{k}}[\\|u_{k}\\|^{3}]}{n^{1/2}}$
		$\displaystyle\overset{\text{\eqref{eq:bound of expectation of norm2}}}{\geq}\frac{d}{\sqrt{n}}(f(x_{k})-f(x^{*}))-\frac{L\mu_{k}(d+3)^{3/2}}{n^{1/2}}.$

		$\displaystyle\gamma$	$\displaystyle:=\mathbb{E}_{P_{k}}\mathbb{E}_{u_{k}}\left[\left(\frac{{\color[rgb]{0,0,0}h^{(k)}}(\mu_{k}u_{k})-{\color[rgb]{0,0,0}h^{(k)}}(-\mu_{k}u_{k})}{2\mu_{k}}\right)^{2}\\|P_{k}u_{k}\\|^{2}\right]$
			$\displaystyle\overset{\text{\eqref{eq:smoothing subspace function}}}{=}\frac{1}{4\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-f(x_{k}-\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right)^{2}\\|u_{k}\\|^{2}\\|z_{k}\\|^{2}\right]$
			$\displaystyle=\frac{1}{4\mu_{k}^{2}}\mathbb{E}_{z_{k}}\mathbb{E}_{u_{k}}\left[\left((f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\beta)+(\beta-f(x_{k}-\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k}))\right)^{2}\\|u_{k}\\|^{2}\\|z_{k}\\|^{2}\right].$

	$\displaystyle\mathbb{E}_{z_{k}}\left[\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right)^{4}\right]$
	$\displaystyle\hskip 100.0pt=\int_{0}^{\infty}\mathrm{Prob}\left(\left(f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right)^{4}\geq t\right)\mathrm{d}t$
	$\displaystyle\hskip 100.0pt=\int_{0}^{\infty}\mathrm{Prob}\left(\left\|f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})-\mathbb{E}_{z_{k}}\left[f(x_{k}+\frac{\mu_{k}\\|u_{k}\\|}{\sqrt{n}}z_{k})\right]\right\|\geq t^{1/4}\right)\mathrm{d}t$
	$\displaystyle\hskip 100.0pt\overset{\text{\eqref{eq:gaussian concentration prob}}}{\leq}\int_{0}^{\infty}2\exp{\left(-\frac{cnt^{1/2}}{\mu_{k}^{2}\\|u_{k}\\|^{2}L^{2}}\right)}\mathrm{d}t$
	$\displaystyle\hskip 100.0pt=\frac{\mu_{k}^{4}\\|u_{k}\\|^{4}L^{4}}{c^{2}n^{2}}\int_{0}^{\infty}2\exp{\left(-T^{1/2}\right)}\mathrm{d}T$
	$\displaystyle\hskip 100.0pt=\frac{4\mu_{k}^{4}\\|u_{k}\\|^{4}L^{4}}{c^{2}n^{2}},$

		$\displaystyle\mathbb{E}_{u_{k}}\left[u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k}\|\langle u_{k},u\rangle\|\right]$	$\displaystyle\overset{\text{H\"{o}lder's ineq.}}{\leq}\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\mathbb{E}_{u_{k}}\left[\|\langle u_{k},u\rangle\|^{2}\right]}$
			$\displaystyle=\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\mathbb{E}_{u_{k}}\left[u^{\top}u_{k}u_{k}^{\top}u\right]}$
			$\displaystyle\overset{\text{$\mathbb{E}_{u_{k}}[u_{k}u_{k}^{\top}]=I_{d}$}}{=}\sqrt{\mathbb{E}_{u_{k}}\left[(u_{k}^{\top}P_{k}^{\top}\tilde{H}(x)P_{k}u_{k})^{2}\right]}\sqrt{\\|u\\|^{2}}$
			$\displaystyle\overset{\text{\eqref{eq:trace expectation2}}}{\leq}\sqrt{(\mathrm{tr}\left({P_{k}^{\top}\tilde{H}(x)P_{k}}\right))^{2}+2\mathrm{tr}\left({(P_{k}^{\top}\tilde{H}(x)P_{k})^{2}}\right)}\\|u\\|.$

Zeroth-order Random Subspace Algorithm for Non-smooth Convex Optimization

Abstract

1 Introduction

1.1 Main Contribution

1.2 Related Works on L2L_{2} randomized smoothing

1.3 Organization

2 Preliminaries

2.1 Notation

2.2 Gaussian Smoothing Function

Definition 1.

Lemma 1.

2.3 Random Matrices and Vectors

Lemma 2.

Corollary 1.

Lemma 3.

Proof of Lemma 3.2.

Remark 1.

3 Proposed Algorithm

Lemma 4.

Proof.

4 Global Convergence

Assumption 1.

Theorem 1.

Proof.

5 Local Convergence

5.1 Assumptions

Assumption 2.

Assumption 3.

Theorem 2.

5.2 Local Theoretical Guarantees

Proposition 1.

Theorem 3.

Proof.

Remark 2.

6 Experiments

6.1 Softmax Regression

6.2 Adversarially Robust Logistic Regression

Lemma 5.

Proof.

7 Conclusion

Acknowledgement

8 Compliance with Ethical Standards

References

1.2 Related Works on $L_{2}$ randomized smoothing