Zeroth-order Random Subspace Algorithm for Non-smooth Convex Optimization
Abstract
Zeroth-order optimization, which does not use derivative information, is one of the significant research areas in the field of mathematical optimization and machine learning. Although various studies have explored zeroth-order algorithms, one of the theoretical limitations is that oracle complexity depends on the dimension, i.e., on the number of variables, of the optimization problem. In this paper, to reduce the dependency of the dimension in oracle complexity, we propose a zeroth-order random subspace algorithm by combining a gradient-free algorithm (specifically, Gaussian randomized smoothing with central differences) with random projection. We derive the worst-case oracle complexity of our proposed method in non-smooth and convex settings; it is equivalent to standard results for full-dimensional non-smooth convex algorithms. Furthermore, we prove that ours also has a local convergence rate independent of the original dimension under additional assumptions. In addition to the theoretical results, numerical experiments show that when an objective function has a specific structure, the proposed method can become experimentally more efficient due to random projection.
Keywords: zeroth-order optimization, random projection, convex optimization, oracle complexity
1 Introduction
We consider the following unconstrained optimization problem:
(1) |
where the objective function, , is possibly non-smooth, but it is convex. Throughout this paper, we assume that is -Lipschitz continuous, i.e.,
holds for all and . Furthermore, we assume that (1) is zeroth-order optimization problem, which means that only the function values are accessible and the derivatives of are unavailable, or that the calculation of is expensive. Zeroth-order optimization has many applications such as bandit [5], adversarial attack [18], or hyperparameter tuning [3]. Additionally, there are various types of optimization methods [23] for zeroth-order optimization such as random search and model based methods, and these methods have been widely studied in both a smooth setting [35, 2, 22, 36] and a non-smooth setting [14, 30, 11, 13, 4, 33].
However, one of the theoretical limitations of zeroth-order algorithms is that the oracle complexity depends on the dimension . For example, Duchi et al. [11] and Gasnikov et al. [14] proposed variants of mirror descent algorithms. Duchi et al. [11] prove under the zeroth-order setting that a lower bound on the oracle complexity required for their method to find an -approximate solution is for the case of linear losses. Gasnikov et al. [14] analyze their method on the unit simplex and prove that the oracle complexity is . On the other hand, Nesterov and Spokoiny [27] proposed a method using Gaussian smoothing, which is defined by
By approximating the gradient of the smoothed function by the forward difference: or the central difference: , they obtain the oracle complexity under different settings; concretely, they obtain a complexity of order when the objective function is non-smooth and convex. Later, some papers using random smoothing [33, 13, 30] are able to improve oracle complexity with using the central difference. An oracle complexity with is a natural result in zeroth-order optimization, since standard subgradient methods for non-smooth convex functions require iterations and the approximation of the gradients using the finite difference needs times function evaluations.
To overcome this dependence on the dimension , several research [36, 15, 29] assume that has a low-dimensional structure. For example, Yue et al. [36] use a notion of effective dimensionality and prove that the oracle complexity depends on the effective dimension , where denotes a singular value, rather than on . When the objective function is convex with -Lipschitz gradient and -Lipschitz Hessian, their algorithm is shown to have an oracle complexity where in the oracle complexity of [27] has changed to . In practice, the effective dimension is often significantly smaller than the dimension . In such a case, the oracle complexity is improved under convex and Lipschitz gradient settings.
An alternative approach to reduce the dependency on the dimension in the complexity is to use random projections [6, 22, 2, 31]. Cartis and Roberts [6] combine random projections with a model based derivative-free method, which approximates the objective function by interpolation. In their approach, they solve using a smaller-sized variable in each iteration, constructed with a random matrix and , instead of the original function . Using random projection theory, they prove that when the objective is smooth and non-convex, the methods reduce the dimensionality of oracle complexity from to , in order to find an -stationary point. Kozak et al. [22] consider a zeroth-order variant of [21], approximating the exact gradient by the random projected gradient. They obtain the iteration complexity under various parameter choices and assumptions in the smooth setting. In the subsequent work, Rando et al. [30] propose a variant of [22] and obtain an oracle complexity of order in the non-smooth setting. Berglund et al. [2] propose a zeroth-order variant of the randomized subspace Newton method [16] and prove iteration complexity in the strongly convex case. Roberts and Royer [31] propose a subspace variant of the direct search methods [17, 20]. They obtain some convergence guarantees under a wide range of subspace projections and directions, and show that their randomized methods give better results than the original full-dimensional ones. However, to the best of our knowledge, there is no research which reduces the dependency on the dimension in oracle complexity for non-smooth functions.
1.1 Main Contribution
In this paper, we aim to reduce the dependency on the dimension in the worst-case oracle complexity by employing random projections, specifically under a non-smooth and convex setting. We propose an algorithm which combines Gaussian smoothing using central differences [27] and random projections. Our idea is to apply the Gaussian smoothing to the objective function restricted to a subspace, i.e., , instead of the original function . We prove that our algorithm achieves an oracle complexity of globally, which is the standard result under the non-smooth and convex setting. Moreover, under additional local assumptions on , we prove an oracle complexity of locally, where is the dimension of the random subspace, defined by . This indicates that by choosing much smaller than , our proposed method improves the local oracle complexity.
We can summarize our contribution as follows.
-
•
We propose a zeroth-order random subspace algorithm by using random projection technique to a Gaussian smoothing algorithm for non-smooth convex optimization problems.
-
•
Our algorithm achieves an oracle complexity of globally, and also has a local convergence rate independent of the original dimension under additional assumptions.
-
•
Our numerical experiments show that the proposed method performs well due to random projection for an objective function with a specific structure.
1.2 Related Works on randomized smoothing
Recently, randomized smoothing has been actively studied for smoothing non-smooth function [11, 13, 33]. The random variable that defines is assumed to be a random vector uniformly distributed on a ball with center and radius , instead of a normally distributed random Gaussian vector. For the randomized smoothing, Shamir [33] proposed the algorithm for bandit convex optimization and showed the optimal rate for convex Lipschitz functions using central differences. Gasnikov et al. [13] proposed the generic approach that combines smoothing and first-order methods. They show that the approach achieves the optimal rate of zeroth-order methods and can utilize various techniques in first-order methods for non-smooth zeroth-order optimization.
In fact, our proposed method can be modified so as to use a randomized smoothing instead of a Gaussian one, and theoretical guarantees are essentially the same for both methods (see Remark 1). In any case, since we use properties of Gaussian random matrices to reduce the dimension of the problem, we use the Gaussian smoothing in this paper for the simplicity of our discussion.
1.3 Organization
In Section 2, we introduce some properties of the smoothing function and random matrices and vectors for our analysis. In Section 3, we present our algorithm, and in Section 4 we prove global convergence in . In Section 5, we prove local convergence in . In Section 6, we show numerical results and demonstrate that when the objective has a structure suitable for random projections, our method converges faster than existing methods, by reducing the function evaluation time.
2 Preliminaries
2.1 Notation
denotes one of the optimal solutions of (1). Let denote the expectation of a random variable , and denote the normal distribution with mean and covariance . denotes the identity matrix of size .
We use as the Euclidean norm and as the sub-Gaussian norm of a sub-Gaussian random variable, which is defined by
From the property of the sub-Gaussian norm, is equivalent to
(2) |
where is an absolute constant. Let denote the -th largest eigenvalue of a matrix and let denote the indicator function defined by
In particular, when or for some , we use or , respectively. denotes sub-differential at .
2.2 Gaussian Smoothing Function
In this subsection, we introduce the definition of Gaussian smoothing function and recall its properties.
Definition 1.
It is well-known that if is convex, then is also convex. As derived from Definition 1, the gradient can be calculated by the following:
(4) |
We can evaluate the error bound between and when is convex and Lipschitz continuous.
Lemma 1.
2.3 Random Matrices and Vectors
In this subsection, we introduce some properties of Gaussian random matrices and vectors.
Lemma 2.
In particular, when , the following relationship is derived from simple calculations:
(10) |
From Lemma 2.3, we can evaluate the sub-Gaussian norm for random variables as follows.
Corollary 1.
Consider a random vector and an -Lipschitz function . Then,
holds.
From Corollary 1 and the property of the sub-Gaussian norm (2), we have
(11) |
where the constant is independent of . Next, we recall some properties of random matrices.
Lemma 3.
Proof of Lemma 3.2.
We define , where is a -dimensional vector. Then, we have , and therefore,
Regarding the third and fourth terms on the right-hand side, since the index is not equal to the other indices , and given that , we have and . Similarly, regarding the second term on the right-hand side, since the index is not equal to and , we have . Hence, we obtain
∎∎
Remark 1.
randomized smoothing is defined by (3) with the random vector sampled from the uniform distribution on the sphere of radius (i.e., ), and it is known in [10, 13] to have a dimension-independent upper bound for as
(13) |
The difference between (5) and (13) comes from the norm of random vectors. In terms of upper bounds of , there is no essential difference between these two smoothing methods.
Indeed, our oracle complexity analysis also holds when the random vector is sampled from the uniform distribution on the sphere of radius . We can prove Lemmas 1 and 2 for a vector uniformly sampled on the sphere of radius . For example,
holds from [10, 13]. Instead of Lemma 2.1 we can directly state and equation (8) also holds. For equation (9), holds. Lemma 2.3 holds for the uniform distribution on the sphere from [34, Theorem 5.1.4]. Then, the arguments in the following sections hold when we use these lemmas instead of Gaussian versions that are described in this section.
3 Proposed Algorithm
In this section, we describe the Randomized Gradient-Free method (RGF) [27] and our proposed method. RGF, a random search method, updates by (14), where denotes the approximation of the gradient at along a random direction . The method can use forward differences or central differences for , i.e.,
respectively. For a convex and Lipschitz continuous objective function, RGF using forward differences achieves iteration complexity of [27] and RGF using central differences achieves one of [33]. For our proposed method, we confirm that using central differences attains better oracle complexity than using forward differences.
(14) |
Combining RGF and random projections, we propose Algorithm 2.
(15) |
We define as the restriction of to the random subspace,
(16) |
where is a random matrix, whose elements are sampled from . Lemma 4 shows that the expectation of random matrices generates smoothing functions.
Lemma 4.
(17) | |||
(18) |
hold.
Proof.
Using rotational invariance of normal distribution, we have that holds for any function . Using this property, we obtain
Notice that the distribution of is given by . We can therefore replace by , where is sampled from . Then, we obtain
(19) |
Similarly, we have
and
∎∎
4 Global Convergence
In this section, we prove global convergence of our proposed method for convex and Lipschitz continuous functions. We define .
Assumption 1.
is -Lipschitz continuous, i.e.,
holds for all , and convex.
Theorem 1.
Proof.
We define . From (15), we have
(20) |
Taking the expectation with respect to and , we then evaluate the second and third terms. Regarding the second term on the right-hand side of (4), we have
For the third term on the right-hand side of (4), from , we have
By applying the inequality , we obtain
From the rotational invariance of , we have
Selecting , we have
We evaluate using Corollary 1. Note that for any non-negative random variables , holds. Using this relation, we have
where the last equality follows from . Then, we obtain
(21) |
Therefore, we obtain
Taking the expectation with respect to , we obtain
Summing up these inequalities from and , we obtain
Therefore,
∎∎
With fixed and , we obtain
From this relation, the oracle complexity for achieving the inequality: is
with
These parameters are obtained by the relations of and .
Note that in each iteration, our algorithm calculates the function value twice. Therefore, the oracle complexity is equal to twice the iteration complexity.
5 Local Convergence
In this section, we prove, under some local assumptions on , local convergence of our proposed method.
5.1 Assumptions
Assumption 2.
We have that , and as .
Next we consider the following local assumptions on .
Assumption 3.
There exists a neighborhood of and an Alexandrov matrix that satisfy the following properties.
-
(i)
There exist constants and , and a subgradient such that for all :
(22) (23) -
(ii)
There exist constants and such that holds for all .
When the objective function is twice differentiable, we set and . In the smooth setting, (22) and (23) in Assumption 3(i) are called relative convexity and smoothness [16, 19], respectively, and some functions achieve this property (e.g. logistic function, Wasserstein distance). This assumption implies that the objective function is -strongly convex and -smooth under the semi-norm for all .
While non-smooth objective functions do not necessarily have gradients and Hessians at some points, we can show that when is convex, is twice differentiable almost everywhere.
Theorem 2.
Assumption 3(i) is inspired by Theorem 2, because there exists almost everywhere such that
Note that the subgradient and the matrix are used only in the analysis, not in our algorithm.
In the following subsection, we assume the above two assumptions in addition to Assumption 1. The theoretical results in this section can only hold locally around an optimal solution . Indeed, we assume Lipschitz continuity as Assumption 1 and relative convexity as Assumption 3(i). This implies that and hold, and then as , these assumptions conflict.
5.2 Local Theoretical Guarantees
We define as the smoothing function on the random subspace:
(24) |
From (2.2), we have
(25) |
Under some assumptions, we can show that is a positive definite matrix with high probability.
Proposition 1.
Now, we prove local convergence of our proposed method.
Theorem 3.
Proof.
Let . Using the same argument as in the proof of Theorem 1, from (4) and (21), we obtain
(28) | |||||
We reevaluate the second term on the right-hand side. Now, we evaluate the error between and for any . From relation (22) with and , we have
Taking the expectation with respect to , we obtain
(29) | |||||
First, we evaluate . Using , we have
(30) | |||||
We compute both an upper bound and a lower bound for . From the convexity of , we have
(31) |
From relation (23) with and , we have
(32) |
Using these relations, we have
(33) |
Regarding the second term, using from the rotational invariance of the normal distribution, we have
(34) | |||||
Now, we evaluate . From Hölder’s inequality and Lemma 2.2, we obtain
For any positive semidefinite matrix , holds. Then, we have
(35) |
Finally, from (33),(34), and (35), we obtain
(36) |
Combining relations from (29) and (36), we obtain
By substituting for ,
holds, and then, regarding the second term on the right-hand side of (28), we have
(37) |
After taking the expectation with respect to , we evaluate the right-hand side. As for , we have
Applying Lemma 2.1, we have and
Finally, we have
(38) |
For in (37), from and , we have
(39) |
Next, regarding , we have
For applying conditional expectation property, we define , and . Then, we have
Applying Lemma 3.1 and Proposition 1, we have
Let , we have
(40) |
From (37), (38), (39), and (40), we obtain
(41) |
To satisfy the condition:
we need
(42) |
When satisfies (42), we obtain
(43) |
Then, we have
Taking the expectation with respect to , we have
(44) |
If satisfy (42),
When does not satisfy (42) for some , i.e.,
from Lipschitz continuity of , we have
∎∎
With fixed and , from (26) we obtain
From this relation, the iteration complexity for achieving the inequality: is
(45) |
with
(46) |
From (27), when and hold, we obtain . Then, the reduced dimension must satisfy . When comparing the global and local iteration complexity, the local behavior becomes better when the original dimension satisfies with . In this case, while global iteration complexity becomes , the local iteration complexity achieves , which is less than , with reduced dimension .
Remark 2.
Indeed, our algorithm uses only , and does not use and separately. This is the most interesting part of this research using random projections. We consider that the advantage of comes from the relation
where is a random vector whose entries come from . This relation is clear because all entries of and follow and the left-hand side problem is identical to the right-hand side when . Noticing that the function on the left-hand side is denoted by in (16), we can regard (15) in Algorithm 2 as one iteration of Algorithm 1 (i.e., RGF [27]) applied to the problem on the left-hand side.
6 Experiments
6.1 Softmax Regression
In this section, we evaluate the performance of the proposed method, Algorithm 2, and compare it with RGF [27] with central differences, which is described as Algorithm 1, by optimizing a softmax loss function with regularization:
where denotes data and . For the regularization, we set and use reduced dimensional size . For both methods, we set a smoothing parameter and use a fixed step size .
Name | feature | class () | training size () |
---|---|---|---|
SCOTUS | 126,405 | 13 | 5,000 |
news20 | 62,061 | 20 | 15,935 |


Figure 1 shows function values of RGF and the proposed methods per iteration for the datasets listed in Table 1. From Figure 1, we can find that our algorithm converges faster than RGF after a sufficient number of iterations, while RGF reduces the objective function value more rapidly in the early iterations. This behavior might be consistent with the theoretical guarantees of global and local worst-case complexities (Theorems 1 and 3, respectively), considering that the global complexity is the same to the one of RGF. However, it seems difficult to see the difference between the coefficients of the local and global sublinear rates, i.e., that the local iteration complexity is times the global one. Perhaps the reason is that the rate improvement of local convergence is about worst-case complexity, and such worst-case may not always be achieved in practice.
As for the result of function values in time, generating random matrices is time-consuming and there is no benefit using random projections in view of time for general functions. In this setting, our proposed methods spend more time for the same number of iterations.
6.2 Adversarially Robust Logistic Regression
We consider the next adversarially robust optimization, which is studied in [24]:
(47) |
where . By letting and , we can rewrite Problem (47) as , where
Note that the derivative of is difficult to compute due to the non-smoothness of in general.
In this formulation, we can take advantages of random projections in our proposed method. When we evaluate the function value , it is necessary to solve
(48) |
In this case, we solve the following approximated optimization problem:
(49) |
where . We will explain that this approximated problem (49) is equivalent to the original problem (48) when some condition holds, and also show that the condition holds with high probability. Now, we confirm that we can obtain such that from the solution of the approximated problem (49). Let denote optimal solutions of this approximated problem (49). When , we can confirm the existence of that satisfies and by solving the linear equation;
(50) |
From the linear dependence of row vectors in , the minimum norm solution of (50) is , and then holds. This inequality implies that when holds, also holds. From , holds. Then, if holds, implies . To show this relation, we prove next Lemma 5.
Lemma 5.
Let , where is a random matrix whose entries are sampled from and . Then, holds with probability at least .
Proof.
Let and . From the property of the minimum eigenvalue, we have
Regarding the second term on the right-hand side, from Lemma 3.1 and , we obtain
with probability at least . Then, we have
The first term on right-hand side is equivalent to the minimum eigenvalue of
and then, we have . Applying Lemma 3.3, holds with probability at least . Hence, we have
with probability at least . ∎∎
When is large enough and , from Lemma 5, holds with high probability. Then, holds with high probability.
Next, we confirm that the optimal solution to the original problem (48) can be obtained from (49). Let denote the global solution. When is contained in , we can calculate the same maximal value as in the original problem (48) from (49). We show that is contained in with high probability. We have
with probability at least . Hence, with sufficiently large and , we have and is contained in . Note that the problem (48) is not convex optimization even if is a convex set, because is concave.
Name | feature | class () | training size () |
---|---|---|---|
news20(binary) | 1,355,191 | 2 | 19,996 |
random | 1,000,000 | 2 | 100 |


In numerical experiments, we evaluate by solving the maximum optimization using the accelerated proximal gradient method until the norm of the generalized gradient is less than . In our proposed method, we evaluate using the approximated random optimization problem (49). Furthermore, we increase the sampling size per iteration in both the proposed method and RGF, i.e., we update by
respectively. We use a dataset from [7] and randomly generated one. For the random dataset, we use a matrix and a vector whose entries are sampled from . The labels are generated as , where is a noise vector sampled from . In both methods, we set a smoothing parameter and use a fixed step size .
Figure 2 shows time versus the function values of RGF and the proposed methods for the datasets listed in Table 2. From Figure 2 when evaluating is time-consuming due to solve maximization problems, our proposed method converges faster than RGF. This efficiency comes from the random projection technique, which leads to a reduction of function evaluation time.
7 Conclusion
We proposed a new zeroth-order method combining random projections and smoothing method for non-smooth convex optimization problems. While our proposed method achieves worst-case iteration complexity, which is equivalent to the standard result under convex and non-smooth setting, ours can converge with , which does not depend on the dimension , under some additional local properties of an objective. In numerical experiments, our method performed well when the function evaluation time can be reduced using random projection. As discussed in Section 6.1, since we have shown in this paper is the improvement of the “worst-case” oracle complexity, it is not always the case that the worst-case oracle complexity is achieved when the algorithm is actually run. Indeed, many applications using zeroth-order methods have succeeded despite of large scale models and their oracle complexities depending on the dimension [35, 26, 8, 32]. It may be interesting to investigate whether iteration complexities of zeroth-order methods are not affected by in practical use, or whether it can strongly depend on it in any problem setting as a future work. We also would like to investigate the convergence rate of our algorithm in a non-smooth and non-convex setting in the future.
Acknowledgement
This work was supported by JSPS KAKENHI Grant Number 23H03351 and JST ERATO Grant Number JPMJER1903.
8 Compliance with Ethical Standards
This work was partially supported by JSPS KAKENHI (23H03351) and JST ERATO (JPMJER1903). There is no conflict of interest in writing the paper.
References
- Alexandrov [1939] A. D. Alexandrov. Almost everywhere existence of the second differential of a convex function and some properties of convex surfaces connected with it. Leningrad State Univ. Annals [Uchenye Zapiski] Math. Ser., 6:3, 1939.
- Berglund et al. [2022] E. Berglund, S. Khirirat, and X. Wang. Zeroth-order randomized subspace newton methods. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6002–6006. IEEE, 2022.
- Bergstra et al. [2011] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
- Beznosikov et al. [2020] A. Beznosikov, E. Gorbunov, and A. Gasnikov. Derivative-free method for composite optimization with applications to decentralized distributed optimization. IFAC-PapersOnLine, 53(2):4038–4043, 2020.
- Bubeck et al. [2017] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, page 72–85, New York, 2017. Association for Computing Machinery.
- Cartis and Roberts [2023] C. Cartis and L. Roberts. Scalable subspace methods for derivative-free nonlinear least-squares optimization. Mathematical Programming, 199(1-2):461–524, 2023.
- Chang and Lin [2011] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
- Choromanski et al. [2018] K. Choromanski, M. Rowland, V. Sindhwani, R. Turner, and A. Weller. Structured evolution with compact architectures for scalable policy optimization. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 970–978. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/choromanski18a.html.
- Davidson and Szarek [2001] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.
- Duchi et al. [2012] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
- Duchi et al. [2015] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
- Fuji et al. [2022] T. Fuji, P.-L. Poirion, and A. Takeda. Randomized subspace regularized newton method for unconstrained non-convex optimization. arXiv preprint arXiv:2209.04170, 2022.
- Gasnikov et al. [2022] A. Gasnikov, A. Novitskii, V. Novitskii, F. Abdukhakimov, D. Kamzolov, A. Beznosikov, M. Takac, P. Dvurechensky, and B. Gu. The power of first-order smooth optimization for black-box non-smooth problems. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 7241–7265. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/gasnikov22a.html.
- Gasnikov et al. [2016] A. V. Gasnikov, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fedorenko. Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex. Automation and Remote Control, 77:2018–2034, 2016.
- Golovin et al. [2020] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang. Gradientless descent: High-dimensional zeroth-order optimization. In International Conference on Learning Representations, 2020.
- Gower et al. [2019] R. Gower, D. Kovalev, F. Lieder, and P. Richtarik. Rsn: Randomized subspace newton. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Gratton et al. [2015] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25(3):1515–1541, 2015.
- Ilyas et al. [2018] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2137–2146. PMLR, 10–15 Jul 2018.
- Karimireddy et al. [2018] S. P. Karimireddy, S. U. Stich, and M. Jaggi. Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
- Kolda et al. [2003] T. G. Kolda, R. M. Lewis, and V. Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM review, 45(3):385–482, 2003.
- Kozak et al. [2021] D. Kozak, S. Becker, A. Doostan, and L. Tenorio. A stochastic subspace approach to gradient-free optimization in high dimensions. Computational Optimization and Applications, 79(2):339–368, 2021.
- Kozak et al. [2023] D. Kozak, C. Molinari, L. Rosasco, L. Tenorio, and S. Villa. Zeroth-order optimization with orthogonal random directions. Mathematical Programming, 199(1-2):1179–1219, 2023.
- Larson et al. [2019] J. Larson, M. Menickelly, and S. M. Wild. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019.
- Madry et al. [2018] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
- Magnus [1978] J. R. Magnus. The moments of products of quadratic forms in normal variables. Univ., Instituut voor Actuariaat en Econometrie, 1978.
- Mania et al. [2018] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
- Nesterov and Spokoiny [2017] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
- Niculescu and Persson [2006] C. Niculescu and L.-E. Persson. Convex functions and their applications, volume 23. Springer, New York, 2006.
- Qian et al. [2016] H. Qian, Y.-Q. Hu, and Y. Yu. Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1946–1952. AAAI Press, 2016.
- Rando et al. [2023] M. Rando, C. Molinari, L. Rosasco, and S. Villa. An optimal structured zeroth-order algorithm for non-smooth optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=SfdkS6tt81.
- Roberts and Royer [2023] L. Roberts and C. W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33(4):3057–3082, 2023.
- Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Shamir [2017] O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703–1713, 2017.
- Vershynin [2018] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, Cambridge, 2018.
- Ye et al. [2018] H. Ye, Z. Huang, C. Fang, C. J. Li, and T. Zhang. Hessian-aware zeroth-order optimization for black-box adversarial attack. arXiv preprint arXiv:1812.11377, 2018.
- Yue et al. [2023] P. Yue, L. Yang, C. Fang, and Z. Lin. Zeroth-order optimization with weak dimension dependency. In G. Neu and L. Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 4429–4472. PMLR, 12–15 Jul 2023.