Variance-Reduced Gradient Estimator for Nonconvex Zeroth-Order Distributed Optimization
Abstract
This paper investigates distributed zeroth-order optimization for smooth nonconvex problems. We propose a novel variance-reduced gradient estimator, which randomly renovates one orthogonal direction of the true gradient in each iteration while leveraging historical snapshots for variance correction. By integrating this estimator with gradient tracking mechanism, we address the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation that exists in current zeroth-order distributed optimization algorithms, which rely on either the 2-point or -point gradient estimators. We derive a convergence rate of for smooth nonconvex functions in terms of sampling number and problem dimension . Numerical simulations comparing our algorithm with existing methods confirm the effectiveness and efficiency of the proposed gradient estimator.
I INTRODUCTION
We consider a multi-agent system with agents, where the agents are connected by a communication network that allows them to exchange information for decentralized decision-making. The goal of this group of agents is to collaboratively solve the following consensus optimization problem in a decentralized manner:
(1) |
Here is the global decision variable. Each function represents the local objective function for agent , known only to the agent itself; is assumed to be smooth but may be nonconvex. Each agent can only use zeroth-order information of during the optimization procedure.
Decentralized optimization has gained significant attention due to its wide range of applications including multi-agent system coordination [1], power systems [2], communication networks [3], etc. For smooth and convex objective functions, the decentralized gradient descent (DGD) algorithm with a decreasing step-size achieved a convergence rate of [4], while [5] and subsequent works proposed gradient tracking (GT) mechanisms with a fixed step-size, achieving a sublinear convergence rate of , which is comparable to centralized gradient descent method. In many real-world applications, the objective functions can be nonconvex, and distributed nonconvex optimization has applications in areas such as machine learning [6] and robotics control [7]. For smooth nonconvex functions, DGD achieves convergence to a stationary point with a rate of [8, 9], while various GT methods can achieve convergence to a stationary point with a rate of [10].
The aforementioned optimization algorithms rely on first-order information. However, in some scenarios, the gradient is unavailable or expensive to obtain, and agents can only access zeroth-order information of the objective functions. Such scenarios arise in optimization with black-box models [11], optimization with bandit feedback [12], fine-tuning language models [13], etc. To address this issue, various gradient-free methods, particularly algorithms based on zeroth-order gradient estimators, have attracted considerable attention [14]. The work [15] investigated the 2-point zeroth-order gradient estimator, which produces a biased stochastic gradient by using the function values of two randomly sampled points. In [16], the -point gradient estimator was proposed, where is the dimension of the state variable for each agent. [17] combined the 2-point gradient estimator with DGD and the -point gradient estimator with GT for nonconvex multi-agent optimization, which lead to convergence rates that are comparable with their first-order counterparts. However, [17] also argued that there appears to be a trade-off between the convergence rate and the sampling cost per zeroth-order gradient estimation when combining zeroth-order gradient estimation techniques with distributed optimization frameworks. This trade-off arises from the high sampling burden of the -point estimator and the inherent variance of the 2-point estimator in distributed settings.
To address this trade-off, we aim to design a variance-reduced zeroth-order gradient estimator with a scalable sampling number of function values that is independent of the dimension . Variance reduction is widely applied in machine learning [18] and stochastic optimization [19]. In [20], variance reduction was used for centralized stochastic gradient descent with strongly convex objectives, achieving a linear convergence rate. [21] applied a 2-point gradient estimator and employed variance reduction for zeroth-order nonconvex centralized stochastic optimization, achieving sublinear convergence. Note that these works only focused on centralized problems. Decentralized finite-sum minimization problems were considered in [22], where variance reduction was used to accelerate convergence; we note that in this work, the variance reduction technique was employed to reduce the variance caused by the finite-sum structure, rather than to address the inherent variance of the 2-point zeroth-order gradient estimator.
In this paper, we propose a new distributed zeroth-order optimization method that incorporates variance reduction techniques as well as the gradient tracking framework, to address the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation in existing zeroth-order distributed optimization algorithms. Specifically, We employ the variance reduction (VR) mechanism to design a novel variance-reduced gradient estimator for distributed nonconvex zeroth-order optimization problems, as formulated in (1). We then combine this new zeroth-order gradient estimation method with the gradient tracking framework, and the resulting algorithm is able to achieve both fast convergence and low sampling cost per zeroth-order gradient estimation. To the best of the authors’ knowledge, this is the first work that attempts to address the aforementioned trade-off for general zeroth-order distributed optimization problems. We also provide rigorous convergence analysis of our proposed algorithm under the smoothness assumption. Although the derived oracle complexities have a higher dependence on the dimension compared to the algorithms in [17], numerical experiments demonstrate that our proposed algorithm enjoys superior convergence speed and accuracy, reaching lower optimization errors with the same number of samples compared to existing zeroth-order distributed optimization algorithms [17, 23].
Notations: The set of positive integers up to is denoted as . The -th component of a vector is denoted as . The spectral norm and spectral radius of a matrix are represented by and , respectively. For a vector , refers to the Euclidean norm, and refers to the weighted infinity norm, where has all positive components. For a matrix , represents the matrix norm induced by , and represents the matrix norm induced by . For two matrices and , denotes the Kronecker product. We denote as the closed unit ball in , and as the unit sphere. denotes the uniform distribution.
II Formulation And Preliminaries
II-A Problem Formulation
We consider a network consisting of agents connected via an undirected communication network. The topology of the network is represented by the graph , where and represent the set of agents and communication links, respectively. The distributed consensus optimization problem (1) can be equivalently reformulated as follows:
(2) | ||||
s.t. |
where now represents the local decision variable of agent , and the constraint requires the agents to achieve global consensus for the final decision. During the optimization procedure, each agent may obtain other agents’ information only via exchanging messages with their neighbors in the communication network. We further impose the restriction that only zeroth-order information of the local objective function is available to each agent. In other words, in each iteration, agent can query the function values of at finitely many points.
The following assumption will be employed later in this paper.
Assumption 1.
Each is -smooth, i.e., we have
(3) |
for all and . Furthermore, .
II-B Preliminaries on Distributed Zeroth-Order Optimization
When gradient information of the objective function is unavailable, one may construct gradient estimators by sampling the function values at a finite number of points, which has been shown to be a very effective approach by existing literature. We first briefly introduce two types of gradient estimators [17] that are commonly used in noiseless distributed optimization.
Let be a continuously differentiable function. One version of the 2-point zeroth-order gradient estimator for has the following form:
(4) |
where is a positive scalar called the smoothing radius and is a random vector sampled from the distribution . One can show that the expectation of the 2-point gradient estimator is the gradient of a smoothed version of the original function [24, 25], i.e.,
where . As the smoothing radius tends to zero, the expectation of the 2-point gradient estimator approaches to the true gradient .
By combining the simple 2-point gradient estimator (4) with the decentralized gradient descent framework, one obtains the following algorithm for distributed zeroth-order consensus optimization (2):
(5) |
which we shall call DGD-2p in this paper. Here denotes the local decision variable of agent at the -th iteration, is a weight matrix that is taken to be doubly stochastic, and is the step-size at iteration . Since each construction of the 2-point gradient estimator (4) requires sampling only two function values, we can see that DGD-2p can achieves low sampling cost per zeroth-order gradient estimation. However, as shown by [17], DGD-2p achieves a relatively slow convergence rate , where denotes the number of function value queries. [17] argued that this slow convergence rate is mainly due to the inherent variance of the 2-point gradient estimator, bounded by
under the assumption that function is -smooth. In a distributed optimization algorithm, each agent’s local gradient does not vanish to zero even if the system has reached consensus and optimality. Consequently, the inherent variance of 2-point gradient estimator is inevitable and will considerably slow down the convergence rate.
To achieve higher accuracy for zeroth-order gradient estimation, existing literature has also proposed the -point gradient estimator:
(6) |
Here is the -th standard basis vector such that when and otherwise. It can be shown that when is -smooth (see, e.g., [17]). Consequently, if we assume the function values of can be obtained accurately and machine precision issues in numerical computations are ignored, then the -point gradient estimator can achieve arbitrarily high accuracy when approximating the true gradient. By combining (6) with the distributed gradient tracking method, one obtains the following algorithm:
(7) | ||||
which we shall call GT-. Here the auxiliary state variable in (7) tracks the global gradient across iterations. Distributed zeroth-order optimization algorithms that utilize the -point gradient estimator, such as GT-, can achieve faster convergence due to precise estimation of the true gradients that allows further incorporation of gradient tracking techniques. However, GT- has higher sampling cost per gradient estimation compared to DGD-2p: As shown in (6), points need to be sampled for each construction of the gradient estimator. This high sampling cost may lead to poor scalability when the dimension is large.
We remark that the -point gradient estimator (6) can also be interpreted as the expectation of the following coordinate-wise gradient estimator:
(8) |
and we have
(9) |
where denotes the discrete uniform distribution over the set . The coordinate-wise gradient estimator in (8) shares the similar structure with the 2-point gradient estimator in (4). The key difference is that in (8), we restrict the perturbation direction to lie in the orthogonal directions associated with the standard basis, instead of uniformly sampled from the unit sphere.
III Our Algorithm
To address the trade-off between convergence rate and sampling cost per gradient estimation in zeroth-order distributed optimization, we employ a variance reduction mechanism [20] to design an improved gradient estimator. The intuition is to combine the best of both worlds, i.e., the precise approximation feature of the -point gradient estimator and the low-sampling feature of the 2-point gradient estimator.
Let denote the iteration number. For each agent, we keep a snapshot point together with its associate smoothing radius at which we have conducted relatively accurate gradient estimation. We then use a random variable generated from the Bernoulli distribution as an activation indicator for the update of the snapshot point : When , agent updates the snapshot point to be the current iterate, i.e., and , and takes a snapshot of the -point gradient estimator which provides a more accurate gradient estimation at iteration . When , the snapshot point together with the smoothing radius remain unchanged from the previous iteration, i.e., .
In each iteration, agent constructs the coordinate-wise gradient estimators and at the two points and . We then propose the following variance-reduced gradient estimator (VR-GE):
(10) | ||||
where is randomly selected by each agent at each iteration . By (9), we can see that
(11) |
i.e., VR-GE provides a stochastic gradient of with a bias , assuming is -smooth.
The expected number of function value samples required per construction of VR-GE is . When for some absolute constant , this becomes which is independent of the dimension . This gives VR-GE the potential to decrease the sampling cost in large-dimensional zeroth-order distributed optimization by appropriately adjusting the probability . In the following section, specifically in Lemma 1, we will rigorously analyze the variance of VR-GE and demonstrate its variance reduction property.
In designing our distributed zeroth-order optimization algorithm, we further leverage the gradient tracking mechanisms. Existing literature (including [5, 26, 27], etc.) has demonstrated that gradient tracking mechanisms help mitigate the gap in the convergence rates between distributed optimization and centralized optimization when the objective function is smooth. Drawing inspiration from this advantage, we incorporate the variance-reduced gradient estimator with gradient tracking mechanism to design our algorithm.
The details of the proposed algorithm are outlined in Algorithm 1. Here is the step-size; Steps 1 and 6 implement the gradient tracking mechanism, while Steps 2–5 implement our proposed variance-reduced gradient estimator (10). The convergence guarantees of Algorithm 1 will be provided and discussed in the next section.
IV Main Results
In this section, we present the convergence result of Algorithm 1 under Assumption 1.
For the subsequent analysis, we denote
and define the following quantities:
where and . Here, quantifies the optimality gap in terms of the objective value, and characterize the consensus errors, and characterizes the tracking error. We also denote .
Theorem 1.
Under Assumption 1, suppose the parameters of Algorithm 1 satisfy , , is non-increasing, and
Then we have
and
where , .
Proof.
See Section V. ∎
Remark 1.
The convergence rate of Algorithm 1 under Assumption 1 is , which aligns with the rate achieved for distributed nonconvex optimization with gradient tracking using first-order information [10]. In addition, each iteration of VR-GE requires function value queries on average. As long as , the averaged sampling number for VR-GE is less then that for the -point gradient estimator.
Remark 2.
Assuming and that all agents start from the same initial point, the convergence rate becomes with respect to the number of function value queries , which can be justified by simple algebraic calculation. Although this rate is not as favorable as the rate of GT- in terms of the dependence on the problem dimension , we suspect that this discrepancy is primarily due to the complicated analysis procedure; the conditions in Theorem 1 serve as sufficient but not necessary conditions for convergence, meaning that the rate may not be a theoretically tight bound. As demonstrated in the simulation section, Algorithm 1 converges faster than GT- and achieves higher accuracy with the same number of samples. Deriving a tighter convergence result remains a topic for future work.
Next, we present the convergence rate when the probability is fixed as in the following corollary.
Corollary 1.
Proof.
See Appendix F. ∎
Remark 3.
With some standard algebraic calculation, it can be shown that the convergence rate in (12) is in terms of the number of function value queries , which improves upon the rate in Theorem 1. Roughly speaking, this improvement is due to that, by fixing , the step-size’s dependency on the dimension can be decreased from to . In light of this improvement, setting offers practical guidance for selecting the probability .
V Outline of Convergence Analysis
V-A Bounding the Variance of VR-GE
The variance of VR-GE is essential for convergence proof of Algorithm 1 and we provide analysis details in this subsection. We first rewrite Algorithm 1 as follows:
(13a) | |||
(13b) |
where .
We now derive a bound on the expected difference between variance-reduced gradient estimator and the true gradient in the following lemma.
Lemma 1.
Let be -smooth. Then for any and , it holds that
(14) | ||||
where .
Proof.
Using , we derive that
Then, by for an arbitrary random vector , we can derive from the above inequality that
where we have used (3) and inequality in the second inequality, and the fact that in the last inequality. ∎
Remark 4.
The estimation error of VR-GE (i.e., the left-hand side of (14)) comprises two parts: consensus error and the smoothing radius effect. Consensus error is inherent due to the distributed nature of the algorithm, while the smoothing radius can be tuned properly. As Algorithm 1 approaches consensus and the smoothing radius approaches zero, the estimation error between the VR-GE and the true gradient diminishes. Thus, VR-GE offers reduced variance compared to the 2-point gradient estimator while requiring fewer samples than the -point gradient estimator.
V-B Proof Sketch of Theorem 1
The proof relies on four lemmas. The first lemma analyzes the evolution of function value using -smooth property. We denote and assume in the following analysis.
Lemma 2.
Suppose , we have
(15) |
Proof.
See Appendix B. ∎
Lemma 2 derives a bound for optimization error . We need to further bound consensus errors: and . This is tackled by the following lemma.
Lemma 3.
Suppose we choose and . Then we have the following inequality:
(16) |
where
and |
Proof.
See Appendix C. ∎
For subsequent analysis, we need to bound the induced weighted matrix norm of the matrix in the inequality (16).
Lemma 4.
Proof.
See Appendix D. ∎
Next, we derive a bound on the accumulated consensus errors over iterations using Lemma 3.
Lemma 5.
Suppose we choose and . We have
(18) |
where . We also derive that .
Proof.
See Appendix E. ∎
Based on Lemmas 2 and 5, we are now ready to prove Theorem 1. Plugging the inequality (18) into (15) leads to
(19) |
where we have used in the first inequality and in the last inequality.
Using the bound in Lemma 5 and is summable, we see that is also summable. Consequently, we have as by (19). We can further derive from the inequality (19) that
(20) | ||||
Combining (20) with (18), we have
(21) |
where we have used . Similarly,
where we have used inequality (18) and in the third inequality.
The proof of Theorem 1 can now be concluded.
VI Simulation
We consider a multi-agent nonconvex optimization problem adapted from [17] with agents in the network, and the objective function of each agent is given as follows:
(22) |
where are randomly generated parameters satisfying , each is also randomly generated, and the dimension is set to .
For the following numerical simulation of Algorithm 1, we set the step-size and the smoothing radius . All agents start from the same initial points to ensure consistency in the initial conditions across the network.
VI-A Comparison with Other Algorithms
Fig. 1 compares Algorithm 1 with DGD-2p, GT-, and ZONE-M (with =100). In the figure, the probability used for Algorithm 1 is . The horizontal axis is normalized and represents the sampling number (i.e., the number of zeroth-order queries). The two sub-figures illustrate the stationarity gap and the consensus error , respectively.
By inspecting Fig. 1, we first see that DGD-2p converges faster than ZONE-M with . When comparing DGD-2p and GT-, we can see a clear difference between their convergence behavior: DGD-2p achieves fast convergence initially but slows down afterwards due to the inherent variance of the 2-point gradient estimator, whereas GT- achieves higher eventual accuracy but slower initial convergence before approximately zeroth-order queries due to the higher sampling burden of the -point gradient estimator.
As demonstrated in Fig. 1, Algorithm 1 offers both high eventual accuracy and a fast convergence rate in terms of stationarity gap and consensus error. This improvement is attributed to the variance reduction mechanism employed in designing VR-GE, which effectively balances the sampling number and expected variance, thereby addressing the trade-off between convergence rate and sampling cost per zeroth-order gradient estimation that exists in current zeroth-order distributed optimization algorithms.
VI-B Comparison of Algorithm 1 under Different Probabilities
Fig. 2 compares the convergence of Algorithm 1 under different choices of the probability , which reflects the frequency with which each agent takes snapshots. The three sub-figures illustrate the stationarity gap , the consensus error , and the tracking error , respectively.
The results demonstrate that Algorithm 1 with a lower probability achieves better accuracy with fewer sampling numbers. However, a lower probability also results in more fluctuation during convergence. This is expected because, with a lower probability, the snapshot variables are updated less frequently, leading to a greater deviation from the true gradient as iterations progress.
Two notable cases are and . When , VR-GE behaves similarly to GT-, taking snapshots at every iteration, which results in poorer convergence performance compared to cases where . Conversely, when , agents do not take any snapshots, and remains its initial value . In this scenario, VR-GE uses solely the initial iterate for gradient correction, which introduces variance and reduces convergence accuracy, as illustrated in Fig. 2.
VI-C Comparison of Algorithm 1 under Different Dimensions
Fig. 3 compares the convergence of Algorithm 1 across different agent dimensions, alongside varying probabilities for taking snapshots within the algorithm. The results show that Algorithm 1 can effectively handle different scenarios, such as when , achieving stationarity gaps that are below .
As the dimension increases, VR-GE requires more samples to accurately estimate the gradient. To maintain similar convergence performance across higher dimensions, the probability for taking snapshots can be adjusted to lower values. As shown in Fig. 3, decreasing the probability as the dimension grows allows Algorithm 1 to achieve a convergence rate and optimization accuracy that are comparable to cases with lower dimensions. However, this adjustment also leads to increased fluctuation during the convergence process. This fluctuation is a result of the randomness introduced by the snapshot mechanism.
VII Conclusion
In this paper, we proposed an improved variance-reduced gradient estimator and integrated it with gradient tracking mechanism for nonconvex distributed zeroth-order optimization problems. Through rigorous analysis, we demonstrated that our algorithm achieves sublinear convergence for smooth nonconvex functions that is comparable with first-order gradient tracking algorithms, while maintaining relatively low sampling cost per gradient estimation. Comparative evaluations with existing distributed zeroth-order optimization algorithms verified the effectiveness of the proposed gradient estimator. Future work will focus on refining the step-size bounds of the algorithm and reducing the dependence of the convergence rate on the dimension .
Appendix A. Auxiliary Lemmas
This section summarizes some auxiliary lemmas for convergence analysis.
Lemma 6.
Suppose is -smooth and . Then
(23) |
Proof.
This lemma is standard in optimization theory, and can be shown via
where the last step follows from the -smoothness of . ∎
Lemma 7 ([17]).
Let be -smooth. Then for any ,
(24) |
Lemma 8 ([26]).
Let . For any , we have
where we denote .
Lemma 9 ([28]).
Let be non-negative and be positive. If for , then .
Appendix B. Proof of Lemma 2
First, by left multiplying on both sides of equation (13b), and using the initialization , we derive that
where and .
Appendix C. Proof of Lemma 3
First, following from (13) and (25), we derive that
(29) | ||||
where we have used Lemma 8 in the inequality.
Second, we bound the consensus error . We have
(30) |
For the second term of (30), we have
(31) | ||||
where is an auxiliary parameter that will be determined later.
Now, we can plug (29) and (31) into (30) to obtain
(32) | ||||
Letting and using leads to
Using and under the condition that , we get
Consequently, we derive that
(33) | ||||
Third, we bound the consensus error . Note that
(34) | ||||
For the term in (34), we have
(35) |
Combining (35) with (34), we derive that
(36) |
where we have used inequality for all and in the second inequality.
Appendix D. Proof of Lemma 4
Using Lemma 9, we derive a range of and a positive vector such that the following inequality holds:
which can be equivalently written as
(41a) | |||
(41b) | |||
(41c) |
where and . Under the condition , to show the inequalities (41), it is sufficient to check the following inequalities hold:
and by further letting and , it is sufficient to show the following inequalities:
Using inequality , we have the following inequality:
Consequently, it suffices to ensure that
which can now be justified by the condition . It is straightforward to verify that satisfies and , which completes the proof.
Appendix E. Proof of Lemma 5
It is straightforward to verify that
is a sufficient (and unnecessary) condition for
Thus, Lemma 3 holds.
Note that the matrix norm is induced by the vector norm . From the inequality (16), we derive that
(42) | ||||
and by induction on (42), we get
Using the definition of the vector norm , we derive that
(43) | ||||
Note that for any nonnegative sequence and , we have
(44) | ||||
Consequently, we obtain
where and we have used in the inequality.
For the bound on , we first derive that
where we have used the condition that all the follow from the same iterative equation in the third equality. Then, using inequality (44), we further get
Since is summable, we derive that is also summable, which completes the proof.
Appendix F. Proof of Corollary 1
We then demonstrate that the vector and the step-size condition together ensure the inequality
Similar to the proof of Lemma 4 , it suffices to verify
(46) | ||||
It is straightforward to verify that is a sufficient condition of inequality (46). Using Lemma 9, we then obtain
(47) |
Now, we combine (48) with (15) to derive
(49) | ||||
where we have used in the first inequality and in the last inequality. We then derive from (49) that
The proof of Corollary 1 can now be concluded.
References
- [1] Shu Liang, Le Yi Wang, and George Yin. Distributed smooth convex optimization with coupled constraints. IEEE Transactions on Automatic Control, 65(1):347–353, 2020.
- [2] Xuan Zhang, Antonis Papachristodoulou, and Na Li. Distributed control for reaching optimal steady state in network systems: An optimization approach. IEEE Transactions on Automatic Control, 63(3):864–871, 2018.
- [3] Jie Liu, Daniel W. C. Ho, and Lulu Li. A generic algorithm framework for distributed optimization over the time-varying network with communication delays. IEEE Transactions on Automatic Control, 69(1):371–378, 2024.
- [4] Angelia Nedić and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
- [5] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
- [6] Angelia Nedić. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020.
- [7] Guido Carnevale, Nicola Mimmo, and Giuseppe Notarstefano. Nonconvex distributed feedback optimization for aggregative cooperative robotics. Automatica, 167:111767, 2024.
- [8] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017.
- [9] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66(11):2834–2848, 2018.
- [10] Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization over time-varying digraphs. Mathematical Programming, 176:497–544, 2019.
- [11] Zhongguo Li, Zhen Dong, Zhongchao Liang, and Zhengtao Ding. Surrogate-based distributed optimisation for expensive black-box functions. Automatica, 125:109407, 2021.
- [12] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- [13] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. In Advances in Neural Information Processing Systems, volume 36, pages 53038–53075, 2023.
- [14] Anit Kumar Sahu, Dusan Jakovetic, Dragana Bajovic, and Soummya Kar. Communication-efficient distributed strongly convex stochastic optimization: Non-asymptotic rates. arXiv preprint arXiv:1809.02920, 2018.
- [15] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
- [16] Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, pages 462–466, 1952.
- [17] Yujie Tang, Junshan Zhang, and Na Li. Distributed zero-order algorithms for nonconvex multiagent optimization. IEEE Transactions on Control of Network Systems, 8(1):269–281, 2021.
- [18] Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. In Advances in Neural Information Processing Systems, volume 34, pages 8637–8648, 2021.
- [19] Chong Wang, Xi Chen, Alexander J. Smola, and Eric P. Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, volume 26, 2013.
- [20] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, 2013.
- [21] Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, volume 31, 2018.
- [22] Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Transactions on Signal Processing, 68:6255–6271, 2020.
- [23] Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. ZONE: Zeroth-order nonconvex multiagent optimization over networks. IEEE Transactions on Automatic Control, 64(10):3995–4010, 2019.
- [24] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, 1983.
- [25] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394, 2005.
- [26] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017.
- [27] Angelia Nedić, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
- [28] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2nd edition, 2012.