Distributed Stochastic Optimization under Heavy-Tailed Noises
Abstract
This paper studies the distributed optimization problem under the influence of heavy-tailed gradient noises. Here, a heavy-tailed noise means that the noise does not necessarily satisfy the bounded variance assumption, i.e., for some positive constant . Instead, it satisfies a more general assumption for some . The commonly-used bounded variance assumption is a special case of the considered noise assumption. A typical example of this kind of noise is a Pareto distribution noise with tail index within (1,2], which has infinite variance. Despite that there has been several distributed optimization algorithms proposed for the heavy-tailed noise scenario, these algorithms need a centralized server in the network which collects the information of all clients. Different from these algorithms, this paper considers that there is no centralized server and the agents can only exchange information with neighbors in a communication graph. A distributed method combining gradient clipping and distributed stochastic subgradient projection is proposed. It is proven that when the gradient descent step-size and the gradient clipping step-size meet certain conditions, the state of each agent converges to the optimal solution of the distributed optimization problem with probability 1. The simulation results validate the algorithm.
Index Terms:
Multi-agent system; Heavy-tailed noise; Distributed stochastic optimization.I Introduction
I-A Distributed Optimization
Distributed optimization is a method of achieving optimization through cooperation among multiple agents, which can be used to solve large-scale optimization problems that are hard for centralized algorithms to cope with. The core idea of distributed optimization is to decompose a large problem into small problems and distribute the local objectives to multiple intelligent agents for solution. Compared with centralized optimization, distributed optimization overcomes the disadvantages of single point of failure, high computation and communication burden, and restrictions on scalability and flexibility [1]. Distributed optimization has a wide range of applications, such as machine learning [2], distributed economic dispatch problem of energy systems[3], and formation control problem in robotic systems [4].
I-B Distributed Stochastic Optimization
Different from deterministic optimization, stochastic optimization employs random factors to reach a solution to optimization problems. Stochastic optimization has wild applications in various fields. For example, the stochastic gradient descent (SGD) method has greatly participated in the progress of machine learning.
As an extension of distributed deterministic optimization to stochastic scenarios, distributed stochastic optimization has also attracted widespread attention. For instance, the authors in [5] presented a distributed stochastic subgradient projection algorithm tailored for convex optimization challenges. [6] introduced a distributed algorithm that relies on asynchronous step-sizes. [7] proposed a proximal primal-dual approach geared towards nonconvex and nonsmooth problems. [8] examined the high-probability convergence of distributed stochastic optimization.
I-C Heavy-Tailed Noise
Most of the above literature on stochastic optimization assume that the variance of the random variable is bounded. Despite that this is a mild assumption which covers many noises such as Gaussian noises, there are still many noises that do not satisfy this condition. For example, the Pareto distribution and the -stable Levy distribution [9]. Heavy-tailed noises has been recently observed in many machine learning systems [9, 10, 11, 12, 13]. For example, in [9], it was found that an attention training model on Wikipedia can generate heavy-tailed noises. Moreover, it is shown that the traditional SGD method may diverge when the noise is heavy-tailed [9, 10].
I-D Contributions
In this work, we study the distributed stochastic optimization algorithm under heavy-tailed gradient noises using neighboring agents' information exchange only. The main contributions of this paper are listed as follows:
1) We design a distributed updating law to deal with the stochastic optimization problem under heavy-tailed noises. Strict proof is given to show that under some mild parameter conditions, the state of each agent converges to the optimal solution with probability 1.
2) Compared with the existing distributed stochastic optimization works, e.g., [5, 6, 14, 15, 16, 7], our algorithm allows heavy-tailed noises that satisfy for some and positive constant , which generalizes the commonly used assumption .
3) Compared with the existing literature on heavy-tailed noises, e.g., [9, 17, 18, 19, 10, 20, 21, 22], our algorithm is distributed in the sense that there is no central server and the agents can only exchange information via a strongly connected communication graph. Note that despite that [10, 20, 21, 22] studied heavy-tailed noises in a distributed setting, these algorithms require a central server [10, 21, 22] or the agents require the states of all the other agents at some time instant [20]. These works are not related to multi-agent consensus while our work is related to consensus and graph theory.
I-E Organization
The rest of this paper is organized as follows: In Section II, notations and preliminary knowledge are given. In Section III, the studied distributed optimization problem is formulated. In Section IV, a distributed updating law is proposed for each agent based on gradient clipping and distributed stochastic subgradient projection method. The almost sure convergence to the optimal solution is proven. In Section V, a numerical example is given to illustrate the effectiveness and efficiency of the proposed algorithm. Finally, in Section VI, we conclude the paper.
II Notations and Preliminaries
Throughout this paper, represents a zero vector with an appropriate dimension or scalar zero. and represent the real number set and the -dimensional real vector set, respectively. is the absolute value of a scalar, and is the 2-norm of a vector. The ``inf" represent the infimum of a sequence. is the Euclidean projection of a vector to the space . is the -th element of a vector . is the element in the -th row and the -th column of a matrix . represents the minimum of and . is the expectation of a variable.
Lemma 1
[5] For scalars where and vectors , the following conclusion holds:
(1) |
Lemma 2
(Lemma 3.1 of [5]) Let be a scalar sequence. (1) If and , then (2) If for all , and , then .
III Problem Formulation
Consider a multi-agent system comprised of agents. The agents cooperate to solve the following optimization problem
(2) |
where is the decision variable, is the global objective function, is the local objective function in agent , and is the local constraint set.
Suppose that each agent can only get information from its neighbors via a communication graph where is the node set, and is the edge set. Let be the adjacency matrix of .
Assumption 1
[5] The graph is strongly connected. The adjacency matrix is doubly stochastic, i.e., . There exists a scalar such that when .
Assumption 2
The constraint set is nonempty, convex and compact.
Assumption 3
is strongly convex in , i.e., there exists a modulus such that for any . Moreover, is continuously differentiable.
IV Algorithm Design
The distributed updating law for agent is designed as
(3) |
where
(4) |
and
(5) |
In (3)-(5), and are two positive sequences determined later, is the random variable of agent at step .
Let be the -algebra generated by the random variables from to .
The following assumptions on the noises are made to facilitate the subsequent analysis.
Assumption 4
[9] (Bounded -Moment) There exists two positive constants and with and such that with probability 1.
Remark 2
Assumption 5
(Unbiased Local Gradient Estimator) with probability 1.
Remark 3
The algorithm proposed in (3)-(5) is motivated by the GClip algorithm for a centralized convex optimization problem with heavy-tailed noises proposed in [9] and the distributed stochastic subgradient projection algorithm proposed for a distributed convex optimization problem with standard noises in [5].
Remark 4
In (3)-(5), all agents use the same gradient decent step-size and gradient clipping step-size. Thus, the algorithm is not ``fully distributed". In fact, this hypothesis is common in the existing distributed optimization literature. This work is still novel compared with the existing algorithms on heavy-tailed noises since the state exchange is distributed.
Let and .
Based on Assumption 2, there exists a positive scalar such that for all .
Based on the definition of and Assumption 1, for all .
Thus, for all .
Based on the above analysis, we can arrive at the following conclusion.
Proof:
The proof is similar to Lemma 5.1 of [25]. For the completeness of this work, we put it here. Define
(6) |
and
(7) |
Since , . Thus, . Since and based on Assumption 5,
(8) |
with probability 1, where we used Jensen's inequality and the convexity of the norm.
Based on Holder's inequality for conditional expectation,
(9) |
with probability 1.
Based on the Markov's inequality for conditional expectation,
(10) |
with probability 1.
Then, with probability 1. ∎
The following lemma can be directly obtained from Lemma 3.2 of [5].
Lemma 4
Based on the above lemmas, the following conclusion holds.
Theorem 1
where is a constant satisfying for all . Then, converges to the unique optimal solution with probability 1.
Proof:
Define
(15) |
and
(16) |
(17) |
where we used Assumption 1.
Then,
(18) |
Let , , , then (16) can be written as
(19) |
which implies that
(20) |
where in the last step we use to replace , and .
Thus,
(21) |
where .
For any , based on the non-expansive property and the definition of ,
(24) |
where in the last step we used the strong convexity of in Assumption 3.
Thus, based on Assumption 1,
(28) |
Taking the -algebra of (29) and letting gives
(30) |
Based on Lemma 3 and utilizing Young's inequality to the last but one term, we have
(31) |
with probability 1.
According to (23), it can be obtained that
(32) |
Since , is bounded. According to the definition of in Lemma 4, . Then, is bounded.
Since , based on Lemma 2 (2), we have
(33) |
Based on the conditions and , we can get that and . Thus, by (32), .
Remark 6
Comparing the assumptions in [5] with those in this work, the main difference is that the conditions on noises in Assumption 4 are different where we consider heavy-tailed noises. In addition, in Assumption 3, we require the local objective functions to be strongly convex while [5] only requires convexity. The reason is that to guarantee the boundedness of the terms in the right-hand side of (31), with strong convexity, we need to be bounded, while with convexity we need to be bounded, which is hard to be guaranteed together with the other conditions given in the theorem.
V Simulation
In this section, we consider agents collaborate to address an -regularized Logistic Regression Problem [26], where
(34) |
with .
The constraint set is given by .
Let , , and let be
(35) |
The communication graph can be described in Fig. 1, where the weights are defined as: and . The graph satisfies Assumption 1.

Let be a random variable with where satisfies Pareto distribution with tail index and . The probability density function of can be described as The expectation of is and the variance is infinity. Thus, . Moreover, since we have for ,
(36) |
Let where each element of satisfies Pareto distribution with tail index 2 and the minimum parameter 1. Based on the above analysis, and . Thus, satisfies Assumptions 4 and 5.
Fig. 2 shows the evolution of . As can be seen, the algorithm converges to the optimal solution.

Simulation with Different Tail Index: It can be verified that noises with tail index satisfy Assumptions 4 and 5. When , the variance is bounded. The evolution of with tail index is shown in Fig. 3. It can be seen that a larger tail index can lead to faster convergence.

Comparison with [5]: We compare our algorithm with the distributed stochastic subgradient projection algorithm proposed in [5] despite that the Patero random noise does not satisfy the bounded variance condition (Assumption 4 of [5]). When is small, it is found that the performance of our algorithm is similar to that of [5]. Let . The simulation result is shown in Fig. 4. The performance of our algorithm is better than that of [5].

VI Conclusions
In this paper, we considered the distributed optimization problem under the influence of heavy-tailed noises. Unlike previous studies on heavy-tailed noises, we proposed a distributed update law that only uses neighboring information and the network does not have a central node. Unlike previous studies on distributed stochastic optimization, this paper considers the assumption of heavy-tailed distributions instead of the commonly considered bounded variance noises, which is a broader assumption. The simulation results demonstrate the effectiveness of the algorithm. In future, we will investigate the algorithm design for distributed optimization problems with inequality constraints.
References
- [1] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, ``A survey of distributed optimization,'' Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
- [2] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, ``Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning,'' in 2012 50th annual allerton conference on communication, control, and computing (allerton). IEEE, 2012, pp. 1543–1550.
- [3] P. Yi, Y. Hong, and F. Liu, ``Initialization-free distributed algorithms for optimal resource allocation with feasibility constraints and application to economic dispatch of power systems,'' Automatica, vol. 74, pp. 259–269, 2016.
- [4] C. Sun, Z. Feng, and G. Hu, ``Time-varying optimization-based approach for distributed formation of uncertain euler–lagrange systems,'' IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 5984–5998, 2021.
- [5] S. Sundhar Ram, A. Nedić, and V. V. Veeravalli, ``Distributed stochastic subgradient projection algorithms for convex optimization,'' Journal of optimization theory and applications, vol. 147, pp. 516–545, 2010.
- [6] K. Srivastava and A. Nedic, ``Distributed asynchronous constrained stochastic optimization,'' IEEE journal of selected topics in signal processing, vol. 5, no. 4, pp. 772–790, 2011.
- [7] Z. Wang, J. Zhang, T.-H. Chang, J. Li, and Z.-Q. Luo, ``Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems,'' IEEE Transactions on Signal Processing, vol. 69, pp. 4486–4501, 2021.
- [8] K. Lu, H. Wang, H. Zhang, and L. Wang, ``Convergence in high probability of distributed stochastic gradient descent algorithms,'' IEEE Transactions on Automatic Control, 2023.
- [9] J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, ``Why are adaptive methods good for attention models?'' Advances in Neural Information Processing Systems, vol. 33, pp. 15 383–15 393, 2020.
- [10] H. Yang, P. Qiu, and J. Liu, ``Taming fat-tailed (“heavier-tailed” with potentially infinite variance) noise in federated learning,'' Advances in Neural Information Processing Systems, vol. 35, pp. 17 017–17 029, 2022.
- [11] U. Simsekli, L. Sagun, and M. Gurbuzbalaban, ``A tail-index analysis of stochastic gradient noise in deep neural networks,'' in International Conference on Machine Learning. PMLR, 2019, pp. 5827–5837.
- [12] M. Gurbuzbalaban, U. Simsekli, and L. Zhu, ``The heavy-tail phenomenon in sgd,'' in International Conference on Machine Learning. PMLR, 2021, pp. 3964–3975.
- [13] H. Wang, M. Gurbuzbalaban, L. Zhu, U. Simsekli, and M. A. Erdogdu, ``Convergence rates of stochastic gradient descent under infinite noise variance,'' Advances in Neural Information Processing Systems, vol. 34, pp. 18 866–18 877, 2021.
- [14] S. A. Alghunaim and A. H. Sayed, ``Distributed coupled multiagent stochastic optimization,'' IEEE Transactions on Automatic Control, vol. 65, no. 1, pp. 175–190, 2019.
- [15] S. Pu and A. Nedić, ``Distributed stochastic gradient tracking methods,'' Mathematical Programming, vol. 187, pp. 409–457, 2021.
- [16] D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, ``Optimal distributed stochastic mirror descent for strongly convex optimization,'' Automatica, vol. 90, pp. 196–203, 2018.
- [17] E. Gorbunov, M. Danilova, and A. Gasnikov, ``Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,'' Advances in Neural Information Processing Systems, vol. 33, pp. 15 042–15 053, 2020.
- [18] Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, ``High probability convergence of stochastic gradient methods,'' in International Conference on Machine Learning. PMLR, 2023, pp. 21 884–21 914.
- [19] A. Cutkosky and H. Mehta, ``High-probability bounds for non-convex stochastic optimization with heavy tails,'' Advances in Neural Information Processing Systems, vol. 34, pp. 4883–4895, 2021.
- [20] M. Liu, Z. Zhuang, Y. Lei, and C. Liao, ``A communication-efficient distributed gradient clipping algorithm for training deep neural networks,'' Advances in Neural Information Processing Systems, vol. 35, pp. 26 204–26 217, 2022.
- [21] S. Yu, D. Jakovetic, and S. Kar, ``Smoothed gradient clipping and error feedback for distributed optimization under heavy-tailed noise,'' arXiv preprint arXiv:2310.16920, 2023.
- [22] E. Gorbunov, A. Sadiev, M. Danilova, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, ``High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise,'' arXiv preprint arXiv:2310.01860, 2023.
- [23] S. S. Ram, A. Nedic, and V. V. Veeravalli, ``Distributed subgradient projection algorithm for convex optimization,'' in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 3653–3656.
- [24] M. O. Sayin, N. D. Vanli, S. S. Kozat, and T. Başar, ``Stochastic subgradient algorithms for strongly convex optimization over distributed networks,'' IEEE Transactions on network science and engineering, vol. 4, no. 4, pp. 248–260, 2017.
- [25] A. Sadiev, M. Danilova, E. Gorbunov, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, ``High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance,'' arXiv preprint arXiv:2302.00999, 2023.
- [26] A. Y. Ng, ``Feature selection, l 1 vs. l 2 regularization, and rotational invariance,'' in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 78.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/e34a4393-657c-4355-a778-913118d17ec0/sunchao.png)
Chao Sun received his B.Eng degree from University of Science and Technology of China in 2013. Then he obtained the Ph.D. degree from School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore in 2018. After graduation, he worked as a research fellow at Nanyang Technological University till May 2022. Currently, he is an associate professor at Beihang University, China.