Communication Compression for Decentralized Learning with Operator Splitting Methods
Abstract
In decentralized learning, operator splitting methods using a primal-dual formulation (e.g., the Edge-Consensus Learning (ECL)) has been shown to be robust to heterogeneous data and has attracted significant attention in recent years. However, in the ECL, a node needs to exchange dual variables with its neighbors. These exchanges incur significant communication costs. For the Gossip-based algorithms, many compression methods have been proposed, but these Gossip-based algorithm do not perform well when the data distribution held by each node is statistically heterogeneous. In this work, we propose the novel framework of the compression methods for the ECL, called the Communication Compressed ECL (C-ECL). Specifically, we reformulate the update formulas of the ECL, and propose to compress the update values of the dual variables. We demonstrate experimentally that the C-ECL can achieve a nearly equivalent performance with fewer parameter exchanges than the ECL. Moreover, we demonstrate that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithms.
1 Introduction
In recent years, neural networks have shown promising results in various fields, including image processing [4, 7] and natural language processing [31, 6], and have thus attracted considerable attention. To train a neural network, we generally need to collect a large number of training data. Owning to the use of crowdsourcing services, it is now easy to collect a large number of annotated images and texts. However, because of privacy concerns, it is difficult to collect a large number of personal data, such as medical data including gene expression and medical images, on a single server. In such cases, decentralized learning, which aims to train a model without sharing the training data among servers, is a powerful tool. Decentralized learning was originally studied to train large-scale models in parallel, and because it allows us to train models without aggregating the training data, it has recently attracted significant attention from the perspective of privacy preservation.
One of the most widely used algorithms for decentralized learning is the Gossip-based algorithm [2, 18]. In the Gossip-based algorithm, each node (i.e., server) updates the model parameters using its own gradient, exchanges model parameters with its neighbors, and then takes the average value to reach a consensus. The Gossip-based algorithm is a simple yet effective approach. When the distribution of the data subset held by each node is statistically homogeneous, the Gossip-based algorithm can perform as well as the vanilla SGD, which trains the model on a single node using all of the training data. However, when the data distribution of each node is statistically heterogeneous (e.g., each node has only images of some classes and not others), the client-drift [11] occurs, and the Gossip-based algorithms do not perform well [30, 32].
Recently, the operator splitting method using the primal-dual formulations, called the Edge-Consensus Learning (ECL) [22], has been proposed. The primal-dual algorithm, including the Alternating Direction Method of Multipliers (ADMM) [3] and the Primal-Dual Method of Multipliers (PDMM) [38], can be applied in decentralized learning by representing the model consensus as the linear constraints. It has recently been shown that the ADMM and PDMM for decentralized learning can be derived by solving the dual problem using operator splitting methods [27] (e.g., the Douglas-Rachford splitting [8] and the Peaceman-Rachford splitting [24]). In addition, Niwa et al., [22, 23] applied these operator splitting methods to the neural networks, named them the Edge-Consensus Learning (ECL). Then, they showed that the ECL can be interpreted as a variance reduction method and is robust to heterogeneous data.
However, for both the Gossip-based algorithm and the ECL, each node needs to exchange the model parameters and/or dual variables with its neighbors, and such exchanges incur significant communication costs. Recently, in the Gossip-based algorithm, many studies have proposed methods for compressing the exchange of parameters [29, 13, 21, 33]. Then, they showed that these compression methods can train a model with fewer parameter exchanges than the uncompressed Gossip-based algorithm. However, these compression methods are based on the Gossip algorithms and do not perform well when the data distribution of each node is statistically heterogeneous.
In this work, we propose the novel framework for the compression methods applied to the ECL, which we refer to as the Communication Compressed ECL (C-ECL). Specifically, we reformulate the update formulas of the ECL, and propose to compress the update values of the dual variables. Theoretically, we analyze how our proposed compression affects the convergence rate of the ECL and show that the C-ECL converges linearly to the optimal solution as well as the ECL. Experimentally, we show that the C-ECL can achieve almost the same accuracy as the ECL with fewer parameter exchanges. Furthermore, the experimental results show that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithm, and when the data distribution of each node is statistically heterogeneous, the C-ECL can outperform the uncompressed Gossip-based algorithm, in terms of both the accuracy and the communication costs.
Notation: In this work, denotes the L2 norm, denotes a vector with all zeros, and denotes an identity matrix.
2 Related Work
In this section, we briefly introduce the problem setting of decentralized learning, and then introduce the Gossip-based algorithm and the ECL.
2.1 Decentralized Learning
Let be an undirected connected graph that represents the network topology where denotes the set of nodes and denotes the set of edges. For simplicity, we denote as the set of integers . We denote the set of neighbors of the node as . The goal of decentralized learning is formulated as follows:
(1) |
where is the model parameter, is the loss function, represents the data held by the node , is the data sample from , and is the loss function of the node .
2.2 Gossip-based Algorithm
A widely used approach for decentralized learning is the Gossip-based algorithm (e.g. D-PSGD [18]). In the Gossip-based algorithm, each node computes its own gradient , exchanges parameters with its neighbors, and then takes their average. Although, the Gossip-based algorithm is a simple and effective approach, there are two main issues: high communication costs and sensitivity to the heterogeneity of the data distribution.
In the Gossip-based algorithm, the node needs to receive the model parameter from its neighbors. Because the number of parameters in a neural network is large, the exchanges of the model parameters incur huge communication costs. To reduce these communication costs of the Gossip-based algorithm, many methods that compress the exchanging parameters by using sparsification, quantization, and low-rank approximation have been proposed [29, 13, 21, 12, 33]. Then, it was shown that these compression methods can achieve almost the same accuracy as the uncompressed Gossip-based algorithm with fewer parameter exchanges.
The second issue is that the Gossip-based algorithm is sensitive to the heterogeneity of the data distribution. When the data distribution of each node is statistically heterogeneous, because the optimal solution of the loss function and the optimal solution of the loss function of each node are far from each other, the Gossip-based algorithms do not perform well [22, 32]. To address heterogeneous data in the Gossip-based algorithm, Tang et al., 2018b [30] and Xin et al., [37] applied variance reduction methods [5, 10] to the Gossip-based algorithm. Lorenzo and Scutari, [20] proposed gradient tracking algorithms, and Vogels et al., [32] recently proposed the RelaySum.
2.3 Edge-Consensus Learning
In this section, we briefly introduce the Edge-Consensus Learning (ECL) [22]. By reformulating Eq. (1), the primal problem can be defined as follows:
(2) |
where when and , and when and . By contrast, the Gossip-based algorithm explicitly computes the average at each round, whereas the primal problem of Eq. (2) represents the consensus based on the linear constraints. Subsequently, by solving the the dual problem of Eq. (2) using the Douglas-Rachford splitting [8], the update formulas can be derived as follows [27]:
(3) | ||||
(4) | ||||
(5) |
where and are hyperparameters, and are dual variables. We present detailed derivation of these update formulas in Sec. B. When , the Douglas-Rachford splitting is specifically called the Peaceman-Rachford splitting [24]. When is non-convex (e.g., a loss function of a neural network), Eq. (3) can not be generally solved. Then, Niwa et al., [22] proposed the Edge-Consensus Learning (ECL), which approximately solves Eq. (3) as follows:
(6) |
where corresponds to the learning rate. Then, Niwa et al., [23] showed that the ECL can be interpreted as a stochastic variance reduction method and demonstrated that the ECL is more robust to heterogeneous data than the Gossip-based algorithms. However, as shown in Eq. (5), the node must receive dual variables from its neighbor in the ECL. Therefore, in the ECL as well as in the Gossip-based algorithm, large communication costs occur during training.
In addition to the ECL, other methods using primal-dual formulations have been proposed [17], and recently, the compression methods for these primal-dual algorithms has been studied [14, 19]. However, the compression methods for the ECL have not been studied. In this work, we propose the compression method for the ECL, called the C-ECL, that can train a model with fewer parameter exchanges and is robust to heterogeneous data.
3 Proposed Method
3.1 Compression Operator
Before proposing the C-ECL, we first introduce the compression operator used in this work.
Assumption 1 (Compression Operator).
For some , we assume that the compression operator satisfies the following conditions:
(7) | |||||
(8) | |||||
(9) |
where represents the parameter of the compression operator. In the following, we abbreviate and write as the operator containing the randomness.
The assumption of Eq. (7) is commonly used for the compression methods for the Gossip-based algorithms [21, 33, 13]. In addition, we assume that the compression operator satisfies Eqs. (8-9), and the low-rank approximation [33] and the following sparsification used in the compression methods for the Gossip-based algorithm satisfy Eqs. (8-9).
Example 1.
3.2 Communication Compressed Edge-Consensus Learning
In this section, we propose the the Communication Compressed ECL (C-ECL), the method for compressing the dual variables exchanged in the ECL using the compression operator introduced in the previous section.
In the ECL, to update in Eq. (5), the node needs to receive from the node . Because the number of elements in is the same as that of the model parameter , this exchange incurs significant communication costs. A straightforward approach to reduce this communication cost is compressing in Eq. (5) as follows:
(11) |
However, we found experimentally that compressing does not work. In the compression methods for the Gossip-based algorithm, Lu and De Sa, [21] showed that the model parameters are not robust to the compression. This is because the optimal solution of the model parameter is generally not zero, and thus the error caused by the compression does not approach zero even if the model parameters are near the optimal solution. Therefore, the successful compression methods for the Gossip-based algorithms compress the gradient or the model difference , which approach zero when the model parameters are near the optimal solution [13, 21, 33].
Inspired by these compression methods for the Gossip-based algorithms, we reformulate Eq. (5) into Eq. (12) so that we can compress the parameters which approach zero when the model parameters are near the optimal solution.
(12) |
In the Douglas-Rachford splitting, approaches the fixed point (i.e., ) when the model parameters approach the optimal solution. (See Sec. A for details of the Douglas-Rachford splitting and the definition of the fixed point). Then, from Eq. (12), when the model parameters approach to the optimal solution, in Eq. (12) approaches zero. Then, instead of compressing as in Eq. (11), we propose compressing as follows:
(13) | ||||
where we use Assumption 1 in the last equation. When is used as the compression operator, and must be compressed using the same sparse vector in Eq. (10) to use Assumption 1. In Alg. 1, we provide the pseudo-code of the C-ECL. For simplicity, the node and the node exchange and at Lines 5-6 in Alg. 1. However, by sharing the same seed value to generate and before starting the training, the node and the node can obtain the same values and without any exchanges. Moreover, when is used as the compression operator, the node can obtain from the received value because is stored in a sparse matrix format (e.g., COO format). Thus, these exchanges of and can be omitted in practice. Therefore, in the C-ECL, each node only needs to exchange the compressed value of , and the C-ECL can train a model with fewer parameter exchanges than the ECL.
4 Convergence Analysis
In this section, we analyze how compression in the C-ECL affects the convergence rate of the ECL. Our convergence analysis is based on the analysis of the Douglas-Rachford splitting [9], and the proofs of which are presented in Sec. C.
4.1 Assumptions
In this section, we introduce additional notations and assumptions used in our convergence analysis. We define , , where is the set of neighbors of the node . Let be the -th smallest index of the node in , we define , , and as follows:
(14) |
For simplicity, we drop the superscripts of the number of round . Let be the optimal solution of Eq. (2), we define in the same manner as the definition of in Eq. (14). We define the loss function as . Next, we introduce the assumptions used in the convergence analysis.
Assumption 2.
We assume that is proper, closed and convex.
Assumption 3.
We assume that is -smooth and -strongly convex with and .
Assumption 4.
We assume that the graph has no isolated nodes (i.e., ).
Assumption 2, 3 are the standard assumptions used for the convergence analysis of the operator splitting methods [9, 26]. Assumption 3 is weaker than the assumptions of the smoothness and the strongly convexity of for all , which are commonly used in decentralized learning. Assumption 4 holds in general because decentralized learning assumes that the graph is connected. In addition, we define as follows:
Suppose that Assumptions 2, 3, and 4 hold and holds, then holds because and .
4.2 Convergence Rates
Theorem 1.
Corollary 1.
Because implies that in the C-ECL, Corollary 1 shows the convergence rate of the ECL under Assumptions 1, 2, 3, and 4, which is almost the same rate as that shown in the previous work [9]. Comparing the domain of for the ECL and the C-ECL to converge, as decreases, the domain in Eq. (15) becomes smaller. Subsequently, in order for the domain in Eq. (15) to be non-empty, must be greater than or equal to . Next, comparing the convergence rate of the ECL and the C-ECL, the compression in the C-ECL slows down the convergence rate of the ECL by the term . Moreover, similar to the convergence analysis of the Douglas-Rachford splitting [9], Theorem 1 and Corollary 1 imply that the optimal parameter of can be determined as follows:
Corollary 2.
5 Experiments
In this section, we demonstrate that the C-ECL can achieve almost the same performance with fewer parameter exchanges than the ECL. Furthermore, we show that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithm.
5.1 Experimental Setting
Dataset and Model: We evaluate the C-ECL using FashionMNIST [35] and CIFAR10 [15], which are datasets of 10-class image-classification tasks. As the models for both datasets, we use 5-layer convolutional neural networks [16] with group normalization [34]. Following the previous work [22], we distribute data to the nodes in two settings: the homogeneous and heterogeneous settings. In the homogeneous setting, the data are distributed such that each node has the data of all classes and has approximately the same number of data of each class. In the heterogeneous setting, the data are distributed such that each node has the data of randomly selected classes. In both settings, the data are distributed to the nodes such that each node has the same number of data.
Network: In Sec. 5.2, we evaluate all comparison methods on a network of a ring consisting of eight nodes. In addition, in Sec. 5.3, we evaluate all comparison methods in four settings of the network topology: chain, ring, multiplex ring, and fully connected graph, where each setting consists of eight nodes. In Sec. D, we show a visualization of the network topology. Each node exchanges parameters with its neighbors per five local updates. We implement with PyTorch using gloo222https://pytorch.org/docs/stable/distributed.html as the backend, and run all comparison methods on eight GPUs (NVIDIA RTX 3090).
Comparison Methods: (1) D-PSGD [18]: The uncompressed Gossip-based algorithm. (2) PowerGossip [33]: The Gossip-based algorithm that compresses the exchanging parameters by using a low-rank approximation. We use the PowerGossip as the compression method for the Gossip-based algorithm because the PowerGossip has been shown to achieve almost the same performance as other existing compression methods without additional hyperparameter tuning. (3) ECL [22, 23]: The primal-dual algorithm described in Sec. 2.3. Because Niwa et al., [22] showed that the ECL converges faster when than when , we set . (4) C-ECL: Our proposed method described in Sec. 3. We use as the compression operator. Following Corollary 2, we set . We initialize and to zeros. However, we found that when we compress the update values of by using , the convergence becomes slower because remains to be sparse in the early training stage. Then, we set of to only during the first epoch.
In addition, for the reference, we show the results of the Stochastic Gradient Descent (SGD), in which the model is trained on a single node containing all training data. In our experiments, we set the learning rate, number of epochs, and batch size to the same values for all comparison methods. In Sec. D, we show the detailed hyperparameters used for all comparison methods.
5.2 Experimental Results
In this section, we evaluate the accuracy and the communication costs when setting the network topology to be a ring.
FashionMNIST | CIFAR10 | ||||||
Accuracy | Send/Epoch | Accuracy | Send/Epoch | ||||
SGD | - | - | |||||
D-PSGD | KB | () | KB | () | |||
ECL | KB | () | KB | () | |||
PowerGossip (1) | KB | () | KB | () | |||
PowerGossip (10) | KB | () | KB | () | |||
PowerGossip (20) | KB | () | KB | () | |||
C-ECL () | KB | () | KB | () | |||
C-ECL () | KB | () | KB | () | |||
C-ECL () | KB | () | KB | () |
FashionMNIST | CIFAR10 | ||||||
Accuracy | Send/Epoch | Accuracy | Send/Epoch | ||||
SGD | - | - | |||||
D-PSGD | KB | () | KB | () | |||
ECL | KB | () | KB | () | |||
PowerGossip (1) | KB | () | KB | () | |||
PowerGossip (10) | KB | () | KB | () | |||
PowerGossip (20) | KB | () | KB | () | |||
C-ECL () | KB | () | KB | () | |||
C-ECL () | KB | () | KB | () | |||
C-ECL () | KB | () | KB | () |
Homogeneous Setting: First, we discuss the results on the homogeneous setting. Table 1 shows the accuracy and the communication costs on the homogeneous setting. The results show that the D-PSGD and the ECL achieve almost the same accuracy on both datasets. Then, the C-ECL and the PowerGossip are comparable and achieve almost the same accuracy as the ECL and the D-PSGD even when we set of to and the number of the power iteration steps to respectively. Therefore, the C-ECL can achieve the comparable accuracy with approximately -times fewer parameter exchanges than the ECL and the D-PSGD on the homogeneous setting.
Heterogeneous Setting: Next, we discuss the results on the heterogeneous setting. Table 2 shows the accuracy and the communication costs on the heterogeneous setting. In the D-PSGD, the accuracy on the heterogeneous setting decreases by approximately compared to that on the homogeneous setting. In the PowerGossip, even if the number of power iteration steps is increased, the accuracy does not approach that of the D-PSGD and the ECL. On the other hand, the accuracy of the ECL is almost the same on both the homogeneous and heterogeneous settings, and the results indicate that the ECL is more robust to heterogeneous data than the D-PSGD. In the C-ECL, when we set of to , the accuracy on the heterogeneous setting decreases by approximately compared to that of the ECL. However, when we increase of , the C-ECL is competitive with the ECL and outperforms the D-PSGD and the PowerGossip. Specifically, on FashionMNIST, when we set to , the C-ECL is competitive to the ECL and outperforms the D-PSGD and the PowerGossip. On CIFAR10, when we set to , the C-ECL is competitive with the ECL and outperforms the D-PSGD and the PowerGossip.
In summary, on the homogeneous setting, the C-ECL and the PowerGossip can achieve almost the same accuracy as the ECL and the D-PSGD with approximately -times fewer parameter exchanges. On the heterogeneous setting, the C-ECL can achieve almost the same accuracy as the ECL with approximately -times fewer parameter exchanges and can outperform the PowerGossip. Furthermore, the results show that the C-ECL can outperform the D-PSGD, the uncompressed Gossip-based algorithm, in terms of both the accuracy and the communication costs.
5.3 Network Topology
In this section, we show the accuracy and the communication costs when the network topology is varied. Table 3 and Fig. 1 show the communication costs and the accuracy on FashionMNIST when the network topology is varied as a chain, ring, multiplex ring, or fully connected graph.
On the homogeneous setting, Fig. 1 shows that the accuracy of all comparison methods are almost the same and reach that of the SGD on all network topologies. On the heterogeneous setting, Fig. 1 shows that the accuracy of the D-PSGD and the PowerGossip are decreased compared to that on the homogeneous setting. On the other hand, the accuracy of the ECL is almost the same as on the homogeneous setting on all network topologies. Then, on all network topologies, the C-ECL achieves almost the same accuracy as the ECL with fewer parameters exchanges and consistently outperforms the PowerGossip. Moreover, the results show that on the heterogeneous setting, the C-ECL can outperform the D-PSGD, the uncompressed Gossip-based algorithm, on all network topologies in terms of both the accuracy and the communication costs.


Chain | Ring | Multiplex Ring | Fully Connected Graph | |
---|---|---|---|---|
D-PSGD | KB | KB | KB | KB |
ECL | KB | KB | KB | KB |
PowerGossip (10) | KB | KB | KB | KB |
C-ECL () | KB | KB | KB | KB |
6 Conclusion
In this work, we propose the Communication Compressed ECL (C-ECL), the novel framework for the compression methods of the ECL. Specifically, we reformulate the update formula of the ECL and propose compressing the update values of the dual variables. Theoretically, we analyze the convergence rate of the C-ECL and show that the C-ECL converges linearly to the optimal solution as well as the ECL. Experimentally, we demonstrate that, even if the data distribution of each node is statistically heterogeneous, the C-ECL can achieve almost the same accuracy as the ECL with fewer parameter exchanges. Moreover, we show that, when the data distribution of each node is statistically heterogeneous, the C-ECL outperforms the uncompressed Gossip-based algorithm in terms of both the accuracy and the communication costs.
Acknowledgments
M.Y. was supported by MEXT KAKENHI Grant Number 20H04243.
References
- Bauschke and Combettes, [2017] Bauschke, H. H. and Combettes, P. L. (2017). Convex analysis and monotone operator theory in hilbert spaces. Springer, 2nd edition.
- Boyd et al., [2006] Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2006). Randomized gossip algorithms. In IEEE Transactions on Information Theory.
- Boyd et al., [2011] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning.
- Chen et al., [2020] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning.
- Defazio et al., [2014] Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems.
- Devlin et al., [2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics.
- Dosovitskiy et al., [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Douglas and Rachford, [1956] Douglas, J. and Rachford, H. H. (1956). On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society.
- Giselsson and Boyd, [2017] Giselsson, P. and Boyd, S. P. (2017). Linear convergence and metric selection for douglas-rachford splitting and ADMM. IEEE Transactions on Automatic Control.
- Johnson and Zhang, [2013] Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems.
- Karimireddy et al., [2020] Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. (2020). SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning.
- Koloskova et al., [2020] Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. (2020). Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations.
- Koloskova et al., [2019] Koloskova, A., Stich, S., and Jaggi, M. (2019). Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning.
- Kovalev et al., [2021] Kovalev, D., Koloskova, A., Jaggi, M., Richtarik, P., and Stich, S. (2021). A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics.
- Krizhevsky, [2009] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
- LeCun et al., [1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. In IEEE.
- Li et al., [2019] Li, Z., Shi, W., and Yan, M. (2019). A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. In IEEE Transactions on Signal Processing.
- Lian et al., [2017] Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems.
- Liu et al., [2021] Liu, X., Li, Y., Wang, R., Tang, J., and Yan, M. (2021). Linear convergent decentralized optimization with compression. In International Conference on Learning Representations.
- Lorenzo and Scutari, [2016] Lorenzo, P. D. and Scutari, G. (2016). NEXT: in-network nonconvex optimization. In IEEE Transactions on Signal and Information Processing over Networks.
- Lu and De Sa, [2020] Lu, Y. and De Sa, C. (2020). Moniqua: Modulo quantized communication in decentralized SGD. In International Conference on Machine Learning.
- Niwa et al., [2020] Niwa, K., Harada, N., Zhang, G., and Kleijn, W. B. (2020). Edge-consensus learning: Deep learning on p2p networks with nonhomogeneous data. In International Conference on Knowledge Discovery and Data Mining.
- Niwa et al., [2021] Niwa, K., Zhang, G., Kleijn, W. B., Harada, N., Sawada, H., and Fujino, A. (2021). Asynchronous decentralized optimization with implicit stochastic variance reduction. In International Conference on Machine Learning.
- Peaceman and Rachford, [1955] Peaceman, D. W. and Rachford, H. H. (1955). The numerical solution of parabolic and elliptic differential equations. Journal of the Society for Industrial and Applied Mathematics.
- Rockafellar, [2015] Rockafellar, R. T. (2015). Convex analysis. Princeton University Press.
- Ryu and Boyd, [2015] Ryu, E. K. and Boyd, S. P. (2015). A primer on monotone operator methods. In Applied and Computational Mathematics.
- Sherson et al., [2019] Sherson, T. W., Heusdens, R., and Kleijn, W. B. (2019). Derivation and analysis of the primal-dual method of multipliers based on monotone operator theory. In IEEE Transactions on Signal and Information Processing over Networks.
- Stich et al., [2018] Stich, S. U., Cordonnier, J.-B., and Jaggi, M. (2018). Sparsified sgd with memory. In Advances in Neural Information Processing Systems.
- [29] Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. (2018a). Communication compression for decentralized training. In Advances in Neural Information Processing Systems.
- [30] Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. (2018b). : Decentralized training over decentralized data. In International Conference on Machine Learning.
- Vaswani et al., [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems.
- Vogels et al., [2021] Vogels, T., He, L., Koloskova, A., Karimireddy, S. P., Lin, T., Stich, S. U., and Jaggi, M. (2021). Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems.
- Vogels et al., [2020] Vogels, T., Karimireddy, S. P., and Jaggi, M. (2020). Powergossip: Practical low-rank communication compression in decentralized deep learning. In Advances in Neural Information Processing Systems.
- Wu and He, [2018] Wu, Y. and He, K. (2018). Group normalization. In European Conference on Computer Vision.
- Xiao et al., [2017] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. In arXiv.
- Xiao et al., [2007] Xiao, L., Boyd, S. P., and Kim, S. (2007). Distributed average consensus with least-mean-square deviation. In Journal of Parallel and Distributed Computing.
- Xin et al., [2020] Xin, R., Khan, U. A., and Kar, S. (2020). Variance-reduced decentralized stochastic optimization with accelerated convergence. In IEEE Transactions on Signal and Information Processing over Networks.
- Zhang and Heusdens, [2018] Zhang, G. and Heusdens, R. (2018). Distributed optimization using the primal-dual method of multipliers. In IEEE Transactions on Signal and Information Processing over Networks.
Appendix A Preliminary
In this section, we introduce the definitions used in the following section and briefly introduce the Douglas-Rachford splitting. (See [1, 26] for more details.)
A.1 Definition
Definition 1 (Smooth Function).
Let be a closed, proper, and convex function. is -smooth if satisfies the following:
Definition 2 (Strongly Convex Function).
Let be a closed, proper, and convex function. is -strongly convex if satisfies the following:
Definition 3 (Conjugate Function).
Let . The conjugate function of is defined as follows:
Definition 4 (Nonexpansive Operator).
Let be a non-empty subset of . An operator is nonexpansive if is -Lipschitz continuous.
Definition 5 (Contractive Operator).
Let be a non-empty subset of . An operator is -contractive if is Lipschitz continuous with constant .
Definition 6 (Proximal Operator).
Let be a closed, proper, and convex function. The proximal operator of is defined as follows:
Definition 7 (Resolvent).
Let , the resolvent of is defined as follows:
Definition 8 (Reflected Resolvent).
Let , the reflected resolvent of is defined as follows:
A.2 Douglas-Rachford Splitting
In this section, we briefly introduce the Douglas-Rachford splitting [8].
The Douglas-Rachford splitting can be applied to the following problem:
(18) |
where and are closed, proper, and convex functions. Let be the optimal solution of the above problem, the optimality condition can be written as follows:
From [1, Proposition 26.1], the optimality condition above is equivalent to the following:
where and . The Douglas-Rachford splitting then computes the fixed point as follows:
(19) |
where is the hyperparameter of the Douglas-Rachford splitting. Under certain assumptions, is contractive [9], and the update formula above is guaranteed to converge at the fixed point . Then, after converging to the fixed point , the Douglas-Rachford splitting obtains the optimal solution as follows:
(20) |
Moreover, by the definition of the reflected resolvent, the Douglas-Rachford splitting of Eq. (19) can be rewritten as follows:
(21) | ||||
(22) | ||||
(23) |
Appendix B Derivation of Update Formulas of ECL
In this section, we briefly describe the derivation of the update formulas of the ECL. (See [22, 23, 27] for more details.)
B.1 Derivation of Dual Problem
First, we derive the Lagrangian function. The Lagrangian function of Eq. (2) can be derived as follows:
where is the dual variable. Defining , the Lagrangian function can be rewritten as follows:
Note that the definition of implies that . Let be the number of nodes . Here, denotes the -th smallest index of the node in . We define , and as follows:
We define the function as follows:
The Lagrangian function can be rewritten as follows:
Then, the primal and dual problems can be defined as follows:
(24) | ||||
(25) |
where is the indicator function defined as follows:
B.2 Derivation of Update Formulas
Next, we derive the update formulas of the ECL. By applying the Douglas-Rachford splitting of Eqs. (21-23) to the dual problem of Eq. (25), we obtain the following update formulas:
(26) | ||||
(27) | ||||
(28) |
where and can be decomposed into and as follows:
Update formulas of Eqs. (26-27): By the definition of the resolvent , we obtain
(29) |
We define as follows:
(30) |
From the property of the convex conjugate function, we obtain
(31) |
Substituting Eqs. (29-30) into Eq. (31), we obtain
We then obtain the update formula of as follows:
Substituting Eq. (30) into Eq. (29), we obtain
Then, the update formula of Eq. (27) is written as follows:
Appendix C Convergence Analysis of C-ECL
From the derivation of the ECL shown in Sec. B, the update formulas of Eq. (3), Eq. (4), and Eq. (13) in the C-ECL can be rewritten as follows:333By the definition of the reflected resolvent, the update formulas of Eq. (26) and Eq. (27) are equivalent to that of Eq. (35).
(35) | ||||
(36) |
To simplify the notation, we define . Then, Eq. (35) is rewritten as follows:
(37) |
In the following, we analyze the convergence rate of the update formulas of Eq. (37) and Eq. (36).
Lemma 1.
Under Assumption 4, the maximum singular value and the minimum singular value of are and respectively.
Proof.
By the definition of , we have . This implies that is not only a block diagonal matrix, but also a diagonal matrix. Then, the eigen values of are . This concludes the proof. ∎
Remark 1.
Under Assumption 4, the maximum singular value and the minimum singular value of are and respectively.
Proof.
We define as follows:
Note that when Assumptions 2, 3, and 4 are satisfied, and when , is satisfied.
Proof.
In the following, we use the notation as the expectation over the randomness in the round .
Lemma 5.
Proof.
Combining Eq. (36) and Eq. (37), the update formula of can be written as follows:
Let be the parameter used to compress . In the following, to use Eq. (8) and Eq. (9) in Assumption 1, we rewrite the update formula of as follows:
Because , we have . Under Assumption 1, because for any , we have
We then have
where we use Assumption 1 for (a) and (b). From Lemma 4, , , and are upper bounded as follows:
Therefore, we obtain
This concludes the proof. ∎
Lemma 6.
Proof.
When , the domain of Eq. (39) is non-empty and contains . Then, when and satisfies the following:
we have
When and satisfies the following:
we have
This concludes the proof. ∎
In the following, we define the operator as follows:
Proof.
We have
From [1, Example 23.3], the proximal operator of is the resolvent of , and the resolvent can be rewritten as follows:
From Assumption 3 and Remark 1, is -strongly convex. Then, is -smooth. Then, the proximal operator of is -Lipschitz continuous. Then, we have
where we use Lemma 1 for (a). This concludes the proof. ∎
Lemma 8.
Proof.
By using the optimality condition [25, Theorem 27.1] of , we have
From the property of the convex conjugate function of , we get
Then, by multiplying and using the property of the convex conjugate function of , we have
From the above equation, satisfies the optimality condition of Eq. (24). This concludes the proof. ∎
Lemma 9.
Proof.
Lemma 10.
Theorem 1.
Corollary 1.
Proof.
The statement follows from Theorem 1. ∎
Corollary 2.
Proof.
When , we have
Because , decreases when approaches . Therefore, the optimal convergence rate is achieved when . ∎
Corollary 3.
Proof.
The statement follows from Corollary 2. ∎
Appendix D Experimental Setting
D.1 Hyperparameter
In this section, we describe the detailed hyperparameters used in our experiments.
FashionMNIST: We set the learning rate to , batch size to , and number of epochs to . To avoid the overfitting, we use RandomCrop of PyTorch for the data augmentation. In the ECL, following the previous work [23], we set as follows:
(46) |
where is the number of local steps. In the C-ECL, when we use as the compression operator, the number of local steps can be regarded to be . Then, in the C-ECL, we set as follows:
(47) |
Note that is set to the different values between the nodes by the definition of in Eqs. (46-47). For the D-PSGD and the PowerGossip, we use Metropolis-Hastings weights [36] as the weights of the edges.
CIFAR10: We set the learning rate to , batch size to , and number of epochs to . To avoid the overfitting, we use RandomCrop and RandomHorizontalFlip of PyTorch for the data augmentation. For the ECL and the C-ECL, we set in the same way as the FashionMNIST. For the D-PSGD and the PowerGossip, we use Metropolis–Hastings weights as the weights of the edges.
D.2 Network Topology
Fig. 2 shows the visualization of the network topology used in our experiments.



