This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Communication Compression for Decentralized Learning with Operator Splitting Methods

Yuki Takezawa
Kyoto University and RIKEN AIP
[email protected]
&Kenta Niwa
NTT Communication Science Laboratories
[email protected]
&Makoto Yamada
Kyoto University and RIKEN AIP
[email protected]
Abstract

In decentralized learning, operator splitting methods using a primal-dual formulation (e.g., the Edge-Consensus Learning (ECL)) has been shown to be robust to heterogeneous data and has attracted significant attention in recent years. However, in the ECL, a node needs to exchange dual variables with its neighbors. These exchanges incur significant communication costs. For the Gossip-based algorithms, many compression methods have been proposed, but these Gossip-based algorithm do not perform well when the data distribution held by each node is statistically heterogeneous. In this work, we propose the novel framework of the compression methods for the ECL, called the Communication Compressed ECL (C-ECL). Specifically, we reformulate the update formulas of the ECL, and propose to compress the update values of the dual variables. We demonstrate experimentally that the C-ECL can achieve a nearly equivalent performance with fewer parameter exchanges than the ECL. Moreover, we demonstrate that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithms.

1 Introduction

In recent years, neural networks have shown promising results in various fields, including image processing [4, 7] and natural language processing [31, 6], and have thus attracted considerable attention. To train a neural network, we generally need to collect a large number of training data. Owning to the use of crowdsourcing services, it is now easy to collect a large number of annotated images and texts. However, because of privacy concerns, it is difficult to collect a large number of personal data, such as medical data including gene expression and medical images, on a single server. In such cases, decentralized learning, which aims to train a model without sharing the training data among servers, is a powerful tool. Decentralized learning was originally studied to train large-scale models in parallel, and because it allows us to train models without aggregating the training data, it has recently attracted significant attention from the perspective of privacy preservation.

One of the most widely used algorithms for decentralized learning is the Gossip-based algorithm [2, 18]. In the Gossip-based algorithm, each node (i.e., server) updates the model parameters using its own gradient, exchanges model parameters with its neighbors, and then takes the average value to reach a consensus. The Gossip-based algorithm is a simple yet effective approach. When the distribution of the data subset held by each node is statistically homogeneous, the Gossip-based algorithm can perform as well as the vanilla SGD, which trains the model on a single node using all of the training data. However, when the data distribution of each node is statistically heterogeneous (e.g., each node has only images of some classes and not others), the client-drift [11] occurs, and the Gossip-based algorithms do not perform well [30, 32].

Recently, the operator splitting method using the primal-dual formulations, called the Edge-Consensus Learning (ECL) [22], has been proposed. The primal-dual algorithm, including the Alternating Direction Method of Multipliers (ADMM) [3] and the Primal-Dual Method of Multipliers (PDMM) [38], can be applied in decentralized learning by representing the model consensus as the linear constraints. It has recently been shown that the ADMM and PDMM for decentralized learning can be derived by solving the dual problem using operator splitting methods [27] (e.g., the Douglas-Rachford splitting [8] and the Peaceman-Rachford splitting [24]). In addition, Niwa et al., [22, 23] applied these operator splitting methods to the neural networks, named them the Edge-Consensus Learning (ECL). Then, they showed that the ECL can be interpreted as a variance reduction method and is robust to heterogeneous data.

However, for both the Gossip-based algorithm and the ECL, each node needs to exchange the model parameters and/or dual variables with its neighbors, and such exchanges incur significant communication costs. Recently, in the Gossip-based algorithm, many studies have proposed methods for compressing the exchange of parameters [29, 13, 21, 33]. Then, they showed that these compression methods can train a model with fewer parameter exchanges than the uncompressed Gossip-based algorithm. However, these compression methods are based on the Gossip algorithms and do not perform well when the data distribution of each node is statistically heterogeneous.

In this work, we propose the novel framework for the compression methods applied to the ECL, which we refer to as the Communication Compressed ECL (C-ECL). Specifically, we reformulate the update formulas of the ECL, and propose to compress the update values of the dual variables. Theoretically, we analyze how our proposed compression affects the convergence rate of the ECL and show that the C-ECL converges linearly to the optimal solution as well as the ECL. Experimentally, we show that the C-ECL can achieve almost the same accuracy as the ECL with fewer parameter exchanges. Furthermore, the experimental results show that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithm, and when the data distribution of each node is statistically heterogeneous, the C-ECL can outperform the uncompressed Gossip-based algorithm, in terms of both the accuracy and the communication costs.

Notation: In this work, \|\cdot\| denotes the L2 norm, 𝟎\mathbf{0} denotes a vector with all zeros, and 𝐈\mathbf{I} denotes an identity matrix.

2 Related Work

In this section, we briefly introduce the problem setting of decentralized learning, and then introduce the Gossip-based algorithm and the ECL.

2.1 Decentralized Learning

Let G=(𝒱,)G=(\mathcal{V},\mathcal{E}) be an undirected connected graph that represents the network topology where 𝒱\mathcal{V} denotes the set of nodes and \mathcal{E} denotes the set of edges. For simplicity, we denote 𝒱\mathcal{V} as the set of integers {1,2,,|𝒱|}\{1,2,\ldots,|\mathcal{V}|\}. We denote the set of neighbors of the node ii as 𝒩i={j𝒱|(i,j)}\mathcal{N}_{i}=\{j\in\mathcal{V}|(i,j)\in\mathcal{E}\}. The goal of decentralized learning is formulated as follows:

inf𝐰i𝒱fi(𝐰),fi(𝐰)𝔼ζi𝒟i[F(𝐰;ζi)],\displaystyle\inf_{\mathbf{w}}\;\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}),\;\;f_{i}(\mathbf{w})\coloneqq\mathbb{E}_{\zeta_{i}\sim\mathcal{D}_{i}}[F(\mathbf{w};\zeta_{i})], (1)

where 𝐰d\mathbf{w}\in\mathbb{R}^{d} is the model parameter, FF is the loss function, 𝒟i\mathcal{D}_{i} represents the data held by the node ii, ζi\zeta_{i} is the data sample from 𝒟i\mathcal{D}_{i}, and fif_{i} is the loss function of the node ii.

2.2 Gossip-based Algorithm

A widely used approach for decentralized learning is the Gossip-based algorithm (e.g. D-PSGD [18]). In the Gossip-based algorithm, each node ii computes its own gradient fi\nabla f_{i}, exchanges parameters with its neighbors, and then takes their average. Although, the Gossip-based algorithm is a simple and effective approach, there are two main issues: high communication costs and sensitivity to the heterogeneity of the data distribution.

In the Gossip-based algorithm, the node needs to receive the model parameter from its neighbors. Because the number of parameters in a neural network is large, the exchanges of the model parameters incur huge communication costs. To reduce these communication costs of the Gossip-based algorithm, many methods that compress the exchanging parameters by using sparsification, quantization, and low-rank approximation have been proposed [29, 13, 21, 12, 33]. Then, it was shown that these compression methods can achieve almost the same accuracy as the uncompressed Gossip-based algorithm with fewer parameter exchanges.

The second issue is that the Gossip-based algorithm is sensitive to the heterogeneity of the data distribution. When the data distribution of each node is statistically heterogeneous, because the optimal solution of the loss function ifi\sum_{i}f_{i} and the optimal solution of the loss function of each node fif_{i} are far from each other, the Gossip-based algorithms do not perform well [22, 32]. To address heterogeneous data in the Gossip-based algorithm, Tang et al., 2018b [30] and Xin et al., [37] applied variance reduction methods [5, 10] to the Gossip-based algorithm. Lorenzo and Scutari, [20] proposed gradient tracking algorithms, and Vogels et al., [32] recently proposed the RelaySum.

2.3 Edge-Consensus Learning

In this section, we briefly introduce the Edge-Consensus Learning (ECL) [22]. By reformulating Eq. (1), the primal problem can be defined as follows:

inf{𝐰i}ii𝒱fi(𝐰i)s.t.𝐀i|j𝐰i+𝐀j|i𝐰j=𝟎,((i,j)),\displaystyle\inf_{\{\mathbf{w}_{i}\}_{i}}\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})\;\;\text{s.t.}\;\;\mathbf{A}_{i|j}\mathbf{w}_{i}+\mathbf{A}_{j|i}\mathbf{w}_{j}=\mathbf{0},\;(\forall(i,j)\in\mathcal{E}), (2)

where 𝐀i|j=𝐈\mathbf{A}_{i|j}=\mathbf{I} when j𝒩ij\in\mathcal{N}_{i} and i<ji<j, and 𝐀i|j=𝐈\mathbf{A}_{i|j}=-\mathbf{I} when j𝒩ij\in\mathcal{N}_{i} and i>ji>j. By contrast, the Gossip-based algorithm explicitly computes the average at each round, whereas the primal problem of Eq. (2) represents the consensus based on the linear constraints. Subsequently, by solving the the dual problem of Eq. (2) using the Douglas-Rachford splitting [8], the update formulas can be derived as follows [27]:

𝐰i(r+1)\displaystyle\mathbf{w}^{(r+1)}_{i} =argmin𝐰i{fi(𝐰i)+α2j𝒩i𝐀i|j𝐰i1α𝐳i|j(r)2},\displaystyle=\text{argmin}_{\mathbf{w}_{i}}\{f_{i}(\mathbf{w}_{i})+\frac{\alpha}{2}\sum_{j\in\mathcal{N}_{i}}{\left\|\mathbf{A}_{i|j}\mathbf{w}_{i}-\frac{1}{\alpha}\mathbf{z}^{(r)}_{i|j}\right\|}^{2}\}, (3)
𝐲i|j(r+1)\displaystyle\mathbf{y}^{(r+1)}_{i|j} =𝐳i|j(r)2α𝐀i|j𝐰i(r+1),\displaystyle=\mathbf{z}^{(r)}_{i|j}-2\alpha\mathbf{A}_{i|j}\mathbf{w}^{(r+1)}_{i}, (4)
𝐳i|j(r+1)\displaystyle\mathbf{z}^{(r+1)}_{i|j} =(1θ)𝐳i|j(r)+θ𝐲j|i(r+1),\displaystyle=(1-\theta)\mathbf{z}^{(r)}_{i|j}+\theta\mathbf{y}^{(r+1)}_{j|i}, (5)

where θ(0,1]\theta\in(0,1] and α>0\alpha>0 are hyperparameters, and 𝐲i|j,𝐳i|jd\mathbf{y}_{i|j},\mathbf{z}_{i|j}\in\mathbb{R}^{d} are dual variables. We present detailed derivation of these update formulas in Sec. B. When θ=1\theta=1, the Douglas-Rachford splitting is specifically called the Peaceman-Rachford splitting [24]. When fif_{i} is non-convex (e.g., a loss function of a neural network), Eq. (3) can not be generally solved. Then, Niwa et al., [22] proposed the Edge-Consensus Learning (ECL), which approximately solves Eq. (3) as follows:

𝐰i(r+1)\displaystyle\mathbf{w}^{(r+1)}_{i} =argmin𝐰i{𝐰i,fi(𝐰i(r))+12η𝐰i𝐰i(r)2+α2j𝒩i𝐀i|j𝐰i1α𝐳i|j(r)2},\displaystyle=\text{argmin}_{\mathbf{w}_{i}}\{\langle\mathbf{w}_{i},\nabla f_{i}(\mathbf{w}^{(r)}_{i})\rangle+\frac{1}{2\eta}\left\|\mathbf{w}_{i}-\mathbf{w}^{(r)}_{i}\right\|^{2}+\frac{\alpha}{2}\sum_{j\in\mathcal{N}_{i}}{\left\|\mathbf{A}_{i|j}\mathbf{w}_{i}-\frac{1}{\alpha}\mathbf{z}^{(r)}_{i|j}\right\|}^{2}\}, (6)

where η>0\eta>0 corresponds to the learning rate. Then, Niwa et al., [23] showed that the ECL can be interpreted as a stochastic variance reduction method and demonstrated that the ECL is more robust to heterogeneous data than the Gossip-based algorithms. However, as shown in Eq. (5), the node ii must receive dual variables 𝐲j|i\mathbf{y}_{j|i} from its neighbor jj in the ECL. Therefore, in the ECL as well as in the Gossip-based algorithm, large communication costs occur during training.

In addition to the ECL, other methods using primal-dual formulations have been proposed [17], and recently, the compression methods for these primal-dual algorithms has been studied [14, 19]. However, the compression methods for the ECL have not been studied. In this work, we propose the compression method for the ECL, called the C-ECL, that can train a model with fewer parameter exchanges and is robust to heterogeneous data.

3 Proposed Method

3.1 Compression Operator

Before proposing the C-ECL, we first introduce the compression operator used in this work.

Assumption 1 (Compression Operator).

For some τ(0,1]\tau\in(0,1], we assume that the compression operator comp:dd\textbf{comp}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} satisfies the following conditions:

𝔼ωcomp(𝐱;ω)𝐱2\displaystyle\mathbb{E}_{\omega}\left\|\mathrm{comp}(\mathbf{x};\omega)-\mathbf{x}\right\|^{2} (1τ)𝐱2\displaystyle\leq(1-\tau)\left\|\mathbf{x}\right\|^{2}\quad (𝐱d),\displaystyle(\forall\mathbf{x}\in\mathbb{R}^{d}), (7)
comp(𝐱+𝐲;ω)\displaystyle\mathrm{comp}(\mathbf{x}+\mathbf{y};\omega) =comp(𝐱;ω)+comp(𝐲;ω)\displaystyle=\mathrm{comp}(\mathbf{x};\omega)+\mathrm{comp}(\mathbf{y};\omega)\quad (ω,𝐱,𝐲d),\displaystyle(\forall\omega,\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}), (8)
comp(𝐱;ω)\displaystyle\mathrm{comp}(-\mathbf{x};\omega) =comp(𝐱;ω)\displaystyle=-\mathrm{comp}(\mathbf{x};\omega)\quad (ω,𝐱d),\displaystyle(\forall\omega,\forall\mathbf{x}\in\mathbb{R}^{d}), (9)

where ω\omega represents the parameter of the compression operator. In the following, we abbreviate ω\omega and write comp(𝐱)\textbf{comp}(\mathbf{x}) as the operator containing the randomness.

The assumption of Eq. (7) is commonly used for the compression methods for the Gossip-based algorithms [21, 33, 13]. In addition, we assume that the compression operator satisfies Eqs. (8-9), and the low-rank approximation [33] and the following sparsification used in the compression methods for the Gossip-based algorithm satisfy Eqs. (8-9).

Example 1.

For some k(0,100]k\in(0,100], we define the operator randk%:dd\textbf{rand}_{k\%}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} as follows:

randk%(𝐱)𝐬𝐱,\displaystyle\mathrm{rand}_{k\%}(\mathbf{x})\coloneqq\mathbf{s}\circ\mathbf{x}, (10)

where \circ is the Hadamard product and 𝐬{0,1}d\mathbf{s}\in\{0,1\}^{d} is a uniformly sampled sparse vector whose element is one with probability k%k\%. Here, the parameter ω\omega of the compression operator corresponds to the randomly sampled vector 𝐬\mathbf{s}. Then, randk%\textbf{rand}_{k\%} satisfies Assumption 1 [28].

3.2 Communication Compressed Edge-Consensus Learning

In this section, we propose the the Communication Compressed ECL (C-ECL), the method for compressing the dual variables exchanged in the ECL using the compression operator introduced in the previous section.

In the ECL, to update 𝐳i|j\mathbf{z}_{i|j} in Eq. (5), the node ii needs to receive 𝐲j|i\mathbf{y}_{j|i} from the node jj. Because the number of elements in 𝐲j|i\mathbf{y}_{j|i} is the same as that of the model parameter 𝐰j\mathbf{w}_{j}, this exchange incurs significant communication costs. A straightforward approach to reduce this communication cost is compressing 𝐲j|i\mathbf{y}_{j|i} in Eq. (5) as follows:

𝐳i|j(r+1)\displaystyle\mathbf{z}^{(r+1)}_{i|j} =(1θ)𝐳i|j(r)+θcomp(𝐲j|i(r+1)).\displaystyle=(1-\theta)\mathbf{z}^{(r)}_{i|j}+\theta\;\text{comp}(\mathbf{y}^{(r+1)}_{j|i}). (11)

However, we found experimentally that compressing 𝐲j|i\mathbf{y}_{j|i} does not work. In the compression methods for the Gossip-based algorithm, Lu and De Sa, [21] showed that the model parameters are not robust to the compression. This is because the optimal solution of the model parameter is generally not zero, and thus the error caused by the compression does not approach zero even if the model parameters are near the optimal solution. Therefore, the successful compression methods for the Gossip-based algorithms compress the gradient fi(𝐰i)\nabla f_{i}(\mathbf{w}_{i}) or the model difference (𝐰j𝐰i)(\mathbf{w}_{j}-\mathbf{w}_{i}), which approach zero when the model parameters are near the optimal solution [13, 21, 33].

Inspired by these compression methods for the Gossip-based algorithms, we reformulate Eq. (5) into Eq. (12) so that we can compress the parameters which approach zero when the model parameters are near the optimal solution.

𝐳i|j(r+1)\displaystyle\mathbf{z}^{(r+1)}_{i|j} =𝐳i|j(r)+θ(𝐲j|i(r+1)𝐳i|j(r)).\displaystyle=\mathbf{z}^{(r)}_{i|j}+\theta(\mathbf{y}^{(r+1)}_{j|i}-\mathbf{z}^{(r)}_{i|j}). (12)

In the Douglas-Rachford splitting, 𝐳i|j\mathbf{z}_{i|j} approaches the fixed point (i.e., 𝐳i|j(r)=𝐳i|j(r+1)\mathbf{z}_{i|j}^{(r)}=\mathbf{z}_{i|j}^{(r+1)}) when the model parameters approach the optimal solution. (See Sec. A for details of the Douglas-Rachford splitting and the definition of the fixed point). Then, from Eq. (12), when the model parameters approach to the optimal solution, (𝐲j|i𝐳i|j)(\mathbf{y}_{j|i}-\mathbf{z}_{i|j}) in Eq. (12) approaches zero. Then, instead of compressing 𝐲j|i\mathbf{y}_{j|i} as in Eq. (11), we propose compressing (𝐲j|i𝐳i|j)(\mathbf{y}_{j|i}-\mathbf{z}_{i|j}) as follows:

𝐳i|j(r+1)\displaystyle\mathbf{z}^{(r+1)}_{i|j} =𝐳i|j(r)+θcomp(𝐲j|i(r+1)𝐳i|j(r))\displaystyle=\mathbf{z}^{(r)}_{i|j}+\theta\;\text{comp}(\mathbf{y}^{(r+1)}_{j|i}-\mathbf{z}^{(r)}_{i|j}) (13)
=𝐳i|j(r)+θ(comp(𝐲j|i(r+1))comp(𝐳i|j(r))),\displaystyle=\mathbf{z}^{(r)}_{i|j}+\theta\;(\text{comp}(\mathbf{y}^{(r+1)}_{j|i})-\text{comp}(\mathbf{z}^{(r)}_{i|j})),

where we use Assumption 1 in the last equation. When randk%\textbf{rand}_{k\%} is used as the compression operator, 𝐲j|i\mathbf{y}_{j|i} and 𝐳i|j\mathbf{z}_{i|j} must be compressed using the same sparse vector 𝐬\mathbf{s} in Eq. (10) to use Assumption 1. In Alg. 1, we provide the pseudo-code of the C-ECL. For simplicity, the node ii and the node jj exchange ωi|j\omega_{i|j} and ωj|i\omega_{j|i} at Lines 5-6 in Alg. 1. However, by sharing the same seed value to generate ωi|j\omega_{i|j} and ωj|i\omega_{j|i} before starting the training, the node ii and the node jj can obtain the same values ωi|j\omega_{i|j} and ωj|i\omega_{j|i} without any exchanges. Moreover, when randk%\textbf{rand}_{k\%} is used as the compression operator, the node ii can obtain ωi|j\omega_{i|j} from the received value comp(𝐲j|i;ωi|j)\textbf{comp}(\mathbf{y}_{j|i};\omega_{i|j}) because comp(𝐲j|i;ωi|j)\textbf{comp}(\mathbf{y}_{j|i};\omega_{i|j}) is stored in a sparse matrix format (e.g., COO format). Thus, these exchanges of ωi|j\omega_{i|j} and ωj|i\omega_{j|i} can be omitted in practice. Therefore, in the C-ECL, each node only needs to exchange the compressed value of 𝐲j|i\mathbf{y}_{j|i}, and the C-ECL can train a model with fewer parameter exchanges than the ECL.

1:  for r=0r=0 to RR do
2:     𝐰i(r+1)argmin𝐰i{fi(𝐰i)+α2j𝒩i𝐀i|j𝐰i1α𝐳i|j(r)2}\mathbf{w}^{(r+1)}_{i}\leftarrow\text{argmin}_{\mathbf{w}_{i}}\{f_{i}(\mathbf{w}_{i})+\frac{\alpha}{2}\sum_{j\in\mathcal{N}_{i}}{\left\|\mathbf{A}_{i|j}\mathbf{w}_{i}-\frac{1}{\alpha}\mathbf{z}^{(r)}_{i|j}\right\|}^{2}\}.
3:     for j𝒩ij\in\mathcal{N}_{i}  do
4:        𝐲i|j(r+1)𝐳i|j(r)2α𝐀i|j𝐰i(r+1)\mathbf{y}^{(r+1)}_{i|j}\leftarrow\mathbf{z}^{(r)}_{i|j}-2\alpha\mathbf{A}_{i|j}\mathbf{w}^{(r+1)}_{i}.
5:        Receiveij(ωi|j(r+1))\textbf{Receive}_{i\leftarrow j}(\omega_{i|j}^{(r+1)}). // This exchange can be omitted.
6:        Transmitij(ωj|i(r+1))\textbf{Transmit}_{i\rightarrow j}(\omega_{j|i}^{(r+1)}). // This exchange can be omitted.
7:        Receiveij(comp(𝐲j|i(r+1);ωi|j(r+1)))\textbf{Receive}_{i\leftarrow j}(\text{comp}(\mathbf{y}^{(r+1)}_{j|i};\omega^{(r+1)}_{i|j})).
8:        Transmitij(comp(𝐲i|j(r+1);ωj|i(r+1)))\textbf{Transmit}_{i\rightarrow j}(\text{comp}(\mathbf{y}^{(r+1)}_{i|j};\omega^{(r+1)}_{j|i})).
9:        𝐳i|j(r+1)𝐳i|j(r)+θcomp(𝐲j|i(r+1)𝐳i|j(r);ωi|j(r+1))\mathbf{z}^{(r+1)}_{i|j}\leftarrow\mathbf{z}^{(r)}_{i|j}+\theta\;\text{comp}(\mathbf{y}^{(r+1)}_{j|i}-\mathbf{z}^{(r)}_{i|j};\omega^{(r+1)}_{i|j}).
10:     end for
11:  end for
Algorithm 1 Update procedure at the node ii of the C-ECL.

4 Convergence Analysis

In this section, we analyze how compression in the C-ECL affects the convergence rate of the ECL. Our convergence analysis is based on the analysis of the Douglas-Rachford splitting [9], and the proofs of which are presented in Sec. C.

4.1 Assumptions

In this section, we introduce additional notations and assumptions used in our convergence analysis. We define N|𝒱|N\coloneqq|\mathcal{V}|, Nminmini{|𝒩i|}N_{\text{min}}\coloneqq\text{min}_{i}\{|\mathcal{N}_{i}|\}, Nmaxmaxi{|𝒩i|}N_{\text{max}}\coloneqq\text{max}_{i}\{|\mathcal{N}_{i}|\} where 𝒩i\mathcal{N}_{i} is the set of neighbors of the node ii. Let 𝒩i(j)\mathcal{N}_{i}(j) be the jj-th smallest index of the node in 𝒩i\mathcal{N}_{i}, we define 𝐰dN\mathbf{w}\in\mathbb{R}^{dN}, 𝐳id|𝒩i|\mathbf{z}_{i}\in\mathbb{R}^{d|\mathcal{N}_{i}|}, and 𝐳2d||\mathbf{z}\in\mathbb{R}^{2d|\mathcal{E}|} as follows:

𝐰(𝐰1,,𝐰N),𝐳i(𝐳i|𝒩i(1),,𝐳i|𝒩i(|𝒩i|)),𝐳(𝐳1,,𝐳N).\displaystyle\mathbf{w}\coloneqq(\mathbf{w}_{1}^{\top},\ldots,\mathbf{w}_{N}^{\top})^{\top},\;\;\mathbf{z}_{i}\coloneqq(\mathbf{z}_{i|\mathcal{N}_{i}(1)}^{\top},\ldots,\mathbf{z}_{i|\mathcal{N}_{i}(|\mathcal{N}_{i}|)}^{\top})^{\top},\;\;\mathbf{z}\coloneqq(\mathbf{z}_{1}^{\top},\ldots,\mathbf{z}_{N}^{\top})^{\top}. (14)

For simplicity, we drop the superscripts of the number of round rr. Let {𝐰i}i\{\mathbf{w}_{i}^{\ast}\}_{i} be the optimal solution of Eq. (2), we define 𝐰dN\mathbf{w}^{\star}\in\mathbb{R}^{dN} in the same manner as the definition of 𝐰\mathbf{w} in Eq. (14). We define the loss function as f(𝐰)i𝒱fi(𝐰i)f(\mathbf{w})\coloneqq\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i}). Next, we introduce the assumptions used in the convergence analysis.

Assumption 2.

We assume that ff is proper, closed and convex.

Assumption 3.

We assume that ff is LL-smooth and μ\mu-strongly convex with L>0L>0 and μ>0\mu>0.

Assumption 4.

We assume that the graph GG has no isolated nodes (i.e., Nmin>0N_{\text{min}}>0).

Assumption 2, 3 are the standard assumptions used for the convergence analysis of the operator splitting methods [9, 26]. Assumption 3 is weaker than the assumptions of the smoothness and the strongly convexity of fif_{i} for all i𝒱i\in\mathcal{V}, which are commonly used in decentralized learning. Assumption 4 holds in general because decentralized learning assumes that the graph GG is connected. In addition, we define δ\delta\in\mathbb{R} as follows:

δmax(αNmaxμαNmax+μ,LαNminL+αNmin).\displaystyle\delta\coloneqq\text{max}\left(\frac{\alpha N_{\text{max}}-\mu}{\alpha N_{\text{max}}+\mu},\frac{L-\alpha N_{\text{min}}}{L+\alpha N_{\text{min}}}\right).

Suppose that Assumptions 2, 3, and 4 hold and α(0,)\alpha\in(0,\infty) holds, then δ[0,1)\delta\in[0,1) holds because Lμ>0L\geq\mu>0 and NmaxNmin>0N_{\text{max}}\geq N_{\text{min}}>0.

4.2 Convergence Rates

Theorem 1.

Let 𝐳¯2d||\bar{\mathbf{z}}\in\mathbb{R}^{2d|\mathcal{E}|} be the fixed point of the Douglas-Rachford splitting.111A more detailed definition is shown in Sec. A and C. Suppose that Assumptions 1, 2, 3, and 4 hold. If τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2} and θ\theta satisfies

θ(2δ1τ(1δ)(11τ),2(1+δ)(1+1τ)),\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right), (15)

then 𝐰(r+1)\mathbf{w}^{(r+1)} generated by Alg. 1 linearly converges to the optimal solution 𝐰\mathbf{w}^{\star} of Eq. (2) as follows:

𝔼𝐰(r+1)𝐰Nmaxμ+αNmin{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}r𝐳(0)𝐳¯.\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\right\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|. (16)
Corollary 1.

Let 𝐳¯2d||\bar{\mathbf{z}}\in\mathbb{R}^{2d|\mathcal{E}|} be the fixed point of the Douglas-Rachford splitting. Under Assumptions 1, 2, 3, and 4, when τ=1\tau=1 and θ(0,21+δ)\theta\in(0,\frac{2}{1+\delta}), 𝐰(r+1)\mathbf{w}^{(r+1)} generated by Alg. 1 linearly converges to the optimal solution 𝐰\mathbf{w}^{\star} of Eq. (2) as follows:

𝔼𝐰(r+1)𝐰Nmaxμ+αNmin(|1θ|+θδ)r𝐳(0)𝐳¯.\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}(|1-\theta|+\theta\delta)^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|. (17)

Because τ=1\tau=1 implies that comp(𝐱)=𝐱\textbf{comp}(\mathbf{x})=\mathbf{x} in the C-ECL, Corollary 1 shows the convergence rate of the ECL under Assumptions 1, 2, 3, and 4, which is almost the same rate as that shown in the previous work [9]. Comparing the domain of θ\theta for the ECL and the C-ECL to converge, as τ\tau decreases, the domain in Eq. (15) becomes smaller. Subsequently, in order for the domain in Eq. (15) to be non-empty, τ\tau must be greater than or equal to (1(1δ)2/(1+δ)2)(1-(1-\delta)^{2}/(1+\delta)^{2}). Next, comparing the convergence rate of the ECL and the C-ECL, the compression in the C-ECL slows down the convergence rate of the ECL by the term (1τ(θ+|1θ|δ+δ))(\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)). Moreover, similar to the convergence analysis of the Douglas-Rachford splitting [9], Theorem 1 and Corollary 1 imply that the optimal parameter of θ\theta can be determined as follows:

Corollary 2.

Suppose that 1, 2, 3, and 4 hold, and τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}, the optimal convergence rate of Eq. (16) in the C-ECL is achieved when θ=1\theta=1.

Corollary 3.

Suppose that Assumption 1, 2, 3, and 4 hold, and τ=1\tau=1, the optimal convergence rate of Eq. (17) is achieved when θ=1\theta=1.

5 Experiments

In this section, we demonstrate that the C-ECL can achieve almost the same performance with fewer parameter exchanges than the ECL. Furthermore, we show that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithm.

5.1 Experimental Setting

Dataset and Model: We evaluate the C-ECL using FashionMNIST [35] and CIFAR10 [15], which are datasets of 10-class image-classification tasks. As the models for both datasets, we use 5-layer convolutional neural networks [16] with group normalization [34]. Following the previous work [22], we distribute data to the nodes in two settings: the homogeneous and heterogeneous settings. In the homogeneous setting, the data are distributed such that each node has the data of all 1010 classes and has approximately the same number of data of each class. In the heterogeneous setting, the data are distributed such that each node has the data of randomly selected 88 classes. In both settings, the data are distributed to the nodes such that each node has the same number of data.

Network: In Sec. 5.2, we evaluate all comparison methods on a network of a ring consisting of eight nodes. In addition, in Sec. 5.3, we evaluate all comparison methods in four settings of the network topology: chain, ring, multiplex ring, and fully connected graph, where each setting consists of eight nodes. In Sec. D, we show a visualization of the network topology. Each node exchanges parameters with its neighbors per five local updates. We implement with PyTorch using gloo222https://pytorch.org/docs/stable/distributed.html as the backend, and run all comparison methods on eight GPUs (NVIDIA RTX 3090).

Comparison Methods: (1) D-PSGD [18]: The uncompressed Gossip-based algorithm. (2) PowerGossip [33]: The Gossip-based algorithm that compresses the exchanging parameters by using a low-rank approximation. We use the PowerGossip as the compression method for the Gossip-based algorithm because the PowerGossip has been shown to achieve almost the same performance as other existing compression methods without additional hyperparameter tuning. (3) ECL [22, 23]: The primal-dual algorithm described in Sec. 2.3. Because Niwa et al., [22] showed that the ECL converges faster when θ=1\theta=1 than when θ=0.5\theta=0.5, we set θ=1\theta=1. (4) C-ECL: Our proposed method described in Sec. 3. We use randk%\textbf{rand}_{k\%} as the compression operator. Following Corollary 2, we set θ=1\theta=1. We initialize 𝐳i|j\mathbf{z}_{i|j} and 𝐲i|j\mathbf{y}_{i|j} to zeros. However, we found that when we compress the update values of 𝐳i|j\mathbf{z}_{i|j} by using randk%\textbf{rand}_{k\%}, the convergence becomes slower because 𝐳i|j\mathbf{z}_{i|j} remains to be sparse in the early training stage. Then, we set k%k\% of randk%\textbf{rand}_{k\%} to 100%100\% only during the first epoch.

In addition, for the reference, we show the results of the Stochastic Gradient Descent (SGD), in which the model is trained on a single node containing all training data. In our experiments, we set the learning rate, number of epochs, and batch size to the same values for all comparison methods. In Sec. D, we show the detailed hyperparameters used for all comparison methods.

5.2 Experimental Results

In this section, we evaluate the accuracy and the communication costs when setting the network topology to be a ring.

Table 1: Test accuracy and communication costs on the homogeneous setting. For the C-ECL, the number in the bracket is kk of randk%\textbf{rand}_{k\%}. For the PowerGossip, the number in the bracket is the number of the power iteration steps. As the communication costs, the average amount of parameter sent per epoch is shown.
FashionMNIST CIFAR10
Accuracy Send/Epoch Accuracy Send/Epoch
SGD 88.788.7 - 75.775.7 -
D-PSGD 84.184.1 53365336 KB (×1.0\times 1.0) 72.872.8 62556255 KB (×1.0\times 1.0)
ECL 84.484.4 53365336 KB (×1.0\times 1.0) 72.672.6 62556255 KB (×1.0\times 1.0)
PowerGossip (1) 84.084.0 138138 KB (×38.7\times 38.7) 72.072.0 135135 KB (×46.3\times 46.3)
PowerGossip (10) 84.384.3 10791079 KB (×5.0\times 5.0) 72.372.3 11021102 KB (×5.7\times 5.7)
PowerGossip (20) 84.284.2 21242124 KB (×2.5\times 2.5) 72.272.2 21752175 KB (×2.9\times 2.9)
C-ECL (1%1\%) 84.084.0 115115 KB (×48.1\times 48.1) 71.571.5 132132 KB (×47.4\times 47.4)
C-ECL (10%10\%) 84.084.0 10751075 KB (×5.1\times 5.1) 71.471.4 12571257 KB (×5.0\times 5.0)
C-ECL (20%20\%) 84.084.0 21422142 KB (×2.5\times 2.5) 71.171.1 25072507 KB (×2.5\times 2.5)
Table 2: Test accuracy and communication costs on the heterogeneous setting. For the C-ECL, the number in the bracket is kk of randk%\textbf{rand}_{k\%}. For the PowerGossip, the number in the bracket is the number of the power iteration steps. As the communication costs, the average amount of parameter sent per epoch is shown.
FashionMNIST CIFAR10
Accuracy Send/Epoch Accuracy Send/Epoch
SGD 88.788.7 - 75.775.7 -
D-PSGD 79.479.4 53365336 KB (×1.0\times 1.0) 70.870.8 61556155 KB (×1.0\times 1.0)
ECL 84.584.5 53365336 KB (×1.0\times 1.0) 72.772.7 61556155 KB (×1.0\times 1.0)
PowerGossip (1) 77.577.5 138138 KB (×38.7\times 38.7) 64.364.3 133133 KB (×46.3\times 46.3)
PowerGossip (10) 77.777.7 10791079 KB (×5.0\times 5.0) 67.267.2 10841084 KB (×5.7\times 5.7)
PowerGossip (20) 77.477.4 21242124 KB (×2.5\times 2.5) 67.967.9 21412141 KB (×2.9\times 2.9)
C-ECL (1%1\%) 77.777.7 115115 KB (×48.1\times 48.1) 60.260.2 129129 KB (×47.7\times 47.7)
C-ECL (10%10\%) 83.483.4 10751075 KB (×5.1\times 5.1) 61.861.8 12371237 KB (×5.0\times 5.0)
C-ECL (20%20\%) 83.683.6 21422142 KB (×2.5\times 2.5) 72.372.3 24672467 KB (×2.5\times 2.5)

Homogeneous Setting: First, we discuss the results on the homogeneous setting. Table 1 shows the accuracy and the communication costs on the homogeneous setting. The results show that the D-PSGD and the ECL achieve almost the same accuracy on both datasets. Then, the C-ECL and the PowerGossip are comparable and achieve almost the same accuracy as the ECL and the D-PSGD even when we set k%k\% of randk%\textbf{rand}_{k\%} to 1%1\% and the number of the power iteration steps to 11 respectively. Therefore, the C-ECL can achieve the comparable accuracy with approximately 5050-times fewer parameter exchanges than the ECL and the D-PSGD on the homogeneous setting.

Heterogeneous Setting: Next, we discuss the results on the heterogeneous setting. Table 2 shows the accuracy and the communication costs on the heterogeneous setting. In the D-PSGD, the accuracy on the heterogeneous setting decreases by approximately 3%3\% compared to that on the homogeneous setting. In the PowerGossip, even if the number of power iteration steps is increased, the accuracy does not approach that of the D-PSGD and the ECL. On the other hand, the accuracy of the ECL is almost the same on both the homogeneous and heterogeneous settings, and the results indicate that the ECL is more robust to heterogeneous data than the D-PSGD. In the C-ECL, when we set k%k\% of randk%\textbf{rand}_{k\%} to 1%1\%, the accuracy on the heterogeneous setting decreases by approximately 10%10\% compared to that of the ECL. However, when we increase k%k\% of randk%\textbf{rand}_{k\%}, the C-ECL is competitive with the ECL and outperforms the D-PSGD and the PowerGossip. Specifically, on FashionMNIST, when we set k%k\% to 10%10\%, the C-ECL is competitive to the ECL and outperforms the D-PSGD and the PowerGossip. On CIFAR10, when we set k%k\% to 20%20\%, the C-ECL is competitive with the ECL and outperforms the D-PSGD and the PowerGossip.

In summary, on the homogeneous setting, the C-ECL and the PowerGossip can achieve almost the same accuracy as the ECL and the D-PSGD with approximately 5050-times fewer parameter exchanges. On the heterogeneous setting, the C-ECL can achieve almost the same accuracy as the ECL with approximately 44-times fewer parameter exchanges and can outperform the PowerGossip. Furthermore, the results show that the C-ECL can outperform the D-PSGD, the uncompressed Gossip-based algorithm, in terms of both the accuracy and the communication costs.

5.3 Network Topology

In this section, we show the accuracy and the communication costs when the network topology is varied. Table 3 and Fig. 1 show the communication costs and the accuracy on FashionMNIST when the network topology is varied as a chain, ring, multiplex ring, or fully connected graph.

On the homogeneous setting, Fig. 1 shows that the accuracy of all comparison methods are almost the same and reach that of the SGD on all network topologies. On the heterogeneous setting, Fig. 1 shows that the accuracy of the D-PSGD and the PowerGossip are decreased compared to that on the homogeneous setting. On the other hand, the accuracy of the ECL is almost the same as on the homogeneous setting on all network topologies. Then, on all network topologies, the C-ECL achieves almost the same accuracy as the ECL with fewer parameters exchanges and consistently outperforms the PowerGossip. Moreover, the results show that on the heterogeneous setting, the C-ECL can outperform the D-PSGD, the uncompressed Gossip-based algorithm, on all network topologies in terms of both the accuracy and the communication costs.

Refer to caption
(a) Homogeneous Setting
Refer to caption
(b) Heterogeneous Setting
Figure 1: Test accuracy on FashionMNIST when varying the network topology. We evaluate the average test accuracy of each node per 10 epochs.
Table 3: Communication costs on FashionMNIST when varying the network topology. As the communication costs, the average amount of parameters sent per epoch on homogeneous and heterogeneous settings is shown.
Chain Ring Multiplex Ring Fully Connected Graph
D-PSGD 46704670 KB 53365336 KB 1067310673 KB 1867718677 KB
ECL 46704670 KB 53365336 KB 1067310673 KB 1867718677 KB
PowerGossip (10) 944944 KB 10781078 KB 21582158 KB 37763776 KB
C-ECL (10%10\%) 941941 KB 10751075 KB 21512151 KB 37643764 KB

6 Conclusion

In this work, we propose the Communication Compressed ECL (C-ECL), the novel framework for the compression methods of the ECL. Specifically, we reformulate the update formula of the ECL and propose compressing the update values of the dual variables. Theoretically, we analyze the convergence rate of the C-ECL and show that the C-ECL converges linearly to the optimal solution as well as the ECL. Experimentally, we demonstrate that, even if the data distribution of each node is statistically heterogeneous, the C-ECL can achieve almost the same accuracy as the ECL with fewer parameter exchanges. Moreover, we show that, when the data distribution of each node is statistically heterogeneous, the C-ECL outperforms the uncompressed Gossip-based algorithm in terms of both the accuracy and the communication costs.

Acknowledgments

M.Y. was supported by MEXT KAKENHI Grant Number 20H04243.

References

  • Bauschke and Combettes, [2017] Bauschke, H. H. and Combettes, P. L. (2017). Convex analysis and monotone operator theory in hilbert spaces. Springer, 2nd edition.
  • Boyd et al., [2006] Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2006). Randomized gossip algorithms. In IEEE Transactions on Information Theory.
  • Boyd et al., [2011] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning.
  • Chen et al., [2020] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning.
  • Defazio et al., [2014] Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems.
  • Devlin et al., [2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics.
  • Dosovitskiy et al., [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  • Douglas and Rachford, [1956] Douglas, J. and Rachford, H. H. (1956). On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society.
  • Giselsson and Boyd, [2017] Giselsson, P. and Boyd, S. P. (2017). Linear convergence and metric selection for douglas-rachford splitting and ADMM. IEEE Transactions on Automatic Control.
  • Johnson and Zhang, [2013] Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems.
  • Karimireddy et al., [2020] Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. (2020). SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning.
  • Koloskova et al., [2020] Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. (2020). Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations.
  • Koloskova et al., [2019] Koloskova, A., Stich, S., and Jaggi, M. (2019). Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning.
  • Kovalev et al., [2021] Kovalev, D., Koloskova, A., Jaggi, M., Richtarik, P., and Stich, S. (2021). A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics.
  • Krizhevsky, [2009] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
  • LeCun et al., [1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. In IEEE.
  • Li et al., [2019] Li, Z., Shi, W., and Yan, M. (2019). A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. In IEEE Transactions on Signal Processing.
  • Lian et al., [2017] Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems.
  • Liu et al., [2021] Liu, X., Li, Y., Wang, R., Tang, J., and Yan, M. (2021). Linear convergent decentralized optimization with compression. In International Conference on Learning Representations.
  • Lorenzo and Scutari, [2016] Lorenzo, P. D. and Scutari, G. (2016). NEXT: in-network nonconvex optimization. In IEEE Transactions on Signal and Information Processing over Networks.
  • Lu and De Sa, [2020] Lu, Y. and De Sa, C. (2020). Moniqua: Modulo quantized communication in decentralized SGD. In International Conference on Machine Learning.
  • Niwa et al., [2020] Niwa, K., Harada, N., Zhang, G., and Kleijn, W. B. (2020). Edge-consensus learning: Deep learning on p2p networks with nonhomogeneous data. In International Conference on Knowledge Discovery and Data Mining.
  • Niwa et al., [2021] Niwa, K., Zhang, G., Kleijn, W. B., Harada, N., Sawada, H., and Fujino, A. (2021). Asynchronous decentralized optimization with implicit stochastic variance reduction. In International Conference on Machine Learning.
  • Peaceman and Rachford, [1955] Peaceman, D. W. and Rachford, H. H. (1955). The numerical solution of parabolic and elliptic differential equations. Journal of the Society for Industrial and Applied Mathematics.
  • Rockafellar, [2015] Rockafellar, R. T. (2015). Convex analysis. Princeton University Press.
  • Ryu and Boyd, [2015] Ryu, E. K. and Boyd, S. P. (2015). A primer on monotone operator methods. In Applied and Computational Mathematics.
  • Sherson et al., [2019] Sherson, T. W., Heusdens, R., and Kleijn, W. B. (2019). Derivation and analysis of the primal-dual method of multipliers based on monotone operator theory. In IEEE Transactions on Signal and Information Processing over Networks.
  • Stich et al., [2018] Stich, S. U., Cordonnier, J.-B., and Jaggi, M. (2018). Sparsified sgd with memory. In Advances in Neural Information Processing Systems.
  • [29] Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. (2018a). Communication compression for decentralized training. In Advances in Neural Information Processing Systems.
  • [30] Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. (2018b). D2D^{2}: Decentralized training over decentralized data. In International Conference on Machine Learning.
  • Vaswani et al., [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems.
  • Vogels et al., [2021] Vogels, T., He, L., Koloskova, A., Karimireddy, S. P., Lin, T., Stich, S. U., and Jaggi, M. (2021). Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems.
  • Vogels et al., [2020] Vogels, T., Karimireddy, S. P., and Jaggi, M. (2020). Powergossip: Practical low-rank communication compression in decentralized deep learning. In Advances in Neural Information Processing Systems.
  • Wu and He, [2018] Wu, Y. and He, K. (2018). Group normalization. In European Conference on Computer Vision.
  • Xiao et al., [2017] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. In arXiv.
  • Xiao et al., [2007] Xiao, L., Boyd, S. P., and Kim, S. (2007). Distributed average consensus with least-mean-square deviation. In Journal of Parallel and Distributed Computing.
  • Xin et al., [2020] Xin, R., Khan, U. A., and Kar, S. (2020). Variance-reduced decentralized stochastic optimization with accelerated convergence. In IEEE Transactions on Signal and Information Processing over Networks.
  • Zhang and Heusdens, [2018] Zhang, G. and Heusdens, R. (2018). Distributed optimization using the primal-dual method of multipliers. In IEEE Transactions on Signal and Information Processing over Networks.

Appendix A Preliminary

In this section, we introduce the definitions used in the following section and briefly introduce the Douglas-Rachford splitting. (See [1, 26] for more details.)

A.1 Definition

Definition 1 (Smooth Function).

Let f:{}f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\} be a closed, proper, and convex function. ff is LL-smooth if ff satisfies the following:

f(x)f(y)+f(y),xy+L2xy2(x,y).\displaystyle f(x)\leq f(y)+\langle\nabla f(y),x-y\rangle+\frac{L}{2}\left\|x-y\right\|^{2}\qquad(\forall x,y\in\mathcal{H}).
Definition 2 (Strongly Convex Function).

Let f:{}f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\} be a closed, proper, and convex function. ff is μ\mu-strongly convex if ff satisfies the following:

f(x)f(y)+f(y),xy+μ2xy2(x,y).\displaystyle f(x)\geq f(y)+\langle\nabla f(y),x-y\rangle+\frac{\mu}{2}\left\|x-y\right\|^{2}\qquad(\forall x,y\in\mathcal{H}).
Definition 3 (Conjugate Function).

Let f:{}f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\}. The conjugate function of ff is defined as follows:

f(y)=supx(y,xf(x)).\displaystyle f^{\ast}(y)=\sup_{x\in\mathcal{H}}(\langle y,x\rangle-f(x)).
Definition 4 (Nonexpansive Operator).

Let DD be a non-empty subset of \mathcal{H}. An operator T:DT:D\rightarrow\mathcal{H} is nonexpansive if TT is 11-Lipschitz continuous.

Definition 5 (Contractive Operator).

Let DD be a non-empty subset of \mathcal{H}. An operator T:DT:D\rightarrow\mathcal{H} is β\beta-contractive if TT is Lipschitz continuous with constant β[0,1)\beta\in[0,1).

Definition 6 (Proximal Operator).

Let f:{}f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\} be a closed, proper, and convex function. The proximal operator of ff is defined as follows:

proxf(𝐱)=argmin𝐲{f(𝐲)+12𝐲𝐱2}.\displaystyle\text{prox}_{f}(\mathbf{x})=\text{argmin}_{\mathbf{y}}\{f(\mathbf{y})+\frac{1}{2}\|\mathbf{y}-\mathbf{x}\|^{2}\}.
Definition 7 (Resolvent).

Let A:2A:\mathcal{H}\rightarrow 2^{\mathcal{H}}, the resolvent of AA is defined as follows:

JA=(Id+A)1.\displaystyle J_{A}=(\text{Id}+A)^{-1}.
Definition 8 (Reflected Resolvent).

Let A:2A:\mathcal{H}\rightarrow 2^{\mathcal{H}}, the reflected resolvent of AA is defined as follows:

RA=2JAId.\displaystyle R_{A}=2J_{A}-\text{Id}.

A.2 Douglas-Rachford Splitting

In this section, we briefly introduce the Douglas-Rachford splitting [8].

The Douglas-Rachford splitting can be applied to the following problem:

inf𝐱f(𝐱)+g(𝐱),\displaystyle\inf_{\mathbf{x}}f(\mathbf{x})+g(\mathbf{x}), (18)

where ff and gg are closed, proper, and convex functions. Let 𝐱\mathbf{x}^{\star} be the optimal solution of the above problem, the optimality condition can be written as follows:

𝟎f(𝐱)+g(𝐱).\displaystyle\mathbf{0}\in\partial f(\mathbf{x}^{\star})+\partial g(\mathbf{x}^{\star}).

From [1, Proposition 26.1], the optimality condition above is equivalent to the following:

𝐱=Jαf(Fix(RαgRαf)),\displaystyle\mathbf{x}^{\star}=J_{\alpha\partial f}(\text{Fix}(R_{\alpha\partial g}R_{\alpha\partial f})),

where Fix(RαgRαf)={𝐳|RαgRαf𝐳=𝐳}\text{Fix}(R_{\alpha\partial g}R_{\alpha\partial f})=\{\mathbf{z}|R_{\alpha\partial g}R_{\alpha\partial f}\mathbf{z}=\mathbf{z}\} and α>0\alpha>0. The Douglas-Rachford splitting then computes the fixed point 𝐳¯Fix(RαgRαf)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial g}R_{\alpha\partial f}) as follows:

𝐳(r+1)\displaystyle\mathbf{z}^{(r+1)} =((1θ)Id+θRαgRαf)𝐳(r),\displaystyle=((1-\theta)\text{Id}+\theta R_{\alpha\partial g}R_{\alpha\partial f})\mathbf{z}^{(r)}, (19)

where θ(0,1]\theta\in(0,1] is the hyperparameter of the Douglas-Rachford splitting. Under certain assumptions, ((1θ)Id+θRαgRαf)((1-\theta)\text{Id}+\theta R_{\alpha\partial g}R_{\alpha\partial f}) is contractive [9], and the update formula above is guaranteed to converge at the fixed point 𝐳¯\bar{\mathbf{z}}. Then, after converging to the fixed point 𝐳¯\bar{\mathbf{z}}, the Douglas-Rachford splitting obtains the optimal solution 𝐱\mathbf{x}^{\star} as follows:

𝐱\displaystyle\mathbf{x}^{\star} =Jαf𝐳¯.\displaystyle=J_{\alpha\partial f}\bar{\mathbf{z}}. (20)

Moreover, by the definition of the reflected resolvent, the Douglas-Rachford splitting of Eq. (19) can be rewritten as follows:

𝐱(r+1)\displaystyle\mathbf{x}^{(r+1)} =Jαf𝐳(r),\displaystyle=J_{\alpha\partial f}\mathbf{z}^{(r)}, (21)
𝐲(r+1)\displaystyle\mathbf{y}^{(r+1)} =2𝐱(r+1)𝐳(r),\displaystyle=2\mathbf{x}^{(r+1)}-\mathbf{z}^{(r)}, (22)
𝐳(r+1)\displaystyle\mathbf{z}^{(r+1)} =(1θ)𝐳(r)+θRαg𝐲(r+1).\displaystyle=(1-\theta)\mathbf{z}^{(r)}+\theta R_{\alpha\partial g}\mathbf{y}^{(r+1)}. (23)

Appendix B Derivation of Update Formulas of ECL

In this section, we briefly describe the derivation of the update formulas of the ECL. (See [22, 23, 27] for more details.)

B.1 Derivation of Dual Problem

First, we derive the Lagrangian function. The Lagrangian function of Eq. (2) can be derived as follows:

i𝒱fi(𝐰i)i𝒱j𝒩i𝝀ij,𝐀i|j𝐰i+𝐀j|i𝐰j\displaystyle\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij},\mathbf{A}_{i|j}\mathbf{w}_{i}+\mathbf{A}_{j|i}\mathbf{w}_{j}\rangle
=i𝒱fi(𝐰i)i𝒱j𝒩i𝝀ij,𝐀i|j𝐰ii𝒱j𝒩i𝝀ij,𝐀j|i𝐰j\displaystyle=\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij},\mathbf{A}_{i|j}\mathbf{w}_{i}\rangle-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij},\mathbf{A}_{j|i}\mathbf{w}_{j}\rangle
=i𝒱fi(𝐰i)i𝒱j𝒩i𝝀ij+𝝀ji,𝐀i|j𝐰i,\displaystyle=\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij}+\boldsymbol{\lambda}_{ji},\mathbf{A}_{i|j}\mathbf{w}_{i}\rangle,

where 𝝀ijd\boldsymbol{\lambda}_{ij}\in\mathbb{R}^{d} is the dual variable. Defining 𝝀i|j𝝀ij+𝝀ji\boldsymbol{\lambda}_{i|j}\coloneqq\boldsymbol{\lambda}_{ij}+\boldsymbol{\lambda}_{ji}, the Lagrangian function can be rewritten as follows:

i𝒱fi(𝐰i)i𝒱j𝒩i𝝀i|j,𝐀i|j𝐰i.\displaystyle\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{i|j},\mathbf{A}_{i|j}\mathbf{w}_{i}\rangle.

Note that the definition of 𝝀i|j\boldsymbol{\lambda}_{i|j} implies that 𝝀i|j=𝝀j|i\boldsymbol{\lambda}_{i|j}=\boldsymbol{\lambda}_{j|i}. Let NN be the number of nodes |𝒱||\mathcal{V}|. Here, 𝒩i(j)\mathcal{N}_{i}(j) denotes the jj-th smallest index of the node in 𝒩i\mathcal{N}_{i}. We define 𝐰dN\mathbf{w}\in\mathbb{R}^{dN}, 𝐀dN×2d||\mathbf{A}\in\mathbb{R}^{dN\times 2d|\mathcal{E}|} and 𝝀2d||\boldsymbol{\lambda}\in\mathbb{R}^{2d|\mathcal{E}|} as follows:

𝐰\displaystyle\mathbf{w} =(𝐰1,,𝐰N),\displaystyle=\begin{pmatrix}\mathbf{w}_{1}^{\top},&\cdots,\mathbf{w}_{N}^{\top}\end{pmatrix}^{\top},
𝐀i\displaystyle\mathbf{A}_{i} =(𝐀i|𝒩i(1),,𝐀i|𝒩i(|𝒩i|)),\displaystyle=\begin{pmatrix}\mathbf{A}_{i|\mathcal{N}_{i}(1)},&\cdots,&\mathbf{A}_{i|\mathcal{N}_{i}(|\mathcal{N}_{i}|)}\end{pmatrix},
𝐀\displaystyle\mathbf{A} =diag(𝐀1,,𝐀N),\displaystyle=\text{diag}(\mathbf{A}_{1},\cdots,\mathbf{A}_{N}),
𝝀i\displaystyle\boldsymbol{\lambda}_{i} =(𝝀i|𝒩i(1),,𝝀i|𝒩i(|𝒩i|)),\displaystyle=\begin{pmatrix}\boldsymbol{\lambda}_{i|\mathcal{N}_{i}(1)}^{\top},&\cdots,&\boldsymbol{\lambda}_{i|\mathcal{N}_{i}(|\mathcal{N}_{i}|)}^{\top}\end{pmatrix}^{\top},
𝝀\displaystyle\boldsymbol{\lambda} =(𝝀1,,𝝀N).\displaystyle=\begin{pmatrix}\boldsymbol{\lambda}_{1}^{\top},&\cdots,&\boldsymbol{\lambda}_{N}^{\top}\end{pmatrix}^{\top}.

We define the function ff as follows:

f(𝐰)=i𝒱fi(𝐰i).\displaystyle f(\mathbf{w})=\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i}).

The Lagrangian function can be rewritten as follows:

f(𝐰)𝝀,𝐀𝐰.\displaystyle f(\mathbf{w})-\langle\boldsymbol{\lambda},\mathbf{A}^{\top}\mathbf{w}\rangle.

Then, the primal and dual problems can be defined as follows:

inf𝐰sup𝝀f(𝐰)𝝀,𝐀𝐰ι(𝝀)\displaystyle\inf_{\mathbf{w}}\sup_{\boldsymbol{\lambda}}f(\mathbf{w})-\langle\boldsymbol{\lambda},\mathbf{A}^{\top}\mathbf{w}\rangle-\iota(\boldsymbol{\lambda}) =inf𝐰f(𝐰)+ι(𝐀𝐰),\displaystyle=\inf_{\mathbf{w}}f(\mathbf{w})+\iota^{\ast}(-\mathbf{A}^{\top}\mathbf{w}), (24)
sup𝝀inf𝐰f(𝐰)𝝀,𝐀𝐰ι(𝝀)\displaystyle\sup_{\boldsymbol{\lambda}}\inf_{\mathbf{w}}f(\mathbf{w})-\langle\boldsymbol{\lambda},\mathbf{A}^{\top}\mathbf{w}\rangle-\iota(\boldsymbol{\lambda}) =inf𝝀f(𝐀𝝀)+ι(𝝀),\displaystyle=-\inf_{\boldsymbol{\lambda}}f^{\ast}(\mathbf{A}\boldsymbol{\lambda})+\iota(\boldsymbol{\lambda}), (25)

where ι\iota is the indicator function defined as follows:

ι(𝝀)={0if𝝀i|j=𝝀j|i,((i,j))otherwise.\displaystyle\iota(\boldsymbol{\lambda})=\begin{cases}0&\text{if}\;\boldsymbol{\lambda}_{i|j}=\boldsymbol{\lambda}_{j|i},\;(\forall(i,j)\in\mathcal{E})\\ \infty&\text{otherwise}\end{cases}.

B.2 Derivation of Update Formulas

Next, we derive the update formulas of the ECL. By applying the Douglas-Rachford splitting of Eqs. (21-23) to the dual problem of Eq. (25), we obtain the following update formulas:

𝝀(r+1)\displaystyle\boldsymbol{\lambda}^{(r+1)} =Jα𝐀f(𝐀)𝐳(r),\displaystyle=J_{\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\cdot)}\mathbf{z}^{(r)}, (26)
𝐲(r+1)\displaystyle\mathbf{y}^{(r+1)} =2𝝀(r+1)𝐳(r),\displaystyle=2\boldsymbol{\lambda}^{(r+1)}-\mathbf{z}^{(r)}, (27)
𝐳(r+1)\displaystyle\mathbf{z}^{(r+1)} =(1θ)𝐳(r)+θRαι𝐲(r+1),\displaystyle=(1-\theta)\mathbf{z}^{(r)}+\theta R_{\alpha\partial\iota}\mathbf{y}^{(r+1)}, (28)

where 𝐳2d||\mathbf{z}\in\mathbb{R}^{2d|\mathcal{E}|} and 𝐲2d||\mathbf{y}\in\mathbb{R}^{2d|\mathcal{E}|} can be decomposed into 𝐳i|jd\mathbf{z}_{i|j}\in\mathbb{R}^{d} and 𝐲i|jd\mathbf{y}_{i|j}\in\mathbb{R}^{d} as follows:

𝐳i\displaystyle\mathbf{z}_{i} =(𝐳i|𝒩i(1),,𝐳i|𝒩i(|𝒩i|)),\displaystyle=(\mathbf{z}_{i|\mathcal{N}_{i}(1)}^{\top},\ldots,\mathbf{z}_{i|\mathcal{N}_{i}(|\mathcal{N}_{i}|)}^{\top})^{\top},
𝐳\displaystyle\mathbf{z} =(𝐳1,,𝐳N),\displaystyle=(\mathbf{z}_{1}^{\top},\ldots,\mathbf{z}_{N}^{\top})^{\top},
𝐲i\displaystyle\mathbf{y}_{i} =(𝐲i|𝒩i(1),,𝐲i|𝒩i(|𝒩i|)),\displaystyle=(\mathbf{y}_{i|\mathcal{N}_{i}(1)}^{\top},\ldots,\mathbf{y}_{i|\mathcal{N}_{i}(|\mathcal{N}_{i}|)}^{\top})^{\top},
𝐲\displaystyle\mathbf{y} =(𝐲1,,𝐲N).\displaystyle=(\mathbf{y}_{1}^{\top},\ldots,\mathbf{y}_{N}^{\top})^{\top}.

Update formulas of Eqs. (26-27): By the definition of the resolvent Jα𝐀f(𝐀)J_{\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\cdot)}, we obtain

𝝀(r+1)+α𝐀f(𝐀𝝀(r+1))=𝐳(r),\displaystyle\boldsymbol{\lambda}^{(r+1)}+\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{(r+1)})=\mathbf{z}^{(r)},
𝝀(r+1)=𝐳(r)α𝐀f(𝐀𝝀(r+1)).\displaystyle\boldsymbol{\lambda}^{(r+1)}=\mathbf{z}^{(r)}-\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{(r+1)}). (29)

We define 𝐰(r+1)\mathbf{w}^{(r+1)} as follows:

𝐰(r+1)=f(𝐀𝝀(r+1)).\displaystyle\mathbf{w}^{(r+1)}=\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{(r+1)}). (30)

From the property of the convex conjugate function, we obtain

𝐀𝝀(r+1)=f(𝐰(r+1)).\displaystyle\mathbf{A}\boldsymbol{\lambda}^{(r+1)}=\nabla f(\mathbf{w}^{(r+1)}). (31)

Substituting Eqs. (29-30) into Eq. (31), we obtain

𝟎=f(𝐰(r+1))+𝐀(α𝐀𝐰(r+1)𝐳(r)).\displaystyle\mathbf{0}=\nabla f(\mathbf{w}^{(r+1)})+\mathbf{A}(\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}-\mathbf{z}^{(r)}).

We then obtain the update formula of 𝐰\mathbf{w} as follows:

𝐰(r+1)=argmin\displaystyle\mathbf{w}^{(r+1)}=\text{argmin} {f(𝐰)+α2𝐀𝐰1α𝐳(r)2}.\displaystyle\{f(\mathbf{w})+\frac{\alpha}{2}\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\|^{2}\}.

Substituting Eq. (30) into Eq. (29), we obtain

𝝀(r+1)=𝐳(r)α𝐀𝐰(r+1).\displaystyle\boldsymbol{\lambda}^{(r+1)}=\mathbf{z}^{(r)}-\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}.

Then, the update formula of Eq. (27) is written as follows:

𝐲(r+1)=𝐳(r)2α𝐀𝐰(r+1).\displaystyle\mathbf{y}^{(r+1)}=\mathbf{z}^{(r)}-2\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}.

Update formula of Eq. (28): From [27, Lemma VI.2], the update formula of Eq. (28) is rewritten as follows:

𝐳(r+1)=(1θ)𝐳(r)+θ𝐏𝐲(r+1),\displaystyle\mathbf{z}^{(r+1)}=(1-\theta)\mathbf{z}^{(r)}+\theta\mathbf{P}\mathbf{y}^{(r+1)},

where 𝐏\mathbf{P} denotes the permutation matrix transforming 𝐲i|j\mathbf{y}_{i|j} into 𝐲j|i\mathbf{y}_{j|i} for all (i,j)(i,j)\in\mathcal{E}.

In summary, the update formulas of the ECL can be derived as follows:

𝐰(r+1)\displaystyle\mathbf{w}^{(r+1)} =argmin𝐰{f(𝐰)+α2𝐀𝐰1α𝐳(r)2},\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\|}^{2}\}, (32)
𝐲(r+1)\displaystyle\mathbf{y}^{(r+1)} =𝐳(r)2α𝐀𝐰(r+1),\displaystyle=\mathbf{z}^{(r)}-2\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}, (33)
𝐳(r+1)\displaystyle\mathbf{z}^{(r+1)} =(1θ)𝐳(r)+θ𝐏𝐲(r+1).\displaystyle=(1-\theta)\mathbf{z}^{(r)}+\theta\mathbf{P}\mathbf{y}^{(r+1)}. (34)

Then, by rewriting 𝐰\mathbf{w}, 𝐳\mathbf{z} and 𝐲\mathbf{y} with 𝐰i\mathbf{w}_{i}, 𝐳i|j\mathbf{z}_{i|j} and 𝐲i|j\mathbf{y}_{i|j} respectively, we obtain the update formulas of Eqs. (3-5).

Appendix C Convergence Analysis of C-ECL

From the derivation of the ECL shown in Sec. B, the update formulas of Eq. (3), Eq. (4), and Eq. (13) in the C-ECL can be rewritten as follows:333By the definition of the reflected resolvent, the update formulas of Eq. (26) and Eq. (27) are equivalent to that of Eq. (35).

𝐲(r+1)\displaystyle\mathbf{y}^{(r+1)} =Rα𝐀f(𝐀)𝐳(r),\displaystyle=R_{\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\cdot)}\mathbf{z}^{(r)}, (35)
𝐳(r+1)\displaystyle\mathbf{z}^{(r+1)} =𝐳(r)+θcomp(Rαι𝐲(r+1)𝐳(r)).\displaystyle=\mathbf{z}^{(r)}+\theta\;\text{comp}(R_{\alpha\partial\iota}\mathbf{y}^{(r+1)}-\mathbf{z}^{(r)}). (36)

To simplify the notation, we define gf(𝐀)g\coloneqq f^{\ast}(\mathbf{A}\cdot). Then, Eq. (35) is rewritten as follows:

𝐲(r+1)\displaystyle\mathbf{y}^{(r+1)} =Rαg𝐳(r).\displaystyle=R_{\alpha\nabla g}\mathbf{z}^{(r)}. (37)

In the following, we analyze the convergence rate of the update formulas of Eq. (37) and Eq. (36).

Lemma 1.

Under Assumption 4, the maximum singular value and the minimum singular value of 𝐀\mathbf{A} are Nmax\sqrt{N_{\text{max}}} and Nmin\sqrt{N_{\text{min}}} respectively.

Proof.

By the definition of 𝐀\mathbf{A}, we have 𝐀𝐀=diag(|𝒩1|𝐈,,|𝒩N|𝐈)\mathbf{A}\mathbf{A}^{\top}=\text{diag}(|\mathcal{N}_{1}|\mathbf{I},\cdots,|\mathcal{N}_{N}|\mathbf{I}). This implies that 𝐀𝐀\mathbf{A}\mathbf{A}^{\top} is not only a block diagonal matrix, but also a diagonal matrix. Then, the eigen values of 𝐀𝐀\mathbf{A}\mathbf{A}^{\top} are |𝒩1|,,|𝒩N||\mathcal{N}_{1}|,\ldots,|\mathcal{N}_{N}|. This concludes the proof. ∎

Remark 1.

Under Assumption 4, the maximum singular value and the minimum singular value of 𝐀\mathbf{A}^{\top} are Nmax\sqrt{N_{\text{max}}} and Nmin\sqrt{N_{\text{min}}} respectively.

Lemma 2.

Under Assumptions 2, 3, and 4, f(𝐀)f^{\ast}(\mathbf{A}\cdot) is (Nmax/μ)(N_{\text{max}}/\mu)-smooth and (Nmin/L)(N_{\text{min}}/L)-strongly convex.

Proof.

From Assumption 3, ff^{\ast} is (1/μ)(1/\mu)-smooth and (1/L)(1/L)-strongly convex. Then, for any 𝝀\boldsymbol{\lambda} and 𝝀\boldsymbol{\lambda}^{\prime}, by the (1/μ)(1/\mu)-smoothness of ff^{\ast} and Lemma 1, we have

f(𝐀𝝀)\displaystyle f^{\ast}(\mathbf{A}\boldsymbol{\lambda}) f(𝐀𝝀)+f(𝐀𝝀),𝐀𝝀𝐀𝝀+12μ𝐀𝝀𝐀𝝀2\displaystyle\leq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\rangle+\frac{1}{2\mu}\left\|\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\right\|^{2}
f(𝐀𝝀)+𝐀f(𝐀𝝀),𝝀𝝀+Nmax2μ𝝀𝝀2.\displaystyle\leq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\rangle+\frac{N_{\text{max}}}{2\mu}\left\|\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\right\|^{2}.

Because ff^{\ast} is (1/L)(1/L)-strongly convex, from Lemma 1, for any 𝝀\boldsymbol{\lambda} and 𝝀\boldsymbol{\lambda}^{\prime}, we have

f(𝐀𝝀)\displaystyle f^{\ast}(\mathbf{A}\boldsymbol{\lambda}) f(𝐀𝝀)+f(𝐀𝝀),𝐀𝝀𝐀𝝀+12L𝐀𝝀𝐀𝝀2\displaystyle\geq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\rangle+\frac{1}{2L}\left\|\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\right\|^{2}
f(𝐀𝝀)+𝐀f(𝐀𝝀),𝝀𝝀+Nmin2L𝝀𝝀2.\displaystyle\geq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\rangle+\frac{N_{\text{min}}}{2L}\left\|\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\right\|^{2}.

This concludes the proof. ∎

We define δ\delta as follows:

δmax(αNmaxμ1αNmaxμ+1,1αNminL1+αNminL).\displaystyle\delta\coloneqq\text{max}\left(\frac{\frac{\alpha N_{\text{max}}}{\mu}-1}{\frac{\alpha N_{\text{max}}}{\mu}+1},\frac{1-\frac{\alpha N_{\text{min}}}{L}}{1+\frac{\alpha N_{\text{min}}}{L}}\right).

Note that when Assumptions 2, 3, and 4 are satisfied, and when α>0\alpha>0, 0δ<10\leq\delta<1 is satisfied.

Lemma 3.

Under Assumptions 2, 3, and 4, RαgR_{\alpha\nabla g} is δ\delta-contractive for any α(0,)\alpha\in(0,\infty).

Proof.

The statement follows from [9, Theorem 1] and Lemma 2. ∎

Lemma 4.

Under Assumptions 2, 3, and 4, RαιRαgR_{\alpha\partial\iota}R_{\alpha\nabla g} is δ\delta-contractive for any α(0,)\alpha\in(0,\infty).

Proof.

From [1, Corollary 23.9] and [1, Theorem 20.25], RαιR_{\alpha\partial\iota} is nonexpansive for any α(0,)\alpha\in(0,\infty). Then, from Lemma 3, for any 𝐳\mathbf{z} and 𝐳\mathbf{z}^{\prime}, we have

RαιRαg𝐳RαιRαg𝐳\displaystyle\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{\prime}\| Rαg𝐳Rαg𝐳\displaystyle\leq\|R_{\alpha\nabla g}\mathbf{z}-R_{\alpha\nabla g}\mathbf{z}^{\prime}\|
δ𝐳𝐳.\displaystyle\leq\delta\|\mathbf{z}-\mathbf{z}^{\prime}\|.

This concludes the proof. ∎

In the following, we use the notation 𝔼r[]\mathbb{E}_{r}[\cdot] as the expectation over the randomness in the round rr.

Lemma 5.

Let 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}). Under Assumptions 1, 2, 3, and 4, 𝐳(r+1)\mathbf{z}^{(r+1)} and 𝐳(r)\mathbf{z}^{(r)} generated by Eqs. (35-36) satisfy the following:

𝔼r𝐳(r+1)𝐳¯{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}𝐳(r)𝐳¯.\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|\leq\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|. (38)
Proof.

Combining Eq. (36) and Eq. (37), the update formula of 𝐳\mathbf{z} can be written as follows:

𝐳(r+1)=𝐳(r)+θcomp((RαιRαgId)𝐳(r)).\displaystyle\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\theta\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)}).

Let ω(r)\omega^{(r)} be the parameter used to compress ((RαιRαgId)𝐳(r))((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)}). In the following, to use Eq. (8) and Eq. (9) in Assumption 1, we rewrite the update formula of 𝐳\mathbf{z} as follows:

𝐳(r+1)\displaystyle\mathbf{z}^{(r+1)} =𝐳(r)+θcomp((RαιRαgId)𝐳(r);ω(r)).\displaystyle=\mathbf{z}^{(r)}+\theta\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)};\omega^{(r)}).

Because 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}), we have (RαιRαgId)𝐳¯=𝟎(R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}}=\mathbf{0}. Under Assumption 1, because comp(𝟎;ω)=𝟎\textbf{comp}(\mathbf{0};\omega)=\mathbf{0} for any ω\omega, we have

𝐳¯\displaystyle\bar{\mathbf{z}} =𝐳¯+θcomp((RαιRαgId)𝐳¯;ω(r)).\displaystyle=\bar{\mathbf{z}}+\theta\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}};\omega^{(r)}).

We then have

𝔼r𝐳(r+1)𝐳¯\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|
=𝔼r𝐳(r)𝐳¯+θcomp((RαιRαgId)𝐳(r);ω(r))θcomp((RαιRαgId)𝐳¯;ω(r))\displaystyle=\mathbb{E}_{r}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}+\theta\;\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)};\omega^{(r)})-\theta\;\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}};\omega^{(r)})\|
(a)(1θ)(𝐳(r)𝐳¯)+θ(RαιRαg𝐳(r)RαιRαg𝐳¯)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\|
+𝔼rθ(𝐳(r)𝐳¯RαιRαg𝐳(r)+RαιRαg𝐳¯)\displaystyle\qquad+\mathbb{E}_{r}\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}+R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})
θ{comp(𝐳(r)𝐳¯RαιRαg𝐳(r)+RαιRαg𝐳¯;ω(r))}\displaystyle\qquad\qquad-\theta\{\text{comp}(\mathbf{z}^{(r)}-\bar{\mathbf{z}}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}+R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}};\omega^{(r)})\}\|
(b)(1θ)(𝐳(r)𝐳¯)+θ(RαιRαg𝐳(r)RαιRαg𝐳¯)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\|
+1τθ(𝐳(r)𝐳¯)θ(RαιRαg𝐳(r)RαιRαg𝐳¯)\displaystyle\qquad+\sqrt{1-\tau}\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}})-\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\|
(1θ)(𝐳(r)𝐳¯)+θ(RαιRαg𝐳(r)RαιRαg𝐳¯)T1\displaystyle\leq\underbrace{\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\|}_{T_{1}}
+1τθ(𝐳(r)𝐳¯)+(1θ)(RαιRαg𝐳(r)RαιRαg𝐳¯)T2\displaystyle\qquad+\underbrace{\sqrt{1-\tau}\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+(1-\theta)(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\|}_{T_{2}}
+1τRαιRαg𝐳(r)RαιRαg𝐳¯T3,\displaystyle\qquad+\underbrace{\sqrt{1-\tau}\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\|}_{T_{3}},

where we use Assumption 1 for (a) and (b). From Lemma 4, T1T_{1}, T2T_{2}, and T3T_{3} are upper bounded as follows:

T1\displaystyle T_{1} |1θ|𝐳(r)𝐳¯+θRαιRαg𝐳(r)RαιRαg𝐳¯\displaystyle\leq|1-\theta|\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|+\theta\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\|
(|1θ|+θδ)𝐳(r)𝐳¯,\displaystyle\leq(|1-\theta|+\theta\delta)\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|,
T2\displaystyle T_{2} 1τ(θ𝐳(r)𝐳¯+|1θ|RαιRαg𝐳(r)RαιRαg𝐳¯)\displaystyle\leq\sqrt{1-\tau}(\theta\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|+|1-\theta|\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\|)
1τ(θ+|1θ|δ)𝐳(r)𝐳¯,\displaystyle\leq\sqrt{1-\tau}(\theta+|1-\theta|\delta)\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|,
T3\displaystyle T_{3} 1τδ𝐳(r)𝐳¯.\displaystyle\leq\sqrt{1-\tau}\delta\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|.

Therefore, we obtain

𝔼r𝐳(r+1)𝐳¯{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}𝐳(r)𝐳¯.\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|\leq\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|.

This concludes the proof. ∎

Lemma 6.

Let 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}). Under Assumptions 1, 2, 3, and 4, when τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2} and θ\theta satisfies the following:

θ(2δ1τ(1δ)(11τ),2(1+δ)(1+1τ)),\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right), (39)

𝐳(r+1)\mathbf{z}^{(r+1)} and 𝐳(r)\mathbf{z}^{(r)} generated by Eqs. (35-36) satisfy the following:

𝔼r𝐳(r+1)𝐳¯{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}𝐳(r)𝐳¯<𝐳(r)𝐳¯.\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|\leq\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|<\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|. (40)
Proof.

When τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}, the domain of Eq. (39) is non-empty and contains 11. Then, when τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2} and θ\theta satisfies the following:

θ(2δ1τ(1δ)(11τ),1],\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},1\right],

we have

|1θ|+θδ+1τ(θ+|1θ|δ+δ)\displaystyle|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)
=1+2δ1τθ(11τ)(1δ)<1.\displaystyle=1+2\delta\sqrt{1-\tau}-\theta(1-\sqrt{1-\tau})(1-\delta)<1.

When τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2} and θ\theta satisfies the following:

θ[1,2(1+δ)(1+1τ)),\displaystyle\theta\in\left[1,\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right),

we have

|1θ|+θδ+1τ(θ+|1θ|δ+δ)\displaystyle|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)
=1+θ(1+δ)(1+1τ)<1.\displaystyle=-1+\theta(1+\delta)(1+\sqrt{1-\tau})<1.

This concludes the proof. ∎

In the following, we define the operator TT as follows:

T𝐳(r)argmin𝐰{f(𝐰)+α2𝐀𝐰1α𝐳(r)2}.\displaystyle T\mathbf{z}^{(r)}\coloneqq\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\|}^{2}\}.
Lemma 7.

Under Assumptions 2, 3, and 4, the operator TT is (Nmax/(μ+αNmin))(\sqrt{N_{\text{max}}}/(\mu+\alpha N_{\text{min}}))-Lipschitz continuous.

Proof.

We have

argmin𝐰{f(𝐰)+α2𝐀𝐰1α𝐳(r)2}\displaystyle\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\|}^{2}\}
=argmin𝐰{f(𝐰)+α2𝐀𝐰2𝐰,𝐀𝐳(r)}\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}\|\mathbf{A}^{\top}\mathbf{w}\|^{2}-\langle\mathbf{w},\mathbf{A}\mathbf{z}^{(r)}\rangle\}
=argmin𝐰{f(𝐰)+α2𝐀𝐰212𝐰2+12𝐰𝐀𝐳(r)2}\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}\|\mathbf{A}^{\top}\mathbf{w}\|^{2}-\frac{1}{2}\|\mathbf{w}\|^{2}+\frac{1}{2}\|\mathbf{w}-\mathbf{A}\mathbf{z}^{(r)}\|^{2}\}
=proxf+α2𝐀2122(𝐀𝐳(r)).\displaystyle=\text{prox}_{f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}}(\mathbf{A}\mathbf{z}^{(r)}).

From [1, Example 23.3], the proximal operator of (f+α2𝐀2122)(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}) is the resolvent of (f+α2𝐀2122)\nabla(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}), and the resolvent can be rewritten as follows:

J(f+α2𝐀2122)\displaystyle J_{\nabla(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2})} =(Id+(f+α2𝐀2122))1\displaystyle=(\text{Id}+\nabla(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}))^{-1}
=((f+α2𝐀2))1\displaystyle=(\nabla(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}))^{-1}
=(f+α2𝐀2).\displaystyle=\nabla(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2})^{\ast}.

From Assumption 3 and Remark 1, (f+α2𝐀2)(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}) is (μ+αNmin)(\mu+\alpha N_{\text{min}})-strongly convex. Then, (f+α2𝐀2)(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2})^{\ast} is (1/(μ+αNmin))(1/(\mu+\alpha N_{\text{min}}))-smooth. Then, the proximal operator of (f+α2𝐀2122)(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}) is (1/(μ+αNmin))(1/(\mu+\alpha N_{\text{min}}))-Lipschitz continuous. Then, we have

proxf+α2𝐀2122(𝐀𝐳)proxf+α2𝐀2122(𝐀𝐳)\displaystyle\left\|\text{prox}_{f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}}(\mathbf{A}\mathbf{z})-\text{prox}_{f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2}}(\mathbf{A}\mathbf{z}^{\prime})\right\| 1μ+αNmin𝐀𝐳𝐀𝐳\displaystyle\leq\frac{1}{\mu+\alpha N_{\text{min}}}\|\mathbf{A}\mathbf{z}-\mathbf{A}\mathbf{z}^{\prime}\|
(a)Nmaxμ+αNmin𝐳𝐳,\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\|\mathbf{z}-\mathbf{z}^{\prime}\|,

where we use Lemma 1 for (a). This concludes the proof. ∎

Lemma 8.

Suppose that Assumptions 2, 3, and 4 hold. Let 𝛌\boldsymbol{\lambda}^{\star} be the optimal solution of the dual problem of Eq. (25). Then, f(𝐀𝛌)\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star}) is the optimal solution of the primal problem of Eq. (24).

Proof.

By using the optimality condition [25, Theorem 27.1] of 𝝀\boldsymbol{\lambda}^{\star}, we have

𝐀f(𝐀𝝀)ι(𝝀).\displaystyle-\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\ast})\in\partial\iota(\boldsymbol{\lambda}^{\star}).

From the property of the convex conjugate function of ι\iota, we get

𝝀ι(𝐀f(𝐀𝝀)).\displaystyle\boldsymbol{\lambda}^{\star}\in\partial\iota^{\ast}(-\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star})).

Then, by multiplying 𝐀\mathbf{A} and using the property of the convex conjugate function of ff, we have

f(f(𝐀𝝀))𝐀ι(𝐀f(𝐀𝝀)).\displaystyle\nabla f(\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star}))\in\mathbf{A}\partial\iota^{\ast}(-\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star})).

From the above equation, f(𝐀𝝀)\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star}) satisfies the optimality condition of Eq. (24). This concludes the proof. ∎

Lemma 9.

Let 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}), and let 𝐰\mathbf{w}^{\star} be the optimal solution of Eq. (2). Under Assumptions 1, 2, 3, and 4, 𝐰(r+1)\mathbf{w}^{(r+1)} and 𝐳(r1)\mathbf{z}^{(r-1)} generated by Alg. 1 satisfy the following:

𝔼r1𝐰(r+1)𝐰\displaystyle\mathbb{E}_{r-1}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|
Nmaxμ+αNmin{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}𝐳(r1)𝐳¯.\displaystyle\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r-1)}-\bar{\mathbf{z}}\|. (41)
Proof.

Let 𝝀\boldsymbol{\lambda}^{\star} be the optimal solution of the dual problem of Eq. (25). We have 𝝀=Jαg𝐳¯\boldsymbol{\lambda}^{\star}=J_{\alpha\nabla g}\bar{\mathbf{z}} because 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}). By Lemma 8 and the definition of 𝐰\mathbf{w} of Eq. (30), 𝐰\mathbf{w}^{\star} can be obtained from the fixed point 𝐳¯\bar{\mathbf{z}} as 𝐰=T𝐳¯\mathbf{w}^{\star}=T\bar{\mathbf{z}}. Then, we have

𝔼r1𝐰(r+1)𝐰\displaystyle\mathbb{E}_{r-1}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|
𝔼r1T𝐳(r)T𝐳¯\displaystyle\leq\mathbb{E}_{r-1}\|T\mathbf{z}^{(r)}-T\bar{\mathbf{z}}\|
(a)Nmaxμ+αNmin𝔼r1𝐳(r)𝐳¯\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\mathbb{E}_{r-1}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|
(b)Nmaxμ+αNmin{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}𝐳(r1)𝐳¯,\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r-1)}-\bar{\mathbf{z}}\|,

where we use Lemma 7 for (a) and use Lemma 5 for (b). This concludes the proof. ∎

Lemma 10.

Let 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}), and let 𝐰\mathbf{w}^{\star} be the optimal solution of Eq. (2). Under Assumptions 1, 2, 3, and 4, 𝐰(r+1)\mathbf{w}^{(r+1)} generated by Alg. 1 satisfies the following:

𝔼𝐰(r+1)𝐰\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|
Nmaxμ+αNmin{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}r𝐳(0)𝐳¯.\displaystyle\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\right\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|. (42)
Proof.

The statement follows from Lemma 5 and Lemma 9. ∎

Theorem 1.

Let 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}). Under Assumptions 1, 2, 3, and 4, when τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2} and θ\theta satisfies the following:

θ(2δ1τ(1δ)(11τ),2(1+δ)(1+1τ)),\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right), (43)

𝐰(r+1)\mathbf{w}^{(r+1)} generated by Alg. 1 linearly converges to the optimal solution 𝐰\mathbf{w}^{\star} as follows:

𝔼𝐰(r+1)𝐰\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|
Nmaxμ+αNmin{|1θ|+θδ+1τ(θ+|1θ|δ+δ)}r𝐳(0)𝐳¯.\displaystyle\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\right\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|. (44)
Proof.

The statement follows from Lemma 6 and Lemma 10. ∎

Corollary 1.

Let 𝐳¯Fix(RαιRαg)\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g}). Under Assumptions 1, 2, 3, and 4, when τ=1\tau=1 and θ\theta satisfies the following:

θ(0,21+δ),\displaystyle\theta\in\left(0,\frac{2}{1+\delta}\right), (45)

𝐰(r+1)\mathbf{w}^{(r+1)} generated by Alg. 1 linearly converges to the optimal solution 𝐰\mathbf{w}^{\star} as follows:

𝔼𝐰(r+1)𝐰Nmaxμ+αNmin{|1θ|+θδ}r𝐳(0)𝐳¯.\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{|1-\theta|+\theta\delta\right\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.
Proof.

The statement follows from Theorem 1. ∎

Corollary 2.

Under Assumptions 1, 2, 3, and 4, when τ1(1δ1+δ)2\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}, the optimal convergence rate of Eq. (16) in the C-ECL is achieved when θ=1\theta=1.

Proof.

By Theorem 1, when θ1\theta\leq 1, we have

𝔼𝐰(r+1)𝐰Nmaxμ+αNmin{1+2δ1τθ(11τ)(1δ)}r𝐳(0)𝐳¯.\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{1+2\delta\sqrt{1-\tau}-\theta(1-\sqrt{1-\tau})(1-\delta)\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

Because (11τ)(1δ)>0(1-\sqrt{1-\tau})(1-\delta)>0, {1+2δ1τθ(11τ)(1δ)}\{1+2\delta\sqrt{1-\tau}-\theta(1-\sqrt{1-\tau})(1-\delta)\} decreases when θ\theta approaches 11.

When θ1\theta\geq 1, we have

𝔼𝐰(r+1)𝐰Nmaxμ+αNmin{1+θ(1+δ)(1+1τ)}r𝐳(0)𝐳¯.\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{-1+\theta(1+\delta)(1+\sqrt{1-\tau})\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

Because (1+δ)(1+1τ)>0(1+\delta)(1+\sqrt{1-\tau})>0, {1+θ(1+δ)(1+1τ)}\{-1+\theta(1+\delta)(1+\sqrt{1-\tau})\} decreases when θ\theta approaches 11. Therefore, the optimal convergence rate is achieved when θ=1\theta=1. ∎

Corollary 3.

Under Assumptions 1, 2, 3, and 4, when τ=1\tau=1, the optimal convergence rate of Eq. (17) is achieved when θ=1\theta=1.

Proof.

The statement follows from Corollary 2. ∎

Appendix D Experimental Setting

D.1 Hyperparameter

In this section, we describe the detailed hyperparameters used in our experiments.

FashionMNIST: We set the learning rate to 0.0010.001, batch size to 100100, and number of epochs to 15001500. To avoid the overfitting, we use RandomCrop of PyTorch for the data augmentation. In the ECL, following the previous work [23], we set α\alpha as follows:

α=1η|𝒩i|(K1),\displaystyle\alpha=\frac{1}{\eta|\mathcal{N}_{i}|(K-1)}, (46)

where KK is the number of local steps. In the C-ECL, when we use randk%\textbf{rand}_{k\%} as the compression operator, the number of local steps can be regarded to be 100Kk\frac{100K}{k}. Then, in the C-ECL, we set α\alpha as follows:

α=1η|𝒩i|(100Kk1).\displaystyle\alpha=\frac{1}{\eta|\mathcal{N}_{i}|(\frac{100K}{k}-1)}. (47)

Note that α\alpha is set to the different values between the nodes by the definition of α\alpha in Eqs. (46-47). For the D-PSGD and the PowerGossip, we use Metropolis-Hastings weights [36] as the weights of the edges.

CIFAR10: We set the learning rate to 0.0050.005, batch size to 100100, and number of epochs to 25002500. To avoid the overfitting, we use RandomCrop and RandomHorizontalFlip of PyTorch for the data augmentation. For the ECL and the C-ECL, we set α\alpha in the same way as the FashionMNIST. For the D-PSGD and the PowerGossip, we use Metropolis–Hastings weights as the weights of the edges.

D.2 Network Topology

Fig. 2 shows the visualization of the network topology used in our experiments.

Refer to caption
(a) Chain
Refer to caption
(b) Ring
Refer to caption
(c) Multiplex Ring
Refer to caption
(d) Fully Connected Graph
Figure 2: Visualization of the network topology.