Communication Compression for Decentralized Learning with Operator Splitting Methods

Yuki Takezawa
Kyoto University and RIKEN AIP
[email protected]
&Kenta Niwa
NTT Communication Science Laboratories
[email protected]
&Makoto Yamada
Kyoto University and RIKEN AIP
[email protected]

Abstract

In decentralized learning, operator splitting methods using a primal-dual formulation (e.g., the Edge-Consensus Learning (ECL)) has been shown to be robust to heterogeneous data and has attracted significant attention in recent years. However, in the ECL, a node needs to exchange dual variables with its neighbors. These exchanges incur significant communication costs. For the Gossip-based algorithms, many compression methods have been proposed, but these Gossip-based algorithm do not perform well when the data distribution held by each node is statistically heterogeneous. In this work, we propose the novel framework of the compression methods for the ECL, called the Communication Compressed ECL (C-ECL). Specifically, we reformulate the update formulas of the ECL, and propose to compress the update values of the dual variables. We demonstrate experimentally that the C-ECL can achieve a nearly equivalent performance with fewer parameter exchanges than the ECL. Moreover, we demonstrate that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithms.

1 Introduction

In recent years, neural networks have shown promising results in various fields, including image processing [4, 7] and natural language processing [31, 6], and have thus attracted considerable attention. To train a neural network, we generally need to collect a large number of training data. Owning to the use of crowdsourcing services, it is now easy to collect a large number of annotated images and texts. However, because of privacy concerns, it is difficult to collect a large number of personal data, such as medical data including gene expression and medical images, on a single server. In such cases, decentralized learning, which aims to train a model without sharing the training data among servers, is a powerful tool. Decentralized learning was originally studied to train large-scale models in parallel, and because it allows us to train models without aggregating the training data, it has recently attracted significant attention from the perspective of privacy preservation.

One of the most widely used algorithms for decentralized learning is the Gossip-based algorithm [2, 18]. In the Gossip-based algorithm, each node (i.e., server) updates the model parameters using its own gradient, exchanges model parameters with its neighbors, and then takes the average value to reach a consensus. The Gossip-based algorithm is a simple yet effective approach. When the distribution of the data subset held by each node is statistically homogeneous, the Gossip-based algorithm can perform as well as the vanilla SGD, which trains the model on a single node using all of the training data. However, when the data distribution of each node is statistically heterogeneous (e.g., each node has only images of some classes and not others), the client-drift [11] occurs, and the Gossip-based algorithms do not perform well [30, 32].

Recently, the operator splitting method using the primal-dual formulations, called the Edge-Consensus Learning (ECL) [22], has been proposed. The primal-dual algorithm, including the Alternating Direction Method of Multipliers (ADMM) [3] and the Primal-Dual Method of Multipliers (PDMM) [38], can be applied in decentralized learning by representing the model consensus as the linear constraints. It has recently been shown that the ADMM and PDMM for decentralized learning can be derived by solving the dual problem using operator splitting methods [27] (e.g., the Douglas-Rachford splitting [8] and the Peaceman-Rachford splitting [24]). In addition, Niwa et al., [22, 23] applied these operator splitting methods to the neural networks, named them the Edge-Consensus Learning (ECL). Then, they showed that the ECL can be interpreted as a variance reduction method and is robust to heterogeneous data.

However, for both the Gossip-based algorithm and the ECL, each node needs to exchange the model parameters and/or dual variables with its neighbors, and such exchanges incur significant communication costs. Recently, in the Gossip-based algorithm, many studies have proposed methods for compressing the exchange of parameters [29, 13, 21, 33]. Then, they showed that these compression methods can train a model with fewer parameter exchanges than the uncompressed Gossip-based algorithm. However, these compression methods are based on the Gossip algorithms and do not perform well when the data distribution of each node is statistically heterogeneous.

In this work, we propose the novel framework for the compression methods applied to the ECL, which we refer to as the Communication Compressed ECL (C-ECL). Specifically, we reformulate the update formulas of the ECL, and propose to compress the update values of the dual variables. Theoretically, we analyze how our proposed compression affects the convergence rate of the ECL and show that the C-ECL converges linearly to the optimal solution as well as the ECL. Experimentally, we show that the C-ECL can achieve almost the same accuracy as the ECL with fewer parameter exchanges. Furthermore, the experimental results show that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithm, and when the data distribution of each node is statistically heterogeneous, the C-ECL can outperform the uncompressed Gossip-based algorithm, in terms of both the accuracy and the communication costs.

Notation: In this work, $\|\cdot\|$ denotes the L2 norm, $\mathbf{0}$ denotes a vector with all zeros, and $\mathbf{I}$ denotes an identity matrix.

2 Related Work

In this section, we briefly introduce the problem setting of decentralized learning, and then introduce the Gossip-based algorithm and the ECL.

2.1 Decentralized Learning

Let $G=(\mathcal{V},\mathcal{E})$ be an undirected connected graph that represents the network topology where $\mathcal{V}$ denotes the set of nodes and $\mathcal{E}$ denotes the set of edges. For simplicity, we denote $\mathcal{V}$ as the set of integers $\{1,2,\ldots,|\mathcal{V}|\}$ . We denote the set of neighbors of the node $i$ as $\mathcal{N}_{i}=\{j\in\mathcal{V}|(i,j)\in\mathcal{E}\}$ . The goal of decentralized learning is formulated as follows:

\displaystyle\inf_{\mathbf{w}}\;\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}),\;\;f_{i}(\mathbf{w})\coloneqq\mathbb{E}_{\zeta_{i}\sim\mathcal{D}_{i}}[F(\mathbf{w};\zeta_{i})],

(1)

where $\mathbf{w}\in\mathbb{R}^{d}$ is the model parameter, $F$ is the loss function, $\mathcal{D}_{i}$ represents the data held by the node $i$ , $\zeta_{i}$ is the data sample from $\mathcal{D}_{i}$ , and $f_{i}$ is the loss function of the node $i$ .

2.2 Gossip-based Algorithm

A widely used approach for decentralized learning is the Gossip-based algorithm (e.g. D-PSGD [18]). In the Gossip-based algorithm, each node $i$ computes its own gradient $\nabla f_{i}$ , exchanges parameters with its neighbors, and then takes their average. Although, the Gossip-based algorithm is a simple and effective approach, there are two main issues: high communication costs and sensitivity to the heterogeneity of the data distribution.

In the Gossip-based algorithm, the node needs to receive the model parameter from its neighbors. Because the number of parameters in a neural network is large, the exchanges of the model parameters incur huge communication costs. To reduce these communication costs of the Gossip-based algorithm, many methods that compress the exchanging parameters by using sparsification, quantization, and low-rank approximation have been proposed [29, 13, 21, 12, 33]. Then, it was shown that these compression methods can achieve almost the same accuracy as the uncompressed Gossip-based algorithm with fewer parameter exchanges.

The second issue is that the Gossip-based algorithm is sensitive to the heterogeneity of the data distribution. When the data distribution of each node is statistically heterogeneous, because the optimal solution of the loss function $\sum_{i}f_{i}$ and the optimal solution of the loss function of each node $f_{i}$ are far from each other, the Gossip-based algorithms do not perform well [22, 32]. To address heterogeneous data in the Gossip-based algorithm, Tang et al., 2018b [30] and Xin et al., [37] applied variance reduction methods [5, 10] to the Gossip-based algorithm. Lorenzo and Scutari, [20] proposed gradient tracking algorithms, and Vogels et al., [32] recently proposed the RelaySum.

2.3 Edge-Consensus Learning

In this section, we briefly introduce the Edge-Consensus Learning (ECL) [22]. By reformulating Eq. (1), the primal problem can be defined as follows:

\displaystyle\inf_{\{\mathbf{w}_{i}\}_{i}}\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})\;\;\text{s.t.}\;\;\mathbf{A}_{i|j}\mathbf{w}_{i}+\mathbf{A}_{j|i}\mathbf{w}_{j}=\mathbf{0},\;(\forall(i,j)\in\mathcal{E}),

(2)

where $\mathbf{A}_{i|j}=\mathbf{I}$ when $j\in\mathcal{N}_{i}$ and $i<j$ , and $\mathbf{A}_{i|j}=-\mathbf{I}$ when $j\in\mathcal{N}_{i}$ and $i>j$ . By contrast, the Gossip-based algorithm explicitly computes the average at each round, whereas the primal problem of Eq. (2) represents the consensus based on the linear constraints. Subsequently, by solving the the dual problem of Eq. (2) using the Douglas-Rachford splitting [8], the update formulas can be derived as follows [27]:

$\displaystyle\mathbf{w}^{(r+1)}_{i}$	$\displaystyle=\text{argmin}_{\mathbf{w}_{i}}\{f_{i}(\mathbf{w}_{i})+\frac{\alpha}{2}\sum_{j\in\mathcal{N}_{i}}{\left\\|\mathbf{A}_{i\|j}\mathbf{w}_{i}-\frac{1}{\alpha}\mathbf{z}^{(r)}_{i\|j}\right\\|}^{2}\},$	(3)
$\displaystyle\mathbf{y}^{(r+1)}_{i\|j}$	$\displaystyle=\mathbf{z}^{(r)}_{i\|j}-2\alpha\mathbf{A}_{i\|j}\mathbf{w}^{(r+1)}_{i},$	(4)
$\displaystyle\mathbf{z}^{(r+1)}_{i\|j}$	$\displaystyle=(1-\theta)\mathbf{z}^{(r)}_{i\|j}+\theta\mathbf{y}^{(r+1)}_{j\|i},$	(5)

where $\theta\in(0,1]$ and $\alpha>0$ are hyperparameters, and $\mathbf{y}_{i|j},\mathbf{z}_{i|j}\in\mathbb{R}^{d}$ are dual variables. We present detailed derivation of these update formulas in Sec. B. When $\theta=1$ , the Douglas-Rachford splitting is specifically called the Peaceman-Rachford splitting [24]. When $f_{i}$ is non-convex (e.g., a loss function of a neural network), Eq. (3) can not be generally solved. Then, Niwa et al., [22] proposed the Edge-Consensus Learning (ECL), which approximately solves Eq. (3) as follows:

\displaystyle\mathbf{w}^{(r+1)}_{i}

\displaystyle=\text{argmin}_{\mathbf{w}_{i}}\{\langle\mathbf{w}_{i},\nabla f_{i}(\mathbf{w}^{(r)}_{i})\rangle+\frac{1}{2\eta}\left\|\mathbf{w}_{i}-\mathbf{w}^{(r)}_{i}\right\|^{2}+\frac{\alpha}{2}\sum_{j\in\mathcal{N}_{i}}{\left\|\mathbf{A}_{i|j}\mathbf{w}_{i}-\frac{1}{\alpha}\mathbf{z}^{(r)}_{i|j}\right\|}^{2}\},

(6)

where $\eta>0$ corresponds to the learning rate. Then, Niwa et al., [23] showed that the ECL can be interpreted as a stochastic variance reduction method and demonstrated that the ECL is more robust to heterogeneous data than the Gossip-based algorithms. However, as shown in Eq. (5), the node $i$ must receive dual variables $\mathbf{y}_{j|i}$ from its neighbor $j$ in the ECL. Therefore, in the ECL as well as in the Gossip-based algorithm, large communication costs occur during training.

In addition to the ECL, other methods using primal-dual formulations have been proposed [17], and recently, the compression methods for these primal-dual algorithms has been studied [14, 19]. However, the compression methods for the ECL have not been studied. In this work, we propose the compression method for the ECL, called the C-ECL, that can train a model with fewer parameter exchanges and is robust to heterogeneous data.

3 Proposed Method

3.1 Compression Operator

Before proposing the C-ECL, we first introduce the compression operator used in this work.

Assumption 1 (Compression Operator).

For some $\tau\in(0,1]$ , we assume that the compression operator $\textbf{comp}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ satisfies the following conditions:

$\displaystyle\mathbb{E}_{\omega}\left\\|\mathrm{comp}(\mathbf{x};\omega)-\mathbf{x}\right\\|^{2}$	$\displaystyle\leq(1-\tau)\left\\|\mathbf{x}\right\\|^{2}\quad$	$\displaystyle(\forall\mathbf{x}\in\mathbb{R}^{d}),$	(7)
$\displaystyle\mathrm{comp}(\mathbf{x}+\mathbf{y};\omega)$	$\displaystyle=\mathrm{comp}(\mathbf{x};\omega)+\mathrm{comp}(\mathbf{y};\omega)\quad$	$\displaystyle(\forall\omega,\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}),$	(8)
$\displaystyle\mathrm{comp}(-\mathbf{x};\omega)$	$\displaystyle=-\mathrm{comp}(\mathbf{x};\omega)\quad$	$\displaystyle(\forall\omega,\forall\mathbf{x}\in\mathbb{R}^{d}),$	(9)

where $\omega$ represents the parameter of the compression operator. In the following, we abbreviate $\omega$ and write $\textbf{comp}(\mathbf{x})$ as the operator containing the randomness.

The assumption of Eq. (7) is commonly used for the compression methods for the Gossip-based algorithms [21, 33, 13]. In addition, we assume that the compression operator satisfies Eqs. (8-9), and the low-rank approximation [33] and the following sparsification used in the compression methods for the Gossip-based algorithm satisfy Eqs. (8-9).

Example 1.

For some $k\in(0,100]$ , we define the operator $\textbf{rand}_{k\%}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ as follows:

\displaystyle\mathrm{rand}_{k\%}(\mathbf{x})\coloneqq\mathbf{s}\circ\mathbf{x},

(10)

where $\circ$ is the Hadamard product and $\mathbf{s}\in\{0,1\}^{d}$ is a uniformly sampled sparse vector whose element is one with probability $k\%$ . Here, the parameter $\omega$ of the compression operator corresponds to the randomly sampled vector $\mathbf{s}$ . Then, $\textbf{rand}_{k\%}$ satisfies Assumption 1 [28].

3.2 Communication Compressed Edge-Consensus Learning

In this section, we propose the the Communication Compressed ECL (C-ECL), the method for compressing the dual variables exchanged in the ECL using the compression operator introduced in the previous section.

In the ECL, to update $\mathbf{z}_{i|j}$ in Eq. (5), the node $i$ needs to receive $\mathbf{y}_{j|i}$ from the node $j$ . Because the number of elements in $\mathbf{y}_{j|i}$ is the same as that of the model parameter $\mathbf{w}_{j}$ , this exchange incurs significant communication costs. A straightforward approach to reduce this communication cost is compressing $\mathbf{y}_{j|i}$ in Eq. (5) as follows:

\displaystyle\mathbf{z}^{(r+1)}_{i|j}

\displaystyle=(1-\theta)\mathbf{z}^{(r)}_{i|j}+\theta\;\text{comp}(\mathbf{y}^{(r+1)}_{j|i}).

(11)

However, we found experimentally that compressing $\mathbf{y}_{j|i}$ does not work. In the compression methods for the Gossip-based algorithm, Lu and De Sa, [21] showed that the model parameters are not robust to the compression. This is because the optimal solution of the model parameter is generally not zero, and thus the error caused by the compression does not approach zero even if the model parameters are near the optimal solution. Therefore, the successful compression methods for the Gossip-based algorithms compress the gradient $\nabla f_{i}(\mathbf{w}_{i})$ or the model difference $(\mathbf{w}_{j}-\mathbf{w}_{i})$ , which approach zero when the model parameters are near the optimal solution [13, 21, 33].

Inspired by these compression methods for the Gossip-based algorithms, we reformulate Eq. (5) into Eq. (12) so that we can compress the parameters which approach zero when the model parameters are near the optimal solution.

\displaystyle\mathbf{z}^{(r+1)}_{i|j}

\displaystyle=\mathbf{z}^{(r)}_{i|j}+\theta(\mathbf{y}^{(r+1)}_{j|i}-\mathbf{z}^{(r)}_{i|j}).

(12)

In the Douglas-Rachford splitting, $\mathbf{z}_{i|j}$ approaches the fixed point (i.e., $\mathbf{z}_{i|j}^{(r)}=\mathbf{z}_{i|j}^{(r+1)}$ ) when the model parameters approach the optimal solution. (See Sec. A for details of the Douglas-Rachford splitting and the definition of the fixed point). Then, from Eq. (12), when the model parameters approach to the optimal solution, $(\mathbf{y}_{j|i}-\mathbf{z}_{i|j})$ in Eq. (12) approaches zero. Then, instead of compressing $\mathbf{y}_{j|i}$ as in Eq. (11), we propose compressing $(\mathbf{y}_{j|i}-\mathbf{z}_{i|j})$ as follows:

	$\displaystyle\mathbf{z}^{(r+1)}_{i\|j}$	$\displaystyle=\mathbf{z}^{(r)}_{i\|j}+\theta\;\text{comp}(\mathbf{y}^{(r+1)}_{j\|i}-\mathbf{z}^{(r)}_{i\|j})$		(13)
		$\displaystyle=\mathbf{z}^{(r)}_{i\|j}+\theta\;(\text{comp}(\mathbf{y}^{(r+1)}_{j\|i})-\text{comp}(\mathbf{z}^{(r)}_{i\|j})),$

where we use Assumption 1 in the last equation. When $\textbf{rand}_{k\%}$ is used as the compression operator, $\mathbf{y}_{j|i}$ and $\mathbf{z}_{i|j}$ must be compressed using the same sparse vector $\mathbf{s}$ in Eq. (10) to use Assumption 1. In Alg. 1, we provide the pseudo-code of the C-ECL. For simplicity, the node $i$ and the node $j$ exchange $\omega_{i|j}$ and $\omega_{j|i}$ at Lines 5-6 in Alg. 1. However, by sharing the same seed value to generate $\omega_{i|j}$ and $\omega_{j|i}$ before starting the training, the node $i$ and the node $j$ can obtain the same values $\omega_{i|j}$ and $\omega_{j|i}$ without any exchanges. Moreover, when $\textbf{rand}_{k\%}$ is used as the compression operator, the node $i$ can obtain $\omega_{i|j}$ from the received value $\textbf{comp}(\mathbf{y}_{j|i};\omega_{i|j})$ because $\textbf{comp}(\mathbf{y}_{j|i};\omega_{i|j})$ is stored in a sparse matrix format (e.g., COO format). Thus, these exchanges of $\omega_{i|j}$ and $\omega_{j|i}$ can be omitted in practice. Therefore, in the C-ECL, each node only needs to exchange the compressed value of $\mathbf{y}_{j|i}$ , and the C-ECL can train a model with fewer parameter exchanges than the ECL.

1: for

r=0

R

\mathbf{w}^{(r+1)}_{i}\leftarrow\text{argmin}_{\mathbf{w}_{i}}\{f_{i}(\mathbf{w}_{i})+\frac{\alpha}{2}\sum_{j\in\mathcal{N}_{i}}{\left\|\mathbf{A}_{i|j}\mathbf{w}_{i}-\frac{1}{\alpha}\mathbf{z}^{(r)}_{i|j}\right\|}^{2}\}

3: for

j\in\mathcal{N}_{i}

\mathbf{y}^{(r+1)}_{i|j}\leftarrow\mathbf{z}^{(r)}_{i|j}-2\alpha\mathbf{A}_{i|j}\mathbf{w}^{(r+1)}_{i}

\textbf{Receive}_{i\leftarrow j}(\omega_{i|j}^{(r+1)})

. // This exchange can be omitted.

\textbf{Transmit}_{i\rightarrow j}(\omega_{j|i}^{(r+1)})

. // This exchange can be omitted.

\textbf{Receive}_{i\leftarrow j}(\text{comp}(\mathbf{y}^{(r+1)}_{j|i};\omega^{(r+1)}_{i|j}))

\textbf{Transmit}_{i\rightarrow j}(\text{comp}(\mathbf{y}^{(r+1)}_{i|j};\omega^{(r+1)}_{j|i}))

\mathbf{z}^{(r+1)}_{i|j}\leftarrow\mathbf{z}^{(r)}_{i|j}+\theta\;\text{comp}(\mathbf{y}^{(r+1)}_{j|i}-\mathbf{z}^{(r)}_{i|j};\omega^{(r+1)}_{i|j})

10: end for

11: end for

Algorithm 1 Update procedure at the node

i

of the C-ECL.

4 Convergence Analysis

In this section, we analyze how compression in the C-ECL affects the convergence rate of the ECL. Our convergence analysis is based on the analysis of the Douglas-Rachford splitting [9], and the proofs of which are presented in Sec. C.

4.1 Assumptions

In this section, we introduce additional notations and assumptions used in our convergence analysis. We define $N\coloneqq|\mathcal{V}|$ , $N_{\text{min}}\coloneqq\text{min}_{i}\{|\mathcal{N}_{i}|\}$ , $N_{\text{max}}\coloneqq\text{max}_{i}\{|\mathcal{N}_{i}|\}$ where $\mathcal{N}_{i}$ is the set of neighbors of the node $i$ . Let $\mathcal{N}_{i}(j)$ be the $j$ -th smallest index of the node in $\mathcal{N}_{i}$ , we define $\mathbf{w}\in\mathbb{R}^{dN}$ , $\mathbf{z}_{i}\in\mathbb{R}^{d|\mathcal{N}_{i}|}$ , and $\mathbf{z}\in\mathbb{R}^{2d|\mathcal{E}|}$ as follows:

\displaystyle\mathbf{w}\coloneqq(\mathbf{w}_{1}^{\top},\ldots,\mathbf{w}_{N}^{\top})^{\top},\;\;\mathbf{z}_{i}\coloneqq(\mathbf{z}_{i|\mathcal{N}_{i}(1)}^{\top},\ldots,\mathbf{z}_{i|\mathcal{N}_{i}(|\mathcal{N}_{i}|)}^{\top})^{\top},\;\;\mathbf{z}\coloneqq(\mathbf{z}_{1}^{\top},\ldots,\mathbf{z}_{N}^{\top})^{\top}.

(14)

For simplicity, we drop the superscripts of the number of round $r$ . Let $\{\mathbf{w}_{i}^{\ast}\}_{i}$ be the optimal solution of Eq. (2), we define $\mathbf{w}^{\star}\in\mathbb{R}^{dN}$ in the same manner as the definition of $\mathbf{w}$ in Eq. (14). We define the loss function as $f(\mathbf{w})\coloneqq\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})$ . Next, we introduce the assumptions used in the convergence analysis.

Assumption 2.

We assume that $f$ is proper, closed and convex.

Assumption 3.

We assume that $f$ is $L$ -smooth and $\mu$ -strongly convex with $L>0$ and $\mu>0$ .

Assumption 4.

We assume that the graph $G$ has no isolated nodes (i.e., $N_{\text{min}}>0$ ).

Assumption 2, 3 are the standard assumptions used for the convergence analysis of the operator splitting methods [9, 26]. Assumption 3 is weaker than the assumptions of the smoothness and the strongly convexity of $f_{i}$ for all $i\in\mathcal{V}$ , which are commonly used in decentralized learning. Assumption 4 holds in general because decentralized learning assumes that the graph $G$ is connected. In addition, we define $\delta\in\mathbb{R}$ as follows:

\displaystyle\delta\coloneqq\text{max}\left(\frac{\alpha N_{\text{max}}-\mu}{\alpha N_{\text{max}}+\mu},\frac{L-\alpha N_{\text{min}}}{L+\alpha N_{\text{min}}}\right).

Suppose that Assumptions 2, 3, and 4 hold and $\alpha\in(0,\infty)$ holds, then $\delta\in[0,1)$ holds because $L\geq\mu>0$ and $N_{\text{max}}\geq N_{\text{min}}>0$ .

4.2 Convergence Rates

Theorem 1.

Let $\bar{\mathbf{z}}\in\mathbb{R}^{2d|\mathcal{E}|}$ be the fixed point of the Douglas-Rachford splitting.¹¹1A more detailed definition is shown in Sec. A and C. Suppose that Assumptions 1, 2, 3, and 4 hold. If $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ and $\theta$ satisfies

\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right),

(15)

then $\mathbf{w}^{(r+1)}$ generated by Alg. 1 linearly converges to the optimal solution $\mathbf{w}^{\star}$ of Eq. (2) as follows:

\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\right\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

(16)

Corollary 1.

Let $\bar{\mathbf{z}}\in\mathbb{R}^{2d|\mathcal{E}|}$ be the fixed point of the Douglas-Rachford splitting. Under Assumptions 1, 2, 3, and 4, when $\tau=1$ and $\theta\in(0,\frac{2}{1+\delta})$ , $\mathbf{w}^{(r+1)}$ generated by Alg. 1 linearly converges to the optimal solution $\mathbf{w}^{\star}$ of Eq. (2) as follows:

\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}(|1-\theta|+\theta\delta)^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

(17)

Because $\tau=1$ implies that $\textbf{comp}(\mathbf{x})=\mathbf{x}$ in the C-ECL, Corollary 1 shows the convergence rate of the ECL under Assumptions 1, 2, 3, and 4, which is almost the same rate as that shown in the previous work [9]. Comparing the domain of $\theta$ for the ECL and the C-ECL to converge, as $\tau$ decreases, the domain in Eq. (15) becomes smaller. Subsequently, in order for the domain in Eq. (15) to be non-empty, $\tau$ must be greater than or equal to $(1-(1-\delta)^{2}/(1+\delta)^{2})$ . Next, comparing the convergence rate of the ECL and the C-ECL, the compression in the C-ECL slows down the convergence rate of the ECL by the term $(\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta))$ . Moreover, similar to the convergence analysis of the Douglas-Rachford splitting [9], Theorem 1 and Corollary 1 imply that the optimal parameter of $\theta$ can be determined as follows:

Corollary 2.

Suppose that 1, 2, 3, and 4 hold, and $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ , the optimal convergence rate of Eq. (16) in the C-ECL is achieved when $\theta=1$ .

Corollary 3.

Suppose that Assumption 1, 2, 3, and 4 hold, and $\tau=1$ , the optimal convergence rate of Eq. (17) is achieved when $\theta=1$ .

5 Experiments

In this section, we demonstrate that the C-ECL can achieve almost the same performance with fewer parameter exchanges than the ECL. Furthermore, we show that the C-ECL is more robust to heterogeneous data than the Gossip-based algorithm.

5.1 Experimental Setting

Dataset and Model: We evaluate the C-ECL using FashionMNIST [35] and CIFAR10 [15], which are datasets of 10-class image-classification tasks. As the models for both datasets, we use 5-layer convolutional neural networks [16] with group normalization [34]. Following the previous work [22], we distribute data to the nodes in two settings: the homogeneous and heterogeneous settings. In the homogeneous setting, the data are distributed such that each node has the data of all $10$ classes and has approximately the same number of data of each class. In the heterogeneous setting, the data are distributed such that each node has the data of randomly selected $8$ classes. In both settings, the data are distributed to the nodes such that each node has the same number of data.

Network: In Sec. 5.2, we evaluate all comparison methods on a network of a ring consisting of eight nodes. In addition, in Sec. 5.3, we evaluate all comparison methods in four settings of the network topology: chain, ring, multiplex ring, and fully connected graph, where each setting consists of eight nodes. In Sec. D, we show a visualization of the network topology. Each node exchanges parameters with its neighbors per five local updates. We implement with PyTorch using gloo²²2https://pytorch.org/docs/stable/distributed.html as the backend, and run all comparison methods on eight GPUs (NVIDIA RTX 3090).

Comparison Methods: (1) D-PSGD [18]: The uncompressed Gossip-based algorithm. (2) PowerGossip [33]: The Gossip-based algorithm that compresses the exchanging parameters by using a low-rank approximation. We use the PowerGossip as the compression method for the Gossip-based algorithm because the PowerGossip has been shown to achieve almost the same performance as other existing compression methods without additional hyperparameter tuning. (3) ECL [22, 23]: The primal-dual algorithm described in Sec. 2.3. Because Niwa et al., [22] showed that the ECL converges faster when $\theta=1$ than when $\theta=0.5$ , we set $\theta=1$ . (4) C-ECL: Our proposed method described in Sec. 3. We use $\textbf{rand}_{k\%}$ as the compression operator. Following Corollary 2, we set $\theta=1$ . We initialize $\mathbf{z}_{i|j}$ and $\mathbf{y}_{i|j}$ to zeros. However, we found that when we compress the update values of $\mathbf{z}_{i|j}$ by using $\textbf{rand}_{k\%}$ , the convergence becomes slower because $\mathbf{z}_{i|j}$ remains to be sparse in the early training stage. Then, we set $k\%$ of $\textbf{rand}_{k\%}$ to $100\%$ only during the first epoch.

In addition, for the reference, we show the results of the Stochastic Gradient Descent (SGD), in which the model is trained on a single node containing all training data. In our experiments, we set the learning rate, number of epochs, and batch size to the same values for all comparison methods. In Sec. D, we show the detailed hyperparameters used for all comparison methods.

5.2 Experimental Results

In this section, we evaluate the accuracy and the communication costs when setting the network topology to be a ring.

Table 1: Test accuracy and communication costs on the homogeneous setting. For the C-ECL, the number in the bracket is

k

\textbf{rand}_{k\%}

. For the PowerGossip, the number in the bracket is the number of the power iteration steps. As the communication costs, the average amount of parameter sent per epoch is shown.

	FashionMNIST			CIFAR10
	Accuracy	Send/Epoch		Accuracy	Send/Epoch
SGD	$88.7$	-		$75.7$	-
D-PSGD	$84.1$	$5336$ KB	( $\times 1.0$ )	$72.8$	$6255$ KB	( $\times 1.0$ )
ECL	$84.4$	$5336$ KB	( $\times 1.0$ )	$72.6$	$6255$ KB	( $\times 1.0$ )
PowerGossip (1)	$84.0$	$138$ KB	( $\times 38.7$ )	$72.0$	$135$ KB	( $\times 46.3$ )
PowerGossip (10)	$84.3$	$1079$ KB	( $\times 5.0$ )	$72.3$	$1102$ KB	( $\times 5.7$ )
PowerGossip (20)	$84.2$	$2124$ KB	( $\times 2.5$ )	$72.2$	$2175$ KB	( $\times 2.9$ )
C-ECL ( $1\%$ )	$84.0$	$115$ KB	( $\times 48.1$ )	$71.5$	$132$ KB	( $\times 47.4$ )
C-ECL ( $10\%$ )	$84.0$	$1075$ KB	( $\times 5.1$ )	$71.4$	$1257$ KB	( $\times 5.0$ )
C-ECL ( $20\%$ )	$84.0$	$2142$ KB	( $\times 2.5$ )	$71.1$	$2507$ KB	( $\times 2.5$ )

Table 2: Test accuracy and communication costs on the heterogeneous setting. For the C-ECL, the number in the bracket is

k

\textbf{rand}_{k\%}

. For the PowerGossip, the number in the bracket is the number of the power iteration steps. As the communication costs, the average amount of parameter sent per epoch is shown.

	FashionMNIST			CIFAR10
	Accuracy	Send/Epoch		Accuracy	Send/Epoch
SGD	$88.7$	-		$75.7$	-
D-PSGD	$79.4$	$5336$ KB	( $\times 1.0$ )	$70.8$	$6155$ KB	( $\times 1.0$ )
ECL	$84.5$	$5336$ KB	( $\times 1.0$ )	$72.7$	$6155$ KB	( $\times 1.0$ )
PowerGossip (1)	$77.5$	$138$ KB	( $\times 38.7$ )	$64.3$	$133$ KB	( $\times 46.3$ )
PowerGossip (10)	$77.7$	$1079$ KB	( $\times 5.0$ )	$67.2$	$1084$ KB	( $\times 5.7$ )
PowerGossip (20)	$77.4$	$2124$ KB	( $\times 2.5$ )	$67.9$	$2141$ KB	( $\times 2.9$ )
C-ECL ( $1\%$ )	$77.7$	$115$ KB	( $\times 48.1$ )	$60.2$	$129$ KB	( $\times 47.7$ )
C-ECL ( $10\%$ )	$83.4$	$1075$ KB	( $\times 5.1$ )	$61.8$	$1237$ KB	( $\times 5.0$ )
C-ECL ( $20\%$ )	$83.6$	$2142$ KB	( $\times 2.5$ )	$72.3$	$2467$ KB	( $\times 2.5$ )

Homogeneous Setting: First, we discuss the results on the homogeneous setting. Table 1 shows the accuracy and the communication costs on the homogeneous setting. The results show that the D-PSGD and the ECL achieve almost the same accuracy on both datasets. Then, the C-ECL and the PowerGossip are comparable and achieve almost the same accuracy as the ECL and the D-PSGD even when we set $k\%$ of $\textbf{rand}_{k\%}$ to $1\%$ and the number of the power iteration steps to $1$ respectively. Therefore, the C-ECL can achieve the comparable accuracy with approximately $50$ -times fewer parameter exchanges than the ECL and the D-PSGD on the homogeneous setting.

Heterogeneous Setting: Next, we discuss the results on the heterogeneous setting. Table 2 shows the accuracy and the communication costs on the heterogeneous setting. In the D-PSGD, the accuracy on the heterogeneous setting decreases by approximately $3\%$ compared to that on the homogeneous setting. In the PowerGossip, even if the number of power iteration steps is increased, the accuracy does not approach that of the D-PSGD and the ECL. On the other hand, the accuracy of the ECL is almost the same on both the homogeneous and heterogeneous settings, and the results indicate that the ECL is more robust to heterogeneous data than the D-PSGD. In the C-ECL, when we set $k\%$ of $\textbf{rand}_{k\%}$ to $1\%$ , the accuracy on the heterogeneous setting decreases by approximately $10\%$ compared to that of the ECL. However, when we increase $k\%$ of $\textbf{rand}_{k\%}$ , the C-ECL is competitive with the ECL and outperforms the D-PSGD and the PowerGossip. Specifically, on FashionMNIST, when we set $k\%$ to $10\%$ , the C-ECL is competitive to the ECL and outperforms the D-PSGD and the PowerGossip. On CIFAR10, when we set $k\%$ to $20\%$ , the C-ECL is competitive with the ECL and outperforms the D-PSGD and the PowerGossip.

In summary, on the homogeneous setting, the C-ECL and the PowerGossip can achieve almost the same accuracy as the ECL and the D-PSGD with approximately $50$ -times fewer parameter exchanges. On the heterogeneous setting, the C-ECL can achieve almost the same accuracy as the ECL with approximately $4$ -times fewer parameter exchanges and can outperform the PowerGossip. Furthermore, the results show that the C-ECL can outperform the D-PSGD, the uncompressed Gossip-based algorithm, in terms of both the accuracy and the communication costs.

5.3 Network Topology

In this section, we show the accuracy and the communication costs when the network topology is varied. Table 3 and Fig. 1 show the communication costs and the accuracy on FashionMNIST when the network topology is varied as a chain, ring, multiplex ring, or fully connected graph.

On the homogeneous setting, Fig. 1 shows that the accuracy of all comparison methods are almost the same and reach that of the SGD on all network topologies. On the heterogeneous setting, Fig. 1 shows that the accuracy of the D-PSGD and the PowerGossip are decreased compared to that on the homogeneous setting. On the other hand, the accuracy of the ECL is almost the same as on the homogeneous setting on all network topologies. Then, on all network topologies, the C-ECL achieves almost the same accuracy as the ECL with fewer parameters exchanges and consistently outperforms the PowerGossip. Moreover, the results show that on the heterogeneous setting, the C-ECL can outperform the D-PSGD, the uncompressed Gossip-based algorithm, on all network topologies in terms of both the accuracy and the communication costs.

Refer to caption — (a) Homogeneous Setting

Table 3: Communication costs on FashionMNIST when varying the network topology. As the communication costs, the average amount of parameters sent per epoch on homogeneous and heterogeneous settings is shown.

	Chain	Ring	Multiplex Ring	Fully Connected Graph
D-PSGD	$4670$ KB	$5336$ KB	$10673$ KB	$18677$ KB
ECL	$4670$ KB	$5336$ KB	$10673$ KB	$18677$ KB
PowerGossip (10)	$944$ KB	$1078$ KB	$2158$ KB	$3776$ KB
C-ECL ( $10\%$ )	$941$ KB	$1075$ KB	$2151$ KB	$3764$ KB

6 Conclusion

In this work, we propose the Communication Compressed ECL (C-ECL), the novel framework for the compression methods of the ECL. Specifically, we reformulate the update formula of the ECL and propose compressing the update values of the dual variables. Theoretically, we analyze the convergence rate of the C-ECL and show that the C-ECL converges linearly to the optimal solution as well as the ECL. Experimentally, we demonstrate that, even if the data distribution of each node is statistically heterogeneous, the C-ECL can achieve almost the same accuracy as the ECL with fewer parameter exchanges. Moreover, we show that, when the data distribution of each node is statistically heterogeneous, the C-ECL outperforms the uncompressed Gossip-based algorithm in terms of both the accuracy and the communication costs.

Acknowledgments

M.Y. was supported by MEXT KAKENHI Grant Number 20H04243.

References

Bauschke and Combettes, [2017] Bauschke, H. H. and Combettes, P. L. (2017). Convex analysis and monotone operator theory in hilbert spaces. Springer, 2nd edition.
Boyd et al., [2006] Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2006). Randomized gossip algorithms. In IEEE Transactions on Information Theory.
Boyd et al., [2011] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning.
Chen et al., [2020] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning.
Defazio et al., [2014] Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems.
Devlin et al., [2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics.
Dosovitskiy et al., [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
Douglas and Rachford, [1956] Douglas, J. and Rachford, H. H. (1956). On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society.
Giselsson and Boyd, [2017] Giselsson, P. and Boyd, S. P. (2017). Linear convergence and metric selection for douglas-rachford splitting and ADMM. IEEE Transactions on Automatic Control.
Johnson and Zhang, [2013] Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems.
Karimireddy et al., [2020] Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. (2020). SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning.
Koloskova et al., [2020] Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. (2020). Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations.
Koloskova et al., [2019] Koloskova, A., Stich, S., and Jaggi, M. (2019). Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning.
Kovalev et al., [2021] Kovalev, D., Koloskova, A., Jaggi, M., Richtarik, P., and Stich, S. (2021). A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In International Conference on Artificial Intelligence and Statistics.
Krizhevsky, [2009] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
LeCun et al., [1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. In IEEE.
Li et al., [2019] Li, Z., Shi, W., and Yan, M. (2019). A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. In IEEE Transactions on Signal Processing.
Lian et al., [2017] Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems.
Liu et al., [2021] Liu, X., Li, Y., Wang, R., Tang, J., and Yan, M. (2021). Linear convergent decentralized optimization with compression. In International Conference on Learning Representations.
Lorenzo and Scutari, [2016] Lorenzo, P. D. and Scutari, G. (2016). NEXT: in-network nonconvex optimization. In IEEE Transactions on Signal and Information Processing over Networks.
Lu and De Sa, [2020] Lu, Y. and De Sa, C. (2020). Moniqua: Modulo quantized communication in decentralized SGD. In International Conference on Machine Learning.
Niwa et al., [2020] Niwa, K., Harada, N., Zhang, G., and Kleijn, W. B. (2020). Edge-consensus learning: Deep learning on p2p networks with nonhomogeneous data. In International Conference on Knowledge Discovery and Data Mining.
Niwa et al., [2021] Niwa, K., Zhang, G., Kleijn, W. B., Harada, N., Sawada, H., and Fujino, A. (2021). Asynchronous decentralized optimization with implicit stochastic variance reduction. In International Conference on Machine Learning.
Peaceman and Rachford, [1955] Peaceman, D. W. and Rachford, H. H. (1955). The numerical solution of parabolic and elliptic differential equations. Journal of the Society for Industrial and Applied Mathematics.
Rockafellar, [2015] Rockafellar, R. T. (2015). Convex analysis. Princeton University Press.
Ryu and Boyd, [2015] Ryu, E. K. and Boyd, S. P. (2015). A primer on monotone operator methods. In Applied and Computational Mathematics.
Sherson et al., [2019] Sherson, T. W., Heusdens, R., and Kleijn, W. B. (2019). Derivation and analysis of the primal-dual method of multipliers based on monotone operator theory. In IEEE Transactions on Signal and Information Processing over Networks.
Stich et al., [2018] Stich, S. U., Cordonnier, J.-B., and Jaggi, M. (2018). Sparsified sgd with memory. In Advances in Neural Information Processing Systems.
[29] Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. (2018a). Communication compression for decentralized training. In Advances in Neural Information Processing Systems.
[30] Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. (2018b). $D^{2}$ : Decentralized training over decentralized data. In International Conference on Machine Learning.
Vaswani et al., [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems.
Vogels et al., [2021] Vogels, T., He, L., Koloskova, A., Karimireddy, S. P., Lin, T., Stich, S. U., and Jaggi, M. (2021). Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems.
Vogels et al., [2020] Vogels, T., Karimireddy, S. P., and Jaggi, M. (2020). Powergossip: Practical low-rank communication compression in decentralized deep learning. In Advances in Neural Information Processing Systems.
Wu and He, [2018] Wu, Y. and He, K. (2018). Group normalization. In European Conference on Computer Vision.
Xiao et al., [2017] Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. In arXiv.
Xiao et al., [2007] Xiao, L., Boyd, S. P., and Kim, S. (2007). Distributed average consensus with least-mean-square deviation. In Journal of Parallel and Distributed Computing.
Xin et al., [2020] Xin, R., Khan, U. A., and Kar, S. (2020). Variance-reduced decentralized stochastic optimization with accelerated convergence. In IEEE Transactions on Signal and Information Processing over Networks.
Zhang and Heusdens, [2018] Zhang, G. and Heusdens, R. (2018). Distributed optimization using the primal-dual method of multipliers. In IEEE Transactions on Signal and Information Processing over Networks.

Appendix A Preliminary

In this section, we introduce the definitions used in the following section and briefly introduce the Douglas-Rachford splitting. (See [1, 26] for more details.)

A.1 Definition

Definition 1 (Smooth Function).

Let $f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\}$ be a closed, proper, and convex function. $f$ is $L$ -smooth if $f$ satisfies the following:

\displaystyle f(x)\leq f(y)+\langle\nabla f(y),x-y\rangle+\frac{L}{2}\left\|x-y\right\|^{2}\qquad(\forall x,y\in\mathcal{H}).

Definition 2 (Strongly Convex Function).

Let $f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\}$ be a closed, proper, and convex function. $f$ is $\mu$ -strongly convex if $f$ satisfies the following:

\displaystyle f(x)\geq f(y)+\langle\nabla f(y),x-y\rangle+\frac{\mu}{2}\left\|x-y\right\|^{2}\qquad(\forall x,y\in\mathcal{H}).

Definition 3 (Conjugate Function).

Let $f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\}$ . The conjugate function of $f$ is defined as follows:

\displaystyle f^{\ast}(y)=\sup_{x\in\mathcal{H}}(\langle y,x\rangle-f(x)).

Definition 4 (Nonexpansive Operator).

Let $D$ be a non-empty subset of $\mathcal{H}$ . An operator $T:D\rightarrow\mathcal{H}$ is nonexpansive if $T$ is $1$ -Lipschitz continuous.

Definition 5 (Contractive Operator).

Let $D$ be a non-empty subset of $\mathcal{H}$ . An operator $T:D\rightarrow\mathcal{H}$ is $\beta$ -contractive if $T$ is Lipschitz continuous with constant $\beta\in[0,1)$ .

Definition 6 (Proximal Operator).

Let $f:\mathcal{H}\rightarrow\mathbb{R}\cup\{\infty\}$ be a closed, proper, and convex function. The proximal operator of $f$ is defined as follows:

\displaystyle\text{prox}_{f}(\mathbf{x})=\text{argmin}_{\mathbf{y}}\{f(\mathbf{y})+\frac{1}{2}\|\mathbf{y}-\mathbf{x}\|^{2}\}.

Definition 7 (Resolvent).

Let $A:\mathcal{H}\rightarrow 2^{\mathcal{H}}$ , the resolvent of $A$ is defined as follows:

\displaystyle J_{A}=(\text{Id}+A)^{-1}.

Definition 8 (Reflected Resolvent).

Let $A:\mathcal{H}\rightarrow 2^{\mathcal{H}}$ , the reflected resolvent of $A$ is defined as follows:

\displaystyle R_{A}=2J_{A}-\text{Id}.

A.2 Douglas-Rachford Splitting

In this section, we briefly introduce the Douglas-Rachford splitting [8].

The Douglas-Rachford splitting can be applied to the following problem:

\displaystyle\inf_{\mathbf{x}}f(\mathbf{x})+g(\mathbf{x}),

(18)

where $f$ and $g$ are closed, proper, and convex functions. Let $\mathbf{x}^{\star}$ be the optimal solution of the above problem, the optimality condition can be written as follows:

\displaystyle\mathbf{0}\in\partial f(\mathbf{x}^{\star})+\partial g(\mathbf{x}^{\star}).

From [1, Proposition 26.1], the optimality condition above is equivalent to the following:

\displaystyle\mathbf{x}^{\star}=J_{\alpha\partial f}(\text{Fix}(R_{\alpha\partial g}R_{\alpha\partial f})),

where $\text{Fix}(R_{\alpha\partial g}R_{\alpha\partial f})=\{\mathbf{z}|R_{\alpha\partial g}R_{\alpha\partial f}\mathbf{z}=\mathbf{z}\}$ and $\alpha>0$ . The Douglas-Rachford splitting then computes the fixed point $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial g}R_{\alpha\partial f})$ as follows:

\displaystyle\mathbf{z}^{(r+1)}

\displaystyle=((1-\theta)\text{Id}+\theta R_{\alpha\partial g}R_{\alpha\partial f})\mathbf{z}^{(r)},

(19)

where $\theta\in(0,1]$ is the hyperparameter of the Douglas-Rachford splitting. Under certain assumptions, $((1-\theta)\text{Id}+\theta R_{\alpha\partial g}R_{\alpha\partial f})$ is contractive [9], and the update formula above is guaranteed to converge at the fixed point $\bar{\mathbf{z}}$ . Then, after converging to the fixed point $\bar{\mathbf{z}}$ , the Douglas-Rachford splitting obtains the optimal solution $\mathbf{x}^{\star}$ as follows:

\displaystyle\mathbf{x}^{\star}

\displaystyle=J_{\alpha\partial f}\bar{\mathbf{z}}.

(20)

Moreover, by the definition of the reflected resolvent, the Douglas-Rachford splitting of Eq. (19) can be rewritten as follows:

$\displaystyle\mathbf{x}^{(r+1)}$	$\displaystyle=J_{\alpha\partial f}\mathbf{z}^{(r)},$	(21)
$\displaystyle\mathbf{y}^{(r+1)}$	$\displaystyle=2\mathbf{x}^{(r+1)}-\mathbf{z}^{(r)},$	(22)
$\displaystyle\mathbf{z}^{(r+1)}$	$\displaystyle=(1-\theta)\mathbf{z}^{(r)}+\theta R_{\alpha\partial g}\mathbf{y}^{(r+1)}.$	(23)

Appendix B Derivation of Update Formulas of ECL

In this section, we briefly describe the derivation of the update formulas of the ECL. (See [22, 23, 27] for more details.)

B.1 Derivation of Dual Problem

First, we derive the Lagrangian function. The Lagrangian function of Eq. (2) can be derived as follows:

	$\displaystyle\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij},\mathbf{A}_{i\|j}\mathbf{w}_{i}+\mathbf{A}_{j\|i}\mathbf{w}_{j}\rangle$
	$\displaystyle=\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij},\mathbf{A}_{i\|j}\mathbf{w}_{i}\rangle-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij},\mathbf{A}_{j\|i}\mathbf{w}_{j}\rangle$
	$\displaystyle=\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{ij}+\boldsymbol{\lambda}_{ji},\mathbf{A}_{i\|j}\mathbf{w}_{i}\rangle,$

where $\boldsymbol{\lambda}_{ij}\in\mathbb{R}^{d}$ is the dual variable. Defining $\boldsymbol{\lambda}_{i|j}\coloneqq\boldsymbol{\lambda}_{ij}+\boldsymbol{\lambda}_{ji}$ , the Lagrangian function can be rewritten as follows:

\displaystyle\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i})-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{N}_{i}}\langle\boldsymbol{\lambda}_{i|j},\mathbf{A}_{i|j}\mathbf{w}_{i}\rangle.

Note that the definition of $\boldsymbol{\lambda}_{i|j}$ implies that $\boldsymbol{\lambda}_{i|j}=\boldsymbol{\lambda}_{j|i}$ . Let $N$ be the number of nodes $|\mathcal{V}|$ . Here, $\mathcal{N}_{i}(j)$ denotes the $j$ -th smallest index of the node in $\mathcal{N}_{i}$ . We define $\mathbf{w}\in\mathbb{R}^{dN}$ , $\mathbf{A}\in\mathbb{R}^{dN\times 2d|\mathcal{E}|}$ and $\boldsymbol{\lambda}\in\mathbb{R}^{2d|\mathcal{E}|}$ as follows:

	$\displaystyle\mathbf{w}$	$\displaystyle=\begin{pmatrix}\mathbf{w}_{1}^{\top},&\cdots,\mathbf{w}_{N}^{\top}\end{pmatrix}^{\top},$
	$\displaystyle\mathbf{A}_{i}$	$\displaystyle=\begin{pmatrix}\mathbf{A}_{i\|\mathcal{N}_{i}(1)},&\cdots,&\mathbf{A}_{i\|\mathcal{N}_{i}(\|\mathcal{N}_{i}\|)}\end{pmatrix},$
	$\displaystyle\mathbf{A}$	$\displaystyle=\text{diag}(\mathbf{A}_{1},\cdots,\mathbf{A}_{N}),$
	$\displaystyle\boldsymbol{\lambda}_{i}$	$\displaystyle=\begin{pmatrix}\boldsymbol{\lambda}_{i\|\mathcal{N}_{i}(1)}^{\top},&\cdots,&\boldsymbol{\lambda}_{i\|\mathcal{N}_{i}(\|\mathcal{N}_{i}\|)}^{\top}\end{pmatrix}^{\top},$
	$\displaystyle\boldsymbol{\lambda}$	$\displaystyle=\begin{pmatrix}\boldsymbol{\lambda}_{1}^{\top},&\cdots,&\boldsymbol{\lambda}_{N}^{\top}\end{pmatrix}^{\top}.$

We define the function $f$ as follows:

\displaystyle f(\mathbf{w})=\sum_{i\in\mathcal{V}}f_{i}(\mathbf{w}_{i}).

The Lagrangian function can be rewritten as follows:

\displaystyle f(\mathbf{w})-\langle\boldsymbol{\lambda},\mathbf{A}^{\top}\mathbf{w}\rangle.

Then, the primal and dual problems can be defined as follows:

	$\displaystyle\inf_{\mathbf{w}}\sup_{\boldsymbol{\lambda}}f(\mathbf{w})-\langle\boldsymbol{\lambda},\mathbf{A}^{\top}\mathbf{w}\rangle-\iota(\boldsymbol{\lambda})$	$\displaystyle=\inf_{\mathbf{w}}f(\mathbf{w})+\iota^{\ast}(-\mathbf{A}^{\top}\mathbf{w}),$		(24)
	$\displaystyle\sup_{\boldsymbol{\lambda}}\inf_{\mathbf{w}}f(\mathbf{w})-\langle\boldsymbol{\lambda},\mathbf{A}^{\top}\mathbf{w}\rangle-\iota(\boldsymbol{\lambda})$	$\displaystyle=-\inf_{\boldsymbol{\lambda}}f^{\ast}(\mathbf{A}\boldsymbol{\lambda})+\iota(\boldsymbol{\lambda}),$		(25)

where $\iota$ is the indicator function defined as follows:

\displaystyle\iota(\boldsymbol{\lambda})=\begin{cases}0&\text{if}\;\boldsymbol{\lambda}_{i|j}=\boldsymbol{\lambda}_{j|i},\;(\forall(i,j)\in\mathcal{E})\\ \infty&\text{otherwise}\end{cases}.

B.2 Derivation of Update Formulas

Next, we derive the update formulas of the ECL. By applying the Douglas-Rachford splitting of Eqs. (21-23) to the dual problem of Eq. (25), we obtain the following update formulas:

$\displaystyle\boldsymbol{\lambda}^{(r+1)}$	$\displaystyle=J_{\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\cdot)}\mathbf{z}^{(r)},$	(26)
$\displaystyle\mathbf{y}^{(r+1)}$	$\displaystyle=2\boldsymbol{\lambda}^{(r+1)}-\mathbf{z}^{(r)},$	(27)
$\displaystyle\mathbf{z}^{(r+1)}$	$\displaystyle=(1-\theta)\mathbf{z}^{(r)}+\theta R_{\alpha\partial\iota}\mathbf{y}^{(r+1)},$	(28)

where $\mathbf{z}\in\mathbb{R}^{2d|\mathcal{E}|}$ and $\mathbf{y}\in\mathbb{R}^{2d|\mathcal{E}|}$ can be decomposed into $\mathbf{z}_{i|j}\in\mathbb{R}^{d}$ and $\mathbf{y}_{i|j}\in\mathbb{R}^{d}$ as follows:

	$\displaystyle\mathbf{z}_{i}$	$\displaystyle=(\mathbf{z}_{i\|\mathcal{N}_{i}(1)}^{\top},\ldots,\mathbf{z}_{i\|\mathcal{N}_{i}(\|\mathcal{N}_{i}\|)}^{\top})^{\top},$
	$\displaystyle\mathbf{z}$	$\displaystyle=(\mathbf{z}_{1}^{\top},\ldots,\mathbf{z}_{N}^{\top})^{\top},$
	$\displaystyle\mathbf{y}_{i}$	$\displaystyle=(\mathbf{y}_{i\|\mathcal{N}_{i}(1)}^{\top},\ldots,\mathbf{y}_{i\|\mathcal{N}_{i}(\|\mathcal{N}_{i}\|)}^{\top})^{\top},$
	$\displaystyle\mathbf{y}$	$\displaystyle=(\mathbf{y}_{1}^{\top},\ldots,\mathbf{y}_{N}^{\top})^{\top}.$

Update formulas of Eqs. (26-27): By the definition of the resolvent $J_{\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\cdot)}$ , we obtain

		$\displaystyle\boldsymbol{\lambda}^{(r+1)}+\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{(r+1)})=\mathbf{z}^{(r)},$
		$\displaystyle\boldsymbol{\lambda}^{(r+1)}=\mathbf{z}^{(r)}-\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{(r+1)}).$		(29)

We define $\mathbf{w}^{(r+1)}$ as follows:

\displaystyle\mathbf{w}^{(r+1)}=\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{(r+1)}).

(30)

From the property of the convex conjugate function, we obtain

\displaystyle\mathbf{A}\boldsymbol{\lambda}^{(r+1)}=\nabla f(\mathbf{w}^{(r+1)}).

(31)

Substituting Eqs. (29-30) into Eq. (31), we obtain

\displaystyle\mathbf{0}=\nabla f(\mathbf{w}^{(r+1)})+\mathbf{A}(\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}-\mathbf{z}^{(r)}).

We then obtain the update formula of $\mathbf{w}$ as follows:

\displaystyle\mathbf{w}^{(r+1)}=\text{argmin}

\displaystyle\{f(\mathbf{w})+\frac{\alpha}{2}\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\|^{2}\}.

Substituting Eq. (30) into Eq. (29), we obtain

\displaystyle\boldsymbol{\lambda}^{(r+1)}=\mathbf{z}^{(r)}-\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}.

Then, the update formula of Eq. (27) is written as follows:

\displaystyle\mathbf{y}^{(r+1)}=\mathbf{z}^{(r)}-2\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)}.

Update formula of Eq. (28): From [27, Lemma VI.2], the update formula of Eq. (28) is rewritten as follows:

\displaystyle\mathbf{z}^{(r+1)}=(1-\theta)\mathbf{z}^{(r)}+\theta\mathbf{P}\mathbf{y}^{(r+1)},

where $\mathbf{P}$ denotes the permutation matrix transforming $\mathbf{y}_{i|j}$ into $\mathbf{y}_{j|i}$ for all $(i,j)\in\mathcal{E}$ .

In summary, the update formulas of the ECL can be derived as follows:

$\displaystyle\mathbf{w}^{(r+1)}$	$\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\\|}^{2}\},$	(32)
$\displaystyle\mathbf{y}^{(r+1)}$	$\displaystyle=\mathbf{z}^{(r)}-2\alpha\mathbf{A}^{\top}\mathbf{w}^{(r+1)},$	(33)
$\displaystyle\mathbf{z}^{(r+1)}$	$\displaystyle=(1-\theta)\mathbf{z}^{(r)}+\theta\mathbf{P}\mathbf{y}^{(r+1)}.$	(34)

Then, by rewriting $\mathbf{w}$ , $\mathbf{z}$ and $\mathbf{y}$ with $\mathbf{w}_{i}$ , $\mathbf{z}_{i|j}$ and $\mathbf{y}_{i|j}$ respectively, we obtain the update formulas of Eqs. (3-5).

Appendix C Convergence Analysis of C-ECL

From the derivation of the ECL shown in Sec. B, the update formulas of Eq. (3), Eq. (4), and Eq. (13) in the C-ECL can be rewritten as follows:³³3By the definition of the reflected resolvent, the update formulas of Eq. (26) and Eq. (27) are equivalent to that of Eq. (35).

	$\displaystyle\mathbf{y}^{(r+1)}$	$\displaystyle=R_{\alpha\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\cdot)}\mathbf{z}^{(r)},$		(35)
	$\displaystyle\mathbf{z}^{(r+1)}$	$\displaystyle=\mathbf{z}^{(r)}+\theta\;\text{comp}(R_{\alpha\partial\iota}\mathbf{y}^{(r+1)}-\mathbf{z}^{(r)}).$		(36)

To simplify the notation, we define $g\coloneqq f^{\ast}(\mathbf{A}\cdot)$ . Then, Eq. (35) is rewritten as follows:

\displaystyle\mathbf{y}^{(r+1)}

\displaystyle=R_{\alpha\nabla g}\mathbf{z}^{(r)}.

(37)

In the following, we analyze the convergence rate of the update formulas of Eq. (37) and Eq. (36).

Lemma 1.

Under Assumption 4, the maximum singular value and the minimum singular value of $\mathbf{A}$ are $\sqrt{N_{\text{max}}}$ and $\sqrt{N_{\text{min}}}$ respectively.

Proof.

By the definition of $\mathbf{A}$ , we have $\mathbf{A}\mathbf{A}^{\top}=\text{diag}(|\mathcal{N}_{1}|\mathbf{I},\cdots,|\mathcal{N}_{N}|\mathbf{I})$ . This implies that $\mathbf{A}\mathbf{A}^{\top}$ is not only a block diagonal matrix, but also a diagonal matrix. Then, the eigen values of $\mathbf{A}\mathbf{A}^{\top}$ are $|\mathcal{N}_{1}|,\ldots,|\mathcal{N}_{N}|$ . This concludes the proof. ∎

Remark 1.

Under Assumption 4, the maximum singular value and the minimum singular value of $\mathbf{A}^{\top}$ are $\sqrt{N_{\text{max}}}$ and $\sqrt{N_{\text{min}}}$ respectively.

Lemma 2.

Under Assumptions 2, 3, and 4, $f^{\ast}(\mathbf{A}\cdot)$ is $(N_{\text{max}}/\mu)$ -smooth and $(N_{\text{min}}/L)$ -strongly convex.

Proof.

From Assumption 3, $f^{\ast}$ is $(1/\mu)$ -smooth and $(1/L)$ -strongly convex. Then, for any $\boldsymbol{\lambda}$ and $\boldsymbol{\lambda}^{\prime}$ , by the $(1/\mu)$ -smoothness of $f^{\ast}$ and Lemma 1, we have

	$\displaystyle f^{\ast}(\mathbf{A}\boldsymbol{\lambda})$	$\displaystyle\leq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\rangle+\frac{1}{2\mu}\left\\|\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\right\\|^{2}$
		$\displaystyle\leq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\rangle+\frac{N_{\text{max}}}{2\mu}\left\\|\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\right\\|^{2}.$

Because $f^{\ast}$ is $(1/L)$ -strongly convex, from Lemma 1, for any $\boldsymbol{\lambda}$ and $\boldsymbol{\lambda}^{\prime}$ , we have

	$\displaystyle f^{\ast}(\mathbf{A}\boldsymbol{\lambda})$	$\displaystyle\geq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\rangle+\frac{1}{2L}\left\\|\mathbf{A}\boldsymbol{\lambda}-\mathbf{A}\boldsymbol{\lambda}^{\prime}\right\\|^{2}$
		$\displaystyle\geq f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime})+\langle\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\prime}),\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\rangle+\frac{N_{\text{min}}}{2L}\left\\|\boldsymbol{\lambda}-\boldsymbol{\lambda}^{\prime}\right\\|^{2}.$

This concludes the proof. ∎

We define $\delta$ as follows:

\displaystyle\delta\coloneqq\text{max}\left(\frac{\frac{\alpha N_{\text{max}}}{\mu}-1}{\frac{\alpha N_{\text{max}}}{\mu}+1},\frac{1-\frac{\alpha N_{\text{min}}}{L}}{1+\frac{\alpha N_{\text{min}}}{L}}\right).

Note that when Assumptions 2, 3, and 4 are satisfied, and when $\alpha>0$ , $0\leq\delta<1$ is satisfied.

Lemma 3.

Under Assumptions 2, 3, and 4, $R_{\alpha\nabla g}$ is $\delta$ -contractive for any $\alpha\in(0,\infty)$ .

Proof.

The statement follows from [9, Theorem 1] and Lemma 2. ∎

Lemma 4.

Under Assumptions 2, 3, and 4, $R_{\alpha\partial\iota}R_{\alpha\nabla g}$ is $\delta$ -contractive for any $\alpha\in(0,\infty)$ .

Proof.

From [1, Corollary 23.9] and [1, Theorem 20.25], $R_{\alpha\partial\iota}$ is nonexpansive for any $\alpha\in(0,\infty)$ . Then, from Lemma 3, for any $\mathbf{z}$ and $\mathbf{z}^{\prime}$ , we have

	$\displaystyle\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{\prime}\\|$	$\displaystyle\leq\\|R_{\alpha\nabla g}\mathbf{z}-R_{\alpha\nabla g}\mathbf{z}^{\prime}\\|$
		$\displaystyle\leq\delta\\|\mathbf{z}-\mathbf{z}^{\prime}\\|.$

This concludes the proof. ∎

In the following, we use the notation $\mathbb{E}_{r}[\cdot]$ as the expectation over the randomness in the round $r$ .

Lemma 5.

Let $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g})$ . Under Assumptions 1, 2, 3, and 4, $\mathbf{z}^{(r+1)}$ and $\mathbf{z}^{(r)}$ generated by Eqs. (35-36) satisfy the following:

\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|\leq\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|.

(38)

Proof.

Combining Eq. (36) and Eq. (37), the update formula of $\mathbf{z}$ can be written as follows:

\displaystyle\mathbf{z}^{(r+1)}=\mathbf{z}^{(r)}+\theta\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)}).

Let $\omega^{(r)}$ be the parameter used to compress $((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)})$ . In the following, to use Eq. (8) and Eq. (9) in Assumption 1, we rewrite the update formula of $\mathbf{z}$ as follows:

\displaystyle\mathbf{z}^{(r+1)}

\displaystyle=\mathbf{z}^{(r)}+\theta\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)};\omega^{(r)}).

Because $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g})$ , we have $(R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}}=\mathbf{0}$ . Under Assumption 1, because $\textbf{comp}(\mathbf{0};\omega)=\mathbf{0}$ for any $\omega$ , we have

\displaystyle\bar{\mathbf{z}}

\displaystyle=\bar{\mathbf{z}}+\theta\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}};\omega^{(r)}).

We then have

	$\displaystyle\mathbb{E}_{r}\\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\\|$
	$\displaystyle=\mathbb{E}_{r}\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}+\theta\;\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)};\omega^{(r)})-\theta\;\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}};\omega^{(r)})\\|$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|$
	$\displaystyle\qquad+\mathbb{E}_{r}\\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}+R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})$
	$\displaystyle\qquad\qquad-\theta\{\text{comp}(\mathbf{z}^{(r)}-\bar{\mathbf{z}}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}+R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}};\omega^{(r)})\}\\|$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|$
	$\displaystyle\qquad+\sqrt{1-\tau}\\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}})-\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|$
	$\displaystyle\leq\underbrace{\\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|}_{T_{1}}$
	$\displaystyle\qquad+\underbrace{\sqrt{1-\tau}\\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+(1-\theta)(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|}_{T_{2}}$
	$\displaystyle\qquad+\underbrace{\sqrt{1-\tau}\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\\|}_{T_{3}},$

where we use Assumption 1 for (a) and (b). From Lemma 4, $T_{1}$ , $T_{2}$ , and $T_{3}$ are upper bounded as follows:

	$\displaystyle T_{1}$	$\displaystyle\leq\|1-\theta\|\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|+\theta\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\\|$
		$\displaystyle\leq(\|1-\theta\|+\theta\delta)\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|,$
	$\displaystyle T_{2}$	$\displaystyle\leq\sqrt{1-\tau}(\theta\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|+\|1-\theta\|\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\\|)$
		$\displaystyle\leq\sqrt{1-\tau}(\theta+\|1-\theta\|\delta)\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|,$
	$\displaystyle T_{3}$	$\displaystyle\leq\sqrt{1-\tau}\delta\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|.$

Therefore, we obtain

\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|\leq\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|.

This concludes the proof. ∎

Lemma 6.

Let $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g})$ . Under Assumptions 1, 2, 3, and 4, when $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ and $\theta$ satisfies the following:

\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right),

(39)

$\mathbf{z}^{(r+1)}$ and $\mathbf{z}^{(r)}$ generated by Eqs. (35-36) satisfy the following:

\displaystyle\mathbb{E}_{r}\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\|\leq\{|1-\theta|+\theta\delta+\sqrt{1-\tau}(\theta+|1-\theta|\delta+\delta)\}\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|<\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\|.

(40)

Proof.

When $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ , the domain of Eq. (39) is non-empty and contains $1$ . Then, when $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ and $\theta$ satisfies the following:

\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},1\right],

we have

	$\displaystyle\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)$
	$\displaystyle=1+2\delta\sqrt{1-\tau}-\theta(1-\sqrt{1-\tau})(1-\delta)<1.$

When $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ and $\theta$ satisfies the following:

\displaystyle\theta\in\left[1,\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right),

we have

	$\displaystyle\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)$
	$\displaystyle=-1+\theta(1+\delta)(1+\sqrt{1-\tau})<1.$

This concludes the proof. ∎

In the following, we define the operator $T$ as follows:

\displaystyle T\mathbf{z}^{(r)}\coloneqq\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\|}^{2}\}.

Lemma 7.

Under Assumptions 2, 3, and 4, the operator $T$ is $(\sqrt{N_{\text{max}}}/(\mu+\alpha N_{\text{min}}))$ -Lipschitz continuous.

Proof.

We have

	$\displaystyle\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\\|}^{2}\}$
	$\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\mathbf{w}\\|^{2}-\langle\mathbf{w},\mathbf{A}\mathbf{z}^{(r)}\rangle\}$
	$\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\mathbf{w}\\|^{2}-\frac{1}{2}\\|\mathbf{w}\\|^{2}+\frac{1}{2}\\|\mathbf{w}-\mathbf{A}\mathbf{z}^{(r)}\\|^{2}\}$
	$\displaystyle=\text{prox}_{f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2}}(\mathbf{A}\mathbf{z}^{(r)}).$

From [1, Example 23.3], the proximal operator of $(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2})$ is the resolvent of $\nabla(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2})$ , and the resolvent can be rewritten as follows:

	$\displaystyle J_{\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2})}$	$\displaystyle=(\text{Id}+\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2}))^{-1}$
		$\displaystyle=(\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}))^{-1}$
		$\displaystyle=\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2})^{\ast}.$

From Assumption 3 and Remark 1, $(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2})$ is $(\mu+\alpha N_{\text{min}})$ -strongly convex. Then, $(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2})^{\ast}$ is $(1/(\mu+\alpha N_{\text{min}}))$ -smooth. Then, the proximal operator of $(f+\frac{\alpha}{2}\|\mathbf{A}^{\top}\cdot\|^{2}-\frac{1}{2}\|\cdot\|^{2})$ is $(1/(\mu+\alpha N_{\text{min}}))$ -Lipschitz continuous. Then, we have

	$\displaystyle\left\\|\text{prox}_{f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2}}(\mathbf{A}\mathbf{z})-\text{prox}_{f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2}}(\mathbf{A}\mathbf{z}^{\prime})\right\\|$	$\displaystyle\leq\frac{1}{\mu+\alpha N_{\text{min}}}\\|\mathbf{A}\mathbf{z}-\mathbf{A}\mathbf{z}^{\prime}\\|$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\\|\mathbf{z}-\mathbf{z}^{\prime}\\|,$

where we use Lemma 1 for (a). This concludes the proof. ∎

Lemma 8.

Suppose that Assumptions 2, 3, and 4 hold. Let $\boldsymbol{\lambda}^{\star}$ be the optimal solution of the dual problem of Eq. (25). Then, $\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star})$ is the optimal solution of the primal problem of Eq. (24).

Proof.

By using the optimality condition [25, Theorem 27.1] of $\boldsymbol{\lambda}^{\star}$ , we have

\displaystyle-\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\ast})\in\partial\iota(\boldsymbol{\lambda}^{\star}).

From the property of the convex conjugate function of $\iota$ , we get

\displaystyle\boldsymbol{\lambda}^{\star}\in\partial\iota^{\ast}(-\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star})).

Then, by multiplying $\mathbf{A}$ and using the property of the convex conjugate function of $f$ , we have

\displaystyle\nabla f(\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star}))\in\mathbf{A}\partial\iota^{\ast}(-\mathbf{A}^{\top}\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star})).

From the above equation, $\nabla f^{\ast}(\mathbf{A}\boldsymbol{\lambda}^{\star})$ satisfies the optimality condition of Eq. (24). This concludes the proof. ∎

Lemma 9.

Let $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g})$ , and let $\mathbf{w}^{\star}$ be the optimal solution of Eq. (2). Under Assumptions 1, 2, 3, and 4, $\mathbf{w}^{(r+1)}$ and $\mathbf{z}^{(r-1)}$ generated by Alg. 1 satisfy the following:

	$\displaystyle\mathbb{E}_{r-1}\\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\\|$
	$\displaystyle\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)\}\\|\mathbf{z}^{(r-1)}-\bar{\mathbf{z}}\\|.$		(41)

Proof.

Let $\boldsymbol{\lambda}^{\star}$ be the optimal solution of the dual problem of Eq. (25). We have $\boldsymbol{\lambda}^{\star}=J_{\alpha\nabla g}\bar{\mathbf{z}}$ because $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g})$ . By Lemma 8 and the definition of $\mathbf{w}$ of Eq. (30), $\mathbf{w}^{\star}$ can be obtained from the fixed point $\bar{\mathbf{z}}$ as $\mathbf{w}^{\star}=T\bar{\mathbf{z}}$ . Then, we have

	$\displaystyle\mathbb{E}_{r-1}\\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\\|$
	$\displaystyle\leq\mathbb{E}_{r-1}\\|T\mathbf{z}^{(r)}-T\bar{\mathbf{z}}\\|$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\mathbb{E}_{r-1}\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)\}\\|\mathbf{z}^{(r-1)}-\bar{\mathbf{z}}\\|,$

where we use Lemma 7 for (a) and use Lemma 5 for (b). This concludes the proof. ∎

Lemma 10.

	$\displaystyle\mathbb{E}\\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\\|$
	$\displaystyle\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)\right\}^{r}\\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\\|.$		(42)

Proof.

The statement follows from Lemma 5 and Lemma 9. ∎

Theorem 1.

\displaystyle\theta\in\left(\frac{2\delta\sqrt{1-\tau}}{(1-\delta)(1-\sqrt{1-\tau})},\frac{2}{(1+\delta)(1+\sqrt{1-\tau})}\right),

(43)

$\mathbf{w}^{(r+1)}$ generated by Alg. 1 linearly converges to the optimal solution $\mathbf{w}^{\star}$ as follows:

	$\displaystyle\mathbb{E}\\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\\|$
	$\displaystyle\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)\right\}^{r}\\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\\|.$		(44)

Proof.

The statement follows from Lemma 6 and Lemma 10. ∎

Corollary 1.

Let $\bar{\mathbf{z}}\in\text{Fix}(R_{\alpha\partial\iota}R_{\alpha\nabla g})$ . Under Assumptions 1, 2, 3, and 4, when $\tau=1$ and $\theta$ satisfies the following:

\displaystyle\theta\in\left(0,\frac{2}{1+\delta}\right),

(45)

$\mathbf{w}^{(r+1)}$ generated by Alg. 1 linearly converges to the optimal solution $\mathbf{w}^{\star}$ as follows:

\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\left\{|1-\theta|+\theta\delta\right\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

Proof.

The statement follows from Theorem 1. ∎

Corollary 2.

Under Assumptions 1, 2, 3, and 4, when $\tau\geq 1-(\frac{1-\delta}{1+\delta})^{2}$ , the optimal convergence rate of Eq. (16) in the C-ECL is achieved when $\theta=1$ .

Proof.

By Theorem 1, when $\theta\leq 1$ , we have

\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{1+2\delta\sqrt{1-\tau}-\theta(1-\sqrt{1-\tau})(1-\delta)\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

Because $(1-\sqrt{1-\tau})(1-\delta)>0$ , $\{1+2\delta\sqrt{1-\tau}-\theta(1-\sqrt{1-\tau})(1-\delta)\}$ decreases when $\theta$ approaches $1$ .

When $\theta\geq 1$ , we have

\displaystyle\mathbb{E}\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\|\leq\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{-1+\theta(1+\delta)(1+\sqrt{1-\tau})\}^{r}\|\mathbf{z}^{(0)}-\bar{\mathbf{z}}\|.

Because $(1+\delta)(1+\sqrt{1-\tau})>0$ , $\{-1+\theta(1+\delta)(1+\sqrt{1-\tau})\}$ decreases when $\theta$ approaches $1$ . Therefore, the optimal convergence rate is achieved when $\theta=1$ . ∎

Corollary 3.

Under Assumptions 1, 2, 3, and 4, when $\tau=1$ , the optimal convergence rate of Eq. (17) is achieved when $\theta=1$ .

Proof.

The statement follows from Corollary 2. ∎

Appendix D Experimental Setting

D.1 Hyperparameter

In this section, we describe the detailed hyperparameters used in our experiments.

FashionMNIST: We set the learning rate to $0.001$ , batch size to $100$ , and number of epochs to $1500$ . To avoid the overfitting, we use RandomCrop of PyTorch for the data augmentation. In the ECL, following the previous work [23], we set $\alpha$ as follows:

\displaystyle\alpha=\frac{1}{\eta|\mathcal{N}_{i}|(K-1)},

(46)

where $K$ is the number of local steps. In the C-ECL, when we use $\textbf{rand}_{k\%}$ as the compression operator, the number of local steps can be regarded to be $\frac{100K}{k}$ . Then, in the C-ECL, we set $\alpha$ as follows:

\displaystyle\alpha=\frac{1}{\eta|\mathcal{N}_{i}|(\frac{100K}{k}-1)}.

(47)

Note that $\alpha$ is set to the different values between the nodes by the definition of $\alpha$ in Eqs. (46-47). For the D-PSGD and the PowerGossip, we use Metropolis-Hastings weights [36] as the weights of the edges.

CIFAR10: We set the learning rate to $0.005$ , batch size to $100$ , and number of epochs to $2500$ . To avoid the overfitting, we use RandomCrop and RandomHorizontalFlip of PyTorch for the data augmentation. For the ECL and the C-ECL, we set $\alpha$ in the same way as the FashionMNIST. For the D-PSGD and the PowerGossip, we use Metropolis–Hastings weights as the weights of the edges.

D.2 Network Topology

Fig. 2 shows the visualization of the network topology used in our experiments.

	$\displaystyle\mathbb{E}_{r}\\|\mathbf{z}^{(r+1)}-\bar{\mathbf{z}}\\|$
	$\displaystyle=\mathbb{E}_{r}\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}+\theta\;\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\mathbf{z}^{(r)};\omega^{(r)})-\theta\;\text{comp}((R_{\alpha\partial\iota}R_{\alpha\nabla g}-\text{Id})\bar{\mathbf{z}};\omega^{(r)})\\|$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|$
	$\displaystyle\qquad+\mathbb{E}_{r}\\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}+R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})$
	$\displaystyle\qquad\qquad-\theta\{\text{comp}(\mathbf{z}^{(r)}-\bar{\mathbf{z}}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}+R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}};\omega^{(r)})\}\\|$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|$
	$\displaystyle\qquad+\sqrt{1-\tau}\\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}})-\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|$
	$\displaystyle\leq\underbrace{\\|(1-\theta)(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+\theta(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|}_{T_{1}}$
	$\displaystyle\qquad+\underbrace{\sqrt{1-\tau}\\|\theta(\mathbf{z}^{(r)}-\bar{\mathbf{z}})+(1-\theta)(R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}})\\|}_{T_{2}}$
	$\displaystyle\qquad+\underbrace{\sqrt{1-\tau}\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\\|}_{T_{3}},$

	$\displaystyle T_{1}$	$\displaystyle\leq\|1-\theta\|\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|+\theta\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\\|$
		$\displaystyle\leq(\|1-\theta\|+\theta\delta)\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|,$
	$\displaystyle T_{2}$	$\displaystyle\leq\sqrt{1-\tau}(\theta\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|+\|1-\theta\|\\|R_{\alpha\partial\iota}R_{\alpha\nabla g}\mathbf{z}^{(r)}-R_{\alpha\partial\iota}R_{\alpha\nabla g}\bar{\mathbf{z}}\\|)$
		$\displaystyle\leq\sqrt{1-\tau}(\theta+\|1-\theta\|\delta)\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|,$
	$\displaystyle T_{3}$	$\displaystyle\leq\sqrt{1-\tau}\delta\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|.$

	$\displaystyle\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}{\left\\|\mathbf{A}^{\top}\mathbf{w}-\frac{1}{\alpha}\mathbf{z}^{(r)}\right\\|}^{2}\}$
	$\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\mathbf{w}\\|^{2}-\langle\mathbf{w},\mathbf{A}\mathbf{z}^{(r)}\rangle\}$
	$\displaystyle=\text{argmin}_{\mathbf{w}}\{f(\mathbf{w})+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\mathbf{w}\\|^{2}-\frac{1}{2}\\|\mathbf{w}\\|^{2}+\frac{1}{2}\\|\mathbf{w}-\mathbf{A}\mathbf{z}^{(r)}\\|^{2}\}$
	$\displaystyle=\text{prox}_{f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2}}(\mathbf{A}\mathbf{z}^{(r)}).$

	$\displaystyle J_{\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2})}$	$\displaystyle=(\text{Id}+\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}-\frac{1}{2}\\|\cdot\\|^{2}))^{-1}$
		$\displaystyle=(\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2}))^{-1}$
		$\displaystyle=\nabla(f+\frac{\alpha}{2}\\|\mathbf{A}^{\top}\cdot\\|^{2})^{\ast}.$

	$\displaystyle\mathbb{E}_{r-1}\\|\mathbf{w}^{(r+1)}-\mathbf{w}^{\star}\\|$
	$\displaystyle\leq\mathbb{E}_{r-1}\\|T\mathbf{z}^{(r)}-T\bar{\mathbf{z}}\\|$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\mathbb{E}_{r-1}\\|\mathbf{z}^{(r)}-\bar{\mathbf{z}}\\|$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{\sqrt{N_{\text{max}}}}{\mu+\alpha N_{\text{min}}}\{\|1-\theta\|+\theta\delta+\sqrt{1-\tau}(\theta+\|1-\theta\|\delta+\delta)\}\\|\mathbf{z}^{(r-1)}-\bar{\mathbf{z}}\\|,$