Distributed Optimization of Clique-Wise Coupled Problems via Three-Operator Splitting

Yuto Watanabe, and Kazunori Sakurama Yuto Watanabe is with the Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093 USA (email: [email protected]).Kazunori Sakurama is with the Department of System Innovation, Graduate School of Engineering Science, Osaka University, 1-3, Machikaneyama, Toyonaka, Osaka 560-8531, Japan (email: [email protected]).This work was partially supported by the joint project of Kyoto University and Toyota Motor Corporation, titled “Advanced Mathematical Science for Mobility Society”.

Abstract

This study explores distributed optimization problems with clique-wise coupling via operator splitting and how we can utilize this framework for performance analysis and enhancement. This framework extends beyond conventional pairwise coupled problems (e.g., consensus optimization) and is applicable to broader examples. To this end, we first introduce a new distributed optimization algorithm by leveraging a clique-based matrix and the Davis-Yin splitting (DYS), a versatile three-operator splitting method. We then demonstrate that this approach sheds new light on conventional algorithms in the following way: (i) Existing algorithms (NIDS, Exact diffusion, diffusion, and our previous work) can be derived from our proposed method; (ii) We present a new mixing matrix based on clique-wise coupling, which surfaces when deriving the NIDS. We prove its preferable distribution of eigenvalues, enabling fast consensus; (iii) These observations yield a new linear convergence rate for the NIDS with non-smooth objective functions. Remarkably our linear rate is first established for the general DYS with a projection for a subspace. This case is not covered by any prior results, to our knowledge. Finally, numerical examples showcase the efficacy of our proposed approach.

1 Introduction

The last two decades have witnessed the significant advancement of distributed optimization. In the literature, a huge body of existing studies has been dedicated to pairwise coupled optimization problems, where every coupling of variables comprises two agents’ decision variables corresponding to the communication path (edge) between the two. Representative examples are consensus optimization [22, 28, 38, 20, 1, 33] and formation control [26]. Moreover, so are the problems with globally coupled linear constraints [21] because their dual problems result in pairwise coupled consensus optimization.

To handle wider applications that involve complex coupling beyond edges, we address a more generic class of distributed optimization—clique-wise coupled optimization problems. This class has been introduced in our recent works [30, 31] with an emphasis on its generalization aspect. In this note, we elucidate additional benefits of this class of problems for performance enhancement and analysis via a new algorithm based on a three-operator splitting [13]. This class of problems is formulated as follows: Consider a multi-agent system with $n$ agents over a time-invariant undirected graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ with $\mathcal{N}=\{1,\ldots,n\}$ and an edge set $\mathcal{E}$ . Let $x_{i}\in\mathbb{R}^{d_{i}}$ represent the $d_{i}$ dimensional decision variable of agent $i$ . Then, the following is called a clique-wise coupled optimization problem:

\displaystyle\!\!\!\!\!\!\!\!\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}}\\ \,i\in\mathcal{N}\end{subarray}}{\mbox{min}}&\!\!\!\!\!\displaystyle\underbrace{\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\left(f_{l}(x_{\mathcal{C}_{l}})+g_{l}(x_{\mathcal{C}_{l}})\right)}_{\text{clique-wise coupling}}+\displaystyle\sum_{i=1}^{n}\left(\hat{f}_{i}(x_{i})+\hat{g}_{i}(x_{i})\right),\end{array}

(2)

where for all $l\in\mathcal{Q}_{\mathcal{G}}$ , $f_{l}:\mathbb{R}^{\sum_{j\in{\mathcal{C}_{l}}}d_{j}}\to\mathbb{R}$ is $L_{l}$ -smooth and convex, and $g_{l}:\mathbb{R}^{\sum_{j\in{\mathcal{C}_{l}}}d_{j}}\to\mathbb{R}$ is proper, closed, and convex. For all $i\in\mathcal{N}$ , $\hat{f}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R}$ is $\hat{L}_{i}$ -smooth and convex, and $\hat{g}_{i}:\mathbb{R}^{{d_{i}}}\to\mathbb{R}$ is proper, closed, and convex. For the set ${\mathcal{C}_{l}}=\{j_{1},\ldots,j_{|{\mathcal{C}_{l}}|}\}\subset\mathcal{N}$ , let $x_{{\mathcal{C}_{l}}}$ denote $x_{{\mathcal{C}_{l}}}=[x_{j_{1}}^{\top},\ldots,x_{j_{|{\mathcal{C}_{l}}|}}^{\top}]^{\top}$ . Here, the set ${\mathcal{C}_{l}}$ represents a clique, i.e., a complete subgraph in $\mathcal{G}$ [7]. The set $\mathcal{Q}_{\mathcal{G}}\neq\emptyset$ is a subset of ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}$ , where ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}$ is the index set of all the cliques in $\mathcal{G}$ . For example, in the undirected graph in Fig. 1, we have ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}=\{1,\ldots,9\}$ and $\mathcal{C}_{1},\ldots,\mathcal{C}_{9}$ .

Refer to caption — Figure 1: Sketches of (a) pairwise coupling and (b) clique-wise coupling.

As mentioned above, an immediate benefit of Problem (2) is that it can handle variable couplings of more than two agents. As Fig. 1, cliques in (b) can deal with the coupling of three nodes $\{1,2,3\}$ , differently from (a). Indeed, Problem (2) always contains conventional pairwise coupled optimization problems as nodes and edges are also cliques. As well as pairwise coupled problems, other possible applications are, for example, (i) clique-wise coupled linear constraints [30, 31, 19] (e.g., resource allocation in Section 6), (ii) sparse SDP [29] (e.g., distributed design of distributed controllers [42], sensor network localization [2], etc), (iii) regularization accounting for network structures (e.g, Network lasso [15]), and (iv) approximation of trace norm minimization problems (e.g., multi-task learning [40], robust PCA [9], etc).

This note addresses Problem (2) using the Davis-Yin Splitting (DYS) and reveals that the notion of clique-wise coupling is beneficial for analyzing and improving convergence performance. The DYS is a versatile three-operator splitting scheme that generalizes basic operator-splitting methods (e.g., the forward-backward and Douglas-Rachford splittings). Firstly, we reformulate Problem (2) by introducing a matrix called the clique-wise duplication (CD) matrix. This matrix lifts Problem (2) to a tractable separated form for algorithm design. Then, applying the DYS, we derive the proposed algorithm called the clique-based distributed DYS (CD-DYS). Subsequently, we demonstrate that the CD-DYS generalizes several existing algorithms, encompassing the celebrated NIDS [20]. Then, we analyze a new mixing matrix that naturally comes up in deriving the NIDS and show a preferable distribution of its eigenvalues. Moreover, we present a new linear convergence rate for the NIDS with non-smooth terms by proving a more general linear rate for the DYS with a projection onto a subspace under strong convexity of the smooth term. Finally, numerical examples illustrate the effectiveness of the proposed approach.

Our contributions can be summarized as follows. (i) We propose a new distributed algorithm, CD-DYS, for Problem (2) applicable to broad examples ranging from optimization to control and learning problems; (ii) Our investigation of consensus optimization as a clique-wise coupled problem unveils that several conventional distributed optimization methods, including NIDS [20], are derived from the proposed CD-DYS method, which leads to a new linear convergence rate for the NIDS with non-smooth objective functions. This linear rate admits bigger stepsizes than ones in [1, 33]. It is worth mentioning that our linear convergence is first established for the general DYS with an indicator function of a linear image space, which does not follow from the prior works [13, 18, 37, 12] as indicator functions are neither smooth nor strongly convex; (iii) Numerical examples demonstrate the higher performance of our proposed approach than [20] and [21]. In particular, the superiority against the standard NIDS [20] is attributed to a novel mixing matrix obtained from our proposed method, which realizes a preferable eigenvalue distribution for fast consensus. We also provide its theoretical evidence. Note that one can construct this matrix without global information and use it for other consensus-based algorithms.

The remainder of this note is organized as follows. Section 2 provides preliminaries. Section 3 presents the definition of the CD matrix and its analysis including a new mixing matrix. In Section 4, we propose new distributed algorithms based on the DYS. In Section 5, we analyze the proposed methods for consensus optimization and show a new linear convergence result. Section 6 presents numerical experiments. Section 7 provides the proof of the convergence rate. Finally, Section 8 concludes this note.

2 Preliminaries

We here prepare several important notions.

Notations

Throughout this note, we use the following notations. Let $I_{d}$ denote the $d\times d$ identity matrix in $\mathbb{R}^{d\times d}$ . We omit the subscript $d$ of $I_{d}$ when the dimension is obvious. Let $O_{d_{1}\times d_{2}}$ be the $d_{1}\times d_{2}$ zero matrix. Let $\mathbf{1}_{d}=[1,\ldots,1]^{\top}\in\mathbb{R}^{d}$ . For $\mathcal{M}\subset\mathcal{N}$ , $[x_{j}]_{j\in\mathcal{M}}$ and $x_{\mathcal{M}}$ represent the stacked vector in ascending order obtained from vectors $x_{j}\in\mathbb{R}^{d_{j}},\,j\in\mathcal{M}$ , and we use the same notation to express stacked matrices. Let $\text{diag}(a)$ with $a=[a_{1},\ldots,a_{n}]^{\top}$ denote the diagonal matrix whose $i$ th diagonal entry is $a_{i}\in\mathbb{R}$ . Similarly, $\text{blk-diag}([\ldots,R_{i},\ldots])$ and $\text{blk-diag}([R_{j}]_{j\in\mathcal{M}})$ represent the block diagonal matrix. For a symmetric matrix $Q\succ O$ , let $\|u\|_{Q}=\sqrt{\langle u,u\rangle_{Q}}$ with the inner product $\langle u,v\rangle_{Q}:=v^{\top}Qu$ , and we simply write $\|\cdot\|_{I_{m}}=\|\cdot\|$ for $Q=I_{m}$ . Let $\lambda_{\mathrm{max}}(Q)$ and $\lambda_{\mathrm{min}}(Q)$ be the largest and smallest eigenvalues of $Q$ , respectively. For a differentiable function $f:\mathbb{R}^{d}\to\mathbb{R}$ and $x\in\mathbb{R}^{d}$ , we write $\nabla_{x}f(\cdot)=\partial f/\partial x(\cdot)$ . We simply use $\nabla$ when it is obvious. For a proper, closed, and convex function $g:\mathbb{R}^{d}:\to\mathbb{R}\cup\{+\infty\}$ and $Q\succ 0$ , the proximal operator of $g$ with $Q$ is represented by $\mathrm{prox}^{Q}_{g}(x)=\operatorname*{arg\,min}_{x^{\prime}\in\mathbb{R}^{d}}\{g(x^{\prime})+\|x-x^{\prime}\|_{Q}^{2}/2\}$ , and we denote $\mathrm{prox}_{g}^{I}(\cdot)=\mathrm{prox}_{g}(\cdot)$ for $Q=I$ . Let $\delta_{\mathcal{D}}(\cdot)$ represent the indicator function of $\mathcal{D}$ , i.e., $\delta_{\mathcal{D}}(x)=0$ for $x\in\mathcal{D}$ and $\delta_{\mathcal{D}}(x)=\infty$ for $x\notin\mathcal{D}$ . The projection onto a closed convex set $\mathcal{D}$ with a metric $Q$ is represented by $P^{Q}_{\mathcal{D}}(x)=\operatorname*{arg\,min}_{x^{\prime}\in\mathcal{D}}\|x-x^{\prime}\|_{Q}$ , and we write $P^{I}_{\mathcal{D}}(\cdot)=P_{\mathcal{D}}(\cdot)$ for $Q=I$ .

Graph theory

Consider a graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ with a node set $\mathcal{N}=\{1,\ldots,n\}$ and an edge set $\mathcal{E}$ consisting of pairs $\{i,j\}$ of different nodes $i,j\in\mathcal{N}$ . Note that throughout this note, we consider undirected graphs and do not distinguish $\{i,j\}$ and $\{j,i\}$ for each $\{i,j\}\in\mathcal{E}$ . For $i\in\mathcal{N}$ and $\mathcal{G}$ , let $\mathcal{N}_{i}\subset\mathcal{N}$ be the neighbor set of node $i$ over $\mathcal{G}$ , defined as $\mathcal{N}_{i}=\{j\in\mathcal{N}:\{i,j\}\in\mathcal{E}\}\cup\{i\}$ .

For an undirected graph $\mathcal{G}$ , consider a set $\mathcal{C}\subset\mathcal{N}$ . The set $\mathcal{C}$ is called a clique of $\mathcal{G}$ if the subgraph $\mathcal{G}$ induced by $\mathcal{C}$ is complete [7]. We define ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}=\{1,2,\ldots,q\}$ as the set of indices of all the cliques in $\mathcal{G}$ . For ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}$ , the set $\mathcal{Q}_{\mathcal{G}}$ represents a subset of ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}$ . If a clique $\mathcal{C}$ is not contained by any other cliques, $\mathcal{C}$ is said to be maximal. Let $\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}(\subset{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}})$ be the set of indices of all the maximal cliques in $\mathcal{G}$ . For edge set $\mathcal{E}$ , let $\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ be the index set of all the edges. For $\mathcal{Q}_{\mathcal{G}}\subset{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}$ and $i\in\mathcal{N}$ , we define $\mathcal{Q}_{\mathcal{G}}^{i}$ as the index set of all cliques in $\mathcal{Q}_{\mathcal{G}}$ containing $i$ . Similarly, $\mathcal{Q}_{\mathcal{G}}^{ij}$ represents $\mathcal{Q}_{\mathcal{G}}^{ij}=\mathcal{Q}_{\mathcal{G}}^{ji}=\mathcal{Q}_{\mathcal{G}}^{i}\cap\mathcal{Q}_{\mathcal{G}}^{j}$ . For each $i\in\mathcal{N}$ , $\mathcal{N}_{i}$ , and $\mathcal{C}_{l},\,l\in\mathcal{Q}^{i}_{\mathcal{G}}$ ,

\bigcup_{l\in\mathcal{Q}^{i}_{\mathcal{G}}}\mathcal{C}_{l}\subset\mathcal{N}_{i},

(3)

holds [26]. Note that agent $i$ can independently obtain ${\mathcal{C}_{l}},\,l\in\mathcal{Q}^{i}_{\mathcal{G}}$ from the undirected subgraph induced by $\mathcal{N}_{i}$ .

Operator splitting

Consider

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}y\in\mathbb{R}^{d}\end{subarray}}{\mbox{min}}&\displaystyle f(y)+g(y)+h(y),\end{array}

(5)

where $f:\mathbb{R}^{d}\to\mathbb{R}$ is an smooth convex function, and $g,h:\mathbb{R}^{d}\to\mathbb{R}\cup\{\infty\}$ are proper, closed, and convex functions. For this problem, the following versatile algorithm, called (variable metric) Davis-Yin splitting (DYS), has been proposed in [13]:

\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\begin{array}[]{lll}&y^{k+1/2}=\mathrm{prox}_{\alpha h}^{M}(z^{k})\\ &y^{k+1}=\mathrm{prox}_{\alpha g}^{M}(2y^{k+1/2}-z^{k}-\alpha M^{-1}\nabla f(y^{k+1/2}))\\ &z^{k+1}=z^{k}+y^{k+1}-y^{k+1/2},\end{array}

(9)

where $M\in\mathbb{R}^{d\times d}$ is a positive definite symmetric matrix. Note that the case of $M=I$ corresponds to the standard DYS. This algorithm reduces to the Douglas-Rachford splitting when $f=0$ and to forward-backward splitting when $g=0$ . We have the following basic result, which states that $y^{k+1/2}$ and $y^{k+1}$ converge to a solution to (5) under an appropriate $\alpha>0$ . For further convergence results, see [13, 24, 3, 18, 37, 12] and Subsection 5.2.

Lemma 1.

Suppose that $M^{-1/2}\nabla f(y)M^{-1/2}$ is $L$ -Lipschitz continuous for positive definite $M$ . Let $z^{0}\in\mathbb{R}^{d}$ and $\alpha\in(0,2/L)$ . Assume that Problem (5) has an optimal solution. Then, $y^{k}$ and $y^{k+1/2}$ updated by (9) converge to an optimal solution to Problem (5).

3 Clique-Wise Duplication Matrix

This section presents the definition and properties of the CD matrix $\mathbf{D}$ that allows us to leverage operator splitting techniques for Problem (2) in a distributed fashion¹¹1Note that the matrix $\mathbf{D}$ itself is not new. The same or similar ideas can be found in other existing papers, e.g., SDP [29, 42] and generalized Nash equilibrium seeking [6]. We also present a new mixing matrix ${\boldsymbol{\Phi}}$ with the matrix $\mathbf{D}$ , showing a preferable distribution of eigenvalues.

3.1 Fundamentals

The definition and essential properties of the CD matrix are presented in what follows.

First, we assume the non-emptiness of $\mathcal{Q}_{\mathcal{G}}^{i}$ . If this assumption is not satisfied, we can alternatively consider a subgraph induced by the node set to $\bigcup_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{C}_{l}$ .

Assumption 1.

For all $i\in\mathcal{N}$ , $\mathcal{Q}_{\mathcal{G}}^{i}\neq\emptyset$ holds.

Then, the definition of the CD matrix is given as follows. Here, $d_{i}$ for each $i\in\mathcal{N}$ is the size of $x_{i}$ in Problem (2), and we define

\displaystyle d=\sum_{i=1}^{n}d_{i},\quad d^{l}=\sum_{j\in{\mathcal{C}_{l}}}d_{j},\quad\hat{d}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}d^{l}.

Definition 1.

For $d_{i},\,i\in\mathcal{N}$ and cliques ${\mathcal{C}_{l}},\,l\in\mathcal{Q}_{\mathcal{G}}$ of graph $\mathcal{G}$ , the Clique-wise Duplication (CD) matrix $\mathbf{D}$ is defined as

\mathbf{D}:=\begin{bmatrix}D_{1}\\ \vdots\\ D_{|\mathcal{Q}_{G}|}\end{bmatrix}\in\mathbb{R}^{\hat{d}\times d},

(10)

where $D_{l}=[E_{j}]_{j\in{\mathcal{C}_{l}}}\in\mathbb{R}^{d^{l}\times d}$ and $E_{j}=[O_{d_{j}\times d_{1}},\ldots,I_{d_{j}},\ldots,O_{d_{j}\times d_{n}}]\in\mathbb{R}^{d_{j}\times d}$ for each $l\in\mathcal{Q}_{\mathcal{G}}$ .

The CD matrix $\mathbf{D}$ can be interpreted as follows. For $\mathbf{x}=[x_{1}^{\top},\ldots,x_{n}^{\top}]^{\top}\in\mathbb{R}^{d}$ , $\mathbf{D}\mathbf{x}=[x_{\mathcal{C}_{l}}]_{l\in\mathcal{Q}_{\mathcal{G}}}\in\mathbb{R}^{\hat{d}}$ holds since $D_{l}\mathbf{x}=x_{\mathcal{C}_{l}}\in\mathbb{R}^{d^{l}}$ . Hence, the CD matrix $\mathbf{D}$ generates the copies of $\mathbf{x}$ with respect to cliques ${\mathcal{C}_{l}},\,l\in\mathcal{Q}_{\mathcal{G}}$ .

The following lemma provides the fundamental properties of the CD matrix. Now, let the matrix $E_{l,i}\in\mathbb{R}^{d_{i}\times d^{l}}$ be

E_{l,i}=[O_{d_{i}\times d_{j_{1}}},\ldots,I_{d_{i}},\ldots,O_{d_{i}\times d_{j_{|{\mathcal{C}_{l}}|}}}]\in\mathbb{R}^{d_{i}\times d^{l}}

(11)

for ${\mathcal{C}_{l}}=\{j_{1},\ldots,i,\ldots,j_{|{\mathcal{C}_{l}}|}\},\,l\in\mathcal{Q}_{\mathcal{G}}^{i}$ . This matrix $E_{l,i}$ fulfills $E_{l,i}x_{{\mathcal{C}_{l}}}=x_{i}$ for $x_{\mathcal{C}_{l}}$ and $i\in{\mathcal{C}_{l}}$ .

Lemma 2.

Under Assumption 1, the followings hold.

(a)

$\mathbf{D}$ is column full rank.
(b)

$\mathbf{D}^{\top}\mathbf{D}=\text{blk-diag}(|\mathcal{Q}_{\mathcal{G}}^{1}|I_{d_{1}},\ldots,|\mathcal{Q}_{\mathcal{G}}^{n}|I_{d_{n}})\succ O$ .

(c)

For $\mathbf{y}=[y_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}\in\mathbb{R}^{\hat{d}}$ with $y_{l}\in\mathbb{R}^{d^{l}}$ ,

\mathbf{D}^{\top}\mathbf{y}=\begin{bmatrix}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{1}}E_{l,1}y_{l}\\ \vdots\\ \sum_{l\in\mathcal{Q}_{\mathcal{G}}^{n}}E_{l,n}y_{l}\end{bmatrix}\in\mathbb{R}^{d}.

(12)

Using the CD matrix and (3), we can distributedly compute the least squares solution of $\mathbf{y}=\mathbf{D}\mathbf{x}$ , i.e.,

\displaystyle\mathbf{x}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}

(13)

and the projection of $\mathbf{y}$ onto $\mathrm{Im}(\mathbf{D})$ as

\displaystyle P_{\mathrm{Im}(\mathbf{D})}(\mathbf{y})=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}.

(14)

Example 1.

Consider $\mathcal{G}=(\mathcal{N},\mathcal{E})$ with $\mathcal{N}=\{1,2,3\}$ and $\mathcal{E}=\{\{1,2\},\,\{2,3\}\}$ . Let $d_{1}=d_{2}=d_{3}=1$ and $\mathcal{Q}_{\mathcal{G}}=\{1,2\}$ with $\mathcal{C}_{1}=\{1,2\}$ and $\mathcal{C}_{2}=\{2,3\}$ . Then, we obtain $\mathcal{Q}_{\mathcal{G}}^{1}=\{1\}$ , $\mathcal{Q}_{\mathcal{G}}^{2}=\{1,2\}$ , and $\mathcal{Q}_{\mathcal{G}}^{3}=\{2\}$ , which ensures Assumption 1. For this system, the CD matrix is given by $\mathbf{D}=[D_{1}^{\top},D_{2}^{\top}]^{\top}\in\mathbb{R}^{4\times 3}$ , where $D_{1}=\begin{bmatrix}1&0&0\\ 0&1&0\end{bmatrix},\,D_{2}=\begin{bmatrix}0&1&0\\ 0&0&1\end{bmatrix}$ . We then obtain $D_{1}\mathbf{x}=[x_{1},x_{2}]^{\top}$ and $D_{2}\mathbf{x}=[x_{2},x_{3}]^{\top}$ for $\mathbf{x}=[x_{1},x_{2},x_{3}]^{\top}\in\mathbb{R}^{3}$ . Moreover, $\mathbf{D}^{\top}\mathbf{D}=D_{1}^{\top}D_{1}+D_{2}^{\top}D_{2}=\text{diag}(1,2,1)=\text{diag}(|\mathcal{Q}_{\mathcal{G}}^{1}|,|\mathcal{Q}_{\mathcal{G}}^{2}|,|\mathcal{Q}_{\mathcal{G}}^{3}|)$ , and

\displaystyle\mathbf{D}^{\top}\mathbf{y}=D_{1}^{\top}y_{1}+D_{2}^{\top}y_{2}=\begin{bmatrix}y_{1,1}\\ y_{1,2}+y_{2,1}\\ y_{2,2}\end{bmatrix}=\begin{bmatrix}E_{1,1}y_{1}\\ E_{1,2}y_{1}+E_{2,2}y_{2}\\ E_{2,3}y_{2}\end{bmatrix}

for any vector $\mathbf{y}=[y_{1}^{\top},y_{2}^{\top}]^{\top}\in\mathbb{R}^{4}$ with $y_{1}=[y_{1,1},y_{1,2}]^{\top}\in\mathbb{R}^{2}$ and $y_{2}=[y_{2,1},y_{2,2}]^{\top}\in\mathbb{R}^{2}$ , which can be computed in a distributed fashion.

3.2 Useful properties

Here, we provide useful properties of the CD matrix $\mathbf{D}$ .

The following result shows that the gradient and proximal operator with $\mathbf{D}$ can be computed in a distributed fashion. Here, $i$ th block $x_{i}$ of $\mathbf{x}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}$ is represented by $x_{i}=E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}y_{l}$ from Lemma 2.

Proposition 1.

Let $\mathbf{y}\in\mathbb{R}^{\hat{d}}$ . Then, under Assumption 1, the following equations hold.

(a)

Let $\hat{g}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R}\cup\{\infty\}$ be a proper, closed, and convex function for each $i\in\mathcal{N}$ . Define $G:\mathbb{R}^{\hat{d}}\to\mathbb{R}\cup\{\infty\}$ as $G(\mathbf{z})=\delta_{\mathrm{Im}(\mathbf{D})}(\mathbf{z})+\sum_{i=1}^{n}\hat{g}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}))$ . Let $\alpha>0$ . Then,

\displaystyle\mathrm{prox}_{\alpha G}(\mathbf{y})=\mathbf{D}\begin{bmatrix}\mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{1}|}\hat{g}_{1}}(E_{1}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\\ \vdots\\ \mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{n}|}\hat{g}_{n}}(E_{n}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\end{bmatrix}

(15)

(b)

Let $\mathbf{Q}=\text{blk-diag}([Q_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}})$ , where $Q_{l}=\text{blk-diag}([\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}I_{d_{j}}]_{j\in{\mathcal{C}_{l}}})$ for each $l\in\mathcal{Q}_{\mathcal{G}}$ . Then,

\displaystyle\mathrm{prox}_{\alpha G}^{\mathbf{Q}}(\mathbf{y})=\mathbf{D}\begin{bmatrix}\mathrm{prox}_{\alpha\hat{g}_{1}}(E_{1}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\\ \vdots\\ \mathrm{prox}_{\alpha\hat{g}_{n}}(E_{n}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\end{bmatrix}

(16)

(c)

Let $\hat{f}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R}$ be a differentiable function. Then,

		$\displaystyle\frac{\partial}{\partial\mathbf{y}}\sum_{i=1}^{n}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})$
		$\displaystyle=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}\begin{bmatrix}\nabla_{x_{1}}\hat{f}_{1}(E_{1}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})\\ \vdots\\ \nabla_{x_{n}}\hat{f}_{n}(E_{n}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})\\ \end{bmatrix}$		(17)

Additionally, we provide properties of the CD matrix concerning matrix $\mathbf{Q}$ . Those properties are useful to derive the NIDS [20] and Exact diffusion [38] from the proposed method.

Proposition 2.

Let $\mathbf{Q}$ denote the matrix in Proposition 1b. Then, under Assumption 1, the following equations hold:

(a)

$\mathbf{D}^{\top}\mathbf{Q}\mathbf{D}=I_{d}$ .
(b)

$\mathbf{D}^{\top}\mathbf{Q}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}$ and $\mathbf{D}^{\top}\mathbf{Q}^{-1}=\mathbf{D}^{\top}\mathbf{D}\mathbf{D}^{\top}$ .
(c)

$\mathbf{Q}\mathbf{D}=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}$ and $\mathbf{Q}^{-1}\mathbf{D}=\mathbf{D}\mathbf{D}^{\top}\mathbf{D}$ .

3.3 A mixing matrix ${\boldsymbol{\Phi}}$

Using the CD matrix and the matrices $Q_{l},\,l\in\mathcal{Q}_{\mathcal{G}}$ in Proposition 1b, we can obtain a positive semidefinite and doubly stochastic matrix ${\boldsymbol{\Phi}}$ that will be used in Section 5. Thanks to the definition, ${\boldsymbol{\Phi}}$ in (18) can be constructed only from local information (i.e., $\mathcal{Q}_{\mathcal{G}}^{j},\,j\in\bigcup_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}{\mathcal{C}_{l}}\subset\mathcal{N}_{i}$ ). Note that this matrix can be viewed as a special case of the clique-based projection $T$ in [30] and Appendix F for the consensus constraint, i.e., ${\boldsymbol{\Phi}}\mathbf{x}=\mathbf{D}^{\top}(\mathbf{D}^{\top}\mathbf{D})^{-1}P^{\mathbf{Q}}_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}(\mathbf{D}\mathbf{x})$ for $\mathcal{D}_{l}$ in (39). We here pose the following assumption²²2Assumption 2 is not strict and satisfied for $\mathcal{Q}_{\mathcal{G}}={\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}$ , $\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}$ , $\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ ..

Assumption 2.

For $\mathcal{Q}_{\mathcal{G}}$ , $\mathcal{Q}_{\mathcal{G}}^{i}\cap\mathcal{Q}_{\mathcal{G}}^{j}\neq\emptyset\Leftrightarrow\{i,j\}\in\mathcal{E}$ .

The matrix ${\boldsymbol{\Phi}}$ and its basic properties are given as follows.

Proposition 3.

Suppose Assumptions 1 and 2. Consider the matrices $Q_{l},\,l\in\mathcal{Q}_{\mathcal{G}}$ in Proposition 1b. Suppose that $d_{1}=\cdots=d_{n}=1$ . Then,

{\boldsymbol{\Phi}}=\begin{bmatrix}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{1}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{1}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}\\ \vdots\\ \frac{1}{|\mathcal{Q}_{\mathcal{G}}^{n}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{n}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}\\ \end{bmatrix}\in\mathbb{R}^{n\times n}

(18)

is doubly stochastic, and it holds that

[{\boldsymbol{\Phi}}]_{ij}=\begin{cases}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}||\mathcal{Q}_{\mathcal{G}}^{j}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{ij}}\frac{1}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}},&\{i,j\}\in\mathcal{E}\\ 0,&\text{otherwise},\end{cases}

(19)

where $[{\boldsymbol{\Phi}}]_{ij}$ represents $(i,j)$ entry of ${\boldsymbol{\Phi}}$ . Moreover, $\lambda_{\mathrm{max}}({\boldsymbol{\Phi}})=1$ and $\lambda_{\mathrm{min}}({\boldsymbol{\Phi}})\geq 0$ hold. Furthermore, when $\mathcal{G}$ is connected, the eigenvalue $1$ of ${\boldsymbol{\Phi}}$ is simple.

Proof.

The right stochasticity is proved as $(\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}})\mathbf{1}_{n}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}1=1.$ Using the definition of $D_{l}$ in Definition 1, the left stochasticity is also verified as

		$\displaystyle\mathbf{1}_{n}^{\top}{\boldsymbol{\Phi}}=\sum_{i=1}^{n}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}\mathbf{1}_{\|{\mathcal{C}_{l}}\|}}$
	$\displaystyle=$	$\displaystyle\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{j}\|}\frac{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}\mathbf{1}_{\|{\mathcal{C}_{l}}\|}}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{j}\|}E_{j}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}=\sum_{i=1}^{n}E_{i}=\mathbf{1}_{n}^{\top}$

from $\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}$ and $\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}=\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}$ . Next,

\displaystyle[{\boldsymbol{\Phi}}]_{ij}=E_{i}{\boldsymbol{\Phi}}E_{j}^{\top}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}E_{j}^{\top}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\sum_{p\in\mathcal{C}_{l}}\frac{1}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{p}|}E_{p}E_{j}^{\top}

holds. Then, we obtain (19). Moreover, $\lambda_{\mathrm{max}}({\boldsymbol{\Phi}})=1$ directly follows from Gershgorin disks theorem [8]. Additionally, from the firmly nonexpansiveness of the clique-based projection $T$ (see Proposition 6), we obtain $\mathbf{x}^{\top}{\boldsymbol{\Phi}}\mathbf{x}\geq\|{\boldsymbol{\Phi}}\mathbf{x}\|^{2}$ for any $\mathbf{x}\in\mathbb{R}^{n}$ , which gives $\lambda_{\mathrm{min}}({\boldsymbol{\Phi}})\geq 0$ .

Finally, by the assumption of $\mathcal{Q}_{\mathcal{G}}^{ij}\neq\emptyset\Leftrightarrow\{i,j\}\in\mathcal{E}$ , the associated graph of ${\boldsymbol{\Phi}}$ is equal to $\mathcal{G}$ . Therefore, the eigenvalue $1$ of ${\boldsymbol{\Phi}}$ is simple when $\mathcal{G}$ is connected (see [8]). ∎

Now, we state the following proposition for ${\boldsymbol{\Phi}}$ in (18), implying that ${\boldsymbol{\Phi}}$ enables fast consensus. This is because all the eigenvalues smaller than $1$ are likely to take smaller values than other popular positive semidefinite mixing matrices from the Gershgorin disks theorem [8]. A numerical example of the eigenvalues and a sketch of Proposition 4’s implication are illustrated in Fig. 2.

Proposition 4.

Suppose Assumption 1. For undirected graph $\mathcal{G}$ , consider the matrix ${\boldsymbol{\Phi}}$ in (18) with $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ . Let $\widetilde{\mathbf{W}}_{\mathbf{L}}=(I+\mathbf{W}_{\mathbf{L}})/2$ , where $\mathbf{W}_{\mathbf{L}}=I-\epsilon\mathbf{L}$ with Laplacian matrix $\mathbf{L}$ of $\mathcal{G}$ and $\epsilon\in(0,1/\max_{i\in\mathcal{N}}\{|\mathcal{N}_{i}|-1\})$ ³³3 The matrix $\mathbf{L}$ is defined as $[\mathbf{L}]_{ii}=|\mathcal{N}_{i}|-1$ for $i=1,\ldots,n$ and $[\mathbf{L}]_{ij}=-1$ for $\{i,j\}\in\mathcal{E}$ . Otherwise $[\mathbf{L}]_{ij}=0$ . In [32], $[\mathbf{W}_{\mathbf{L}}]_{ij}$ with $\epsilon=1/\max_{i\in\mathcal{N}}\{|\mathcal{N}_{i}|\}$ is said to be the max-degree weight. . Similarly, let $\widetilde{\mathbf{W}}_{\mathrm{mh}}=(I+\mathbf{W}_{\mathrm{mh}})/2$ with $\mathbf{W}_{\mathrm{mh}}$ obtained by the Metropolis–Hastings weights⁴⁴4 $\mathbf{W}_{\mathrm{mh}}$ is defined as $[\mathbf{W}_{\mathrm{mh}}]_{ij}=1/(\max\{|\mathcal{N}_{i}|-1,|\mathcal{N}_{j}|-1\}+\varepsilon)$ for $\{i,j\}\in\mathcal{E}$ and $[\mathbf{W}_{\mathrm{mh}}]_{ii}=1-\sum_{j\in\mathcal{N}_{i}\setminus\{i\}}[\mathbf{W}_{\mathrm{mh}}]_{ij}$ for $i=1,\ldots,n$ . Otherwise $[\mathbf{W}_{\mathrm{mh}}]_{ij}=0$ . in [32]. Then, for all $i=1,\ldots,n$ , we have

[{\boldsymbol{\Phi}}]_{ii}<[\widetilde{\mathbf{W}}_{\mathbf{L}}]_{ii},\quad[{\boldsymbol{\Phi}}]_{ii}<[\widetilde{\mathbf{W}}_{\mathrm{mh}}]_{ii}.

(20)

Proof.

When $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ , $|\mathcal{Q}_{\mathcal{G}}^{i}|=|\mathcal{N}_{i}|-1$ holds for $i=1,\ldots,n$ , and $\mathcal{Q}_{\mathcal{G}}^{ij}$ for $\{i,j\}\in\mathcal{E}$ becomes a singleton $\mathcal{Q}_{\mathcal{G}}^{ij}=\{l\}$ with ${\mathcal{C}_{l}}=\{i,j\}$ as $\mathcal{Q}_{\mathcal{G}}^{i}$ is just the set of indices of edges that include $i$ . Then, for $\{i,j\}\in\mathcal{E}$ we get $[{\boldsymbol{\Phi}}]_{ij}=\frac{1}{|\mathcal{N}_{i}|-1+|\mathcal{N}_{j}|-1}$ . Hence, recalling the definition of $\widetilde{\mathbf{W}}_{\mathbf{L}}$ and $\widetilde{\mathbf{W}}_{\mathrm{mh}}$ for $\{i,j\}\in\mathcal{E}$ , we have $[\widetilde{\mathbf{W}}_{\mathbf{L}}]_{ij}=1/2\epsilon<1/(2\max_{i\in\mathcal{N}}\{|\mathcal{N}_{i}|-1\})\leq[{\boldsymbol{\Phi}}]_{ij}$ and $[\widetilde{\mathbf{W}}_{\mathrm{mh}}]_{ij}=1/2(\max\{|\mathcal{N}_{i}|-1,|\mathcal{N}_{j}|-1\}+\varepsilon)<1/2\max\{|\mathcal{N}_{i}|-1,|\mathcal{N}_{j}|-1\}\leq[{\boldsymbol{\Phi}}]_{ij}$ , respectively. Therefore, since all $(i,j)$ entries of ${\boldsymbol{\Phi}}$ for $\{i,j\}\in\mathcal{E}$ are bigger than those of $\widetilde{\mathbf{W}}_{\mathbf{L}}$ and $\widetilde{\mathbf{W}}_{\mathrm{mh}}$ and these matrices are doubly stochastic, we get (20). ∎

Remark 1.

In Fig. 2a, we use not $\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ but $\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}$ , which also realizes smaller eigenvalues. Likewise, even when $\mathcal{Q}_{\mathcal{G}}\neq\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ , ${\boldsymbol{\Phi}}$ can have smaller eigenvalues than $\widetilde{\mathbf{W}}_{\mathbf{L}}$ and $\widetilde{\mathbf{W}}_{\mathrm{mh}}$ .

4 Solution to Clique-Wise Coupled Problems via Operator Splitting

This section presents our proposed algorithms for Problem (2) with the CD matrix and DYS in (9) with the metrics of $M=I$ and $M=\mathbf{Q}$ . We now assume the following.

Assumption 3.

Problem (2) has an optimal solution.

In what follows, the functions $f:\mathbb{R}^{\hat{d}}\to\mathbb{R}$ , $g:\mathbb{R}^{\hat{d}}\to\mathbb{R}$ , $\hat{f}:\mathbb{R}^{d}\to\mathbb{R}$ , and $\hat{g}:\mathbb{R}^{d}\to\mathbb{R}$ represent

		$\displaystyle f(\mathbf{y})=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}f_{l}(y_{l}),\quad g(\mathbf{y})=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}g_{l}(y_{l}),$		(21)
		$\displaystyle\hat{f}(\mathbf{x})=\sum_{i=1}^{n}f_{i}(x_{i}),\quad\hat{g}(\mathbf{x})=\sum_{i=1}^{n}g_{i}(x_{i}).$		(22)

4.1 Algorithm description

Algorithm 1 Clique-based distributed Davis-Yin splitting (CD-DYS) algorithm for agent

i\in\mathcal{N}

z_{l}^{0}

and

\alpha>0

for all

l\in\mathcal{Q}_{\mathcal{G}}^{i}

1: for

k=0,1,\ldots

x_{i}^{k}=\mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\hat{g}_{i}}(\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}z_{l}^{k})

3: Gather

x_{j}^{k}

from the neighboring agents

j\in\bigcup_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}{\mathcal{C}_{l}}\subset\mathcal{N}_{i}

4: Obtain

y_{l}^{k+1/2}

y_{l}^{k+1}

, and

z_{l}^{k+1}

for

l\in\mathcal{Q}_{\mathcal{G}}^{i}

	$\displaystyle y_{l}^{k+1/2}=x^{k}_{\mathcal{C}_{l}}$
	$\displaystyle y_{l}^{k+1}=\mathrm{prox}_{\alpha g_{l}}(2y_{l}^{k+1/2}-z_{l}^{k}-\alpha\nabla_{y_{l}}f_{l}(y_{l}^{k+1/2})-\alpha[\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{j}\|}\nabla_{x_{j}}\hat{f}_{j}(x_{j}^{k})]_{j\in{\mathcal{C}_{l}}})$
	$\displaystyle z_{l}^{k+1}=z_{l}^{k}+y_{l}^{k+1}-y_{l}^{k+1/2}$

5: end for

We give the distributed optimization algorithm in Algorithm 1, the clique-based distributed Davis-Yin splitting (CD-DYS) algorithm. This algorithm is distributed from (3). By Lemma 2, the aggregated form of this algorithm is as follows:

\displaystyle\!\!\!\!\!\!\!\!\!\begin{array}[]{lll}&\mathbf{x}^{k}=\mathrm{prox}_{\sum_{i=1}^{n}\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\hat{g}_{i}(\cdot)}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}\\ &\mathbf{y}^{k+1/2}=\mathbf{D}\mathbf{x}^{k}\\ &\mathbf{y}^{k+1}=\mathrm{prox}_{\alpha g}(2\mathbf{y}^{k+1/2}-\mathbf{z}^{k}\\ &\quad\quad\quad-\alpha\nabla_{\mathbf{y}}f(\mathbf{y}^{k+1/2})-\alpha\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\\ &\mathbf{z}^{k+1}=\mathbf{z}^{k}+\mathbf{y}^{k+1}-\mathbf{y}^{k+1/2},\end{array}

(28)

where $\mathbf{x}^{k}=[x_{1}^{k\top},\ldots,x_{n}^{k\top}]^{\top}$ , $\mathbf{y}^{k}=[y_{l}^{k}]_{l\in\mathcal{Q}_{\mathcal{G}}}$ , $\mathbf{y}^{k+1/2}=[y_{l}^{k+1/2}]_{l\in\mathcal{Q}_{\mathcal{G}}}$ , and $\mathbf{z}^{k}=[z_{l}^{k}]_{l\in\mathcal{Q}_{\mathcal{G}}}$ . By Lemma 1, this algorithm converges to the optimal point under $\alpha\in(0,2/({\max_{l\in\mathcal{Q}_{\mathcal{G}}}L_{l}+\max_{i\in\mathcal{N}}\frac{\hat{L}_{i}}{|\mathcal{Q}_{\mathcal{G}}^{i}|}}))$ .

This algorithm can be derived in the following way. From (13), for $\mathbf{y}=\mathbf{D}\mathbf{x}\in\mathrm{Im}(\mathbf{D})$ , we can reformulate Problem (2) into the form of (5) as follows:

\displaystyle\begin{array}[]{cl}&\underset{\begin{subarray}{c}y_{l}\in\mathbb{R}^{d^{l}},\,l\in\mathcal{Q}_{\mathcal{G}}\end{subarray}}{\mbox{min}}\quad\displaystyle\underbrace{\sum_{i=1}^{n}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})+\sum_{l\in\mathcal{Q}_{\mathcal{G}}}f_{l}(y_{l})}_{f\text{ in \eqref{problem_3O}}}\\ &\displaystyle+\underbrace{\sum_{l\in\mathcal{Q}_{\mathcal{G}}}g_{l}(y_{l})}_{g\text{ in \eqref{problem_3O}}}\displaystyle+\underbrace{\sum_{i=1}^{n}\hat{g}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})+\delta_{\mathrm{Im}(\mathbf{D})}(\mathbf{y})}_{h\text{ in \eqref{problem_3O}}}.\end{array}\!\!\!\!

(31)

For Problem (31), Proposition 1 tells us that the function $\sum_{i=1}^{n}\hat{g}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})+\delta_{\mathrm{Im}(\mathbf{D})}(\mathbf{y})$ is proximable for proximable $\hat{g}_{i}$ , and the proximal operator can be computed in a distributed fashion. Accordingly, we can directly apply DYS in (9) with $M=I$ to (31). From Proposition 1, setting $x_{i}^{k}=\mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\hat{g}_{i}}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k})$ gives the distributed algorithm in Algorithm 1 (or (28)).

4.2 Variable metric version

Applying the variable metric DYS in (9) with $M=\mathbf{Q}$ in Proposition 1 instead to Problem (31) gives the following algorithm:

\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\begin{array}[]{lll}&x_{i}^{k}=\mathrm{prox}_{\alpha\hat{g}_{i}}(\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}z_{l}^{k})\\ &y_{l}^{k+1/2}=x^{k}_{\mathcal{C}_{l}}\\ &y_{l}^{k+1}=\mathrm{prox}_{\alpha g_{l}}^{Q_{l}}(2y_{l}^{k+1/2}-z_{l}^{k}-\alpha Q_{l}^{-1}\nabla_{y_{l}}f_{l}(y_{l}^{k+1/2})-\alpha[\nabla_{x_{j}}\hat{f}_{j}(x_{j}^{k})]_{j\in{\mathcal{C}_{l}}})\\ &z_{l}^{k+1}=z_{l}^{k}+y_{l}^{k+1}-y_{l}^{k+1/2},\end{array}

(36)

where we use Propositions 1b and 2. It will turn out in Section 5 that this algorithm shows an interesting connection to other methods as Fig. 3 through ${\boldsymbol{\Phi}}$ in (18). By Lemma 1, a sufficient condition for the convergence is $\alpha\in(0,2/({\max_{l\in\mathcal{Q}_{\mathcal{G}}}\max_{j\in{\mathcal{C}_{l}}}(|\mathcal{Q}_{\mathcal{G}}^{j}|L_{l})+\max_{i\in\mathcal{N}}\hat{L}_{i}}))$ .

5 Revisit of Consensus Optimization as A Clique-Wise Coupled Problem

Figure 3: The relationships among the proposed methods and existing methods.

This section is dedicated to a detailed analysis of the proposed methods in Section 4 in light of consensus optimization, presenting a renewed perspective. We here demonstrate the relationship summarized in Fig. 3 by showing the most important part: Algorithm (36) generalizes the NIDS in [20]. Our analysis reveals that matrix ${\boldsymbol{\Phi}}$ in (18) naturally arises in the NIDS. This fact and Proposition 4 imply that the proposed algorithm enables fast convergence [20] beyond standard mixing matrices. Furthermore, we present a new linear convergence rate for the NIDS with a non-smooth term based on its DYS structure. The linear rate follows from a more general result for the DYS (9).

We here consider a special case of Problem (2) given by

\displaystyle\!\!\!\!\!\!\!\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}},\,i\in\mathcal{N}\end{subarray}}{\mbox{min}}&\displaystyle\sum_{i=1}^{n}\hat{f}_{i}(x_{i})+\sum_{i=1}^{n}\hat{g}_{i}(x_{i})+\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\underbrace{\delta_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})}_{g_{l}(x_{\mathcal{C}_{l}})}.\end{array}\!\!\!

(38)

When $m=d_{1}=\cdots=d_{n}$ and

\mathcal{D}_{l}=\{x_{\mathcal{C}_{l}}\in\mathbb{R}^{|{\mathcal{C}_{l}}|m}:\exists\theta\in\mathbb{R}^{m}\text{ s.t. }x_{\mathcal{C}_{l}}=\mathbf{1}_{|{\mathcal{C}_{l}}|}\otimes\theta\},

(39)

this problem is called a consensus optimization problem, which we discuss here. Notice that according to [26], $\cap_{l\in\mathcal{Q}_{\mathcal{G}}}\{\mathbf{x}\in\mathbb{R}^{nm}:x_{\mathcal{C}_{l}}\in D_{l}\}=\{\mathbf{x}\in\mathbb{R}^{nm}:x_{1}=\cdots=x_{n}\}$ holds under the connectivity of graph $\mathcal{G}$ and Assumptions 1 and 2.

5.1 CD-DYS as generalized NIDS

NIDS algorithm

First, the NIDS algorithm [20] for consensus optimization for $k=1,2,\ldots$ is as follows:

\displaystyle\begin{array}[]{lll}&\mathbf{x}^{k}=&\mathrm{prox}_{\alpha\hat{g}}(\mathbf{w}^{k})\\ &\mathbf{w}^{k+1}=&\mathbf{w}^{k}-\mathbf{x}^{k}+\widetilde{\mathbf{W}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\end{array}

(42)

with arbitrary $\mathbf{x}^{0}$ and $\mathbf{w}^{1}=\mathbf{x}^{0}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{0})$ . The matrix $\widetilde{\mathbf{W}}$ is a positive semidefinite doubly stochastic mixing matrix. A standard choice of $\widetilde{\mathbf{W}}$ with no use of global information is $\widetilde{\mathbf{W}}=\widetilde{\mathbf{W}}_{\mathrm{mh}}$ in Proposition 4. To make $\widetilde{\mathbf{W}}$ less conservative, [20] suggests that some global information is necessary (e.g., the value of $\lambda_{\mathrm{max}}(\mathbf{W}_{\mathrm{mh}})$ ).

Analysis

We here present the following proposition, stating that the proposed algorithm in (36) yields the NIDS algorithm with $\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}$ in (18), which can achieve fast convergence as shown in Proposition 4 and Fig. 2. Note that we show the case of $m=1$ for simplicity.

Proposition 5.

Consider Algorithm (36) for Problem (38). Suppose Assumptions 1–3. Assume that for all $i\in\mathcal{N}$ , $\hat{f}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R}$ is $\hat{L}_{i}$ -smooth and convex, and $\hat{g}_{i}:\mathbb{R}^{{d_{i}}}\to\mathbb{R}$ is proper, closed, and convex. For arbitrary $\mathbf{x}^{0}$ , let $\mathbf{z}^{1}=\mathbf{D}(\mathbf{x}^{0}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{0}))$ . Then, for $k=1,2,\ldots$ , $\mathbf{w}^{k}:=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}$ and $\mathbf{x}^{k}:=\mathrm{prox}_{\alpha\hat{g}}(\mathbf{w}^{k})$ satisfy the update of NIDS in (42) with $\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}$ in (18).

Proof.

By Lemma 2b–c, multiplying the third line of (36) by $(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}$ gives $w_{i}^{k+1}=w_{i}^{k}-x_{i}^{k}+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}\mathrm{prox}_{\delta_{\mathcal{D}_{l}}}^{Q_{l}}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))$ . Then, plugging in $\mathrm{prox}_{\delta_{\mathcal{D}_{l}}}^{Q_{l}}(x_{\mathcal{C}_{l}})=P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})=\mathbf{1}_{|{\mathcal{C}_{l}}|}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}x_{\mathcal{C}_{l}}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}$ ,

	$\displaystyle w_{i}^{k+1}=w_{i}^{k}-x_{i}^{k}$
	$\displaystyle+\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}}{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}\mathbf{1}_{\|{\mathcal{C}_{l}}\|}}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})).$

Additionally, we can transform $\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}z_{l}^{k+1}$ for $k\geq 1$ into

$\displaystyle\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}z_{l}^{k+1}=$	$\displaystyle\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}(z_{l}^{k}-x_{\mathcal{C}_{l}}^{k})$
	$\displaystyle+\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))$
$\displaystyle=$	$\displaystyle\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}(x_{\mathcal{C}_{l}}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})).$	(43)

Thus, for $k=1,2,\ldots$ , recalling the initialization of $\mathbf{z}^{1}=\mathbf{D}(\mathbf{x}^{0}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{0}))$ and applying (5.1) to $w_{i}^{k+1}$ provide $w_{i}^{k+1}=w_{i}^{k}-x_{i}^{k}+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}(2x^{k}_{\mathcal{C}_{l}}-x^{k-1}_{\mathcal{C}_{l}}+\alpha D_{l}(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})))=w_{i}^{k}-x_{i}^{k}+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})))$ . Thus, setting $\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}$ with ${\boldsymbol{\Phi}}$ in (18), we get $\mathbf{w}^{k+1}=\mathbf{w}^{k}-\mathbf{x}^{k}+\widetilde{\mathbf{W}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})))$ . ∎

Remark 2.

The original DYS paper [20] states that the NIDS is obtained from the PD3O in [36] (a primal-dual variant of DYS). Meanwhile, in Proposition 5, we rely only on the primal part and obtain $\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}$ as a fixed parameter.

5.2 Linear convergence of the NIDS with $\hat{g}_{i}(\cdot)\neq 0$

This subsection presents a linear convergence rate of the NIDS via the CD-DYS. We first present a new result of linear convergence for the general DYS for Problem (5) (not limited to the CD-DYS) when $f$ is strongly convex and $g$ is the indicator function of a linear image space. As indicator functions satisfy neither smoothness nor strong convexity, our result cannot be derived from the prior results of linear convergence as [13, 12, 18, 37]. The proof is presented in Section 7.

Theorem 1.

Consider the variable metric DYS in (9) for $k=1,2,\ldots$ for Problem (5). Let $y^{*}$ and $z^{*}$ be the optimal values of $y^{k+1/2}$ , $y^{k}$ , and $z^{k}$ . Suppose that $M^{-1}\nabla f(y)$ is $L$ -Lipschitz continuous, $f$ is $\mu$ -strongly convex, $g(y)=\delta_{\mathrm{Im}(U)}(y)$ with a column full-rank matrix $U$ , and $h$ is proper, closed, and convex. Set a stepsize $\alpha\in(0,2\varepsilon/L)$ , where $\varepsilon\in(0,1)$ . Pick any start point $y^{1/2}=y^{\mathrm{init}}$ and set $z^{1}=y^{1/2}-\alpha M^{-1}\nabla f(y^{1/2})$ . Then it holds that

		$\displaystyle\\|z^{k+1}-z^{}\\|^{2}+\nu\\|\zeta(y^{k+1/2})-\zeta(y^{})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle(1-C)(\\|z^{k}-z^{}\\|^{2}+\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}),$

where $\nu\in(0,\frac{\beta}{2}(\alpha-\frac{\alpha^{2}L}{2\varepsilon}))$ with $\beta=\min\{\frac{1}{\alpha L},\mu\}$ , $\zeta(y):=y-\alpha M^{-1}\nabla_{y}f(y)$ , and $C=\min\{\frac{\kappa}{48},\frac{\kappa}{12\alpha},\frac{\nu}{\nu+9}\}$ with $\kappa:=\beta(\alpha-\frac{\alpha^{2}L}{2\varepsilon})-2\nu>0$ .

Since $\mathcal{D}_{l}=\mathrm{Im}(\mathbf{1}_{|{\mathcal{C}_{l}}|})$ for $\mathcal{D}_{l}$ in (39), Theorem 2 provides the following linear rate for the NIDS with ${\boldsymbol{\Phi}}$ . Although [1, 33] have addressed this case, our result below admits bigger stepsizes due to the arbitrariness of $\varepsilon\in(0,1)$ .

Theorem 2.

Consider the same assumptions as Proposition 5. Further, assume that $\mathcal{G}$ is connected and that for each $i\in\mathcal{N}$ , $\hat{f}_{i}(\cdot)$ is $\hat{\mu}_{i}$ -strongly convex. Set a stepsize $\alpha\in(0,2\varepsilon/\max_{i\in\mathcal{N}}\hat{L}_{i})$ , where $\varepsilon\in(0,1)$ . Pick any start point $\mathbf{x}^{0}$ and set $\mathbf{w}^{1}=\mathbf{x}^{0}-\alpha\hat{f}(\mathbf{x}^{0})$ . Then, $\|\mathbf{x}^{k}-\mathbf{x}^{*}\|=O((1-C)^{k/2})$ holds, where $C$ is given as Theorem 1 with $L=\max_{i\in\mathcal{N}}\hat{L}_{i}$ and $\mu=\min_{i\in\mathcal{N}}\hat{\mu}_{i}/\max_{i\in\mathcal{N}}|\mathcal{Q}_{\mathcal{G}}^{i}|$ .

Proof.

This theorem follows from Proposition 5 and Theorem 1. Note that while the $\mu_{i}$ -strong convexity of $\hat{f}_{i}$ for $i=1,\ldots,n$ guarantees the convexity of $\hat{f}((\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})-(\min_{i}\hat{\mu}_{i})\|(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}\|^{2}$ ⁵⁵5 $\hat{f}((\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})$ is strongly convex over $\mathrm{Im}(\mathbf{D})$ ; recall our formulation in (31)., we can treat $\sum_{i=1}^{n}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})$ just as a $\min_{i\in\mathcal{N}}\hat{\mu}_{i}/\max_{i\in\mathcal{N}}|\mathcal{Q}_{\mathcal{G}}^{i}|$ -strongly convex function and directly apply the same arguments as Theorem 1 because both $\mathbf{y}^{k+1/2}$ in Algorithm (36) and $\mathbf{y}^{*}$ always belong to $\mathrm{Im}(\mathbf{D})$ . The linear convergence of $\{\mathbf{x}^{k}\}$ follows from the first line of Algorithm (36). ∎

6 Numerical Experiments

This section presents two numerical examples: resource allocation and consensus optimization.

Clique-wise resource allocation

First, we consider a resource allocation problem for the network of $n=20$ agents in [30, Fig. 1] with four communities modeled by cliques and suppose that a resource constraint is imposed on each clique. The clique parameters are given as $\mathcal{Q}_{\mathcal{G}}=\{1,2,3,4\}$ and $\mathcal{C}_{1}=\{1,2,\ldots,6\},\,\mathcal{C}_{2}=\{5,6,\ldots,9\},\,\mathcal{C}_{3}=\{8,9,\ldots,12\},\,\mathcal{C}_{4}=\{9,10,13,14,\ldots,20\}$ . Let $x_{i}\geq 0$ for $i\in\mathcal{N}$ be the amount of resources of agent $i$ . For $i\in\mathcal{N}$ and $l\in\mathcal{Q}_{\mathcal{G}}$ , we set $f_{l}(x_{\mathcal{C}_{l}})=\frac{a_{l}}{2}\|\frac{1}{|{\mathcal{C}_{l}}|}\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}x_{\mathcal{C}_{l}}-b_{l}\|^{2}$ with $a_{l}=1$ and $b_{l}$ generated by the uniform distribution with $[0,5]$ , $g_{l}(x_{\mathcal{C}_{l}})=\delta_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})$ with $\mathcal{D}_{l}=\{x_{\mathcal{C}_{l}}:\sum_{j\in{\mathcal{C}_{l}}}x_{j}=N_{l}\}$ , for $(N_{1},\ldots,N_{4})=(5,10,5,15)$ , $\hat{f}_{i}(x_{i})=\frac{\hat{a}_{i}}{2}\|x_{i}-\hat{b}_{i}\|^{2}$ with $\hat{a}_{i}=1$ and $\hat{b}_{i}$ generated by the uniform distribution with $[0,1]$ , and $\hat{g}_{i}(x_{i})=\delta_{\mathbb{R}_{+}\cup\{0\}}(x_{i})$ .

We here compare the proposed methods in (28) (or Algorithm 1) with $\alpha=1/(\max_{i\in\mathcal{N},l\in\mathcal{Q}_{\mathcal{G}}}\hat{a}_{i}/\min_{i\in\mathcal{Q}_{\mathcal{G}}}|\mathcal{Q}_{\mathcal{G}}^{i}|+\max_{l\in\mathcal{Q}_{\mathcal{G}}}a_{l})$ ) with Liang et al. [21] for two different stepsizes ( $\tau=0.09,\,0.1$ ). The method in [21] is a distributed algorithm for globally coupled constraints using the gradient of the cost function. Notice that the dual decomposition technique cannot be directly used here since we have to minimize $f_{l},\,l\in\mathcal{Q}_{\mathcal{G}}$ .

The simulation result is plotted in Fig. 4. The proposed CD-DYS converges much faster than Liang et al. [21] whereas that with $\tau=0.1$ fails to converge. This difference is rooted in the fact that the CD-DYS exploits the community structure of the problem and admits larger stepsizes. Note that to get the largest stepsize for Liang et al., one has to know an upper bound of the norm of dual variables.

Consensus optimization via NIDS

We next consider the $\ell_{1}$ norm regularized consensus optimization problem (38) for $n=50$ agents over an undirected graph $\mathcal{G}$ . Here, we set $\hat{f}_{i}(x_{i})=\frac{1}{2}\|\Psi_{i}x_{i}-b_{i}\|^{2}$ and $\hat{g}_{i}(x_{i})=\lambda_{i}\|x_{i}\|_{1}$ for $i\in\mathcal{N}$ . Here, $\Psi_{i}=I_{10}+0.05\Omega_{i}\in\mathbb{R}^{10\times 10}$ , $b_{i}\in\mathbb{R}^{10},\,i\in\mathcal{N}$ , and $\lambda_{1}=\cdots=\lambda_{n}=\lambda=0.001$ . Each entry of $\Omega_{i}$ and $b_{i}$ is generated by the standard normal distribution.

We here conduct simulations of the NIDS in (42) for $\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}$ with $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}$ and $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ , $\widetilde{\mathbf{W}}_{\mathbf{L}}$ with $\epsilon=0.99/\max_{i\in\mathcal{N}}(|\mathcal{N}_{i}|-1)$ , $\widetilde{\mathbf{W}}_{\mathrm{mh}}$ , and $\widetilde{\mathbf{W}}_{c}:=I-\alpha c(I-\mathbf{W}_{\mathbf{L}})$ , where $c=1/(1-\lambda_{\min}(\mathbf{W}_{\mathbf{L}})\alpha)$ . The last is introduced in [20] as a less conservative choice using global information $\lambda_{\mathrm{min}}(\mathbf{W}_{\mathbf{L}})$ . The stepsize is assigned as $\alpha=1/\hat{L}$ with $\hat{L}=\max_{i}\{\lambda_{\mathrm{max}}(|\mathcal{Q}_{\mathcal{G}}^{i}|(\Psi_{i}^{\top}\Psi_{i}))\}.$

The simulation result is reported in Fig. 5, which plots the relative objective residual $|F(\mathbf{x}^{k})-F(\mathbf{x}^{*})|/F(\mathbf{x}^{*})$ where $F(\mathbf{x}):=\hat{f}(\mathbf{x})+\lambda\|\mathbf{x}\|_{1}$ . It can be observed that the case of ${\boldsymbol{\Phi}}$ with $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}$ exhibits the fastest convergence in almost 60 iterations with high accuracy, outperforming $\widetilde{\mathbf{W}}_{c}$ without using global information. While the case of ${\boldsymbol{\Phi}}$ with $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ is slower than $\widetilde{\mathbf{W}}_{c}$ , this is still superior to $\widetilde{\mathbf{W}}_{\mathrm{mh}}$ and $\widetilde{\mathbf{W}}_{\mathbf{L}}$ , as implied in Proposition 4.

7 Proof of Theorem 2

Our proof is based on the following trick (See Theorem D.6 in [13]): If $a_{0},\cdots,a_{p},b_{0},\cdots,b_{p},c_{0},\cdots,c_{p}\in(0,\infty)$ for some $p>0$ , and

\text{{${\!\!\!\!\!\|w^{k+1}-w^{*}\|^{2}+\sum_{i=0}^{p}a_{i}c_{i}\leq\|w^{k}-w^{*}\|^{2}\leq\sum_{i=0}^{p}a_{i}b_{i}}$}},

(44)

then $\sum_{i=0}^{n}a_{i}b_{i}\leq\max_{i}(b_{i}/c_{i})\sum_{i=0}^{n}a_{i}c_{i}$ , so

\displaystyle\|w^{k+1}-w^{*}\|^{2}+\min_{i}(c_{i}/b_{i})\|w^{k}-w^{*}\|^{2}\leq\|w^{k+1}-w^{*}\|^{2}+\sum_{i=0}^{p}a_{i}c_{i}\leq\|w^{k}-w^{*}\|^{2}\leq\|w^{k}-w^{*}\|^{2}.

Thus,

\displaystyle\!\!\!\!\!\!\|w^{k+1}-w^{*}\|^{2}\leq(1-\min_{i}c_{i}/b_{i})\|w^{k}-w^{*}\|^{2},

(45)

which provides a linear convergence rate. In the following, we derive an inequality of the form in (44) with $w^{k}=[(z^{k})^{\top},\nu^{1/2}(\zeta(y^{k-1/2}))^{\top}]^{\top}$ , $w^{*}=[z^{*\top},\nu^{1/2}(\zeta(y^{*}))^{\top}]^{\top}$ , and some constants $a_{0},\cdots,a_{p},b_{0},\cdots,b_{p},c_{0},\cdots,c_{p}>0$ .

We first prepare a key inclusion for establishing the desired rate. We suppose that $z^{k}$ , $y^{k}$ , and $y^{k+1/2}$ are not optimal without loss of generality. For $g(y)=\delta_{\mathrm{Im}(U)}(y)$ , we have $\mathrm{prox}_{\alpha g}^{M}(y)=U(U^{\top}MU)^{-1}U^{\top}My$ , which leads to

	$\displaystyle U^{\top}Mz^{k+1}$	$\displaystyle=U^{\top}Mz^{k}-IMy^{k+1/2}+U^{\top}M(2y^{k+1/2}-z^{k}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2}))$
		$\displaystyle=U^{\top}M(y^{k+1/2}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2}))$

since $U^{\top}M\mathrm{prox}_{\alpha g}^{M}(y)=U^{\top}My$ . Note that this holds for all $k\geq 1$ thanks to the initialization. Then we have

\displaystyle y^{k+1}=(U^{\top}MU)^{-1}U^{\top}M(2y^{k+1/2}-y^{k-1/2}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2})+\alpha M^{-1}\nabla_{y}f(y^{k-1/2})),

and thus we get $\xi^{k+1}:=2y^{k+1/2}-y^{k-1/2}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2})+\alpha M^{-1}\nabla_{y}f(y^{k-1/2})\in(I+\alpha M^{-1}\partial g)(y^{k+1})$ by [4, Prop. 16.44]. This inclusion allows us to remove the assumption of smoothness or strong convexity for $g$ or $h$ in [13, 18, 37, 12] because one can evaluate $y^{k+1}$ only with $f$ .

We derive the lower side of the inequality in (44) as follows. By the smoothness and strong convexity of $f$ , for $\|z^{k}-z^{*}\|^{2}$ , [13, Proposition D.4] provides

	$\displaystyle\\|z^{k}-z^{*}\\|^{2}\geq$	$\displaystyle(1-\varepsilon)\\|z^{k+1}-z^{}\\|^{2}+2\alpha\max\{\mu\\|y^{k+1/2}-y^{}\\|^{2},\frac{1}{L}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}\}$
		$\displaystyle-\frac{\alpha^{2}}{\varepsilon}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}.$

Then, using $\theta=1/(2-\varepsilon)$ , $\|w^{k}-w^{*}\|^{2}$ is lower bounded as

	$\displaystyle\\|w^{k}-w^{}\\|^{2}\geq\\|z^{k+1}-z^{}\\|^{2}+\nu\\|\zeta(y^{k-1/2})-\zeta(y^{*})\\|^{2}$
	$\displaystyle+(\frac{1}{\theta}-1)\\|z^{k+1}-z^{k}\\|^{2}+2\alpha\max\{\mu\\|y^{k+1/2}-y^{}\\|^{2},\frac{1}{L}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{})\\|^{2}\}$
	$\displaystyle-\frac{\alpha^{2}L}{\varepsilon}\times\frac{1}{L}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}$
$\displaystyle\geq$	$\displaystyle\\|w^{k+1}-w^{}\\|^{2}+c_{0}\\|z^{k+1}-z^{k}\\|^{2}+c_{1}\\|y^{k+1/2}-y^{}\\|^{2}+c_{2}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}$
	$\displaystyle+c_{3}\\|\zeta(y^{k-1/2})-\zeta(y^{*})\\|^{2}$	(46)

where $c_{0}=1/\theta-1$ , $c_{1}=\beta(\alpha-\frac{\alpha^{2}L}{2\varepsilon})-2\nu$ , $c_{2}=\alpha c_{1}$ , and $c_{3}=\nu$ . Here, the term $\|w^{k+1}-w^{*}\|^{2}$ in (7) comes from the definition of $w^{k}$ and the relationship $\|w\|^{2}+\|w^{\prime}\|^{2}\geq\|w+w^{\prime}\|^{2}/2$ obtained by Jensen’s inequality for the terms of $\|y^{k+1/2}-y^{*}\|^{2}$ and $\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}$ .

Next, utilizing a similar calculation to the proof of [13, Proposition D.4] (see the proof of the second upper bound) and the inclusion $\xi^{k+1}\in(I+\alpha M^{-1}\partial g)(y^{k+1})$ , we derive the upper-bound of $\|z^{k}-z^{*}\|^{2}+\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}$ as follows. Here, we set $\epsilon=c_{0}/C$ , and let $u_{A}^{k+1}\in M^{-1}\partial g(y^{k+1})$ and $u_{A}^{*}\in M^{-1}\partial g(y^{*})$ satisfy $y^{k+1}+\alpha u_{A}^{k*1}=\xi^{k+1}$ and $y^{*}+\alpha u_{A}^{*}=\xi^{*}$ , respectively:

	$\displaystyle\\|w^{k}-w^{}\\|^{2}\leq\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}+\\|y^{k+1}-\alpha(u_{A}^{k+1}+M^{-1}\nabla_{y}f(y^{k+1/2}))$
	$\displaystyle+2(y^{k+1/2}-y^{k+1})-(y^{}-\alpha u_{A}^{}-\alpha\nabla_{y}f(y^{*}))\\|^{2}$
$\displaystyle\leq$	$\displaystyle\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}+\\|-(\xi^{k+1}-\xi^{})-\alpha(M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*}))$
	$\displaystyle+2(y^{k+1/2}-y^{*})\\|^{2}+\epsilon\\|z^{k+1}-z^{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}+\epsilon\\|z^{k+1}-z^{k}\\|^{2}+3\\|\xi^{k+1}-\xi^{}\\|^{2}+12\\|(y^{k+1/2}-y^{*})\\|^{2}$
	$\displaystyle+3\alpha^{2}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}$
$\displaystyle\leq$	$\displaystyle(\nu+9)\\|\zeta(y^{k-1/2})-\zeta(y^{*})\\|^{2}+\epsilon\\|z^{k+1}-z^{k}\\|^{2}$
	$\displaystyle+48\\|(y^{k+1/2}-y^{})\\|^{2}+12\alpha^{2}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{})\\|^{2},$	(47)

where the last line follows from $\|\xi^{k+1}-\xi^{*}\|^{2}\leq 3\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}+12\|(y^{k+1/2}-y^{*})\|^{2}+3\alpha^{2}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}$ .

Therefore, for $a_{0}=\|z^{k+1}-z^{k}\|^{2}$ , $a_{1}=\|(y^{k+1/2}-y^{*})\|^{2}$ , $a_{2}=\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}$ , $a_{3}=\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}$ , $b_{0}=\epsilon$ , $b_{1}=48$ , $b_{2}=12\alpha^{2}$ , and $b_{3}=\nu+9$ , the linear rate follows from (45), (7), and (7).

8 Conclusion

This note addressed distributed optimization of clique-wise coupled problems via operator splitting. First, we defined the CD matrix and a new mixing matrix and analyzed its properties. Then, using the CD matrix, we presented the CD-DYS algorithm via the Davis-Yin splitting (DYS). Subsequently, its connection to consensus optimization methods as NIDS was also analyzed. Moreover, we presented a new linear convergence rate not only for the NIDS with non-smooth terms but also for the general DYS with a projection onto a subspace. Finally, we demonstrated the effectiveness via numerical examples.

References

[1] Sulaiman A Alghunaim, Ernest K Ryu, Kun Yuan, and Ali H Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 66(6):2787–2794, 2020.
[2] Miguel F Anjos and Jean B Lasserre. Handbook on semidefinite, conic and polynomial optimization, volume 166. Springer Science & Business Media, 2011.
[3] Francisco J Aragón-Artacho and David Torregrosa-Belén. A direct proof of convergence of Davis–Yin splitting algorithm allowing larger stepsizes. Set-Valued and Variational Analysis, 30(3):1011–1029, 2022.
[4] Heinz H Bauschke, Patrick L Combettes, Heinz H Bauschke, and Patrick L Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2017.
[5] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
[6] Mattia Bianchi and Sergio Grammatico. The end: Estimation network design for games under partial-decision information. IEEE Transactions on Control of Network Systems, 2024.
[7] Béla Bollobás. Modern Graph Theory, volume 184. Springer Science & Business Media, 1998.
[8] Francesco Bullo. Lectures on Network Systems, volume 1. Kindle Direct Publishing Seattle, DC, USA, 2020.
[9] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
[10] Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289–4305, 2012.
[11] Laurent Condat, Daichi Kitahara, Andrés Contreras, and Akira Hirabayashi. Proximal splitting algorithms for convex optimization: A tour of recent advances, with new twists. SIAM Review, 65(2):375–435, 2023.
[12] Laurent Condat and Peter Richtárik. Randprox: Primal-dual optimization algorithms with randomized proximal updates. arXiv preprint arXiv:2207.12891, 2022.
[13] Damek Davis and Wotao Yin. A three-operator splitting scheme and its optimization applications. Set-Valued and Variational Analysis, 25:829–858, 2017.
[14] Mituhiro Fukuda, Masakazu Kojima, Kazuo Murota, and Kazuhide Nakata. Exploiting sparsity in semidefinite programming via matrix completion i: General framework. SIAM Journal on optimization, 11(3):647–674, 2001.
[15] David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 387–396, 2015.
[16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
[17] Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization. IEEE Transactions on Automatic Control, 64(10):4050–4065, 2019.
[18] Jongmin Lee, Soheun Yi, and Ernest K Ryu. Convergence analyses of davis-yin splitting via scaled relative graphs. arXiv preprint arXiv:2207.04015, 2022.
[19] Huaqing Li, Enbing Su, Chengbo Wang, Jiawei Liu, Zuqing Zheng, Zheng Wang, and Dawen Xia. A primal-dual forward-backward splitting algorithm for distributed convex optimization. IEEE Transactions on Emerging Topics in Computational Intelligence, 2021.
[20] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019.
[21] Shu Liang, George Yin, et al. Distributed smooth convex optimization with coupled constraints. IEEE Transactions on Automatic Control, 65(1):347–353, 2019.
[22] Angelia Nedić and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
[23] Yurii Evgen’evich Nesterov. A method of solving a convex programming problem with convergence rate $O\bigl{(}1/k^{2}\bigr{)}$ . In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
[24] Ernest K Ryu and Wotao Yin. Large-scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, 2022.
[25] Kazunori Sakurama. Unified formulation of multiagent coordination with relative measurements. IEEE Transactions on Automatic Control, 66(9):4101–4116, 2020.
[26] Kazunori Sakurama and Toshiharu Sugie. Generalized coordination of multi-robot systems. Foundations and Trends® in Systems and Control, 9(1):1–170, 2022.
[27] Ali H Sayed. Diffusion adaptation over networks. In Academic Press Library in Signal Processing, volume 3, pages 323–453. Elsevier, 2014.
[28] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015.
[29] Lieven Vandenberghe and Martin S Andersen. Chordal graphs and semidefinite optimization. Foundations and Trends® in Optimization, 1(4):241–433, 2015.
[30] Yuto Watanabe and Kazunori Sakurama. Accelerated distributed projected gradient descent for convex optimization with clique-wise coupled constraints. In Proceedings of the 22nd IFAC World Congress, 2023.
[31] Yuto Watanabe and Kazunori Sakurama. Distributed optimization of clique-wise coupled problems. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 296–302. IEEE, 2023.
[32] Lin Xiao, Stephen Boyd, and Sanjay Lall. Distributed average consensus with time-varying metropolis weights. Automatica, 1:1–4, 2006.
[33] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: Unified framework and convergence analysis. IEEE Transactions on Signal Processing, 69:3555–3570, 2021.
[34] Isao Yamada. The hybrid steepest descent method for the variational inequality problem over the intersection of fixed point sets of nonexpansive mappings. Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications, 8:473–504, 2001.
[35] Isao Yamada, Nobuhiko Ogura, and Nobuyasu Shirakawa. A numerically robust hybrid steepest descent method for the convexly constrained generalized inverse problems. Contemporary Mathematics, 313:269–305, 2002.
[36] Ming Yan. A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. Journal of Scientific Computing, 76:1698–1717, 2018.
[37] Soheun Yi and Ernest K Ryu. Convergence analyses of davis-yin splitting via scaled relative graphs ii: Convex optimization problems. arXiv preprint arXiv:2211.15604, 2022.
[38] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part I: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2018.
[39] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part II: Convergence analysis. IEEE Transactions on Signal Processing, 67(3):724–739, 2018.
[40] Yu Zhang and Qiang Yang. An overview of multi-task learning. National Science Review, 5(1):30–43, 2018.
[41] Yang Zheng, Giovanni Fantuzzi, and Antonis Papachristodoulou. Chordal and factor-width decompositions for scalable semidefinite and polynomial optimization. Annual Reviews in Control, 52:243–279, 2021.
[42] Yang Zheng, Maryam Kamgarpour, Aivar Sootla, and Antonis Papachristodoulou. Distributed design for decentralized control using chordal decomposition and ADMM. IEEE Transactions on Control of Network Systems, 7(2):614–626, 2019.

Appendix A Application examples

In addition to the examples in Section 6, we here present various application examples of Problem 2.

Formation control

A formation control problem aims to steer the positions of robots to a desired configuration and has been actively investigated for the past two decades. For a multi-agent system over undirected graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , the most basic formulation of this problem is

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}},\,i\in\mathcal{N}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{\{i,j\}\in\mathcal{E}}\|x_{i}-x_{j}-d_{ij}\|^{2},\end{array}

(49)

where $x_{i}$ is the position of agent $i$ , and $d_{ij}$ is the desired relative position from $x_{j}$ to $x_{i}$ . By assigning $\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}$ , one can obtain the desired configuration via the proposed CD-DYS. Note that one can also deal with various constraints in the clique-wise coupled framework, e.g., an agent-wise constraint $x_{i}\in\Omega_{i}$ and a pairwise distance constraint $\underline{\delta}_{ij}\leq\|x_{i}-x_{j}\|\leq\overline{\delta}_{ij}$ . In addition, the proposed framework also allows us to achieve the desired formation in a distributed manner even for linear multi-agent systems, as shown in [17], and in the case where each agent has no access to the global coordinate and can only use information via relative measurements, as shown in [25, 26]

Network Lasso

The network lasso is an optimization-based machine-learning technique accounting for network structures. For a multi-agent system over graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , a network lasso problem [15] is given as follows:

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{m},\,i\in\mathcal{N}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}\hat{f}_{i}(x_{i})+\lambda\sum_{\{i,j\}\in\mathcal{E}}w_{ij}\|x_{i}-x_{j}\|,\end{array}

(51)

where $\lambda>0$ and $w_{ij}>0$ for $\{i,j\}\in\mathcal{E}$ . This problem can be seen as a special case of Problem (2). Owing to the second term in (51), neighboring nodes are more likely to form a cluster, i.e., to take close values. Applications of the Network Lasso include the estimation of home prices [15], where there is a spatial interdependence among houses’ prices.

Sparse semidefinite programming

Semidefinite programming via chordal graph decomposition has been actively studied not only in optimization [29, 14] but also in control [41] as an efficient and scalable computation scheme exploiting the sparsity of matrices that naturally arises from underlying networked structures of problems. This type of problem can also be solved in a distributed manner based on the framework of clique-wise coupling.

Consider a multi-agent system over $\mathcal{G}=(\mathcal{N},\mathcal{E})$ and the following standard semidefinite programming

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}y_{i}\in\mathbb{R},\,i\in\mathcal{N},\,Z\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{n}b_{i}^{\top}y_{i}+\delta_{\mathbb{S}^{n}_{+}}(Z)\\ {\mbox{subject to}}&\displaystyle Z+\sum_{j=1}^{p}A_{j}y_{j}=C,\end{array}

(54)

where we consider that agent $i$ possesses $y_{i}$ and $i$ th column of $Z$ . Here, $\mathbb{S}_{+}^{n}$ represents the set of $n\times n$ positive semidefinite matrices. This problem cannot be solved in a distributed manner by standard algorithms due to the undecomposable constraint $Z\in\mathbb{S}^{n}_{+}$ . Nevertheless, if $Z,\,A_{1},\ldots,A_{p},\,C$ have the sparsity with respect to $\mathcal{G}=(\mathcal{N},\mathcal{E})$ and graph $\mathcal{G}$ is chordal, [29, 41] show that this problem can be equivalently transformed into the following decomposed form with smaller positive semidefinite constraints:

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}y_{i}\in\mathbb{R},\,i\in\mathcal{N},\,Z_{l},\,l\in\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{n}b_{i}^{\top}y_{i}+\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}}\delta_{\mathbb{S}_{+}^{|{\mathcal{C}_{l}}|}}(Z_{l})\\ {\mbox{subject to}}&\displaystyle\sum_{j=1}^{p}A_{j}y_{j}+\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}}D_{l}^{\top}Z_{l}D_{l}=C.\end{array}

(57)

Moreover, when $\sum_{i=1}b_{i}^{\top}y_{i}$ in (57) can be rewritten as

\sum_{j=1}^{p}A_{j}y_{j}=MY+YN

(58)

with $Y=\text{diag}(y_{1},\ldots,y_{n})$ and some matrices $M,N$ with the sparsity with respect to $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , we can reformulate Problem (57) into a clique-wise coupled problem in (2) by introducing auxiliary variables. For example, for the system with $n=3$ in Fig. 6, which is a chordal graph with maximal cliques $\mathcal{C}_{1}=\{1,2\}$ and $\mathcal{C}_{2}=\{2,3\}$ , Problem (57) with (58) reduces to

\displaystyle\begin{array}[]{cl}&\underset{\begin{subarray}{c}y_{1},y_{2},y_{3}\in\mathbb{R},\,Z_{1},Z_{2}\in\mathbb{S}^{2}\end{subarray}}{\mbox{minimize}}\quad\displaystyle\sum_{i=1}^{3}b_{i}^{\top}y_{i}+\delta_{\mathbb{S}_{+}^{2}}(Z_{1})+\delta_{\mathbb{S}_{+}^{2}}(Z_{2})\\ &{\mbox{subject to}}\\ &\begin{bmatrix}m_{11}y_{1}+n_{11}y_{1}&m_{12}y_{1}+n_{12}y_{2}&0\\ m_{21}y_{2}+n_{21}y_{1}&m_{22}y_{2}+n_{22}y_{2}&m_{23}y_{2}+n_{23}y_{3}\\ 0&m_{32}y_{3}+n_{32}y_{2}&m_{33}y_{3}+n_{33}y_{3}\end{bmatrix}+\begin{bmatrix}\text{\large{$Z_{1}$}}&\begin{matrix}0\\ 0\end{matrix}\\ \begin{matrix}0&0\end{matrix}&0\end{bmatrix}+\begin{bmatrix}0&\begin{matrix}0&0\end{matrix}\\ \begin{matrix}0\\ 0\end{matrix}&\text{\large{$Z_{2}$}}\end{bmatrix}=\begin{bmatrix}c_{11}&c_{12}&0\\ c_{21}&c_{22}&c_{23}\\ 0&c_{32}&c_{33}\end{bmatrix},\end{array}

(62)

where $m_{ij}$ , $n_{ij}$ , $c_{ij}$ are the $i,j$ entries of $M$ , $N$ , and $C$ , respectively. Hence, by decomposing the constraint in (62) into clique-wise coupled constraints by using the auxiliary variables $\hat{z}_{2,11}$ and $\hat{z}_{1,22}$ as

		$\displaystyle\begin{bmatrix}m_{11}y_{1}+n_{11}y_{1}&m_{12}y_{1}+n_{12}y_{2}\\ m_{21}y_{2}+n_{21}y_{1}&m_{22}y_{2}+n_{22}y_{2}\end{bmatrix}+Z_{1}+\begin{bmatrix}0&0\\ 0&\hat{z}_{2,11}\end{bmatrix}=\begin{bmatrix}c_{11}&c_{12}\\ c_{21}&c_{22}\\ \end{bmatrix}$		(63)
		$\displaystyle\begin{bmatrix}m_{22}y_{2}+n_{22}y_{2}&m_{23}y_{2}+n_{23}y_{3}\\ m_{32}y_{3}+n_{32}y_{2}&m_{33}y_{3}+n_{33}y_{3}\end{bmatrix}+Z_{2}+\begin{bmatrix}\hat{z}_{1,22}&0\\ 0&0\end{bmatrix}=\begin{bmatrix}c_{22}&c_{32}\\ c_{32}&c_{33}\\ \end{bmatrix}$		(64)
		$\displaystyle z_{1,22}=\hat{z}_{1,22},\quad z_{2,11}=\hat{z}_{2,11},$		(65)

where $z_{l,ij}$ represents the $i,j$ entry of $Z_{l}$ , we can obtain an equivalent clique-wise coupled problem in the following with $x_{1}=[y_{1};\mathrm{vec}([Z_{1}]_{1})]$ , $x_{2}=[y_{2};\mathrm{vec}([Z_{1}]_{2});\mathrm{vec}([Z_{2}]_{1});\hat{z}_{1,22};\hat{z}_{2,11}]$ , $x_{3}=[y3;\mathrm{vec}([Z_{2}]_{2})]$ , where $\mathrm{vec}([Z_{l}]_{i})$ represents $i$ th column of the matrix $Z_{l}$ :

\displaystyle\begin{array}[]{cll}&\underset{\begin{subarray}{c}y_{1},y_{2},y_{3}\in\mathbb{R}\\ Z_{1},Z_{2}\in\mathbb{S}^{2}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{3}b_{i}^{\top}y_{i}+\delta_{\mathbb{S}_{+}^{2}}(Z_{1})+\delta_{\mathbb{S}_{+}^{2}}(Z_{2})\\ &{\mbox{subject to}}&\text{\eqref{sdp_decomposed_constraints_1}--\eqref{sdp_decomposed_constraints_3}}\end{array}

(68)

Semidefinite programming in the form of (57) with (58) arises in practical problems, e.g., distributed design of decentralized controllers for linear networked systems [42] and sensor network localization [2]. Note that one can extend the discussion above to higher dimensional vectors $y_{i}$ and block-partitioned matrices $Z$ , as shown in [41].

Approximating trace norm minimization problems

Trace norm minimization is a powerful technique in machine learning and computer vision that can obtain a low-rank matrix $\hat{L}$ representing the underlying structure of the data. Its applications include the Robust PCA (RPCA) and multi-task learning problems.

For example, we can relax an RPCA problem to a clique-wise coupled problem as follows. Consider a data matrix $Y\in\mathbb{R}^{d\times n}$ . Then, a standard form of RPCA is formulated as follows:

\displaystyle\underset{\hat{S},\hat{L}}{\mbox{minimize}}\;\;\displaystyle\|\hat{S}\|_{1}+\theta\|\hat{L}\|_{*}\;\;\displaystyle\mbox{subject to}\;\;\hat{S}+\hat{L}=Y.

(69)

By solving this problem, we can decompose a data matrix $Y$ into two components: a low-rank matrix $\hat{L}$ representing the underlying structure of the data and a sparse matrix $\hat{S}$ capturing the outliers or noise. Consider that for a multi-agent system with $n$ agents and $Y=[Y_{1},\ldots,Y_{n}]$ , agent $i$ possesses the matrix $Y_{i}$ . Then the robust PCA problem with the clique-based relaxation is formulated as follows:

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}\hat{S}_{i},\,i\in\mathcal{N}\\ \hat{L}_{l}\in\mathbb{R}^{640\times 40|{\mathcal{C}_{l}}|},\,l\in\mathcal{Q}_{\mathcal{G}}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i\in\mathcal{N}}\|\hat{S}_{1}\|_{1}+\sum_{l\in{\mathcal{C}_{l}}}\theta_{l}\|\hat{L}_{l}\|_{*}\\ \displaystyle\mbox{subject to}&\hat{S}_{\mathcal{C}_{l}}+\hat{L}_{l}=Y_{\mathcal{C}_{l}}\quad\forall l\in\mathcal{Q}_{\mathcal{G}},\end{array}

(72)

where $\hat{S}_{\mathcal{C}_{l}}=[\hat{S}_{j_{1}},\ldots,\hat{S}_{j_{|{\mathcal{C}_{l}}|}}]$ and $\hat{Y}_{\mathcal{C}_{l}}=[\hat{Y}_{j_{1}},\ldots,\hat{Y}_{j_{|{\mathcal{C}_{l}}|}}]$ for ${\mathcal{C}_{l}}=\{j_{1},\ldots,j_{|{\mathcal{C}_{l}}|}\}$ . Here, $\hat{S}_{i}$ and $\hat{L}_{l}$ correspond to $x_{i}$ and $y_{l}$ in Problem (2). Although Problem (72) involves relaxation, one can still realize a low-rank matrix by solving it.

Appendix B Proof of Lemma 2

(a) We prove the statement by contradiction. Assume that the CD matrix $\mathbf{D}$ is not column full rank. Then, there exists a vector $\mathbf{v}=[v_{1}^{\top},\ldots,v_{n}^{\top}]^{\top}\neq 0$ with $v_{i}\in\mathbb{R}^{d_{i}}$ such that $\mathbf{D}\mathbf{v}=0$ . This yields $D_{l}\mathbf{v}=0$ for $\mathbf{v}$ and all $l\in\mathcal{Q}_{\mathcal{G}}$ . Hence, we obtain $E_{i}\mathbf{v}=v_{i}=0$ for all $i\in\mathcal{N}$ from Assumption 1. This contradicts the assumption.

(b) For $\mathbf{D}$ , we have $\mathbf{D}^{\top}\mathbf{D}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}D_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}E_{j}^{\top}E_{j}=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}^{\top}E_{i}=\sum_{i=1}^{n}|\mathcal{Q}_{\mathcal{G}}^{i}|E_{i}^{\top}E_{i}$ from Definition 1. Here, $E_{i}^{\top}E_{i}=\text{blk-diag}(O_{d_{1}\times d_{1}},\ldots,I_{d_{i}},\ldots,O_{d_{n}\times d_{n}})$ holds. Therefore, we obtain $\mathbf{D}^{\top}\mathbf{D}=\text{blk-diag}(|\mathcal{Q}_{\mathcal{G}}^{1}|I_{d_{1}},\ldots,|\mathcal{Q}_{\mathcal{G}}^{n}|I_{d_{n}})$ . $\mathbf{D}^{\top}\mathbf{D}\succ O$ follows from Assumption 1.

(c) It holds that $\mathbf{D}^{\top}\mathbf{y}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}y_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}E_{j}^{\top}(E_{l,j}y_{l})=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}^{\top}E_{l,i}y_{l}=\sum_{i=1}^{n}E_{i}^{\top}(\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}y_{l})$ . Hence, we obtain (12).

Appendix C Proof of Proposition 1

(a) For $\mathbf{z}\in\mathrm{Im}(\mathbf{D})$ , there exists some $\mathbf{x}\in\mathbb{R}^{d}$ such that $\mathbf{z}=\mathbf{D}\mathbf{x}$ . Then, we obtain

	$\displaystyle\mathrm{prox}_{\alpha G}(\mathbf{y})=\mathbf{D}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{d}}(\frac{1}{2\alpha}\\|\mathbf{y}-\mathbf{D}\mathbf{x}\\|^{2}+\sum_{i=1}^{n}\hat{g}_{i}(E_{i}\mathbf{x}))$
	$\displaystyle=\mathbf{D}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{d}}(\sum_{i=1}^{n}(\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{2\alpha}\\|E_{l,i}y_{l}-x_{i}\\|^{2}+\hat{g}_{i}(x_{i})))$
	$\displaystyle=\mathbf{D}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{d}}(\sum_{i=1}^{n}(\frac{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}{2\alpha}\\|\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}E_{l,i}y_{l}-x_{i}\\|^{2}+\hat{g}_{i}(x_{i}))).$

Therefore, we obtain (15). Note that the last line can be verified by considering the optimality condition.

(b) This can be proved in the same way as Proposition 1a with an easy modification from the definition of $\mathbf{Q}$ .

(c) By the chain rule, we have $\frac{\partial}{\partial\mathbf{y}}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}E_{i}^{\top}\nabla_{x_{i}}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})$ . which gives ((c)).

Appendix D Proof of Proposition 2

(a) For $\mathbf{Q}$ , we obtain $\mathbf{Q}\mathbf{D}=[Q_{l}D_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}$ . Then,

\mathbf{D}^{\top}\mathbf{Q}\mathbf{D}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}Q_{l}D_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}^{\top}E_{j}.

Thus, following the same calculation as the proof of Lemma 2b gives $\mathbf{D}^{\top}\mathbf{Q}\mathbf{D}=I_{d}$ .

(b) For any $\mathbf{y}=[y_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}\in\mathbb{R}^{\hat{d}}$ , it holds that

\mathbf{D}^{\top}\mathbf{Q}\mathbf{y}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}Q_{l}y_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}^{\top}E_{l,j}y_{l}.

Hence, reorganizing this and using the proof of Lemma 2c yield

\mathbf{D}^{\top}\mathbf{Q}\mathbf{y}=\sum_{i=1}^{n}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}E_{i}^{\top}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}y_{l}=\text{blk-diag}([\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}I_{d_{i}}]_{i\in\mathcal{N}})\mathbf{D}^{\top}\mathbf{y}.

Therefore, we obtain $\mathbf{D}^{\top}\mathbf{Q}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}$ from Lemma 2b. The latter equation is also proved in the same way.

(c) From Proposition 2b and Assumption 1, it holds that $\mathbf{D}^{\top}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{Q}^{-1}$ . For the transpose of this matrix, multiplying $\mathbf{D}^{\top}\mathbf{D}$ from the right side gives $\mathbf{Q}^{-1}\mathbf{D}=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})$ . The latter equation is also proved in the same manner.

Appendix E Connection of the CD-DYS to other distributed optimization algorithms

We here present the comprehensive analysis for the diagram in Fig. 3 by deriving the CPGD algorithm in [30] and Section F.

Exact diffusion and Diffusion algorithms

Over undirected graphs, the exact diffusion algorithm is just a special case of the NIDS. In the case of $\hat{g}_{i}=0$ for all $i\in\mathcal{N}$ , the NIDS reduces to the Exact diffusion [38, 39], which is given as follows:

\displaystyle\!\!\!\!\mathbf{x}^{k+1}

\displaystyle=\widetilde{\mathbf{W}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))).\!\!\!\!

(73)

This can be rewritten as follows:

\displaystyle\begin{array}[]{lll}\mathbf{v}^{k+1}=&\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})\\ \mathbf{x}^{k+1}=&\widetilde{\mathbf{W}}(\mathbf{v}^{k+1}+\mathbf{x}^{k}-\mathbf{v}^{k}).\end{array}

(76)

Those algorithms exactly converge to an optimal solution under mild conditions. Note that the Exact diffusion is also valid for directed networks and non-doubly stochastic $\mathbf{W}$ . For details, see [38, 39].

The diffusion algorithm [27, 10] is an early distributed optimization algorithm, given as

\displaystyle\begin{array}[]{lll}\mathbf{x}^{k+1}=&\widetilde{\mathbf{W}}(\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})).\end{array}

(78)

This algorithm is obtained from the NIDS for $\hat{g}_{i}=0,\,i\in\mathcal{N}$ and the Exact diffusion approximating $\mathbf{x}^{k}-\mathbf{v}^{k}\approx 0$ in the second line of (76). Notice that conditions on ${\mathbf{W}}$ in (78) are not equivalent to (73) and (76) (see [27, 10, 24, 38, 39]). Although its convergence is inexact over constant $\alpha$ , its simple structure allows us to easily apply it to stochastic and online setups.

CPGD generalizes of the diffusion algorithm

Invoking the relationship between NIDS/Exact diffusion and diffusion algorithms, we derive a diffusion-like algorithm from the variable metric CD-DYS in (36) for

\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}},\,i\in\mathcal{N}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{n}\hat{f}_{i}(x_{i})+\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\delta_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}),\end{array}

(80)

where $\mathcal{D}_{l}$ is a closed convex set and not limited to (39). The derived algorithm will be formalized as the clique-based projected gradient descent (CPGD) in Appendix F.

We derive the diffusion-like algorithm as follows. From $\hat{g}_{i}=0$ , we have $\mathbf{x}^{k}=\mathbf{x}^{k-}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}$ and $(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\times\mathbf{y}^{k+1/2}=\mathbf{x}^{k}$ . Accordingly, the variable metric CD-DYS in (36) reduces to

\displaystyle\begin{array}[]{lll}&\mathbf{x}^{k}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}\\ &\mathbf{y}^{k+1}=P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(2\mathbf{D}\mathbf{x}^{k}-\mathbf{z}^{k}-\alpha\mathbf{D}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\\ &\mathbf{z}^{k+1}=\mathbf{z}^{k}+\mathbf{y}^{k+1}-\mathbf{D}\mathbf{x}^{k}.\end{array}

By using $\mathbf{v}^{k+1}$ of the form in (76), we get

		$\displaystyle\mathbf{v}^{k+1}=\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})$		(81)
		$\displaystyle\mathbf{x}^{k+1}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{D}\mathbf{v}^{k+1}+\mathbf{D}\mathbf{x}^{k}-\mathbf{z}^{k})$

with $\mathbf{z}^{k}$ from $\mathbf{x}^{k+1}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k+1}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}(\mathbf{z}^{k}+\mathbf{y}^{k+1})-\mathbf{x}^{k}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}^{k+1}$ . In consensus optimization, it can be observed from the previous subsection that $P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\cdot)$ boils down to a linear map and $\mathbf{z}^{k}$ satisfies $P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{z}^{k})=P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{D}\mathbf{v}^{k})$ because we have

\displaystyle P_{\mathcal{D}_{l}}^{Q_{l}}(z_{l}^{k+1})=P_{\mathcal{D}_{l}}^{Q_{l}}(x^{k}_{\mathcal{C}_{l}}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))=P_{\mathcal{D}}^{Q_{l}}(D_{l}\mathbf{v}^{k})

for $\mathcal{D}_{l}$ in (39), as shown in (5.1). Therefore, recalling that the diffusion algorithm (78) can be viewed as (76) with $\mathbf{x}^{k}-\mathbf{v}^{k}\approx 0$ , we can obtain the following diffusion-like algorithm (CPGD) from (81) by the similar approximation $\mathbf{D}\mathbf{x}^{k}-\mathbf{z}\approx 0$ for the second line of (81):

\displaystyle\begin{array}[]{lll}&\mathbf{x}^{k+1}=T(\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\end{array}

(83)

with $T:\mathbb{R}^{d}\to\mathbb{R}^{d}$ defined as $T(\mathbf{x})=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{D}\mathbf{x})$ . Note that the operator $T$ , which will be defined as the clique-based projection in Appendix F, is equal to the doubly stochastic matrix ${\boldsymbol{\Phi}}$ in Proposition 3 for $\mathcal{D}_{l}$ in (39).

Appendix F Clique-based projected gradient descent (CPGD) algorithm

We here formalize the generalization of the diffusion algorithm (CPGD) in (83). We provide detailed convergence analysis, which guarantees the exact convergence under diminishing step sizes and an inexact convergence rate over fixed ones. Moreover, we provide Nesterov’s acceleration and an improved convergence rate.

This section highlights the well-behavedness of clique-wise coupling that enables similar theoretical and algorithmic properties to consensus optimization (diffusion algorithm).

F.1 Clique-based Projected Gradient Descent (CPGD)

Consider Problem (80) with closed convex sets $\mathcal{D}_{l}\subset\mathbb{R}^{d^{l}},\,l\in\mathcal{Q}_{\mathcal{G}}$ . We suppose Assumptions 1–3.

To this problem, the CPGD is given as follows:

\displaystyle\mathbf{x}^{k+1}=T^{p}(\mathbf{x}^{k}-\lambda^{k}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})),

(84)

where $T:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is the clique-based projection for

\mathcal{D}=\bigcap_{l\in\mathcal{Q}_{\mathcal{G}}}\{\mathbf{x}\in\mathbb{R}^{d}:x_{\mathcal{C}_{l}}\in\mathcal{D}_{l}\},

(85)

$T^{p}=\underbrace{T\circ T\circ\cdots\circ T}_{p}$ , $\hat{f}(\mathbf{x})=\sum_{i=1}^{n}\hat{f}_{i}(x_{i})$ , and $\lambda^{k}$ is a step size. The clique-based projection $T$ is defined as follows.

Definition 2.

Suppose Assumption 1. For a non-empty closed convex set $\mathcal{D}$ in (85), a graph $\mathcal{G}$ , and its cliques $\mathcal{C}_{l},\,l\in\mathcal{Q}_{\mathcal{G}}$ , the clique-based projection $T:\mathbb{R}^{d}\to\mathbb{R}^{d}$ of $x\in\mathbb{R}^{d}$ onto $\mathcal{D}$ is defined as $T(\mathbf{x})=[T_{1}(x_{\mathcal{N}_{1}})^{\top},\ldots,T_{n}(x_{\mathcal{N}_{n}})^{\top}]^{\top}$ with

T_{i}(x_{\mathcal{N}_{i}})=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})

(86)

for each $i\in\mathcal{N}$ .

The clique-based projection can be represented as $T(\mathbf{x})=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}P^{\mathbf{Q}}_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}(\mathbf{D}\mathbf{x}).$

The clique-based projection $T$ has many favorable operator-theoretic properties as follows.

Proposition 6.

Suppose Assumption 1. For the closed convex set $\mathcal{D}$ in (85) and clique-based projection $T$ in Definition 2 onto $\mathcal{D}$ , the following statements hold:

(a)

The operator $T$ is firmly nonexpansive, i.e., $\|T(\mathbf{x})-T(\mathbf{w})\|^{2}\leq(\mathbf{x}-\mathbf{w})^{\top}(T(\mathbf{x})-T(\mathbf{w}))$ holds for any $\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}$ .
(b)

The fixed points set of $T$ satisfies $\mathrm{Fix}(T)=\mathcal{D}$
(c)

For any $\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}$ and any $\mathbf{w}\in\mathcal{D}$ , $\|T(\mathbf{x})-\mathbf{w}\|<\|\mathbf{x}-\mathbf{w}\|$ holds.
(d)

For any $\mathbf{x}\in\mathbb{R}^{d}$ , $T^{\infty}(\mathbf{x})=\lim_{p\to\infty}T^{p}(\mathbf{x})\in\mathcal{D}$ holds.

Proof.

See Appendix F.3. ∎

The convergence properties of the CPGD over various step sizes are presented as follows. Note that the CPGD with fixed step sizes does not exactly converge to an optimal solution like the DGD and diffusion methods for consensus optimization.

Theorem 3.

Consider Problem (38) with closed convex sets $\mathcal{D}_{l},\,l\in\mathcal{Q}_{\mathcal{G}}$ . Consider the CPGD algorithm in (84). Suppose Assumptions 1–3.

(a)

Let a positive sequence $\{\lambda^{k}\}$ satisfy $\lim_{k\to\infty}\lambda^{k}=0$ , $\sum_{k=1}^{\infty}\lambda^{k}=\infty$ , and $\sum_{k=1}^{\infty}(\lambda^{k})^{2}<\infty$ .⁶⁶6For example, $\lambda^{k}=1/k$ satisfies the conditions. Assume that $\mathcal{D}$ is bounded. Then, for any $\mathbf{x}^{0}\in\mathbb{R}^{d}$ and any $p\in\mathbb{N}$ , $\mathbf{x}^{k}$ converges to an optimal solution $\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x})$ .
(b)

Let a positive sequence $\{\lambda^{k}\}$ satisfy $\lim_{k\to\infty}\lambda^{k}=0$ , $\sum_{k=1}^{\infty}\lambda^{k}=\infty$ , and $\sum_{k=1}^{\infty}|\lambda^{k}-\lambda^{k+1}|<\infty$ . ⁷⁷7For example, $\lambda^{k}=1/k$ and $\lambda^{k}=1/\sqrt{k}$ satisfy the conditions. Additionally, assume that $\hat{f}(\mathbf{x})$ is strongly convex. Then $\mathbf{x}^{k}$ converges to the unique optimal solution $\mathbf{x}^{*}=\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x})$ for any $\mathbf{x}^{0}\in\mathbb{R}^{d}$ and any $p\in\mathbb{N}$ .

(c)

Let $\lambda^{k}=\alpha\in(0,1/\hat{L}]$ for any $k\in\mathbb{N}$ . Let $J:\mathbb{R}^{d}\to\mathbb{R}$ be

J(\mathbf{x})=\hat{f}(\mathbf{x})+V(\mathbf{x})/\alpha

(87)

with

V(\mathbf{x})=\frac{1}{2}\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})\|^{2}_{Q_{l}}.

(88)

Then, for any $\mathbf{x}^{0}\in\mathbb{R}^{d}$ and $p=1$ ,

J(\mathbf{x}^{k})-J(\mathbf{x}^{*})\leq\frac{\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}}{2\alpha k}

(89)

holds for $\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x})$ .

Proof.

(a) From Proposition 6a-b, the CPGD in (84) can be regarded as the hybrid steepest descent in [34, 35] for any $p\in\mathbb{N}$ . Hence, Theorem 3a follows from Theorem 2.18, Remark 2.17 in [35], and Proposition 6c. (b) The statement follows from Theorem 2.15 in [35] and Proposition 6a-b. (c) See Appendix F.4. ∎

Remark 3.

Using $V$ in (88), another expression of the clique-based projection $T$ is obtained as follows.

Proposition 7.

Consider the function $V:\mathbb{R}^{d}\to\mathbb{R}$ in (88). Then, it holds for any $\mathbf{x}\in\mathbb{R}^{d}$ that

T(\mathbf{x})=\mathbf{x}-\nabla_{\mathbf{x}}V(\mathbf{x}).

(90)

Proof.

Since each $\mathcal{D}_{l}$ is closed and convex, $1/2\,\|x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})\|_{Q_{l}}^{2}$ is differentiable, and thus $V(\mathbf{x})$ in (88) is also differentiable. Then, for all $i\in\mathcal{N}$ , we have $\nabla_{x_{i}}V(\mathbf{x})=\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}(x_{i}-E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}}))=x_{i}-\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})=x_{i}-T_{i}(x_{\mathcal{N}_{i}})$ from (3) and (86). Hence, we obtain (90). ∎

From Proposition 7, we can interpret the CPGD as a variant of the proximal gradient descent [24, 11, 5] since the clique-based projection $T$ can be represented as $T(\mathbf{x})=\operatorname*{arg\,min}_{\mathbf{x}^{\prime}\in\mathbb{R}^{d}}\>\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}+V(\mathbf{x})+\nabla_{\mathbf{x}}V(\mathbf{x})^{\top}(\mathbf{x}^{\prime}-\mathbf{x}).$ In this sense, the CPGD is a generalization of the conventional projected gradient descent (PGD). When $\mathcal{G}$ is complete, the CPGD equals PGD because ${\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}=\{1\}$ and $\mathcal{C}_{1}=\mathcal{N}$ hold for complete graphs.

Remark 4.

A benefit of the CPGD over the CD-DYS is its simple structure which makes its analysis and extension easy. We can easily evaluate stochastic and online variants of the CPGD using the same strategy as the online projected gradient descent [16] from Proposition 6.

F.2 Nesterov’s acceleration

The CPGD with fixed step sizes can be accelerated up to the inexact convergence rate of $O(1/k^{2})$ with Nesterov’s acceleration [23, 5]. The accelerated CPGD (ACPGD) is given as follows:

		$\displaystyle\mathbf{x}^{k+1}=T^{p}(\hat{\mathbf{x}}^{k}-\lambda^{k}\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k}))$
		$\displaystyle\hat{\mathbf{x}}^{k+1}=\mathbf{x}^{k+1}-\frac{\sigma^{k}-1}{\sigma^{k+1}}(\mathbf{x}^{k+1}-\mathbf{x}^{k}),$		(91)

where $\hat{\mathbf{x}}^{0}=\mathbf{x}^{0}$ and $\sigma^{k+1}=(1+\sqrt{1+4\sigma^{2}})/2$ with $\sigma^{0}=1$ . This algorithm can also be implemented in a distributed manner.

The convergence rate is proved as follows.

Theorem 4.

Consider Problem (38) with closed convex sets $\mathcal{D}_{l},\,l\in\mathcal{Q}_{\mathcal{G}}$ and the ACPGD algorithm (F.2). Suppose Assumption 1. Assume that $\mathcal{D}\subset\mathbb{R}^{d}$ in (85) is a non-empty closed convex set. Let $p=1$ and $\lambda^{k}=\alpha\in(0,1/\hat{L}]$ for all $k$ . Then, for any initial state $\mathbf{x}^{0}=\hat{\mathbf{x}}^{0}\in\mathbb{R}^{d}$ , the following inequality holds:

J(\mathbf{x}^{k})-J(\mathbf{x}^{*})\leq\frac{2\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}}{\alpha k^{2}},

(92)

where $\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x})$ and $J(\mathbf{x})$ is given as (87).

Proof.

See Appendix F.4. ∎

F.3 Proof of Proposition 6

As a preliminary, we present important properties of the function $V(\mathbf{x})$ in (88) for $\mathcal{D}$ in (85) as follows. Note that the function $V$ in (88) is convex because of the convexity of each $\mathcal{D}_{l}$ .

Proposition 8.

For $V(\mathbf{x})$ in (88) and a non-empty closed convex set $\mathcal{D}$ in (85), $V(\mathbf{x})=0\Leftrightarrow\mathbf{x}\in\mathcal{D}$ holds.

Proof.

If $V(\mathbf{x})=0$ for $\mathbf{x}\in\mathbb{R}^{d}$ , we obtain $x_{\mathcal{C}_{l}}=P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})\in\mathcal{D}_{l}$ for all $l\in\mathcal{Q}_{\mathcal{G}}$ , which yields $\mathbf{x}\in\mathcal{D}$ because of (85). Conversely, if $\mathbf{x}\in\mathcal{D}$ , then we have $x_{\mathcal{C}_{l}}\in\mathcal{D}_{l}$ for all $l\in\mathcal{Q}_{\mathcal{G}}$ . Thus, $V(\mathbf{x})=0$ holds. ∎

Proposition 9.

The function $V(\mathbf{x})$ in (88) is a $1$ -smooth function, i.e., its gradient $\nabla_{\mathbf{x}}V(\mathbf{x})$ is $1$ -Lipschitzian.

Proof.

From Definition 2, 1-cocoercivity of $P_{\mathcal{D}_{l}}^{Q_{l}}$ (see [4]), and Proposition 7, we obtain the following for any $\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}$ :

		$\displaystyle\\|\nabla_{\mathbf{x}}V(\mathbf{x})-\nabla_{\mathbf{x}}V(\mathbf{w})\\|^{2}=\\|(\mathbf{x}-\mathbf{w})-(T(\mathbf{x})-T(\mathbf{w}))\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}+\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}-2(\mathbf{x}-\mathbf{w})^{\top}(T(\mathbf{x})-T(\mathbf{w}))$
	$\displaystyle=$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}+\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}$
		$\displaystyle-2\sum_{l\in\mathcal{Q}_{\mathcal{G}}}(x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}})^{\top}Q_{l}(P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}}))$
	$\displaystyle\leq$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}+\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}$
		$\displaystyle\quad-2\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\\|^{2}_{Q_{l}}$
	$\displaystyle\leq$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}-\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}\leq\\|\mathbf{x}-\mathbf{w}\\|^{2}.$

The last line follows from (F.3) in the proof of Proposition 6a. It completes the proof. ∎

With this in mind, we prove Proposition 6 as follows.

a) From Jensen’s inequality and the quasinonexpansiveness of convex projection operators [4], the following inequality holds for any $\mathbf{x},\,\mathbf{w}\in\mathbb{R}^{d}$ :

		$\displaystyle(T(\mathbf{x})-T(\mathbf{w}))^{\top}(\mathbf{x}-\mathbf{w})$
		$\displaystyle=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}(x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}})^{\top}Q_{l}(P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}}))$
		$\displaystyle\geq\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\\|_{Q_{l}}^{2}$
		$\displaystyle=\sum_{i=1}^{n}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\\|E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\\|^{2}$
		$\displaystyle\geq\sum_{i=1}^{n}\\|T_{i}(x_{\mathcal{N}_{i}})-T_{i}(w_{\mathcal{N}_{i}})\\|^{2}=\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}.$		(93)

Thus, we obtain $\|T(\mathbf{x})-T(\mathbf{w})\|^{2}\leq(T(\mathbf{x})-T(\mathbf{w}))^{\top}(\mathbf{x}-\mathbf{w})$ .

b) $\mathcal{D}\subset\mathrm{Fix}(T)$ holds because $x_{\mathcal{C}_{l}}=P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})$ holds for any $\mathbf{x}\in\mathcal{D}$ and all $l\in\mathcal{Q}_{\mathcal{G}}$ . In the following, we prove the converse inclusion $\mathrm{Fix}(T)\subset\mathcal{D}$ . Let $\mathbf{w}\in\mathcal{D}$ . Then, it suffices to show $\hat{\mathbf{w}}\in\mathrm{Fix}(T)\setminus\{\mathbf{w}\}\Rightarrow\hat{\mathbf{w}}\in\mathcal{D}$ . From $\hat{\mathbf{w}}\in\mathrm{Fix}(T)$ , we obtain $\hat{w}_{i}=T_{i}(\hat{w}_{\mathcal{N}_{i}})$ for all $i\in\mathcal{N}$ . In addition, from Jensen’s inequality and the quasinonexpansiveness of convex projection operators [4], we have

	$\displaystyle\\|\mathbf{w}-\hat{\mathbf{w}}\\|^{2}\geq\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\\|w_{\mathcal{C}_{l}}-P_{\mathcal{D}_{\mathcal{C}_{l}}}^{Q_{l}}(\hat{w}_{\mathcal{C}})\\|^{2}_{Q_{l}}$
	$\displaystyle=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\\|w_{i}-E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})\\|^{2}$
	$\displaystyle\geq\sum_{i=1}^{n}\\|w_{i}-\underbrace{\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})}_{=T_{i}(\hat{w}_{\mathcal{N}_{i}})=\hat{w}_{\mathcal{N}_{i}}}\\|^{2}=\\|\mathbf{w}-\hat{\mathbf{w}}\\|^{2}.$

Thus, from the equality condition of Jensen’s inequality, we obtain $w_{i}-E_{l,i}P_{\mathcal{D}_{k}}(\hat{w}_{\mathcal{C}_{k}})=w_{i}-E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})$ for all $\mathcal{C}_{k},\mathcal{C}_{l}\,(k,l\in\mathcal{Q}_{\mathcal{G}}^{i})$ for all $i\in\mathcal{N}$ . Then, we have $E_{l,i}P_{\mathcal{D}_{k}}(\hat{w}_{\mathcal{C}_{k}})=E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})$ for all $\mathcal{C}_{k},\mathcal{C}_{l}\,(k,l\in\mathcal{Q}_{\mathcal{G}}^{i})$ . Therefore, since $\hat{\mathbf{w}}\in\mathrm{Fix}(T)$ , we have $2V(\hat{\mathbf{w}})=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\|\hat{w}_{i}-E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})\|^{2}=\sum_{i=1}^{n}\|\hat{w}_{i}-T_{i}(\hat{w}_{\mathcal{N}_{i}})\|^{2}=0$ . Thus, $\hat{\mathbf{w}}\in\mathcal{D}$ holds from Proposition 8.

c) For a non-empty closed convex set $\mathcal{D}$ in (85) and $\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}$ , there exists $\hat{l}\in\mathcal{Q}_{\mathcal{G}}$ such that $\|x_{\mathcal{C}_{\hat{l}}}-P_{\mathcal{D}_{\hat{l}}}(x_{\mathcal{C}_{\hat{l}}})\|_{Q_{\hat{l}}}>0$ . Hence, for $\hat{l}\in\mathcal{Q}_{\mathcal{G}}$ , $\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}$ , and $\mathbf{w}\in\mathcal{D}$ , we have $\|x_{\mathcal{C}_{\hat{l}}}-w_{\mathcal{C}_{\hat{l}}}\|^{2}_{Q_{\hat{l}}}>\|P_{\mathcal{D}_{\hat{l}}}^{Q_{\hat{l}}}(x_{\mathcal{C}_{\hat{l}}})-w_{\mathcal{C}_{\hat{l}}}\|^{2}_{Q_{\hat{l}}}$ because

		$\displaystyle\\|x_{\mathcal{C}_{l}}-z_{\mathcal{C}_{l}}\\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})}$
	$\displaystyle=$	$\displaystyle\\|x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})\\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})}+\\|P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})-z_{\mathcal{C}_{l}}\\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})}$
		$\displaystyle\quad-2(x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}))^{\top}\mathrm{diag}(\gamma_{\mathcal{C}_{l}})(z_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}))$
	$\displaystyle>$	$\displaystyle\\|P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})-z_{\mathcal{C}_{l}}\\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})},$

where the last line follows from the projection theorem (see Theorem 3.16 in [4]). Thus, by Jensen’s inequality and the nonexpansiveness of $P_{\mathcal{D}_{l}}^{Q_{l}}$ [4], for any $\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}$ and $\mathbf{w}\in\mathcal{D}$ , we obtain $\|\mathbf{x}-\mathbf{w}\|^{2}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}}\|^{2}_{Q_{l}}>\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-w_{\mathcal{C}_{l}}\|^{2}_{Q_{l}}\geq\sum_{i=1}^{n}\|\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})=\|T(\mathbf{x})-\mathbf{w}\|^{2}.$ Hence, $\|T(\mathbf{x})-\mathbf{w}\|<\|\mathbf{x}-\mathbf{w}\|$ for any $\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}$ and $\mathbf{w}\in\mathcal{D}$ .

d) For $\mathbf{x}\in\mathbb{R}^{d}$ , we define $\{a_{k}\}$ as $a_{k+1}=T(a_{k})$ with $a_{0}=\mathbf{x}$ . Then, we obtain $\lim_{k\to\infty}a_{k+1}=\lim_{k\to\infty}T(a_{k})$ . Thus, from the continuity of $T$ shown in Proposition 6a, we have $T^{\infty}(x)=\lim_{k\to\infty}a_{k+1}=T(\lim_{k\to\infty}a_{k})=T(T^{\infty}(x))$ . Hence, Proposition 6b yields $T^{\infty}(x)\in\mathrm{Fix}(T)=\mathcal{D}$ .

F.4 Proof of Theorems 3c and 4

Here, we show the proofs of Theorems 3c and 4. These proofs are based on the convergence theorems for the ISTA and FISTA (Theorems 3.1 and 4.4 in [5]), respectively.

Supporting Lemmas

Before proceeding to prove the theorems, we show some inequalities corresponding to those obtained from Lemma 2.3 in [5], which is a key to proving the convergence theorems. Note that a differentiable function $h:\mathbb{R}^{m}\to\mathbb{R}$ is convex if and only if

h(\mathbf{w})\geq h(\mathbf{x})+\nabla h(\mathbf{x})^{\top}(\mathbf{w}-\mathbf{x})

(94)

holds for any $\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}$ . If $h$ is $\beta$ -smooth and convex,

	$\displaystyle h(\mathbf{w})$	$\displaystyle\leq h(\mathbf{x})+\nabla h(\mathbf{x})^{\top}(\mathbf{w}-\mathbf{x})+\frac{\beta}{2}\\|\mathbf{w}-\mathbf{x}\\|^{2}$		(95)
	$\displaystyle h(\mathbf{w})$	$\displaystyle\geq h(\mathbf{x})+\nabla h(\mathbf{x})^{\top}(\mathbf{w}-\mathbf{x})+\frac{1}{2\beta}\\|\nabla h(\mathbf{x})-\nabla h(\mathbf{w})\\|^{2}$		(96)

hold for any $\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}$ . For details, see textbooks on convex theory, e.g., Theorem 18.15 in [4].

In preparation for showing lemmas, let $\alpha\in(0,1/\hat{L}]$ and $V_{\alpha}(\mathbf{x})=V(\mathbf{x})/\alpha$ with $V(\mathbf{x})$ in (88). Additionally, for $\mathbf{s}\in\mathbb{R}^{d}$ , we define $\hat{F}_{\mathbf{w}}:\mathbb{R}^{d}\to\mathbb{R}$ with some $\mathbf{w}\in\mathbb{R}^{d}$ as

\hat{F}_{\mathbf{w}}(\mathbf{s})=\hat{f}(\mathbf{s})+V_{\alpha}(\mathbf{w})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(\mathbf{s}-\mathbf{w}).

(97)

For $\hat{F}_{\mathbf{w}}(\mathbf{s})$ in (97), the following inequalities hold.

Proposition 10.

Assume that $\hat{f}$ is $\hat{L}$ -smooth and convex. Let $\mathbf{w}=\mathbf{x}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})$ . Then,

\hat{F}_{\mathbf{w}}(T(\mathbf{w}))\leq\hat{F}_{\mathbf{w}}(\boldsymbol{\xi})+\frac{1}{\alpha}(\mathbf{x}-T(\mathbf{w}))^{\top}(\mathbf{x}-\boldsymbol{\xi})-\frac{1}{2\alpha}\|\mathbf{x}-T(\mathbf{w})\|^{2}

(98)

holds for any $\boldsymbol{\xi}\in\mathbb{R}^{d}$ .

Proof.

Let $G_{\mathbf{w}}(\mathbf{s})=\hat{f}(\mathbf{s})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(\mathbf{s}-\mathbf{w})$ and $\boldsymbol{\xi}\in\mathbb{R}^{d}$ . Then, by using $\hat{L}$ -smoothness of $\hat{f}$ , $\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})=(\mathbf{x}-\mathbf{w})/\alpha$ , and $\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})=(\mathbf{w}-T(\mathbf{w}))/\alpha$ (see Proposition 7),

		$\displaystyle G_{\mathbf{w}}(T(\mathbf{w}))=\hat{f}(T(\mathbf{w}))+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(T(\mathbf{w})-\mathbf{w})$
	$\displaystyle\leq$	$\displaystyle\hat{f}(\mathbf{x})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})^{\top}(\mathbf{x}-T(\mathbf{w}))+\frac{1}{2\alpha}\\|\mathbf{x}-T(\mathbf{w})\\|$
		$\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(T(\mathbf{w})-\mathbf{w})$
	$\displaystyle\leq$	$\displaystyle\hat{f}(\boldsymbol{\xi})+\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})^{\top}(\mathbf{x}-\boldsymbol{\xi})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})^{\top}(\mathbf{x}-T(\mathbf{w}))$
	$\displaystyle+$	$\displaystyle\frac{1}{2\alpha}\\|\mathbf{x}-T(\mathbf{w})\\|^{2}+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}\underbrace{(T(\mathbf{w})-\mathbf{w})}_{=(\boldsymbol{\xi}-\mathbf{w})+(T(\mathbf{w})-\boldsymbol{\xi})}$
	$\displaystyle=$	$\displaystyle G_{\mathbf{w}}(\boldsymbol{\xi})+\frac{1}{\alpha}(\mathbf{x}-T(\mathbf{w}))(T(\mathbf{w})-\boldsymbol{\xi})+\frac{1}{2\alpha}\\|\mathbf{x}-T(\mathbf{w})\\|^{2}$
	$\displaystyle=$	$\displaystyle G_{\mathbf{w}}(\boldsymbol{\xi})+\frac{1}{\alpha}(\mathbf{x}-T(\mathbf{w}))^{\top}(\mathbf{x}-\boldsymbol{\xi})-\frac{1}{2\alpha}\\|\mathbf{x}-T(\mathbf{w})\\|$

is obtained from (94) and (95). Thus, adding $V_{\alpha}(\mathbf{w})$ to both sides, we obtain (98). ∎

Proposition 11.

Let $\mathbf{x}^{k+1}=T(\mathbf{w}^{k})$ with some $\{\mathbf{w}^{k}\}\subset\mathbb{R}^{d}$ . Then, it holds that

\displaystyle\hat{F}_{\mathbf{w}^{k}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k}))\|^{2}\leq\hat{F}_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}.

(99)

Proof.

By $1/\alpha$ -smoothness of $V_{\alpha}(\mathbf{x})$ (see Proposition 9) and Proposition 7,

		$\displaystyle\hat{F}_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})^{\top}(\mathbf{x}^{k}-\mathbf{w}^{k-1})$
	$\displaystyle=$	$\displaystyle\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})-\alpha\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\\|^{2}$
	$\displaystyle\geq$	$\displaystyle\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{w}^{k-1}-\mathbf{w}^{k})$
		$\displaystyle+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})-\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\\|^{2}-\alpha\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\\|^{2}$
	$\displaystyle=$	$\displaystyle\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{x}^{k}-\mathbf{w}^{k})$
		$\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{w}^{k-1}-\mathbf{x}^{k})$
		$\displaystyle+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})-\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\\|^{2}-\alpha\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\\|^{2}$
	$\displaystyle=$	$\displaystyle\hat{F}_{\mathbf{w}^{k}}(\mathbf{x}^{k})+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\\|^{2}-\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\\|^{2}$

is obtained from (96). Hence, (99) holds. ∎

With this in mind, we consider the following update rule with $\hat{\mathbf{x}}(0)=\mathbf{x}(0)$ and some $\{\theta^{k}\}\subset\mathbb{R}$ :

$\displaystyle\mathbf{w}^{k}$	$\displaystyle=\hat{\mathbf{x}}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}(k))$
$\displaystyle\mathbf{x}^{k+1}$	$\displaystyle=T(\mathbf{w}^{k})$
$\displaystyle\hat{\mathbf{x}}^{k+1}$	$\displaystyle=\mathbf{x}^{k+1}+\theta^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k}).$	(100)

In addition, we define $\Theta^{k}:\mathbb{R}^{d}\to\mathbb{R}$ as

\Theta^{k}=\hat{F}_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|V_{\alpha}(\mathbf{w}^{k-1})\|^{2}

(101)

with $\hat{F}_{\mathbf{w}}$ in (97). By $\mathbf{x}^{k}-\mathbf{w}^{k-1}=-\alpha\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})$ , $\Theta^{k}$ can be rewritten as $\Theta^{k}=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})-\frac{1}{2\alpha}\|\mathbf{w}^{k-1}-T(\mathbf{w}^{k-1})\|^{2}=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})-\frac{1}{2\alpha}\|\mathbf{w}^{k-1}-\mathbf{x}^{k}\|^{2}$ .

Remarkably, $\Theta^{k}$ in (101) satisfies the following lemma.

Lemma 3.

Consider the sequence generated by (F.4). Then,

J(\mathbf{x}^{k})=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{x}^{k})\leq\Theta^{k}.

(102)

Proof.

In light of $1/\alpha$ -smoothness of $V_{\alpha}$ and $\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})=-(\mathbf{w}^{k-1}-\mathbf{x}^{k})/\alpha$ , we obtain $V_{\alpha}(\mathbf{x}^{k})\leq V_{\alpha}(\mathbf{w}^{k-1})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})^{\top}(\mathbf{w}^{k-1}-\mathbf{x}^{k})+\frac{1}{2\alpha}\|\mathbf{w}^{k}-\mathbf{x}^{k}\|^{2}=V_{\alpha}(\mathbf{w}^{k-1})-\frac{1}{2\alpha}\|\mathbf{w}^{k}-\mathbf{x}^{k}\|^{2}.$ Hence, adding $\hat{f}(\mathbf{x}^{k})$ to both sides yields (102). ∎

Furthermore, the following inequality holds. This is essential to Theorem 3c and 4.

Lemma 4.

For the sequence generated by (F.4) and $\Theta^{k}$ defined in (101), it holds that

\displaystyle\small\Theta^{k}-\Theta^{k+1}

\displaystyle\geq\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}+\frac{1}{\alpha}(\mathbf{x}^{k+1}-\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k}).

(103)

Proof.

Substituting $\mathbf{x}=\mathbf{x}^{k+1},\,\mathbf{w}=\mathbf{w}^{k},$ and $\boldsymbol{\xi}=\mathbf{x}^{k}$ into (98), we obtain

	$\displaystyle\Theta^{k+1}=\hat{f}(\mathbf{x}^{k+1})+V_{\alpha}(\mathbf{w}^{k})$
	$\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{x}^{k+1}-\mathbf{w}^{k})+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\\|^{2}$
	$\displaystyle\leq\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k})$
	$\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{x}^{k}-\mathbf{w}^{k})+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\\|^{2}$
	$\displaystyle+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\\|^{2}$
	$\displaystyle=F_{\mathbf{w}^{k}}(\mathbf{x}^{k})+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\\|^{2}$
	$\displaystyle+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\\|^{2}$
	$\displaystyle\leq F_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})+\frac{\alpha}{2}\\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\\|^{2}$
	$\displaystyle+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\\|^{2}$
	$\displaystyle=\Theta^{k}+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\\|^{2}$

from (94), (95), and (99). Thus, (103) holds. ∎

For $\mathbf{x}^{k}$ and an optimal $\mathbf{x}^{*}$ , we present the following lemma.

Lemma 5.

For $\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x})$ , it holds that

\displaystyle\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*})

\displaystyle-\Theta^{k+1}\geq\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}+\frac{1}{\alpha}(\mathbf{x}^{k+1}-\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{*}).

(104)

Proof.

Recalling (F.4), $\hat{L}$ -smoothness of $\hat{f}$ , and $1/\alpha$ -smoothness of $V_{\alpha}$ for $\alpha\in(0,1/\hat{L}]$ , we obtain

	$\displaystyle\Theta^{k+1}\leq\hat{f}(\hat{\mathbf{x}}^{k})-\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})$
	$\displaystyle+\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\\|^{2}+V_{\alpha}(\mathbf{w}^{k})-\frac{1}{2\alpha}\\|\mathbf{w}^{k}-T(\mathbf{w}^{k})\\|^{2}$
	$\displaystyle\leq\hat{f}(\mathbf{x}^{})+\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}-\mathbf{x}^{})-\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}-T(\mathbf{w}^{k}))$
	$\displaystyle+\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-T(\mathbf{w}^{k})\\|^{2}+V_{\alpha}(\mathbf{x}^{*})-\frac{1}{2\alpha}\\|\mathbf{w}^{k}-T(\mathbf{w}^{k})\\|^{2}$
	$\displaystyle+\frac{1}{\alpha}(\mathbf{w}^{k}-T(\mathbf{w}^{k})^{\top}(T(\mathbf{w}^{k})-\mathbf{x}^{*}+\mathbf{w}^{k}-T(\mathbf{w}^{k}))$
	$\displaystyle-\frac{1}{2\alpha}\\|\mathbf{w}^{k}-T(\mathbf{w}^{k})-(\mathbf{x}^{}-T(\mathbf{x}^{}))\\|^{2}$
	$\displaystyle=\hat{f}(\mathbf{x}^{})+V_{\alpha}(\mathbf{x}^{})+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{*})-\frac{1}{2\alpha}\\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\\|^{2}$

from (94), (95), and (96), where the last line is obtained because $\mathbf{x}^{*}=T(\mathbf{x}^{*})$ holds for $\mathbf{x}^{*}\in\mathcal{D}$ . Therefore, (104) is obtained. ∎

Proof of Theorem 3c

In this proof, assume that $\theta^{k}=0$ for all $k$ . Then, $\hat{\mathbf{x}}^{k}=\mathbf{x}^{k}$ holds and the algorithm in (F.4) equals to the CPGD with $\lambda^{k}=\alpha\in(0,1/\hat{L}]$ for all $k\in\mathbb{N}$ .

In light of (104) and $\hat{\mathbf{x}}^{k}=\mathbf{x}^{k}$ , we obtain $2\alpha(\Theta^{k+1}-(\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*})))\leq\|\mathbf{x}^{*}-\mathbf{x}^{k}\|^{2}$ because $2\alpha(\Theta^{k+1}-(\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*})))\leq 2(\mathbf{x}^{k}-\mathbf{x}^{k+1})^{\top}(\mathbf{x}^{k}-\mathbf{x}^{*})-\frac{1}{2\alpha}\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\|^{2}=\|\mathbf{x}^{*}-\mathbf{x}^{k}\|^{2}-\|\mathbf{x}^{*}-\mathbf{x}^{k+1}\|^{2}\leq\|\mathbf{x}^{*}-\mathbf{x}^{k}\|^{2}.$ Besides, invoking (103), we have

2\alpha(\Theta^{k+1}-\Theta^{k})\leq\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\|^{2}\leq 0.

Then, following the same procedure as Theorem 3.1 in [5] and using (102), we obtain (89).

Proof of Theorem 4

Substituting $\theta^{k}=(\sigma^{k}-1)/\sigma^{k+1}$ into (F.4) yields the ACPGD in (F.2).

Now, by (103), (104), and $(\sigma^{k-1})^{2}=\sigma^{k}(\sigma^{k}-1)$ , following the procedure of the proof of Theorem 4.4 in [5] gives

	$\displaystyle(\sigma^{k-1})^{2}(\Theta^{k}-J(\mathbf{x}^{}))-(\sigma^{k})^{2}(\Theta^{k+1}-J(\mathbf{x}^{}))$
	$\displaystyle\leq\frac{1}{2\alpha}(\\|\boldsymbol{\zeta}^{k+1}\\|^{2}-\\|\boldsymbol{\zeta}^{k}\\|^{2})$

with $J$ in (87) and $\boldsymbol{\zeta}^{k}=\sigma_{k}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{*})-(\sigma^{k}-1)(\mathbf{x}^{k}-\mathbf{x}^{*})$ . Thus, summing both sides over $k=1,2,\ldots$ yields

(\sigma^{k})^{2}(\Theta^{k+1}-J(\mathbf{x}^{*}))\leq\frac{1}{2\alpha}\|\boldsymbol{\zeta}^{0}\|^{2}=\frac{1}{2\alpha}\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}.

By $\sigma^{k}\geq(k+1)/2$ , which can be shown by mathematical induction, we obtain

\Theta^{k+1}-J(\mathbf{x}^{*})\leq\frac{2\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}}{\alpha(k+1)^{2}}.

Therefore, the inequality (92) follows from (102).

		$\displaystyle\mathbf{1}_{n}^{\top}{\boldsymbol{\Phi}}=\sum_{i=1}^{n}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}\mathbf{1}_{\|{\mathcal{C}_{l}}\|}}$
	$\displaystyle=$	$\displaystyle\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{j}\|}\frac{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}\mathbf{1}_{\|{\mathcal{C}_{l}}\|}}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{j}\|}E_{j}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\frac{1}{\|\mathcal{Q}_{\mathcal{G}}^{i}\|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}=\sum_{i=1}^{n}E_{i}=\mathbf{1}_{n}^{\top}$

$\displaystyle\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}z_{l}^{k+1}=$	$\displaystyle\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}(z_{l}^{k}-x_{\mathcal{C}_{l}}^{k})$
	$\displaystyle+\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))$
$\displaystyle=$	$\displaystyle\mathbf{1}_{\|{\mathcal{C}_{l}}\|}^{\top}Q_{l}(x_{\mathcal{C}_{l}}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})).$	(43)

	$\displaystyle\\|w^{k}-w^{}\\|^{2}\geq\\|z^{k+1}-z^{}\\|^{2}+\nu\\|\zeta(y^{k-1/2})-\zeta(y^{*})\\|^{2}$
	$\displaystyle+(\frac{1}{\theta}-1)\\|z^{k+1}-z^{k}\\|^{2}+2\alpha\max\{\mu\\|y^{k+1/2}-y^{}\\|^{2},\frac{1}{L}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{})\\|^{2}\}$
	$\displaystyle-\frac{\alpha^{2}L}{\varepsilon}\times\frac{1}{L}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}$
$\displaystyle\geq$	$\displaystyle\\|w^{k+1}-w^{}\\|^{2}+c_{0}\\|z^{k+1}-z^{k}\\|^{2}+c_{1}\\|y^{k+1/2}-y^{}\\|^{2}+c_{2}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}$
	$\displaystyle+c_{3}\\|\zeta(y^{k-1/2})-\zeta(y^{*})\\|^{2}$	(46)

	$\displaystyle\\|w^{k}-w^{}\\|^{2}\leq\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}+\\|y^{k+1}-\alpha(u_{A}^{k+1}+M^{-1}\nabla_{y}f(y^{k+1/2}))$
	$\displaystyle+2(y^{k+1/2}-y^{k+1})-(y^{}-\alpha u_{A}^{}-\alpha\nabla_{y}f(y^{*}))\\|^{2}$
$\displaystyle\leq$	$\displaystyle\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}+\\|-(\xi^{k+1}-\xi^{})-\alpha(M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*}))$
	$\displaystyle+2(y^{k+1/2}-y^{*})\\|^{2}+\epsilon\\|z^{k+1}-z^{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\nu\\|\zeta(y^{k-1/2})-\zeta(y^{})\\|^{2}+\epsilon\\|z^{k+1}-z^{k}\\|^{2}+3\\|\xi^{k+1}-\xi^{}\\|^{2}+12\\|(y^{k+1/2}-y^{*})\\|^{2}$
	$\displaystyle+3\alpha^{2}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\\|^{2}$
$\displaystyle\leq$	$\displaystyle(\nu+9)\\|\zeta(y^{k-1/2})-\zeta(y^{*})\\|^{2}+\epsilon\\|z^{k+1}-z^{k}\\|^{2}$
	$\displaystyle+48\\|(y^{k+1/2}-y^{})\\|^{2}+12\alpha^{2}\\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{})\\|^{2},$	(47)

		$\displaystyle\\|\nabla_{\mathbf{x}}V(\mathbf{x})-\nabla_{\mathbf{x}}V(\mathbf{w})\\|^{2}=\\|(\mathbf{x}-\mathbf{w})-(T(\mathbf{x})-T(\mathbf{w}))\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}+\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}-2(\mathbf{x}-\mathbf{w})^{\top}(T(\mathbf{x})-T(\mathbf{w}))$
	$\displaystyle=$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}+\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}$
		$\displaystyle-2\sum_{l\in\mathcal{Q}_{\mathcal{G}}}(x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}})^{\top}Q_{l}(P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}}))$
	$\displaystyle\leq$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}+\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}$
		$\displaystyle\quad-2\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\\|^{2}_{Q_{l}}$
	$\displaystyle\leq$	$\displaystyle\\|\mathbf{x}-\mathbf{w}\\|^{2}-\\|T(\mathbf{x})-T(\mathbf{w})\\|^{2}\leq\\|\mathbf{x}-\mathbf{w}\\|^{2}.$

Distributed Optimization of Clique-Wise Coupled Problems via Three-Operator Splitting

Abstract

1 Introduction

2 Preliminaries

Notations

Graph theory

Operator splitting

Lemma 1.

3 Clique-Wise Duplication Matrix

3.1 Fundamentals

Assumption 1.

Definition 1.

Lemma 2.

Example 1.

3.2 Useful properties

Proposition 1.

Proposition 2.

3.3 A mixing matrix 𝚽{\boldsymbol{\Phi}}

Assumption 2.

Proposition 3.

Proof.

Proposition 4.

Proof.

Remark 1.

4 Solution to Clique-Wise Coupled Problems via Operator Splitting

Assumption 3.

4.1 Algorithm description

4.2 Variable metric version

5 Revisit of Consensus Optimization as A Clique-Wise Coupled Problem

5.1 CD-DYS as generalized NIDS

NIDS algorithm

Analysis

Proposition 5.

Proof.

Remark 2.

5.2 Linear convergence of the NIDS with g^i​(⋅)≠0\hat{g}_{i}(\cdot)\neq 0

Theorem 1.

Theorem 2.

Proof.

6 Numerical Experiments

Clique-wise resource allocation

Consensus optimization via NIDS

7 Proof of Theorem 2

8 Conclusion

References

Appendix A Application examples

Formation control

Network Lasso

Sparse semidefinite programming

Approximating trace norm minimization problems

Appendix B Proof of Lemma 2

Appendix C Proof of Proposition 1

Appendix D Proof of Proposition 2

Appendix E Connection of the CD-DYS to other distributed optimization algorithms

Exact diffusion and Diffusion algorithms

CPGD generalizes of the diffusion algorithm

Appendix F Clique-based projected gradient descent (CPGD) algorithm

F.1 Clique-based Projected Gradient Descent (CPGD)

Definition 2.

Proposition 6.

Proof.

Theorem 3.

Proof.

Remark 3.

Proposition 7.

Proof.

Remark 4.

F.2 Nesterov’s acceleration

Theorem 4.

Proof.

F.3 Proof of Proposition 6

Proposition 8.

Proof.

Proposition 9.

Proof.

F.4 Proof of Theorems 3c and 4

Supporting Lemmas

Proposition 10.

Proof.

Proposition 11.

3.3 A mixing matrix ${\boldsymbol{\Phi}}$

5.2 Linear convergence of the NIDS with $\hat{g}_{i}(\cdot)\neq 0$