This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Optimization of Clique-Wise Coupled Problems via Three-Operator Splitting

Yuto Watanabe, and Kazunori Sakurama Yuto Watanabe is with the Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093 USA (email: [email protected]).Kazunori Sakurama is with the Department of System Innovation, Graduate School of Engineering Science, Osaka University, 1-3, Machikaneyama, Toyonaka, Osaka 560-8531, Japan (email: [email protected]).This work was partially supported by the joint project of Kyoto University and Toyota Motor Corporation, titled “Advanced Mathematical Science for Mobility Society”.
Abstract

This study explores distributed optimization problems with clique-wise coupling via operator splitting and how we can utilize this framework for performance analysis and enhancement. This framework extends beyond conventional pairwise coupled problems (e.g., consensus optimization) and is applicable to broader examples. To this end, we first introduce a new distributed optimization algorithm by leveraging a clique-based matrix and the Davis-Yin splitting (DYS), a versatile three-operator splitting method. We then demonstrate that this approach sheds new light on conventional algorithms in the following way: (i) Existing algorithms (NIDS, Exact diffusion, diffusion, and our previous work) can be derived from our proposed method; (ii) We present a new mixing matrix based on clique-wise coupling, which surfaces when deriving the NIDS. We prove its preferable distribution of eigenvalues, enabling fast consensus; (iii) These observations yield a new linear convergence rate for the NIDS with non-smooth objective functions. Remarkably our linear rate is first established for the general DYS with a projection for a subspace. This case is not covered by any prior results, to our knowledge. Finally, numerical examples showcase the efficacy of our proposed approach.

1 Introduction

The last two decades have witnessed the significant advancement of distributed optimization. In the literature, a huge body of existing studies has been dedicated to pairwise coupled optimization problems, where every coupling of variables comprises two agents’ decision variables corresponding to the communication path (edge) between the two. Representative examples are consensus optimization [22, 28, 38, 20, 1, 33] and formation control [26]. Moreover, so are the problems with globally coupled linear constraints [21] because their dual problems result in pairwise coupled consensus optimization.

To handle wider applications that involve complex coupling beyond edges, we address a more generic class of distributed optimization—clique-wise coupled optimization problems. This class has been introduced in our recent works [30, 31] with an emphasis on its generalization aspect. In this note, we elucidate additional benefits of this class of problems for performance enhancement and analysis via a new algorithm based on a three-operator splitting [13]. This class of problems is formulated as follows: Consider a multi-agent system with nn agents over a time-invariant undirected graph 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}) with 𝒩={1,,n}\mathcal{N}=\{1,\ldots,n\} and an edge set \mathcal{E}. Let xidix_{i}\in\mathbb{R}^{d_{i}} represent the did_{i} dimensional decision variable of agent ii. Then, the following is called a clique-wise coupled optimization problem:

minxidii𝒩l𝒬𝒢(fl(x𝒞l)+gl(x𝒞l))clique-wise coupling+i=1n(f^i(xi)+g^i(xi)),\displaystyle\!\!\!\!\!\!\!\!\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}}\\ \,i\in\mathcal{N}\end{subarray}}{\mbox{min}}&\!\!\!\!\!\displaystyle\underbrace{\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\left(f_{l}(x_{\mathcal{C}_{l}})+g_{l}(x_{\mathcal{C}_{l}})\right)}_{\text{clique-wise coupling}}+\displaystyle\sum_{i=1}^{n}\left(\hat{f}_{i}(x_{i})+\hat{g}_{i}(x_{i})\right),\end{array} (2)

where for all l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}, fl:j𝒞ldjf_{l}:\mathbb{R}^{\sum_{j\in{\mathcal{C}_{l}}}d_{j}}\to\mathbb{R} is LlL_{l}-smooth and convex, and gl:j𝒞ldjg_{l}:\mathbb{R}^{\sum_{j\in{\mathcal{C}_{l}}}d_{j}}\to\mathbb{R} is proper, closed, and convex. For all i𝒩i\in\mathcal{N}, f^i:di\hat{f}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R} is L^i\hat{L}_{i}-smooth and convex, and g^i:di\hat{g}_{i}:\mathbb{R}^{{d_{i}}}\to\mathbb{R} is proper, closed, and convex. For the set 𝒞l={j1,,j|𝒞l|}𝒩{\mathcal{C}_{l}}=\{j_{1},\ldots,j_{|{\mathcal{C}_{l}}|}\}\subset\mathcal{N}, let x𝒞lx_{{\mathcal{C}_{l}}} denote x𝒞l=[xj1,,xj|𝒞l|]x_{{\mathcal{C}_{l}}}=[x_{j_{1}}^{\top},\ldots,x_{j_{|{\mathcal{C}_{l}}|}}^{\top}]^{\top}. Here, the set 𝒞l{\mathcal{C}_{l}} represents a clique, i.e., a complete subgraph in 𝒢\mathcal{G} [7]. The set 𝒬𝒢\mathcal{Q}_{\mathcal{G}}\neq\emptyset is a subset of 𝒬𝒢all{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}, where 𝒬𝒢all{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}} is the index set of all the cliques in 𝒢\mathcal{G}. For example, in the undirected graph in Fig. 1, we have 𝒬𝒢all={1,,9}{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}=\{1,\ldots,9\} and 𝒞1,,𝒞9\mathcal{C}_{1},\ldots,\mathcal{C}_{9}.

Refer to caption
Figure 1: Sketches of (a) pairwise coupling and (b) clique-wise coupling.

As mentioned above, an immediate benefit of Problem (2) is that it can handle variable couplings of more than two agents. As Fig. 1, cliques in (b) can deal with the coupling of three nodes {1,2,3}\{1,2,3\}, differently from (a). Indeed, Problem (2) always contains conventional pairwise coupled optimization problems as nodes and edges are also cliques. As well as pairwise coupled problems, other possible applications are, for example, (i) clique-wise coupled linear constraints [30, 31, 19] (e.g., resource allocation in Section 6), (ii) sparse SDP [29] (e.g., distributed design of distributed controllers [42], sensor network localization [2], etc), (iii) regularization accounting for network structures (e.g, Network lasso [15]), and (iv) approximation of trace norm minimization problems (e.g., multi-task learning [40], robust PCA [9], etc).

This note addresses Problem (2) using the Davis-Yin Splitting (DYS) and reveals that the notion of clique-wise coupling is beneficial for analyzing and improving convergence performance. The DYS is a versatile three-operator splitting scheme that generalizes basic operator-splitting methods (e.g., the forward-backward and Douglas-Rachford splittings). Firstly, we reformulate Problem (2) by introducing a matrix called the clique-wise duplication (CD) matrix. This matrix lifts Problem (2) to a tractable separated form for algorithm design. Then, applying the DYS, we derive the proposed algorithm called the clique-based distributed DYS (CD-DYS). Subsequently, we demonstrate that the CD-DYS generalizes several existing algorithms, encompassing the celebrated NIDS [20]. Then, we analyze a new mixing matrix that naturally comes up in deriving the NIDS and show a preferable distribution of its eigenvalues. Moreover, we present a new linear convergence rate for the NIDS with non-smooth terms by proving a more general linear rate for the DYS with a projection onto a subspace under strong convexity of the smooth term. Finally, numerical examples illustrate the effectiveness of the proposed approach.

Our contributions can be summarized as follows. (i) We propose a new distributed algorithm, CD-DYS, for Problem (2) applicable to broad examples ranging from optimization to control and learning problems; (ii) Our investigation of consensus optimization as a clique-wise coupled problem unveils that several conventional distributed optimization methods, including NIDS [20], are derived from the proposed CD-DYS method, which leads to a new linear convergence rate for the NIDS with non-smooth objective functions. This linear rate admits bigger stepsizes than ones in [1, 33]. It is worth mentioning that our linear convergence is first established for the general DYS with an indicator function of a linear image space, which does not follow from the prior works [13, 18, 37, 12] as indicator functions are neither smooth nor strongly convex; (iii) Numerical examples demonstrate the higher performance of our proposed approach than [20] and [21]. In particular, the superiority against the standard NIDS [20] is attributed to a novel mixing matrix obtained from our proposed method, which realizes a preferable eigenvalue distribution for fast consensus. We also provide its theoretical evidence. Note that one can construct this matrix without global information and use it for other consensus-based algorithms.

The remainder of this note is organized as follows. Section 2 provides preliminaries. Section 3 presents the definition of the CD matrix and its analysis including a new mixing matrix. In Section 4, we propose new distributed algorithms based on the DYS. In Section 5, we analyze the proposed methods for consensus optimization and show a new linear convergence result. Section 6 presents numerical experiments. Section 7 provides the proof of the convergence rate. Finally, Section 8 concludes this note.

2 Preliminaries

We here prepare several important notions.

Notations

Throughout this note, we use the following notations. Let IdI_{d} denote the d×dd\times d identity matrix in d×d\mathbb{R}^{d\times d}. We omit the subscript dd of IdI_{d} when the dimension is obvious. Let Od1×d2O_{d_{1}\times d_{2}} be the d1×d2d_{1}\times d_{2} zero matrix. Let 𝟏d=[1,,1]d\mathbf{1}_{d}=[1,\ldots,1]^{\top}\in\mathbb{R}^{d}. For 𝒩\mathcal{M}\subset\mathcal{N}, [xj]j[x_{j}]_{j\in\mathcal{M}} and xx_{\mathcal{M}} represent the stacked vector in ascending order obtained from vectors xjdj,jx_{j}\in\mathbb{R}^{d_{j}},\,j\in\mathcal{M}, and we use the same notation to express stacked matrices. Let diag(a)\text{diag}(a) with a=[a1,,an]a=[a_{1},\ldots,a_{n}]^{\top} denote the diagonal matrix whose iith diagonal entry is aia_{i}\in\mathbb{R}. Similarly, blk-diag([,Ri,])\text{blk-diag}([\ldots,R_{i},\ldots]) and blk-diag([Rj]j)\text{blk-diag}([R_{j}]_{j\in\mathcal{M}}) represent the block diagonal matrix. For a symmetric matrix QOQ\succ O, let uQ=u,uQ\|u\|_{Q}=\sqrt{\langle u,u\rangle_{Q}} with the inner product u,vQ:=vQu\langle u,v\rangle_{Q}:=v^{\top}Qu, and we simply write Im=\|\cdot\|_{I_{m}}=\|\cdot\| for Q=ImQ=I_{m}. Let λmax(Q)\lambda_{\mathrm{max}}(Q) and λmin(Q)\lambda_{\mathrm{min}}(Q) be the largest and smallest eigenvalues of QQ, respectively. For a differentiable function f:df:\mathbb{R}^{d}\to\mathbb{R} and xdx\in\mathbb{R}^{d}, we write xf()=f/x()\nabla_{x}f(\cdot)=\partial f/\partial x(\cdot). We simply use \nabla when it is obvious. For a proper, closed, and convex function g:d:{+}g:\mathbb{R}^{d}:\to\mathbb{R}\cup\{+\infty\} and Q0Q\succ 0, the proximal operator of gg with QQ is represented by proxgQ(x)=argminxd{g(x)+xxQ2/2}\mathrm{prox}^{Q}_{g}(x)=\operatorname*{arg\,min}_{x^{\prime}\in\mathbb{R}^{d}}\{g(x^{\prime})+\|x-x^{\prime}\|_{Q}^{2}/2\}, and we denote proxgI()=proxg()\mathrm{prox}_{g}^{I}(\cdot)=\mathrm{prox}_{g}(\cdot) for Q=IQ=I. Let δ𝒟()\delta_{\mathcal{D}}(\cdot) represent the indicator function of 𝒟\mathcal{D}, i.e., δ𝒟(x)=0\delta_{\mathcal{D}}(x)=0 for x𝒟x\in\mathcal{D} and δ𝒟(x)=\delta_{\mathcal{D}}(x)=\infty for x𝒟x\notin\mathcal{D}. The projection onto a closed convex set 𝒟\mathcal{D} with a metric QQ is represented by P𝒟Q(x)=argminx𝒟xxQP^{Q}_{\mathcal{D}}(x)=\operatorname*{arg\,min}_{x^{\prime}\in\mathcal{D}}\|x-x^{\prime}\|_{Q}, and we write P𝒟I()=P𝒟()P^{I}_{\mathcal{D}}(\cdot)=P_{\mathcal{D}}(\cdot) for Q=IQ=I.

Graph theory

Consider a graph 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}) with a node set 𝒩={1,,n}\mathcal{N}=\{1,\ldots,n\} and an edge set \mathcal{E} consisting of pairs {i,j}\{i,j\} of different nodes i,j𝒩i,j\in\mathcal{N}. Note that throughout this note, we consider undirected graphs and do not distinguish {i,j}\{i,j\} and {j,i}\{j,i\} for each {i,j}\{i,j\}\in\mathcal{E}. For i𝒩i\in\mathcal{N} and 𝒢\mathcal{G}, let 𝒩i𝒩\mathcal{N}_{i}\subset\mathcal{N} be the neighbor set of node ii over 𝒢\mathcal{G}, defined as 𝒩i={j𝒩:{i,j}}{i}\mathcal{N}_{i}=\{j\in\mathcal{N}:\{i,j\}\in\mathcal{E}\}\cup\{i\}.

For an undirected graph 𝒢\mathcal{G}, consider a set 𝒞𝒩\mathcal{C}\subset\mathcal{N}. The set 𝒞\mathcal{C} is called a clique of 𝒢\mathcal{G} if the subgraph 𝒢\mathcal{G} induced by 𝒞\mathcal{C} is complete [7]. We define 𝒬𝒢all={1,2,,q}{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}=\{1,2,\ldots,q\} as the set of indices of all the cliques in 𝒢\mathcal{G}. For 𝒬𝒢all{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}, the set 𝒬𝒢\mathcal{Q}_{\mathcal{G}} represents a subset of 𝒬𝒢all{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}. If a clique 𝒞\mathcal{C} is not contained by any other cliques, 𝒞\mathcal{C} is said to be maximal. Let 𝒬𝒢max(𝒬𝒢all)\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}(\subset{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}) be the set of indices of all the maximal cliques in 𝒢\mathcal{G}. For edge set \mathcal{E}, let 𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}} be the index set of all the edges. For 𝒬𝒢𝒬𝒢all\mathcal{Q}_{\mathcal{G}}\subset{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}} and i𝒩i\in\mathcal{N}, we define 𝒬𝒢i\mathcal{Q}_{\mathcal{G}}^{i} as the index set of all cliques in 𝒬𝒢\mathcal{Q}_{\mathcal{G}} containing ii. Similarly, 𝒬𝒢ij\mathcal{Q}_{\mathcal{G}}^{ij} represents 𝒬𝒢ij=𝒬𝒢ji=𝒬𝒢i𝒬𝒢j\mathcal{Q}_{\mathcal{G}}^{ij}=\mathcal{Q}_{\mathcal{G}}^{ji}=\mathcal{Q}_{\mathcal{G}}^{i}\cap\mathcal{Q}_{\mathcal{G}}^{j}. For each i𝒩i\in\mathcal{N}, 𝒩i\mathcal{N}_{i}, and 𝒞l,l𝒬𝒢i\mathcal{C}_{l},\,l\in\mathcal{Q}^{i}_{\mathcal{G}},

l𝒬𝒢i𝒞l𝒩i,\bigcup_{l\in\mathcal{Q}^{i}_{\mathcal{G}}}\mathcal{C}_{l}\subset\mathcal{N}_{i}, (3)

holds [26]. Note that agent ii can independently obtain 𝒞l,l𝒬𝒢i{\mathcal{C}_{l}},\,l\in\mathcal{Q}^{i}_{\mathcal{G}} from the undirected subgraph induced by 𝒩i\mathcal{N}_{i}.

Operator splitting

Consider

minydf(y)+g(y)+h(y),\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}y\in\mathbb{R}^{d}\end{subarray}}{\mbox{min}}&\displaystyle f(y)+g(y)+h(y),\end{array} (5)

where f:df:\mathbb{R}^{d}\to\mathbb{R} is an smooth convex function, and g,h:d{}g,h:\mathbb{R}^{d}\to\mathbb{R}\cup\{\infty\} are proper, closed, and convex functions. For this problem, the following versatile algorithm, called (variable metric) Davis-Yin splitting (DYS), has been proposed in [13]:

yk+1/2=proxαhM(zk)yk+1=proxαgM(2yk+1/2zkαM1f(yk+1/2))zk+1=zk+yk+1yk+1/2,\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\begin{array}[]{lll}&y^{k+1/2}=\mathrm{prox}_{\alpha h}^{M}(z^{k})\\ &y^{k+1}=\mathrm{prox}_{\alpha g}^{M}(2y^{k+1/2}-z^{k}-\alpha M^{-1}\nabla f(y^{k+1/2}))\\ &z^{k+1}=z^{k}+y^{k+1}-y^{k+1/2},\end{array} (9)

where Md×dM\in\mathbb{R}^{d\times d} is a positive definite symmetric matrix. Note that the case of M=IM=I corresponds to the standard DYS. This algorithm reduces to the Douglas-Rachford splitting when f=0f=0 and to forward-backward splitting when g=0g=0. We have the following basic result, which states that yk+1/2y^{k+1/2} and yk+1y^{k+1} converge to a solution to (5) under an appropriate α>0\alpha>0. For further convergence results, see [13, 24, 3, 18, 37, 12] and Subsection 5.2.

Lemma 1.

Suppose that M1/2f(y)M1/2M^{-1/2}\nabla f(y)M^{-1/2} is LL-Lipschitz continuous for positive definite MM. Let z0dz^{0}\in\mathbb{R}^{d} and α(0,2/L)\alpha\in(0,2/L). Assume that Problem (5) has an optimal solution. Then, yky^{k} and yk+1/2y^{k+1/2} updated by (9) converge to an optimal solution to Problem (5).

3 Clique-Wise Duplication Matrix

This section presents the definition and properties of the CD matrix 𝐃\mathbf{D} that allows us to leverage operator splitting techniques for Problem (2) in a distributed fashion111Note that the matrix 𝐃\mathbf{D} itself is not new. The same or similar ideas can be found in other existing papers, e.g., SDP [29, 42] and generalized Nash equilibrium seeking [6]. We also present a new mixing matrix 𝚽{\boldsymbol{\Phi}} with the matrix 𝐃\mathbf{D}, showing a preferable distribution of eigenvalues.

3.1 Fundamentals

The definition and essential properties of the CD matrix are presented in what follows.

First, we assume the non-emptiness of 𝒬𝒢i\mathcal{Q}_{\mathcal{G}}^{i}. If this assumption is not satisfied, we can alternatively consider a subgraph induced by the node set to l𝒬𝒢𝒞l\bigcup_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{C}_{l}.

Assumption 1.

For all i𝒩i\in\mathcal{N}, 𝒬𝒢i\mathcal{Q}_{\mathcal{G}}^{i}\neq\emptyset holds.

Then, the definition of the CD matrix is given as follows. Here, did_{i} for each i𝒩i\in\mathcal{N} is the size of xix_{i} in Problem (2), and we define

d=i=1ndi,dl=j𝒞ldj,d^=l𝒬𝒢dl.\displaystyle d=\sum_{i=1}^{n}d_{i},\quad d^{l}=\sum_{j\in{\mathcal{C}_{l}}}d_{j},\quad\hat{d}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}d^{l}.
Definition 1.

For di,i𝒩d_{i},\,i\in\mathcal{N} and cliques 𝒞l,l𝒬𝒢{\mathcal{C}_{l}},\,l\in\mathcal{Q}_{\mathcal{G}} of graph 𝒢\mathcal{G}, the Clique-wise Duplication (CD) matrix 𝐃\mathbf{D} is defined as

𝐃:=[D1D|𝒬G|]d^×d,\mathbf{D}:=\begin{bmatrix}D_{1}\\ \vdots\\ D_{|\mathcal{Q}_{G}|}\end{bmatrix}\in\mathbb{R}^{\hat{d}\times d}, (10)

where Dl=[Ej]j𝒞ldl×dD_{l}=[E_{j}]_{j\in{\mathcal{C}_{l}}}\in\mathbb{R}^{d^{l}\times d} and Ej=[Odj×d1,,Idj,,Odj×dn]dj×dE_{j}=[O_{d_{j}\times d_{1}},\ldots,I_{d_{j}},\ldots,O_{d_{j}\times d_{n}}]\in\mathbb{R}^{d_{j}\times d} for each l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}.

The CD matrix 𝐃\mathbf{D} can be interpreted as follows. For 𝐱=[x1,,xn]d\mathbf{x}=[x_{1}^{\top},\ldots,x_{n}^{\top}]^{\top}\in\mathbb{R}^{d}, 𝐃𝐱=[x𝒞l]l𝒬𝒢d^\mathbf{D}\mathbf{x}=[x_{\mathcal{C}_{l}}]_{l\in\mathcal{Q}_{\mathcal{G}}}\in\mathbb{R}^{\hat{d}} holds since Dl𝐱=x𝒞ldlD_{l}\mathbf{x}=x_{\mathcal{C}_{l}}\in\mathbb{R}^{d^{l}}. Hence, the CD matrix 𝐃\mathbf{D} generates the copies of 𝐱\mathbf{x} with respect to cliques 𝒞l,l𝒬𝒢{\mathcal{C}_{l}},\,l\in\mathcal{Q}_{\mathcal{G}}.

The following lemma provides the fundamental properties of the CD matrix. Now, let the matrix El,idi×dlE_{l,i}\in\mathbb{R}^{d_{i}\times d^{l}} be

El,i=[Odi×dj1,,Idi,,Odi×dj|𝒞l|]di×dlE_{l,i}=[O_{d_{i}\times d_{j_{1}}},\ldots,I_{d_{i}},\ldots,O_{d_{i}\times d_{j_{|{\mathcal{C}_{l}}|}}}]\in\mathbb{R}^{d_{i}\times d^{l}} (11)

for 𝒞l={j1,,i,,j|𝒞l|},l𝒬𝒢i{\mathcal{C}_{l}}=\{j_{1},\ldots,i,\ldots,j_{|{\mathcal{C}_{l}}|}\},\,l\in\mathcal{Q}_{\mathcal{G}}^{i}. This matrix El,iE_{l,i} fulfills El,ix𝒞l=xiE_{l,i}x_{{\mathcal{C}_{l}}}=x_{i} for x𝒞lx_{\mathcal{C}_{l}} and i𝒞li\in{\mathcal{C}_{l}}.

Lemma 2.

Under Assumption 1, the followings hold.

  • (a)

    𝐃\mathbf{D} is column full rank.

  • (b)

    𝐃𝐃=blk-diag(|𝒬𝒢1|Id1,,|𝒬𝒢n|Idn)O\mathbf{D}^{\top}\mathbf{D}=\text{blk-diag}(|\mathcal{Q}_{\mathcal{G}}^{1}|I_{d_{1}},\ldots,|\mathcal{Q}_{\mathcal{G}}^{n}|I_{d_{n}})\succ O.

  • (c)

    For 𝐲=[yl]l𝒬𝒢d^\mathbf{y}=[y_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}\in\mathbb{R}^{\hat{d}} with yldly_{l}\in\mathbb{R}^{d^{l}},

    𝐃𝐲=[l𝒬𝒢1El,1yll𝒬𝒢nEl,nyl]d.\mathbf{D}^{\top}\mathbf{y}=\begin{bmatrix}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{1}}E_{l,1}y_{l}\\ \vdots\\ \sum_{l\in\mathcal{Q}_{\mathcal{G}}^{n}}E_{l,n}y_{l}\end{bmatrix}\in\mathbb{R}^{d}. (12)

Using the CD matrix and (3), we can distributedly compute the least squares solution of 𝐲=𝐃𝐱\mathbf{y}=\mathbf{D}\mathbf{x}, i.e.,

𝐱=(𝐃𝐃)1𝐃𝐲\displaystyle\mathbf{x}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y} (13)

and the projection of 𝐲\mathbf{y} onto Im(𝐃)\mathrm{Im}(\mathbf{D}) as

PIm(𝐃)(𝐲)=𝐃(𝐃𝐃)1𝐃𝐲.\displaystyle P_{\mathrm{Im}(\mathbf{D})}(\mathbf{y})=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}. (14)
Example 1.

Consider 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}) with 𝒩={1,2,3}\mathcal{N}=\{1,2,3\} and ={{1,2},{2,3}}\mathcal{E}=\{\{1,2\},\,\{2,3\}\}. Let d1=d2=d3=1d_{1}=d_{2}=d_{3}=1 and 𝒬𝒢={1,2}\mathcal{Q}_{\mathcal{G}}=\{1,2\} with 𝒞1={1,2}\mathcal{C}_{1}=\{1,2\} and 𝒞2={2,3}\mathcal{C}_{2}=\{2,3\}. Then, we obtain 𝒬𝒢1={1}\mathcal{Q}_{\mathcal{G}}^{1}=\{1\}, 𝒬𝒢2={1,2}\mathcal{Q}_{\mathcal{G}}^{2}=\{1,2\}, and 𝒬𝒢3={2}\mathcal{Q}_{\mathcal{G}}^{3}=\{2\}, which ensures Assumption 1. For this system, the CD matrix is given by 𝐃=[D1,D2]4×3\mathbf{D}=[D_{1}^{\top},D_{2}^{\top}]^{\top}\in\mathbb{R}^{4\times 3}, where D1=[100010],D2=[010001]D_{1}=\begin{bmatrix}1&0&0\\ 0&1&0\end{bmatrix},\,D_{2}=\begin{bmatrix}0&1&0\\ 0&0&1\end{bmatrix}. We then obtain D1𝐱=[x1,x2]D_{1}\mathbf{x}=[x_{1},x_{2}]^{\top} and D2𝐱=[x2,x3]D_{2}\mathbf{x}=[x_{2},x_{3}]^{\top} for 𝐱=[x1,x2,x3]3\mathbf{x}=[x_{1},x_{2},x_{3}]^{\top}\in\mathbb{R}^{3}. Moreover, 𝐃𝐃=D1D1+D2D2=diag(1,2,1)=diag(|𝒬𝒢1|,|𝒬𝒢2|,|𝒬𝒢3|)\mathbf{D}^{\top}\mathbf{D}=D_{1}^{\top}D_{1}+D_{2}^{\top}D_{2}=\text{diag}(1,2,1)=\text{diag}(|\mathcal{Q}_{\mathcal{G}}^{1}|,|\mathcal{Q}_{\mathcal{G}}^{2}|,|\mathcal{Q}_{\mathcal{G}}^{3}|), and

𝐃𝐲=D1y1+D2y2=[y1,1y1,2+y2,1y2,2]=[E1,1y1E1,2y1+E2,2y2E2,3y2]\displaystyle\mathbf{D}^{\top}\mathbf{y}=D_{1}^{\top}y_{1}+D_{2}^{\top}y_{2}=\begin{bmatrix}y_{1,1}\\ y_{1,2}+y_{2,1}\\ y_{2,2}\end{bmatrix}=\begin{bmatrix}E_{1,1}y_{1}\\ E_{1,2}y_{1}+E_{2,2}y_{2}\\ E_{2,3}y_{2}\end{bmatrix}

for any vector 𝐲=[y1,y2]4\mathbf{y}=[y_{1}^{\top},y_{2}^{\top}]^{\top}\in\mathbb{R}^{4} with y1=[y1,1,y1,2]2y_{1}=[y_{1,1},y_{1,2}]^{\top}\in\mathbb{R}^{2} and y2=[y2,1,y2,2]2y_{2}=[y_{2,1},y_{2,2}]^{\top}\in\mathbb{R}^{2}, which can be computed in a distributed fashion.

3.2 Useful properties

Here, we provide useful properties of the CD matrix 𝐃\mathbf{D}.

The following result shows that the gradient and proximal operator with 𝐃\mathbf{D} can be computed in a distributed fashion. Here, iith block xix_{i} of 𝐱=(𝐃𝐃)1𝐃𝐲\mathbf{x}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y} is represented by xi=Ei(𝐃𝐃)1𝐃𝐲=1|𝒬𝒢i|l𝒬𝒢iEl,iylx_{i}=E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}y_{l} from Lemma 2.

Proposition 1.

Let 𝐲d^\mathbf{y}\in\mathbb{R}^{\hat{d}}. Then, under Assumption 1, the following equations hold.

  • (a)

    Let g^i:di{}\hat{g}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R}\cup\{\infty\} be a proper, closed, and convex function for each i𝒩i\in\mathcal{N}. Define G:d^{}G:\mathbb{R}^{\hat{d}}\to\mathbb{R}\cup\{\infty\} as G(𝐳)=δIm(𝐃)(𝐳)+i=1ng^i(Ei(𝐃𝐃)1𝐃𝐳))G(\mathbf{z})=\delta_{\mathrm{Im}(\mathbf{D})}(\mathbf{z})+\sum_{i=1}^{n}\hat{g}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z})). Let α>0\alpha>0. Then,

    proxαG(𝐲)=𝐃[proxα|𝒬𝒢1|g^1(E1(𝐃𝐃)1𝐃𝐲))proxα|𝒬𝒢n|g^n(En(𝐃𝐃)1𝐃𝐲))]\displaystyle\mathrm{prox}_{\alpha G}(\mathbf{y})=\mathbf{D}\begin{bmatrix}\mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{1}|}\hat{g}_{1}}(E_{1}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\\ \vdots\\ \mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{n}|}\hat{g}_{n}}(E_{n}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\end{bmatrix} (15)
  • (b)

    Let 𝐐=blk-diag([Ql]l𝒬𝒢)\mathbf{Q}=\text{blk-diag}([Q_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}), where Ql=blk-diag([1|𝒬𝒢j|Idj]j𝒞l)Q_{l}=\text{blk-diag}([\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}I_{d_{j}}]_{j\in{\mathcal{C}_{l}}}) for each l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}. Then,

    proxαG𝐐(𝐲)=𝐃[proxαg^1(E1(𝐃𝐃)1𝐃𝐲))proxαg^n(En(𝐃𝐃)1𝐃𝐲))]\displaystyle\mathrm{prox}_{\alpha G}^{\mathbf{Q}}(\mathbf{y})=\mathbf{D}\begin{bmatrix}\mathrm{prox}_{\alpha\hat{g}_{1}}(E_{1}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\\ \vdots\\ \mathrm{prox}_{\alpha\hat{g}_{n}}(E_{n}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}))\end{bmatrix} (16)
  • (c)

    Let f^i:di\hat{f}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R} be a differentiable function. Then,

    𝐲i=1nf^i(Ei(𝐃𝐃)1𝐃𝐲)\displaystyle\frac{\partial}{\partial\mathbf{y}}\sum_{i=1}^{n}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})
    =𝐃(𝐃𝐃)1[x1f^1(E1(𝐃𝐃)1𝐃𝐲)xnf^n(En(𝐃𝐃)1𝐃𝐲)]\displaystyle=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}\begin{bmatrix}\nabla_{x_{1}}\hat{f}_{1}(E_{1}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})\\ \vdots\\ \nabla_{x_{n}}\hat{f}_{n}(E_{n}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})\\ \end{bmatrix} (17)

Additionally, we provide properties of the CD matrix concerning matrix 𝐐\mathbf{Q}. Those properties are useful to derive the NIDS [20] and Exact diffusion [38] from the proposed method.

Proposition 2.

Let 𝐐\mathbf{Q} denote the matrix in Proposition 1b. Then, under Assumption 1, the following equations hold:

  • (a)

    𝐃𝐐𝐃=Id\mathbf{D}^{\top}\mathbf{Q}\mathbf{D}=I_{d}.

  • (b)

    𝐃𝐐=(𝐃𝐃)1𝐃\mathbf{D}^{\top}\mathbf{Q}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top} and 𝐃𝐐1=𝐃𝐃𝐃\mathbf{D}^{\top}\mathbf{Q}^{-1}=\mathbf{D}^{\top}\mathbf{D}\mathbf{D}^{\top}.

  • (c)

    𝐐𝐃=𝐃(𝐃𝐃)1\mathbf{Q}\mathbf{D}=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1} and 𝐐1𝐃=𝐃𝐃𝐃\mathbf{Q}^{-1}\mathbf{D}=\mathbf{D}\mathbf{D}^{\top}\mathbf{D}.

3.3 A mixing matrix 𝚽{\boldsymbol{\Phi}}

Using the CD matrix and the matrices Ql,l𝒬𝒢Q_{l},\,l\in\mathcal{Q}_{\mathcal{G}} in Proposition 1b, we can obtain a positive semidefinite and doubly stochastic matrix 𝚽{\boldsymbol{\Phi}} that will be used in Section 5. Thanks to the definition, 𝚽{\boldsymbol{\Phi}} in (18) can be constructed only from local information (i.e., 𝒬𝒢j,jl𝒬𝒢i𝒞l𝒩i\mathcal{Q}_{\mathcal{G}}^{j},\,j\in\bigcup_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}{\mathcal{C}_{l}}\subset\mathcal{N}_{i}). Note that this matrix can be viewed as a special case of the clique-based projection TT in [30] and Appendix F for the consensus constraint, i.e., 𝚽𝐱=𝐃(𝐃𝐃)1PΠl𝒬𝒢𝒟l𝐐(𝐃𝐱){\boldsymbol{\Phi}}\mathbf{x}=\mathbf{D}^{\top}(\mathbf{D}^{\top}\mathbf{D})^{-1}P^{\mathbf{Q}}_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}(\mathbf{D}\mathbf{x}) for 𝒟l\mathcal{D}_{l} in (39). We here pose the following assumption222Assumption 2 is not strict and satisfied for 𝒬𝒢=𝒬𝒢all\mathcal{Q}_{\mathcal{G}}={\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}, 𝒬𝒢max\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}, 𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}..

Assumption 2.

For 𝒬𝒢\mathcal{Q}_{\mathcal{G}}, 𝒬𝒢i𝒬𝒢j{i,j}\mathcal{Q}_{\mathcal{G}}^{i}\cap\mathcal{Q}_{\mathcal{G}}^{j}\neq\emptyset\Leftrightarrow\{i,j\}\in\mathcal{E}.

The matrix 𝚽{\boldsymbol{\Phi}} and its basic properties are given as follows.

Proposition 3.

Suppose Assumptions 1 and 2. Consider the matrices Ql,l𝒬𝒢Q_{l},\,l\in\mathcal{Q}_{\mathcal{G}} in Proposition 1b. Suppose that d1==dn=1d_{1}=\cdots=d_{n}=1. Then,

𝚽=[1|𝒬𝒢1|l𝒬𝒢1𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|1|𝒬𝒢n|l𝒬𝒢n𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|]n×n{\boldsymbol{\Phi}}=\begin{bmatrix}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{1}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{1}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}\\ \vdots\\ \frac{1}{|\mathcal{Q}_{\mathcal{G}}^{n}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{n}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}\\ \end{bmatrix}\in\mathbb{R}^{n\times n} (18)

is doubly stochastic, and it holds that

[𝚽]ij={1|𝒬𝒢i||𝒬𝒢j|l𝒬𝒢ij1𝟏|𝒞l|Ql𝟏|𝒞l|,{i,j}0,otherwise,[{\boldsymbol{\Phi}}]_{ij}=\begin{cases}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}||\mathcal{Q}_{\mathcal{G}}^{j}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{ij}}\frac{1}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}},&\{i,j\}\in\mathcal{E}\\ 0,&\text{otherwise},\end{cases} (19)

where [𝚽]ij[{\boldsymbol{\Phi}}]_{ij} represents (i,j)(i,j) entry of 𝚽{\boldsymbol{\Phi}}. Moreover, λmax(𝚽)=1\lambda_{\mathrm{max}}({\boldsymbol{\Phi}})=1 and λmin(𝚽)0\lambda_{\mathrm{min}}({\boldsymbol{\Phi}})\geq 0 hold. Furthermore, when 𝒢\mathcal{G} is connected, the eigenvalue 11 of 𝚽{\boldsymbol{\Phi}} is simple.

Proof.

The right stochasticity is proved as (1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|)𝟏n=1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|Ql𝟏|𝒞l|𝟏|𝒞l|Ql𝟏|𝒞l|=1|𝒬𝒢i|l𝒬𝒢i1=1.(\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}})\mathbf{1}_{n}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}1=1. Using the definition of DlD_{l} in Definition 1, the left stochasticity is also verified as

𝟏n𝚽=i=1n1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|\displaystyle\mathbf{1}_{n}^{\top}{\boldsymbol{\Phi}}=\sum_{i=1}^{n}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}
=\displaystyle= l𝒬𝒢j𝒞l1|𝒬𝒢j|𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|=l𝒬𝒢j𝒞l1|𝒬𝒢j|Ej\displaystyle\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}
=\displaystyle= i=1n1|𝒬𝒢i|l𝒬𝒢iEi=i=1nEi=𝟏n\displaystyle\sum_{i=1}^{n}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}=\sum_{i=1}^{n}E_{i}=\mathbf{1}_{n}^{\top}

from 𝟏|𝒞l|Ql𝟏|𝒞l|\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|} and 𝟏|𝒞l|QlDl=j𝒞l1|𝒬𝒢j|Ej\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}=\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}. Next,

[𝚽]ij=Ei𝚽Ej=1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|Ej=1|𝒬𝒢i|l𝒬𝒢ip𝒞l1𝟏|𝒞l|Ql𝟏|𝒞l|1|𝒬𝒢p|EpEj\displaystyle[{\boldsymbol{\Phi}}]_{ij}=E_{i}{\boldsymbol{\Phi}}E_{j}^{\top}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}E_{j}^{\top}=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\sum_{p\in\mathcal{C}_{l}}\frac{1}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{p}|}E_{p}E_{j}^{\top}

holds. Then, we obtain (19). Moreover, λmax(𝚽)=1\lambda_{\mathrm{max}}({\boldsymbol{\Phi}})=1 directly follows from Gershgorin disks theorem [8]. Additionally, from the firmly nonexpansiveness of the clique-based projection TT (see Proposition 6), we obtain 𝐱𝚽𝐱𝚽𝐱2\mathbf{x}^{\top}{\boldsymbol{\Phi}}\mathbf{x}\geq\|{\boldsymbol{\Phi}}\mathbf{x}\|^{2} for any 𝐱n\mathbf{x}\in\mathbb{R}^{n}, which gives λmin(𝚽)0\lambda_{\mathrm{min}}({\boldsymbol{\Phi}})\geq 0.

Finally, by the assumption of 𝒬𝒢ij{i,j}\mathcal{Q}_{\mathcal{G}}^{ij}\neq\emptyset\Leftrightarrow\{i,j\}\in\mathcal{E}, the associated graph of 𝚽{\boldsymbol{\Phi}} is equal to 𝒢\mathcal{G}. Therefore, the eigenvalue 11 of 𝚽{\boldsymbol{\Phi}} is simple when 𝒢\mathcal{G} is connected (see [8]). ∎

Now, we state the following proposition for 𝚽{\boldsymbol{\Phi}} in (18), implying that 𝚽{\boldsymbol{\Phi}} enables fast consensus. This is because all the eigenvalues smaller than 11 are likely to take smaller values than other popular positive semidefinite mixing matrices from the Gershgorin disks theorem [8]. A numerical example of the eigenvalues and a sketch of Proposition 4’s implication are illustrated in Fig. 2.

Proposition 4.

Suppose Assumption 1. For undirected graph 𝒢\mathcal{G}, consider the matrix 𝚽{\boldsymbol{\Phi}} in (18) with 𝒬𝒢=𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}. Let 𝐖~𝐋=(I+𝐖𝐋)/2\widetilde{\mathbf{W}}_{\mathbf{L}}=(I+\mathbf{W}_{\mathbf{L}})/2, where 𝐖𝐋=Iϵ𝐋\mathbf{W}_{\mathbf{L}}=I-\epsilon\mathbf{L} with Laplacian matrix 𝐋\mathbf{L} of 𝒢\mathcal{G} and ϵ(0,1/maxi𝒩{|𝒩i|1})\epsilon\in(0,1/\max_{i\in\mathcal{N}}\{|\mathcal{N}_{i}|-1\})333 The matrix 𝐋\mathbf{L} is defined as [𝐋]ii=|𝒩i|1[\mathbf{L}]_{ii}=|\mathcal{N}_{i}|-1 for i=1,,ni=1,\ldots,n and [𝐋]ij=1[\mathbf{L}]_{ij}=-1 for {i,j}\{i,j\}\in\mathcal{E}. Otherwise [𝐋]ij=0[\mathbf{L}]_{ij}=0. In [32], [𝐖𝐋]ij[\mathbf{W}_{\mathbf{L}}]_{ij} with ϵ=1/maxi𝒩{|𝒩i|}\epsilon=1/\max_{i\in\mathcal{N}}\{|\mathcal{N}_{i}|\} is said to be the max-degree weight. . Similarly, let 𝐖~mh=(I+𝐖mh)/2\widetilde{\mathbf{W}}_{\mathrm{mh}}=(I+\mathbf{W}_{\mathrm{mh}})/2 with 𝐖mh\mathbf{W}_{\mathrm{mh}} obtained by the Metropolis–Hastings weights444𝐖mh\mathbf{W}_{\mathrm{mh}} is defined as [𝐖mh]ij=1/(max{|𝒩i|1,|𝒩j|1}+ε)[\mathbf{W}_{\mathrm{mh}}]_{ij}=1/(\max\{|\mathcal{N}_{i}|-1,|\mathcal{N}_{j}|-1\}+\varepsilon) for {i,j}\{i,j\}\in\mathcal{E} and [𝐖mh]ii=1j𝒩i{i}[𝐖mh]ij[\mathbf{W}_{\mathrm{mh}}]_{ii}=1-\sum_{j\in\mathcal{N}_{i}\setminus\{i\}}[\mathbf{W}_{\mathrm{mh}}]_{ij} for i=1,,ni=1,\ldots,n. Otherwise [𝐖mh]ij=0[\mathbf{W}_{\mathrm{mh}}]_{ij}=0. in [32]. Then, for all i=1,,ni=1,\ldots,n, we have

[𝚽]ii<[𝐖~𝐋]ii,[𝚽]ii<[𝐖~mh]ii.[{\boldsymbol{\Phi}}]_{ii}<[\widetilde{\mathbf{W}}_{\mathbf{L}}]_{ii},\quad[{\boldsymbol{\Phi}}]_{ii}<[\widetilde{\mathbf{W}}_{\mathrm{mh}}]_{ii}. (20)
Proof.

When 𝒬𝒢=𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}, |𝒬𝒢i|=|𝒩i|1|\mathcal{Q}_{\mathcal{G}}^{i}|=|\mathcal{N}_{i}|-1 holds for i=1,,ni=1,\ldots,n, and 𝒬𝒢ij\mathcal{Q}_{\mathcal{G}}^{ij} for {i,j}\{i,j\}\in\mathcal{E} becomes a singleton 𝒬𝒢ij={l}\mathcal{Q}_{\mathcal{G}}^{ij}=\{l\} with 𝒞l={i,j}{\mathcal{C}_{l}}=\{i,j\} as 𝒬𝒢i\mathcal{Q}_{\mathcal{G}}^{i} is just the set of indices of edges that include ii. Then, for {i,j}\{i,j\}\in\mathcal{E} we get [𝚽]ij=1|𝒩i|1+|𝒩j|1[{\boldsymbol{\Phi}}]_{ij}=\frac{1}{|\mathcal{N}_{i}|-1+|\mathcal{N}_{j}|-1}. Hence, recalling the definition of 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}} and 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}} for {i,j}\{i,j\}\in\mathcal{E}, we have [𝐖~𝐋]ij=1/2ϵ<1/(2maxi𝒩{|𝒩i|1})[𝚽]ij[\widetilde{\mathbf{W}}_{\mathbf{L}}]_{ij}=1/2\epsilon<1/(2\max_{i\in\mathcal{N}}\{|\mathcal{N}_{i}|-1\})\leq[{\boldsymbol{\Phi}}]_{ij} and [𝐖~mh]ij=1/2(max{|𝒩i|1,|𝒩j|1}+ε)<1/2max{|𝒩i|1,|𝒩j|1}[𝚽]ij[\widetilde{\mathbf{W}}_{\mathrm{mh}}]_{ij}=1/2(\max\{|\mathcal{N}_{i}|-1,|\mathcal{N}_{j}|-1\}+\varepsilon)<1/2\max\{|\mathcal{N}_{i}|-1,|\mathcal{N}_{j}|-1\}\leq[{\boldsymbol{\Phi}}]_{ij}, respectively. Therefore, since all (i,j)(i,j) entries of 𝚽{\boldsymbol{\Phi}} for {i,j}\{i,j\}\in\mathcal{E} are bigger than those of 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}} and 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}} and these matrices are doubly stochastic, we get (20). ∎

Remark 1.

In Fig. 2a, we use not 𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}} but 𝒬𝒢max\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}, which also realizes smaller eigenvalues. Likewise, even when 𝒬𝒢𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}\neq\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}, 𝚽{\boldsymbol{\Phi}} can have smaller eigenvalues than 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}} and 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}}.

Refer to caption
Figure 2: (a) Comparison of the eigenvalues λi\lambda_{i} of 𝚽{\boldsymbol{\Phi}}, 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}} with ϵ=0.99/(maxi𝒩|𝒩|i1)\epsilon=0.99/(\max_{i\in\mathcal{N}}|\mathcal{N}|_{i}-1), and 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}} for a random graph with n=50n=50 inside the plot; (b) A sketch of the region of each eigenvalue of 𝚽{\boldsymbol{\Phi}}, 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}}, and 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}} that Proposition 4 and the Gershgorin disks theorem imply. Both indicate that 𝚽{\boldsymbol{\Phi}} probably takes smaller eigenvalues.

4 Solution to Clique-Wise Coupled Problems via Operator Splitting

This section presents our proposed algorithms for Problem (2) with the CD matrix and DYS in (9) with the metrics of M=IM=I and M=𝐐M=\mathbf{Q}. We now assume the following.

Assumption 3.

Problem (2) has an optimal solution.

In what follows, the functions f:d^f:\mathbb{R}^{\hat{d}}\to\mathbb{R}, g:d^g:\mathbb{R}^{\hat{d}}\to\mathbb{R}, f^:d\hat{f}:\mathbb{R}^{d}\to\mathbb{R}, and g^:d\hat{g}:\mathbb{R}^{d}\to\mathbb{R} represent

f(𝐲)=l𝒬𝒢fl(yl),g(𝐲)=l𝒬𝒢gl(yl),\displaystyle f(\mathbf{y})=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}f_{l}(y_{l}),\quad g(\mathbf{y})=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}g_{l}(y_{l}), (21)
f^(𝐱)=i=1nfi(xi),g^(𝐱)=i=1ngi(xi).\displaystyle\hat{f}(\mathbf{x})=\sum_{i=1}^{n}f_{i}(x_{i}),\quad\hat{g}(\mathbf{x})=\sum_{i=1}^{n}g_{i}(x_{i}). (22)

4.1 Algorithm description

Algorithm 1 Clique-based distributed Davis-Yin splitting (CD-DYS) algorithm for agent i𝒩i\in\mathcal{N}.
0:  zl0z_{l}^{0} and α>0\alpha>0 for all l𝒬𝒢il\in\mathcal{Q}_{\mathcal{G}}^{i}.
1:  for k=0,1,k=0,1,\ldots do
2:     xik=proxα|𝒬𝒢i|g^i(1|𝒬𝒢i|l𝒬𝒢iEl,izlk)x_{i}^{k}=\mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\hat{g}_{i}}(\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}z_{l}^{k})
3:     Gather xjkx_{j}^{k} from the neighboring agents jl𝒬𝒢i𝒞l𝒩ij\in\bigcup_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}{\mathcal{C}_{l}}\subset\mathcal{N}_{i}.
4:     Obtain ylk+1/2y_{l}^{k+1/2}, ylk+1y_{l}^{k+1}, and zlk+1z_{l}^{k+1} for l𝒬𝒢il\in\mathcal{Q}_{\mathcal{G}}^{i} by
ylk+1/2=x𝒞lk\displaystyle y_{l}^{k+1/2}=x^{k}_{\mathcal{C}_{l}}
ylk+1=proxαgl(2ylk+1/2zlkαylfl(ylk+1/2)α[1|𝒬𝒢j|xjf^j(xjk)]j𝒞l)\displaystyle y_{l}^{k+1}=\mathrm{prox}_{\alpha g_{l}}(2y_{l}^{k+1/2}-z_{l}^{k}-\alpha\nabla_{y_{l}}f_{l}(y_{l}^{k+1/2})-\alpha[\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}\nabla_{x_{j}}\hat{f}_{j}(x_{j}^{k})]_{j\in{\mathcal{C}_{l}}})
zlk+1=zlk+ylk+1ylk+1/2\displaystyle z_{l}^{k+1}=z_{l}^{k}+y_{l}^{k+1}-y_{l}^{k+1/2}
5:  end for

We give the distributed optimization algorithm in Algorithm 1, the clique-based distributed Davis-Yin splitting (CD-DYS) algorithm. This algorithm is distributed from (3). By Lemma 2, the aggregated form of this algorithm is as follows:

𝐱k=proxi=1nα|𝒬𝒢i|g^i()(𝐃𝐃)1𝐃𝐳k𝐲k+1/2=𝐃𝐱k𝐲k+1=proxαg(2𝐲k+1/2𝐳kα𝐲f(𝐲k+1/2)α𝐃(𝐃𝐃)1𝐱f^(𝐱k))𝐳k+1=𝐳k+𝐲k+1𝐲k+1/2,\displaystyle\!\!\!\!\!\!\!\!\!\begin{array}[]{lll}&\mathbf{x}^{k}=\mathrm{prox}_{\sum_{i=1}^{n}\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\hat{g}_{i}(\cdot)}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}\\ &\mathbf{y}^{k+1/2}=\mathbf{D}\mathbf{x}^{k}\\ &\mathbf{y}^{k+1}=\mathrm{prox}_{\alpha g}(2\mathbf{y}^{k+1/2}-\mathbf{z}^{k}\\ &\quad\quad\quad-\alpha\nabla_{\mathbf{y}}f(\mathbf{y}^{k+1/2})-\alpha\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\\ &\mathbf{z}^{k+1}=\mathbf{z}^{k}+\mathbf{y}^{k+1}-\mathbf{y}^{k+1/2},\end{array} (28)

where 𝐱k=[x1k,,xnk]\mathbf{x}^{k}=[x_{1}^{k\top},\ldots,x_{n}^{k\top}]^{\top}, 𝐲k=[ylk]l𝒬𝒢\mathbf{y}^{k}=[y_{l}^{k}]_{l\in\mathcal{Q}_{\mathcal{G}}}, 𝐲k+1/2=[ylk+1/2]l𝒬𝒢\mathbf{y}^{k+1/2}=[y_{l}^{k+1/2}]_{l\in\mathcal{Q}_{\mathcal{G}}}, and 𝐳k=[zlk]l𝒬𝒢\mathbf{z}^{k}=[z_{l}^{k}]_{l\in\mathcal{Q}_{\mathcal{G}}}. By Lemma 1, this algorithm converges to the optimal point under α(0,2/(maxl𝒬𝒢Ll+maxi𝒩L^i|𝒬𝒢i|))\alpha\in(0,2/({\max_{l\in\mathcal{Q}_{\mathcal{G}}}L_{l}+\max_{i\in\mathcal{N}}\frac{\hat{L}_{i}}{|\mathcal{Q}_{\mathcal{G}}^{i}|}})).

This algorithm can be derived in the following way. From (13), for 𝐲=𝐃𝐱Im(𝐃)\mathbf{y}=\mathbf{D}\mathbf{x}\in\mathrm{Im}(\mathbf{D}), we can reformulate Problem (2) into the form of (5) as follows:

minyldl,l𝒬𝒢i=1nf^i(Ei(𝐃𝐃)1𝐃𝐲)+l𝒬𝒢fl(yl)f in (5)+l𝒬𝒢gl(yl)g in (5)+i=1ng^i(Ei(𝐃𝐃)1𝐃𝐲)+δIm(𝐃)(𝐲)h in (5).\displaystyle\begin{array}[]{cl}&\underset{\begin{subarray}{c}y_{l}\in\mathbb{R}^{d^{l}},\,l\in\mathcal{Q}_{\mathcal{G}}\end{subarray}}{\mbox{min}}\quad\displaystyle\underbrace{\sum_{i=1}^{n}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})+\sum_{l\in\mathcal{Q}_{\mathcal{G}}}f_{l}(y_{l})}_{f\text{ in \eqref{problem_3O}}}\\ &\displaystyle+\underbrace{\sum_{l\in\mathcal{Q}_{\mathcal{G}}}g_{l}(y_{l})}_{g\text{ in \eqref{problem_3O}}}\displaystyle+\underbrace{\sum_{i=1}^{n}\hat{g}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})+\delta_{\mathrm{Im}(\mathbf{D})}(\mathbf{y})}_{h\text{ in \eqref{problem_3O}}}.\end{array}\!\!\!\! (31)

For Problem (31), Proposition 1 tells us that the function i=1ng^i(Ei(𝐃𝐃)1𝐃𝐲)+δIm(𝐃)(𝐲)\sum_{i=1}^{n}\hat{g}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})+\delta_{\mathrm{Im}(\mathbf{D})}(\mathbf{y}) is proximable for proximable g^i\hat{g}_{i}, and the proximal operator can be computed in a distributed fashion. Accordingly, we can directly apply DYS in (9) with M=IM=I to (31). From Proposition 1, setting xik=proxα|𝒬𝒢i|g^i(Ei(𝐃𝐃)1𝐃𝐳k)x_{i}^{k}=\mathrm{prox}_{\frac{\alpha}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\hat{g}_{i}}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}) gives the distributed algorithm in Algorithm 1 (or (28)).

4.2 Variable metric version

Applying the variable metric DYS in (9) with M=𝐐M=\mathbf{Q} in Proposition 1 instead to Problem (31) gives the following algorithm:

xik=proxαg^i(1|𝒬𝒢i|l𝒬𝒢iEl,izlk)ylk+1/2=x𝒞lkylk+1=proxαglQl(2ylk+1/2zlkαQl1ylfl(ylk+1/2)α[xjf^j(xjk)]j𝒞l)zlk+1=zlk+ylk+1ylk+1/2,\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\begin{array}[]{lll}&x_{i}^{k}=\mathrm{prox}_{\alpha\hat{g}_{i}}(\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}z_{l}^{k})\\ &y_{l}^{k+1/2}=x^{k}_{\mathcal{C}_{l}}\\ &y_{l}^{k+1}=\mathrm{prox}_{\alpha g_{l}}^{Q_{l}}(2y_{l}^{k+1/2}-z_{l}^{k}-\alpha Q_{l}^{-1}\nabla_{y_{l}}f_{l}(y_{l}^{k+1/2})-\alpha[\nabla_{x_{j}}\hat{f}_{j}(x_{j}^{k})]_{j\in{\mathcal{C}_{l}}})\\ &z_{l}^{k+1}=z_{l}^{k}+y_{l}^{k+1}-y_{l}^{k+1/2},\end{array} (36)

where we use Propositions 1b and 2. It will turn out in Section 5 that this algorithm shows an interesting connection to other methods as Fig. 3 through 𝚽{\boldsymbol{\Phi}} in (18). By Lemma 1, a sufficient condition for the convergence is α(0,2/(maxl𝒬𝒢maxj𝒞l(|𝒬𝒢j|Ll)+maxi𝒩L^i))\alpha\in(0,2/({\max_{l\in\mathcal{Q}_{\mathcal{G}}}\max_{j\in{\mathcal{C}_{l}}}(|\mathcal{Q}_{\mathcal{G}}^{j}|L_{l})+\max_{i\in\mathcal{N}}\hat{L}_{i}})).

5 Revisit of Consensus Optimization as A Clique-Wise Coupled Problem

Variable metric CD-DYS(Alg. (9) for (2) via (31)) Variable metric CD-DYSw.r.t. 𝐐\mathbf{Q} (Alg. (36))CD-DYS (Alg. 1 or (28) CPGD [30]NIDS [20] (Alg. (42) with 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}) Exact diffusion [38] with 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}Diffusion [27] with 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}}M=𝐐M=\mathbf{Q}M=IM=Igl=δ𝒟l,g^i=0g_{l}=\delta_{\mathcal{D}_{l}},\,\hat{g}_{i}=0+ approximation gl=δ𝒟lg_{l}=\delta_{\mathcal{D}_{l}} with 𝒟l=\mathcal{D}_{l}=(39) g^i=0\hat{g}_{i}=0Approximation𝒟l=\mathcal{D}_{l}= (39)
Figure 3: The relationships among the proposed methods and existing methods.

This section is dedicated to a detailed analysis of the proposed methods in Section 4 in light of consensus optimization, presenting a renewed perspective. We here demonstrate the relationship summarized in Fig. 3 by showing the most important part: Algorithm (36) generalizes the NIDS in [20]. Our analysis reveals that matrix 𝚽{\boldsymbol{\Phi}} in (18) naturally arises in the NIDS. This fact and Proposition 4 imply that the proposed algorithm enables fast convergence [20] beyond standard mixing matrices. Furthermore, we present a new linear convergence rate for the NIDS with a non-smooth term based on its DYS structure. The linear rate follows from a more general result for the DYS (9).

We here consider a special case of Problem (2) given by

minxidi,i𝒩i=1nf^i(xi)+i=1ng^i(xi)+l𝒬𝒢δ𝒟l(x𝒞l)gl(x𝒞l).\displaystyle\!\!\!\!\!\!\!\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}},\,i\in\mathcal{N}\end{subarray}}{\mbox{min}}&\displaystyle\sum_{i=1}^{n}\hat{f}_{i}(x_{i})+\sum_{i=1}^{n}\hat{g}_{i}(x_{i})+\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\underbrace{\delta_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})}_{g_{l}(x_{\mathcal{C}_{l}})}.\end{array}\!\!\! (38)

When m=d1==dnm=d_{1}=\cdots=d_{n} and

𝒟l={x𝒞l|𝒞l|m:θm s.t. x𝒞l=𝟏|𝒞l|θ},\mathcal{D}_{l}=\{x_{\mathcal{C}_{l}}\in\mathbb{R}^{|{\mathcal{C}_{l}}|m}:\exists\theta\in\mathbb{R}^{m}\text{ s.t. }x_{\mathcal{C}_{l}}=\mathbf{1}_{|{\mathcal{C}_{l}}|}\otimes\theta\}, (39)

this problem is called a consensus optimization problem, which we discuss here. Notice that according to [26], l𝒬𝒢{𝐱nm:x𝒞lDl}={𝐱nm:x1==xn}\cap_{l\in\mathcal{Q}_{\mathcal{G}}}\{\mathbf{x}\in\mathbb{R}^{nm}:x_{\mathcal{C}_{l}}\in D_{l}\}=\{\mathbf{x}\in\mathbb{R}^{nm}:x_{1}=\cdots=x_{n}\} holds under the connectivity of graph 𝒢\mathcal{G} and Assumptions 1 and 2.

5.1 CD-DYS as generalized NIDS

NIDS algorithm

First, the NIDS algorithm [20] for consensus optimization for k=1,2,k=1,2,\ldots is as follows:

𝐱k=proxαg^(𝐰k)𝐰k+1=𝐰k𝐱k+𝐖~(2𝐱k𝐱k1+α𝐱f^(𝐱k1)α𝐱f^(𝐱k))\displaystyle\begin{array}[]{lll}&\mathbf{x}^{k}=&\mathrm{prox}_{\alpha\hat{g}}(\mathbf{w}^{k})\\ &\mathbf{w}^{k+1}=&\mathbf{w}^{k}-\mathbf{x}^{k}+\widetilde{\mathbf{W}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\end{array} (42)

with arbitrary 𝐱0\mathbf{x}^{0} and 𝐰1=𝐱0α𝐱f^(𝐱0)\mathbf{w}^{1}=\mathbf{x}^{0}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{0}). The matrix 𝐖~\widetilde{\mathbf{W}} is a positive semidefinite doubly stochastic mixing matrix. A standard choice of 𝐖~\widetilde{\mathbf{W}} with no use of global information is 𝐖~=𝐖~mh\widetilde{\mathbf{W}}=\widetilde{\mathbf{W}}_{\mathrm{mh}} in Proposition 4. To make 𝐖~\widetilde{\mathbf{W}} less conservative, [20] suggests that some global information is necessary (e.g., the value of λmax(𝐖mh)\lambda_{\mathrm{max}}(\mathbf{W}_{\mathrm{mh}})).

Analysis

We here present the following proposition, stating that the proposed algorithm in (36) yields the NIDS algorithm with 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}} in (18), which can achieve fast convergence as shown in Proposition 4 and Fig. 2. Note that we show the case of m=1m=1 for simplicity.

Proposition 5.

Consider Algorithm (36) for Problem (38). Suppose Assumptions 13. Assume that for all i𝒩i\in\mathcal{N}, f^i:di\hat{f}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R} is L^i\hat{L}_{i}-smooth and convex, and g^i:di\hat{g}_{i}:\mathbb{R}^{{d_{i}}}\to\mathbb{R} is proper, closed, and convex. For arbitrary 𝐱0\mathbf{x}^{0}, let 𝐳1=𝐃(𝐱0α𝐱f^(𝐱0))\mathbf{z}^{1}=\mathbf{D}(\mathbf{x}^{0}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{0})). Then, for k=1,2,k=1,2,\ldots, 𝐰k:=(𝐃𝐃)1𝐃𝐳k\mathbf{w}^{k}:=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k} and 𝐱k:=proxαg^(𝐰k)\mathbf{x}^{k}:=\mathrm{prox}_{\alpha\hat{g}}(\mathbf{w}^{k}) satisfy the update of NIDS in (42) with 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}} in (18).

Proof.

By Lemma 2b–c, multiplying the third line of (36) by (𝐃𝐃)1𝐃(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top} gives wik+1=wikxik+1|𝒬𝒢i|l𝒬𝒢iEl,iproxδ𝒟lQl(2x𝒞lkzlkαDl𝐱f^(𝐱k))w_{i}^{k+1}=w_{i}^{k}-x_{i}^{k}+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}\mathrm{prox}_{\delta_{\mathcal{D}_{l}}}^{Q_{l}}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})). Then, plugging in proxδ𝒟lQl(x𝒞l)=P𝒟lQl(x𝒞l)=𝟏|𝒞l|𝟏|𝒞l|Qlx𝒞l𝟏|𝒞l|Ql𝟏|𝒞l|\mathrm{prox}_{\delta_{\mathcal{D}_{l}}}^{Q_{l}}(x_{\mathcal{C}_{l}})=P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})=\mathbf{1}_{|{\mathcal{C}_{l}}|}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}x_{\mathcal{C}_{l}}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}},

wik+1=wikxik\displaystyle w_{i}^{k+1}=w_{i}^{k}-x_{i}^{k}
+1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|Ql𝟏|𝒞l|Ql𝟏|𝒞l|(2x𝒞lkzlkαDl𝐱f^(𝐱k)).\displaystyle+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})).

Additionally, we can transform 𝟏|𝒞l|Qlzlk+1\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}z_{l}^{k+1} for k1k\geq 1 into

𝟏|𝒞l|Qlzlk+1=\displaystyle\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}z_{l}^{k+1}= 𝟏|𝒞l|Ql(zlkx𝒞lk)\displaystyle\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}(z_{l}^{k}-x_{\mathcal{C}_{l}}^{k})
+𝟏|𝒞l|Ql(2x𝒞lkzlkαDl𝐱f^(𝐱k))\displaystyle+\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}(2x^{k}_{\mathcal{C}_{l}}-z_{l}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))
=\displaystyle= 𝟏|𝒞l|Ql(x𝒞lkαDl𝐱f^(𝐱k)).\displaystyle\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}(x_{\mathcal{C}_{l}}^{k}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})). (43)

Thus, for k=1,2,k=1,2,\ldots, recalling the initialization of 𝐳1=𝐃(𝐱0α𝐱f^(𝐱0))\mathbf{z}^{1}=\mathbf{D}(\mathbf{x}^{0}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{0})) and applying (5.1) to wik+1w_{i}^{k+1} provide wik+1=wikxik+1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|Ql𝟏|𝒞l|Ql𝟏|𝒞l|(2x𝒞lkx𝒞lk1+αDl(𝐱f^(𝐱k1)𝐱f^(𝐱k)))=wikxik+1|𝒬𝒢i|l𝒬𝒢i𝟏|𝒞l|QlDl𝟏|𝒞l|Ql𝟏|𝒞l|(2𝐱k𝐱k1+α(𝐱f^(𝐱k1)𝐱f^(𝐱k)))w_{i}^{k+1}=w_{i}^{k}-x_{i}^{k}+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}(2x^{k}_{\mathcal{C}_{l}}-x^{k-1}_{\mathcal{C}_{l}}+\alpha D_{l}(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})))=w_{i}^{k}-x_{i}^{k}+\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}D_{l}}{\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}Q_{l}\mathbf{1}_{|{\mathcal{C}_{l}}|}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))). Thus, setting 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}} with 𝚽{\boldsymbol{\Phi}} in (18), we get 𝐰k+1=𝐰k𝐱k+𝐖~(2𝐱k𝐱k1+α(𝐱f^(𝐱k1)α𝐱f^(𝐱k)))\mathbf{w}^{k+1}=\mathbf{w}^{k}-\mathbf{x}^{k}+\widetilde{\mathbf{W}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))). ∎

Remark 2.

The original DYS paper [20] states that the NIDS is obtained from the PD3O in [36] (a primal-dual variant of DYS). Meanwhile, in Proposition 5, we rely only on the primal part and obtain 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}} as a fixed parameter.

5.2 Linear convergence of the NIDS with g^i()0\hat{g}_{i}(\cdot)\neq 0

This subsection presents a linear convergence rate of the NIDS via the CD-DYS. We first present a new result of linear convergence for the general DYS for Problem (5) (not limited to the CD-DYS) when ff is strongly convex and gg is the indicator function of a linear image space. As indicator functions satisfy neither smoothness nor strong convexity, our result cannot be derived from the prior results of linear convergence as [13, 12, 18, 37]. The proof is presented in Section 7.

Theorem 1.

Consider the variable metric DYS in (9) for k=1,2,k=1,2,\ldots for Problem (5). Let yy^{*} and zz^{*} be the optimal values of yk+1/2y^{k+1/2}, yky^{k}, and zkz^{k}. Suppose that M1f(y)M^{-1}\nabla f(y) is LL-Lipschitz continuous, ff is μ\mu-strongly convex, g(y)=δIm(U)(y)g(y)=\delta_{\mathrm{Im}(U)}(y) with a column full-rank matrix UU, and hh is proper, closed, and convex. Set a stepsize α(0,2ε/L)\alpha\in(0,2\varepsilon/L), where ε(0,1)\varepsilon\in(0,1). Pick any start point y1/2=yinity^{1/2}=y^{\mathrm{init}} and set z1=y1/2αM1f(y1/2)z^{1}=y^{1/2}-\alpha M^{-1}\nabla f(y^{1/2}). Then it holds that

zk+1z2+νζ(yk+1/2)ζ(y)2\displaystyle\|z^{k+1}-z^{*}\|^{2}+\nu\|\zeta(y^{k+1/2})-\zeta(y^{*})\|^{2}
\displaystyle\leq (1C)(zkz2+νζ(yk1/2)ζ(y)2),\displaystyle(1-C)(\|z^{k}-z^{*}\|^{2}+\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}),

where ν(0,β2(αα2L2ε))\nu\in(0,\frac{\beta}{2}(\alpha-\frac{\alpha^{2}L}{2\varepsilon})) with β=min{1αL,μ}\beta=\min\{\frac{1}{\alpha L},\mu\}, ζ(y):=yαM1yf(y)\zeta(y):=y-\alpha M^{-1}\nabla_{y}f(y), and C=min{κ48,κ12α,νν+9}C=\min\{\frac{\kappa}{48},\frac{\kappa}{12\alpha},\frac{\nu}{\nu+9}\} with κ:=β(αα2L2ε)2ν>0\kappa:=\beta(\alpha-\frac{\alpha^{2}L}{2\varepsilon})-2\nu>0.

Since 𝒟l=Im(𝟏|𝒞l|)\mathcal{D}_{l}=\mathrm{Im}(\mathbf{1}_{|{\mathcal{C}_{l}}|}) for 𝒟l\mathcal{D}_{l} in (39), Theorem 2 provides the following linear rate for the NIDS with 𝚽{\boldsymbol{\Phi}}. Although [1, 33] have addressed this case, our result below admits bigger stepsizes due to the arbitrariness of ε(0,1)\varepsilon\in(0,1).

Theorem 2.

Consider the same assumptions as Proposition 5. Further, assume that 𝒢\mathcal{G} is connected and that for each i𝒩i\in\mathcal{N}, f^i()\hat{f}_{i}(\cdot) is μ^i\hat{\mu}_{i}-strongly convex. Set a stepsize α(0,2ε/maxi𝒩L^i)\alpha\in(0,2\varepsilon/\max_{i\in\mathcal{N}}\hat{L}_{i}), where ε(0,1)\varepsilon\in(0,1). Pick any start point 𝐱0\mathbf{x}^{0} and set 𝐰1=𝐱0αf^(𝐱0)\mathbf{w}^{1}=\mathbf{x}^{0}-\alpha\hat{f}(\mathbf{x}^{0}). Then, 𝐱k𝐱=O((1C)k/2)\|\mathbf{x}^{k}-\mathbf{x}^{*}\|=O((1-C)^{k/2}) holds, where CC is given as Theorem 1 with L=maxi𝒩L^iL=\max_{i\in\mathcal{N}}\hat{L}_{i} and μ=mini𝒩μ^i/maxi𝒩|𝒬𝒢i|\mu=\min_{i\in\mathcal{N}}\hat{\mu}_{i}/\max_{i\in\mathcal{N}}|\mathcal{Q}_{\mathcal{G}}^{i}|.

Proof.

This theorem follows from Proposition 5 and Theorem 1. Note that while the μi\mu_{i}-strong convexity of f^i\hat{f}_{i} for i=1,,ni=1,\ldots,n guarantees the convexity of f^((𝐃𝐃)1𝐃𝐲)(miniμ^i)(𝐃𝐃)1𝐃𝐲2\hat{f}((\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})-(\min_{i}\hat{\mu}_{i})\|(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}\|^{2}555 f^((𝐃𝐃)1𝐃𝐲)\hat{f}((\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}) is strongly convex over Im(𝐃)\mathrm{Im}(\mathbf{D}); recall our formulation in (31)., we can treat i=1nf^i(Ei(𝐃𝐃)1𝐃𝐲)\sum_{i=1}^{n}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}) just as a mini𝒩μ^i/maxi𝒩|𝒬𝒢i|\min_{i\in\mathcal{N}}\hat{\mu}_{i}/\max_{i\in\mathcal{N}}|\mathcal{Q}_{\mathcal{G}}^{i}|-strongly convex function and directly apply the same arguments as Theorem 1 because both 𝐲k+1/2\mathbf{y}^{k+1/2} in Algorithm (36) and 𝐲\mathbf{y}^{*} always belong to Im(𝐃)\mathrm{Im}(\mathbf{D}). The linear convergence of {𝐱k}\{\mathbf{x}^{k}\} follows from the first line of Algorithm (36). ∎

6 Numerical Experiments

This section presents two numerical examples: resource allocation and consensus optimization.

Clique-wise resource allocation

First, we consider a resource allocation problem for the network of n=20n=20 agents in [30, Fig. 1] with four communities modeled by cliques and suppose that a resource constraint is imposed on each clique. The clique parameters are given as 𝒬𝒢={1,2,3,4}\mathcal{Q}_{\mathcal{G}}=\{1,2,3,4\} and 𝒞1={1,2,,6},𝒞2={5,6,,9},𝒞3={8,9,,12},𝒞4={9,10,13,14,,20}\mathcal{C}_{1}=\{1,2,\ldots,6\},\,\mathcal{C}_{2}=\{5,6,\ldots,9\},\,\mathcal{C}_{3}=\{8,9,\ldots,12\},\,\mathcal{C}_{4}=\{9,10,13,14,\ldots,20\}. Let xi0x_{i}\geq 0 for i𝒩i\in\mathcal{N} be the amount of resources of agent ii. For i𝒩i\in\mathcal{N} and l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}, we set fl(x𝒞l)=al21|𝒞l|𝟏|𝒞l|x𝒞lbl2f_{l}(x_{\mathcal{C}_{l}})=\frac{a_{l}}{2}\|\frac{1}{|{\mathcal{C}_{l}}|}\mathbf{1}_{|{\mathcal{C}_{l}}|}^{\top}x_{\mathcal{C}_{l}}-b_{l}\|^{2} with al=1a_{l}=1 and blb_{l} generated by the uniform distribution with [0,5][0,5], gl(x𝒞l)=δ𝒟l(x𝒞l)g_{l}(x_{\mathcal{C}_{l}})=\delta_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}) with 𝒟l={x𝒞l:j𝒞lxj=Nl}\mathcal{D}_{l}=\{x_{\mathcal{C}_{l}}:\sum_{j\in{\mathcal{C}_{l}}}x_{j}=N_{l}\}, for (N1,,N4)=(5,10,5,15)(N_{1},\ldots,N_{4})=(5,10,5,15), f^i(xi)=a^i2xib^i2\hat{f}_{i}(x_{i})=\frac{\hat{a}_{i}}{2}\|x_{i}-\hat{b}_{i}\|^{2} with a^i=1\hat{a}_{i}=1 and b^i\hat{b}_{i} generated by the uniform distribution with [0,1][0,1], and g^i(xi)=δ+{0}(xi)\hat{g}_{i}(x_{i})=\delta_{\mathbb{R}_{+}\cup\{0\}}(x_{i}).

We here compare the proposed methods in (28) (or Algorithm 1) with α=1/(maxi𝒩,l𝒬𝒢a^i/mini𝒬𝒢|𝒬𝒢i|+maxl𝒬𝒢al)\alpha=1/(\max_{i\in\mathcal{N},l\in\mathcal{Q}_{\mathcal{G}}}\hat{a}_{i}/\min_{i\in\mathcal{Q}_{\mathcal{G}}}|\mathcal{Q}_{\mathcal{G}}^{i}|+\max_{l\in\mathcal{Q}_{\mathcal{G}}}a_{l})) with Liang et al. [21] for two different stepsizes (τ=0.09, 0.1\tau=0.09,\,0.1). The method in [21] is a distributed algorithm for globally coupled constraints using the gradient of the cost function. Notice that the dual decomposition technique cannot be directly used here since we have to minimize fl,l𝒬𝒢f_{l},\,l\in\mathcal{Q}_{\mathcal{G}}.

The simulation result is plotted in Fig. 4. The proposed CD-DYS converges much faster than Liang et al. [21] whereas that with τ=0.1\tau=0.1 fails to converge. This difference is rooted in the fact that the CD-DYS exploits the community structure of the problem and admits larger stepsizes. Note that to get the largest stepsize for Liang et al., one has to know an upper bound of the norm of dual variables.

Consensus optimization via NIDS

We next consider the 1\ell_{1} norm regularized consensus optimization problem (38) for n=50n=50 agents over an undirected graph 𝒢\mathcal{G}. Here, we set f^i(xi)=12Ψixibi2\hat{f}_{i}(x_{i})=\frac{1}{2}\|\Psi_{i}x_{i}-b_{i}\|^{2} and g^i(xi)=λixi1\hat{g}_{i}(x_{i})=\lambda_{i}\|x_{i}\|_{1} for i𝒩i\in\mathcal{N}. Here, Ψi=I10+0.05Ωi10×10\Psi_{i}=I_{10}+0.05\Omega_{i}\in\mathbb{R}^{10\times 10}, bi10,i𝒩b_{i}\in\mathbb{R}^{10},\,i\in\mathcal{N}, and λ1==λn=λ=0.001\lambda_{1}=\cdots=\lambda_{n}=\lambda=0.001. Each entry of Ωi\Omega_{i} and bib_{i} is generated by the standard normal distribution.

We here conduct simulations of the NIDS in (42) for 𝐖~=𝚽\widetilde{\mathbf{W}}={\boldsymbol{\Phi}} with 𝒬𝒢=𝒬𝒢max\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}} and 𝒬𝒢=𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}, 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}} with ϵ=0.99/maxi𝒩(|𝒩i|1)\epsilon=0.99/\max_{i\in\mathcal{N}}(|\mathcal{N}_{i}|-1), 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}}, and 𝐖~c:=Iαc(I𝐖𝐋)\widetilde{\mathbf{W}}_{c}:=I-\alpha c(I-\mathbf{W}_{\mathbf{L}}), where c=1/(1λmin(𝐖𝐋)α)c=1/(1-\lambda_{\min}(\mathbf{W}_{\mathbf{L}})\alpha). The last is introduced in [20] as a less conservative choice using global information λmin(𝐖𝐋)\lambda_{\mathrm{min}}(\mathbf{W}_{\mathbf{L}}). The stepsize is assigned as α=1/L^\alpha=1/\hat{L} with L^=maxi{λmax(|𝒬𝒢i|(ΨiΨi))}.\hat{L}=\max_{i}\{\lambda_{\mathrm{max}}(|\mathcal{Q}_{\mathcal{G}}^{i}|(\Psi_{i}^{\top}\Psi_{i}))\}.

The simulation result is reported in Fig. 5, which plots the relative objective residual |F(𝐱k)F(𝐱)|/F(𝐱)|F(\mathbf{x}^{k})-F(\mathbf{x}^{*})|/F(\mathbf{x}^{*}) where F(𝐱):=f^(𝐱)+λ𝐱1F(\mathbf{x}):=\hat{f}(\mathbf{x})+\lambda\|\mathbf{x}\|_{1}. It can be observed that the case of 𝚽{\boldsymbol{\Phi}} with 𝒬𝒢=𝒬𝒢max\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}} exhibits the fastest convergence in almost 60 iterations with high accuracy, outperforming 𝐖~c\widetilde{\mathbf{W}}_{c} without using global information. While the case of 𝚽{\boldsymbol{\Phi}} with 𝒬𝒢=𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}} is slower than 𝐖~c\widetilde{\mathbf{W}}_{c}, this is still superior to 𝐖~mh\widetilde{\mathbf{W}}_{\mathrm{mh}} and 𝐖~𝐋\widetilde{\mathbf{W}}_{\mathbf{L}}, as implied in Proposition 4.

Refer to caption
Figure 4: Plots of the relative objective residual under the Liang et al. [21] with τ=0.09, 0.1\tau=0.09,\,0.1 and the CD-DYS in Algorithm 1 (or (28)). Here, τ\tau represents the step-size.
Refer to caption
Figure 5: Plots of the relative objective residual under the NIDS in (42) for the five different choices of 𝐖~\widetilde{\mathbf{W}}.

7 Proof of Theorem 2

Our proof is based on the following trick (See Theorem D.6 in [13]): If a0,,ap,b0,,bp,c0,,cp(0,)a_{0},\cdots,a_{p},b_{0},\cdots,b_{p},c_{0},\cdots,c_{p}\in(0,\infty) for some p>0p>0, and

wk+1w2+i=0paiciwkw2i=0paibi,\text{{${\!\!\!\!\!\|w^{k+1}-w^{*}\|^{2}+\sum_{i=0}^{p}a_{i}c_{i}\leq\|w^{k}-w^{*}\|^{2}\leq\sum_{i=0}^{p}a_{i}b_{i}}$}}, (44)

then i=0naibimaxi(bi/ci)i=0naici\sum_{i=0}^{n}a_{i}b_{i}\leq\max_{i}(b_{i}/c_{i})\sum_{i=0}^{n}a_{i}c_{i}, so

wk+1w2+mini(ci/bi)wkw2wk+1w2+i=0paiciwkw2wkw2.\displaystyle\|w^{k+1}-w^{*}\|^{2}+\min_{i}(c_{i}/b_{i})\|w^{k}-w^{*}\|^{2}\leq\|w^{k+1}-w^{*}\|^{2}+\sum_{i=0}^{p}a_{i}c_{i}\leq\|w^{k}-w^{*}\|^{2}\leq\|w^{k}-w^{*}\|^{2}.

Thus,

wk+1w2(1minici/bi)wkw2,\displaystyle\!\!\!\!\!\!\|w^{k+1}-w^{*}\|^{2}\leq(1-\min_{i}c_{i}/b_{i})\|w^{k}-w^{*}\|^{2}, (45)

which provides a linear convergence rate. In the following, we derive an inequality of the form in (44) with wk=[(zk),ν1/2(ζ(yk1/2))]w^{k}=[(z^{k})^{\top},\nu^{1/2}(\zeta(y^{k-1/2}))^{\top}]^{\top}, w=[z,ν1/2(ζ(y))]w^{*}=[z^{*\top},\nu^{1/2}(\zeta(y^{*}))^{\top}]^{\top}, and some constants a0,,ap,b0,,bp,c0,,cp>0a_{0},\cdots,a_{p},b_{0},\cdots,b_{p},c_{0},\cdots,c_{p}>0.

We first prepare a key inclusion for establishing the desired rate. We suppose that zkz^{k}, yky^{k}, and yk+1/2y^{k+1/2} are not optimal without loss of generality. For g(y)=δIm(U)(y)g(y)=\delta_{\mathrm{Im}(U)}(y), we have proxαgM(y)=U(UMU)1UMy\mathrm{prox}_{\alpha g}^{M}(y)=U(U^{\top}MU)^{-1}U^{\top}My, which leads to

UMzk+1\displaystyle U^{\top}Mz^{k+1} =UMzkIMyk+1/2+UM(2yk+1/2zkαM1yf(yk+1/2))\displaystyle=U^{\top}Mz^{k}-IMy^{k+1/2}+U^{\top}M(2y^{k+1/2}-z^{k}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2}))
=UM(yk+1/2αM1yf(yk+1/2))\displaystyle=U^{\top}M(y^{k+1/2}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2}))

since UMproxαgM(y)=UMyU^{\top}M\mathrm{prox}_{\alpha g}^{M}(y)=U^{\top}My. Note that this holds for all k1k\geq 1 thanks to the initialization. Then we have

yk+1=(UMU)1UM(2yk+1/2yk1/2αM1yf(yk+1/2)+αM1yf(yk1/2)),\displaystyle y^{k+1}=(U^{\top}MU)^{-1}U^{\top}M(2y^{k+1/2}-y^{k-1/2}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2})+\alpha M^{-1}\nabla_{y}f(y^{k-1/2})),

and thus we get ξk+1:=2yk+1/2yk1/2αM1yf(yk+1/2)+αM1yf(yk1/2)(I+αM1g)(yk+1)\xi^{k+1}:=2y^{k+1/2}-y^{k-1/2}-\alpha M^{-1}\nabla_{y}f(y^{k+1/2})+\alpha M^{-1}\nabla_{y}f(y^{k-1/2})\in(I+\alpha M^{-1}\partial g)(y^{k+1}) by [4, Prop. 16.44]. This inclusion allows us to remove the assumption of smoothness or strong convexity for gg or hh in [13, 18, 37, 12] because one can evaluate yk+1y^{k+1} only with ff.

We derive the lower side of the inequality in (44) as follows. By the smoothness and strong convexity of ff, for zkz2\|z^{k}-z^{*}\|^{2}, [13, Proposition D.4] provides

zkz2\displaystyle\|z^{k}-z^{*}\|^{2}\geq (1ε)zk+1z2+2αmax{μyk+1/2y2,1LM1yf(yk+1/2)M1yf(y)2}\displaystyle(1-\varepsilon)\|z^{k+1}-z^{*}\|^{2}+2\alpha\max\{\mu\|y^{k+1/2}-y^{*}\|^{2},\frac{1}{L}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}\}
α2εM1yf(yk+1/2)M1yf(y)2.\displaystyle-\frac{\alpha^{2}}{\varepsilon}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}.

Then, using θ=1/(2ε)\theta=1/(2-\varepsilon), wkw2\|w^{k}-w^{*}\|^{2} is lower bounded as

wkw2zk+1z2+νζ(yk1/2)ζ(y)2\displaystyle\|w^{k}-w^{*}\|^{2}\geq\|z^{k+1}-z^{*}\|^{2}+\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}
+(1θ1)zk+1zk2+2αmax{μyk+1/2y2,1LM1yf(yk+1/2)M1yf(y)2}\displaystyle+(\frac{1}{\theta}-1)\|z^{k+1}-z^{k}\|^{2}+2\alpha\max\{\mu\|y^{k+1/2}-y^{*}\|^{2},\frac{1}{L}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}\}
α2Lε×1LM1yf(yk+1/2)M1yf(y)2\displaystyle-\frac{\alpha^{2}L}{\varepsilon}\times\frac{1}{L}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}
\displaystyle\geq wk+1w2+c0zk+1zk2+c1yk+1/2y2+c2M1yf(yk+1/2)M1yf(y)2\displaystyle\|w^{k+1}-w^{*}\|^{2}+c_{0}\|z^{k+1}-z^{k}\|^{2}+c_{1}\|y^{k+1/2}-y^{*}\|^{2}+c_{2}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}
+c3ζ(yk1/2)ζ(y)2\displaystyle+c_{3}\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2} (46)

where c0=1/θ1c_{0}=1/\theta-1, c1=β(αα2L2ε)2νc_{1}=\beta(\alpha-\frac{\alpha^{2}L}{2\varepsilon})-2\nu, c2=αc1c_{2}=\alpha c_{1}, and c3=νc_{3}=\nu. Here, the term wk+1w2\|w^{k+1}-w^{*}\|^{2} in (7) comes from the definition of wkw^{k} and the relationship w2+w2w+w2/2\|w\|^{2}+\|w^{\prime}\|^{2}\geq\|w+w^{\prime}\|^{2}/2 obtained by Jensen’s inequality for the terms of yk+1/2y2\|y^{k+1/2}-y^{*}\|^{2} and M1yf(yk+1/2)M1yf(y)2\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}.

Next, utilizing a similar calculation to the proof of [13, Proposition D.4] (see the proof of the second upper bound) and the inclusion ξk+1(I+αM1g)(yk+1)\xi^{k+1}\in(I+\alpha M^{-1}\partial g)(y^{k+1}), we derive the upper-bound of zkz2+νζ(yk1/2)ζ(y)2\|z^{k}-z^{*}\|^{2}+\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2} as follows. Here, we set ϵ=c0/C\epsilon=c_{0}/C, and let uAk+1M1g(yk+1)u_{A}^{k+1}\in M^{-1}\partial g(y^{k+1}) and uAM1g(y)u_{A}^{*}\in M^{-1}\partial g(y^{*}) satisfy yk+1+αuAk1=ξk+1y^{k+1}+\alpha u_{A}^{k*1}=\xi^{k+1} and y+αuA=ξy^{*}+\alpha u_{A}^{*}=\xi^{*}, respectively:

wkw2νζ(yk1/2)ζ(y)2+yk+1α(uAk+1+M1yf(yk+1/2))\displaystyle\|w^{k}-w^{*}\|^{2}\leq\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}+\|y^{k+1}-\alpha(u_{A}^{k+1}+M^{-1}\nabla_{y}f(y^{k+1/2}))
+2(yk+1/2yk+1)(yαuAαyf(y))2\displaystyle+2(y^{k+1/2}-y^{k+1})-(y^{*}-\alpha u_{A}^{*}-\alpha\nabla_{y}f(y^{*}))\|^{2}
\displaystyle\leq νζ(yk1/2)ζ(y)2+(ξk+1ξ)α(M1yf(yk+1/2)M1yf(y))\displaystyle\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}+\|-(\xi^{k+1}-\xi^{*})-\alpha(M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*}))
+2(yk+1/2y)2+ϵzk+1zk2\displaystyle+2(y^{k+1/2}-y^{*})\|^{2}+\epsilon\|z^{k+1}-z^{k}\|^{2}
\displaystyle\leq νζ(yk1/2)ζ(y)2+ϵzk+1zk2+3ξk+1ξ2+12(yk+1/2y)2\displaystyle\nu\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}+\epsilon\|z^{k+1}-z^{k}\|^{2}+3\|\xi^{k+1}-\xi^{*}\|^{2}+12\|(y^{k+1/2}-y^{*})\|^{2}
+3α2M1yf(yk+1/2)M1yf(y)2\displaystyle+3\alpha^{2}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}
\displaystyle\leq (ν+9)ζ(yk1/2)ζ(y)2+ϵzk+1zk2\displaystyle(\nu+9)\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}+\epsilon\|z^{k+1}-z^{k}\|^{2}
+48(yk+1/2y)2+12α2M1yf(yk+1/2)M1yf(y)2,\displaystyle+48\|(y^{k+1/2}-y^{*})\|^{2}+12\alpha^{2}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}, (47)

where the last line follows from ξk+1ξ23ζ(yk1/2)ζ(y)2+12(yk+1/2y)2+3α2M1yf(yk+1/2)M1yf(y)2\|\xi^{k+1}-\xi^{*}\|^{2}\leq 3\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}+12\|(y^{k+1/2}-y^{*})\|^{2}+3\alpha^{2}\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}.

Therefore, for a0=zk+1zk2a_{0}=\|z^{k+1}-z^{k}\|^{2}, a1=(yk+1/2y)2a_{1}=\|(y^{k+1/2}-y^{*})\|^{2}, a2=M1yf(yk+1/2)M1yf(y)2a_{2}=\|M^{-1}\nabla_{y}f(y^{k+1/2})-M^{-1}\nabla_{y}f(y^{*})\|^{2}, a3=ζ(yk1/2)ζ(y)2a_{3}=\|\zeta(y^{k-1/2})-\zeta(y^{*})\|^{2}, b0=ϵb_{0}=\epsilon, b1=48b_{1}=48, b2=12α2b_{2}=12\alpha^{2}, and b3=ν+9b_{3}=\nu+9, the linear rate follows from (45), (7), and (7).

8 Conclusion

This note addressed distributed optimization of clique-wise coupled problems via operator splitting. First, we defined the CD matrix and a new mixing matrix and analyzed its properties. Then, using the CD matrix, we presented the CD-DYS algorithm via the Davis-Yin splitting (DYS). Subsequently, its connection to consensus optimization methods as NIDS was also analyzed. Moreover, we presented a new linear convergence rate not only for the NIDS with non-smooth terms but also for the general DYS with a projection onto a subspace. Finally, we demonstrated the effectiveness via numerical examples.

References

  • [1] Sulaiman A Alghunaim, Ernest K Ryu, Kun Yuan, and Ali H Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 66(6):2787–2794, 2020.
  • [2] Miguel F Anjos and Jean B Lasserre. Handbook on semidefinite, conic and polynomial optimization, volume 166. Springer Science & Business Media, 2011.
  • [3] Francisco J Aragón-Artacho and David Torregrosa-Belén. A direct proof of convergence of Davis–Yin splitting algorithm allowing larger stepsizes. Set-Valued and Variational Analysis, 30(3):1011–1029, 2022.
  • [4] Heinz H Bauschke, Patrick L Combettes, Heinz H Bauschke, and Patrick L Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2017.
  • [5] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  • [6] Mattia Bianchi and Sergio Grammatico. The end: Estimation network design for games under partial-decision information. IEEE Transactions on Control of Network Systems, 2024.
  • [7] Béla Bollobás. Modern Graph Theory, volume 184. Springer Science & Business Media, 1998.
  • [8] Francesco Bullo. Lectures on Network Systems, volume 1. Kindle Direct Publishing Seattle, DC, USA, 2020.
  • [9] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
  • [10] Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289–4305, 2012.
  • [11] Laurent Condat, Daichi Kitahara, Andrés Contreras, and Akira Hirabayashi. Proximal splitting algorithms for convex optimization: A tour of recent advances, with new twists. SIAM Review, 65(2):375–435, 2023.
  • [12] Laurent Condat and Peter Richtárik. Randprox: Primal-dual optimization algorithms with randomized proximal updates. arXiv preprint arXiv:2207.12891, 2022.
  • [13] Damek Davis and Wotao Yin. A three-operator splitting scheme and its optimization applications. Set-Valued and Variational Analysis, 25:829–858, 2017.
  • [14] Mituhiro Fukuda, Masakazu Kojima, Kazuo Murota, and Kazuhide Nakata. Exploiting sparsity in semidefinite programming via matrix completion i: General framework. SIAM Journal on optimization, 11(3):647–674, 2001.
  • [15] David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 387–396, 2015.
  • [16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
  • [17] Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization. IEEE Transactions on Automatic Control, 64(10):4050–4065, 2019.
  • [18] Jongmin Lee, Soheun Yi, and Ernest K Ryu. Convergence analyses of davis-yin splitting via scaled relative graphs. arXiv preprint arXiv:2207.04015, 2022.
  • [19] Huaqing Li, Enbing Su, Chengbo Wang, Jiawei Liu, Zuqing Zheng, Zheng Wang, and Dawen Xia. A primal-dual forward-backward splitting algorithm for distributed convex optimization. IEEE Transactions on Emerging Topics in Computational Intelligence, 2021.
  • [20] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019.
  • [21] Shu Liang, George Yin, et al. Distributed smooth convex optimization with coupled constraints. IEEE Transactions on Automatic Control, 65(1):347–353, 2019.
  • [22] Angelia Nedić and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • [23] Yurii Evgen’evich Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2)O\bigl{(}1/k^{2}\bigr{)}. In Doklady Akademii Nauk, volume 269, pages 543–547. Russian Academy of Sciences, 1983.
  • [24] Ernest K Ryu and Wotao Yin. Large-scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, 2022.
  • [25] Kazunori Sakurama. Unified formulation of multiagent coordination with relative measurements. IEEE Transactions on Automatic Control, 66(9):4101–4116, 2020.
  • [26] Kazunori Sakurama and Toshiharu Sugie. Generalized coordination of multi-robot systems. Foundations and Trends® in Systems and Control, 9(1):1–170, 2022.
  • [27] Ali H Sayed. Diffusion adaptation over networks. In Academic Press Library in Signal Processing, volume 3, pages 323–453. Elsevier, 2014.
  • [28] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015.
  • [29] Lieven Vandenberghe and Martin S Andersen. Chordal graphs and semidefinite optimization. Foundations and Trends® in Optimization, 1(4):241–433, 2015.
  • [30] Yuto Watanabe and Kazunori Sakurama. Accelerated distributed projected gradient descent for convex optimization with clique-wise coupled constraints. In Proceedings of the 22nd IFAC World Congress, 2023.
  • [31] Yuto Watanabe and Kazunori Sakurama. Distributed optimization of clique-wise coupled problems. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 296–302. IEEE, 2023.
  • [32] Lin Xiao, Stephen Boyd, and Sanjay Lall. Distributed average consensus with time-varying metropolis weights. Automatica, 1:1–4, 2006.
  • [33] Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: Unified framework and convergence analysis. IEEE Transactions on Signal Processing, 69:3555–3570, 2021.
  • [34] Isao Yamada. The hybrid steepest descent method for the variational inequality problem over the intersection of fixed point sets of nonexpansive mappings. Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications, 8:473–504, 2001.
  • [35] Isao Yamada, Nobuhiko Ogura, and Nobuyasu Shirakawa. A numerically robust hybrid steepest descent method for the convexly constrained generalized inverse problems. Contemporary Mathematics, 313:269–305, 2002.
  • [36] Ming Yan. A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. Journal of Scientific Computing, 76:1698–1717, 2018.
  • [37] Soheun Yi and Ernest K Ryu. Convergence analyses of davis-yin splitting via scaled relative graphs ii: Convex optimization problems. arXiv preprint arXiv:2211.15604, 2022.
  • [38] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part I: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708–723, 2018.
  • [39] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning—part II: Convergence analysis. IEEE Transactions on Signal Processing, 67(3):724–739, 2018.
  • [40] Yu Zhang and Qiang Yang. An overview of multi-task learning. National Science Review, 5(1):30–43, 2018.
  • [41] Yang Zheng, Giovanni Fantuzzi, and Antonis Papachristodoulou. Chordal and factor-width decompositions for scalable semidefinite and polynomial optimization. Annual Reviews in Control, 52:243–279, 2021.
  • [42] Yang Zheng, Maryam Kamgarpour, Aivar Sootla, and Antonis Papachristodoulou. Distributed design for decentralized control using chordal decomposition and ADMM. IEEE Transactions on Control of Network Systems, 7(2):614–626, 2019.

Appendix A Application examples

In addition to the examples in Section 6, we here present various application examples of Problem 2.

Formation control

A formation control problem aims to steer the positions of robots to a desired configuration and has been actively investigated for the past two decades. For a multi-agent system over undirected graph 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}), the most basic formulation of this problem is

minimizexidi,i𝒩{i,j}xixjdij2,\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}},\,i\in\mathcal{N}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{\{i,j\}\in\mathcal{E}}\|x_{i}-x_{j}-d_{ij}\|^{2},\end{array} (49)

where xix_{i} is the position of agent ii, and dijd_{ij} is the desired relative position from xjx_{j} to xix_{i}. By assigning 𝒬𝒢=𝒬𝒢edge\mathcal{Q}_{\mathcal{G}}=\mathcal{Q}_{\mathcal{G}}^{\mathrm{edge}}, one can obtain the desired configuration via the proposed CD-DYS. Note that one can also deal with various constraints in the clique-wise coupled framework, e.g., an agent-wise constraint xiΩix_{i}\in\Omega_{i} and a pairwise distance constraint δ¯ijxixjδ¯ij\underline{\delta}_{ij}\leq\|x_{i}-x_{j}\|\leq\overline{\delta}_{ij}. In addition, the proposed framework also allows us to achieve the desired formation in a distributed manner even for linear multi-agent systems, as shown in [17], and in the case where each agent has no access to the global coordinate and can only use information via relative measurements, as shown in [25, 26]

Network Lasso

The network lasso is an optimization-based machine-learning technique accounting for network structures. For a multi-agent system over graph 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}), a network lasso problem [15] is given as follows:

minimizexim,i𝒩i=1f^i(xi)+λ{i,j}wijxixj,\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{m},\,i\in\mathcal{N}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}\hat{f}_{i}(x_{i})+\lambda\sum_{\{i,j\}\in\mathcal{E}}w_{ij}\|x_{i}-x_{j}\|,\end{array} (51)

where λ>0\lambda>0 and wij>0w_{ij}>0 for {i,j}\{i,j\}\in\mathcal{E}. This problem can be seen as a special case of Problem (2). Owing to the second term in (51), neighboring nodes are more likely to form a cluster, i.e., to take close values. Applications of the Network Lasso include the estimation of home prices [15], where there is a spatial interdependence among houses’ prices.

Sparse semidefinite programming

Refer to caption
Figure 6: Example of a system with three nodes in Example 1.

Semidefinite programming via chordal graph decomposition has been actively studied not only in optimization [29, 14] but also in control [41] as an efficient and scalable computation scheme exploiting the sparsity of matrices that naturally arises from underlying networked structures of problems. This type of problem can also be solved in a distributed manner based on the framework of clique-wise coupling.

Consider a multi-agent system over 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}) and the following standard semidefinite programming

minimizeyi,i𝒩,Zi=1nbiyi+δ𝕊+n(Z)subject toZ+j=1pAjyj=C,\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}y_{i}\in\mathbb{R},\,i\in\mathcal{N},\,Z\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{n}b_{i}^{\top}y_{i}+\delta_{\mathbb{S}^{n}_{+}}(Z)\\ {\mbox{subject to}}&\displaystyle Z+\sum_{j=1}^{p}A_{j}y_{j}=C,\end{array} (54)

where we consider that agent ii possesses yiy_{i} and iith column of ZZ. Here, 𝕊+n\mathbb{S}_{+}^{n} represents the set of n×nn\times n positive semidefinite matrices. This problem cannot be solved in a distributed manner by standard algorithms due to the undecomposable constraint Z𝕊+nZ\in\mathbb{S}^{n}_{+}. Nevertheless, if Z,A1,,Ap,CZ,\,A_{1},\ldots,A_{p},\,C have the sparsity with respect to 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}) and graph 𝒢\mathcal{G} is chordal, [29, 41] show that this problem can be equivalently transformed into the following decomposed form with smaller positive semidefinite constraints:

minimizeyi,i𝒩,Zl,l𝒬𝒢maxi=1nbiyi+l𝒬𝒢maxδ𝕊+|𝒞l|(Zl)subject toj=1pAjyj+l𝒬𝒢maxDlZlDl=C.\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}y_{i}\in\mathbb{R},\,i\in\mathcal{N},\,Z_{l},\,l\in\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{n}b_{i}^{\top}y_{i}+\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}}\delta_{\mathbb{S}_{+}^{|{\mathcal{C}_{l}}|}}(Z_{l})\\ {\mbox{subject to}}&\displaystyle\sum_{j=1}^{p}A_{j}y_{j}+\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{\mathrm{max}}}D_{l}^{\top}Z_{l}D_{l}=C.\end{array} (57)

Moreover, when i=1biyi\sum_{i=1}b_{i}^{\top}y_{i} in (57) can be rewritten as

j=1pAjyj=MY+YN\sum_{j=1}^{p}A_{j}y_{j}=MY+YN (58)

with Y=diag(y1,,yn)Y=\text{diag}(y_{1},\ldots,y_{n}) and some matrices M,NM,N with the sparsity with respect to 𝒢=(𝒩,)\mathcal{G}=(\mathcal{N},\mathcal{E}), we can reformulate Problem (57) into a clique-wise coupled problem in (2) by introducing auxiliary variables. For example, for the system with n=3n=3 in Fig. 6, which is a chordal graph with maximal cliques 𝒞1={1,2}\mathcal{C}_{1}=\{1,2\} and 𝒞2={2,3}\mathcal{C}_{2}=\{2,3\}, Problem (57) with (58) reduces to

minimizey1,y2,y3,Z1,Z2𝕊2i=13biyi+δ𝕊+2(Z1)+δ𝕊+2(Z2)subject to[m11y1+n11y1m12y1+n12y20m21y2+n21y1m22y2+n22y2m23y2+n23y30m32y3+n32y2m33y3+n33y3]+[Z100000]+[00000Z2]=[c11c120c21c22c230c32c33],\displaystyle\begin{array}[]{cl}&\underset{\begin{subarray}{c}y_{1},y_{2},y_{3}\in\mathbb{R},\,Z_{1},Z_{2}\in\mathbb{S}^{2}\end{subarray}}{\mbox{minimize}}\quad\displaystyle\sum_{i=1}^{3}b_{i}^{\top}y_{i}+\delta_{\mathbb{S}_{+}^{2}}(Z_{1})+\delta_{\mathbb{S}_{+}^{2}}(Z_{2})\\ &{\mbox{subject to}}\\ &\begin{bmatrix}m_{11}y_{1}+n_{11}y_{1}&m_{12}y_{1}+n_{12}y_{2}&0\\ m_{21}y_{2}+n_{21}y_{1}&m_{22}y_{2}+n_{22}y_{2}&m_{23}y_{2}+n_{23}y_{3}\\ 0&m_{32}y_{3}+n_{32}y_{2}&m_{33}y_{3}+n_{33}y_{3}\end{bmatrix}+\begin{bmatrix}\text{\large{$Z_{1}$}}&\begin{matrix}0\\ 0\end{matrix}\\ \begin{matrix}0&0\end{matrix}&0\end{bmatrix}+\begin{bmatrix}0&\begin{matrix}0&0\end{matrix}\\ \begin{matrix}0\\ 0\end{matrix}&\text{\large{$Z_{2}$}}\end{bmatrix}=\begin{bmatrix}c_{11}&c_{12}&0\\ c_{21}&c_{22}&c_{23}\\ 0&c_{32}&c_{33}\end{bmatrix},\end{array} (62)

where mijm_{ij}, nijn_{ij}, cijc_{ij} are the i,ji,j entries of MM, NN, and CC, respectively. Hence, by decomposing the constraint in (62) into clique-wise coupled constraints by using the auxiliary variables z^2,11\hat{z}_{2,11} and z^1,22\hat{z}_{1,22} as

[m11y1+n11y1m12y1+n12y2m21y2+n21y1m22y2+n22y2]+Z1+[000z^2,11]=[c11c12c21c22]\displaystyle\begin{bmatrix}m_{11}y_{1}+n_{11}y_{1}&m_{12}y_{1}+n_{12}y_{2}\\ m_{21}y_{2}+n_{21}y_{1}&m_{22}y_{2}+n_{22}y_{2}\end{bmatrix}+Z_{1}+\begin{bmatrix}0&0\\ 0&\hat{z}_{2,11}\end{bmatrix}=\begin{bmatrix}c_{11}&c_{12}\\ c_{21}&c_{22}\\ \end{bmatrix} (63)
[m22y2+n22y2m23y2+n23y3m32y3+n32y2m33y3+n33y3]+Z2+[z^1,22000]=[c22c32c32c33]\displaystyle\begin{bmatrix}m_{22}y_{2}+n_{22}y_{2}&m_{23}y_{2}+n_{23}y_{3}\\ m_{32}y_{3}+n_{32}y_{2}&m_{33}y_{3}+n_{33}y_{3}\end{bmatrix}+Z_{2}+\begin{bmatrix}\hat{z}_{1,22}&0\\ 0&0\end{bmatrix}=\begin{bmatrix}c_{22}&c_{32}\\ c_{32}&c_{33}\\ \end{bmatrix} (64)
z1,22=z^1,22,z2,11=z^2,11,\displaystyle z_{1,22}=\hat{z}_{1,22},\quad z_{2,11}=\hat{z}_{2,11}, (65)

where zl,ijz_{l,ij} represents the i,ji,j entry of ZlZ_{l}, we can obtain an equivalent clique-wise coupled problem in the following with x1=[y1;vec([Z1]1)]x_{1}=[y_{1};\mathrm{vec}([Z_{1}]_{1})], x2=[y2;vec([Z1]2);vec([Z2]1);z^1,22;z^2,11]x_{2}=[y_{2};\mathrm{vec}([Z_{1}]_{2});\mathrm{vec}([Z_{2}]_{1});\hat{z}_{1,22};\hat{z}_{2,11}], x3=[y3;vec([Z2]2)]x_{3}=[y3;\mathrm{vec}([Z_{2}]_{2})], where vec([Zl]i)\mathrm{vec}([Z_{l}]_{i}) represents iith column of the matrix ZlZ_{l}:

minimizey1,y2,y3Z1,Z2𝕊2i=13biyi+δ𝕊+2(Z1)+δ𝕊+2(Z2)subject to(63)–(65)\displaystyle\begin{array}[]{cll}&\underset{\begin{subarray}{c}y_{1},y_{2},y_{3}\in\mathbb{R}\\ Z_{1},Z_{2}\in\mathbb{S}^{2}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{3}b_{i}^{\top}y_{i}+\delta_{\mathbb{S}_{+}^{2}}(Z_{1})+\delta_{\mathbb{S}_{+}^{2}}(Z_{2})\\ &{\mbox{subject to}}&\text{\eqref{sdp_decomposed_constraints_1}--\eqref{sdp_decomposed_constraints_3}}\end{array} (68)

Semidefinite programming in the form of (57) with (58) arises in practical problems, e.g., distributed design of decentralized controllers for linear networked systems [42] and sensor network localization [2]. Note that one can extend the discussion above to higher dimensional vectors yiy_{i} and block-partitioned matrices ZZ, as shown in [41].

Approximating trace norm minimization problems

Trace norm minimization is a powerful technique in machine learning and computer vision that can obtain a low-rank matrix L^\hat{L} representing the underlying structure of the data. Its applications include the Robust PCA (RPCA) and multi-task learning problems.

For example, we can relax an RPCA problem to a clique-wise coupled problem as follows. Consider a data matrix Yd×nY\in\mathbb{R}^{d\times n}. Then, a standard form of RPCA is formulated as follows:

minimizeS^,L^S^1+θL^subject toS^+L^=Y.\displaystyle\underset{\hat{S},\hat{L}}{\mbox{minimize}}\;\;\displaystyle\|\hat{S}\|_{1}+\theta\|\hat{L}\|_{*}\;\;\displaystyle\mbox{subject to}\;\;\hat{S}+\hat{L}=Y. (69)

By solving this problem, we can decompose a data matrix YY into two components: a low-rank matrix L^\hat{L} representing the underlying structure of the data and a sparse matrix S^\hat{S} capturing the outliers or noise. Consider that for a multi-agent system with nn agents and Y=[Y1,,Yn]Y=[Y_{1},\ldots,Y_{n}], agent ii possesses the matrix YiY_{i}. Then the robust PCA problem with the clique-based relaxation is formulated as follows:

minimizeS^i,i𝒩L^l640×40|𝒞l|,l𝒬𝒢i𝒩S^11+l𝒞lθlL^lsubject toS^𝒞l+L^l=Y𝒞ll𝒬𝒢,\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}\hat{S}_{i},\,i\in\mathcal{N}\\ \hat{L}_{l}\in\mathbb{R}^{640\times 40|{\mathcal{C}_{l}}|},\,l\in\mathcal{Q}_{\mathcal{G}}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i\in\mathcal{N}}\|\hat{S}_{1}\|_{1}+\sum_{l\in{\mathcal{C}_{l}}}\theta_{l}\|\hat{L}_{l}\|_{*}\\ \displaystyle\mbox{subject to}&\hat{S}_{\mathcal{C}_{l}}+\hat{L}_{l}=Y_{\mathcal{C}_{l}}\quad\forall l\in\mathcal{Q}_{\mathcal{G}},\end{array} (72)

where S^𝒞l=[S^j1,,S^j|𝒞l|]\hat{S}_{\mathcal{C}_{l}}=[\hat{S}_{j_{1}},\ldots,\hat{S}_{j_{|{\mathcal{C}_{l}}|}}] and Y^𝒞l=[Y^j1,,Y^j|𝒞l|]\hat{Y}_{\mathcal{C}_{l}}=[\hat{Y}_{j_{1}},\ldots,\hat{Y}_{j_{|{\mathcal{C}_{l}}|}}] for 𝒞l={j1,,j|𝒞l|}{\mathcal{C}_{l}}=\{j_{1},\ldots,j_{|{\mathcal{C}_{l}}|}\}. Here, S^i\hat{S}_{i} and L^l\hat{L}_{l} correspond to xix_{i} and yly_{l} in Problem (2). Although Problem (72) involves relaxation, one can still realize a low-rank matrix by solving it.

Appendix B Proof of Lemma 2

(a) We prove the statement by contradiction. Assume that the CD matrix 𝐃\mathbf{D} is not column full rank. Then, there exists a vector 𝐯=[v1,,vn]0\mathbf{v}=[v_{1}^{\top},\ldots,v_{n}^{\top}]^{\top}\neq 0 with vidiv_{i}\in\mathbb{R}^{d_{i}} such that 𝐃𝐯=0\mathbf{D}\mathbf{v}=0. This yields Dl𝐯=0D_{l}\mathbf{v}=0 for 𝐯\mathbf{v} and all l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}. Hence, we obtain Ei𝐯=vi=0E_{i}\mathbf{v}=v_{i}=0 for all i𝒩i\in\mathcal{N} from Assumption 1. This contradicts the assumption.

(b) For 𝐃\mathbf{D}, we have 𝐃𝐃=l𝒬𝒢DlDl=l𝒬𝒢j𝒞lEjEj=i=1nl𝒬𝒢iEiEi=i=1n|𝒬𝒢i|EiEi\mathbf{D}^{\top}\mathbf{D}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}D_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}E_{j}^{\top}E_{j}=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}^{\top}E_{i}=\sum_{i=1}^{n}|\mathcal{Q}_{\mathcal{G}}^{i}|E_{i}^{\top}E_{i} from Definition 1. Here, EiEi=blk-diag(Od1×d1,,Idi,,Odn×dn)E_{i}^{\top}E_{i}=\text{blk-diag}(O_{d_{1}\times d_{1}},\ldots,I_{d_{i}},\ldots,O_{d_{n}\times d_{n}}) holds. Therefore, we obtain 𝐃𝐃=blk-diag(|𝒬𝒢1|Id1,,|𝒬𝒢n|Idn)\mathbf{D}^{\top}\mathbf{D}=\text{blk-diag}(|\mathcal{Q}_{\mathcal{G}}^{1}|I_{d_{1}},\ldots,|\mathcal{Q}_{\mathcal{G}}^{n}|I_{d_{n}}). 𝐃𝐃O\mathbf{D}^{\top}\mathbf{D}\succ O follows from Assumption 1.

(c) It holds that 𝐃𝐲=l𝒬𝒢Dlyl=l𝒬𝒢j𝒞lEj(El,jyl)=i=1nl𝒬𝒢iEiEl,iyl=i=1nEi(l𝒬𝒢iEl,iyl)\mathbf{D}^{\top}\mathbf{y}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}y_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}E_{j}^{\top}(E_{l,j}y_{l})=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{i}^{\top}E_{l,i}y_{l}=\sum_{i=1}^{n}E_{i}^{\top}(\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}y_{l}). Hence, we obtain (12).

Appendix C Proof of Proposition 1

(a) For 𝐳Im(𝐃)\mathbf{z}\in\mathrm{Im}(\mathbf{D}), there exists some 𝐱d\mathbf{x}\in\mathbb{R}^{d} such that 𝐳=𝐃𝐱\mathbf{z}=\mathbf{D}\mathbf{x}. Then, we obtain

proxαG(𝐲)=𝐃argmin𝐱d(12α𝐲𝐃𝐱2+i=1ng^i(Ei𝐱))\displaystyle\mathrm{prox}_{\alpha G}(\mathbf{y})=\mathbf{D}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{d}}(\frac{1}{2\alpha}\|\mathbf{y}-\mathbf{D}\mathbf{x}\|^{2}+\sum_{i=1}^{n}\hat{g}_{i}(E_{i}\mathbf{x}))
=𝐃argmin𝐱d(i=1n(l𝒬𝒢i12αEl,iylxi2+g^i(xi)))\displaystyle=\mathbf{D}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{d}}(\sum_{i=1}^{n}(\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{2\alpha}\|E_{l,i}y_{l}-x_{i}\|^{2}+\hat{g}_{i}(x_{i})))
=𝐃argmin𝐱d(i=1n(|𝒬𝒢i|2αl𝒬𝒢i1|𝒬𝒢i|El,iylxi2+g^i(xi))).\displaystyle=\mathbf{D}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{d}}(\sum_{i=1}^{n}(\frac{|\mathcal{Q}_{\mathcal{G}}^{i}|}{2\alpha}\|\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}E_{l,i}y_{l}-x_{i}\|^{2}+\hat{g}_{i}(x_{i}))).

Therefore, we obtain (15). Note that the last line can be verified by considering the optimality condition.

(b) This can be proved in the same way as Proposition 1a with an easy modification from the definition of 𝐐\mathbf{Q}.

(c) By the chain rule, we have 𝐲f^i(Ei(𝐃𝐃)1𝐃𝐲)=𝐃(𝐃𝐃)1Eixif^i(Ei(𝐃𝐃)1𝐃𝐲)\frac{\partial}{\partial\mathbf{y}}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y})=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D})^{-1}E_{i}^{\top}\nabla_{x_{i}}\hat{f}_{i}(E_{i}(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}). which gives ((c)).

Appendix D Proof of Proposition 2

(a) For 𝐐\mathbf{Q}, we obtain 𝐐𝐃=[QlDl]l𝒬𝒢\mathbf{Q}\mathbf{D}=[Q_{l}D_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}. Then,

𝐃𝐐𝐃=l𝒬𝒢DlQlDl=l𝒬𝒢j𝒞l1|𝒬𝒢j|EjEj.\mathbf{D}^{\top}\mathbf{Q}\mathbf{D}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}Q_{l}D_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}^{\top}E_{j}.

Thus, following the same calculation as the proof of Lemma 2b gives 𝐃𝐐𝐃=Id\mathbf{D}^{\top}\mathbf{Q}\mathbf{D}=I_{d}.

(b) For any 𝐲=[yl]l𝒬𝒢d^\mathbf{y}=[y_{l}]_{l\in\mathcal{Q}_{\mathcal{G}}}\in\mathbb{R}^{\hat{d}}, it holds that

𝐃𝐐𝐲=l𝒬𝒢DlQlyl=l𝒬𝒢j𝒞l1|𝒬𝒢j|EjEl,jyl.\mathbf{D}^{\top}\mathbf{Q}\mathbf{y}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}D_{l}^{\top}Q_{l}y_{l}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\sum_{j\in{\mathcal{C}_{l}}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{j}|}E_{j}^{\top}E_{l,j}y_{l}.

Hence, reorganizing this and using the proof of Lemma 2c yield

𝐃𝐐𝐲=i=1n1|𝒬𝒢i|Eil𝒬𝒢iEl,iyl=blk-diag([1|𝒬𝒢i|Idi]i𝒩)𝐃𝐲.\mathbf{D}^{\top}\mathbf{Q}\mathbf{y}=\sum_{i=1}^{n}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}E_{i}^{\top}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}y_{l}=\text{blk-diag}([\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}I_{d_{i}}]_{i\in\mathcal{N}})\mathbf{D}^{\top}\mathbf{y}.

Therefore, we obtain 𝐃𝐐=(𝐃𝐃)1𝐃\mathbf{D}^{\top}\mathbf{Q}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top} from Lemma 2b. The latter equation is also proved in the same way.

(c) From Proposition 2b and Assumption 1, it holds that 𝐃=(𝐃𝐃)1𝐃𝐐1\mathbf{D}^{\top}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{Q}^{-1}. For the transpose of this matrix, multiplying 𝐃𝐃\mathbf{D}^{\top}\mathbf{D} from the right side gives 𝐐1𝐃=𝐃(𝐃𝐃)\mathbf{Q}^{-1}\mathbf{D}=\mathbf{D}(\mathbf{D}^{\top}\mathbf{D}). The latter equation is also proved in the same manner.

Appendix E Connection of the CD-DYS to other distributed optimization algorithms

We here present the comprehensive analysis for the diagram in Fig. 3 by deriving the CPGD algorithm in [30] and Section F.

Exact diffusion and Diffusion algorithms

Over undirected graphs, the exact diffusion algorithm is just a special case of the NIDS. In the case of g^i=0\hat{g}_{i}=0 for all i𝒩i\in\mathcal{N}, the NIDS reduces to the Exact diffusion [38, 39], which is given as follows:

𝐱k+1\displaystyle\!\!\!\!\mathbf{x}^{k+1} =𝐖~(2𝐱k𝐱k1+α(𝐱f^(𝐱k1)𝐱f^(𝐱k))).\displaystyle=\widetilde{\mathbf{W}}(2\mathbf{x}^{k}-\mathbf{x}^{k-1}+\alpha(\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k-1})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))).\!\!\!\! (73)

This can be rewritten as follows:

𝐯k+1=𝐱kα𝐱f^(𝐱k)𝐱k+1=𝐖~(𝐯k+1+𝐱k𝐯k).\displaystyle\begin{array}[]{lll}\mathbf{v}^{k+1}=&\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})\\ \mathbf{x}^{k+1}=&\widetilde{\mathbf{W}}(\mathbf{v}^{k+1}+\mathbf{x}^{k}-\mathbf{v}^{k}).\end{array} (76)

Those algorithms exactly converge to an optimal solution under mild conditions. Note that the Exact diffusion is also valid for directed networks and non-doubly stochastic 𝐖\mathbf{W}. For details, see [38, 39].

The diffusion algorithm [27, 10] is an early distributed optimization algorithm, given as

𝐱k+1=𝐖~(𝐱kα𝐱f^(𝐱k)).\displaystyle\begin{array}[]{lll}\mathbf{x}^{k+1}=&\widetilde{\mathbf{W}}(\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})).\end{array} (78)

This algorithm is obtained from the NIDS for g^i=0,i𝒩\hat{g}_{i}=0,\,i\in\mathcal{N} and the Exact diffusion approximating 𝐱k𝐯k0\mathbf{x}^{k}-\mathbf{v}^{k}\approx 0 in the second line of (76). Notice that conditions on 𝐖{\mathbf{W}} in (78) are not equivalent to (73) and (76) (see [27, 10, 24, 38, 39]). Although its convergence is inexact over constant α\alpha, its simple structure allows us to easily apply it to stochastic and online setups.

CPGD generalizes of the diffusion algorithm

Invoking the relationship between NIDS/Exact diffusion and diffusion algorithms, we derive a diffusion-like algorithm from the variable metric CD-DYS in (36) for

minimizexidi,i𝒩i=1nf^i(xi)+l𝒬𝒢δ𝒟l(x𝒞l),\displaystyle\begin{array}[]{cl}\underset{\begin{subarray}{c}x_{i}\in\mathbb{R}^{d_{i}},\,i\in\mathcal{N}\end{subarray}}{\mbox{minimize}}&\displaystyle\sum_{i=1}^{n}\hat{f}_{i}(x_{i})+\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\delta_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}),\end{array} (80)

where 𝒟l\mathcal{D}_{l} is a closed convex set and not limited to (39). The derived algorithm will be formalized as the clique-based projected gradient descent (CPGD) in Appendix F.

We derive the diffusion-like algorithm as follows. From g^i=0\hat{g}_{i}=0, we have 𝐱k=𝐱k=(𝐃𝐃)1𝐃𝐳k\mathbf{x}^{k}=\mathbf{x}^{k-}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k} and (𝐃𝐃)1𝐃×𝐲k+1/2=𝐱k(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\times\mathbf{y}^{k+1/2}=\mathbf{x}^{k}. Accordingly, the variable metric CD-DYS in (36) reduces to

𝐱k=(𝐃𝐃)1𝐃𝐳k𝐲k+1=PΠl𝒬𝒢𝒟l𝐐(2𝐃𝐱k𝐳kα𝐃𝐱f^(𝐱k))𝐳k+1=𝐳k+𝐲k+1𝐃𝐱k.\displaystyle\begin{array}[]{lll}&\mathbf{x}^{k}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k}\\ &\mathbf{y}^{k+1}=P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(2\mathbf{D}\mathbf{x}^{k}-\mathbf{z}^{k}-\alpha\mathbf{D}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\\ &\mathbf{z}^{k+1}=\mathbf{z}^{k}+\mathbf{y}^{k+1}-\mathbf{D}\mathbf{x}^{k}.\end{array}

By using 𝐯k+1\mathbf{v}^{k+1} of the form in (76), we get

𝐯k+1=𝐱kα𝐱f^(𝐱k)\displaystyle\mathbf{v}^{k+1}=\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}) (81)
𝐱k+1=(𝐃𝐃)1𝐃PΠl𝒬𝒢𝒟l𝐐(𝐃𝐯k+1+𝐃𝐱k𝐳k)\displaystyle\mathbf{x}^{k+1}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{D}\mathbf{v}^{k+1}+\mathbf{D}\mathbf{x}^{k}-\mathbf{z}^{k})

with 𝐳k\mathbf{z}^{k} from 𝐱k+1=(𝐃𝐃)1𝐃𝐳k+1=(𝐃𝐃)1𝐃(𝐳k+𝐲k+1)𝐱k=(𝐃𝐃)1𝐃𝐲k+1\mathbf{x}^{k+1}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{z}^{k+1}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}(\mathbf{z}^{k}+\mathbf{y}^{k+1})-\mathbf{x}^{k}=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}\mathbf{y}^{k+1}. In consensus optimization, it can be observed from the previous subsection that PΠl𝒬𝒢𝒟l𝐐()P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\cdot) boils down to a linear map and 𝐳k\mathbf{z}^{k} satisfies PΠl𝒬𝒢𝒟l𝐐(𝐳k)=PΠl𝒬𝒢𝒟l𝐐(𝐃𝐯k)P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{z}^{k})=P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{D}\mathbf{v}^{k}) because we have

P𝒟lQl(zlk+1)=P𝒟lQl(x𝒞lkαDl𝐱f^(𝐱k))=P𝒟Ql(Dl𝐯k)\displaystyle P_{\mathcal{D}_{l}}^{Q_{l}}(z_{l}^{k+1})=P_{\mathcal{D}_{l}}^{Q_{l}}(x^{k}_{\mathcal{C}_{l}}-\alpha D_{l}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))=P_{\mathcal{D}}^{Q_{l}}(D_{l}\mathbf{v}^{k})

for 𝒟l\mathcal{D}_{l} in (39), as shown in (5.1). Therefore, recalling that the diffusion algorithm (78) can be viewed as (76) with 𝐱k𝐯k0\mathbf{x}^{k}-\mathbf{v}^{k}\approx 0, we can obtain the following diffusion-like algorithm (CPGD) from (81) by the similar approximation 𝐃𝐱k𝐳0\mathbf{D}\mathbf{x}^{k}-\mathbf{z}\approx 0 for the second line of (81):

𝐱k+1=T(𝐱kα𝐱f^(𝐱k))\displaystyle\begin{array}[]{lll}&\mathbf{x}^{k+1}=T(\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k}))\end{array} (83)

with T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d} defined as T(𝐱)=(𝐃𝐃)1𝐃PΠl𝒬𝒢𝒟l𝐐(𝐃𝐱)T(\mathbf{x})=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}P_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}^{\mathbf{Q}}(\mathbf{D}\mathbf{x}). Note that the operator TT, which will be defined as the clique-based projection in Appendix F, is equal to the doubly stochastic matrix 𝚽{\boldsymbol{\Phi}} in Proposition 3 for 𝒟l\mathcal{D}_{l} in (39).

Appendix F Clique-based projected gradient descent (CPGD) algorithm

We here formalize the generalization of the diffusion algorithm (CPGD) in (83). We provide detailed convergence analysis, which guarantees the exact convergence under diminishing step sizes and an inexact convergence rate over fixed ones. Moreover, we provide Nesterov’s acceleration and an improved convergence rate.

This section highlights the well-behavedness of clique-wise coupling that enables similar theoretical and algorithmic properties to consensus optimization (diffusion algorithm).

F.1 Clique-based Projected Gradient Descent (CPGD)

Consider Problem (80) with closed convex sets 𝒟ldl,l𝒬𝒢\mathcal{D}_{l}\subset\mathbb{R}^{d^{l}},\,l\in\mathcal{Q}_{\mathcal{G}}. We suppose Assumptions 13.

To this problem, the CPGD is given as follows:

𝐱k+1=Tp(𝐱kλk𝐱f^(𝐱k)),\displaystyle\mathbf{x}^{k+1}=T^{p}(\mathbf{x}^{k}-\lambda^{k}\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}^{k})), (84)

where T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d} is the clique-based projection for

𝒟=l𝒬𝒢{𝐱d:x𝒞l𝒟l},\mathcal{D}=\bigcap_{l\in\mathcal{Q}_{\mathcal{G}}}\{\mathbf{x}\in\mathbb{R}^{d}:x_{\mathcal{C}_{l}}\in\mathcal{D}_{l}\}, (85)

Tp=TTTpT^{p}=\underbrace{T\circ T\circ\cdots\circ T}_{p}, f^(𝐱)=i=1nf^i(xi)\hat{f}(\mathbf{x})=\sum_{i=1}^{n}\hat{f}_{i}(x_{i}), and λk\lambda^{k} is a step size. The clique-based projection TT is defined as follows.

Definition 2.

Suppose Assumption 1. For a non-empty closed convex set 𝒟\mathcal{D} in (85), a graph 𝒢\mathcal{G}, and its cliques 𝒞l,l𝒬𝒢\mathcal{C}_{l},\,l\in\mathcal{Q}_{\mathcal{G}}, the clique-based projection T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d} of xdx\in\mathbb{R}^{d} onto 𝒟\mathcal{D} is defined as T(𝐱)=[T1(x𝒩1),,Tn(x𝒩n)]T(\mathbf{x})=[T_{1}(x_{\mathcal{N}_{1}})^{\top},\ldots,T_{n}(x_{\mathcal{N}_{n}})^{\top}]^{\top} with

Ti(x𝒩i)=1|𝒬𝒢i|l𝒬𝒢iEl,iP𝒟lQl(x𝒞l)T_{i}(x_{\mathcal{N}_{i}})=\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}}) (86)

for each i𝒩i\in\mathcal{N}.

The clique-based projection can be represented as T(𝐱)=(𝐃𝐃)1𝐃PΠl𝒬𝒢𝒟l𝐐(𝐃𝐱).T(\mathbf{x})=(\mathbf{D}^{\top}\mathbf{D})^{-1}\mathbf{D}^{\top}P^{\mathbf{Q}}_{\Pi_{l\in\mathcal{Q}_{\mathcal{G}}}\mathcal{D}_{l}}(\mathbf{D}\mathbf{x}).

The clique-based projection TT has many favorable operator-theoretic properties as follows.

Proposition 6.

Suppose Assumption 1. For the closed convex set 𝒟\mathcal{D} in (85) and clique-based projection TT in Definition 2 onto 𝒟\mathcal{D}, the following statements hold:

  • (a)

    The operator TT is firmly nonexpansive, i.e., T(𝐱)T(𝐰)2(𝐱𝐰)(T(𝐱)T(𝐰))\|T(\mathbf{x})-T(\mathbf{w})\|^{2}\leq(\mathbf{x}-\mathbf{w})^{\top}(T(\mathbf{x})-T(\mathbf{w})) holds for any 𝐱,𝐰d\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}.

  • (b)

    The fixed points set of TT satisfies Fix(T)=𝒟\mathrm{Fix}(T)=\mathcal{D}

  • (c)

    For any 𝐱d𝒟\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D} and any 𝐰𝒟\mathbf{w}\in\mathcal{D}, T(𝐱)𝐰<𝐱𝐰\|T(\mathbf{x})-\mathbf{w}\|<\|\mathbf{x}-\mathbf{w}\| holds.

  • (d)

    For any 𝐱d\mathbf{x}\in\mathbb{R}^{d}, T(𝐱)=limpTp(𝐱)𝒟T^{\infty}(\mathbf{x})=\lim_{p\to\infty}T^{p}(\mathbf{x})\in\mathcal{D} holds.

Proof.

See Appendix F.3. ∎

The convergence properties of the CPGD over various step sizes are presented as follows. Note that the CPGD with fixed step sizes does not exactly converge to an optimal solution like the DGD and diffusion methods for consensus optimization.

Theorem 3.

Consider Problem (38) with closed convex sets 𝒟l,l𝒬𝒢\mathcal{D}_{l},\,l\in\mathcal{Q}_{\mathcal{G}}. Consider the CPGD algorithm in (84). Suppose Assumptions 13.

  • (a)

    Let a positive sequence {λk}\{\lambda^{k}\} satisfy limkλk=0\lim_{k\to\infty}\lambda^{k}=0, k=1λk=\sum_{k=1}^{\infty}\lambda^{k}=\infty, and k=1(λk)2<\sum_{k=1}^{\infty}(\lambda^{k})^{2}<\infty.666For example, λk=1/k\lambda^{k}=1/k satisfies the conditions. Assume that 𝒟\mathcal{D} is bounded. Then, for any 𝐱0d\mathbf{x}^{0}\in\mathbb{R}^{d} and any pp\in\mathbb{N}, 𝐱k\mathbf{x}^{k} converges to an optimal solution 𝐱argmin𝐱𝒟f^(𝐱)\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x}).

  • (b)

    Let a positive sequence {λk}\{\lambda^{k}\} satisfy limkλk=0\lim_{k\to\infty}\lambda^{k}=0, k=1λk=\sum_{k=1}^{\infty}\lambda^{k}=\infty, and k=1|λkλk+1|<\sum_{k=1}^{\infty}|\lambda^{k}-\lambda^{k+1}|<\infty. 777For example, λk=1/k\lambda^{k}=1/k and λk=1/k\lambda^{k}=1/\sqrt{k} satisfy the conditions. Additionally, assume that f^(𝐱)\hat{f}(\mathbf{x}) is strongly convex. Then 𝐱k\mathbf{x}^{k} converges to the unique optimal solution 𝐱=argmin𝐱𝒟f^(𝐱)\mathbf{x}^{*}=\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x}) for any 𝐱0d\mathbf{x}^{0}\in\mathbb{R}^{d} and any pp\in\mathbb{N}.

  • (c)

    Let λk=α(0,1/L^]\lambda^{k}=\alpha\in(0,1/\hat{L}] for any kk\in\mathbb{N}. Let J:dJ:\mathbb{R}^{d}\to\mathbb{R} be

    J(𝐱)=f^(𝐱)+V(𝐱)/αJ(\mathbf{x})=\hat{f}(\mathbf{x})+V(\mathbf{x})/\alpha (87)

    with

    V(𝐱)=12l𝒬𝒢x𝒞lP𝒟lQl(x𝒞l)Ql2.V(\mathbf{x})=\frac{1}{2}\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})\|^{2}_{Q_{l}}. (88)

    Then, for any 𝐱0d\mathbf{x}^{0}\in\mathbb{R}^{d} and p=1p=1,

    J(𝐱k)J(𝐱)𝐱0𝐱22αkJ(\mathbf{x}^{k})-J(\mathbf{x}^{*})\leq\frac{\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}}{2\alpha k} (89)

    holds for 𝐱argmin𝐱𝒟f^(𝐱)\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x}).

Proof.

(a) From Proposition 6a-b, the CPGD in (84) can be regarded as the hybrid steepest descent in [34, 35] for any pp\in\mathbb{N}. Hence, Theorem 3a follows from Theorem 2.18, Remark 2.17 in [35], and Proposition 6c. (b) The statement follows from Theorem 2.15 in [35] and Proposition 6a-b. (c) See Appendix F.4. ∎

Remark 3.

Using VV in (88), another expression of the clique-based projection TT is obtained as follows.

Proposition 7.

Consider the function V:dV:\mathbb{R}^{d}\to\mathbb{R} in (88). Then, it holds for any 𝐱d\mathbf{x}\in\mathbb{R}^{d} that

T(𝐱)=𝐱𝐱V(𝐱).T(\mathbf{x})=\mathbf{x}-\nabla_{\mathbf{x}}V(\mathbf{x}). (90)
Proof.

Since each 𝒟l\mathcal{D}_{l} is closed and convex, 1/2x𝒞lP𝒟lQl(x𝒞l)Ql21/2\,\|x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})\|_{Q_{l}}^{2} is differentiable, and thus V(𝐱)V(\mathbf{x}) in (88) is also differentiable. Then, for all i𝒩i\in\mathcal{N}, we have xiV(𝐱)=l𝒬𝒢i1|𝒬𝒢i|(xiEl,iP𝒟lQl(x𝒞l))=xi1|𝒬𝒢i|l𝒬𝒢iEl,iP𝒟l(x𝒞l)=xiTi(x𝒩i)\nabla_{x_{i}}V(\mathbf{x})=\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}(x_{i}-E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}}))=x_{i}-\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}E_{l,i}P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})=x_{i}-T_{i}(x_{\mathcal{N}_{i}}) from (3) and (86). Hence, we obtain (90). ∎

From Proposition 7, we can interpret the CPGD as a variant of the proximal gradient descent [24, 11, 5] since the clique-based projection TT can be represented as T(𝐱)=argmin𝐱d12𝐱𝐱2+V(𝐱)+𝐱V(𝐱)(𝐱𝐱).T(\mathbf{x})=\operatorname*{arg\,min}_{\mathbf{x}^{\prime}\in\mathbb{R}^{d}}\>\frac{1}{2}\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}+V(\mathbf{x})+\nabla_{\mathbf{x}}V(\mathbf{x})^{\top}(\mathbf{x}^{\prime}-\mathbf{x}). In this sense, the CPGD is a generalization of the conventional projected gradient descent (PGD). When 𝒢\mathcal{G} is complete, the CPGD equals PGD because 𝒬𝒢all={1}{\mathcal{Q}_{\mathcal{G}}^{\mathrm{all}}}=\{1\} and 𝒞1=𝒩\mathcal{C}_{1}=\mathcal{N} hold for complete graphs.

Remark 4.

A benefit of the CPGD over the CD-DYS is its simple structure which makes its analysis and extension easy. We can easily evaluate stochastic and online variants of the CPGD using the same strategy as the online projected gradient descent [16] from Proposition 6.

F.2 Nesterov’s acceleration

The CPGD with fixed step sizes can be accelerated up to the inexact convergence rate of O(1/k2)O(1/k^{2}) with Nesterov’s acceleration [23, 5]. The accelerated CPGD (ACPGD) is given as follows:

𝐱k+1=Tp(𝐱^kλk𝐱f^(𝐱^k))\displaystyle\mathbf{x}^{k+1}=T^{p}(\hat{\mathbf{x}}^{k}-\lambda^{k}\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k}))
𝐱^k+1=𝐱k+1σk1σk+1(𝐱k+1𝐱k),\displaystyle\hat{\mathbf{x}}^{k+1}=\mathbf{x}^{k+1}-\frac{\sigma^{k}-1}{\sigma^{k+1}}(\mathbf{x}^{k+1}-\mathbf{x}^{k}), (91)

where 𝐱^0=𝐱0\hat{\mathbf{x}}^{0}=\mathbf{x}^{0} and σk+1=(1+1+4σ2)/2\sigma^{k+1}=(1+\sqrt{1+4\sigma^{2}})/2 with σ0=1\sigma^{0}=1. This algorithm can also be implemented in a distributed manner.

The convergence rate is proved as follows.

Theorem 4.

Consider Problem (38) with closed convex sets 𝒟l,l𝒬𝒢\mathcal{D}_{l},\,l\in\mathcal{Q}_{\mathcal{G}} and the ACPGD algorithm (F.2). Suppose Assumption 1. Assume that 𝒟d\mathcal{D}\subset\mathbb{R}^{d} in (85) is a non-empty closed convex set. Let p=1p=1 and λk=α(0,1/L^]\lambda^{k}=\alpha\in(0,1/\hat{L}] for all kk. Then, for any initial state 𝐱0=𝐱^0d\mathbf{x}^{0}=\hat{\mathbf{x}}^{0}\in\mathbb{R}^{d}, the following inequality holds:

J(𝐱k)J(𝐱)2𝐱0𝐱2αk2,J(\mathbf{x}^{k})-J(\mathbf{x}^{*})\leq\frac{2\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}}{\alpha k^{2}}, (92)

where 𝐱argmin𝐱𝒟f^(𝐱)\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x}) and J(𝐱)J(\mathbf{x}) is given as (87).

Proof.

See Appendix F.4. ∎

F.3 Proof of Proposition 6

As a preliminary, we present important properties of the function V(𝐱)V(\mathbf{x}) in (88) for 𝒟\mathcal{D} in (85) as follows. Note that the function VV in (88) is convex because of the convexity of each 𝒟l\mathcal{D}_{l}.

Proposition 8.

For V(𝐱)V(\mathbf{x}) in (88) and a non-empty closed convex set 𝒟\mathcal{D} in (85), V(𝐱)=0𝐱𝒟V(\mathbf{x})=0\Leftrightarrow\mathbf{x}\in\mathcal{D} holds.

Proof.

If V(𝐱)=0V(\mathbf{x})=0 for 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we obtain x𝒞l=P𝒟lQl(x𝒞l)𝒟lx_{\mathcal{C}_{l}}=P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})\in\mathcal{D}_{l} for all l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}, which yields 𝐱𝒟\mathbf{x}\in\mathcal{D} because of (85). Conversely, if 𝐱𝒟\mathbf{x}\in\mathcal{D}, then we have x𝒞l𝒟lx_{\mathcal{C}_{l}}\in\mathcal{D}_{l} for all l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}. Thus, V(𝐱)=0V(\mathbf{x})=0 holds. ∎

Proposition 9.

The function V(𝐱)V(\mathbf{x}) in (88) is a 11-smooth function, i.e., its gradient 𝐱V(𝐱)\nabla_{\mathbf{x}}V(\mathbf{x}) is 11-Lipschitzian.

Proof.

From Definition 2, 1-cocoercivity of P𝒟lQlP_{\mathcal{D}_{l}}^{Q_{l}} (see [4]), and Proposition 7, we obtain the following for any 𝐱,𝐰d\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}:

𝐱V(𝐱)𝐱V(𝐰)2=(𝐱𝐰)(T(𝐱)T(𝐰))2\displaystyle\|\nabla_{\mathbf{x}}V(\mathbf{x})-\nabla_{\mathbf{x}}V(\mathbf{w})\|^{2}=\|(\mathbf{x}-\mathbf{w})-(T(\mathbf{x})-T(\mathbf{w}))\|^{2}
=\displaystyle= 𝐱𝐰2+T(𝐱)T(𝐰)22(𝐱𝐰)(T(𝐱)T(𝐰))\displaystyle\|\mathbf{x}-\mathbf{w}\|^{2}+\|T(\mathbf{x})-T(\mathbf{w})\|^{2}-2(\mathbf{x}-\mathbf{w})^{\top}(T(\mathbf{x})-T(\mathbf{w}))
=\displaystyle= 𝐱𝐰2+T(𝐱)T(𝐰)2\displaystyle\|\mathbf{x}-\mathbf{w}\|^{2}+\|T(\mathbf{x})-T(\mathbf{w})\|^{2}
2l𝒬𝒢(x𝒞lw𝒞l)Ql(P𝒟lQl(x𝒞l)P𝒟lQl(w𝒞l))\displaystyle-2\sum_{l\in\mathcal{Q}_{\mathcal{G}}}(x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}})^{\top}Q_{l}(P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}}))
\displaystyle\leq 𝐱𝐰2+T(𝐱)T(𝐰)2\displaystyle\|\mathbf{x}-\mathbf{w}\|^{2}+\|T(\mathbf{x})-T(\mathbf{w})\|^{2}
2l𝒬𝒢P𝒟lQl(x𝒞l)P𝒟lQl(w𝒞l)Ql2\displaystyle\quad-2\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\|^{2}_{Q_{l}}
\displaystyle\leq 𝐱𝐰2T(𝐱)T(𝐰)2𝐱𝐰2.\displaystyle\|\mathbf{x}-\mathbf{w}\|^{2}-\|T(\mathbf{x})-T(\mathbf{w})\|^{2}\leq\|\mathbf{x}-\mathbf{w}\|^{2}.

The last line follows from (F.3) in the proof of Proposition 6a. It completes the proof. ∎

With this in mind, we prove Proposition 6 as follows.

a) From Jensen’s inequality and the quasinonexpansiveness of convex projection operators [4], the following inequality holds for any 𝐱,𝐰d\mathbf{x},\,\mathbf{w}\in\mathbb{R}^{d}:

(T(𝐱)T(𝐰))(𝐱𝐰)\displaystyle(T(\mathbf{x})-T(\mathbf{w}))^{\top}(\mathbf{x}-\mathbf{w})
=l𝒬𝒢(x𝒞lw𝒞l)Ql(P𝒟lQl(x𝒞l)P𝒟lQl(w𝒞l))\displaystyle=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}(x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}})^{\top}Q_{l}(P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}}))
l𝒬𝒢P𝒟lQl(x𝒞l)P𝒟lQl(w𝒞l)Ql2\displaystyle\geq\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\|_{Q_{l}}^{2}
=i=1n1|𝒬𝒢i|El,iP𝒟lQl(x𝒞l)El,iP𝒟lQl(w𝒞l)2\displaystyle=\sum_{i=1}^{n}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\|E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(w_{\mathcal{C}_{l}})\|^{2}
i=1nTi(x𝒩i)Ti(w𝒩i)2=T(𝐱)T(𝐰)2.\displaystyle\geq\sum_{i=1}^{n}\|T_{i}(x_{\mathcal{N}_{i}})-T_{i}(w_{\mathcal{N}_{i}})\|^{2}=\|T(\mathbf{x})-T(\mathbf{w})\|^{2}. (93)

Thus, we obtain T(𝐱)T(𝐰)2(T(𝐱)T(𝐰))(𝐱𝐰)\|T(\mathbf{x})-T(\mathbf{w})\|^{2}\leq(T(\mathbf{x})-T(\mathbf{w}))^{\top}(\mathbf{x}-\mathbf{w}).

b) 𝒟Fix(T)\mathcal{D}\subset\mathrm{Fix}(T) holds because x𝒞l=P𝒟lQl(x𝒞l)x_{\mathcal{C}_{l}}=P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}}) holds for any 𝐱𝒟\mathbf{x}\in\mathcal{D} and all l𝒬𝒢l\in\mathcal{Q}_{\mathcal{G}}. In the following, we prove the converse inclusion Fix(T)𝒟\mathrm{Fix}(T)\subset\mathcal{D}. Let 𝐰𝒟\mathbf{w}\in\mathcal{D}. Then, it suffices to show 𝐰^Fix(T){𝐰}𝐰^𝒟\hat{\mathbf{w}}\in\mathrm{Fix}(T)\setminus\{\mathbf{w}\}\Rightarrow\hat{\mathbf{w}}\in\mathcal{D}. From 𝐰^Fix(T)\hat{\mathbf{w}}\in\mathrm{Fix}(T), we obtain w^i=Ti(w^𝒩i)\hat{w}_{i}=T_{i}(\hat{w}_{\mathcal{N}_{i}}) for all i𝒩i\in\mathcal{N}. In addition, from Jensen’s inequality and the quasinonexpansiveness of convex projection operators [4], we have

𝐰𝐰^2l𝒬𝒢w𝒞lP𝒟𝒞lQl(w^𝒞)Ql2\displaystyle\|\mathbf{w}-\hat{\mathbf{w}}\|^{2}\geq\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|w_{\mathcal{C}_{l}}-P_{\mathcal{D}_{\mathcal{C}_{l}}}^{Q_{l}}(\hat{w}_{\mathcal{C}})\|^{2}_{Q_{l}}
=i=1nl𝒬𝒢i1|𝒬𝒢i|wiEl,iP𝒟l(w^𝒞l)2\displaystyle=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\|w_{i}-E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})\|^{2}
i=1nwil𝒬𝒢i1|𝒬𝒢i|El,iP𝒟l(w^𝒞l)=Ti(w^𝒩i)=w^𝒩i2=𝐰𝐰^2.\displaystyle\geq\sum_{i=1}^{n}\|w_{i}-\underbrace{\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})}_{=T_{i}(\hat{w}_{\mathcal{N}_{i}})=\hat{w}_{\mathcal{N}_{i}}}\|^{2}=\|\mathbf{w}-\hat{\mathbf{w}}\|^{2}.

Thus, from the equality condition of Jensen’s inequality, we obtain wiEl,iP𝒟k(w^𝒞k)=wiEl,iP𝒟l(w^𝒞l)w_{i}-E_{l,i}P_{\mathcal{D}_{k}}(\hat{w}_{\mathcal{C}_{k}})=w_{i}-E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}}) for all 𝒞k,𝒞l(k,l𝒬𝒢i)\mathcal{C}_{k},\mathcal{C}_{l}\,(k,l\in\mathcal{Q}_{\mathcal{G}}^{i}) for all i𝒩i\in\mathcal{N}. Then, we have El,iP𝒟k(w^𝒞k)=El,iP𝒟l(w^𝒞l)E_{l,i}P_{\mathcal{D}_{k}}(\hat{w}_{\mathcal{C}_{k}})=E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}}) for all 𝒞k,𝒞l(k,l𝒬𝒢i)\mathcal{C}_{k},\mathcal{C}_{l}\,(k,l\in\mathcal{Q}_{\mathcal{G}}^{i}). Therefore, since 𝐰^Fix(T)\hat{\mathbf{w}}\in\mathrm{Fix}(T), we have 2V(𝐰^)=i=1nl𝒬𝒢i1|𝒬𝒢i|w^iEl,iP𝒟l(w^𝒞l)2=i=1nw^iTi(w^𝒩i)2=02V(\hat{\mathbf{w}})=\sum_{i=1}^{n}\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}\|\hat{w}_{i}-E_{l,i}P_{\mathcal{D}_{l}}(\hat{w}_{\mathcal{C}_{l}})\|^{2}=\sum_{i=1}^{n}\|\hat{w}_{i}-T_{i}(\hat{w}_{\mathcal{N}_{i}})\|^{2}=0. Thus, 𝐰^𝒟\hat{\mathbf{w}}\in\mathcal{D} holds from Proposition 8.

c) For a non-empty closed convex set 𝒟\mathcal{D} in (85) and 𝐱d𝒟\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}, there exists l^𝒬𝒢\hat{l}\in\mathcal{Q}_{\mathcal{G}} such that x𝒞l^P𝒟l^(x𝒞l^)Ql^>0\|x_{\mathcal{C}_{\hat{l}}}-P_{\mathcal{D}_{\hat{l}}}(x_{\mathcal{C}_{\hat{l}}})\|_{Q_{\hat{l}}}>0. Hence, for l^𝒬𝒢\hat{l}\in\mathcal{Q}_{\mathcal{G}}, 𝐱d𝒟\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D}, and 𝐰𝒟\mathbf{w}\in\mathcal{D}, we have x𝒞l^w𝒞l^Ql^2>P𝒟l^Ql^(x𝒞l^)w𝒞l^Ql^2\|x_{\mathcal{C}_{\hat{l}}}-w_{\mathcal{C}_{\hat{l}}}\|^{2}_{Q_{\hat{l}}}>\|P_{\mathcal{D}_{\hat{l}}}^{Q_{\hat{l}}}(x_{\mathcal{C}_{\hat{l}}})-w_{\mathcal{C}_{\hat{l}}}\|^{2}_{Q_{\hat{l}}} because

x𝒞lz𝒞ldiag(γ𝒞l)2\displaystyle\|x_{\mathcal{C}_{l}}-z_{\mathcal{C}_{l}}\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})}
=\displaystyle= x𝒞lP𝒟l(x𝒞l)diag(γ𝒞l)2+P𝒟l(x𝒞l)z𝒞ldiag(γ𝒞l)2\displaystyle\|x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})}+\|P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})-z_{\mathcal{C}_{l}}\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})}
2(x𝒞lP𝒟l(x𝒞l))diag(γ𝒞l)(z𝒞lP𝒟l(x𝒞l))\displaystyle\quad-2(x_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}))^{\top}\mathrm{diag}(\gamma_{\mathcal{C}_{l}})(z_{\mathcal{C}_{l}}-P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}}))
>\displaystyle> P𝒟l(x𝒞l)z𝒞ldiag(γ𝒞l)2,\displaystyle\|P_{\mathcal{D}_{l}}(x_{\mathcal{C}_{l}})-z_{\mathcal{C}_{l}}\|^{2}_{\mathrm{diag}(\gamma_{\mathcal{C}_{l}})},

where the last line follows from the projection theorem (see Theorem 3.16 in [4]). Thus, by Jensen’s inequality and the nonexpansiveness of P𝒟lQlP_{\mathcal{D}_{l}}^{Q_{l}} [4], for any 𝐱d𝒟\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D} and 𝐰𝒟\mathbf{w}\in\mathcal{D}, we obtain 𝐱𝐰2=l𝒬𝒢x𝒞lw𝒞lQl2>l𝒬𝒢P𝒟lQl(x𝒞l)w𝒞lQl2i=1nl𝒬𝒢i1|𝒬𝒢i|El,iP𝒟lQl(x𝒞l)=T(𝐱)𝐰2.\|\mathbf{x}-\mathbf{w}\|^{2}=\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|x_{\mathcal{C}_{l}}-w_{\mathcal{C}_{l}}\|^{2}_{Q_{l}}>\sum_{l\in\mathcal{Q}_{\mathcal{G}}}\|P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})-w_{\mathcal{C}_{l}}\|^{2}_{Q_{l}}\geq\sum_{i=1}^{n}\|\sum_{l\in\mathcal{Q}_{\mathcal{G}}^{i}}\frac{1}{|\mathcal{Q}_{\mathcal{G}}^{i}|}E_{l,i}P_{\mathcal{D}_{l}}^{Q_{l}}(x_{\mathcal{C}_{l}})=\|T(\mathbf{x})-\mathbf{w}\|^{2}. Hence, T(𝐱)𝐰<𝐱𝐰\|T(\mathbf{x})-\mathbf{w}\|<\|\mathbf{x}-\mathbf{w}\| for any 𝐱d𝒟\mathbf{x}\in\mathbb{R}^{d}\setminus\mathcal{D} and 𝐰𝒟\mathbf{w}\in\mathcal{D}.

d) For 𝐱d\mathbf{x}\in\mathbb{R}^{d}, we define {ak}\{a_{k}\} as ak+1=T(ak)a_{k+1}=T(a_{k}) with a0=𝐱a_{0}=\mathbf{x}. Then, we obtain limkak+1=limkT(ak)\lim_{k\to\infty}a_{k+1}=\lim_{k\to\infty}T(a_{k}). Thus, from the continuity of TT shown in Proposition 6a, we have T(x)=limkak+1=T(limkak)=T(T(x))T^{\infty}(x)=\lim_{k\to\infty}a_{k+1}=T(\lim_{k\to\infty}a_{k})=T(T^{\infty}(x)). Hence, Proposition 6b yields T(x)Fix(T)=𝒟T^{\infty}(x)\in\mathrm{Fix}(T)=\mathcal{D}.

F.4 Proof of Theorems 3c and 4

Here, we show the proofs of Theorems 3c and 4. These proofs are based on the convergence theorems for the ISTA and FISTA (Theorems 3.1 and 4.4 in [5]), respectively.

Supporting Lemmas

Before proceeding to prove the theorems, we show some inequalities corresponding to those obtained from Lemma 2.3 in [5], which is a key to proving the convergence theorems. Note that a differentiable function h:mh:\mathbb{R}^{m}\to\mathbb{R} is convex if and only if

h(𝐰)h(𝐱)+h(𝐱)(𝐰𝐱)h(\mathbf{w})\geq h(\mathbf{x})+\nabla h(\mathbf{x})^{\top}(\mathbf{w}-\mathbf{x}) (94)

holds for any 𝐱,𝐰d\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}. If hh is β\beta-smooth and convex,

h(𝐰)\displaystyle h(\mathbf{w}) h(𝐱)+h(𝐱)(𝐰𝐱)+β2𝐰𝐱2\displaystyle\leq h(\mathbf{x})+\nabla h(\mathbf{x})^{\top}(\mathbf{w}-\mathbf{x})+\frac{\beta}{2}\|\mathbf{w}-\mathbf{x}\|^{2} (95)
h(𝐰)\displaystyle h(\mathbf{w}) h(𝐱)+h(𝐱)(𝐰𝐱)+12βh(𝐱)h(𝐰)2\displaystyle\geq h(\mathbf{x})+\nabla h(\mathbf{x})^{\top}(\mathbf{w}-\mathbf{x})+\frac{1}{2\beta}\|\nabla h(\mathbf{x})-\nabla h(\mathbf{w})\|^{2} (96)

hold for any 𝐱,𝐰d\mathbf{x},\mathbf{w}\in\mathbb{R}^{d}. For details, see textbooks on convex theory, e.g., Theorem 18.15 in [4].

In preparation for showing lemmas, let α(0,1/L^]\alpha\in(0,1/\hat{L}] and Vα(𝐱)=V(𝐱)/αV_{\alpha}(\mathbf{x})=V(\mathbf{x})/\alpha with V(𝐱)V(\mathbf{x}) in (88). Additionally, for 𝐬d\mathbf{s}\in\mathbb{R}^{d}, we define F^𝐰:d\hat{F}_{\mathbf{w}}:\mathbb{R}^{d}\to\mathbb{R} with some 𝐰d\mathbf{w}\in\mathbb{R}^{d} as

F^𝐰(𝐬)=f^(𝐬)+Vα(𝐰)+𝐱Vα(𝐰)(𝐬𝐰).\hat{F}_{\mathbf{w}}(\mathbf{s})=\hat{f}(\mathbf{s})+V_{\alpha}(\mathbf{w})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(\mathbf{s}-\mathbf{w}). (97)

For F^𝐰(𝐬)\hat{F}_{\mathbf{w}}(\mathbf{s}) in (97), the following inequalities hold.

Proposition 10.

Assume that f^\hat{f} is L^\hat{L}-smooth and convex. Let 𝐰=𝐱α𝐱f^(𝐱)\mathbf{w}=\mathbf{x}-\alpha\nabla_{\mathbf{x}}\hat{f}(\mathbf{x}). Then,

F^𝐰(T(𝐰))F^𝐰(𝝃)+1α(𝐱T(𝐰))(𝐱𝝃)12α𝐱T(𝐰)2\hat{F}_{\mathbf{w}}(T(\mathbf{w}))\leq\hat{F}_{\mathbf{w}}(\boldsymbol{\xi})+\frac{1}{\alpha}(\mathbf{x}-T(\mathbf{w}))^{\top}(\mathbf{x}-\boldsymbol{\xi})-\frac{1}{2\alpha}\|\mathbf{x}-T(\mathbf{w})\|^{2} (98)

holds for any 𝛏d\boldsymbol{\xi}\in\mathbb{R}^{d}.

Proof.

Let G𝐰(𝐬)=f^(𝐬)+𝐱Vα(𝐰)(𝐬𝐰)G_{\mathbf{w}}(\mathbf{s})=\hat{f}(\mathbf{s})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(\mathbf{s}-\mathbf{w}) and 𝝃d\boldsymbol{\xi}\in\mathbb{R}^{d}. Then, by using L^\hat{L}-smoothness of f^\hat{f}, 𝐱f^(𝐱)=(𝐱𝐰)/α\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})=(\mathbf{x}-\mathbf{w})/\alpha, and 𝐱Vα(𝐰)=(𝐰T(𝐰))/α\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})=(\mathbf{w}-T(\mathbf{w}))/\alpha (see Proposition 7),

G𝐰(T(𝐰))=f^(T(𝐰))+𝐱Vα(𝐰)(T(𝐰)𝐰)\displaystyle G_{\mathbf{w}}(T(\mathbf{w}))=\hat{f}(T(\mathbf{w}))+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(T(\mathbf{w})-\mathbf{w})
\displaystyle\leq f^(𝐱)𝐱f^(𝐱)(𝐱T(𝐰))+12α𝐱T(𝐰)\displaystyle\hat{f}(\mathbf{x})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})^{\top}(\mathbf{x}-T(\mathbf{w}))+\frac{1}{2\alpha}\|\mathbf{x}-T(\mathbf{w})\|
+𝐱Vα(𝐰)(T(𝐰)𝐰)\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}(T(\mathbf{w})-\mathbf{w})
\displaystyle\leq f^(𝝃)+𝐱f^(𝐱)(𝐱𝝃)𝐱f^(𝐱)(𝐱T(𝐰))\displaystyle\hat{f}(\boldsymbol{\xi})+\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})^{\top}(\mathbf{x}-\boldsymbol{\xi})-\nabla_{\mathbf{x}}\hat{f}(\mathbf{x})^{\top}(\mathbf{x}-T(\mathbf{w}))
+\displaystyle+ 12α𝐱T(𝐰)2+𝐱Vα(𝐰)(T(𝐰)𝐰)=(𝝃𝐰)+(T(𝐰)𝝃)\displaystyle\frac{1}{2\alpha}\|\mathbf{x}-T(\mathbf{w})\|^{2}+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w})^{\top}\underbrace{(T(\mathbf{w})-\mathbf{w})}_{=(\boldsymbol{\xi}-\mathbf{w})+(T(\mathbf{w})-\boldsymbol{\xi})}
=\displaystyle= G𝐰(𝝃)+1α(𝐱T(𝐰))(T(𝐰)𝝃)+12α𝐱T(𝐰)2\displaystyle G_{\mathbf{w}}(\boldsymbol{\xi})+\frac{1}{\alpha}(\mathbf{x}-T(\mathbf{w}))(T(\mathbf{w})-\boldsymbol{\xi})+\frac{1}{2\alpha}\|\mathbf{x}-T(\mathbf{w})\|^{2}
=\displaystyle= G𝐰(𝝃)+1α(𝐱T(𝐰))(𝐱𝝃)12α𝐱T(𝐰)\displaystyle G_{\mathbf{w}}(\boldsymbol{\xi})+\frac{1}{\alpha}(\mathbf{x}-T(\mathbf{w}))^{\top}(\mathbf{x}-\boldsymbol{\xi})-\frac{1}{2\alpha}\|\mathbf{x}-T(\mathbf{w})\|

is obtained from (94) and (95). Thus, adding Vα(𝐰)V_{\alpha}(\mathbf{w}) to both sides, we obtain (98). ∎

Proposition 11.

Let 𝐱k+1=T(𝐰k)\mathbf{x}^{k+1}=T(\mathbf{w}^{k}) with some {𝐰k}d\{\mathbf{w}^{k}\}\subset\mathbb{R}^{d}. Then, it holds that

F^𝐰k(𝐱k)+α2𝐱Vα(𝐰k))2F^𝐰k1(𝐱k)+α2𝐱Vα(𝐰k1)2.\displaystyle\hat{F}_{\mathbf{w}^{k}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k}))\|^{2}\leq\hat{F}_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}. (99)
Proof.

By 1/α1/\alpha-smoothness of Vα(𝐱)V_{\alpha}(\mathbf{x}) (see Proposition 9) and Proposition 7,

F^𝐰k1(𝐱k)=f^(𝐱k)+Vα(𝐰k1)+𝐱Vα(𝐰k1)(𝐱k𝐰k1)\displaystyle\hat{F}_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})^{\top}(\mathbf{x}^{k}-\mathbf{w}^{k-1})
=\displaystyle= f^(𝐱k)+Vα(𝐰k1)α𝐱Vα(𝐰k1)2\displaystyle\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})-\alpha\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}
\displaystyle\geq f^(𝐱k)+Vα(𝐰k)+𝐱Vα(𝐰k)(𝐰k1𝐰k)\displaystyle\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{w}^{k-1}-\mathbf{w}^{k})
+α2𝐱Vα(𝐰k1)𝐱Vα(𝐰k)2α𝐱Vα(𝐰k1)2\displaystyle+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})-\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\|^{2}-\alpha\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}
=\displaystyle= f^(𝐱k)+Vα(𝐰k)+𝐱Vα(𝐰k)(𝐱k𝐰k)\displaystyle\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{x}^{k}-\mathbf{w}^{k})
+𝐱Vα(𝐰k)(𝐰k1𝐱k)\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{w}^{k-1}-\mathbf{x}^{k})
+α2𝐱Vα(𝐰k1)𝐱Vα(𝐰k)2α𝐱Vα(𝐰k1)2\displaystyle+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})-\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\|^{2}-\alpha\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}
=\displaystyle= F^𝐰k(𝐱k)+α2𝐱Vα(𝐰k)2α2𝐱Vα(𝐰k1)2\displaystyle\hat{F}_{\mathbf{w}^{k}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\|^{2}-\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}

is obtained from (96). Hence, (99) holds. ∎

With this in mind, we consider the following update rule with 𝐱^(0)=𝐱(0)\hat{\mathbf{x}}(0)=\mathbf{x}(0) and some {θk}\{\theta^{k}\}\subset\mathbb{R}:

𝐰k\displaystyle\mathbf{w}^{k} =𝐱^kα𝐱f^(𝐱^(k))\displaystyle=\hat{\mathbf{x}}^{k}-\alpha\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}(k))
𝐱k+1\displaystyle\mathbf{x}^{k+1} =T(𝐰k)\displaystyle=T(\mathbf{w}^{k})
𝐱^k+1\displaystyle\hat{\mathbf{x}}^{k+1} =𝐱k+1+θk(𝐱k+1𝐱k).\displaystyle=\mathbf{x}^{k+1}+\theta^{k}(\mathbf{x}^{k+1}-\mathbf{x}^{k}). (100)

In addition, we define Θk:d\Theta^{k}:\mathbb{R}^{d}\to\mathbb{R} as

Θk=F^𝐰k1(𝐱k)+α2Vα(𝐰k1)2\Theta^{k}=\hat{F}_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|V_{\alpha}(\mathbf{w}^{k-1})\|^{2} (101)

with F^𝐰\hat{F}_{\mathbf{w}} in (97). By 𝐱k𝐰k1=α𝐱Vα(𝐰k1)\mathbf{x}^{k}-\mathbf{w}^{k-1}=-\alpha\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1}), Θk\Theta^{k} can be rewritten as Θk=f^(𝐱k)+Vα(𝐰k1)12α𝐰k1T(𝐰k1)2=f^(𝐱k)+Vα(𝐰k1)12α𝐰k1𝐱k2\Theta^{k}=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})-\frac{1}{2\alpha}\|\mathbf{w}^{k-1}-T(\mathbf{w}^{k-1})\|^{2}=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k-1})-\frac{1}{2\alpha}\|\mathbf{w}^{k-1}-\mathbf{x}^{k}\|^{2}.

Remarkably, Θk\Theta^{k} in (101) satisfies the following lemma.

Lemma 3.

Consider the sequence generated by (F.4). Then,

J(𝐱k)=f^(𝐱k)+Vα(𝐱k)Θk.J(\mathbf{x}^{k})=\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{x}^{k})\leq\Theta^{k}. (102)
Proof.

In light of 1/α1/\alpha-smoothness of VαV_{\alpha} and 𝐱Vα(𝐰k1)=(𝐰k1𝐱k)/α\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})=-(\mathbf{w}^{k-1}-\mathbf{x}^{k})/\alpha, we obtain Vα(𝐱k)Vα(𝐰k1)+𝐱Vα(𝐰k1)(𝐰k1𝐱k)+12α𝐰k𝐱k2=Vα(𝐰k1)12α𝐰k𝐱k2.V_{\alpha}(\mathbf{x}^{k})\leq V_{\alpha}(\mathbf{w}^{k-1})+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})^{\top}(\mathbf{w}^{k-1}-\mathbf{x}^{k})+\frac{1}{2\alpha}\|\mathbf{w}^{k}-\mathbf{x}^{k}\|^{2}=V_{\alpha}(\mathbf{w}^{k-1})-\frac{1}{2\alpha}\|\mathbf{w}^{k}-\mathbf{x}^{k}\|^{2}. Hence, adding f^(𝐱k)\hat{f}(\mathbf{x}^{k}) to both sides yields (102). ∎

Furthermore, the following inequality holds. This is essential to Theorem 3c and 4.

Lemma 4.

For the sequence generated by (F.4) and Θk\Theta^{k} defined in (101), it holds that

ΘkΘk+1\displaystyle\small\Theta^{k}-\Theta^{k+1} 12α𝐱^k𝐱k+12+1α(𝐱k+1𝐱^k)(𝐱^k𝐱k).\displaystyle\geq\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}+\frac{1}{\alpha}(\mathbf{x}^{k+1}-\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k}). (103)
Proof.

Substituting 𝐱=𝐱k+1,𝐰=𝐰k,\mathbf{x}=\mathbf{x}^{k+1},\,\mathbf{w}=\mathbf{w}^{k}, and 𝝃=𝐱k\boldsymbol{\xi}=\mathbf{x}^{k} into (98), we obtain

Θk+1=f^(𝐱k+1)+Vα(𝐰k)\displaystyle\Theta^{k+1}=\hat{f}(\mathbf{x}^{k+1})+V_{\alpha}(\mathbf{w}^{k})
+𝐱Vα(𝐰k)(𝐱k+1𝐰k)+α2𝐱Vα(𝐰k)2\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{x}^{k+1}-\mathbf{w}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\|^{2}
f^(𝐱k)+Vα(𝐰k)\displaystyle\leq\hat{f}(\mathbf{x}^{k})+V_{\alpha}(\mathbf{w}^{k})
+𝐱Vα(𝐰k)(𝐱k𝐰k)+α2𝐱Vα(𝐰k)2\displaystyle+\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})^{\top}(\mathbf{x}^{k}-\mathbf{w}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\|^{2}
+1α(𝐱^k𝐱k+1)(𝐱^k𝐱k)12α𝐱^k𝐱k+12\displaystyle+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}
=F𝐰k(𝐱k)+α2𝐱Vα(𝐰k)2\displaystyle=F_{\mathbf{w}^{k}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k})\|^{2}
+1α(𝐱^k𝐱k+1)(𝐱^k𝐱k)12α𝐱^k𝐱k+12\displaystyle+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}
F𝐰k1(𝐱k)+α2𝐱Vα(𝐰k1)2\displaystyle\leq F_{\mathbf{w}^{k-1}}(\mathbf{x}^{k})+\frac{\alpha}{2}\|\nabla_{\mathbf{x}}V_{\alpha}(\mathbf{w}^{k-1})\|^{2}
+1α(𝐱^k𝐱k+1)(𝐱^k𝐱k)12α𝐱^k𝐱k+12\displaystyle+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}
=Θk+1α(𝐱^k𝐱k+1)(𝐱^k𝐱k)12α𝐱^k𝐱k+12\displaystyle=\Theta^{k}+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k})-\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}

from (94), (95), and (99). Thus, (103) holds. ∎

For 𝐱k\mathbf{x}^{k} and an optimal 𝐱\mathbf{x}^{*}, we present the following lemma.

Lemma 5.

For 𝐱argmin𝐱𝒟f^(𝐱)\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathcal{D}}\hat{f}(\mathbf{x}), it holds that

f^(𝐱)+Vα(𝐱)\displaystyle\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*}) Θk+112α𝐱^k𝐱k+12+1α(𝐱k+1𝐱^k)(𝐱^k𝐱).\displaystyle-\Theta^{k+1}\geq\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}+\frac{1}{\alpha}(\mathbf{x}^{k+1}-\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{*}). (104)
Proof.

Recalling (F.4), L^\hat{L}-smoothness of f^\hat{f}, and 1/α1/\alpha-smoothness of VαV_{\alpha} for α(0,1/L^]\alpha\in(0,1/\hat{L}], we obtain

Θk+1f^(𝐱^k)𝐱f^(𝐱^k)(𝐱^k𝐱k+1)\displaystyle\Theta^{k+1}\leq\hat{f}(\hat{\mathbf{x}}^{k})-\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})
+12α𝐱^k𝐱k+12+Vα(𝐰k)12α𝐰kT(𝐰k)2\displaystyle+\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}+V_{\alpha}(\mathbf{w}^{k})-\frac{1}{2\alpha}\|\mathbf{w}^{k}-T(\mathbf{w}^{k})\|^{2}
f^(𝐱)+𝐱f^(𝐱^k)(𝐱^𝐱)𝐱f^(𝐱^k)(𝐱^T(𝐰k))\displaystyle\leq\hat{f}(\mathbf{x}^{*})+\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}-\mathbf{x}^{*})-\nabla_{\mathbf{x}}\hat{f}(\hat{\mathbf{x}}^{k})^{\top}(\hat{\mathbf{x}}-T(\mathbf{w}^{k}))
+12α𝐱^kT(𝐰k)2+Vα(𝐱)12α𝐰kT(𝐰k)2\displaystyle+\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-T(\mathbf{w}^{k})\|^{2}+V_{\alpha}(\mathbf{x}^{*})-\frac{1}{2\alpha}\|\mathbf{w}^{k}-T(\mathbf{w}^{k})\|^{2}
+1α(𝐰kT(𝐰k)(T(𝐰k)𝐱+𝐰kT(𝐰k))\displaystyle+\frac{1}{\alpha}(\mathbf{w}^{k}-T(\mathbf{w}^{k})^{\top}(T(\mathbf{w}^{k})-\mathbf{x}^{*}+\mathbf{w}^{k}-T(\mathbf{w}^{k}))
12α𝐰kT(𝐰k)(𝐱T(𝐱))2\displaystyle-\frac{1}{2\alpha}\|\mathbf{w}^{k}-T(\mathbf{w}^{k})-(\mathbf{x}^{*}-T(\mathbf{x}^{*}))\|^{2}
=f^(𝐱)+Vα(𝐱)+1α(𝐱^k𝐱k+1)(𝐱^k𝐱)12α𝐱^k𝐱k+12\displaystyle=\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*})+\frac{1}{\alpha}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1})^{\top}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{*})-\frac{1}{2\alpha}\|\hat{\mathbf{x}}^{k}-\mathbf{x}^{k+1}\|^{2}

from (94), (95), and (96), where the last line is obtained because 𝐱=T(𝐱)\mathbf{x}^{*}=T(\mathbf{x}^{*}) holds for 𝐱𝒟\mathbf{x}^{*}\in\mathcal{D}. Therefore, (104) is obtained. ∎

Proof of Theorem 3c

In this proof, assume that θk=0\theta^{k}=0 for all kk. Then, 𝐱^k=𝐱k\hat{\mathbf{x}}^{k}=\mathbf{x}^{k} holds and the algorithm in (F.4) equals to the CPGD with λk=α(0,1/L^]\lambda^{k}=\alpha\in(0,1/\hat{L}] for all kk\in\mathbb{N}.

In light of (104) and 𝐱^k=𝐱k\hat{\mathbf{x}}^{k}=\mathbf{x}^{k}, we obtain 2α(Θk+1(f^(𝐱)+Vα(𝐱)))𝐱𝐱k22\alpha(\Theta^{k+1}-(\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*})))\leq\|\mathbf{x}^{*}-\mathbf{x}^{k}\|^{2} because 2α(Θk+1(f^(𝐱)+Vα(𝐱)))2(𝐱k𝐱k+1)(𝐱k𝐱)12α𝐱k𝐱k+12=𝐱𝐱k2𝐱𝐱k+12𝐱𝐱k2.2\alpha(\Theta^{k+1}-(\hat{f}(\mathbf{x}^{*})+V_{\alpha}(\mathbf{x}^{*})))\leq 2(\mathbf{x}^{k}-\mathbf{x}^{k+1})^{\top}(\mathbf{x}^{k}-\mathbf{x}^{*})-\frac{1}{2\alpha}\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\|^{2}=\|\mathbf{x}^{*}-\mathbf{x}^{k}\|^{2}-\|\mathbf{x}^{*}-\mathbf{x}^{k+1}\|^{2}\leq\|\mathbf{x}^{*}-\mathbf{x}^{k}\|^{2}. Besides, invoking (103), we have

2α(Θk+1Θk)𝐱k𝐱k+120.2\alpha(\Theta^{k+1}-\Theta^{k})\leq\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\|^{2}\leq 0.

Then, following the same procedure as Theorem 3.1 in [5] and using (102), we obtain (89).

Proof of Theorem 4

Substituting θk=(σk1)/σk+1\theta^{k}=(\sigma^{k}-1)/\sigma^{k+1} into (F.4) yields the ACPGD in (F.2).

Now, by (103), (104), and (σk1)2=σk(σk1)(\sigma^{k-1})^{2}=\sigma^{k}(\sigma^{k}-1), following the procedure of the proof of Theorem 4.4 in [5] gives

(σk1)2(ΘkJ(𝐱))(σk)2(Θk+1J(𝐱))\displaystyle(\sigma^{k-1})^{2}(\Theta^{k}-J(\mathbf{x}^{*}))-(\sigma^{k})^{2}(\Theta^{k+1}-J(\mathbf{x}^{*}))
12α(𝜻k+12𝜻k2)\displaystyle\leq\frac{1}{2\alpha}(\|\boldsymbol{\zeta}^{k+1}\|^{2}-\|\boldsymbol{\zeta}^{k}\|^{2})

with JJ in (87) and 𝜻k=σk(𝐱^k𝐱)(σk1)(𝐱k𝐱)\boldsymbol{\zeta}^{k}=\sigma_{k}(\hat{\mathbf{x}}^{k}-\mathbf{x}^{*})-(\sigma^{k}-1)(\mathbf{x}^{k}-\mathbf{x}^{*}). Thus, summing both sides over k=1,2,k=1,2,\ldots yields

(σk)2(Θk+1J(𝐱))12α𝜻02=12α𝐱0𝐱2.(\sigma^{k})^{2}(\Theta^{k+1}-J(\mathbf{x}^{*}))\leq\frac{1}{2\alpha}\|\boldsymbol{\zeta}^{0}\|^{2}=\frac{1}{2\alpha}\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}.

By σk(k+1)/2\sigma^{k}\geq(k+1)/2, which can be shown by mathematical induction, we obtain

Θk+1J(𝐱)2𝐱0𝐱2α(k+1)2.\Theta^{k+1}-J(\mathbf{x}^{*})\leq\frac{2\|\mathbf{x}^{0}-\mathbf{x}^{*}\|^{2}}{\alpha(k+1)^{2}}.

Therefore, the inequality (92) follows from (102).