This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the convergence of the distributed Proximal point algorithm

Woocheol Choi [email protected]
Abstract.

In this work, we establish convergence results for the distributed proximal point algorithm (DPPA) for distributed optimization problems. We consider the problem on the whole domain d\mathbb{R}^{d} and find a general condition on the stepsize and cost functions such that the DPPA is stable. We prove that the DPPA with stepsize η>0\eta>0 exponentially converges to an O(η)O(\eta)-neighborhood of the optimizer. Our result clearly explains the advantage of the DPPA with respect to the convergence and stability in comparison with the distributed gradient descent algorithm. We also provide numerical tests supporting the theoretical results.

Key words and phrases:
Proximal point method, Distributed optimization, Distributed gradient algorithm
2010 Mathematics Subject Classification:
Primary 90C25, 68Q25

1. Introduction

This works consider the distributed optimization

minxdf(x)=i=1nfi(x),\min_{x\in\mathbb{R}^{d}}f(x)=\sum_{i=1}^{n}f_{i}(x), (1.1)

where nn denotes the number of the agents in the network, and fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} is a differentiable local cost only known agent ii for each 1in1\leq i\leq n. In recent years, extensive research has been conducted on distributed optimization due to its relevance in various applications, including wireless sensor networks [1, 24], multi-agent control [4, 22, 5], smart grids [9, 10], and machine learning [11, 21, 30, 3, 27].

Numerous studies have focused on solving distributed optimization problems on networks. Relevant research works include [17, 18, 25, 26], and the references therein. A key algorithm in this field is the distributed gradient descent algorithm (DGD), introduced in [17]. Additionally, there exist various versions of distributed optimization algorithms such as EXTRA [25] and decentralized gradient tracking [18, 20, 28]. Lately, there has been growing interest in designing communication-efficient algorithms for distributed optimization [23, 8, 12, 2].

The distributed proximal point algorithm (DPPA) was proposed in [15, 16] as a distributed counterpart of the proximal point method, analogous to the relation between the distributed gradient descent algorithm and the gradient descent method. The works [15, 16] established the asymptotic convergence of the DPPA under the assumptions of a compact domain for each local cost fif_{i} and a decreasing step size. The work [14] designed the DPPA on a directed graph and proved a convergence estimate with rate O(1/t)O(1/\sqrt{t}) when the step size is set to 1/t1/\sqrt{t} and each cost function has a compact domain.

It is a well-known fact that the proximal point method is more stable compared to the gradient descent method for large choices of the stepsize. This fact suggests that the DPPA may also exhibit be more stable than the DGD, as mentioned in the previous works [13, 14, 16]. In this work, we provide convergence results for the DPPA in the case of the entire domain and a constant step size. Comparing the results with the convergence result [29] of the DGD, we will find that the DPPA is more stable than the DGD when the stepsize is large.

The DPPA for the problem (1.1) is described as follows:

x^i(t)=j=1nwijxj(t)xi(t+1)=argminxn(fi(x)+12ηxx^i(t)2),\begin{split}\hat{x}_{i}(t)&=\sum_{j=1}^{n}w_{ij}x_{j}(t)\\ x_{i}(t+1)&=\textrm{argmin}_{x\in\mathbb{R}^{n}}\Big{(}f_{i}(x)+\frac{1}{2\eta}\|x-\hat{x}_{i}(t)\|^{2}\Big{)},\end{split} (1.2)

where wijw_{ij} denotes the weight for the communication among agents in the network desribed by an undirected graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}). Each node in 𝒱\mathcal{V} represents an agent, and each edge {i,j}\{i,j\}\in\mathcal{E} means that ii can send messages to jj and vice versa. We consider a graph 𝒢\mathcal{G} satisfying the following assumption.

Assumption 1.

The communication graph 𝒢\mathcal{G} is undirected and connected, i.e., there exists a path between any two agents.

We define the mixing matrix W={wij}1i,jnW=\{w_{ij}\}_{1\leq i,j\leq n} as follows. The nonnegative weight wijw_{ij} is given for each communication link {i,j},\{i,j\}\in\mathcal{E}, where wij0w_{ij}\neq 0 if {i,j}\{i,j\}\in\mathcal{E} and wij=0w_{ij}=0 if {i,j}\{i,j\}\notin\mathcal{E}. In this paper, we make the following assumption on the mixing matrix WW.

Assumption 2.

The mixing matrix W={wij}1i,jnW=\{w_{ij}\}_{1\leq i,j\leq n} is doubly stochastic, i.e., W𝟏=𝟏W\mathbf{1}=\mathbf{1} and 𝟏TW=𝟏T\mathbf{1}^{T}W=\mathbf{1}^{T}. In addition, wii>0w_{ii}>0 for all i𝒱i\in\mathcal{V}.

In the following result, we establish that the sequence {xk(t)}k=1n\{x_{k}(t)\}_{k=1}^{n} is stable, i.e., uniformly bounded for t0t\geq 0 under suitable conditions.

Theorem 1.1.

Suppose that the assumptions 1-2 hold, and assume that one of the following conditions holds:

  1. (1)

    The following function

    Fη(x):=k=1nfk(xk)+12ηxT(IW)xF_{\eta}(x):=\sum_{k=1}^{n}f_{k}(x_{k})+\frac{1}{2\eta}x^{T}(I-W)x (1.3)

    is bounded below and has an optimal solution (x1,,xn)(x_{1}^{*},\cdots,x_{n}^{*}).

  2. (2)

    Each function fkf_{k} is LL-smooth and ff is α\alpha-stronlgy convex. In addition, the stepsize η>0\eta>0 satisfies the following inequality

    η2(α2+αL)+η(α+L(1ρW)α2L)<(1ρW)αL,\eta^{2}(\alpha^{2}+\alpha L)+\eta\Big{(}\alpha+L-\frac{(1-\rho_{W})\alpha^{2}}{L}\Big{)}<\frac{(1-\rho_{W})\alpha}{L}, (1.4)

    where ρW\rho_{W} is the spectral norm of the matrix W1n𝟏𝟏TW-\frac{1}{n}\mathbf{1}\mathbf{1}^{T}.

Then the sequence {xk(t)}k=1n\{x_{k}(t)\}_{k=1}^{n} of the DPPA with stepsize η>0\eta>0 is uniformly bounded for t0t\geq 0.

We now compare the stability result with that of the DGD described by

xi(t+1)=j=1Nwijxj(t)ηfi(xi(t)).x_{i}(t+1)=\sum_{j=1}^{N}w_{ij}x_{j}(t)-\eta\nabla f_{i}(x_{i}(t)).

Let λn(W)\lambda_{n}(W)\in\mathbb{R} be the smallest eigenvalue of WW. Then it is known from [29] that the DGD is stable if the stepsize η>0\eta>0 satisfies

η1+λn(W)L,\eta\leq\frac{1+\lambda_{n}(W)}{L},

provided that each cost function fjf_{j} is convex and LL-smooth.

Condition on the costs Stepsize Paper
DGD Each fjf_{j} is convex and LL-smooth η(1+λn(W))L\eta\leq\frac{(1+\lambda_{n}(W))}{L} [29]
DPPA FηF_{\eta} is bounded below and has an optimizer η(0,)\eta\in(0,\infty) This work
Table 1. This table compares the condition on the stepsize for the stability of the DGD and the DPPA.

Table 1 summarizes the condition on the stepsize η>0\eta>0 for the stability of the DGD and the DPPA. We observe that FηF_{\eta} defined in (1.3) is bounded below for any η>0\eta>0 when each fjf_{j} is convex. Therefore, the condition of (1) in Theorem 1.1 is not so restrictive than the condition that fjf_{j} is convex which is required for the stability result [29] of the DGD. Hence the result of Theorem 1.1 proves that the stepsize choice of η>0\eta>0 is much wider for the DPPA than that for the DGD.

For the convergence analysis of the DPPA, we assume the uniformly boundedness property.

Assumption 3.

The sequence {xk(t)}k=1n\{x_{k}(t)\}_{k=1}^{n} is uniformly bounded for t0t\geq 0, i.e., there is R>0R>0 such that

AtRandBtRt0.A_{t}\leq R\quad\textrm{and}\quad B_{t}\leq R\quad\forall~{}t\geq 0.

Although the property is guaranteed for broad class of functions by the result of Theorem 1.1, we formulate this assumption of the sake of simplicity in the statement of the convergence result. We also consider the following assumption.

Assumption 4.

The aggregate cost function ff is α\alpha-strongly convex for some α>0\alpha>0 and each cost function fjf_{j} is LL-smooth for each 1jn1\leq j\leq n, i.e.,

fj(x)fj(y)Lxy\|\nabla f_{j}(x)-\nabla f_{j}(y)\|\leq L\|x-y\|

for all x,ydx,y\in\mathbb{R}^{d}.

Under this assumption, there exists a unique optimizer x=argminxdf(x)x_{*}=\arg\min_{x\in\mathbb{R}^{d}}f(x). We let D=max1infi(x)D=\max_{1\leq i\leq n}\|\nabla f_{i}(x_{*})\|. Also we regard xi(t)x_{i}(t) as a row vector in 1×d\mathbb{R}^{1\times d}, and define the variable 𝐱(t)n×d\mathbf{x}(t)\in\mathbb{R}^{n\times d} and 𝐱¯(t)n×d\bar{\mathbf{x}}(t)\in\mathbb{R}^{n\times d} by

𝐱(t)=(x1(t)T,,xm(t)T)Tand𝐱¯(t)=(x¯(t)T,,x¯(t)T)T,\mathbf{x}(t)=\left(x_{1}\left(t\right)^{T},\cdots,x_{m}\left(t\right)^{T}\right)^{T}\quad\textrm{and}\quad\bar{\mathbf{x}}(t)=\left(\bar{x}\left(t\right)^{T},\cdots,\bar{x}\left(t\right)^{T}\right)^{T}, (1.5)

where x¯(t)=1nk=1nxk(t)\bar{x}(t)=\frac{1}{n}\sum_{k=1}^{n}x_{k}(t). We let

At=x¯(t)xandBt=1n𝐱¯(t)𝐱(t)=(1nk=1nx¯(t)xk(t)2)1/2.A_{t}=\|\bar{x}(t)-x_{*}\|\quad\textrm{and}\quad B_{t}=\frac{1}{\sqrt{n}}\|\mathbf{\bar{x}}(t)-\mathbf{x}(t)\|=\Big{(}\frac{1}{n}\sum_{k=1}^{n}\|\bar{x}(t)-x_{k}(t)\|^{2}\Big{)}^{1/2}.

We show that the DPPA converges exponentially to an O(η)O(\eta)-neighborhood of the optimizer in the following result.

Theorem 1.2.

Suppose that the assumptions 1-3 hold and Then we have

AtA0(1+ηα)t+tηLB0ρW(1+ηα)max{ρW,11+ηα}t1+ηL(2RL+D)α(1ρW)A_{t}\leq\frac{A_{0}}{(1+\eta\alpha)^{t}}+\frac{t\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\max\Big{\{}\rho_{W},\frac{1}{1+\eta\alpha}\Big{\}}^{t-1}+\frac{\eta L(2RL+D)}{\alpha(1-\rho_{W})} (1.6)

and

Bt(ρW)tB0+η(1ρW)(2RL+D).B_{t}\leq(\rho_{W})^{t}B_{0}+\frac{\eta}{(1-\rho_{W})}(2RL+D).

Upon knowing that the DGD is stable, the linear convergence result of the DGD is proved when the stepsize η\eta satisfies η2α+L\eta\leq\frac{2}{\alpha+L}. We refer to [29, 7] for the detail. As for the DPPA, we note that there is no restriction on the stepsize η>0\eta>0 for the linear convergence as obtained in Theorem 1.2. Table 2 compares the condition on the stepsize of the DGD and the DPPA for the convergence of the algorithms.

Condition on the costs DGD Estimate Paper
DGD Each fjf_{j} is LL-smooth ff is α\alpha-strongly convex η2α+L\eta\leq\frac{2}{\alpha+L} O(ect)+O(η1ρW)O(e^{-ct})+O\Big{(}\frac{\eta}{1-\rho_{W}}\Big{)} [29, 7]
DPPA Each fjf_{j} is LL-smooth ff is α\alpha-strongly convex η(0,)\eta\in(0,\infty) O(ect)+O(η1ρW)O(e^{-ct})+O\Big{(}\frac{\eta}{1-\rho_{W}}\Big{)} This work
Table 2. This table compares the condition on the stepsize for the stability of the DGD and the DPPA.

The rest of this paper is organized as follows. In Section 2, we establish two sequential inequalities of AtA_{t} and BtB_{t}. Section 3 is devoted to prove the uniform boundedness result of Theorem 1.1. In Section 4, we prove the convergence result of Theorem 1.2. Numerical results are presented in Section 5.

2. Sequential estimates

In this section, we dervie two sequential inequalities of AtA_{t} and BtB_{t}, which are main ingredients for the stability and the convergence analysis of the DPPA.

Proposition 2.1.

Assume that ff is α\alpha-strongly convex. Then, for any stepsize η>0\eta>0, the sequence {(At,Bt)}t0\{(A_{t},B_{t})\}_{t\geq 0} satisfies the follwoing inequality

(1+ηα)At+1At+ηLBt+1(1+\eta\alpha)A_{t+1}\leq A_{t}+\eta LB_{t+1} (2.1)

for all t0t\geq 0.

Proof.

From the minimality of (1.2), it follows that

fk(xk(t+1))+1η(xk(t+1)j=1nwkjxj(t))=0.\nabla f_{k}(x_{k}(t+1))+\frac{1}{\eta}\Big{(}x_{k}(t+1)-\sum_{j=1}^{n}w_{kj}x_{j}(t)\Big{)}=0. (2.2)

We reformulate this as

xk(t+1)+ηfk(xk(t+1))=j=1nwkjxj(t).x_{k}(t+1)+\eta\nabla f_{k}(x_{k}(t+1))=\sum_{j=1}^{n}w_{kj}x_{j}(t). (2.3)

By averaging this for 1kn1\leq k\leq n, we get

x¯(t+1)+ηnk=1nfk(xk(t+1))=x¯(t).\bar{x}(t+1)+\frac{\eta}{n}\sum_{k=1}^{n}\nabla f_{k}(x_{k}(t+1))=\bar{x}(t). (2.4)

Using this and the fact that f(x)=0\nabla f(x_{*})=0, we find

x¯(t+1)x+η(f(x¯(t+1))f(x))=x¯(t)x+η(k=1nfk(x¯(t+1))fk(xk(t+1))).\begin{split}&\bar{x}(t+1)-x_{*}+\eta(\nabla f(\bar{x}(t+1))-\nabla f(x_{*}))\\ &=\bar{x}(t)-x_{*}+\eta\Big{(}\sum_{k=1}^{n}\nabla f_{k}(\bar{x}(t+1))-\nabla f_{k}(x_{k}(t+1))\Big{)}.\end{split} (2.5)

Since ff is assumed to be α\alpha-strongly convex, we have

f(x)f(y)2α2xy2\|\nabla f(x)-\nabla f(y)\|^{2}\geq\alpha^{2}\|x-y\|^{2}

and

xy,f(x)f(y)αxy2.\langle x-y,~{}\nabla f(x)-\nabla f(y)\rangle\geq\alpha\|x-y\|^{2}.

Combining these estimates, we get

x¯(t+1)x+η(f(x¯(t+1))f(x))2=x¯(t+1)x2+2ηx¯(t+1)x,f(x¯(t+1))f(x)+η2f(x¯(t+1))f(x)2(1+2ηα+α2)x¯(t+1)x2.\begin{split}&\Big{\|}\bar{x}(t+1)-x_{*}+\eta\Big{(}\nabla f(\bar{x}(t+1))-\nabla f(x_{*})\Big{)}\Big{\|}^{2}\\ &=\|\bar{x}(t+1)-x_{*}\|^{2}+2\eta\langle\bar{x}(t+1)-x_{*},~{}\nabla f(\bar{x}(t+1))-\nabla f(x_{*})\rangle\\ &\qquad+\eta^{2}\|\nabla f(\bar{x}(t+1))-\nabla f(x_{*})\|^{2}\\ &\geq(1+2\eta\alpha+\alpha^{2})\|\bar{x}(t+1)-x_{*}\|^{2}.\end{split}

Using this estimate in (2.5) and applying the triangle inequality, we get

(1+ηα)x¯(t+1)xx¯(t)x+ηnk=1nfk(x¯(t+1))fk(xk(t+1))x¯(t)x+ηLnk=1nx¯(t+1)xk(t+1)x¯(t)x+ηLn𝐱¯(t+1)𝐱(t+1),\begin{split}&(1+\eta\alpha)\|\bar{x}(t+1)-x_{*}\|\\ &\leq\|\bar{x}(t)-x_{*}\|+\frac{\eta}{n}\Big{\|}\sum_{k=1}^{n}\nabla f_{k}(\bar{x}(t+1))-\nabla f_{k}(x_{k}(t+1))\Big{\|}\\ &\leq\|\bar{x}(t)-x_{*}\|+\frac{\eta L}{n}\sum_{k=1}^{n}\|\bar{x}(t+1)-x_{k}(t+1)\|\\ &\leq\|\bar{x}(t)-x_{*}\|+\frac{\eta L}{\sqrt{n}}\|\bar{\mathbf{x}}(t+1)-\mathbf{x}(t+1)\|,\end{split}

where we used the Cauchy-Schwartz inequality in the last inequality. This gives the desired inequality. ∎

Next we will derive a bound of Bt+1B_{t+1} in terms of AtA_{t} and BtB_{t}. For this we will use the following result (see [19, Lemma 1]).

Lemma 2.2.

Suppose Assumptions 1 and 2 hold, and let ρW\rho_{W} be the spectral norm of the matrix W1n𝟏𝟏T.W-\frac{1}{n}\mathbf{1}\mathbf{1}^{T}. Then we have ρW<1\rho_{W}<1 and

i=1nj=1nwij(xjx¯)2(ρW)2i=1nxix¯2,\sum_{i=1}^{n}\Big{\|}\sum_{j=1}^{n}w_{ij}(x_{j}-\bar{x})\Big{\|}^{2}\leq(\rho_{W})^{2}\sum_{i=1}^{n}\|x_{i}-\bar{x}\|^{2},

where x¯=1nk=1nxk\bar{x}=\frac{1}{n}\sum_{k=1}^{n}x_{k} for any xid×1x_{i}\in\mathbb{R}^{d\times 1} and 1in1\leq i\leq n.

In the following result, we find an estimate of Bt+1B_{t+1} in terms of BtB_{t} and At+1A_{t+1}.

Proposition 2.3.

Suppose that each fjf_{j} is LL-smooth. Then the sequence {(At,Bt)}t0\{(A_{t},B_{t})\}_{t\geq 0} satisfies the following inequality

Bt+1ρWBt+ηLBt+1+ηLAt+1+ηDB_{t+1}\leq\rho_{W}B_{t}+\eta LB_{t+1}+\eta LA_{t+1}+\eta D (2.6)

for all t0t\geq 0.

Proof.

We may write (2.3) and (2.4) in the following way

𝐱(t+1)+ηF(𝐱(t+1))=W𝐱(t)\mathbf{x}(t+1)+\eta\nabla F(\mathbf{x}(t+1))=W\mathbf{x}(t)

and

𝐱¯(t+1)+ηF¯(𝐱(t+1))=𝐱¯(t),\bar{\mathbf{x}}(t+1)+\eta\overline{\nabla F}(\mathbf{x}(t+1))=\bar{\mathbf{x}}(t),

where we have let

F¯(𝐱(t+1))=(f1(x1(t+1)),,fn(xn(t+1)))T.\overline{\nabla F}(\mathbf{x}(t+1))=\Big{(}\nabla f_{1}(x_{1}(t+1)),\cdots,\nabla f_{n}(x_{n}(t+1))\Big{)}^{T}.

Combining the above equalities, we find

𝐱(t+1)𝐱¯(t+1)=W(𝐱(t)𝐱¯(t))η(F(𝐱(t+1))F¯(𝐱(t+1))).\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)=W(\mathbf{x}(t)-\bar{\mathbf{x}}(t))-\eta\Big{(}\nabla F(\mathbf{x}(t+1))-\overline{\nabla F}(\mathbf{x}(t+1))\Big{)}.

By applying the triangle inequality and Lemma 2.2, we deduce

𝐱(t+1)𝐱¯(t+1)ρW𝐱(t)𝐱¯(t)+ηF(𝐱(t+1))F¯(𝐱(t+1)).\begin{split}&\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|\\ &\leq\rho_{W}\|\mathbf{x}(t)-\bar{\mathbf{x}}(t)\|+\eta\|\nabla F(\mathbf{x}(t+1))-\overline{\nabla F}(\mathbf{x}(t+1))\|.\end{split} (2.7)

Using the fact that the spectral radius of the matrix (In1n1nT1n)\Big{(}I_{n}-\frac{1}{n}1_{n}^{T}1_{n}\Big{)} is one, we obtain

F(𝐱(t+1))F¯(𝐱(t+1))F(𝐱(t+1))F(𝐱(t+1))F(𝐱¯(t+1))+F(𝐱¯(t+1))F(x)+F(x)L𝐱(t+1)𝐱¯(t+1)+nLx¯(t+1)x+nD.\begin{split}&\|\nabla F(\mathbf{x}(t+1))-\overline{\nabla F}(\mathbf{x}(t+1))\|\\ &\leq\|\nabla F(\mathbf{x}(t+1))\|\\ &\leq\|\nabla F(\mathbf{x}(t+1))-\nabla F(\bar{\mathbf{x}}(t+1))\|+\|\nabla F(\bar{\mathbf{x}}(t+1))-\nabla F(x_{*})\|+\|\nabla F(x_{*})\|\\ &\leq L\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|+\sqrt{n}L\|\bar{{x}}(t+1)-{x}_{*}\|+\sqrt{n}D.\end{split}

Inserting this into (2.7) we obtain

𝐱(t+1)𝐱¯(t+1)ρW𝐱(t)𝐱¯(t)+ηL𝐱(t+1)𝐱¯(t+1)+nLηx¯(t+1)x+nDη,\begin{split}&\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|\\ &\leq\rho_{W}\|\mathbf{x}(t)-\bar{\mathbf{x}}(t)\|+\eta L\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|\\ &\quad+\sqrt{n}L\eta\|\bar{x}(t+1)-x_{*}\|+\sqrt{n}D\eta,\end{split}

which is the desired inequality. ∎

3. Boundedness of the sequence

We prove the uniform boundedness result of Theorem 1.1 under the two conditions separately in the below.

Proof of Theorem 1.1.

Assume the first condition of Theorem 1.1. We claim the following inequality

k=1nxk(t+1)xkk=1nxk(t)xkt0.\sum_{k=1}^{n}\|x_{k}(t+1)-x_{k}^{*}\|\leq\sum_{k=1}^{n}\|x_{k}(t)-x_{k}^{*}\|\quad\forall~{}t\geq 0. (3.1)

The optimizer (x1,,xn)(x_{1}^{*},\cdots,x_{n}^{*}) of (2.7) satisfies

fk(xk)+1η(xkj=1nwkjxj)=0.\nabla f_{k}(x_{k}^{*})+\frac{1}{\eta}\Big{(}x_{k}^{*}-\sum_{j=1}^{n}w_{kj}x_{j}^{*}\Big{)}=0.

Combining this with (2.2) gives

η(fk(xk(t+1))fk(xk))+(xk(t+1)xk)=j=1nwkj(xj(t)xj).\eta(\nabla f_{k}(x_{k}(t+1))-\nabla f_{k}(x_{k}^{*}))+(x_{k}(t+1)-x_{k}^{*})=\sum_{j=1}^{n}w_{kj}(x_{j}(t)-x_{j}^{*}).

From this we have

xk(t+1)xkj=1nwkj(xj(t)xj)j=1nwkjxj(t)xj.\|x_{k}(t+1)-x_{k}^{*}\|\leq\Big{\|}\sum_{j=1}^{n}w_{kj}(x_{j}(t)-x_{j}^{*})\Big{\|}\leq\sum_{j=1}^{n}w_{kj}\|x_{j}(t)-x_{j}^{*}\|.

Summing up this for 1kn1\leq k\leq n, we get

k=1nxk(t+1)xkk=1nj=1nwkjxj(t)xj=j=1nxj(t)xj,\sum_{k=1}^{n}\|x_{k}(t+1)-x_{k}^{*}\|\leq\sum_{k=1}^{n}\sum_{j=1}^{n}w_{kj}\|x_{j}(t)-x_{j}^{*}\|=\sum_{j=1}^{n}\|x_{j}(t)-x_{j}^{*}\|,

which proves the inequality (3.1). It gives the following bound

k=1nxk(t)xkk=1nj=1nwkjxj(t)xj=j=1nxj(0)xjt0.\sum_{k=1}^{n}\|x_{k}(t)-x_{k}^{*}\|\leq\sum_{k=1}^{n}\sum_{j=1}^{n}w_{kj}\|x_{j}(t)-x_{j}^{*}\|=\sum_{j=1}^{n}\|x_{j}(0)-x_{j}^{*}\|\quad\forall~{}t\geq 0.

Hence AtA_{t} and BtB_{t} are uniformly bounded.

Next we assume the second condition of Theorem 1.1. Then we claim that the sequence AtA_{t} and BtB_{t} satisfies

AtRandBtαLR,A_{t}\leq R\quad\textrm{and}\quad B_{t}\leq\frac{\alpha}{L}R,

where R>0R>0 is defined by

R=max{A0,LαB0,LαB1,ηDαL(1ηLη2L2(1+ηα))αLρWηL(1+ηα)}.R=\max\Bigg{\{}A_{0},~{}\frac{L}{\alpha}B_{0},~{}\frac{L}{\alpha}B_{1},~{}\frac{\eta D}{\frac{\alpha}{L}\Big{(}1-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}-\frac{\alpha}{L}\rho_{W}-\frac{\eta L}{(1+\eta\alpha)}}\Bigg{\}}. (3.2)

We argue by induction to prove this claim. First, we note that A0RA_{0}\leq R, B0RB_{0}\leq R and B1RB_{1}\leq R by the definition of RR. Next we assume that

AtRandBt+1cRA_{t}\leq R\quad\textrm{and}\quad B_{t+1}\leq cR (3.3)

for some t0t\geq 0 and c=αLc=\frac{\alpha}{L}. Then, we use these bounds in (2.1) to find

(1+ηα)At+1(R+ηLcR)=(1+ηα)R,(1+\eta\alpha)A_{t+1}\leq(R+\eta LcR)=(1+\eta\alpha)R,

which gives At+1RA_{t+1}\leq R. Next we recall the estimates (2.1) and (2.6) with tt replaced by (t+1)(t+1) as

Bt+2ρWBt+1+ηLBt+2+ηLAt+2+ηDB_{t+2}\leq\rho_{W}B_{t+1}+\eta LB_{t+2}+\eta LA_{t+2}+\eta D

and

(1+ηα)At+2At+1+ηLBt+2R+ηLBt+1,\begin{split}(1+\eta\alpha)A_{t+2}&\leq A_{t+1}+\eta LB_{t+2}\\ &\leq R+\eta LB_{t+1},\end{split}

where we used At+1RA_{t+1}\leq R in the last inequality. Combining these estimates with (3.3) gives

(1ηLη2L2(1+ηα))Bt+2(ρW)(cR)+ηL1+ηαR+ηD.\Big{(}1-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}B_{t+2}\leq(\rho_{W})(cR)+\frac{\eta L}{1+\eta\alpha}R+\eta D.

This gives Bt+2cRB_{t+2}\leq cR provided that

ρW(cR)+ηL(1+ηα)R+ηDRc(1ηLη2L2(1+ηα)).\rho_{W}(cR)+\frac{\eta L}{(1+\eta\alpha)}R+\eta D\leq Rc\Big{(}1-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}.

This holds true for R>0R>0 defined in (3.2) and η>0\eta>0 satisfying

αL((1ρW)ηLη2L2(1+ηα))>ηL(1+ηα),\frac{\alpha}{L}\Big{(}(1-\rho_{W})-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}>\frac{\eta L}{(1+\eta\alpha)},

which is same with

η2(α2+αL)+η(α+L(1ρW)α2L)<(1ρW)αL.\eta^{2}(\alpha^{2}+\alpha L)+\eta\Big{(}\alpha+L-\frac{(1-\rho_{W})\alpha^{2}}{L}\Big{)}<\frac{(1-\rho_{W})\alpha}{L}.

The proof is done.

4. Convergence result

In this section, we prove the main convergence result of the decentralized proximal point method.

Proof of Theorem 1.2.

By Proposition 2.1 and Proposition 2.3 we have the following inequalities

(1+ηα)At+1At+ηLBt+1(1+\eta\alpha)A_{t+1}\leq A_{t}+\eta LB_{t+1} (4.1)

and

Bt+1ρWBt+ηLBt+1+ηLAt+1+ηD.B_{t+1}\leq\rho_{W}B_{t}+\eta LB_{t+1}+\eta LA_{t+1}+\eta D.

Using the assumption that At+1RA_{t+1}\leq R and Bt+1RB_{t+1}\leq R, we have

Bt+1(ρW)Bt+η(2RL+D)B_{t+1}\leq(\rho_{W})B_{t}+\eta(2RL+D)

for all t0t\geq 0. Using this iteratively, we get

Bt(ρW)tB0+η(2RL+D)[1+ρW++(ρW)t1](ρW)tB0+η(1ρW)(2RL+D).\begin{split}B_{t}&\leq(\rho_{W})^{t}B_{0}+\eta(2RL+D)\Big{[}1+\rho_{W}+\cdots+(\rho_{W})^{t-1}\Big{]}\\ &\leq(\rho_{W})^{t}B_{0}+\frac{\eta}{(1-\rho_{W})}(2RL+D).\end{split}

Putting this inequality into (4.1) leads to

(1+ηα)At+1At+ηL[(ρW)t+1B0+η(1ρW)(2RL+D)]=At+ηLB0(ρW)t+1+η2L(1ρW)(2RL+D).\begin{split}&(1+\eta\alpha)A_{t+1}\\ &\leq A_{t}+\eta L\Big{[}(\rho_{W})^{t+1}B_{0}+\frac{\eta}{(1-\rho_{W})}(2RL+D)\Big{]}\\ &=A_{t}+\eta LB_{0}(\rho_{W})^{t+1}+\frac{\eta^{2}L}{(1-\rho_{W})}(2RL+D).\end{split} (4.2)

To analyize this sequential estimate, we consider two positive sequences {zt}t0\{z_{t}\}_{t\geq 0} and {yt}t0\{y_{t}\}_{t\geq 0} satisfying

(1+ηα)zt+1=zt+ηLB0(ρW)t+1(1+\eta\alpha)z_{t+1}=z_{t}+\eta LB_{0}(\rho_{W})^{t+1}

and

(1+ηα)yt+1=yt+η2L(1ρW)(2RL+D),(1+\eta\alpha)y_{t+1}=y_{t}+\frac{\eta^{2}L}{(1-\rho_{W})}(2RL+D),

where z0=A0z_{0}=A_{0} and y0=0y_{0}=0. It then follows from (4.2) that Atzt+ytA_{t}\leq z_{t}+y_{t} for all t0t\geq 0.

We estimate ztz_{t} as follows:

zt=z0(1ηα)t+ηLB0ρW(1+ηα)[j=0t1(ρW)j(1+ηα)t1j]z0(1ηα)t+tηLB0ρW(1+ηα)max{ρW,11+ηα}t1.\begin{split}z_{t}&=\frac{z_{0}}{(1-\eta\alpha)^{t}}+\frac{\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\Big{[}\sum_{j=0}^{t-1}\frac{(\rho_{W})^{j}}{(1+\eta\alpha)^{t-1-j}}\Big{]}\\ &\leq\frac{z_{0}}{(1-\eta\alpha)^{t}}+\frac{t\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\max\Big{\{}\rho_{W},\frac{1}{1+\eta\alpha}\Big{\}}^{t-1}.\end{split}

Next we estimate yty_{t} as

yt=1(1+ηα)ty0+η2L(1+ηα)(1ρW)(2RL+D)j=0t11(1+ηα)j11(1+ηα)ty0+η2L(1+ηα)(1ρW)(2RL+D)j=01(1+ηα)j=y0(1+ηα)t+ηL(2RL+D)α(1ρW).\begin{split}y_{t}&=\frac{1}{(1+\eta\alpha)^{t}}y_{0}+\frac{\eta^{2}L}{(1+\eta\alpha)(1-\rho_{W})}(2RL+D)\sum_{j=0}^{t-1}\frac{1}{(1+\eta\alpha)^{j-1}}\\ &\leq\frac{1}{(1+\eta\alpha)^{t}}y_{0}+\frac{\eta^{2}L}{(1+\eta\alpha)(1-\rho_{W})}(2RL+D)\sum_{j=0}^{\infty}\frac{1}{(1+\eta\alpha)^{j}}\\ &=\frac{y_{0}}{(1+\eta\alpha)^{t}}+\frac{\eta L(2RL+D)}{\alpha(1-\rho_{W})}.\end{split}

Combining the above estimates gives

Atzt+ytA0(1ηα)t+tηLB0ρW(1+ηα)max{ρW,11+ηα}t1+ηL(2RL+D)α(1ρW).\begin{split}A_{t}&\leq z_{t}+y_{t}\\ &\leq\frac{A_{0}}{(1-\eta\alpha)^{t}}+\frac{t\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\max\Big{\{}\rho_{W},\frac{1}{1+\eta\alpha}\Big{\}}^{t-1}+\frac{\eta L(2RL+D)}{\alpha(1-\rho_{W})}.\end{split}

This corresponds to the inequality (1.4) in the condition (2). Thus, it holds that At+1RA_{t+1}\leq R and Bt+2cRB_{t+2}\leq cR. This completes the induction, and so the claim is proved. Hence the uniform boundedness is proved. ∎

5. Numerical tests

This section gives numerical experimental results of the DPPA. We consider the cost function

f(x)=1nk=1nAkxyk2,f(x)=\frac{1}{n}\sum_{k=1}^{n}\|A_{k}x-y_{k}\|^{2},

where nn is the number of agents and for each 1in1\leq i\leq n and AiA_{i} be an m×dm\times d matrix whose element is chosen randomly following the normal distribution N(0,1)N(0,1). Also we set yimy_{i}\in\mathbb{R}^{m} whose each element is generated from the normal distribution N(0,1)N(0,1). We choose n=20n=20, m=5m=5 and d=10d=10.

For the communication matrix WW, we link each two agents with probability 0.4, and define the weights wijw_{ij} by

wij={1/max{deg(i),deg(j)}ifiNi,1jNiwijifi=j,0otherwise.w_{ij}=\left\{\begin{array}[]{ll}1/\max\{\textrm{deg}(i),\textrm{deg}(j)\}&~{}\textrm{if}~{}i\in N_{i},\\ 1-\sum_{j\in N_{i}}w_{ij}&~{}\textrm{if}~{}i=j,\\ 0&~{}otherwise.\end{array}\right.

We define the following value

ηc=1+λn(W)L,\eta_{c}=\frac{1+\lambda_{n}(W)}{L},

where λn(W)\lambda_{n}(W) is the smallest eigenvalue of WW and LL is the constant for the smoothness propery of fjf_{j} for 1jn1\leq j\leq n. In our experiment, the constants are computed as L29.7312L\simeq 29.7312 and λn(W)0.4009\lambda_{n}(W)\simeq-0.4009. Therefore we find ηc0.0202\eta_{c}\simeq 0.0202.

We perform the numerical simulation for the DPPA and the DGD with stepsize chosen as η=ηc+0.005\eta=\eta_{c}+0.005 and η=ηc\eta=\eta_{c}. Figure 1 indicates the graphs of the error log(k=1nxk(t)x/n)\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n) with respect to t0t\geq 0. The result shows that both the DPPA and the DGD are stable for η=ηc\eta=\eta_{c} as the theoretical results of Table 1 guarantee. On the other hand, if we choose η=ηc+0.005\eta=\eta_{c}+0.005, then the DGD becomes unstable while the DPPA is still stable. This supports the result of Theorem 1.1.

Refer to caption
Refer to caption
Figure 1. The graphs of the error log(k=1nxk(t)x/n)\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n) for the DPPA and the DGD for t0t\geq 0 with stepsize η=ηc+0.005\eta=\eta_{c}+0.005 and η=ηc\eta=\eta_{c}.

Next we perform the numerical test for the DPPA with various stepsize η{0.001,0.01,0.1,1,2}\eta\in\{0.001,0.01,0.1,1,2\}. We measure the error log(k=1nxk(t)x/n)\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n) and the graphs are given in Figure 2. The result shows that the error decreases exponentially up to an O(η)O(\eta) value, which supports the convergence result of Theorem 1.2.

Refer to caption
Figure 2. The graphs of the error log(k=1nxk(t)x/n)\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n) for the DPPA with respect to t0t\geq 0 and stepsize η{0.001,0.01,0.1,1,2}\eta\in\{0.001,0.01,0.1,1,2\}.

Acknowledgments

The author was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government No. NRF-2016R1A5A1008055 and No. NRF-2021R1F1A1059671.

References

  • [1] J. A. Bazerque, G.B. Giannakis, Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Trans. Signal Process. 58(3), 1847–1862 (2010)
  • [2] A.S. Berahas, R. Bollapragada, N.S. Keskar, E. Wei, Balancing communication and computation in distributed optimization. IEEE Trans. Autom. Control 64(8), 3141–3155 (2018)
  • [3] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, vol. 60, no. 2, pp. 223–-311, 2018.
  • [4] F. Bullo, J. Cortes, S. Martinez, S. Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms, vol. 27. Princeton University Press, Princeton (2009)
  • [5] Y. Cao, W. Yu, W. Ren, and G. Chen, An overview of recent progress in the study of distributed multi-agent coordination, IEEE Trans Ind. Informat., 9 (2013), pp. 427–438
  • [6] W. Choi, D. Kim, S. Yun, Convergence results of a nested decentralized gradient method for non-strongly convex problems. J. Optim. Theory Appl. 195 (2022), no. 1, 172–204.
  • [7] W. Choi, J. Kim, On the convergence of decentralized gradient descent with diminishing stepsize, revisited, arXiv:2203.09079.
  • [8] A. I. Chen and A. Ozdaglar, A fast distributed proximal-gradient method, in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 2012, pp. 601–608.
  • [9] G. B. Giannakis, V. Kekatos, N. Gatsis, S.-J. Kim, H. Zhu, and B. Wollenberg, Mon-itoring and optimization for power grids: A signal processing perspective, IEEE Signal Processing Mag., 30 (2013), pp. 107–128,
  • [10] V. Kekatos and G. B. Giannakis, Distributed robust power system state estimation, IEEE Trans. Power Syst., 28 (2013), pp. 1617–1626,
  • [11] P. A. Forero, A. Cano, and G. B. Giannakis, Consensus-based distributed support vector machines, Journal of Machine Learning Research, vol. 11, pp. 1663–-1707, 2010.
  • [12] D. Jakovetic, J. Xavier, and J. M. F. Moura, Fast Distributed Gradient Methods, IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
  • [13] X. Li, G. Feng, L. Xie, Distributed proximal algorithms for multiagent optimization with coupled inequality constraints. IEEE Trans. Automat. Control 66 (2021), no. 3, 1223–1230.
  • [14] X. Li, G. Feng, L. Xie, Distributed Proximal Point Algorithm for Constrained Optimization over Unbalanced Graphs. In 2019 IEEE 15th International Conference on Control and Automation (ICCA) (pp. 824-829).
  • [15] K. Margellos, A. Falsone, S. Garatti, and M. Prandini, Proximal minimization based distributed convex optimization,” in 2016 American Control Conference (ACC), July 2016, pp. 2466–2471.
  • [16] K. Margellos, A. Falsone, S. Garatti, and M. Prandini, Distributed constrained optimization and consensus in uncertain networks via proximal minimization, IEEE Trans. Autom. Control, vol. 63, no. 5, pp. 1372–1387, May 2018.
  • [17] A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48, 2009.
  • [18] A. Nedić, A. Olshevsky, and W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
  • [19] S. Pu and A. Nedić, Distributed stochastic gradient tracking methods, Math. Program, pp. 1–49, 2018
  • [20] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018.
  • [21] H. Raja and W. U. Bajwa, Cloud K-SVD: A collaborative dictionary learning algorithm for big, distributed data, IEEE Transactions on Signal Processing, vol. 64, no. 1, pp. 173–-188, Jan. 2016.
  • [22] W. Ren, Consensus Based Formation Control Strategies for Multi-Vehicle Systems, in Proceedings of the Amer-ican Control Conference, 2006, pp. 4237–-4242.
  • [23] A. H. Sayed, Diffusion adaptation over networks. Academic Press Library in Signal Processing, 2013, vol. 3.
  • [24] I.D. Schizas, G.B. Giannakis, S.I. Roumeliotis, A. Ribeiro, Consensus in ad hoc WSNs with noisy links—part II: distributed estimation and smoothing of random signals. IEEE Trans. Signal Process. 56(4), 1650–1666 (2008)
  • [25] W. Shi, Q. Ling, G. Wu, W. Yin, Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
  • [26] W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014)
  • [27] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning, in Proceedings of the IEEE Allerton Conference on Communication, Control, and Computing, IEEE, New York, 2012, pp. 1543–1550,
  • [28] R. Xin and U. A. Khan, A linear algorithm for optimization over directed graphs with geometric convergence, IEEE Control Systems Letters, vol. 2, no. 3, pp. 315–320, 2018.
  • [29] K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, Sep. 2016.
  • [30] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, A survey of distributed optimization, Annual Reviews in Control, vol. 47, pp. 278–305, 2019.