On the convergence of the distributed Proximal point algorithm

Abstract.

In this work, we establish convergence results for the distributed proximal point algorithm (DPPA) for distributed optimization problems. We consider the problem on the whole domain $\mathbb{R}^{d}$ and find a general condition on the stepsize and cost functions such that the DPPA is stable. We prove that the DPPA with stepsize $\eta>0$ exponentially converges to an $O(\eta)$ -neighborhood of the optimizer. Our result clearly explains the advantage of the DPPA with respect to the convergence and stability in comparison with the distributed gradient descent algorithm. We also provide numerical tests supporting the theoretical results.

Key words and phrases:

Proximal point method, Distributed optimization, Distributed gradient algorithm

2010 Mathematics Subject Classification:

Primary 90C25, 68Q25

1. Introduction

This works consider the distributed optimization

\min_{x\in\mathbb{R}^{d}}f(x)=\sum_{i=1}^{n}f_{i}(x),

(1.1)

where $n$ denotes the number of the agents in the network, and $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a differentiable local cost only known agent $i$ for each $1\leq i\leq n$ . In recent years, extensive research has been conducted on distributed optimization due to its relevance in various applications, including wireless sensor networks [1, 24], multi-agent control [4, 22, 5], smart grids [9, 10], and machine learning [11, 21, 30, 3, 27].

Numerous studies have focused on solving distributed optimization problems on networks. Relevant research works include [17, 18, 25, 26], and the references therein. A key algorithm in this field is the distributed gradient descent algorithm (DGD), introduced in [17]. Additionally, there exist various versions of distributed optimization algorithms such as EXTRA [25] and decentralized gradient tracking [18, 20, 28]. Lately, there has been growing interest in designing communication-efficient algorithms for distributed optimization [23, 8, 12, 2].

The distributed proximal point algorithm (DPPA) was proposed in [15, 16] as a distributed counterpart of the proximal point method, analogous to the relation between the distributed gradient descent algorithm and the gradient descent method. The works [15, 16] established the asymptotic convergence of the DPPA under the assumptions of a compact domain for each local cost $f_{i}$ and a decreasing step size. The work [14] designed the DPPA on a directed graph and proved a convergence estimate with rate $O(1/\sqrt{t})$ when the step size is set to $1/\sqrt{t}$ and each cost function has a compact domain.

It is a well-known fact that the proximal point method is more stable compared to the gradient descent method for large choices of the stepsize. This fact suggests that the DPPA may also exhibit be more stable than the DGD, as mentioned in the previous works [13, 14, 16]. In this work, we provide convergence results for the DPPA in the case of the entire domain and a constant step size. Comparing the results with the convergence result [29] of the DGD, we will find that the DPPA is more stable than the DGD when the stepsize is large.

The DPPA for the problem (1.1) is described as follows:

\begin{split}\hat{x}_{i}(t)&=\sum_{j=1}^{n}w_{ij}x_{j}(t)\\ x_{i}(t+1)&=\textrm{argmin}_{x\in\mathbb{R}^{n}}\Big{(}f_{i}(x)+\frac{1}{2\eta}\|x-\hat{x}_{i}(t)\|^{2}\Big{)},\end{split}

(1.2)

where $w_{ij}$ denotes the weight for the communication among agents in the network desribed by an undirected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ . Each node in $\mathcal{V}$ represents an agent, and each edge $\{i,j\}\in\mathcal{E}$ means that $i$ can send messages to $j$ and vice versa. We consider a graph $\mathcal{G}$ satisfying the following assumption.

Assumption 1.

The communication graph $\mathcal{G}$ is undirected and connected, i.e., there exists a path between any two agents.

We define the mixing matrix $W=\{w_{ij}\}_{1\leq i,j\leq n}$ as follows. The nonnegative weight $w_{ij}$ is given for each communication link $\{i,j\}\in\mathcal{E},$ where $w_{ij}\neq 0$ if $\{i,j\}\in\mathcal{E}$ and $w_{ij}=0$ if $\{i,j\}\notin\mathcal{E}$ . In this paper, we make the following assumption on the mixing matrix $W$ .

Assumption 2.

The mixing matrix $W=\{w_{ij}\}_{1\leq i,j\leq n}$ is doubly stochastic, i.e., $W\mathbf{1}=\mathbf{1}$ and $\mathbf{1}^{T}W=\mathbf{1}^{T}$ . In addition, $w_{ii}>0$ for all $i\in\mathcal{V}$ .

In the following result, we establish that the sequence $\{x_{k}(t)\}_{k=1}^{n}$ is stable, i.e., uniformly bounded for $t\geq 0$ under suitable conditions.

Theorem 1.1.

Suppose that the assumptions 1-2 hold, and assume that one of the following conditions holds:

(1)

The following function

$F_{\eta}(x):=\sum_{k=1}^{n}f_{k}(x_{k})+\frac{1}{2\eta}x^{T}(I-W)x$ (1.3)

is bounded below and has an optimal solution $(x_{1}^{*},\cdots,x_{n}^{*})$ .
(2)

Each function $f_{k}$ is $L$ -smooth and $f$ is $\alpha$ -stronlgy convex. In addition, the stepsize $\eta>0$ satisfies the following inequality

$\eta^{2}(\alpha^{2}+\alpha L)+\eta\Big{(}\alpha+L-\frac{(1-\rho_{W})\alpha^{2}}{L}\Big{)}<\frac{(1-\rho_{W})\alpha}{L},$ (1.4)

where $\rho_{W}$ is the spectral norm of the matrix $W-\frac{1}{n}\mathbf{1}\mathbf{1}^{T}$ .

Then the sequence $\{x_{k}(t)\}_{k=1}^{n}$ of the DPPA with stepsize $\eta>0$ is uniformly bounded for $t\geq 0$ .

We now compare the stability result with that of the DGD described by

x_{i}(t+1)=\sum_{j=1}^{N}w_{ij}x_{j}(t)-\eta\nabla f_{i}(x_{i}(t)).

Let $\lambda_{n}(W)\in\mathbb{R}$ be the smallest eigenvalue of $W$ . Then it is known from [29] that the DGD is stable if the stepsize $\eta>0$ satisfies

\eta\leq\frac{1+\lambda_{n}(W)}{L},

provided that each cost function $f_{j}$ is convex and $L$ -smooth.

	Condition on the costs	Stepsize	Paper
DGD	Each $f_{j}$ is convex and $L$ -smooth	$\eta\leq\frac{(1+\lambda_{n}(W))}{L}$	[29]
DPPA	$F_{\eta}$ is bounded below and has an optimizer	$\eta\in(0,\infty)$	This work

Table 1. This table compares the condition on the stepsize for the stability of the DGD and the DPPA.

Table 1 summarizes the condition on the stepsize $\eta>0$ for the stability of the DGD and the DPPA. We observe that $F_{\eta}$ defined in (1.3) is bounded below for any $\eta>0$ when each $f_{j}$ is convex. Therefore, the condition of (1) in Theorem 1.1 is not so restrictive than the condition that $f_{j}$ is convex which is required for the stability result [29] of the DGD. Hence the result of Theorem 1.1 proves that the stepsize choice of $\eta>0$ is much wider for the DPPA than that for the DGD.

For the convergence analysis of the DPPA, we assume the uniformly boundedness property.

Assumption 3.

The sequence $\{x_{k}(t)\}_{k=1}^{n}$ is uniformly bounded for $t\geq 0$ , i.e., there is $R>0$ such that

A_{t}\leq R\quad\textrm{and}\quad B_{t}\leq R\quad\forall~{}t\geq 0.

Although the property is guaranteed for broad class of functions by the result of Theorem 1.1, we formulate this assumption of the sake of simplicity in the statement of the convergence result. We also consider the following assumption.

Assumption 4.

The aggregate cost function $f$ is $\alpha$ -strongly convex for some $\alpha>0$ and each cost function $f_{j}$ is $L$ -smooth for each $1\leq j\leq n$ , i.e.,

\|\nabla f_{j}(x)-\nabla f_{j}(y)\|\leq L\|x-y\|

for all $x,y\in\mathbb{R}^{d}$ .

Under this assumption, there exists a unique optimizer $x_{*}=\arg\min_{x\in\mathbb{R}^{d}}f(x)$ . We let $D=\max_{1\leq i\leq n}\|\nabla f_{i}(x_{*})\|$ . Also we regard $x_{i}(t)$ as a row vector in $\mathbb{R}^{1\times d}$ , and define the variable $\mathbf{x}(t)\in\mathbb{R}^{n\times d}$ and $\bar{\mathbf{x}}(t)\in\mathbb{R}^{n\times d}$ by

\mathbf{x}(t)=\left(x_{1}\left(t\right)^{T},\cdots,x_{m}\left(t\right)^{T}\right)^{T}\quad\textrm{and}\quad\bar{\mathbf{x}}(t)=\left(\bar{x}\left(t\right)^{T},\cdots,\bar{x}\left(t\right)^{T}\right)^{T},

(1.5)

where $\bar{x}(t)=\frac{1}{n}\sum_{k=1}^{n}x_{k}(t)$ . We let

A_{t}=\|\bar{x}(t)-x_{*}\|\quad\textrm{and}\quad B_{t}=\frac{1}{\sqrt{n}}\|\mathbf{\bar{x}}(t)-\mathbf{x}(t)\|=\Big{(}\frac{1}{n}\sum_{k=1}^{n}\|\bar{x}(t)-x_{k}(t)\|^{2}\Big{)}^{1/2}.

We show that the DPPA converges exponentially to an $O(\eta)$ -neighborhood of the optimizer in the following result.

Theorem 1.2.

Suppose that the assumptions 1-3 hold and Then we have

A_{t}\leq\frac{A_{0}}{(1+\eta\alpha)^{t}}+\frac{t\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\max\Big{\{}\rho_{W},\frac{1}{1+\eta\alpha}\Big{\}}^{t-1}+\frac{\eta L(2RL+D)}{\alpha(1-\rho_{W})}

(1.6)

and

B_{t}\leq(\rho_{W})^{t}B_{0}+\frac{\eta}{(1-\rho_{W})}(2RL+D).

Upon knowing that the DGD is stable, the linear convergence result of the DGD is proved when the stepsize $\eta$ satisfies $\eta\leq\frac{2}{\alpha+L}$ . We refer to [29, 7] for the detail. As for the DPPA, we note that there is no restriction on the stepsize $\eta>0$ for the linear convergence as obtained in Theorem 1.2. Table 2 compares the condition on the stepsize of the DGD and the DPPA for the convergence of the algorithms.

	Condition on the costs	DGD	Estimate	Paper
DGD	Each $f_{j}$ is $L$ -smooth $f$ is $\alpha$ -strongly convex	$\eta\leq\frac{2}{\alpha+L}$	$O(e^{-ct})+O\Big{(}\frac{\eta}{1-\rho_{W}}\Big{)}$	[29, 7]
DPPA	Each $f_{j}$ is $L$ -smooth $f$ is $\alpha$ -strongly convex	$\eta\in(0,\infty)$	$O(e^{-ct})+O\Big{(}\frac{\eta}{1-\rho_{W}}\Big{)}$	This work

Table 2. This table compares the condition on the stepsize for the stability of the DGD and the DPPA.

The rest of this paper is organized as follows. In Section 2, we establish two sequential inequalities of $A_{t}$ and $B_{t}$ . Section 3 is devoted to prove the uniform boundedness result of Theorem 1.1. In Section 4, we prove the convergence result of Theorem 1.2. Numerical results are presented in Section 5.

2. Sequential estimates

In this section, we dervie two sequential inequalities of $A_{t}$ and $B_{t}$ , which are main ingredients for the stability and the convergence analysis of the DPPA.

Proposition 2.1.

Assume that $f$ is $\alpha$ -strongly convex. Then, for any stepsize $\eta>0$ , the sequence $\{(A_{t},B_{t})\}_{t\geq 0}$ satisfies the follwoing inequality

(1+\eta\alpha)A_{t+1}\leq A_{t}+\eta LB_{t+1}

(2.1)

for all $t\geq 0$ .

Proof.

From the minimality of (1.2), it follows that

\nabla f_{k}(x_{k}(t+1))+\frac{1}{\eta}\Big{(}x_{k}(t+1)-\sum_{j=1}^{n}w_{kj}x_{j}(t)\Big{)}=0.

(2.2)

We reformulate this as

x_{k}(t+1)+\eta\nabla f_{k}(x_{k}(t+1))=\sum_{j=1}^{n}w_{kj}x_{j}(t).

(2.3)

By averaging this for $1\leq k\leq n$ , we get

\bar{x}(t+1)+\frac{\eta}{n}\sum_{k=1}^{n}\nabla f_{k}(x_{k}(t+1))=\bar{x}(t).

(2.4)

Using this and the fact that $\nabla f(x_{*})=0$ , we find

\begin{split}&\bar{x}(t+1)-x_{*}+\eta(\nabla f(\bar{x}(t+1))-\nabla f(x_{*}))\\ &=\bar{x}(t)-x_{*}+\eta\Big{(}\sum_{k=1}^{n}\nabla f_{k}(\bar{x}(t+1))-\nabla f_{k}(x_{k}(t+1))\Big{)}.\end{split}

(2.5)

Since $f$ is assumed to be $\alpha$ -strongly convex, we have

\|\nabla f(x)-\nabla f(y)\|^{2}\geq\alpha^{2}\|x-y\|^{2}

and

\langle x-y,~{}\nabla f(x)-\nabla f(y)\rangle\geq\alpha\|x-y\|^{2}.

Combining these estimates, we get

\begin{split}&\Big{\|}\bar{x}(t+1)-x_{*}+\eta\Big{(}\nabla f(\bar{x}(t+1))-\nabla f(x_{*})\Big{)}\Big{\|}^{2}\\ &=\|\bar{x}(t+1)-x_{*}\|^{2}+2\eta\langle\bar{x}(t+1)-x_{*},~{}\nabla f(\bar{x}(t+1))-\nabla f(x_{*})\rangle\\ &\qquad+\eta^{2}\|\nabla f(\bar{x}(t+1))-\nabla f(x_{*})\|^{2}\\ &\geq(1+2\eta\alpha+\alpha^{2})\|\bar{x}(t+1)-x_{*}\|^{2}.\end{split}

Using this estimate in (2.5) and applying the triangle inequality, we get

\begin{split}&(1+\eta\alpha)\|\bar{x}(t+1)-x_{*}\|\\ &\leq\|\bar{x}(t)-x_{*}\|+\frac{\eta}{n}\Big{\|}\sum_{k=1}^{n}\nabla f_{k}(\bar{x}(t+1))-\nabla f_{k}(x_{k}(t+1))\Big{\|}\\ &\leq\|\bar{x}(t)-x_{*}\|+\frac{\eta L}{n}\sum_{k=1}^{n}\|\bar{x}(t+1)-x_{k}(t+1)\|\\ &\leq\|\bar{x}(t)-x_{*}\|+\frac{\eta L}{\sqrt{n}}\|\bar{\mathbf{x}}(t+1)-\mathbf{x}(t+1)\|,\end{split}

where we used the Cauchy-Schwartz inequality in the last inequality. This gives the desired inequality. ∎

Next we will derive a bound of $B_{t+1}$ in terms of $A_{t}$ and $B_{t}$ . For this we will use the following result (see [19, Lemma 1]).

Lemma 2.2.

Suppose Assumptions 1 and 2 hold, and let $\rho_{W}$ be the spectral norm of the matrix $W-\frac{1}{n}\mathbf{1}\mathbf{1}^{T}.$ Then we have $\rho_{W}<1$ and

\sum_{i=1}^{n}\Big{\|}\sum_{j=1}^{n}w_{ij}(x_{j}-\bar{x})\Big{\|}^{2}\leq(\rho_{W})^{2}\sum_{i=1}^{n}\|x_{i}-\bar{x}\|^{2},

where $\bar{x}=\frac{1}{n}\sum_{k=1}^{n}x_{k}$ for any $x_{i}\in\mathbb{R}^{d\times 1}$ and $1\leq i\leq n$ .

In the following result, we find an estimate of $B_{t+1}$ in terms of $B_{t}$ and $A_{t+1}$ .

Proposition 2.3.

Suppose that each $f_{j}$ is $L$ -smooth. Then the sequence $\{(A_{t},B_{t})\}_{t\geq 0}$ satisfies the following inequality

B_{t+1}\leq\rho_{W}B_{t}+\eta LB_{t+1}+\eta LA_{t+1}+\eta D

(2.6)

for all $t\geq 0$ .

Proof.

We may write (2.3) and (2.4) in the following way

\mathbf{x}(t+1)+\eta\nabla F(\mathbf{x}(t+1))=W\mathbf{x}(t)

and

\bar{\mathbf{x}}(t+1)+\eta\overline{\nabla F}(\mathbf{x}(t+1))=\bar{\mathbf{x}}(t),

where we have let

\overline{\nabla F}(\mathbf{x}(t+1))=\Big{(}\nabla f_{1}(x_{1}(t+1)),\cdots,\nabla f_{n}(x_{n}(t+1))\Big{)}^{T}.

Combining the above equalities, we find

\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)=W(\mathbf{x}(t)-\bar{\mathbf{x}}(t))-\eta\Big{(}\nabla F(\mathbf{x}(t+1))-\overline{\nabla F}(\mathbf{x}(t+1))\Big{)}.

By applying the triangle inequality and Lemma 2.2, we deduce

\begin{split}&\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|\\ &\leq\rho_{W}\|\mathbf{x}(t)-\bar{\mathbf{x}}(t)\|+\eta\|\nabla F(\mathbf{x}(t+1))-\overline{\nabla F}(\mathbf{x}(t+1))\|.\end{split}

(2.7)

Using the fact that the spectral radius of the matrix $\Big{(}I_{n}-\frac{1}{n}1_{n}^{T}1_{n}\Big{)}$ is one, we obtain

\begin{split}&\|\nabla F(\mathbf{x}(t+1))-\overline{\nabla F}(\mathbf{x}(t+1))\|\\ &\leq\|\nabla F(\mathbf{x}(t+1))\|\\ &\leq\|\nabla F(\mathbf{x}(t+1))-\nabla F(\bar{\mathbf{x}}(t+1))\|+\|\nabla F(\bar{\mathbf{x}}(t+1))-\nabla F(x_{*})\|+\|\nabla F(x_{*})\|\\ &\leq L\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|+\sqrt{n}L\|\bar{{x}}(t+1)-{x}_{*}\|+\sqrt{n}D.\end{split}

Inserting this into (2.7) we obtain

\begin{split}&\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|\\ &\leq\rho_{W}\|\mathbf{x}(t)-\bar{\mathbf{x}}(t)\|+\eta L\|\mathbf{x}(t+1)-\bar{\mathbf{x}}(t+1)\|\\ &\quad+\sqrt{n}L\eta\|\bar{x}(t+1)-x_{*}\|+\sqrt{n}D\eta,\end{split}

which is the desired inequality. ∎

3. Boundedness of the sequence

We prove the uniform boundedness result of Theorem 1.1 under the two conditions separately in the below.

Proof of Theorem 1.1.

Assume the first condition of Theorem 1.1. We claim the following inequality

\sum_{k=1}^{n}\|x_{k}(t+1)-x_{k}^{*}\|\leq\sum_{k=1}^{n}\|x_{k}(t)-x_{k}^{*}\|\quad\forall~{}t\geq 0.

(3.1)

The optimizer $(x_{1}^{*},\cdots,x_{n}^{*})$ of (2.7) satisfies

\nabla f_{k}(x_{k}^{*})+\frac{1}{\eta}\Big{(}x_{k}^{*}-\sum_{j=1}^{n}w_{kj}x_{j}^{*}\Big{)}=0.

Combining this with (2.2) gives

\eta(\nabla f_{k}(x_{k}(t+1))-\nabla f_{k}(x_{k}^{*}))+(x_{k}(t+1)-x_{k}^{*})=\sum_{j=1}^{n}w_{kj}(x_{j}(t)-x_{j}^{*}).

From this we have

\|x_{k}(t+1)-x_{k}^{*}\|\leq\Big{\|}\sum_{j=1}^{n}w_{kj}(x_{j}(t)-x_{j}^{*})\Big{\|}\leq\sum_{j=1}^{n}w_{kj}\|x_{j}(t)-x_{j}^{*}\|.

Summing up this for $1\leq k\leq n$ , we get

\sum_{k=1}^{n}\|x_{k}(t+1)-x_{k}^{*}\|\leq\sum_{k=1}^{n}\sum_{j=1}^{n}w_{kj}\|x_{j}(t)-x_{j}^{*}\|=\sum_{j=1}^{n}\|x_{j}(t)-x_{j}^{*}\|,

which proves the inequality (3.1). It gives the following bound

\sum_{k=1}^{n}\|x_{k}(t)-x_{k}^{*}\|\leq\sum_{k=1}^{n}\sum_{j=1}^{n}w_{kj}\|x_{j}(t)-x_{j}^{*}\|=\sum_{j=1}^{n}\|x_{j}(0)-x_{j}^{*}\|\quad\forall~{}t\geq 0.

Hence $A_{t}$ and $B_{t}$ are uniformly bounded.

Next we assume the second condition of Theorem 1.1. Then we claim that the sequence $A_{t}$ and $B_{t}$ satisfies

A_{t}\leq R\quad\textrm{and}\quad B_{t}\leq\frac{\alpha}{L}R,

where $R>0$ is defined by

R=\max\Bigg{\{}A_{0},~{}\frac{L}{\alpha}B_{0},~{}\frac{L}{\alpha}B_{1},~{}\frac{\eta D}{\frac{\alpha}{L}\Big{(}1-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}-\frac{\alpha}{L}\rho_{W}-\frac{\eta L}{(1+\eta\alpha)}}\Bigg{\}}.

(3.2)

We argue by induction to prove this claim. First, we note that $A_{0}\leq R$ , $B_{0}\leq R$ and $B_{1}\leq R$ by the definition of $R$ . Next we assume that

A_{t}\leq R\quad\textrm{and}\quad B_{t+1}\leq cR

(3.3)

for some $t\geq 0$ and $c=\frac{\alpha}{L}$ . Then, we use these bounds in (2.1) to find

(1+\eta\alpha)A_{t+1}\leq(R+\eta LcR)=(1+\eta\alpha)R,

which gives $A_{t+1}\leq R$ . Next we recall the estimates (2.1) and (2.6) with $t$ replaced by $(t+1)$ as

B_{t+2}\leq\rho_{W}B_{t+1}+\eta LB_{t+2}+\eta LA_{t+2}+\eta D

and

\begin{split}(1+\eta\alpha)A_{t+2}&\leq A_{t+1}+\eta LB_{t+2}\\ &\leq R+\eta LB_{t+1},\end{split}

where we used $A_{t+1}\leq R$ in the last inequality. Combining these estimates with (3.3) gives

\Big{(}1-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}B_{t+2}\leq(\rho_{W})(cR)+\frac{\eta L}{1+\eta\alpha}R+\eta D.

This gives $B_{t+2}\leq cR$ provided that

\rho_{W}(cR)+\frac{\eta L}{(1+\eta\alpha)}R+\eta D\leq Rc\Big{(}1-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}.

This holds true for $R>0$ defined in (3.2) and $\eta>0$ satisfying

\frac{\alpha}{L}\Big{(}(1-\rho_{W})-\eta L-\frac{\eta^{2}L^{2}}{(1+\eta\alpha)}\Big{)}>\frac{\eta L}{(1+\eta\alpha)},

which is same with

\eta^{2}(\alpha^{2}+\alpha L)+\eta\Big{(}\alpha+L-\frac{(1-\rho_{W})\alpha^{2}}{L}\Big{)}<\frac{(1-\rho_{W})\alpha}{L}.

The proof is done.

∎

4. Convergence result

In this section, we prove the main convergence result of the decentralized proximal point method.

Proof of Theorem 1.2.

By Proposition 2.1 and Proposition 2.3 we have the following inequalities

(1+\eta\alpha)A_{t+1}\leq A_{t}+\eta LB_{t+1}

(4.1)

and

B_{t+1}\leq\rho_{W}B_{t}+\eta LB_{t+1}+\eta LA_{t+1}+\eta D.

Using the assumption that $A_{t+1}\leq R$ and $B_{t+1}\leq R$ , we have

B_{t+1}\leq(\rho_{W})B_{t}+\eta(2RL+D)

for all $t\geq 0$ . Using this iteratively, we get

\begin{split}B_{t}&\leq(\rho_{W})^{t}B_{0}+\eta(2RL+D)\Big{[}1+\rho_{W}+\cdots+(\rho_{W})^{t-1}\Big{]}\\ &\leq(\rho_{W})^{t}B_{0}+\frac{\eta}{(1-\rho_{W})}(2RL+D).\end{split}

Putting this inequality into (4.1) leads to

\begin{split}&(1+\eta\alpha)A_{t+1}\\ &\leq A_{t}+\eta L\Big{[}(\rho_{W})^{t+1}B_{0}+\frac{\eta}{(1-\rho_{W})}(2RL+D)\Big{]}\\ &=A_{t}+\eta LB_{0}(\rho_{W})^{t+1}+\frac{\eta^{2}L}{(1-\rho_{W})}(2RL+D).\end{split}

(4.2)

To analyize this sequential estimate, we consider two positive sequences $\{z_{t}\}_{t\geq 0}$ and $\{y_{t}\}_{t\geq 0}$ satisfying

(1+\eta\alpha)z_{t+1}=z_{t}+\eta LB_{0}(\rho_{W})^{t+1}

and

(1+\eta\alpha)y_{t+1}=y_{t}+\frac{\eta^{2}L}{(1-\rho_{W})}(2RL+D),

where $z_{0}=A_{0}$ and $y_{0}=0$ . It then follows from (4.2) that $A_{t}\leq z_{t}+y_{t}$ for all $t\geq 0$ .

We estimate $z_{t}$ as follows:

\begin{split}z_{t}&=\frac{z_{0}}{(1-\eta\alpha)^{t}}+\frac{\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\Big{[}\sum_{j=0}^{t-1}\frac{(\rho_{W})^{j}}{(1+\eta\alpha)^{t-1-j}}\Big{]}\\ &\leq\frac{z_{0}}{(1-\eta\alpha)^{t}}+\frac{t\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\max\Big{\{}\rho_{W},\frac{1}{1+\eta\alpha}\Big{\}}^{t-1}.\end{split}

Next we estimate $y_{t}$ as

\begin{split}y_{t}&=\frac{1}{(1+\eta\alpha)^{t}}y_{0}+\frac{\eta^{2}L}{(1+\eta\alpha)(1-\rho_{W})}(2RL+D)\sum_{j=0}^{t-1}\frac{1}{(1+\eta\alpha)^{j-1}}\\ &\leq\frac{1}{(1+\eta\alpha)^{t}}y_{0}+\frac{\eta^{2}L}{(1+\eta\alpha)(1-\rho_{W})}(2RL+D)\sum_{j=0}^{\infty}\frac{1}{(1+\eta\alpha)^{j}}\\ &=\frac{y_{0}}{(1+\eta\alpha)^{t}}+\frac{\eta L(2RL+D)}{\alpha(1-\rho_{W})}.\end{split}

Combining the above estimates gives

\begin{split}A_{t}&\leq z_{t}+y_{t}\\ &\leq\frac{A_{0}}{(1-\eta\alpha)^{t}}+\frac{t\eta LB_{0}\rho_{W}}{(1+\eta\alpha)}\max\Big{\{}\rho_{W},\frac{1}{1+\eta\alpha}\Big{\}}^{t-1}+\frac{\eta L(2RL+D)}{\alpha(1-\rho_{W})}.\end{split}

This corresponds to the inequality (1.4) in the condition (2). Thus, it holds that $A_{t+1}\leq R$ and $B_{t+2}\leq cR$ . This completes the induction, and so the claim is proved. Hence the uniform boundedness is proved. ∎

5. Numerical tests

This section gives numerical experimental results of the DPPA. We consider the cost function

f(x)=\frac{1}{n}\sum_{k=1}^{n}\|A_{k}x-y_{k}\|^{2},

where $n$ is the number of agents and for each $1\leq i\leq n$ and $A_{i}$ be an $m\times d$ matrix whose element is chosen randomly following the normal distribution $N(0,1)$ . Also we set $y_{i}\in\mathbb{R}^{m}$ whose each element is generated from the normal distribution $N(0,1)$ . We choose $n=20$ , $m=5$ and $d=10$ .

For the communication matrix $W$ , we link each two agents with probability 0.4, and define the weights $w_{ij}$ by

w_{ij}=\left\{\begin{array}[]{ll}1/\max\{\textrm{deg}(i),\textrm{deg}(j)\}&~{}\textrm{if}~{}i\in N_{i},\\ 1-\sum_{j\in N_{i}}w_{ij}&~{}\textrm{if}~{}i=j,\\ 0&~{}otherwise.\end{array}\right.

We define the following value

\eta_{c}=\frac{1+\lambda_{n}(W)}{L},

where $\lambda_{n}(W)$ is the smallest eigenvalue of $W$ and $L$ is the constant for the smoothness propery of $f_{j}$ for $1\leq j\leq n$ . In our experiment, the constants are computed as $L\simeq 29.7312$ and $\lambda_{n}(W)\simeq-0.4009$ . Therefore we find $\eta_{c}\simeq 0.0202$ .

We perform the numerical simulation for the DPPA and the DGD with stepsize chosen as $\eta=\eta_{c}+0.005$ and $\eta=\eta_{c}$ . Figure 1 indicates the graphs of the error $\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n)$ with respect to $t\geq 0$ . The result shows that both the DPPA and the DGD are stable for $\eta=\eta_{c}$ as the theoretical results of Table 1 guarantee. On the other hand, if we choose $\eta=\eta_{c}+0.005$ , then the DGD becomes unstable while the DPPA is still stable. This supports the result of Theorem 1.1.

Refer to caption — Figure 1. The graphs of the error $\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n)$ for the DPPA and the DGD for $t\geq 0$ with stepsize $\eta=\eta_{c}+0.005$ and $\eta=\eta_{c}$ .

Next we perform the numerical test for the DPPA with various stepsize $\eta\in\{0.001,0.01,0.1,1,2\}$ . We measure the error $\log(\sum_{k=1}^{n}\|x_{k}(t)-x_{*}\|/n)$ and the graphs are given in Figure 2. The result shows that the error decreases exponentially up to an $O(\eta)$ value, which supports the convergence result of Theorem 1.2.

Acknowledgments

The author was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government No. NRF-2016R1A5A1008055 and No. NRF-2021R1F1A1059671.

References

[1] J. A. Bazerque, G.B. Giannakis, Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Trans. Signal Process. 58(3), 1847–1862 (2010)
[2] A.S. Berahas, R. Bollapragada, N.S. Keskar, E. Wei, Balancing communication and computation in distributed optimization. IEEE Trans. Autom. Control 64(8), 3141–3155 (2018)
[3] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, vol. 60, no. 2, pp. 223–-311, 2018.
[4] F. Bullo, J. Cortes, S. Martinez, S. Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms, vol. 27. Princeton University Press, Princeton (2009)
[5] Y. Cao, W. Yu, W. Ren, and G. Chen, An overview of recent progress in the study of distributed multi-agent coordination, IEEE Trans Ind. Informat., 9 (2013), pp. 427–438
[6] W. Choi, D. Kim, S. Yun, Convergence results of a nested decentralized gradient method for non-strongly convex problems. J. Optim. Theory Appl. 195 (2022), no. 1, 172–204.
[7] W. Choi, J. Kim, On the convergence of decentralized gradient descent with diminishing stepsize, revisited, arXiv:2203.09079.
[8] A. I. Chen and A. Ozdaglar, A fast distributed proximal-gradient method, in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 2012, pp. 601–608.
[9] G. B. Giannakis, V. Kekatos, N. Gatsis, S.-J. Kim, H. Zhu, and B. Wollenberg, Mon-itoring and optimization for power grids: A signal processing perspective, IEEE Signal Processing Mag., 30 (2013), pp. 107–128,
[10] V. Kekatos and G. B. Giannakis, Distributed robust power system state estimation, IEEE Trans. Power Syst., 28 (2013), pp. 1617–1626,
[11] P. A. Forero, A. Cano, and G. B. Giannakis, Consensus-based distributed support vector machines, Journal of Machine Learning Research, vol. 11, pp. 1663–-1707, 2010.
[12] D. Jakovetic, J. Xavier, and J. M. F. Moura, Fast Distributed Gradient Methods, IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
[13] X. Li, G. Feng, L. Xie, Distributed proximal algorithms for multiagent optimization with coupled inequality constraints. IEEE Trans. Automat. Control 66 (2021), no. 3, 1223–1230.
[14] X. Li, G. Feng, L. Xie, Distributed Proximal Point Algorithm for Constrained Optimization over Unbalanced Graphs. In 2019 IEEE 15th International Conference on Control and Automation (ICCA) (pp. 824-829).
[15] K. Margellos, A. Falsone, S. Garatti, and M. Prandini, Proximal minimization based distributed convex optimization,” in 2016 American Control Conference (ACC), July 2016, pp. 2466–2471.
[16] K. Margellos, A. Falsone, S. Garatti, and M. Prandini, Distributed constrained optimization and consensus in uncertain networks via proximal minimization, IEEE Trans. Autom. Control, vol. 63, no. 5, pp. 1372–1387, May 2018.
[17] A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48, 2009.
[18] A. Nedić, A. Olshevsky, and W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
[19] S. Pu and A. Nedić, Distributed stochastic gradient tracking methods, Math. Program, pp. 1–49, 2018
[20] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018.
[21] H. Raja and W. U. Bajwa, Cloud K-SVD: A collaborative dictionary learning algorithm for big, distributed data, IEEE Transactions on Signal Processing, vol. 64, no. 1, pp. 173–-188, Jan. 2016.
[22] W. Ren, Consensus Based Formation Control Strategies for Multi-Vehicle Systems, in Proceedings of the Amer-ican Control Conference, 2006, pp. 4237–-4242.
[23] A. H. Sayed, Diffusion adaptation over networks. Academic Press Library in Signal Processing, 2013, vol. 3.
[24] I.D. Schizas, G.B. Giannakis, S.I. Roumeliotis, A. Ribeiro, Consensus in ad hoc WSNs with noisy links—part II: distributed estimation and smoothing of random signals. IEEE Trans. Signal Process. 56(4), 1650–1666 (2008)
[25] W. Shi, Q. Ling, G. Wu, W. Yin, Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
[26] W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014)
[27] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning, in Proceedings of the IEEE Allerton Conference on Communication, Control, and Computing, IEEE, New York, 2012, pp. 1543–1550,
[28] R. Xin and U. A. Khan, A linear algorithm for optimization over directed graphs with geometric convergence, IEEE Control Systems Letters, vol. 2, no. 3, pp. 315–320, 2018.
[29] K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, Sep. 2016.
[30] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, A survey of distributed optimization, Annual Reviews in Control, vol. 47, pp. 278–305, 2019.