Primal-dual $\varepsilon$ -Subgradient Method for Distributed Optimization ^†^†thanks: This work was supported by National Natural Science Foundation of China under Grants 61973043.

Kui Zhu and Yutao Tang K. Zhu and Y. Tang are both with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China (e-mails: [email protected], [email protected]).

Abstract: This paper studies the distributed optimization problem when the objective functions might be nondifferentiable and subject to heterogeneous set constraints. Unlike existing subgradient methods, we focus on the case when the exact subgradients of the local objective functions can not be accessed by the agents. To solve this problem, we propose a projected primal-dual dynamics using only the objective function’s approximate subgradients. We first prove that the formulated optimization problem can generally be solved with an error depending upon the accuracy of the available subgradients. Then, we show the exact solvability of this distributed optimization problem when the accumulated approximation error of inexact subgradients is not too large. After that, we also give a novel componentwise normalized variant to improve the transient behavior of the convergent sequence. The effectiveness of our algorithms is verified by a numerical example.

Keywords: Distributed optimization, $\varepsilon$ -subgradient, constrained optimization, primal-dual dynamics

1 Introduction

The last decade has witnessed considerable interests in distributed optimization problems due to the numerous applications in signal processing, control, and machine learning. To solve this problem, subgradient information of the objective functions has been widely used due to the cheap iteration cost and well-established convergence properties [1, 2].

Note that most of these subgradient-based results assume the availability of local cost function’s exact subgradients. In many circumstances, the function subgradient is computed by solving another auxiliary optimization problem as shown in [3, 5, 4]. In practice, we are often only able to solve these subproblems approximately. Hence, in that context, numerical methods solving the original optimization problem are provided with only inexact subgradient information. This leads us to investigate the solvability of distributed optimization problem using inexact subgradient information.

A close topic is the inexact augmented Lagrangian method. As surveyed in [6], this method has been extensively extended to distributed settings in various ways assuming the primal variables are only obtained in an approximate sense. Nevertheless, most of these results still require the exact gradient or subgradient information of the local objective functions at each given estimate. It is interesting to ask whether the primal-dual method is still effective when only inexact gradient/subgradient information is available.

In this paper, we focus on a typical distributed consensus optimization problem for a sum of convex objective functions subject to heterogeneous set constraints. Although this problem has been partially studied by gradient/subgradient methods in [7, 8, 9, 10, 11, 12, 13, 14], its solvability using only inexact subgradient information has not yet been addressed and is still unclear at present.

To solve this problem, we first convert it into a distributed saddle point seeking problem and present a projected primal-dual $\varepsilon$ -subgradient dynamics to handle both the distributedness and constraints. When the objective functions are smooth with exact gradients, the proposed algorithms reduce to the primal-dual dynamics considering in [10, 8]. Then, we discuss the convergent property with diminishing step sizes and suboptimality of the proposed algorithm depending on the accuracy of available subgradients. In particular, we show that if the accumulated error resulting from the subgradient inexactness is not too large, the proposed algorithm under certain diminishing step size will drive all the estimates of agents to reach a consensus about an optimal solution to the global optimization problem. To our knowledge, this might be the first attempt to solve the formulated distributed optimization problem using only inexact subgradients of the local objective functions. To improve the transient performance of our preceding designs, we further propose a novel componentwise normalized step size as that in [14]. As a byproduct, this normalized step size removes the widely used subgradient boundedness assumption in the literature [7, 8].

The rest of this paper is organized as follows. We first give some preliminaries in Section 2 and then introduce the formulation of our problem in Section 3. Main results are presented in Section 4. After that, we give a numerical example in Section 5 to show the effectiveness of our design. Finally, some concluding remarks are given in Section 6.

2 Preliminary

In this section, we first give some preliminaries about graph theory and convex analysis.

2.1 Graph Theory

Let $\mathbb{R}^{n}$ be the $n$ -dimensional Euclidean space and $\mathbb{R}^{n\times m}$ be the set of all $n\times m$ matrices. ${\bf 1}_{n}$ (or ${\bf 0}_{n}$ ) denotes an $n$ -dimensional all-one (or all-zero) column vector and ${\bm{1}}_{n\times m}$ (or ${\bm{0}}_{n\times m}$ ) all-one (or all-zero) matrix. $\mathrm{col}(a_{1},\,{\dots},\,a_{n})={[a_{1}^{\intercal},\,{\dots},\,a_{n}^{\intercal}]}^{\intercal}$ for column vectors $a_{1},\,\dots,a_{n}$ . For a vector $x$ (or matrix $A$ ) , $||x||$ ( $||A||$ ) denotes its Euclidean (or spectral) norm.

A weighted (undirected) graph $\mathcal{G}=(\mathcal{N},\mathcal{E},\mathcal{A})$ is defined as follows, where $\mathcal{N}=\{1,{\dots},n\}$ is the set of nodes, $\mathcal{E}\subset\mathcal{N}\times\mathcal{N}$ is the set of edges, and $\mathcal{A}\in\mathbb{R}^{n\times n}$ is a weighted adjacency matrix. $(i,j)\in\mathcal{E}$ denotes an edge leaving from node $i$ and entering node $j$ . The weighted adjacency matrix of this graph $\mathcal{G}$ is described by $\mathcal{A}=[a_{ij}]\in\mathbb{R}^{n\times n}$ , where $a_{ii}=0$ and $a_{ij}\geq 0$ ( $a_{ij}=a_{ji}>0$ if and only if there is an edge between nodes $j$ and $i$ ). The neighbor set of agent $i$ is defined as $\mathcal{N}_{i}=\{j\colon(j,\,i)\in\mathcal{E}\}$ for $i=1,\,\dots,\,n$ . A path in graph $\mathcal{G}$ is an alternating sequence $i_{1}e_{1}i_{2}e_{2}{\cdots}e_{k-1}i_{k}$ of nodes $i_{l}$ and edges $e_{m}=(i_{m},\,i_{m+1})\in\mathcal{E}$ for $l=1,\,2,\,{\dots},\,k$ . If there exists a path from node $i$ to node $j$ then node $i$ is said to be reachable from node $j$ . The Laplacian $L=[l_{ij}]\in\mathbb{R}^{n\times n}$ of graph $\mathcal{G}$ is defined as $l_{ii}=\sum_{j\neq i}a_{ij}$ and $l_{ij}=-a_{ij}(j\neq i)$ . It can be found that the Laplacian is symmetric and semi-definite. Denote its ordered eigenvalues as $0=\lambda_{1}\leq\lambda_{2}\leq\dots\lambda_{N}$ . The corresponding eigenvector of $\lambda_{1}=0$ is the all one vector ${\bm{1}}_{N}$ . Moreover, $\lambda_{2}>0$ if and only if this graph $\mathcal{G}$ is connected.

2.2 Convex analysis

For each set $X\in\mathbb{R}^{m}$ , the indicator function is denoted by $\delta_{X}$ with $\delta_{X}(x)=0$ for any $x\in X$ and $\delta_{X}(x)=\infty$ for any $x\notin X$ . A set $X\in\mathbb{R}^{m}$ is said to be convex if $\theta x+(1-\theta)y\in X$ for any any $x,\,y\in X$ and $\theta\in(0,\,1)$ . For a closed convex set $X\neq\emptyset$ , the projection operator $P_{X}\colon\mathbb{R}^{m}\to X$ is defined as $P_{X}[x]=\arg\min_{y\in X}||y-x||$ . The projection operator is non-expansive in the sense that $||P_{X}[x]-P_{X}[y]||\leq||x-y||$ for any $x,\,y\in\mathbb{R}^{m}$ . For any $x\in\mathbb{R}^{m}$ , it holds $(x-P_{X}[x])^{\intercal}(y-P_{X}[x])\leq 0,\,\forall y\in X$ .

For a given function $f\colon\mathbb{R}^{m}\to\mathbb{R}$ , we denote by $\mathrm{dom}\,f=\{x\in\mathbb{R}^{m}\mid|f(x)|<\infty\}$ the domain of $f$ . We always assume that $\mathrm{dom}\,f\neq\emptyset$ and $\mathrm{dom}\,f=\mathbb{R}^{m}$ if not specified. We say it is convex if its domain is convex and $f(\alpha x+(1-\alpha y))\leq\alpha f(x)+(1-\alpha)f(y)$ holds for all $x,\,y\in\mathrm{dom}\,f$ and $\alpha\in[0,\,1]$ . If this inequality is strict in the sense that the equation holds only if $x=y$ , the function is called strictly convex. A function $f$ is called closed and convex on a convex set $X\subset\mathrm{dom}\,f$ if its constrained epigraph $\mbox{epi}_{X}(f)=\{(x,\,t)\in X\times\mathbb{R}\mid t\geq f(x)\}$ is a closed convex set. If $X=\mathrm{dom}\,f$ , we call $f$ a closed convex function.

A vector-valued function ${\bm{f}}\colon\mathbb{R}^{m}\rightarrow\mathbb{R}^{m}$ is Lipschitz with constant $\vartheta>0$ (or simply $\vartheta$ -Lipschitz) if we have

\displaystyle||{\bm{f}}(\zeta_{1})-{\bm{f}}(\zeta_{2})||\leq\vartheta||\zeta_{1}-\zeta_{2}||,\,\forall\zeta_{1},\zeta_{2}\in\mathbb{R}^{m}

Let us consider a function $\phi\colon X\times Z\to\mathbb{R}$ , where $X$ and $Z$ are nonempty subsets of $\mathbb{R}^{n}$ and $\mathbb{R}^{m}$ , respectively. A pair of vectors $x^{*}\in X$ and $z^{*}\in Z$ is called a saddle point of $\phi$ if $\phi(x^{*},\,z)\leq\phi(x^{*},\,z^{*})\leq\phi(x,\,z^{*})$ holds for any $x\in X$ and $z\in Z$ .

3 Problem Formulation

In this paper, we focus on solving the following constrained optimization problem by a network of $N$ agents:

\displaystyle\begin{split}\min&\quad f(x)=\sum_{i=1}^{N}f_{i}(x)\\ \mbox{s.t.}&\quad x\in X\triangleq\bigcap_{i=1}^{N}X_{i}\end{split}

(1)

Here function $f_{i}\colon\mathbb{R}\to\mathbb{R}$ and set $X_{i}$ are private to agent $i$ for each $i\in\mathcal{N}\triangleq\{1,\,2,\,\dots,\,N\}$ and can not be shared with others.

To ensure its solvability, the following assumption is made.

Assumption 1

For each $i\in\mathcal{N}$ , function $f_{i}\colon\mathbb{R}\to\mathbb{R}$ is convex, set $X_{i}$ is convex and closed, and $\mathrm{int}\,X=\bigcap_{i\in\mathcal{N}}\mathrm{int}\,X_{i}$ is nonempty and contained in $\mathrm{dom}\,f_{i}$ .

Note that $f_{i}$ might be nondifferentiable under this assumption. Denote the minimal value of problem (1) by $f^{*}$ and the optimal solution set by $\cal{X}^{*}$ , i.e., $f^{*}=\min_{x\in X}f(x)$ and $\mathcal{X}^{*}=\{x\in X\mid f(x)=f^{*}\}$ . As usual, we assume $f^{*}$ is finite and set $\mathcal{X}^{*}$ is nonempty. To cooperatively address the optimization problem (1) in a distributed manner, we use a weighted undirected graph $\mathcal{G}=\{\mathcal{N},\,\,\mathcal{E},\,\mathcal{A}\}$ to describe the information sharing relationships with node set $\mathcal{N}$ , edge set $\mathcal{E}\subset\mathcal{N}\times\mathcal{N}$ , and weight matrix $\mathcal{A}=[a_{ij}]_{N\times N}$ . Here $a_{ij}=a_{ji}>0$ means agents $i$ and $j$ can communicate with each other.

Assumption 2

Graph $\mathcal{G}$ is connected.

Suppose agent $i$ maintains an estimate $x_{i}$ of the optimal solution to (1) with other (possible) auxiliary variables. Agents exchange these variables through the communication network described by $\mathcal{G}$ and perform some updates at given discrete-time instants $k=1,\,2,\,\dots$ . Then, the distributed optimization problem in this paper is formulated to find an update rule of $x_{i}(k)$ for agent $i$ using only its own and neighboring information such that $\lim_{k\to\infty}[x_{i}(k)-x_{j}(k)]=0$ for any $i,\,j\in\mathcal{N}$ and $\lim_{k\to\infty}\sum_{i=1}^{N}f_{i}(x_{i}(k))=f^{*}$ . If possible, we expect that all the estimates will converge to an optimal solution to problem (1).

As stated above, this problem has been intensively studied in literature [1, 2]. However, most existing designs require the exact subgradients of the local objective functions in constructing effective distributed algorithms. In this paper, we are interested in the solvability of the formulated distributed constrained optimization problem (1) working with inexact subgradients of the local objective functions. For this purpose, we adopt the notion of $\varepsilon$ -subgradient to describe such inexactness as in [15] and assume the $\varepsilon$ -subgradient of $f_{i}$ can be easily computed for any given $\varepsilon\geq 0$ .

Definition 1

For a convex function $f\colon\mathbb{R}^{m}\to\mathbb{R}$ and a scalar $\varepsilon>0$ , $g\in\mathbb{R}^{m}$ is said to be an $\varepsilon$ -subgradient of $f$ at $x\in\mathbb{R}^{m}$ if

f(y)\geq f(x)+g^{\intercal}(y-x)-\varepsilon,\quad\forall y\in\mathrm{dom}\,f

Denote by $\partial_{\varepsilon}f(x)$ the $\varepsilon$ -subdifferential of $f$ at $x\in\mathbb{R}^{m}$ , which is the set of all $\varepsilon$ -subgradients of $f$ at $x$ . $\partial_{\varepsilon}f(x)$ is nonempty and convex for any $x\in\mathbb{R}^{m}$ due to the convexity of $f$ . Moreover, $\partial_{0}f(x)$ coincides with the subdifferential of $f$ at $x\in\mathbb{R}^{m}$ .

In next section, we will convert our problem into a saddle-point seeking problem and develop a projected primal-dual $\varepsilon$ -subgradient method with rigorous solvability analysis.

4 Main Result

To begin with, we rewrite problem (1) into an alternative form as that in [10, 16]:

$\displaystyle\min~{}~{}$	$\displaystyle\tilde{f}({\bf x})=\sum_{i=1}^{N}f_{i}(x_{i})$
s.t.	$\displaystyle L{\bf x}={\bf 0}_{N}$	(2)
	$\displaystyle{\bf x}\in\tilde{X}\triangleq X_{1}\times\dots\times X_{N}$

where ${\bf x}=\mathrm{col}(x_{1},\,\dots,\,x_{N})$ and $L$ is the Laplacian of graph $\mathcal{G}$ . Note that $L$ is symmetric and positive semi-definite with its ordered eigenvalues as $0=\lambda_{1}<\lambda_{2}\leq\dots\lambda_{N}$ under Assumption 2 by Theorem 2.8 in [17].

Consider the augmented Lagrangian function of problem (4):

\displaystyle\Phi({\bf x},\,{\bf v})=\tilde{f}({\bf x})+{\bf v}^{\intercal}L{\bf x}+\frac{1}{2}{\bf x}^{\intercal}L{\bf x}

(3)

with ${\bf v}=\mbox{col}(v_{1},\,\dots,\,v_{N})\in\mathbb{R}^{N}$ . By Proposition 3.4.1 in [3], if $\Phi$ has a saddle point $({\bf x}^{*},\,{\bf v}^{*})$ in $\tilde{X}\times\mathbb{R}^{N}$ , then ${\bf x}^{*}$ must be an optimal solution to problem (4), which in turn provides an optimal solution to (1). Since the Slater’s condition holds under Assumption 1, such saddle points indeed exist by virtue of Theorems 3.34 and 4.7 in [18]. Thus, it suffices for us to seek a saddle point of $\Phi$ in $\tilde{X}\times\mathbb{R}^{N}$ .

Following this conversion, many solvability results on problem (1) have been presented when the exact gradient or subgradient of $f_{i}$ is available, e.g., [19, 20, 16, 10, 21, 14]. However, whether and how $\varepsilon$ -subgradient algorithms can be derived has not been discussed yet. To this end, we are motivated by aforementioned saddle-point seeking designs and present the following dynamics:

\displaystyle\begin{split}x_{i}(k+1)&=P_{X_{i}}[x_{i}(k)-\alpha_{k}(g_{i}(k)+\hat{x}_{i}(k)+\hat{v}_{i}(k))]\\ v_{i}(k+1)&=v_{i}(k)+\alpha_{k}\hat{x}_{i}(k)\end{split}

(4)

where $\hat{x}_{i}(k)\triangleq\sum_{j=1}^{N}a_{ij}(x_{i}(k)-x_{j}(k))$ , $\hat{v}_{i}(k)\triangleq\sum_{j=1}^{N}a_{ij}(v_{i}(k)-v_{j}(k))$ , and $g_{i}(k)\in\partial_{\varepsilon_{k}}f_{i}(x_{i}(k))$ with parameters $\varepsilon_{k},\,\alpha_{k}>0$ to be specified later. It can be taken as a constrained version of algorithms in [19, 20]. Different from similar primal-dual designs in [10, 16], we do not require the differentiability of these objective functions or their exact gradients.

Letting ${\bf x}(k)=\mathrm{col}(x_{1}(k),\dots,x_{N}(k))$ and ${\bf v}(k)=\mathrm{col}(v_{1}(k),\dots,v_{N}(k))$ , we can put (4) into a compact form:

\displaystyle\begin{split}{\bf x}(k+1)&=P_{\tilde{X}}[{\bf x}(k)-\alpha_{k}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))]\\ {\bf v}(k+1)&={\bf v}(k)+\alpha_{k}L{\bf x}(k)\end{split}

(5)

with ${\bf g}(k)=\mathrm{col}(g_{1}(k),\,\dots,\,g_{N}(k))\in\partial_{N\varepsilon_{k}}\tilde{f}({\bf x}(k))\in\mathbb{R}^{N}$ . It can be further rewritten as follows.

\displaystyle{\bf z}(k+1)=P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]

(6)

where ${\bf z}(k)=\mathrm{col}({\bf x}(k),\,{\bf v}(k))$ , $\overline{X}=\tilde{X}\times\mathbb{R}^{N}$ , and

\displaystyle T_{\varepsilon_{k}}({\bf z}(k))=\begin{bmatrix}{\bf g}(k)+L{\bf v}(k)+L{\bf x}(k)\\ -L{\bf x}(k)\end{bmatrix}

To establish the effectiveness of this algorithm, another assumption is made as follows.

Assumption 3

The $\varepsilon_{k}$ -subgradient sequence $\{g_{i}(k)\}$ is uniformly bounded for each $i$ , i.e., there exists a scalar $C>0$ such that $\max_{i\in\mathcal{N}}\{||g_{i}(k)||\}<C$ all $k>0$ .

This assumption is temporally made for simplicity as in [22, 8] and will be further removed later by some novel step sizes later. Suppose ${\bf z}^{*}=\mathrm{col}({\bf x}^{*},\,{\bf v}^{*})$ is a saddle point of $\Phi$ in $\tilde{X}\times\mathbb{R}^{N}$ . Here is a key lemma under Assumption 3.

Lemma 1

Suppose Assumptions 1–3 hold. Along the trajectory of algorithm (4), there exists some $C_{1}>0$ such that, for any $k=1,\,2,\,\dots$ , the following inequality holds:

\displaystyle\begin{split}||{\bf z}(k+1)-{\bf z}^{*}||^{2}&\leq(1+C_{1}\alpha_{k}^{2})||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\Delta({\bf x}(k))+2N\alpha_{k}\varepsilon_{k}+C_{1}\alpha_{k}^{2}\end{split}

(7)

where $\Delta({\bf x}(k))\triangleq\Phi({\bf x}(k),\,{\bf v}^{*})-\Phi({\bf x}^{*},\,{\bf v}^{*})+\frac{1}{2}{\bf x}(k)^{\intercal}L{\bf x}(k)\geq 0$ .

Proof. By lemma conditions, $({\bf x}^{*},\,{\bf v}^{*})$ is a saddle point of $\Phi$ . Then, $\Delta({\bf x}(k))\triangleq\Phi({\bf x}(k),\,{\bf v}^{*})-\Phi({\bf x}^{*},\,{\bf v}^{*})+\frac{1}{2}{\bf x}(k)^{\intercal}L{\bf x}(k)\geq 0$ can be easily verified by the definition of saddle points.

Next, we consider the evolution of $||{\bf z}(k)-{\bf z}^{*}||^{2}$ with respect to $k$ . Under the iteration (4), it follows then

		$\displaystyle\|\|{\bf z}(k+1)-{\bf z}^{*}\|\|^{2}$
		$\displaystyle\qquad=\|\|P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]-{\bf z}^{*}\|\|^{2}$
		$\displaystyle\qquad\leq\|\|{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))-{\bf z}^{*}\|\|^{2}-\|\|{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))-P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]\|\|^{2}$
		$\displaystyle\qquad\leq\|\|{\bf z}(k)-{\bf z}^{}\|\|^{2}-2\alpha_{k}({\bf z}(k)-{\bf z}^{})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k))+\alpha_{k}^{2}\|\|T_{\varepsilon_{k}}({\bf z}(k))\|\|^{2}$		(8)

By the proprieties of saddle point and $\varepsilon_{k}$ -subgradient, we have

$\displaystyle({\bf z}(k)-{\bf z}^{*})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k))$	$\displaystyle=({\bf x}(k)-{\bf x}^{})^{\intercal}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))-({\bf v}(k)-{\bf v}^{})^{\intercal}L{\bf x}(k)$
	$\displaystyle\geq\tilde{f}({\bf x}^{k})-\tilde{f}({\bf x}^{})-N\varepsilon_{k}+{{\bf v}^{}}^{\intercal}L{\bf x}(k)+{\bf x}(k)^{\intercal}L{\bf x}(k)$
	$\displaystyle=\Phi({\bf x}(k),\,{\bf v}^{})-\Phi({\bf x}^{},\,{\bf v}^{*})+\frac{1}{2}{\bf x}(k)^{\intercal}L{\bf x}(k)-N\varepsilon_{k}$	(9)
	$\displaystyle=\Delta({\bf x}(k))-N\varepsilon_{k}$

Since $L{\bf x}^{*}={\bf 0}$ , $T_{\varepsilon_{k}}({\bf z}(k))$ can be rewritten as

\displaystyle T_{\varepsilon_{k}}({\bf z}(k))=\begin{bmatrix}{\bf g}(k)+L({\bf x}(k)-{\bf x}^{*})+L({\bf v}(k)-{\bf v}^{*})+L{\bf v}^{*}\\ -L({\bf x}(k)-{\bf x}^{*})\end{bmatrix}

Under Assumption 3, there must be constant $C_{1}>0$ such that

\displaystyle||T_{\varepsilon_{k}}({\bf z}(k))||^{2}\leq C_{1}(1+||{\bf z}(k)-{\bf z}^{*}||^{2})

(10)

Putting all inequalities (4)–(10) together, we have

\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2}

\displaystyle\leq(1+C_{1}\alpha_{k}^{2})||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\Delta({\bf x}(k))+2N\alpha_{k}\varepsilon_{k}+C_{1}\alpha_{k}^{2}

which is exactly the expected inequality (7).

When the exact subgradient is available (i.e., $\varepsilon_{k}=0$ ), the inequality (7) can be simplified into the well-known supermartingale inequality ensuring the convergence of ${\bf z}(k)$ towards ${\bf z}^{*}$ as shown in [3] if $\{\alpha_{k}\}$ is chosen to satisfy

\displaystyle\sum_{k=1}^{\infty}\alpha_{k}=\infty,\quad\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty

(11)

However, the inexactness of available subgradients deteriorates this property and the expected convergence might fail when we use only $\varepsilon$ -subgradients in the iteration (4).

Let us denote $\overline{\Delta}=\liminf_{k\to\infty}\Delta({\bf x}(k))$ and take a closer look at the inequality (7). Note that $\overline{\Delta}$ consists of two parts, i.e., the discrepancy of function values and violation of constraints (in term of consensus error) of the iterative sequence. It can be taken as a measure of suboptimality of the iterative sequence. In other words, we can determine the upper bound for $\overline{\Delta}$ to evaluate the effectiveness of algorithm (4).

We first consider the case when $\varepsilon_{k}$ is a constant.

Theorem 1

Suppose Assumptions 1–3 hold. Let the step size $\alpha_{k}$ be chosen to satisfy (11) and $\varepsilon_{k}$ be fixed at some scalar $\varepsilon_{0}>0$ . Then, along the trajectory of algorithm (4), it holds that

\displaystyle 0\leq\overline{\Delta}\leq N\varepsilon_{0}

(12)

Proof. To prove this theorem, we only have to show that $\overline{\Delta}\leq N\varepsilon_{0}$ . If the inequality does not hold, there must exist a $\delta>0$ and a sufficient large integer $K_{1}>1$ such that $\Delta({\bf x}(k))>N\varepsilon_{0}+\delta$ for all $k\geq K_{1}$ . By (11), we have $\lim_{k\to\infty}\alpha_{k}=0$ . Thus, there must exist an integer $K_{2}>1$ such that $0<\alpha_{k}\leq\frac{\delta}{C_{1}}$ for all $k\geq K_{2}$ . Bringing these conditions together, one can strengthen inequality (7) for all $k\geq K\triangleq\max\{K_{1},\,K_{2}\}$ as follows.

\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2}

\displaystyle\leq(1+C_{1}\alpha_{k}^{2})||{\bf z}(k)-{\bf z}^{*}||^{2}-\alpha_{k}\delta

Summing up its both sides from $K$ to $\overline{K}>K$ gives

\displaystyle||{\bf z}({\overline{K}}+1)-{\bf z}^{*}||^{2}

\displaystyle\leq||{\bf z}(K)-{\bf z}^{*}||^{2}\prod_{k=K}^{\overline{K}}(1+C_{1}\alpha_{k}^{2})-\delta\sum_{k=K}^{\overline{K}}\alpha_{k}

where we use $1+C_{1}\alpha_{k}^{2}>1$ to handle the cross terms.

Note that $1+\theta\leq e^{\theta}$ for any $\theta>0$ . Hence $\prod_{k=K}^{\overline{K}}(1+C_{1}\alpha_{k}^{2})\leq e^{C_{1}\sum_{k=K}^{\overline{K}}\alpha_{k}^{2}}\leq e^{C_{1}\sum_{k=1}^{\infty}\alpha_{k}^{2}}$ . Under the condition (11), there must exist a positive scalar $\overline{C}>0$ such that

\displaystyle||{\bf z}({\overline{K}}+1)-{\bf z}^{*}||^{2}

\displaystyle\leq\overline{C}||{\bf z}(K)-{\bf z}^{*}||^{2}-\delta\sum_{k=K}^{\overline{K}}\alpha_{k}

which can not hold for a sufficiently large $\overline{K}$ since $\sum_{k=1}^{\infty}\alpha_{k}=\infty$ . We obtain a contradiction and complete the proof.

Remark 1

According to Theorem 1, one can generally obtain a suboptimal solution to problem (1) using inexact subgradients. If we are interested in an exact solution, it is required to ensure $\lim_{k\to\infty}\varepsilon_{k}=0$ . For a very special case when $\varepsilon_{k}=0$ , it shows the effectiveness of our algorithm (4) in solving the formulated problem (1) with exact subgradients. This observation is consistent with the existing subgradient methods in [20, 11, 12, 13, 14].

With Theorem 1, it is natural for us to enforce some stronger condition on the error $\varepsilon_{k}$ for a better convergence performance of the entire sequence $\{x_{i}(k)\}$ . Along this line, we provide another theorem supposing the accumulated error of subgradient inexactness is not too large.

Theorem 2

Suppose Assumptions 1–3 hold. Let the parameters $\alpha_{k},\,\varepsilon_{k}>0$ be chosen to satisfy the following condition

\displaystyle\sum_{k=1}^{\infty}\alpha_{k}=\infty,\quad\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty,\quad\sum_{k=1}^{\infty}\alpha_{k}\varepsilon_{k}<\infty

(13)

Then, along the trajectory of algorithm (4), we have

1)

the sequence $\{||{\bf z}(k+1)-{\bf z}^{*}||\}$ converges;
2)

the estimates $x_{1}(k),\,\ldots,\,x_{N}(k)$ reach an optimal consensus in the sense that $\lim_{k\to\infty}[{x}_{i}(k)-{x}_{j}(k)]=0$ and $\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})=f^{*}$ ;
3)

$\{{\bf z}(k)\}$ has at least one cluster point ${\overline{\bf z}}=\mathrm{col}({\overline{\bf x}},\,{\bf{\overline{v}}})$ such that $\overline{\bf x}={\bf 1}_{N}{x^{*}}$ with ${x^{*}}$ being an optimal solution to problem (1);
4)

If the optimal solution to problem (1) is unique, i.e., $\mathcal{X}^{*}=\{x^{*}\}$ , then $\lim_{k\to\infty}{{x}_{i}(k)}=x^{*}$ for each $i\in\mathcal{N}$ .

Proof. Note that $\Delta({\bf x}(k))\geq 0$ by Lemma 1 and $\sum_{k=1}^{\infty}\alpha_{k}\varepsilon_{k}+\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty$ under the theorem assumption. Applying Lemma 5.31 in [23] to the inequality (7), we can obtain the convergence of $\{||{\bf z}(k+1)-{\bf z}^{*}||\}$ and

\displaystyle 0\leq\sum_{i=1}^{{}_{\infty}}\alpha_{k}\Delta({\bf x}(k))<\infty

(14)

Thus, the sequence $\{{\bf z}(k)\}$ must be uniformly bounded by some $C_{2}>0$ . From the continuity of $\Delta({\bf x})$ , it must be $C_{3}$ -Lipschitz with respect to ${\bf x}$ for some constant $C_{3}>0$ . It follows then

\displaystyle\begin{split}\Delta({\bf x}(k+1))-\Delta({\bf x}(k))&\leq C_{3}||{\bf x}(k+1)-{\bf x}(k)||\\ &=C_{3}||P_{\tilde{X}}[{\bf x}(k)-\alpha_{k}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))]-{\bf x}(k)||\\ &\leq C_{3}||{\bf x}(k)-\alpha_{k}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))-{\bf x}(k)||\\ &\leq C_{3}\alpha_{k}||{\bf g}(k)+L{\bf v}(k)+L{\bf x}(k)||\\ &\leq C_{3}(\sqrt{N}C+2\lambda_{\max}(L)C_{2})\alpha_{k}\end{split}

(15)

Jointly using (14), (15), and $\sum_{k=1}^{\infty}\alpha_{k}=\infty$ , we resort to Proposition 2 in [24] and conclude that $\lim_{k\to\infty}\Delta({\bf x}(k))=0$ . Recalling the expression of $\Delta$ , we have $\lim_{k\to\infty}{\bf x}^{\intercal}(k)L{\bf x}(k)=0$ and $\lim_{k\to\infty}[\tilde{f}({\bf x}(k))+{{\bf v}^{*}}^{\intercal}L{\bf x}(k)]=\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})$ . Note that $L$ is positive semidefinite with $0$ as its simple eigenvalue under Assumption 2. It follows then $\lim_{k\to\infty}[x_{i}(k)-x_{j}(k)]=0$ and $\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})=f^{*}$ .

Due to the uniform boundedness of sequence $\{{\bf z}(k)\}$ by item 1), there must be a convergent subsequence $\{{\bf z}_{k_{m}}\}$ of $\{{\bf z}_{k}\}$ . We denote its limit by $\overline{\bf z}=\mathrm{col}(\overline{\bf x},\,\overline{\bf v})$ . Then, it should satisfy $L\overline{\bf x}=0$ and $\tilde{f}(\overline{\bf x})=\lim_{m\to\infty}\tilde{f}({\bf x}_{k_{m}})=f^{*}$ by item 2). In other words, $\overline{\bf x}$ is an optimal solution to (4). By Assumption 2, one can conclude that there exists some $\overline{x}\in\mathbb{R}$ such that $\overline{\bf x}={\bf 1}_{N}\overline{x}$ . Note that $f(\overline{x})=\tilde{f}(\overline{\bf x})=f^{*}$ , i.e., $\overline{x}$ is an exact optimal solution to problem (1).

If $\mathcal{X}^{*}=\{x^{*}\}$ holds, all convergent subsequences of $\{{\bf{\bf x}}(k)\}$ have the same limit $x^{*}$ . This combined with the boundedness of $\{{\bf{\bf z}}(k)\}$ implies item 4) and completes the proof.

Remark 2

This theorem specifies a nontrivial case when our distributed optimization problem (1) can be exactly solved even using only inexact subgradients information of the local objective functions. This observation is consistent with the centralized results in [5]. Compared with similar primal-dual domain results in [19, 25, 21, 10, 16, 14], this algorithm further allows us to consider nonsmooth objective functions with only approximate subgradients.

It is known that normalization might improve the transient performance of the algorithms to avoid overshoots at the starting phase. However, conventional normalized techniques often involve some global information and can not be directly implemented in distributed settings. We here present a novel componentwise normalized version of algorithm (4) as follows.

$\displaystyle x_{i}(k+1)$	$\displaystyle=P_{X_{i}}[x_{i}(k)-\frac{\alpha_{k}}{\max\{c,\,\delta_{ik,D}\}}(g_{i}(k)+\hat{x}_{i}(k)+\hat{v}_{i}(k))]$
$\displaystyle v_{i}(k+1)$	$\displaystyle=v_{i}(k)+\frac{\alpha_{k}}{\max\{c,\,\delta_{ik,D}\}}\hat{x}_{i}(k)$	(16)
$\displaystyle\delta_{ik,\,m}$	$\displaystyle=\begin{cases}\qquad\|\|T^{i}_{\varepsilon_{k}}({\bf z}(k))\|\|,&\,\mbox{when }m=1\\ \max_{j\in\mathcal{N}_{i}}\{\delta_{ik,\,m-1},\,\delta_{jk,\,m-1}\},&\,\mbox{when }2\leq m\leq D\end{cases}$

where integer $D\geq\mathrm{D}(\mathcal{G})+1$ with $\mathrm{D}(\mathcal{G})$ the diameter of graph $\mathcal{G}$ and $c>0$ is any given constant. Since $\mathrm{D}(\mathcal{G})$ or an upper bound can be computed by distributed rules [26], this normalized algorithm is implementable in a fully distributed manner by embedding a max-consensus subiteration.

Here is a corollary to state that the normalized algorithm (4) retains all the established properties in Theorem 2.

Corollary 1

Suppose Assumptions 1–2 hold. Choose the same parameters satisfying condition (13). Then, along the trajectory of algorithm (4), the sequence $\{{\bf{\bf z}}(k)\}$ retains all the established properties in Theorem 2.

1)

the sequence $\{||{\bf z}(k+1)-{\bf z}^{*}||\}$ converges;
2)

the estimates $x_{1}(k),\,\dots,\,x_{N}(k)$ reach an optimal consensus in the sense that $\lim_{k\to\infty}[{\bf x}_{i}(k)-{\bf x}_{j}(k)]=0$ and $\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})=f^{*}$ ;
3)

$\{{\bf z}(k)\}$ has at least one cluster point ${\overline{\bf z}}=\mathrm{col}({\overline{\bf x}},\,{\bf{\overline{v}}})$ such that $\overline{\bf x}={\bf 1}{x^{*}}$ with ${x^{*}}$ being an optimal solution to problem (1).
4)

If the optimal solution to problem (1) is unique, i.e., $\mathcal{X}^{*}=\{x^{*}\}$ , then $\lim_{k\to\infty}{{x}_{i}(k)}=x^{*}$ for each $i\in\mathcal{N}$ .

Proof. The proof is similar with that of Theorem 2. First, we recall Theorem 4.1 in [27] and conclude that all agents will get $\max\{c,\,\max_{i\in\mathcal{N}}||T^{i}_{\varepsilon_{k}}({\bf z}(k))||\}$ after the subiteration. Thus, we only have to consider the following system:

\displaystyle\begin{split}x_{i}(k+1)&=P_{X_{i}}[x_{i}(k)-\alpha_{k}\gamma_{k}(g_{i}(k)+\hat{x}_{i}(k)+\hat{v}_{i}(k))]\\ v_{i}(k+1)&=v_{i}(k)+\alpha_{k}\gamma_{k}\hat{x}_{i}(k)\end{split}

with $\gamma_{k}=\frac{1}{\max\{c,\,\max_{i\in\mathcal{N}}||T^{i}_{\varepsilon_{k}}({\bf z}(k))||\}}$ . For this new system, we can establish a similar inequality as (7) for $||{\bf z}(k)-{\bf z}^{*}||$ :

\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2}

\displaystyle\leq||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\gamma_{k}({\bf z}(k)-{\bf z}^{*})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k))+\alpha_{k}^{2}\gamma_{k}^{2}||T_{\varepsilon_{k}}({\bf z}(k))||^{2}

Note that $\gamma_{k}^{2}||T_{\varepsilon_{k}}({\bf z}(k))||^{2}\leq N$ and $0\leq\gamma_{k}\leq\frac{1}{c}$ . Recalling the fact (9) and $\Delta({\bf x}(k))\geq 0$ , one can obtain

\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2}

\displaystyle\leq||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\gamma_{k}\Delta({\bf x}(k))+\frac{2N}{c}\alpha_{k}\varepsilon_{k}+N\alpha_{k}^{2}

According to Lemma 5.3.1 in [23], we can conclude the convergence of $\{||{\bf z}(k+1)-{\bf z}^{*}||\}$ and $\sum_{k=1}^{\infty}\alpha_{k}\gamma_{k}\Delta({\bf x}(k))\leq\infty$ . Then, $\{||{\bf z}(k)||\}$ is uniformly bounded, which implies that there must be small enough constant $C_{4}>0$ such that $\gamma_{k}\geq C_{4}>0$ . We use Proposition 2 in [24] again and obtain that $\lim_{k\to\infty}\Delta({\bf x}(k))=0$ . Then, items 3) and 4) can be easily verified following a similar procedure as in Theorem 2. The proof is thus complete.

Remark 3

Compared with the conventional normalized step sizes in [18, 28], the proposed componentwise normalized step size can be taken as their distributed extension and the iterative sequence generated by (4) might have a better transient behavior than that generated by (4). Interestingly, the widely used subgradient boundedness assumption (i.e., Assumption 3) is also removed as a byproduct, which might be favorable in distributed scenarios.

5 Simulation

Figure 1: Communication graph

\mathcal{G}

in our example.

In this section, we consider a LASSO (least absolute shrinkage and selection operator) regression problem to verify the effectiveness of our algorithms:

\displaystyle\min_{x\in X}f(x)=\frac{1}{2}\sum_{i=1}^{N}||x-p_{i}||^{2}+\lambda N||x||_{1}

where $\lambda N>0$ is the regularization parameter, $p_{i}$ is an estimate only known by agent $i$ , and $X$ is the constrained set. In this case, we let $X_{i}=X$ and $f_{i}(x)=\frac{1}{2}||x-p_{i}||^{2}+\lambda||x||_{1}$ and put it into a form of problem (1). Distributed subgradient algorithms have been developed to solve this problem when $\partial f_{i}$ is available, e.g., [12]. Although $\partial f_{i}$ and $\partial_{\varepsilon}f_{i}$ can be easily calculated, we use this example to show the effectiveness of our algorithm only using its $\varepsilon$ -subdifferential instead of the exact one.

According to the definition of $\varepsilon$ -subgradient, we have

\partial_{\varepsilon}f_{i}(x)=\begin{cases}[x-p_{i}-\lambda,x-p_{i}-\lambda-\frac{\lambda\varepsilon}{x}]&\text{for }x<-\frac{\varepsilon}{2},\\ [x-p_{i}-\lambda,x-p_{i}+\lambda]&\text{for }x\in[-\frac{\varepsilon}{2},\frac{\varepsilon}{2}],\\ [x-p_{i}+\lambda-\frac{\lambda\varepsilon}{x},x-p_{i}+\lambda]&\text{for }x>\frac{\varepsilon}{2}\end{cases}

Refer to caption — Figure 2: Profiles of primal and dual variables in Algorithm (4).

For simulations, we let $N=4$ , $p_{i}=2i$ , and $\lambda=0.1$ . The communication graph is given as Fig. 1 with unity weights. To make it more interesting, we assume this problem has set constraints specified by $X_{i}=[-11+i,\,8-i]$ with $i=1,\,2,\,3,\,4$ . Assumptions 1-2 can be confirmed. To ensure the condition (13), we choose $\alpha_{k}=\varepsilon_{k}=\frac{3}{k+1}$ . Then, the considered problem can be solved by our proposed distributed primal-dual $\varepsilon$ -subgradient method (PD $\varepsilon$ SM) (4) and the normalized primal-dual $\varepsilon$ -subgradient method (NPD $\varepsilon$ SM) (4) according to Theorem 2 and Corollary 1. In simulations, when $x<-\frac{\varepsilon}{2}$ , we choose $x-p_{i}-\lambda-\frac{\lambda\varepsilon}{x}$ as the $\varepsilon$ -subgradient, when $-\frac{\varepsilon}{2}\leq x\leq\frac{\varepsilon}{2}$ , we choose $x-p_{i}+\lambda$ as the $\varepsilon$ -subgradient, when $x>\frac{\varepsilon}{2}$ , we choose $x-p_{i}+\lambda-\frac{\lambda\varepsilon}{x}$ as the $\varepsilon$ -subgradient.

Simulation results with ${\bf x}(1)=\mathrm{col}(1,\,0,\,5,\,-1)$ and $c=0.1$ for (4) and (4) are shown in Figs. 2–3, where all agents’ primal variables are observed to converge to the global optimal solution $x^{*}=4$ while the dual variables are bounded and converge. This verifies the effectiveness of our algorithms. Moreover, one can find that although the normalized step size in (4) might slow down the convergence speed compared with (4), the resultant transient performance of the primal and dual variables has been much improved with less and weaker oscillations. For a clear comparison, we let $e(t)=\frac{||x_{i}(k)-{\bf 1}_{4}x^{*}||}{||x_{i}(1)-{\bf 1}_{4}x^{*}||}$ be the residential error of our algorithms. The profiles of $e(k)$ in both algorithms are shown in Fig. 4. From this, we can also confirm the improvement of transient performance by the proposed componentwise normalized step size.

6 Conclusion

In this paper, we have attempted to solve a distributed constrained optimization problem with inexact subgradient information of local objective functions. We have developed a projected primal-dual dynamics using only $\varepsilon$ -subgradients and discussed its convergence properties. In particular, we have shown the exact solvability of this problem if the accumulated error introduced by subgradient inexactness is not too large. We have also presented a novel distributed normalized step size to improve the transient performance of our algorithms. It is interesting to consider more general graphs in future works.

References

[1] Nedić A, Liu J, Distributed optimization for control, Annual Review of Control, Robotics, and Autonomous Systems, 2018, 1: 77–103.
[2] Yang T, Yi X, Wu J, et al. A survey of distributed optimization, Annual Reviews in Control, 2019, 47: 278–305.
[3] Bertsekas D, Convex Optimization Algorithms, Athena Scientific, Belmont, 2015.
[4] Devolder O, Glineur F, Nesterov Y, First-order methods of smooth convex optimization with inexact oracle, Mathematical Programming, 2014, 146(1-2): 37–75.
[5] Kiwiel K, Convergence of approximate and incremental subgradient methods for convex optimization, SIAM Journal on Optimization, 2004, 14(3): 807–840.
[6] Jakovetić D, Bajović D, Xavier J, Moura J, Primal–Dual Methods for Large-Scale and Distributed Convex Optimization and Data Analytics, Proceedings of the IEEE, 2020, 108(11): 1923–1938.
[7] Nedić A, Ozdaglar A, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 2009, 54(1):48–61.
[8] Jakovetić D, Moura J, Xavier J, Linear convergence rate of a class of distributed augmented Lagrangian algorithms, IEEE Transactions on Automatic Control, 2014, 60(4): 922–936.
[9] Yi P, Hong Y, Liu F, Distributed gradient algorithm for constrained optimization with application to load sharing in power systems, Systems & Control Letters, 2015, 83: 45–52.
[10] Lei J, Chen H, Fang H, Primal–dual algorithm for distributed constrained optimization, Systems & Control Letters, 2016, 96:110–117.
[11] Xi C, Khan U, Distributed subgradient projection algorithm over directed graphs, IEEE Transactions on Automatic Control, 2016, 62(8):3986–3992.
[12] Liu S, Qiu Z, Xie L, Convergence rate analysis of distributed optimization with projected subgradient algorithm, Automatica, 2017, 83:162–169.
[13] Zeng X, Yi P, Hong Y, Distributed continuous-time algorithm for constrained convex optimizations via nonsmooth analysis approach, IEEE Transactions on Automatic Control, 2017, 62(10): 5227–5233.
[14] Zhu K, Zhu H, Tang Y, On the Boundedness of Subgradients in Distributed Optimization, Proceedings of 39th Chinese Control Conference $($ CCC $)$ , Shenyang, 2020, 4912–4917.
[15] Polyak B, Introduction to Optimization, Optimization Software Inc., New York, 1987.
[16] Liu Q, Yang S, Hong Y, Constrained consensus algorithms with fixed step size for distributed convex optimization over multiagent networks, IEEE Transactions on Automatic Control, 2017, 62(8): 4259–4265.
[17] Mesbahi M, Egerstedt M, Graph Theoretic Methods in Multiagent Networks, Princeton University Press, Princeton, 2010.
[18] Ruszczynski A, Nonlinear Optimization, Princeton University Press, Princeton, 2011.
[19] Wang J, Elia N, Control approach to distributed optimization, Proceedings of 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, 2010, 557–561.
[20] Gharesifard B, Cortés J, Distributed continuous-time convex optimization on weight-balanced digraphs, IEEE Transactions on Automatic Control, 2014, 59(3): 781–786.
[21] Kia S, Cortés J, Martínez S, Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication, Automatica, 2015, 55: 254–264.
[22] Nedić A, Ozdaglar A, Subgradient methods for saddle-point problems, Journal of Optimization Theory & Applications, 2009, 142(1): 205-–228.
[23] Bauschke H, Combettes P, Convex Analysis and Monotone Operator Theory in Hilbert Spaces (2nd ed.), Springer, Cham, 2017.
[24] Alber Y, Iusem A, Solodov M, On the projected subgradient method for nonsmooth convex optimization in a Hilbert space, Mathematical Programming, 1998, 81(1): 23–35.
[25] Jakovetić D, Xavier J, Moura J, Fast distributed gradient methods, IEEE Transactions on Automatic Control, 2014, 59(5): 1131–1146.
[26] Oliva G, Setola R, Hadjicostis C, Distributed finite-time average-consensus with limited computational and storage capability, IEEE Transactions on Control of Network Systems, 2016, 4(2): 380–391.
[27] Nejad B, Attia S, Raisch J, Max-consensus in a max-plus algebraic setting: The case of fixed communication topologies, Proceedings of 2009 XXII International Symposium on Information, Communication and Automation Technologies, Sarajevo, 2009, 1–7.
[28] Boyd S, Mutapcic A. Subgradient Methods, Notes for EE364b, Stanford University, 2008.

		$\displaystyle\|\|{\bf z}(k+1)-{\bf z}^{*}\|\|^{2}$
		$\displaystyle\qquad=\|\|P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]-{\bf z}^{*}\|\|^{2}$
		$\displaystyle\qquad\leq\|\|{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))-{\bf z}^{*}\|\|^{2}-\|\|{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))-P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]\|\|^{2}$
		$\displaystyle\qquad\leq\|\|{\bf z}(k)-{\bf z}^{}\|\|^{2}-2\alpha_{k}({\bf z}(k)-{\bf z}^{})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k))+\alpha_{k}^{2}\|\|T_{\varepsilon_{k}}({\bf z}(k))\|\|^{2}$		(8)

Primal-dual ε\varepsilon-Subgradient Method for Distributed Optimization ††thanks: This work was supported by National Natural Science Foundation of China under Grants 61973043.