This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Primal-dual ε\varepsilon-Subgradient Method for Distributed Optimization thanks: This work was supported by National Natural Science Foundation of China under Grants 61973043.

Kui Zhu and Yutao Tang K. Zhu and Y. Tang are both with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China (e-mails: [email protected], [email protected]).

Abstract: This paper studies the distributed optimization problem when the objective functions might be nondifferentiable and subject to heterogeneous set constraints. Unlike existing subgradient methods, we focus on the case when the exact subgradients of the local objective functions can not be accessed by the agents. To solve this problem, we propose a projected primal-dual dynamics using only the objective function’s approximate subgradients. We first prove that the formulated optimization problem can generally be solved with an error depending upon the accuracy of the available subgradients. Then, we show the exact solvability of this distributed optimization problem when the accumulated approximation error of inexact subgradients is not too large. After that, we also give a novel componentwise normalized variant to improve the transient behavior of the convergent sequence. The effectiveness of our algorithms is verified by a numerical example.

Keywords: Distributed optimization, ε\varepsilon-subgradient, constrained optimization, primal-dual dynamics

1 Introduction

The last decade has witnessed considerable interests in distributed optimization problems due to the numerous applications in signal processing, control, and machine learning. To solve this problem, subgradient information of the objective functions has been widely used due to the cheap iteration cost and well-established convergence properties [1, 2].

Note that most of these subgradient-based results assume the availability of local cost function’s exact subgradients. In many circumstances, the function subgradient is computed by solving another auxiliary optimization problem as shown in [3, 5, 4]. In practice, we are often only able to solve these subproblems approximately. Hence, in that context, numerical methods solving the original optimization problem are provided with only inexact subgradient information. This leads us to investigate the solvability of distributed optimization problem using inexact subgradient information.

A close topic is the inexact augmented Lagrangian method. As surveyed in [6], this method has been extensively extended to distributed settings in various ways assuming the primal variables are only obtained in an approximate sense. Nevertheless, most of these results still require the exact gradient or subgradient information of the local objective functions at each given estimate. It is interesting to ask whether the primal-dual method is still effective when only inexact gradient/subgradient information is available.

In this paper, we focus on a typical distributed consensus optimization problem for a sum of convex objective functions subject to heterogeneous set constraints. Although this problem has been partially studied by gradient/subgradient methods in [7, 8, 9, 10, 11, 12, 13, 14], its solvability using only inexact subgradient information has not yet been addressed and is still unclear at present.

To solve this problem, we first convert it into a distributed saddle point seeking problem and present a projected primal-dual ε\varepsilon-subgradient dynamics to handle both the distributedness and constraints. When the objective functions are smooth with exact gradients, the proposed algorithms reduce to the primal-dual dynamics considering in [10, 8]. Then, we discuss the convergent property with diminishing step sizes and suboptimality of the proposed algorithm depending on the accuracy of available subgradients. In particular, we show that if the accumulated error resulting from the subgradient inexactness is not too large, the proposed algorithm under certain diminishing step size will drive all the estimates of agents to reach a consensus about an optimal solution to the global optimization problem. To our knowledge, this might be the first attempt to solve the formulated distributed optimization problem using only inexact subgradients of the local objective functions. To improve the transient performance of our preceding designs, we further propose a novel componentwise normalized step size as that in [14]. As a byproduct, this normalized step size removes the widely used subgradient boundedness assumption in the literature [7, 8].

The rest of this paper is organized as follows. We first give some preliminaries in Section 2 and then introduce the formulation of our problem in Section 3. Main results are presented in Section 4. After that, we give a numerical example in Section 5 to show the effectiveness of our design. Finally, some concluding remarks are given in Section 6.

2 Preliminary

In this section, we first give some preliminaries about graph theory and convex analysis.

2.1 Graph Theory

Let n\mathbb{R}^{n} be the nn-dimensional Euclidean space and n×m\mathbb{R}^{n\times m} be the set of all n×mn\times m matrices. 𝟏n{\bf 1}_{n} (or 𝟎n{\bf 0}_{n}) denotes an nn-dimensional all-one (or all-zero) column vector and 𝟏n×m{\bm{1}}_{n\times m} (or 𝟎n×m{\bm{0}}_{n\times m}) all-one (or all-zero) matrix. col(a1,,an)=[a1,,an]\mathrm{col}(a_{1},\,{\dots},\,a_{n})={[a_{1}^{\intercal},\,{\dots},\,a_{n}^{\intercal}]}^{\intercal} for column vectors a1,,ana_{1},\,\dots,a_{n}. For a vector xx (or matrix AA) , x||x|| (A||A||) denotes its Euclidean (or spectral) norm.

A weighted (undirected) graph 𝒢=(𝒩,,𝒜)\mathcal{G}=(\mathcal{N},\mathcal{E},\mathcal{A}) is defined as follows, where 𝒩={1,,n}\mathcal{N}=\{1,{\dots},n\} is the set of nodes, 𝒩×𝒩\mathcal{E}\subset\mathcal{N}\times\mathcal{N} is the set of edges, and 𝒜n×n\mathcal{A}\in\mathbb{R}^{n\times n} is a weighted adjacency matrix. (i,j)(i,j)\in\mathcal{E} denotes an edge leaving from node ii and entering node jj. The weighted adjacency matrix of this graph 𝒢\mathcal{G} is described by 𝒜=[aij]n×n\mathcal{A}=[a_{ij}]\in\mathbb{R}^{n\times n}, where aii=0a_{ii}=0 and aij0a_{ij}\geq 0 (aij=aji>0a_{ij}=a_{ji}>0 if and only if there is an edge between nodes jj and ii ). The neighbor set of agent ii is defined as 𝒩i={j:(j,i)}\mathcal{N}_{i}=\{j\colon(j,\,i)\in\mathcal{E}\} for i=1,,ni=1,\,\dots,\,n. A path in graph 𝒢\mathcal{G} is an alternating sequence i1e1i2e2ek1iki_{1}e_{1}i_{2}e_{2}{\cdots}e_{k-1}i_{k} of nodes ili_{l} and edges em=(im,im+1)e_{m}=(i_{m},\,i_{m+1})\in\mathcal{E} for l=1, 2,,kl=1,\,2,\,{\dots},\,k. If there exists a path from node ii to node jj then node ii is said to be reachable from node jj. The Laplacian L=[lij]n×nL=[l_{ij}]\in\mathbb{R}^{n\times n} of graph 𝒢\mathcal{G} is defined as lii=jiaijl_{ii}=\sum_{j\neq i}a_{ij} and lij=aij(ji)l_{ij}=-a_{ij}(j\neq i). It can be found that the Laplacian is symmetric and semi-definite. Denote its ordered eigenvalues as 0=λ1λ2λN0=\lambda_{1}\leq\lambda_{2}\leq\dots\lambda_{N}. The corresponding eigenvector of λ1=0\lambda_{1}=0 is the all one vector 𝟏N{\bm{1}}_{N}. Moreover, λ2>0\lambda_{2}>0 if and only if this graph 𝒢\mathcal{G} is connected.

2.2 Convex analysis

For each set XmX\in\mathbb{R}^{m}, the indicator function is denoted by δX\delta_{X} with δX(x)=0\delta_{X}(x)=0 for any xXx\in X and δX(x)=\delta_{X}(x)=\infty for any xXx\notin X. A set XmX\in\mathbb{R}^{m} is said to be convex if θx+(1θ)yX\theta x+(1-\theta)y\in X for any any x,yXx,\,y\in X and θ(0, 1)\theta\in(0,\,1). For a closed convex set XX\neq\emptyset, the projection operator PX:mXP_{X}\colon\mathbb{R}^{m}\to X is defined as PX[x]=argminyXyxP_{X}[x]=\arg\min_{y\in X}||y-x||. The projection operator is non-expansive in the sense that PX[x]PX[y]xy||P_{X}[x]-P_{X}[y]||\leq||x-y|| for any x,ymx,\,y\in\mathbb{R}^{m}. For any xmx\in\mathbb{R}^{m}, it holds (xPX[x])(yPX[x])0,yX(x-P_{X}[x])^{\intercal}(y-P_{X}[x])\leq 0,\,\forall y\in X.

For a given function f:mf\colon\mathbb{R}^{m}\to\mathbb{R}, we denote by domf={xm|f(x)|<}\mathrm{dom}\,f=\{x\in\mathbb{R}^{m}\mid|f(x)|<\infty\} the domain of ff. We always assume that domf\mathrm{dom}\,f\neq\emptyset and domf=m\mathrm{dom}\,f=\mathbb{R}^{m} if not specified. We say it is convex if its domain is convex and f(αx+(1αy))αf(x)+(1α)f(y)f(\alpha x+(1-\alpha y))\leq\alpha f(x)+(1-\alpha)f(y) holds for all x,ydomfx,\,y\in\mathrm{dom}\,f and α[0, 1]\alpha\in[0,\,1]. If this inequality is strict in the sense that the equation holds only if x=yx=y, the function is called strictly convex. A function ff is called closed and convex on a convex set XdomfX\subset\mathrm{dom}\,f if its constrained epigraph epiX(f)={(x,t)X×tf(x)}\mbox{epi}_{X}(f)=\{(x,\,t)\in X\times\mathbb{R}\mid t\geq f(x)\} is a closed convex set. If X=domfX=\mathrm{dom}\,f, we call ff a closed convex function.

A vector-valued function 𝒇:mm{\bm{f}}\colon\mathbb{R}^{m}\rightarrow\mathbb{R}^{m} is Lipschitz with constant ϑ>0\vartheta>0 (or simply ϑ\vartheta-Lipschitz) if we have

𝒇(ζ1)𝒇(ζ2)ϑζ1ζ2,ζ1,ζ2m\displaystyle||{\bm{f}}(\zeta_{1})-{\bm{f}}(\zeta_{2})||\leq\vartheta||\zeta_{1}-\zeta_{2}||,\,\forall\zeta_{1},\zeta_{2}\in\mathbb{R}^{m}

Let us consider a function ϕ:X×Z\phi\colon X\times Z\to\mathbb{R}, where XX and ZZ are nonempty subsets of n\mathbb{R}^{n} and m\mathbb{R}^{m}, respectively. A pair of vectors xXx^{*}\in X and zZz^{*}\in Z is called a saddle point of ϕ\phi if ϕ(x,z)ϕ(x,z)ϕ(x,z)\phi(x^{*},\,z)\leq\phi(x^{*},\,z^{*})\leq\phi(x,\,z^{*}) holds for any xXx\in X and zZz\in Z.

3 Problem Formulation

In this paper, we focus on solving the following constrained optimization problem by a network of NN agents:

minf(x)=i=1Nfi(x)s.t.xXi=1NXi\displaystyle\begin{split}\min&\quad f(x)=\sum_{i=1}^{N}f_{i}(x)\\ \mbox{s.t.}&\quad x\in X\triangleq\bigcap_{i=1}^{N}X_{i}\end{split} (1)

Here function fi:f_{i}\colon\mathbb{R}\to\mathbb{R} and set XiX_{i} are private to agent ii for each i𝒩{1, 2,,N}i\in\mathcal{N}\triangleq\{1,\,2,\,\dots,\,N\} and can not be shared with others.

To ensure its solvability, the following assumption is made.

Assumption 1

For each i𝒩i\in\mathcal{N}, function fi:f_{i}\colon\mathbb{R}\to\mathbb{R} is convex, set XiX_{i} is convex and closed, and intX=i𝒩intXi\mathrm{int}\,X=\bigcap_{i\in\mathcal{N}}\mathrm{int}\,X_{i} is nonempty and contained in domfi\mathrm{dom}\,f_{i}.

Note that fif_{i} might be nondifferentiable under this assumption. Denote the minimal value of problem (1) by ff^{*} and the optimal solution set by 𝒳\cal{X}^{*}, i.e., f=minxXf(x)f^{*}=\min_{x\in X}f(x) and 𝒳={xXf(x)=f}\mathcal{X}^{*}=\{x\in X\mid f(x)=f^{*}\}. As usual, we assume ff^{*} is finite and set 𝒳\mathcal{X}^{*} is nonempty. To cooperatively address the optimization problem (1) in a distributed manner, we use a weighted undirected graph 𝒢={𝒩,,𝒜}\mathcal{G}=\{\mathcal{N},\,\,\mathcal{E},\,\mathcal{A}\} to describe the information sharing relationships with node set 𝒩\mathcal{N}, edge set 𝒩×𝒩\mathcal{E}\subset\mathcal{N}\times\mathcal{N}, and weight matrix 𝒜=[aij]N×N\mathcal{A}=[a_{ij}]_{N\times N}. Here aij=aji>0a_{ij}=a_{ji}>0 means agents ii and jj can communicate with each other.

Assumption 2

Graph 𝒢\mathcal{G} is connected.

Suppose agent ii maintains an estimate xix_{i} of the optimal solution to (1) with other (possible) auxiliary variables. Agents exchange these variables through the communication network described by 𝒢\mathcal{G} and perform some updates at given discrete-time instants k=1, 2,k=1,\,2,\,\dots. Then, the distributed optimization problem in this paper is formulated to find an update rule of xi(k)x_{i}(k) for agent ii using only its own and neighboring information such that limk[xi(k)xj(k)]=0\lim_{k\to\infty}[x_{i}(k)-x_{j}(k)]=0 for any i,j𝒩i,\,j\in\mathcal{N} and limki=1Nfi(xi(k))=f\lim_{k\to\infty}\sum_{i=1}^{N}f_{i}(x_{i}(k))=f^{*}. If possible, we expect that all the estimates will converge to an optimal solution to problem (1).

As stated above, this problem has been intensively studied in literature [1, 2]. However, most existing designs require the exact subgradients of the local objective functions in constructing effective distributed algorithms. In this paper, we are interested in the solvability of the formulated distributed constrained optimization problem (1) working with inexact subgradients of the local objective functions. For this purpose, we adopt the notion of ε\varepsilon-subgradient to describe such inexactness as in [15] and assume the ε\varepsilon-subgradient of fif_{i} can be easily computed for any given ε0\varepsilon\geq 0.

Definition 1

For a convex function f:mf\colon\mathbb{R}^{m}\to\mathbb{R} and a scalar ε>0\varepsilon>0, gmg\in\mathbb{R}^{m} is said to be an ε\varepsilon-subgradient of ff at xmx\in\mathbb{R}^{m} if

f(y)f(x)+g(yx)ε,ydomff(y)\geq f(x)+g^{\intercal}(y-x)-\varepsilon,\quad\forall y\in\mathrm{dom}\,f

Denote by εf(x)\partial_{\varepsilon}f(x) the ε\varepsilon-subdifferential of ff at xmx\in\mathbb{R}^{m}, which is the set of all ε\varepsilon-subgradients of ff at xx. εf(x)\partial_{\varepsilon}f(x) is nonempty and convex for any xmx\in\mathbb{R}^{m} due to the convexity of ff. Moreover, 0f(x)\partial_{0}f(x) coincides with the subdifferential of ff at xmx\in\mathbb{R}^{m}.

In next section, we will convert our problem into a saddle-point seeking problem and develop a projected primal-dual ε\varepsilon-subgradient method with rigorous solvability analysis.

4 Main Result

To begin with, we rewrite problem (1) into an alternative form as that in [10, 16]:

min\displaystyle\min~{}~{} f~(𝐱)=i=1Nfi(xi)\displaystyle\tilde{f}({\bf x})=\sum_{i=1}^{N}f_{i}(x_{i})
s.t. L𝐱=𝟎N\displaystyle L{\bf x}={\bf 0}_{N} (2)
𝐱X~X1××XN\displaystyle{\bf x}\in\tilde{X}\triangleq X_{1}\times\dots\times X_{N}

where 𝐱=col(x1,,xN){\bf x}=\mathrm{col}(x_{1},\,\dots,\,x_{N}) and LL is the Laplacian of graph 𝒢\mathcal{G}. Note that LL is symmetric and positive semi-definite with its ordered eigenvalues as 0=λ1<λ2λN0=\lambda_{1}<\lambda_{2}\leq\dots\lambda_{N} under Assumption 2 by Theorem 2.8 in [17].

Consider the augmented Lagrangian function of problem (4):

Φ(𝐱,𝐯)=f~(𝐱)+𝐯L𝐱+12𝐱L𝐱\displaystyle\Phi({\bf x},\,{\bf v})=\tilde{f}({\bf x})+{\bf v}^{\intercal}L{\bf x}+\frac{1}{2}{\bf x}^{\intercal}L{\bf x} (3)

with 𝐯=col(v1,,vN)N{\bf v}=\mbox{col}(v_{1},\,\dots,\,v_{N})\in\mathbb{R}^{N}. By Proposition 3.4.1 in [3], if Φ\Phi has a saddle point (𝐱,𝐯)({\bf x}^{*},\,{\bf v}^{*}) in X~×N\tilde{X}\times\mathbb{R}^{N}, then 𝐱{\bf x}^{*} must be an optimal solution to problem (4), which in turn provides an optimal solution to (1). Since the Slater’s condition holds under Assumption 1, such saddle points indeed exist by virtue of Theorems 3.34 and 4.7 in [18]. Thus, it suffices for us to seek a saddle point of Φ\Phi in X~×N\tilde{X}\times\mathbb{R}^{N}.

Following this conversion, many solvability results on problem (1) have been presented when the exact gradient or subgradient of fif_{i} is available, e.g., [19, 20, 16, 10, 21, 14]. However, whether and how ε\varepsilon-subgradient algorithms can be derived has not been discussed yet. To this end, we are motivated by aforementioned saddle-point seeking designs and present the following dynamics:

xi(k+1)=PXi[xi(k)αk(gi(k)+x^i(k)+v^i(k))]vi(k+1)=vi(k)+αkx^i(k)\displaystyle\begin{split}x_{i}(k+1)&=P_{X_{i}}[x_{i}(k)-\alpha_{k}(g_{i}(k)+\hat{x}_{i}(k)+\hat{v}_{i}(k))]\\ v_{i}(k+1)&=v_{i}(k)+\alpha_{k}\hat{x}_{i}(k)\end{split} (4)

where x^i(k)j=1Naij(xi(k)xj(k))\hat{x}_{i}(k)\triangleq\sum_{j=1}^{N}a_{ij}(x_{i}(k)-x_{j}(k)), v^i(k)j=1Naij(vi(k)vj(k))\hat{v}_{i}(k)\triangleq\sum_{j=1}^{N}a_{ij}(v_{i}(k)-v_{j}(k)), and gi(k)εkfi(xi(k))g_{i}(k)\in\partial_{\varepsilon_{k}}f_{i}(x_{i}(k)) with parameters εk,αk>0\varepsilon_{k},\,\alpha_{k}>0 to be specified later. It can be taken as a constrained version of algorithms in [19, 20]. Different from similar primal-dual designs in [10, 16], we do not require the differentiability of these objective functions or their exact gradients.

Letting 𝐱(k)=col(x1(k),,xN(k)){\bf x}(k)=\mathrm{col}(x_{1}(k),\dots,x_{N}(k)) and 𝐯(k)=col(v1(k),,vN(k)){\bf v}(k)=\mathrm{col}(v_{1}(k),\dots,v_{N}(k)), we can put (4) into a compact form:

𝐱(k+1)=PX~[𝐱(k)αk(𝐠(k)+L𝐯(k)+L𝐱(k))]𝐯(k+1)=𝐯(k)+αkL𝐱(k)\displaystyle\begin{split}{\bf x}(k+1)&=P_{\tilde{X}}[{\bf x}(k)-\alpha_{k}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))]\\ {\bf v}(k+1)&={\bf v}(k)+\alpha_{k}L{\bf x}(k)\end{split} (5)

with 𝐠(k)=col(g1(k),,gN(k))Nεkf~(𝐱(k))N{\bf g}(k)=\mathrm{col}(g_{1}(k),\,\dots,\,g_{N}(k))\in\partial_{N\varepsilon_{k}}\tilde{f}({\bf x}(k))\in\mathbb{R}^{N}. It can be further rewritten as follows.

𝐳(k+1)=PX¯[𝐳(k)αkTεk(𝐳(k))]\displaystyle{\bf z}(k+1)=P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))] (6)

where 𝐳(k)=col(𝐱(k),𝐯(k)){\bf z}(k)=\mathrm{col}({\bf x}(k),\,{\bf v}(k)), X¯=X~×N\overline{X}=\tilde{X}\times\mathbb{R}^{N}, and

Tεk(𝐳(k))=[𝐠(k)+L𝐯(k)+L𝐱(k)L𝐱(k)]\displaystyle T_{\varepsilon_{k}}({\bf z}(k))=\begin{bmatrix}{\bf g}(k)+L{\bf v}(k)+L{\bf x}(k)\\ -L{\bf x}(k)\end{bmatrix}

To establish the effectiveness of this algorithm, another assumption is made as follows.

Assumption 3

The εk\varepsilon_{k}-subgradient sequence {gi(k)}\{g_{i}(k)\} is uniformly bounded for each ii, i.e., there exists a scalar C>0C>0 such that maxi𝒩{gi(k)}<C\max_{i\in\mathcal{N}}\{||g_{i}(k)||\}<C all k>0k>0.

This assumption is temporally made for simplicity as in [22, 8] and will be further removed later by some novel step sizes later. Suppose 𝐳=col(𝐱,𝐯){\bf z}^{*}=\mathrm{col}({\bf x}^{*},\,{\bf v}^{*}) is a saddle point of Φ\Phi in X~×N\tilde{X}\times\mathbb{R}^{N}. Here is a key lemma under Assumption 3.

Lemma 1

Suppose Assumptions 13 hold. Along the trajectory of algorithm (4), there exists some C1>0C_{1}>0 such that, for any k=1, 2,k=1,\,2,\,\dots, the following inequality holds:

𝐳(k+1)𝐳2(1+C1αk2)𝐳(k)𝐳22αkΔ(𝐱(k))+2Nαkεk+C1αk2\displaystyle\begin{split}||{\bf z}(k+1)-{\bf z}^{*}||^{2}&\leq(1+C_{1}\alpha_{k}^{2})||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\Delta({\bf x}(k))+2N\alpha_{k}\varepsilon_{k}+C_{1}\alpha_{k}^{2}\end{split} (7)

where Δ(𝐱(k))Φ(𝐱(k),𝐯)Φ(𝐱,𝐯)+12𝐱(k)L𝐱(k)0\Delta({\bf x}(k))\triangleq\Phi({\bf x}(k),\,{\bf v}^{*})-\Phi({\bf x}^{*},\,{\bf v}^{*})+\frac{1}{2}{\bf x}(k)^{\intercal}L{\bf x}(k)\geq 0.

Proof. By lemma conditions, (𝐱,𝐯)({\bf x}^{*},\,{\bf v}^{*}) is a saddle point of Φ\Phi. Then, Δ(𝐱(k))Φ(𝐱(k),𝐯)Φ(𝐱,𝐯)+12𝐱(k)L𝐱(k)0\Delta({\bf x}(k))\triangleq\Phi({\bf x}(k),\,{\bf v}^{*})-\Phi({\bf x}^{*},\,{\bf v}^{*})+\frac{1}{2}{\bf x}(k)^{\intercal}L{\bf x}(k)\geq 0 can be easily verified by the definition of saddle points.

Next, we consider the evolution of 𝐳(k)𝐳2||{\bf z}(k)-{\bf z}^{*}||^{2} with respect to kk. Under the iteration (4), it follows then

𝐳(k+1)𝐳2\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2}
=PX¯[𝐳(k)αkTεk(𝐳(k))]𝐳2\displaystyle\qquad=||P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]-{\bf z}^{*}||^{2}
𝐳(k)αkTεk(𝐳(k))𝐳2𝐳(k)αkTεk(𝐳(k))PX¯[𝐳(k)αkTεk(𝐳(k))]2\displaystyle\qquad\leq||{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))-{\bf z}^{*}||^{2}-||{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))-P_{\overline{X}}[{\bf z}(k)-\alpha_{k}T_{\varepsilon_{k}}({\bf z}(k))]||^{2}
𝐳(k)𝐳22αk(𝐳(k)𝐳)Tεk(𝐳(k))+αk2Tεk(𝐳(k))2\displaystyle\qquad\leq||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}({\bf z}(k)-{\bf z}^{*})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k))+\alpha_{k}^{2}||T_{\varepsilon_{k}}({\bf z}(k))||^{2} (8)

By the proprieties of saddle point and εk\varepsilon_{k}-subgradient, we have

(𝐳(k)𝐳)Tεk(𝐳(k))\displaystyle({\bf z}(k)-{\bf z}^{*})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k)) =(𝐱(k)𝐱)(𝐠(k)+L𝐯(k)+L𝐱(k))(𝐯(k)𝐯)L𝐱(k)\displaystyle=({\bf x}(k)-{\bf x}^{*})^{\intercal}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))-({\bf v}(k)-{\bf v}^{*})^{\intercal}L{\bf x}(k)
f~(𝐱k)f~(𝐱)Nεk+𝐯L𝐱(k)+𝐱(k)L𝐱(k)\displaystyle\geq\tilde{f}({\bf x}^{k})-\tilde{f}({\bf x}^{*})-N\varepsilon_{k}+{{\bf v}^{*}}^{\intercal}L{\bf x}(k)+{\bf x}(k)^{\intercal}L{\bf x}(k)
=Φ(𝐱(k),𝐯)Φ(𝐱,𝐯)+12𝐱(k)L𝐱(k)Nεk\displaystyle=\Phi({\bf x}(k),\,{\bf v}^{*})-\Phi({\bf x}^{*},\,{\bf v}^{*})+\frac{1}{2}{\bf x}(k)^{\intercal}L{\bf x}(k)-N\varepsilon_{k} (9)
=Δ(𝐱(k))Nεk\displaystyle=\Delta({\bf x}(k))-N\varepsilon_{k}

Since L𝐱=𝟎L{\bf x}^{*}={\bf 0}, Tεk(𝐳(k))T_{\varepsilon_{k}}({\bf z}(k)) can be rewritten as

Tεk(𝐳(k))=[𝐠(k)+L(𝐱(k)𝐱)+L(𝐯(k)𝐯)+L𝐯L(𝐱(k)𝐱)]\displaystyle T_{\varepsilon_{k}}({\bf z}(k))=\begin{bmatrix}{\bf g}(k)+L({\bf x}(k)-{\bf x}^{*})+L({\bf v}(k)-{\bf v}^{*})+L{\bf v}^{*}\\ -L({\bf x}(k)-{\bf x}^{*})\end{bmatrix}

Under Assumption 3, there must be constant C1>0C_{1}>0 such that

Tεk(𝐳(k))2C1(1+𝐳(k)𝐳2)\displaystyle||T_{\varepsilon_{k}}({\bf z}(k))||^{2}\leq C_{1}(1+||{\bf z}(k)-{\bf z}^{*}||^{2}) (10)

Putting all inequalities (4)–(10) together, we have

𝐳(k+1)𝐳2\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2} (1+C1αk2)𝐳(k)𝐳22αkΔ(𝐱(k))+2Nαkεk+C1αk2\displaystyle\leq(1+C_{1}\alpha_{k}^{2})||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\Delta({\bf x}(k))+2N\alpha_{k}\varepsilon_{k}+C_{1}\alpha_{k}^{2}

which is exactly the expected inequality (7).  

When the exact subgradient is available (i.e., εk=0\varepsilon_{k}=0), the inequality (7) can be simplified into the well-known supermartingale inequality ensuring the convergence of 𝐳(k){\bf z}(k) towards 𝐳{\bf z}^{*} as shown in [3] if {αk}\{\alpha_{k}\} is chosen to satisfy

k=1αk=,k=1αk2<\displaystyle\sum_{k=1}^{\infty}\alpha_{k}=\infty,\quad\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty (11)

However, the inexactness of available subgradients deteriorates this property and the expected convergence might fail when we use only ε\varepsilon-subgradients in the iteration (4).

Let us denote Δ¯=lim infkΔ(𝐱(k))\overline{\Delta}=\liminf_{k\to\infty}\Delta({\bf x}(k)) and take a closer look at the inequality (7). Note that Δ¯\overline{\Delta} consists of two parts, i.e., the discrepancy of function values and violation of constraints (in term of consensus error) of the iterative sequence. It can be taken as a measure of suboptimality of the iterative sequence. In other words, we can determine the upper bound for Δ¯\overline{\Delta} to evaluate the effectiveness of algorithm (4).

We first consider the case when εk\varepsilon_{k} is a constant.

Theorem 1

Suppose Assumptions 13 hold. Let the step size αk\alpha_{k} be chosen to satisfy (11) and εk\varepsilon_{k} be fixed at some scalar ε0>0\varepsilon_{0}>0. Then, along the trajectory of algorithm (4), it holds that

0Δ¯Nε0\displaystyle 0\leq\overline{\Delta}\leq N\varepsilon_{0} (12)

Proof. To prove this theorem, we only have to show that Δ¯Nε0\overline{\Delta}\leq N\varepsilon_{0}. If the inequality does not hold, there must exist a δ>0\delta>0 and a sufficient large integer K1>1K_{1}>1 such that Δ(𝐱(k))>Nε0+δ\Delta({\bf x}(k))>N\varepsilon_{0}+\delta for all kK1k\geq K_{1}. By (11), we have limkαk=0\lim_{k\to\infty}\alpha_{k}=0. Thus, there must exist an integer K2>1K_{2}>1 such that 0<αkδC10<\alpha_{k}\leq\frac{\delta}{C_{1}} for all kK2k\geq K_{2}. Bringing these conditions together, one can strengthen inequality (7) for all kKmax{K1,K2}k\geq K\triangleq\max\{K_{1},\,K_{2}\} as follows.

𝐳(k+1)𝐳2\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2} (1+C1αk2)𝐳(k)𝐳2αkδ\displaystyle\leq(1+C_{1}\alpha_{k}^{2})||{\bf z}(k)-{\bf z}^{*}||^{2}-\alpha_{k}\delta

Summing up its both sides from KK to K¯>K\overline{K}>K gives

𝐳(K¯+1)𝐳2\displaystyle||{\bf z}({\overline{K}}+1)-{\bf z}^{*}||^{2} 𝐳(K)𝐳2k=KK¯(1+C1αk2)δk=KK¯αk\displaystyle\leq||{\bf z}(K)-{\bf z}^{*}||^{2}\prod_{k=K}^{\overline{K}}(1+C_{1}\alpha_{k}^{2})-\delta\sum_{k=K}^{\overline{K}}\alpha_{k}

where we use 1+C1αk2>11+C_{1}\alpha_{k}^{2}>1 to handle the cross terms.

Note that 1+θeθ1+\theta\leq e^{\theta} for any θ>0\theta>0. Hence k=KK¯(1+C1αk2)eC1k=KK¯αk2eC1k=1αk2\prod_{k=K}^{\overline{K}}(1+C_{1}\alpha_{k}^{2})\leq e^{C_{1}\sum_{k=K}^{\overline{K}}\alpha_{k}^{2}}\leq e^{C_{1}\sum_{k=1}^{\infty}\alpha_{k}^{2}}. Under the condition (11), there must exist a positive scalar C¯>0\overline{C}>0 such that

𝐳(K¯+1)𝐳2\displaystyle||{\bf z}({\overline{K}}+1)-{\bf z}^{*}||^{2} C¯𝐳(K)𝐳2δk=KK¯αk\displaystyle\leq\overline{C}||{\bf z}(K)-{\bf z}^{*}||^{2}-\delta\sum_{k=K}^{\overline{K}}\alpha_{k}

which can not hold for a sufficiently large K¯\overline{K} since k=1αk=\sum_{k=1}^{\infty}\alpha_{k}=\infty. We obtain a contradiction and complete the proof.  

Remark 1

According to Theorem 1, one can generally obtain a suboptimal solution to problem (1) using inexact subgradients. If we are interested in an exact solution, it is required to ensure limkεk=0\lim_{k\to\infty}\varepsilon_{k}=0. For a very special case when εk=0\varepsilon_{k}=0, it shows the effectiveness of our algorithm (4) in solving the formulated problem (1) with exact subgradients. This observation is consistent with the existing subgradient methods in [20, 11, 12, 13, 14].

With Theorem 1, it is natural for us to enforce some stronger condition on the error εk\varepsilon_{k} for a better convergence performance of the entire sequence {xi(k)}\{x_{i}(k)\}. Along this line, we provide another theorem supposing the accumulated error of subgradient inexactness is not too large.

Theorem 2

Suppose Assumptions 13 hold. Let the parameters αk,εk>0\alpha_{k},\,\varepsilon_{k}>0 be chosen to satisfy the following condition

k=1αk=,k=1αk2<,k=1αkεk<\displaystyle\sum_{k=1}^{\infty}\alpha_{k}=\infty,\quad\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty,\quad\sum_{k=1}^{\infty}\alpha_{k}\varepsilon_{k}<\infty (13)

Then, along the trajectory of algorithm (4), we have

  • 1)

    the sequence {𝐳(k+1)𝐳}\{||{\bf z}(k+1)-{\bf z}^{*}||\} converges;

  • 2)

    the estimates x1(k),,xN(k)x_{1}(k),\,\ldots,\,x_{N}(k) reach an optimal consensus in the sense that limk[xi(k)xj(k)]=0\lim_{k\to\infty}[{x}_{i}(k)-{x}_{j}(k)]=0 and limkf~(𝐱(k))=f~(𝐱)=f\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})=f^{*};

  • 3)

    {𝐳(k)}\{{\bf z}(k)\} has at least one cluster point 𝐳¯=col(𝐱¯,𝐯¯){\overline{\bf z}}=\mathrm{col}({\overline{\bf x}},\,{\bf{\overline{v}}}) such that 𝐱¯=𝟏Nx\overline{\bf x}={\bf 1}_{N}{x^{*}} with x{x^{*}} being an optimal solution to problem (1);

  • 4)

    If the optimal solution to problem (1) is unique, i.e., 𝒳={x}\mathcal{X}^{*}=\{x^{*}\}, then limkxi(k)=x\lim_{k\to\infty}{{x}_{i}(k)}=x^{*} for each i𝒩i\in\mathcal{N}.

Proof. Note that Δ(𝐱(k))0\Delta({\bf x}(k))\geq 0 by Lemma 1 and k=1αkεk+k=1αk2<\sum_{k=1}^{\infty}\alpha_{k}\varepsilon_{k}+\sum_{k=1}^{\infty}\alpha_{k}^{2}<\infty under the theorem assumption. Applying Lemma 5.31 in [23] to the inequality (7), we can obtain the convergence of {𝐳(k+1)𝐳}\{||{\bf z}(k+1)-{\bf z}^{*}||\} and

0i=1αkΔ(𝐱(k))<\displaystyle 0\leq\sum_{i=1}^{{}_{\infty}}\alpha_{k}\Delta({\bf x}(k))<\infty (14)

Thus, the sequence {𝐳(k)}\{{\bf z}(k)\} must be uniformly bounded by some C2>0C_{2}>0. From the continuity of Δ(𝐱)\Delta({\bf x}), it must be C3C_{3}-Lipschitz with respect to 𝐱{\bf x} for some constant C3>0C_{3}>0. It follows then

Δ(𝐱(k+1))Δ(𝐱(k))C3𝐱(k+1)𝐱(k)=C3PX~[𝐱(k)αk(𝐠(k)+L𝐯(k)+L𝐱(k))]𝐱(k)C3𝐱(k)αk(𝐠(k)+L𝐯(k)+L𝐱(k))𝐱(k)C3αk𝐠(k)+L𝐯(k)+L𝐱(k)C3(NC+2λmax(L)C2)αk\displaystyle\begin{split}\Delta({\bf x}(k+1))-\Delta({\bf x}(k))&\leq C_{3}||{\bf x}(k+1)-{\bf x}(k)||\\ &=C_{3}||P_{\tilde{X}}[{\bf x}(k)-\alpha_{k}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))]-{\bf x}(k)||\\ &\leq C_{3}||{\bf x}(k)-\alpha_{k}({\bf g}(k)+L{\bf v}(k)+L{\bf x}(k))-{\bf x}(k)||\\ &\leq C_{3}\alpha_{k}||{\bf g}(k)+L{\bf v}(k)+L{\bf x}(k)||\\ &\leq C_{3}(\sqrt{N}C+2\lambda_{\max}(L)C_{2})\alpha_{k}\end{split} (15)

Jointly using (14), (15), and k=1αk=\sum_{k=1}^{\infty}\alpha_{k}=\infty, we resort to Proposition 2 in [24] and conclude that limkΔ(𝐱(k))=0\lim_{k\to\infty}\Delta({\bf x}(k))=0. Recalling the expression of Δ\Delta, we have limk𝐱(k)L𝐱(k)=0\lim_{k\to\infty}{\bf x}^{\intercal}(k)L{\bf x}(k)=0 and limk[f~(𝐱(k))+𝐯L𝐱(k)]=limkf~(𝐱(k))=f~(𝐱)\lim_{k\to\infty}[\tilde{f}({\bf x}(k))+{{\bf v}^{*}}^{\intercal}L{\bf x}(k)]=\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*}). Note that LL is positive semidefinite with 0 as its simple eigenvalue under Assumption 2. It follows then limk[xi(k)xj(k)]=0\lim_{k\to\infty}[x_{i}(k)-x_{j}(k)]=0 and limkf~(𝐱(k))=f~(𝐱)=f\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})=f^{*}.

Due to the uniform boundedness of sequence {𝐳(k)}\{{\bf z}(k)\} by item 1), there must be a convergent subsequence {𝐳km}\{{\bf z}_{k_{m}}\} of {𝐳k}\{{\bf z}_{k}\}. We denote its limit by 𝐳¯=col(𝐱¯,𝐯¯)\overline{\bf z}=\mathrm{col}(\overline{\bf x},\,\overline{\bf v}). Then, it should satisfy L𝐱¯=0L\overline{\bf x}=0 and f~(𝐱¯)=limmf~(𝐱km)=f\tilde{f}(\overline{\bf x})=\lim_{m\to\infty}\tilde{f}({\bf x}_{k_{m}})=f^{*} by item 2). In other words, 𝐱¯\overline{\bf x} is an optimal solution to (4). By Assumption 2, one can conclude that there exists some x¯\overline{x}\in\mathbb{R} such that 𝐱¯=𝟏Nx¯\overline{\bf x}={\bf 1}_{N}\overline{x}. Note that f(x¯)=f~(𝐱¯)=ff(\overline{x})=\tilde{f}(\overline{\bf x})=f^{*}, i.e., x¯\overline{x} is an exact optimal solution to problem (1).

If 𝒳={x}\mathcal{X}^{*}=\{x^{*}\} holds, all convergent subsequences of {𝐱(k)}\{{\bf{\bf x}}(k)\} have the same limit xx^{*}. This combined with the boundedness of {𝐳(k)}\{{\bf{\bf z}}(k)\} implies item 4) and completes the proof.  

Remark 2

This theorem specifies a nontrivial case when our distributed optimization problem (1) can be exactly solved even using only inexact subgradients information of the local objective functions. This observation is consistent with the centralized results in [5]. Compared with similar primal-dual domain results in [19, 25, 21, 10, 16, 14], this algorithm further allows us to consider nonsmooth objective functions with only approximate subgradients.

It is known that normalization might improve the transient performance of the algorithms to avoid overshoots at the starting phase. However, conventional normalized techniques often involve some global information and can not be directly implemented in distributed settings. We here present a novel componentwise normalized version of algorithm (4) as follows.

xi(k+1)\displaystyle x_{i}(k+1) =PXi[xi(k)αkmax{c,δik,D}(gi(k)+x^i(k)+v^i(k))]\displaystyle=P_{X_{i}}[x_{i}(k)-\frac{\alpha_{k}}{\max\{c,\,\delta_{ik,D}\}}(g_{i}(k)+\hat{x}_{i}(k)+\hat{v}_{i}(k))]
vi(k+1)\displaystyle v_{i}(k+1) =vi(k)+αkmax{c,δik,D}x^i(k)\displaystyle=v_{i}(k)+\frac{\alpha_{k}}{\max\{c,\,\delta_{ik,D}\}}\hat{x}_{i}(k) (16)
δik,m\displaystyle\delta_{ik,\,m} ={Tεki(𝐳(k)),when m=1maxj𝒩i{δik,m1,δjk,m1},when 2mD\displaystyle=\begin{cases}\qquad||T^{i}_{\varepsilon_{k}}({\bf z}(k))||,&\,\mbox{when }m=1\\ \max_{j\in\mathcal{N}_{i}}\{\delta_{ik,\,m-1},\,\delta_{jk,\,m-1}\},&\,\mbox{when }2\leq m\leq D\end{cases}

where integer DD(𝒢)+1D\geq\mathrm{D}(\mathcal{G})+1 with D(𝒢)\mathrm{D}(\mathcal{G}) the diameter of graph 𝒢\mathcal{G} and c>0c>0 is any given constant. Since D(𝒢)\mathrm{D}(\mathcal{G}) or an upper bound can be computed by distributed rules [26], this normalized algorithm is implementable in a fully distributed manner by embedding a max-consensus subiteration.

Here is a corollary to state that the normalized algorithm (4) retains all the established properties in Theorem 2.

Corollary 1

Suppose Assumptions 12 hold. Choose the same parameters satisfying condition (13). Then, along the trajectory of algorithm (4), the sequence {𝐳(k)}\{{\bf{\bf z}}(k)\} retains all the established properties in Theorem 2.

  • 1)

    the sequence {𝐳(k+1)𝐳}\{||{\bf z}(k+1)-{\bf z}^{*}||\} converges;

  • 2)

    the estimates x1(k),,xN(k)x_{1}(k),\,\dots,\,x_{N}(k) reach an optimal consensus in the sense that limk[𝐱i(k)𝐱j(k)]=0\lim_{k\to\infty}[{\bf x}_{i}(k)-{\bf x}_{j}(k)]=0 and limkf~(𝐱(k))=f~(𝐱)=f\lim_{k\to\infty}\tilde{f}({\bf x}(k))=\tilde{f}({\bf x}^{*})=f^{*};

  • 3)

    {𝐳(k)}\{{\bf z}(k)\} has at least one cluster point 𝐳¯=col(𝐱¯,𝐯¯){\overline{\bf z}}=\mathrm{col}({\overline{\bf x}},\,{\bf{\overline{v}}}) such that 𝐱¯=𝟏x\overline{\bf x}={\bf 1}{x^{*}} with x{x^{*}} being an optimal solution to problem (1).

  • 4)

    If the optimal solution to problem (1) is unique, i.e., 𝒳={x}\mathcal{X}^{*}=\{x^{*}\}, then limkxi(k)=x\lim_{k\to\infty}{{x}_{i}(k)}=x^{*} for each i𝒩i\in\mathcal{N}.

Proof. The proof is similar with that of Theorem 2. First, we recall Theorem 4.1 in [27] and conclude that all agents will get max{c,maxi𝒩Tεki(𝐳(k))}\max\{c,\,\max_{i\in\mathcal{N}}||T^{i}_{\varepsilon_{k}}({\bf z}(k))||\} after the subiteration. Thus, we only have to consider the following system:

xi(k+1)=PXi[xi(k)αkγk(gi(k)+x^i(k)+v^i(k))]vi(k+1)=vi(k)+αkγkx^i(k)\displaystyle\begin{split}x_{i}(k+1)&=P_{X_{i}}[x_{i}(k)-\alpha_{k}\gamma_{k}(g_{i}(k)+\hat{x}_{i}(k)+\hat{v}_{i}(k))]\\ v_{i}(k+1)&=v_{i}(k)+\alpha_{k}\gamma_{k}\hat{x}_{i}(k)\end{split}

with γk=1max{c,maxi𝒩Tεki(𝐳(k))}\gamma_{k}=\frac{1}{\max\{c,\,\max_{i\in\mathcal{N}}||T^{i}_{\varepsilon_{k}}({\bf z}(k))||\}}. For this new system, we can establish a similar inequality as (7) for 𝐳(k)𝐳||{\bf z}(k)-{\bf z}^{*}||:

𝐳(k+1)𝐳2\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2} 𝐳(k)𝐳22αkγk(𝐳(k)𝐳)Tεk(𝐳(k))+αk2γk2Tεk(𝐳(k))2\displaystyle\leq||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\gamma_{k}({\bf z}(k)-{\bf z}^{*})^{\intercal}T_{\varepsilon_{k}}({\bf z}(k))+\alpha_{k}^{2}\gamma_{k}^{2}||T_{\varepsilon_{k}}({\bf z}(k))||^{2}

Note that γk2Tεk(𝐳(k))2N\gamma_{k}^{2}||T_{\varepsilon_{k}}({\bf z}(k))||^{2}\leq N and 0γk1c0\leq\gamma_{k}\leq\frac{1}{c}. Recalling the fact (9) and Δ(𝐱(k))0\Delta({\bf x}(k))\geq 0, one can obtain

𝐳(k+1)𝐳2\displaystyle||{\bf z}(k+1)-{\bf z}^{*}||^{2} 𝐳(k)𝐳22αkγkΔ(𝐱(k))+2Ncαkεk+Nαk2\displaystyle\leq||{\bf z}(k)-{\bf z}^{*}||^{2}-2\alpha_{k}\gamma_{k}\Delta({\bf x}(k))+\frac{2N}{c}\alpha_{k}\varepsilon_{k}+N\alpha_{k}^{2}

According to Lemma 5.3.1 in [23], we can conclude the convergence of {𝐳(k+1)𝐳}\{||{\bf z}(k+1)-{\bf z}^{*}||\} and k=1αkγkΔ(𝐱(k))\sum_{k=1}^{\infty}\alpha_{k}\gamma_{k}\Delta({\bf x}(k))\leq\infty. Then, {𝐳(k)}\{||{\bf z}(k)||\} is uniformly bounded, which implies that there must be small enough constant C4>0C_{4}>0 such that γkC4>0\gamma_{k}\geq C_{4}>0. We use Proposition 2 in [24] again and obtain that limkΔ(𝐱(k))=0\lim_{k\to\infty}\Delta({\bf x}(k))=0. Then, items 3) and 4) can be easily verified following a similar procedure as in Theorem 2. The proof is thus complete.  

Remark 3

Compared with the conventional normalized step sizes in [18, 28], the proposed componentwise normalized step size can be taken as their distributed extension and the iterative sequence generated by (4) might have a better transient behavior than that generated by (4). Interestingly, the widely used subgradient boundedness assumption (i.e., Assumption 3) is also removed as a byproduct, which might be favorable in distributed scenarios.

5 Simulation

1234
Figure 1: Communication graph 𝒢\mathcal{G} in our example.

In this section, we consider a LASSO (least absolute shrinkage and selection operator) regression problem to verify the effectiveness of our algorithms:

minxXf(x)=12i=1Nxpi2+λNx1\displaystyle\min_{x\in X}f(x)=\frac{1}{2}\sum_{i=1}^{N}||x-p_{i}||^{2}+\lambda N||x||_{1}

where λN>0\lambda N>0 is the regularization parameter, pip_{i} is an estimate only known by agent ii, and XX is the constrained set. In this case, we let Xi=XX_{i}=X and fi(x)=12xpi2+λx1f_{i}(x)=\frac{1}{2}||x-p_{i}||^{2}+\lambda||x||_{1} and put it into a form of problem (1). Distributed subgradient algorithms have been developed to solve this problem when fi\partial f_{i} is available, e.g., [12]. Although fi\partial f_{i} and εfi\partial_{\varepsilon}f_{i} can be easily calculated, we use this example to show the effectiveness of our algorithm only using its ε\varepsilon-subdifferential instead of the exact one.

According to the definition of ε\varepsilon-subgradient, we have

εfi(x)={[xpiλ,xpiλλεx]for x<ε2,[xpiλ,xpi+λ]for x[ε2,ε2],[xpi+λλεx,xpi+λ]for x>ε2\partial_{\varepsilon}f_{i}(x)=\begin{cases}[x-p_{i}-\lambda,x-p_{i}-\lambda-\frac{\lambda\varepsilon}{x}]&\text{for }x<-\frac{\varepsilon}{2},\\ [x-p_{i}-\lambda,x-p_{i}+\lambda]&\text{for }x\in[-\frac{\varepsilon}{2},\frac{\varepsilon}{2}],\\ [x-p_{i}+\lambda-\frac{\lambda\varepsilon}{x},x-p_{i}+\lambda]&\text{for }x>\frac{\varepsilon}{2}\end{cases}
Refer to caption
Refer to caption
Figure 2: Profiles of primal and dual variables in Algorithm (4).
Refer to caption
Refer to caption
Figure 3: Profiles of primal and dual variables in Algorithm (4).

For simulations, we let N=4N=4, pi=2ip_{i}=2i, and λ=0.1\lambda=0.1. The communication graph is given as Fig. 1 with unity weights. To make it more interesting, we assume this problem has set constraints specified by Xi=[11+i, 8i]X_{i}=[-11+i,\,8-i] with i=1, 2, 3, 4i=1,\,2,\,3,\,4. Assumptions 1-2 can be confirmed. To ensure the condition (13), we choose αk=εk=3k+1\alpha_{k}=\varepsilon_{k}=\frac{3}{k+1}. Then, the considered problem can be solved by our proposed distributed primal-dual ε\varepsilon-subgradient method (PDε\varepsilonSM) (4) and the normalized primal-dual ε\varepsilon-subgradient method (NPDε\varepsilonSM) (4) according to Theorem 2 and Corollary 1. In simulations, when x<ε2x<-\frac{\varepsilon}{2}, we choose xpiλλεxx-p_{i}-\lambda-\frac{\lambda\varepsilon}{x} as the ε\varepsilon-subgradient, when ε2xε2-\frac{\varepsilon}{2}\leq x\leq\frac{\varepsilon}{2}, we choose xpi+λx-p_{i}+\lambda as the ε\varepsilon-subgradient, when x>ε2x>\frac{\varepsilon}{2}, we choose xpi+λλεxx-p_{i}+\lambda-\frac{\lambda\varepsilon}{x} as the ε\varepsilon-subgradient.

Refer to caption
Figure 4: Profiles of residential errors in Algorithms (4) and (4).

Simulation results with 𝐱(1)=col(1, 0, 5,1){\bf x}(1)=\mathrm{col}(1,\,0,\,5,\,-1) and c=0.1c=0.1 for (4) and (4) are shown in Figs. 23, where all agents’ primal variables are observed to converge to the global optimal solution x=4x^{*}=4 while the dual variables are bounded and converge. This verifies the effectiveness of our algorithms. Moreover, one can find that although the normalized step size in (4) might slow down the convergence speed compared with (4), the resultant transient performance of the primal and dual variables has been much improved with less and weaker oscillations. For a clear comparison, we let e(t)=xi(k)𝟏4xxi(1)𝟏4xe(t)=\frac{||x_{i}(k)-{\bf 1}_{4}x^{*}||}{||x_{i}(1)-{\bf 1}_{4}x^{*}||} be the residential error of our algorithms. The profiles of e(k)e(k) in both algorithms are shown in Fig. 4. From this, we can also confirm the improvement of transient performance by the proposed componentwise normalized step size.

6 Conclusion

In this paper, we have attempted to solve a distributed constrained optimization problem with inexact subgradient information of local objective functions. We have developed a projected primal-dual dynamics using only ε\varepsilon-subgradients and discussed its convergence properties. In particular, we have shown the exact solvability of this problem if the accumulated error introduced by subgradient inexactness is not too large. We have also presented a novel distributed normalized step size to improve the transient performance of our algorithms. It is interesting to consider more general graphs in future works.

References

  • [1] Nedić A, Liu J, Distributed optimization for control, Annual Review of Control, Robotics, and Autonomous Systems, 2018, 1: 77–103.
  • [2] Yang T, Yi X, Wu J, et al. A survey of distributed optimization, Annual Reviews in Control, 2019, 47: 278–305.
  • [3] Bertsekas D, Convex Optimization Algorithms, Athena Scientific, Belmont, 2015.
  • [4] Devolder O, Glineur F, Nesterov Y, First-order methods of smooth convex optimization with inexact oracle, Mathematical Programming, 2014, 146(1-2): 37–75.
  • [5] Kiwiel K, Convergence of approximate and incremental subgradient methods for convex optimization, SIAM Journal on Optimization, 2004, 14(3): 807–840.
  • [6] Jakovetić D, Bajović D, Xavier J, Moura J, Primal–Dual Methods for Large-Scale and Distributed Convex Optimization and Data Analytics, Proceedings of the IEEE, 2020, 108(11): 1923–1938.
  • [7] Nedić A, Ozdaglar A, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 2009, 54(1):48–61.
  • [8] Jakovetić D, Moura J, Xavier J, Linear convergence rate of a class of distributed augmented Lagrangian algorithms, IEEE Transactions on Automatic Control, 2014, 60(4): 922–936.
  • [9] Yi P, Hong Y, Liu F, Distributed gradient algorithm for constrained optimization with application to load sharing in power systems, Systems & Control Letters, 2015, 83: 45–52.
  • [10] Lei J, Chen H, Fang H, Primal–dual algorithm for distributed constrained optimization, Systems & Control Letters, 2016, 96:110–117.
  • [11] Xi C, Khan U, Distributed subgradient projection algorithm over directed graphs, IEEE Transactions on Automatic Control, 2016, 62(8):3986–3992.
  • [12] Liu S, Qiu Z, Xie L, Convergence rate analysis of distributed optimization with projected subgradient algorithm, Automatica, 2017, 83:162–169.
  • [13] Zeng X, Yi P, Hong Y, Distributed continuous-time algorithm for constrained convex optimizations via nonsmooth analysis approach, IEEE Transactions on Automatic Control, 2017, 62(10): 5227–5233.
  • [14] Zhu K, Zhu H, Tang Y, On the Boundedness of Subgradients in Distributed Optimization, Proceedings of 39th Chinese Control Conference ((CCC)), Shenyang, 2020, 4912–4917.
  • [15] Polyak B, Introduction to Optimization, Optimization Software Inc., New York, 1987.
  • [16] Liu Q, Yang S, Hong Y, Constrained consensus algorithms with fixed step size for distributed convex optimization over multiagent networks, IEEE Transactions on Automatic Control, 2017, 62(8): 4259–4265.
  • [17] Mesbahi M, Egerstedt M, Graph Theoretic Methods in Multiagent Networks, Princeton University Press, Princeton, 2010.
  • [18] Ruszczynski A, Nonlinear Optimization, Princeton University Press, Princeton, 2011.
  • [19] Wang J, Elia N, Control approach to distributed optimization, Proceedings of 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, 2010, 557–561.
  • [20] Gharesifard B, Cortés J, Distributed continuous-time convex optimization on weight-balanced digraphs, IEEE Transactions on Automatic Control, 2014, 59(3): 781–786.
  • [21] Kia S, Cortés J, Martínez S, Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication, Automatica, 2015, 55: 254–264.
  • [22] Nedić A, Ozdaglar A, Subgradient methods for saddle-point problems, Journal of Optimization Theory & Applications, 2009, 142(1): 205-–228.
  • [23] Bauschke H, Combettes P, Convex Analysis and Monotone Operator Theory in Hilbert Spaces (2nd ed.), Springer, Cham, 2017.
  • [24] Alber Y, Iusem A, Solodov M, On the projected subgradient method for nonsmooth convex optimization in a Hilbert space, Mathematical Programming, 1998, 81(1): 23–35.
  • [25] Jakovetić D, Xavier J, Moura J, Fast distributed gradient methods, IEEE Transactions on Automatic Control, 2014, 59(5): 1131–1146.
  • [26] Oliva G, Setola R, Hadjicostis C, Distributed finite-time average-consensus with limited computational and storage capability, IEEE Transactions on Control of Network Systems, 2016, 4(2): 380–391.
  • [27] Nejad B, Attia S, Raisch J, Max-consensus in a max-plus algebraic setting: The case of fixed communication topologies, Proceedings of 2009 XXII International Symposium on Information, Communication and Automation Technologies, Sarajevo, 2009, 1–7.
  • [28] Boyd S, Mutapcic A. Subgradient Methods, Notes for EE364b, Stanford University, 2008.