This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Optimization with Projection-free Dynamics

Guanpu Chen, Peng Yi, Yiguang Hong  Jie Chen This work was supported by the National Natural Science Foundation of China (No. 61733018), and by Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0100).G. Chen is with the Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Beijing, China. [email protected]P. Yi is with Department of Control Science and Engineering & Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China. [email protected]Y. Hong is with Department of Control Science and Engineering & Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China, and is also with the Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Beijing, China. [email protected]J. Chen is with Department of Control Science and Engineering & Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China. [email protected]
Abstract

We consider continuous-time dynamics for distributed optimization with set constraints in the paper. To handle the computational complexity of projection-based dynamics due to solving a general quadratic optimization subproblem with projection, we propose a distributed projection-free dynamics by employing the Frank-Wolfe method, also known as the conditional gradient algorithm. The process searches a feasible descent direction by solving an alternative linear optimization instead of a quadratic one. To make the approach applicable over weight-balanced digraphs, we design a dynamics for the consensus of local decision variables and another dynamics of auxiliary variables to track the global gradient. Then we prove the convergence of the dynamical systems to the optimal solution, and provide detailed numerical comparisons with both projection-based dynamics and other distributed projection-free algorithms. Also, we derive the distributed discrete-time scheme following the instructive ideas of the proposed dynamics and provide its accordingly convergence rate.

Index Terms:
distributed optimization, Frank-Wolfe method, projection-free algorithm, constraint.

I Introduction

Distributed optimization and its applications have attracted a large amount of research attention in the past decade. Under multi-agent frameworks, the global objective function consists of agents’ local objective functions, and each agent shares limited amounts of information with neighbors through the networks to achieve an optimal solution. Both discrete-time algorithms [1, 2, 3, 4] and continuous-time algorithms [5, 6, 7, 8, 9, 10] are extensively developed for solving distributed problems.

Among continuous-time algorithms, projection-based dynamics have been widely adopted to solve distributed optimization with constraints, on the basis of the well-developed theory in nonlinear optimization [11, 12]. Various projection-based dynamics have been designed with techniques in dynamical systems and control theory. For example, [6] proposed a proportional-integral protocol to solve distributed constrained optimization, while [7] proposed distributed dynamics where the projection maps are with respect to tangent cones. However, projection-based design implies that agents will encounter a quadratic subproblem when a variable needs to find its nearest point to a set. When the constraints are expressed as complex structures, the computational cost of quadratic subproblems discourages agents from employing projection-based approaches, particularly for high-dimensional constraints.

Meanwhile, though the primal-dual method is another common tool for distributed constrained optimization [4, 13, 14, 15, 16], the attendant difficulties and challenges actually occur in the high dimension of the algorithm, since the dimension of dual dynamics is as high as the dimension of the constraints. This brings extra burden in local storage, algorithm iteration and distributed communication. Furthermore, practical constraints are usually equipped with sparsity, that is, we may spend huge calculation resources to iterate the additional dual dynamics. Nowadays, considering the optimization with large-scale data or constraints, such as sparse PCA, structural SVM, and matrix completion, researchers are eagerly seeking efficient and specific tools for optimization with high-dimensional constraints, rather than the common approaches like projection-based or primal-dual algorithms.

Fortunately, the well-known Frank-Wolfe (FW) method [17], also known as the conditional gradient, provides us with a helpful idea. Briefly speaking, the FW method uses a linearized function to approximate the objective function and derives a feasible descent direction by solving optimization with a linear objective. Thanks to the linear programming toolbox, the feasible descent direction can be efficiently computed when the constraints are high-dimensional polyhedrons, which can usually be used as an approximation for general convex sets [18]. Then, this process avoids general projection operations and solving corresponding quadratic subproblems in algorithm iterations. In fact, the FW method is suitable for various sparse optimization with atomic constraints [19], among which the linear constraint is the most comprehensible one. Though proposed early in 1956 [17], the FW method has regained lots of attention in recent years due to its computational advantage for optimization with large-scale sparse constraints, which covers wide applications such as image process in machine learning and data analysis in bioinformation [21, 22, 23, 20]. Also, there have been massive investigations on the FW method afterward, such as rate analysis over strongly convex sets in [24], online learning over networks in [25], and quantized FW for lower communication in [26].

Motivated by the above, we aim to design a projection-free dynamics based on the FW method for solving distributed constrained optimization in this paper. Agents have their own local objective functions and need to achieve the optimal solution via communicating with neighbors over networks. The main contributions are as follows.

Contributions

Inspired by the historical literature [27], we design a novel FW-based distributed dynamics for agents to solve the constrained optimization. Compared to the dynamics in [6, 7], a feasible descent direction is derived by solving optimization with a linear objective. Hence, the dynamics avoids solving complicated quadratic subproblems due to projection operations on set constraints, which actually leads to a projection-free dynamics. Our distributed projection-free dynamics accordingly induces a novel discrete-time format, where averaging consensus is employed both to ensure the consensus of local decision variables and to help auxiliary variables track the global gradient. This differs from the mechanism in the decentralized discrete-time FW algorithm of [28]. Hence, we develop a novel convergence analysis with the Lyapunov theory and comparison theorems, and provide the analysis on the convergence rate. Moreover, compared with the projected dynamics given in [6, 7] and the discrete-time FW algorithm in [28], the distributed projection-free dynamics is applied over the communication networks described by weight-balanced digraphs.

Paper Organization

The organization of this paper is given as follows. Section II formulates the distributed constrained optimization with basic assumptions, and presents a projection-free algorithm with the corresponding discretization. Then Section III reports the main results, including the consensus of decision variables, global gradient tracking and convergence of the continuous-time algorithm, as well as the convergence rate of the induced discrete scheme. Finally, Section IV shows numerical examples with a comparison to the existing algorithms, while Section V provides some concluding remarks and discussions on future work.

Notations

Here we give some necessary notations in this paper. Denote n\mathbb{R}^{n} ( m×n\mathbb{R}^{m\times n}) as the set of nn-dimension (mm-by-nn) real column vectors (real matrices), 1n1_{n} (or 0n0_{n}) as the nn-dimensional column vectors with all elements of one (or zero), and col{x1,,xn}=(x1T,,xnT)T.col\{x_{1},\dots,x_{n}\}=(x_{1}^{\rm T},\dots,x_{n}^{\rm T})^{\rm T}. Let ABA\otimes B as the Kronecker product of matrices AA and BB. In addition, take \|\cdot\| as the Euclidean norm of vectors, and F\|\|_{F} as the Frobenius norm of real matrices defined by QF=tr(QTQ).\|Q\|_{F}=\sqrt{tr(Q^{T}Q)}.

II Distributed Projection-free Dynamics

In this section, we formulate the constrained distributed optimization, and then propose the distributed projection-free dynamics, as well as a discrete scheme derived from the continuous-time algorithm.

II-A Problem Formulation

Consider NN agents indexed by 𝒱={1,2,N}\mathcal{V}=\{1,2\dots,N\}. For agent i𝒱i\in\mathcal{V}, there is a local cost function fi:nf_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R} on the feasible set Ωn\Omega\subseteq\mathbb{R}^{n}. The global cost function is

F(x)=1Ni=1Nfi(x).F(x)=\frac{1}{N}\sum_{i=1}^{N}f_{i}(x).

All agents aim to solve the constrained optimization:

minxnF(x)s.t.,xΩ.\displaystyle\min_{x\in\mathbb{R}^{n}}F(x)\quad\text{s.t.},\;x\in\Omega. (1)

In a multi-agent network, the iith agent controls a local decision variable xiΩx_{i}\in\Omega to search the optimal xargminF(x)s.t.,xΩ.x^{*}\in\arg\min F(x)\quad\text{s.t.},\;x\in\Omega. Also, the information of local cost functions are regarded as private knowledge. The agents communicate with their neighbors through a network described by a digraph 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}), where 𝒱\mathcal{V} is the set of nodes (regarded as agents here) and 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} is the set of edges. 𝒜=[aij]N×N\mathcal{A}=[a_{ij}]\in\mathbb{R}^{N\times N} is the adjacency matrix subject to aij>0a_{ij}>0 if and only if (i,j)(i,j)\in\mathcal{E}, which means that agent jj can send information to agent ii, and aij=0a_{ij}=0, otherwise. The Laplacian matrix =𝒟𝒜\mathcal{L}=\mathcal{D}-\mathcal{A}, where 𝒟N×N\mathcal{D}\in\mathbb{R}^{N\times N} is diagonal with 𝒟i,i=j=1Naij\mathcal{D}_{i,i}=\sum_{j=1}^{N}a_{ij}, for any i𝒱i\in\mathcal{V}. A digraph 𝒢\mathcal{G} is strongly connected if there exists at least one directed path between any pair of vertices, and 𝒢\mathcal{G} is weight-balanced if j=1Naij=j=1Naji\sum_{j=1}^{N}a_{ij}=\sum_{j=1}^{N}a_{ji} for i𝒱i\in\mathcal{V}.

In addition, the gradient of a differentiable function ff is κ\kappa-Lipschitz on convex set CnC\subseteq\mathbb{R}^{n} with a constant κ>0\kappa>0, if

f(x)f(y)κxy,x,yC.\|\nabla f(x)-\nabla f(y)\|\leq\kappa\|x-y\|,\quad\forall x,y\in C.

Also, the above is equivalent to the following:

f(x)f(y)(xy)Tf(y)+κ2xy2,x,yC.f(x)-f(y)\leq(x-y)^{\rm T}\nabla f(y)+\frac{\kappa}{2}\|x-y\|^{2},\quad\forall x,y\in C.

Then we consider solving distributed optimization (1) under the following assumptions.

Assumption 1
  • The feasible set Ω\Omega is convex, compact and nonempty.

  • For i𝒱i\in\mathcal{V}, fif_{i} is convex and differentiable, and fi\nabla f_{i} is κ\kappa-Lipschitz on Ω\Omega.

  • The digraph 𝒢\mathcal{G} is strongly connected and weight-balanced.

The differentiable cost functions enable us to use the gradient information as in [24, 28, 25]. Additionally, the strongly connected and weight-balanced digraph, as a generalization of connected undirected graphs, was studied in other continuous-time distributed algorithms [29, 16].

II-B Distributed Projection-free Dynamics

To solve the distributed optimization (1), we intend to explore a projection-free approach following the FW method. In fact, continuous-time algorithms provide evolved dynamics, which help reveal how each variable achieves the optimal solution with given directions. Also, the analysis techniques in modern calculus and nonlinear systems may lead to comprehensive results with mild assumptions in continuous-time schemes. Hence, we firstly propose a novel projection-free dynamics with the FW method in the following Algorithm 1, which differs from the projection-based continuous-time algorithms in [6, 7].

Algorithm 1 Distributed Projection-free Dynamics for i𝒱i\in\mathcal{V}

Initialization:

xi(0)Ωx_{i}(0)\in\Omega, yi(0)=𝟎ny_{i}(0)=\mathbf{0}_{n}, zi(0)nz_{i}(0)\in\mathbb{R}^{n}, vi(0)Ωv_{i}(0)\in\Omega.

Flows renewal:

x˙i(t)=\displaystyle\dot{x}_{i}(t)= j=1Naij(xj(t)xi(t))+β(t)(vi(t)xi(t)),\displaystyle\sum_{j=1}^{N}a_{ij}(x_{j}(t)-x_{i}(t))+\beta(t)(v_{i}(t)-x_{i}(t)),
vi(t)\displaystyle v_{i}(t)\in argminvΩzi(t)Tv,\displaystyle~{}\arg\min_{v\in\Omega}~{}z_{i}(t)^{T}v,
y˙i(t)=\displaystyle\dot{y}_{i}(t)= j=1Naij(zj(t)zi(t)),\displaystyle\sum_{j=1}^{N}a_{ij}(z_{j}(t)-z_{i}(t)),
zi(t)=\displaystyle z_{i}(t)= yi(t)+fi(xi(t)),\displaystyle~{}y_{i}(t)+\nabla f_{i}(x_{i}(t)),

where β(t)\beta(t) is a positive time-varying parameter satisfying

limtβ(t)=0,limt0tβ(τ)𝑑τ=.\displaystyle\lim_{t\rightarrow\infty}\beta(t)=0,\quad\lim_{t\rightarrow\infty}\int_{0}^{t}\beta(\tau)d\tau=\infty.

Algorithm 1 is distributed since the dynamics of the iith agent is only influenced by local values of xix_{i}, yiy_{i}, ziz_{i}, viv_{i} and fi(xi)\nabla f_{i}(x_{i}). Specifically, the iith agent uses local decision variable xix_{i} to estimate the optimal solution xΩx^{*}\in\Omega and local optimal solution viv_{i} as a conditional gradient. Since each agent is merely capable to calculate its own gradient fi(x)\nabla f_{i}(x), rather than the global gradient 1Ni=1Nfi(x)\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(x), thus, ziz_{i} serves as the variable that simultaneously operates two processes — one is to compute agent ii’s local gradient, the other is to achieve consensus with neighbors’ local gradients, in order to estimate the global gradient. In fact, the gradient tracking method in [30, 31] motivates our algorithm design. Although the time-varying β(t)\beta(t) seems to be a global parameter, it is easy to determine its value for all agents, by merely selecting some general decreasing functions like β(t)=1/t\beta(t)=1/t. That is analogous to other FW-based works dealing with parameters [28, 25].

II-C Discretization

So far, we have devoted to proposing distributed projection-free dynamics to avoid solving complex subproblems owing to projections. Following these instructive ideas in designing the above distributed continuous-time Algorithm 1, we consider deriving its implementable discretization. We notice the decentralized discrete-time FW algorithm in [28]. To make a comparison, we present the discretization of Algorithm 1 by adopting the notations in [28] in the following. To remain consistent with [28], set the network 𝒢\mathcal{G} undirected and connected, and adjacency matrix 𝒜\mathcal{A} as symmetric and doubly stochastic. Let 0<δ<10<\delta<1 be a fixed step-size in discretization, and denote ηk=δβk\eta^{k}=\delta\beta^{k}. Take the weighted average of agent ii’s neighbors 𝒩i\mathcal{N}_{i} in the network 𝒢\mathcal{G} by

Avgj𝒩i{xjk}=(1δ)xik+δj=1Naijxjk,Avg_{j\in\mathcal{N}_{i}}\{x_{j}^{k}\}=(1-\delta)x_{i}^{k}+\delta\sum_{j=1}^{N}a_{ij}x_{j}^{k},

and similarly,

Avgj𝒩i{zjk}=(1δ)zik+δj=1Naijzjk.Avg_{j\in\mathcal{N}_{i}}\{z_{j}^{k}\}=(1-\delta)z_{i}^{k}+\delta\sum_{j=1}^{N}a_{ij}z_{j}^{k}.

Then, consider the ODE in Algorithm 1

z˙i(t)=j=1Naij(zj(t)zi(t))+ddtfi(xi(t)).\displaystyle\dot{z}_{i}(t)=\sum_{j=1}^{N}a_{ij}(z_{j}(t)-z_{i}(t))+\frac{d}{dt}\nabla f_{i}(x_{i}(t)).

The corresponding difference equation is

zik+1=(1δ)zik+δj=1Naijzjk+fi(xik+1)fi(xik).\displaystyle z_{i}^{k+1}=(1-\delta)z_{i}^{k}+\delta\sum_{j=1}^{N}a_{ij}z_{j}^{k}+\nabla f_{i}(x_{i}^{k+1})-\nabla f_{i}(x_{i}^{k}).

Therefore, the discretization of Algorithm 1 gives

{xik+1=Avgj𝒩i{xjk}+ηk(vikxik),zik+1=Avgj𝒩i{zjk}+fi(xik+1)fi(xik),vikargminvΩvTzik.\vspace{10pt}\left\{\begin{aligned} x_{i}^{k+1}=&Avg_{j\in\mathcal{N}_{i}}\{x_{j}^{k}\}+\eta^{k}(v_{i}^{k}-x_{i}^{k}),\\ z_{i}^{k+1}=&Avg_{j\in\mathcal{N}_{i}}\{z_{j}^{k}\}+\nabla f_{i}(x_{i}^{k+1})-\nabla f_{i}(x_{i}^{k}),\\ v_{i}^{k}\in&\arg\min_{v\in\Omega}~{}v^{T}z_{i}^{k}.\end{aligned}\right. (2)

For clarification, the discrete-time FW in [28] is given as

{xik+1=Avgj𝒩i{xjk}+ηk(vikAvgj𝒩i{xjk}),zik+1=Avgj𝒩i{fj(Avgl𝒩j{xlk})},vikargminvΩvTzik.\vspace{10pt}\left\{\begin{aligned} x_{i}^{k+1}=&Avg_{j\in\mathcal{N}_{i}}\{x_{j}^{k}\}+\eta^{k}(v_{i}^{k}-Avg_{j\in\mathcal{N}_{i}}\{x_{j}^{k}\}),\\ z_{i}^{k+1}=&Avg_{j\in\mathcal{N}_{i}}\{\nabla f_{j}(Avg_{l\in\mathcal{N}_{j}}\{x_{l}^{k}\})\},\\ v_{i}^{k}\in&\arg\min_{v\in\Omega}~{}v^{T}z_{i}^{k}.\end{aligned}\right. (3)

The discretization above reveals that the major difference between (2) and (3) refers to the update protocol of zik+1z_{i}^{k+1}. In (2), agent ii uses both its neighbors’ gradient values and its own gradient renewal to track the global gradient. In (3), agent ii gathers the average value of neighbors’ decision variables to compute local gradient at first. Then agent ii makes again the neighbors’ average gradient value to estimate the global gradient. In other words, the consensus format in (3) requires agents to sequentially collect its neighbors’ decision variables xj𝒩ikx_{j\in\mathcal{N}_{i}}^{k} and auxiliary variables zj𝒩ikz_{j\in\mathcal{N}_{i}}^{k}, which results in more communication burden and data storage. Consider that constructing a reliable communication link is time-consuming and energy-consuming in practice. Therefore, (2) overcomes this weak point by requiring agents to simultaneously collect the both variables. Since the mechanism of Algorithm 1 differs from what in [28], we need to explore new tools for analysis. In addition, we apply Algorithm 1 over weight-balanced digraphs as the generalization of undirected connected graphs in [28], which brings more technical challenges correspondingly.

III Convergence Analysis

In this section, we give the convergence analysis of both Algorithm 1 and its discretazition (2).

III-A Convergence of Algorithm 1

For convenience of statements in the following, we first provide some compact notations here. Take

𝛀Ω××Ω,\bm{\Omega}\triangleq\Omega\times\cdots\times\Omega,
𝒙col{x1,,xN},\bm{x}\triangleq col\{x_{1},\dots,x_{N}\},
𝒚col{y1,,yN},\bm{y}\triangleq col\{y_{1},\dots,y_{N}\},
𝒛col{z1,,zN},\bm{z}\triangleq col\{z_{1},\dots,z_{N}\},

and 𝑳=In\bm{L}=\mathcal{L}\otimes I_{n}. Also, define the profile of gradients

G(𝒙)col{f1(x1),,fN(xN)}.G(\bm{x})\triangleq col\{\nabla f_{1}(x_{1}),\dots,\nabla f_{N}(x_{N})\}.

Equivalently, Algorithm 1 can be expressed in the following compact form

{𝒙˙(t)=𝑳𝒙(t)+β(t)(𝒗(t)𝒙(t)),𝒚˙(t)=𝑳𝒛(t),𝒛(t)=𝒚(t)+G(𝒙(t)),\left\{\begin{aligned} \dot{\bm{x}}(t)=&~{}-\bm{L}\bm{x}(t)+\beta(t)(\bm{v}(t)-{\bm{x}}(t)),\\ \dot{\bm{y}}(t)=&~{}-\bm{L}\bm{z}(t),\\ {\bm{z}}(t)=&~{}{\bm{y}}(t)+G(\bm{x}(t)),\end{aligned}\right. (4)

where 𝒗col{v1,,vN}\bm{v}\triangleq col\{v_{1},\dots,v_{N}\} with

viargminvΩziTv.v_{i}\in\arg\min_{v\in\Omega}~{}z_{i}^{T}v.

Hereupon, we can derive the following results for the convergence guarantees of Algorithm 1.

Theorem 1

Under Assumption 1 and with given initial condition xi(0)Ωx_{i}(0)\in\Omega, yi(0)=𝟎ny_{i}(0)=\bm{0}_{n}, zi(0)nz_{i}(0)\in\mathbb{R}^{n} and vi(0)Ωv_{i}(0)\in\Omega,

  1. i).

    all decision variables xix_{i} achieve consensus, i.e.,

    limt(xi(t)xj(t))=𝟎n,i,j𝒱;\lim_{t\rightarrow\infty}(x_{i}(t)-x_{j}(t))=\bm{0}_{n},\quad\forall i,j\in\mathcal{V};
  2. ii).

    all variables ziz_{i} asymptotically track the dynamical global gradient, i.e.,

    limt(zi(t)1Nj=1Nfj(xj(t)))=𝟎n,i𝒱;\lim_{t\rightarrow\infty}\big{(}z_{i}(t)-\frac{1}{N}\sum_{j=1}^{N}\nabla f_{j}(x_{j}(t))\big{)}=\bm{0}_{n},\quad\forall i\in\mathcal{V};
  3. iii).

    all decision variables xix_{i}, for i𝒱i\in\mathcal{V}, converge to a consensual optimal solution to problem (1).

The following two lemmas are necessary for the analysis of Algorithm 1, whose proofs can be found in the appendix.

Lemma 1

Under Assumption 1, if xi(0)Ωx_{i}(0)\in\Omega for all i𝒱i\in\mathcal{V}, then xi(t)Ωx_{i}(t)\in\Omega for all t>0t>0 and for all i𝒱i\in\mathcal{V}, .

Lemma 2

Given scalars ε(t)0\varepsilon(t)\geq 0, s(t)0s(t)\geq 0, and γ(t)>0\gamma(t)>0, if limt0tγ(τ)𝑑τ=\lim_{t\rightarrow\infty}\int_{0}^{t}\gamma(\tau)d\tau=\infty, limtε(t)=0\lim_{t\rightarrow\infty}\varepsilon(t)=0, and

s˙(t)γ(t)s(t)+γ(t)ε(t),\dot{s}(t)\leq-\gamma(t)s(t)+\gamma(t)\varepsilon(t),

then limts(t)=0\lim_{t\rightarrow\infty}s(t)=0.

In this way, we give the proof of Theorem 1.

Proof.

i). It follows from Lemma 1 that xi(t)Ωx_{i}(t)\in{{\Omega}}. Moreover, since vi(t)v_{i}(t) is chosen from Ω\Omega, it implies that vi(t)xi(t)v_{i}(t)-x_{i}(t) is bounded. Because β(t)0\beta(t)\rightarrow 0 as tt\rightarrow\infty, we have

β(t)(vi(t)xi(t))𝟎n,ast.\beta(t)(v_{i}(t)-x_{i}(t))\rightarrow\bm{0}_{n},\quad\text{as}\quad t\rightarrow\infty.

Thus, the dynamics for decision variables in Algorithm 1 can be written as

x˙i(t)=j=1Naij(xj(t)xi(t))+ui(t),\displaystyle\dot{x}_{i}(t)=\sum_{j=1}^{N}a_{ij}(x_{j}(t)-x_{i}(t))+u_{i}(t), (5)

where limtui(t)=𝟎\lim_{t\rightarrow\infty}u_{i}(t)=\bm{0}. According to the existing results in [32, Section 4.3], all decision variables in (5) reach consensus, i.e.,

limt(xi(t)xj(t))=𝟎n,i,j𝒱,\lim_{t\rightarrow\infty}(x_{i}(t)-x_{j}(t))=\bm{0}_{n},\quad\forall\;i,j\in\mathcal{V},

which yields the result.

ii). Set 𝟏^=1NIn\hat{\bm{1}}=1_{N}\otimes I_{n}, and let us investigate

W(t)=𝒛(t)1N𝟏^𝟏^TG(𝒙(t)).W(t)=\bm{z}(t)-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t)).

Considering the orthogonal decomposition in the subspace ker(𝑳)\ker(\bm{L}) and its complementary space ker(𝑳)\ker(\bm{L})_{\perp}, for t0t\geq 0, define

W(t)=W0(t)+W,W(t)=W_{0}(t)+W_{\perp},

and

𝒛(t)=𝒛0(t)+𝒛(t),\bm{z}(t)=\bm{z}_{0}(t)+\bm{z}_{\perp}(t),

where

W0,𝒛0ker(𝑳)=span{1Nv:vn},W_{0},\bm{z}_{0}\in\ker(\bm{L})=\text{span}\{1_{N}\otimes v:v\in\mathbb{R}^{n}\},

and W,𝒛ker(𝑳)W_{\perp},\bm{z}_{\perp}\in\ker(\bm{L})_{\perp}. Since 𝟏^𝟏^TG(𝒙(t))ker(𝑳)\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t))\in\ker(\bm{L}), clearly, we further obtain that

W0(t)=𝒛01N𝟏^𝟏^TG(𝒙(t)),W_{0}(t)=\bm{z}_{0}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t)),

and

W(t)=𝒛(t),t0.W_{\perp}(t)=\bm{z}_{\perp}(t),\quad\forall t\geq 0.

Due to the weight-balanced digraph 𝒢\mathcal{G},

i=1Ny˙i(t)=i=1Nj=1Naij(zj(t)zi(t))=𝟎n,t0.\sum_{i=1}^{N}\dot{y}_{i}(t)=\sum_{i=1}^{N}\sum_{j=1}^{N}a_{ij}(z_{j}(t)-z_{i}(t))=\bm{0}_{n},\quad\forall t\geq 0.

Together with the initial condition yi(0)=𝟎ny_{i}(0)=\bm{0}_{n} for i𝒱i\in\mathcal{V}, we can verify that

i=1Nyi(t)=𝟎n.\sum_{i=1}^{N}y_{i}(t)=\bm{0}_{n}.

Hence,

i=1Nzi(t)\displaystyle\sum_{i=1}^{N}z_{i}(t) =i=1Nyi(t)+i=1Nfi(xi(t))\displaystyle=\sum_{i=1}^{N}y_{i}(t)+\sum_{i=1}^{N}\nabla f_{i}(x_{i}(t))
=i=1Nfi(xi(t)),\displaystyle=\sum_{i=1}^{N}\nabla f_{i}(x_{i}(t)),

or equivalently,

i=1Nzi(t)=𝟏^TG(𝒙(t)).\sum_{i=1}^{N}z_{i}(t)=\hat{\bm{1}}^{T}G(\bm{x}(t)).

It follows from 𝒛0ker(𝑳)\bm{z}_{0}\in\ker(\bm{L}) that zi0(t)=zj0(t)z_{i0}(t)=z_{j0}(t). Therefore,

W0(t)=𝒛01N𝟏^𝟏^TG(𝒙(t))=𝟎nN,W_{0}(t)=\bm{z}_{0}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t))=\bm{0}_{nN},

which indicates that

W(t)=W(t).W(t)=W_{\perp}(t).

Futhermore, let us invetigate the function

J(t)=12W(t)2,J(t)=\frac{1}{2}\|W(t)\|^{2},

and consider its derivative along Algorithm 1 in the following.

J˙(t)=\displaystyle\dot{J}(t)= (𝒛(t)1N𝟏^𝟏^TG(𝒙(t)))T(𝒛˙(t)1N𝟏^𝟏^TG˙(𝒙(t)))\displaystyle\Big{(}\bm{z}(t)-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t))\Big{)}^{T}\Big{(}\dot{\bm{z}}(t)-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\dot{G}(\bm{x}(t))\Big{)}
=\displaystyle= (𝒛(t)+1N𝟏^𝟏^TG(𝒙(t)))T𝑳𝒛(t)\displaystyle\Big{(}-\bm{z}(t)+\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t))\Big{)}^{T}\bm{L}\bm{z}(t)
+(𝒛(t)1N𝟏^𝟏^TG(𝒙(t)))T(𝑰1N𝟏^𝟏^T)G˙(𝒙(t)),\displaystyle+\Big{(}\bm{z}(t)-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t))\Big{)}^{T}\Big{(}\bm{I}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\Big{)}\dot{G}(\bm{x}(t)),

where 𝑰=InN\bm{I}=I_{nN}. By Assumption 1, the digraph is strongly connected and weight-balanced, which yields

𝑳T𝟏^=𝟎nN.\bm{L}^{T}\hat{\bm{1}}=\bm{0}_{nN}.

Hence,

J˙(t)=\displaystyle\dot{J}(t)= W(t)T𝑳W(t)+W(t)T(𝑰1N𝟏^𝟏^T)G˙(𝒙(t))\displaystyle-W(t)^{T}\bm{L}W(t)+W(t)^{T}\Big{(}\bm{I}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\Big{)}\dot{G}(\bm{x}(t))
\displaystyle\leq W(t)T(12(𝑳+𝑳T))W(t)\displaystyle-W(t)^{T}\Big{(}\frac{1}{2}(\bm{L}+\bm{L}^{T})\Big{)}W(t)
+W(t)𝑰1N𝟏^𝟏^TFG˙(𝒙(t))\displaystyle+\|W(t)\|\|\bm{I}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\|_{F}\|\dot{G}(\bm{x}(t))\|
\displaystyle\leq λ2W(t)2+W(t)𝑰1N𝟏^𝟏^TFG˙(𝒙(t)),\displaystyle-\lambda_{2}\|W(t)\|^{2}+\|W(t)\|\|\bm{I}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\|_{F}\|\dot{G}(\bm{x}(t))\|,

where λ2\lambda_{2} is the smallest positive eigenvalue of 12(𝑳+𝑳T)\frac{1}{2}(\bm{L}+\bm{L}^{T}), and the last inequality follows from the fact W(t)=W(t)W(t)=W_{\perp}(t) and Rayleigh quotient theorem [33, Page 234]. Moreover,

ddtW(t)=\displaystyle\frac{d}{dt}\|W(t)\|= ddt2J(t)\displaystyle\frac{d}{dt}\sqrt{2J(t)}
=\displaystyle= J˙(t)W(t)\displaystyle\frac{\dot{J}(t)}{\|W(t)\|} (6)
\displaystyle\leq λ2W(t)+𝑰1N𝟏^𝟏^TFG˙(𝒙(t)).\displaystyle-\lambda_{2}\|W(t)\|+\|\bm{I}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\|_{F}\|\dot{G}(\bm{x}(t))\|.

It follows from Assumption 1 that fi\nabla f_{i} is κ\kappa-Lipschitz on Ω\Omega, which leads to the k2\frac{k}{2} boundedness of G(x)\|\nabla G(x)\|. Thus,

G˙(𝒙(t))\displaystyle\|\dot{G}(\bm{x}(t))\|\leq κ2𝒙˙(t)\displaystyle\frac{\kappa}{2}\|\dot{\bm{x}}(t)\|
=\displaystyle= κ2𝑳𝒙(t)+β(t)(𝒗(t)𝒙(t)).\displaystyle\frac{\kappa}{2}\|\bm{L}\bm{x}(t)+\beta(t)(\bm{v}(t)-\bm{x}(t))\|.

So far, since xi(t)x_{i}(t) achieves consensus and β(t)0\beta(t)\rightarrow 0 as tt\rightarrow\infty, we have

𝑳𝒙(t)+β(t)(𝒗(t)𝒙(t))0,ast,\|\bm{L}\bm{x}(t)+\beta(t)(\bm{v}(t)-\bm{x}(t))\|\rightarrow 0,\quad\text{as}\quad t\rightarrow\infty,

which indicates that

limtG˙(𝒙(t))=0.\lim_{t\rightarrow\infty}\|\dot{G}(\bm{x}(t))\|=0.

Recall the properties that λ2>0\lambda_{2}>0 and 𝑰1N𝟏^𝟏^TF>0\|\bm{I}-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}\|_{F}>0. Therefore, we learn from Lemma 2 that (III-A) yields

limtW(t)=0,\lim_{t\rightarrow\infty}\|W(t)\|=0,

which implies that all variables ziz_{i} track the dynamical global gradient.

iii). Suppose that xx^{*} is an optimal solution to problem (1) and denote 𝒙=col{x,,x}\bm{x}^{*}=col\{x^{*},\dots,x^{*}\}. Take

x¯(t)=1Ni=1Nxi(t),𝒙¯=col{x¯,,x¯}.\overline{x}(t)=\frac{1}{N}\sum_{i=1}^{N}x_{i}(t),\quad\overline{\bm{x}}=col\{\overline{x},\dots,\overline{x}\}.

Consider the following function

V(t)=\displaystyle V(t)= F(x¯(t))F(x)\displaystyle F(\overline{x}(t))-F(x^{*})
=\displaystyle= 1Ni=1Nfi(x¯(t))1Ni=1Nfi(x).\displaystyle\frac{1}{N}\sum_{i=1}^{N}f_{i}(\overline{x}(t))-\frac{1}{N}\sum_{i=1}^{N}f_{i}(x^{*}).

Clearly, V(t)0V(t)\geq 0. Then we investigate its derivative.

V˙(t)=\displaystyle\dot{V}(t)= 1N(i=1Nfi(x¯(t)))Tx¯˙(t)\displaystyle\frac{1}{N}\Big{(}\sum_{i=1}^{N}\nabla f_{i}(\overline{x}(t))\Big{)}^{T}\dot{\overline{x}}(t)
=\displaystyle= 1N2(𝟏^TG(𝒙¯(t)))T𝟏^T𝒙˙\displaystyle\frac{1}{N^{2}}\Big{(}\hat{\bm{1}}^{T}G(\overline{\bm{x}}(t))\Big{)}^{T}\hat{\bm{1}}^{T}\dot{\bm{x}}
=\displaystyle= 1N2GT(𝒙¯(t))𝟏^𝟏^T(𝑳𝒙(t)+β(t)(𝒗(t)𝒙(t)))\displaystyle\frac{1}{N^{2}}G^{T}(\overline{\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}\Big{(}-\bm{L}\bm{x}(t)+\beta(t)(\bm{v}(t)-\bm{x}(t))\Big{)}
=\displaystyle= β(t)N2GT(𝒙¯(t))𝟏^𝟏^T(𝒗(t)𝒙(t)),\displaystyle\frac{\beta(t)}{N^{2}}G^{T}(\overline{\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}\Big{(}\bm{v}(t)-\bm{x}(t)\Big{)},

where the last equality holds since 𝟏^T𝑳=𝟎\hat{\bm{1}}^{T}\bm{L}=\bm{0}. Then,

V˙(t)=\displaystyle\dot{V}(t)= β(t)N2(GT(𝒙¯(t))𝟏^𝟏^TN𝒛T(t))T(𝒗(t)𝒙(t))\displaystyle\frac{\beta(t)}{N^{2}}\Big{(}G^{T}(\overline{\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}-N\bm{z}^{T}(t)\Big{)}^{T}\Big{(}\bm{v}(t)-\bm{x}(t)\Big{)}
+β(t)N𝒛T(t)(𝒗(t)𝒙(t)).\displaystyle+\frac{\beta(t)}{N}\bm{z}^{T}(t)\Big{(}\bm{v}(t)-\bm{x}(t)\Big{)}.

Recall the derivation of 𝒗(t)\bm{v}(t), or equivalently,

vi(t)=argminvΩziT(t)v,v_{i}(t)=\arg\min_{v\in\Omega}z_{i}^{T}(t)v,

which implies that,

𝒛T(t)𝒗(t)zT(t)𝒙,𝒙𝛀.\bm{z}^{T}(t)\bm{v}(t)\leq z^{T}(t)\bm{x}^{\prime},\quad\forall\bm{x}^{\prime}\in\bm{\Omega}.

On this basis, take 𝒙=𝒙\bm{x}^{\prime}=\bm{x}^{*}, and thus,

V˙(t)\displaystyle\dot{V}(t)\leq β(t)N2(GT(𝒙¯(t))𝟏^𝟏^TN𝒛T(t))(𝒗(t)𝒙(t))\displaystyle~{}\frac{\beta(t)}{N^{2}}\Big{(}G^{T}(\overline{\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}-N\bm{z}^{T}(t)\Big{)}\Big{(}\bm{v}(t)-\bm{x}(t)\Big{)}
+β(t)N𝒛T(t)(𝒙𝒙(t))\displaystyle+\frac{\beta(t)}{N}\bm{z}^{T}(t)\Big{(}\bm{x}^{*}-\bm{x}(t)\Big{)}
=\displaystyle= β(t)N2GT(𝒙¯(t))𝟏^𝟏^T(𝒙𝒙(t))U1(t)\displaystyle~{}\underbrace{\frac{\beta(t)}{N^{2}}G^{T}(\overline{\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}\Big{(}\bm{x}^{*}-\bm{x}(t)\Big{)}}_{U_{1}(t)}
+β(t)N2(GT(𝒙¯(t))𝟏^𝟏^TN𝒛T(t))(𝒗(t)𝒙)U2(t).\displaystyle+\underbrace{\frac{\beta(t)}{N^{2}}\Big{(}G^{T}(\overline{\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}-N\bm{z}^{T}(t)\Big{)}\Big{(}\bm{v}(t)-\bm{x}^{*}\Big{)}}_{U_{2}(t)}.

By the convexity of the cost functions,

U1(t)=\displaystyle U_{1}(t)= β(t)(1Ni=1Nfi(x¯(t)))T(xx¯)\displaystyle\beta(t)\Big{(}\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(\overline{x}(t))\Big{)}^{T}\Big{(}x^{*}-\overline{x}\Big{)}
\displaystyle\leq β(t)(1Ni=1Nfi(x¯(t))+1Ni=1Nfi(x))\displaystyle\beta(t)\Big{(}-\frac{1}{N}\sum_{i=1}^{N}f_{i}(\overline{x}(t))+\frac{1}{N}\sum_{i=1}^{N}f_{i}(x^{*})\Big{)}
=\displaystyle= β(t)V(t).\displaystyle-\beta(t)V(t).

Meanwhile, it follows from W(t)=𝒛(t)1N𝟏^𝟏^TG(𝒙(t))W(t)=\bm{z}(t)-\frac{1}{N}\hat{\bm{1}}\hat{\bm{1}}^{T}G(\bm{x}(t)) that

U2(t)=\displaystyle U_{2}(t)= β(t)N2(GT(𝒙(t))𝟏^𝟏^TN𝒛T(t))(𝒗(t)𝒙)\displaystyle\frac{\beta(t)}{N^{2}}\Big{(}G^{T}({\bm{x}}(t))\hat{\bm{1}}\hat{\bm{1}}^{T}-N\bm{z}^{T}(t)\Big{)}\Big{(}\bm{v}(t)-\bm{x}^{*}\Big{)}
+β(t)N2(GT(𝒙¯(t))GT(𝒙(t)))𝟏^𝟏^T(𝒗(t)𝒙)\displaystyle+\frac{\beta(t)}{N^{2}}\Big{(}G^{T}(\overline{\bm{x}}(t))-G^{T}({\bm{x}}(t))\Big{)}\hat{\bm{1}}\hat{\bm{1}}^{T}\Big{(}\bm{v}(t)-\bm{x}^{*}\Big{)}
\displaystyle\leq β(t)N(W(t)+κ𝒙¯(t)𝒙(t))𝒗(t)𝒙.\displaystyle\frac{\beta(t)}{N}\Big{(}\|W(t)\|+\kappa\|\overline{\bm{x}}(t)-\bm{x}(t)\|\Big{)}\|\bm{v}(t)-\bm{x}^{*}\|.

Since 𝒗(t),𝒙𝛀\bm{v}(t),\bm{x}^{*}\in\bm{\Omega}, there exists a contant c>0c>0 such that 𝒗(t)𝒙c\|\bm{v}(t)-\bm{x}^{*}\|\leq c and

U2(t)cβ(t)N(W(t)+κ𝒙¯(t)𝒙(t)).U_{2}(t)\leq\frac{c\beta(t)}{N}\Big{(}\|W(t)\|+\kappa\|\overline{\bm{x}}(t)-\bm{x}(t)\|\Big{)}.

Therefore,

V˙(t)β(t)V(t)+cβ(t)N(W(t)+κ𝒙¯(t)𝒙(t)).\displaystyle\dot{V}(t)\leq-\beta(t)V(t)+\frac{c\beta(t)}{N}\Big{(}\|W(t)\|+\kappa\|\overline{\bm{x}}(t)-\bm{x}(t)\|\Big{)}.

Recall that we have already proved

limt𝒙¯(t)𝒙(t)=0,\lim_{t\rightarrow\infty}\|\overline{\bm{x}}(t)-\bm{x}(t)\|=0,

as well as

limtW(t)=0.\lim_{t\rightarrow\infty}\|W(t)\|=0.

Moreover, the positive parameter β(t)\beta(t) satisfies

limt0tβ(τ)𝑑τ=.\lim_{t\rightarrow\infty}\int_{0}^{t}\beta(\tau)d\tau=\infty.

Hence, by Lemma 2 again, we have

limtV(t)=0.\lim_{t\rightarrow\infty}V(t)=0.

Take XΩX^{*}\subseteq\Omega as the set of optimal solutions to problem (1), and

ρ(x,X)=infxXxx.\rho(x,X^{*})=\inf_{x^{\prime}\in X^{*}}\|x-x^{\prime}\|.

Since Ω\Omega is compact, there exists a point xΩx_{\infty}\in\Omega and a sequence {x¯(tk),k}\{\overline{x}(t_{k}),k\in\mathbb{N}\} such that

limkxx¯(tk)=0,\lim_{k\rightarrow\infty}\|x_{\infty}-\overline{x}(t_{k})\|=0,

and

limsuptρ(x¯(t),X)=limkρ(x¯(tk),X).\lim\sup_{t\rightarrow\infty}\rho(\overline{x}(t),X^{*})=\lim_{k\rightarrow\infty}\rho(\overline{x}(t_{k}),X^{*}).

Since fif_{i} is differentiable and ρ\rho is lower semicontinuous,

limkρ(x¯(tk),X)=ρ(x,X).\lim_{k\rightarrow\infty}\rho(\overline{x}(t_{k}),X^{*})=\rho(x_{\infty},X^{*}).

Thus,

limkF(x¯(tk))=F(x)=F(x),\lim_{k\rightarrow\infty}F(\overline{x}(t_{k}))=F(x_{\infty})=F(x^{*}),

which implies xXx_{\infty}\in X^{*}, i.e., the decision variable xix_{i} converges to an optimal solution to problem (1). \square

III-B Convergence Rate of Discretization

To compare our discretization (2) with the discrete-time algorithm (3) in [28], we provide theoretical guarantees on the convergence rate of discretization (2). Similarly, here we need some new notations to describe the discrete scheme. Take 𝒙k=col{xik}i=1N\bm{x}^{k}=col\{x^{k}_{i}\}^{N}_{i=1}, 𝒛k=col{zik}i=1N\bm{z}^{k}\!=\!col\{z^{k}_{i}\}^{N}_{i=1} and analogously,

x¯k=1Ni=1Nxik,𝒙¯k=col{x¯k}i=1N,\bar{x}^{k}=\frac{1}{N}\sum^{N}_{i=1}x_{i}^{k},\quad\overline{\bm{x}}^{k}=col\{\overline{x}^{k}\}^{N}_{i=1},
𝒔k=col{vikxik}i=1N,𝒔¯k=col{1Ni=1Nvikx¯k}i=1N.\bm{s}^{k}\!=col\{v^{k}_{i}-x^{k}_{i}\}^{N}_{i=1},\quad\overline{\bm{s}}^{k}=col\{\frac{1}{N}\!\sum_{i=1}^{N}v_{i}^{k}-\bar{x}^{k}\}^{N}_{i=1}.

Also, denote the averages global gradients by

kF¯=1Ni=1Nfi(xik),k𝑭¯=col{kF¯}i=1N.\overline{\nabla^{k}\!F}\!=\frac{1}{N}\!\sum_{i=1}^{N}\!\nabla\!f_{i}(x_{i}^{k}),\quad\overline{\nabla^{k}\!\bm{F}}\!=\!col\{\overline{\nabla^{k}\!F}\}_{i=1}^{N}.

Set λ(𝒜)\lambda(\mathcal{A}) as the second largest eigenvalue of the adjcency matrix 𝒜\mathcal{A}. Select k0k_{0} as the smallest positive integer such that

λ(𝒜)(k0+1)24(k0+2)(k0+2)(k0+1).\lambda(\mathcal{A})\leq\frac{(k_{0}+1)^{2}-4(k_{0}+2)}{(k_{0}+2)(k_{0}+1)}. (7)

The existence of k0k_{0} is guaranteed because the adjacency matrix 𝒜\mathcal{A} is symmetric and doubly stochastic [28]. Then we have the following result on the convergence rate of discrete algorithm (2).

Theorem 2

Under Assumption 1, set δ=1\delta=1, βk=2k+1\beta^{k}=\frac{2}{k+1}, and d¯\bar{d} as the diameter of Ω\Omega. For any k1k\geq 1, there exist constants

Cx=12Nd¯(k0+1),C_{x}=\frac{1}{2}\sqrt{N}\bar{d}(k_{0}+1),

and

Cz=κN(2Cx+d¯)(k0+1),C_{z}=\kappa\sqrt{N}\left(2C_{x}+\bar{d}\right)(k_{0}+1),

such that

  1. i).

    all decision variables xikx_{i}^{k} achieve consensus, i.e.,

    𝒙k𝒙¯k2Cxk+1;\|\bm{x}^{k}-\overline{\bm{x}}^{k}\|\leq\frac{2C_{x}}{k+1};
  2. ii).

    all variable zikz_{i}^{k} track the global gradient, i.e.,

    𝒛kk𝑭¯2Czk+1;\|\bm{z}^{k}-\overline{\nabla^{k}\bm{F}}\|\leq\frac{2C_{z}}{k+1};
  3. iii).

    the average x¯k\bar{x}^{k} converges to an optimal solution xx^{*} with a rate of O(1/k)O(1/k), i.e.,

    F(x¯k)F(x)8d¯Cz+8κd¯Cx+2κd¯2k+1.F(\bar{x}^{k})-F\left(x^{*}\right)\leq\frac{8\bar{d}C_{z}+8\kappa\bar{d}C_{x}+2\kappa\bar{d}^{2}}{k+1}.

Proof.

i). Recall the compactness of Ω\Omega and this is naturally true for k[1,k0]k\in[1,k_{0}]. For induction, assume

𝒙k𝒙¯k2Cxk+1,\|\bm{x}^{k}-\overline{\bm{x}}^{k}\|\leq\frac{2C_{x}}{k+1},

for some kk0k\geq k_{0}. Then, we can verify that

𝒙k+1𝒙¯k+1\displaystyle\|\bm{x}^{k+1}-\bar{\bm{x}}^{k+1}\| =𝑨𝒙k+2𝒔kk+1(𝒙¯k+2𝒔¯kk+1)\displaystyle=\|\bm{A}\bm{x}^{k}+\frac{2\bm{s}^{k}}{k+1}-(\overline{\bm{x}}^{k}+\frac{2\overline{\bm{s}}^{k}}{k+1})\|
|λ(𝒜)|𝒙k𝒙¯k+4Nd¯k+1\displaystyle\leq|\lambda(\mathcal{A})|\|\bm{x}^{k}-\bar{\bm{x}}^{k}\|+\frac{4\sqrt{N}\bar{d}}{k+1}
2k+1(|λ(𝒜)|Cx+2Nd¯)\displaystyle\leq\frac{2}{k+1}(|\lambda(\mathcal{A})|C_{x}+2\sqrt{N}\bar{d})
2Cxk+2,\displaystyle\leq\frac{2C_{x}}{k+2},

where 𝑨=𝒜In\bm{A}=\mathcal{A}\otimes I_{n} and the last inequality follows from (7). Thus, we have the conclusion according to induction.

ii). Similarly, we can verify this for k[1,k0]k\in[1,k_{0}]. For induction, assume

𝒛kk𝑭¯2Czk+1,\|\bm{z}^{k}-\overline{\nabla^{k}\bm{F}}\|\leq\frac{2C_{z}}{k+1},

for some k0k\geq 0. Then we can verify that

𝒛k+1k+1𝑭¯\displaystyle\|\bm{z}^{k+1}-\overline{\nabla^{k+1}\bm{F}}\|
=\displaystyle= 𝑨𝒛k+G(𝒙k+1)G(𝒙k)k+1𝑭¯+k𝑭¯k𝑭¯\displaystyle\|\bm{A}\bm{z}^{k}+G(\bm{x}^{k+1})-G(\bm{x}^{k})-\overline{\nabla^{k+1}\bm{F}}+\overline{\nabla^{k}\bm{F}}-\overline{\nabla^{k}\bm{F}}\|
𝑨𝒛kk𝑭¯++G(𝒙k+1)G(𝒙k)\displaystyle\leq\|\bm{A}\bm{z}^{k}-\overline{\nabla^{k}\bm{F}}\|+\|+G(\bm{x}^{k+1})-G(\bm{x}^{k})\|
+k+1𝑭¯k𝑭¯.\displaystyle+\|\overline{\nabla^{k+1}\bm{F}}-\overline{\nabla^{k}\bm{F}}\|. (8)

The firsr term in (8) satisfies

𝑨𝒛kk𝑭¯\displaystyle\|\bm{A}\bm{z}^{k}-\overline{\nabla^{k}\bm{F}}\|\leq λ(𝒜)𝒛kk𝑭¯\displaystyle\|\lambda(\mathcal{A})\|\|\bm{z}^{k}-\overline{\nabla^{k}\bm{F}}\|
\displaystyle\leq λ(𝒜)2Czk+1.\displaystyle\|\lambda(\mathcal{A})\|\frac{2C_{z}}{k+1}.

The second term in (8) satisfies

G(𝒙k+1)G(𝒙k)=\displaystyle\|G(\bm{x}^{k+1})-G(\bm{x}^{k})\|= i=1Nfi(xik+1)fi(xik)2\displaystyle\sqrt{\sum_{i=1}^{N}\|\nabla f_{i}(x_{i}^{k+1})-\nabla f_{i}(x_{i}^{k})\|^{2}}
\displaystyle\leq κi=1Nxik+1xik2.\displaystyle\kappa\sqrt{\sum_{i=1}^{N}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}}.

Note that

xik+1xik\displaystyle\|x_{i}^{k+1}-x_{i}^{k}\|
\displaystyle\leq j=1Naij(xjkxik)+2k+1(vikxik)\displaystyle\|\sum_{j=1}^{N}a_{ij}(x_{j}^{k}-x_{i}^{k})+\frac{2}{k+1}(v_{i}^{k}-x_{i}^{k})\|
\displaystyle\leq j=1Naij(xjkx¯k+xikx¯k+2k+1vikxik)\displaystyle\sum_{j=1}^{N}a_{ij}(\|x_{j}^{k}-\bar{x}^{k}\|+\|x_{i}^{k}-\bar{x}^{k}\|+\frac{2}{k+1}\|v_{i}^{k}-x_{i}^{k}\|)
\displaystyle\leq 2(2Cx+d¯)k+1.\displaystyle\frac{2(2C_{x}+\bar{d})}{k+1}.

So far, we have

G(𝒙k+1)G(𝒙k)2κN2Cx+d¯k+1.\displaystyle\|G(\bm{x}^{k+1})-G(\bm{x}^{k})\|\leq 2\kappa\sqrt{N}\frac{2C_{x}+\bar{d}}{k+1}.

Next, the third term in (8) satisfies

k+1𝑭¯k𝑭¯\displaystyle\|\overline{\nabla^{k+1}\bm{F}}-\overline{\nabla^{k}\bm{F}}\|\leq G(𝒙k+1G(𝒙k)\displaystyle\|G(\bm{x}^{k+1}-G(\bm{x}^{k})\|
\displaystyle\leq 2κN2Cx+d¯k+1.\displaystyle 2\kappa\sqrt{N}\frac{2C_{x}+\bar{d}}{k+1}.

To sum up, we obtain that

𝒛k+1k+1𝑭¯\displaystyle\|\bm{z}^{k+1}-\overline{\nabla^{k+1}\bm{F}}\|
\displaystyle\leq λ(𝒜)2Czk+1+4κN2Cx+d¯k+1\displaystyle\|\lambda(\mathcal{A})\|\frac{2C_{z}}{k+1}+4\kappa\sqrt{N}\frac{2C_{x}+\bar{d}}{k+1}
\displaystyle\leq 2Czk+2,\displaystyle\frac{2C_{z}}{k+2},

where the last inequality follows from (7) and the definitions of constants CxC_{x} and CzC_{z}. Hence, we get the conclusion according to induction.

iii). Take

gkF(x¯k)F(x).g^{k}\triangleq F(\bar{x}^{k})-F\left(x^{*}\right).

It follows from (2) that

x¯k+1=x¯k+βkNi=1N(vikx¯k).\bar{x}^{k+1}=\bar{x}^{k}+\frac{\beta^{k}}{N}\sum_{i=1}^{N}(v_{i}^{k}-\bar{x}^{k}).

Since FF is κ\kappa-Lipschitz and Ω\Omega is compact,

gk+1gk+βkNi=1Nvikx¯k,F(x¯k)+κd¯22(βk)2.\displaystyle g^{k+1}\leq g^{k}+\frac{\beta^{k}}{N}\sum_{i=1}^{N}\langle v_{i}^{k}-\bar{x}^{k},\nabla F(\bar{x}^{k})\rangle+\frac{\kappa\bar{d}^{2}}{2}(\beta^{k})^{2}. (9)

For i𝒱i\in\mathcal{V}, because vikargminvΩv,zikv_{i}^{k}\in\operatorname{argmin}_{v\in\Omega}\langle v,z_{i}^{k}\rangle,

vikx¯k,F(x¯k)\displaystyle\langle v_{i}^{k}-\bar{x}^{k},\nabla F(\bar{x}^{k})\rangle
\displaystyle\leq vx¯k,zik+d¯zikF(x¯k)\displaystyle\langle v-\bar{x}^{k},z_{i}^{k}\rangle+\bar{d}\|z_{i}^{k}-\nabla F(\bar{x}^{k})\|
\displaystyle\leq vx¯k,F(x¯k)+2d¯zikF(x¯k),vΩ.\displaystyle\langle v-\bar{x}^{k},\nabla F(\bar{x}^{k})\rangle+2\bar{d}\|z_{i}^{k}-\nabla F(\bar{x}^{k})\|,\quad\forall v\in\Omega.

Thus, take

vargminvΩv,F(x¯k).v^{\prime}\in\operatorname{argmin}_{v\in\Omega}\langle v,\nabla F(\bar{x}^{k})\rangle.

Together with the optimality of vv^{\prime} and the convexity of FF,

vx¯k,F(x¯k)\displaystyle\langle v^{\prime}-\bar{x}^{k},\nabla F(\bar{x}^{k})\rangle xx¯k,F(x¯k)\displaystyle\leq\langle x^{*}-\bar{x}^{k},\nabla F(\bar{x}^{k})\rangle
gk.\displaystyle\leq-g^{k}. (10)

In addition,

zikF(x¯k)\displaystyle\|z_{i}^{k}-\nabla F(\bar{x}^{k})\|
\displaystyle\leq zikkF¯+kF¯F(x¯k)\displaystyle\|z_{i}^{k}-\overline{\nabla^{k}F}\|+\|\overline{\nabla^{k}F}-\nabla F(\bar{x}^{k})\|
\displaystyle\leq zikkF¯+κNi=1Nxikx¯k.\displaystyle\|z_{i}^{k}-\overline{\nabla^{k}F}\|+\frac{\kappa}{N}\sum_{i=1}^{N}\|x_{i}^{k}-\bar{x}^{k}\|. (11)

When βk=2/(k+1)\beta^{k}=2/(k+1), it follows from the obtained conclusions in i) and ii) that, (III-B) becomes

zikF(x¯k)2Czk+1+κ2Cxk+1.\displaystyle\|z_{i}^{k}-\nabla F(\bar{x}^{k})\|\leq\frac{2C_{z}}{k+1}+\kappa\frac{2C_{x}}{k+1}. (12)

By substituting (III-B) and (12) into (9), we further have

gk+1gk2gkk+1+8d¯(Cz+κCx)(k+1)2+2κd¯2(k+1)2,\displaystyle g^{k+1}\leq g^{k}-\frac{2g^{k}}{k+1}+\frac{8\bar{d}\left(C_{z}+\kappa C_{x}\right)}{(k+1)^{2}}+\frac{2\kappa\bar{d}^{2}}{(k+1)^{2}},

which yields

gk+1k1k+1gk+O(1/k2).g^{k+1}\leq\frac{k-1}{k+1}g^{k}+O(1/k^{2}).

Consequently, by [34, Lemma 4], we have O(1/k)O(1/k) convergence rate for gkg^{k}, that is,

gk(8d¯(Cz+κCx)+2κd¯2)/(k+1),k1.g^{k}\leq(8\bar{d}\left(C_{z}+\kappa C_{x}\right)+2\kappa\bar{d}^{2})/(k+1),\quad\forall\,k\geq 1.

This completes the proof. \square

IV Numerical Examples

Refer to caption
Figure 1: trajectories of the four agents’ decision variables.
TABLE I: the average real running time of solving subproblems.
       dimensions        n=16        n=64        n=256        n=1024        n=4096
       DCPf (msec)        11.5        12.0        12.6        13.3        14.1
       DeFW (msec)        11.5        12.1        12.8        13.1        14.4
       DCPb-Zeng (msec)        8.7        13.2        19.8        27.9        40.7
       DCPb-Yang (msec)        8.7        13.1        19.5        28.4        43.0
Refer to caption
(a) n=16n=16
Refer to caption
(b) n=64n=64
Refer to caption
(c) n=256n=256
Figure 2: Optimal solution errors with different dimensions n=16,64,256n=16,64,256.

In this section, we carry on some experiments to illustrate the effectiveness of our approach in this paper. We first take a simple example with only N=4N=4 agents and only n=2n=2 dimensions of the decision variables to illustrate the trajectories of Algorithm 1. Set the local cost functions as

fj(x)=(xj153+23j)2+(xj253+23j)2,j=1,2,3,4.f_{j}(x)=(x_{j}^{1}-\frac{5}{3}+\frac{2}{3}j)^{2}+(x_{j}^{2}-\frac{5}{3}+\frac{2}{3}j)^{2},\quad j=1,2,3,4.

The feasible sets is

Ω={x2:2x12,2x22}.\Omega=\{x\in\mathbb{R}^{2}:-2\leq x^{1}\leq 2,-2\leq x^{2}\leq 2\}.

The initial locations are as follows,

x1(0)\displaystyle x_{1}(0) =[1.81.8],x2(0)=[1.81.8],\displaystyle=\begin{bmatrix}-1.8\\ 1.8\end{bmatrix},\,x_{2}(0)=\begin{bmatrix}-1.8\\ -1.8\end{bmatrix},
x3(0)\displaystyle x_{3}(0) =[1.81.8],x4(0)=[1.81.8].\displaystyle=\begin{bmatrix}1.8\\ 1.8\end{bmatrix},\quad x_{4}(0)=\begin{bmatrix}1.8\\ -1.8\end{bmatrix}.

Under this circumstance, the global optimal solution should be the origin on Euclidean space. We employ a directed ring graph as the communication network. Fig.1 shows the trajectory in the plane. The boundaries of the feasible set are in black, while the trajectories of the four agents’ decision variables are showed with different colors.

Next, we show the effectiveness of our distributed projection-free dynamics by comparisons. The number of agents is increased to N=20N=20. As shown in [19], the FW method works better than projection-based algorithms when the vertices on the boundary of the constraint set are easily to find in high-dimensional decision spaces. Thus, along with quadratic local cost functions, we set the constraint set as x2\|x\|_{\infty}\leq 2. Then we choose different dimensions of decision variables and compare our distributed continuous-time projection-free algorithm (DCPf) with two distributed continuous-time projection-based algorithms, (DCPb-Yang) by [6], (DCPb-Zeng) by [7], and a decentralized discrete-time FW algorithm (DeFW) by [28]. Since all these distributed algorithms are suitable for undirected graphs, we set an undirected ring graph as the communication network.

We set the dimensions of decision variables as the power of two. In Fig.2, the xx-axis is for the real running time (CPU time) in seconds, while the yy-axis is for the optimal solution errors in each algorithm. We learn from Fig.2 that as the dimension increases, the real running time (CPU time) of projection-based algorithms is obviously longer than projection-free ones, because searching the vertices on the boundary of high-dimensional constraint sets (to solve a linear program) is faster than calculating a projection on high-dimensional constraint sets (to solve a quadratic program). Moreover, we can observe from Fig.2 that our DCPf is not second to DeFW over connected undirected graphs.

Furthermore, in Tab.I, we list the average real running time of solving subproblems, i.e., linear programs or quadratic programs. When the dimension is low, solving linear programs may take more time than solving quadratic programs over such constraint sets. However, as the dimension increases explosively, solving quadratic programs in such situation turns to be difficult, but the time of solving linear programs still remains almost the same. That conforms with the advantage of projection-free approaches.

V Discussions

In this paper, we developed a novel projection-free dynamics for solving distributed optimization with constraints. By employing the Frank-Wolfe method, we found a feasible descent direction by solving optimization with a linear objective, which avoided solving high-dimensional subproblems in projection-based algorithms. In the distributed dynamics, we simultaneously made the decision variables achieve consensus and the local auxiliary variables track the global gradient. Then we gave an analysis on the convergence of the projection-free dynamics by the Lyapunov theory. As for the induced discrete-time format, we proved the convergence rate. Finally, we provided comparative illustrations to show the efficiency of our algorithm.

In the future, we may focus on some potential challenges to improve the result in this paper. First, whether fixed constants or self-adaptive settings are suitable for parameter β(t)\beta(t) in the algorithm. Second, more advanced approaches should be investigated for directly analyzing the convergence rate of the continuous-time algorithm itself. Third, whether more general formulations of distributed constrained optimization can adopt analogous projection-free algorithms, such as resource allocation problems with coupled constraints. At last, as the advantages from the FW method itself, such projecion-free approaches fit a class of constrained optimization well, including polyhedrons, unit simplices, and unit squares. However, this approach does not always mean efficient for all kinds of constraints, and how to adopt similar ideas to handle diverse constraints is still a challenging but interesting task. Maybe mix-projection schemes for distributed optimization with different constraints will be developed in the near future.

Appendix

Proof of Lemma 1

For a convex set CnC\subseteq\mathbb{R}^{n} and xCx\in C, denote the normal cone to CC at xx by

𝒩C(x)={vn:vT(yx)0,yC},\mathcal{N}_{C}(x)=\big{\{}v\in\mathbb{R}^{n}:v^{\rm T}(y-x)\leq 0,\quad\forall y\in C\big{\}},

and the tangent cone to CC at xx by

𝒯C(x)={limkxkxtk:xkC,tk>0,xkx,tk0}.\mathcal{T}_{C}(x)=\{\lim_{k\rightarrow\infty}\frac{x_{k}-x}{t_{k}}:x_{k}\in C,t_{k}>0,x_{k}\rightarrow x,t_{k}\rightarrow 0\}.

Let PΩ(xi(t))P_{{\Omega}}(x_{i}(t)) as the projection on Ω\Omega at point xi(t)x_{i}(t), which yields that

xi(t)PΩ(xi(t))𝒩Ω(xi(t)).x_{i}(t)-P_{\Omega}(x_{i}(t))\in\mathcal{N}_{\Omega}(x_{i}(t)).

Consider xi(t)Ωx_{i}(t)\in\Omega for i𝒱i\in\mathcal{V} and some t0t\geq 0. Since vi(t)v_{i}(t), for i=1,,Ni=1,\dots,N, are also selected from Ω\Omega, we have

vi(t)xi(t)𝒯Ω(xi(t)),v_{i}(t)-x_{i}(t)\in\mathcal{T}_{\Omega}(x_{i}(t)),

and similarly,

xj(t)xi(t)𝒯Ω(xi(t)).x_{j}(t)-x_{i}(t)\in\mathcal{T}_{\Omega}(x_{i}(t)).

Hence, it follows from the dynamics of Algorithm 1 that

x˙i(t)=j=1Naij(xj(t)xi(t))+β(t)(vi(t)xi(t)),\displaystyle\dot{x}_{i}(t)=\sum_{j=1}^{N}a_{ij}(x_{j}(t)\!-\!x_{i}(t))+\beta(t)(v_{i}(t)\!-\!x_{i}(t)),

which leads to x˙i(t)𝒯Ω(xi(t))\dot{x}_{i}(t)\in\mathcal{T}_{\Omega}(x_{i}(t)). On this basis, consider the energy function as

E(t)=12xi(t)PΩ(xi(t))2.E(t)=\frac{1}{2}\|x_{i}(t)-P_{{\Omega}}(x_{i}(t))\|^{2}.

Its derivative along the dynamics of Algorithm 1 is

E˙(t)=xi(t)PΩ(xi(t)),x˙i(t)0,\displaystyle\dot{E}(t)=\langle x_{i}(t)-P_{{\Omega}}(x_{i}(t)),\,\dot{x}_{i}(t)\rangle\leq 0,

where the last inequality holds because normal cones and tangent cones are orthogonal. Since xi(0)Ωx_{i}(0)\in\Omega, we have E(0)=0E(0)=0. Together with E(t)0E(t)\geq 0 and E˙(t)0\dot{E}(t)\leq 0, we have E(t)0E(t)\equiv 0 at any time t0t\geq 0. This reveals that once all variables are located within Ω{{\Omega}} for some t0t\geq 0, they will not escape. Therefore, recalling xi(0)Ωx_{i}(0)\in\Omega for i𝒱i\in\mathcal{V}, we complete the proof. \square

Proof of Lemma 2

Let us denote the funciton

h(t)=exp0tγ(τ)𝑑τ,h(t)=\exp\int_{0}^{t}\gamma(\tau)d\tau,

which implies

limth(t)=,h˙(t)=γ(t)h(t).\lim_{t\rightarrow\infty}h(t)=\infty,\quad\dot{h}(t)=\gamma(t)h(t).

Recall that s(t)0s(t)\geq 0 and

s˙(t)γ(t)s(t)+γ(t)ε(t).\dot{s}(t)\leq-\gamma(t)s(t)+\gamma(t)\varepsilon(t).

After multiplying the both sides of the inequality above by h(t)h(t), we can further obtain that

ddt(s(t)h(t))γ(t)h(t)ε(t).\frac{d}{dt}\big{(}s(t)h(t)\big{)}\leq\gamma(t)h(t)\varepsilon(t).

Then we integrate the above on the segment (0,t)(0,t) by the Comparison Lemma in [11], which leads to the following inequality

s(t)s(0)h(t)+1h(t)0tγ(τ)h(τ)ε(τ)𝑑τ.s(t)\leq\frac{s(0)}{h(t)}+\frac{1}{h(t)}\int_{0}^{t}\gamma(\tau)h(\tau)\varepsilon(\tau)d\tau.

Hereupon, we can give a detailed discussion based on the above result. On the one hand, if

0γ(τ)h(τ)ε(τ)𝑑τ<,\int_{0}^{\infty}\gamma(\tau)h(\tau)\varepsilon(\tau)d\tau<\infty,

then limts(t)=0\lim_{t\rightarrow\infty}s(t)=0 eventually. On the other hand, if

0γ(τ)h(τ)ε(τ)𝑑τ=,\int_{0}^{\infty}\gamma(\tau)h(\tau)\varepsilon(\tau)d\tau=\infty,

then it follows from L’ Hospital rule that

limtsups(t)\displaystyle\lim_{t\rightarrow\infty}\sup s(t)\leq limtγ(t)h(t)ε(t)γ(t)h(t)\displaystyle\lim_{t\rightarrow\infty}\frac{\gamma(t)h(t)\varepsilon(t)}{\gamma(t)h(t)}
=limtε(t)=0,\displaystyle\quad\quad=\lim_{t\rightarrow\infty}\varepsilon(t)=0,

which completes the proof. \square

References

  • [1] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
  • [2] D. Yuan, S. Xu, H. Zhao, and L. Rong, “Distributed dual averaging method for multi-agent optimization with quantized communication,” Systems & Control Letters, vol. 61, no. 11, pp. 1053–1061, 2012.
  • [3] X. Li, G. Feng, and L. Xie, “Distributed proximal algorithms for multi-agent optimization with coupled inequality constraints,” IEEE Transactions on Automatic Control, 2020.
  • [4] D. Yuan, D. W. Ho, and S. Xu, “Regularized primal–dual subgradient method for distributed constrained optimization,” IEEE transactions on Cybernetics, vol. 46, no. 9, pp. 2109–2118, 2015.
  • [5] X. Wang, Y. Hong, and H. Ji, “Distributed optimization for a class of nonlinear multiagent systems with disturbance rejection,” IEEE transactions on Cybernetics, vol. 46, no. 7, pp. 1655–1666, 2015.
  • [6] S. Yang, Q. Liu, and J. Wang, “A multi-agent system with a proportional-integral protocol for distributed constrained optimization,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3461–3467, 2016.
  • [7] X. Zeng, P. Yi, and Y. Hong, “Distributed continuous-time algorithm for constrained convex optimizations via nonsmooth analysis approach,” IEEE Transactions on Automatic Control, vol. 62, no. 10, pp. 5227–5233, 2016.
  • [8] X. Yi, L. Yao, T. Yang, J. George, and K. H. Johansson, “Distributed optimization for second-order multi-agent systems with dynamic event-triggered communication,” in Conference on Decision and Control.   IEEE, 2018, pp. 3397–3402.
  • [9] X. Shi, J. Cao, G. Wen, and M. Perc, “Finite-time consensus of opinion dynamics and its applications to distributed optimization over digraph,” IEEE transactions on Cybernetics, vol. 49, no. 10, pp. 3767–3779, 2018.
  • [10] K. Lu, G. Jing, and L. Wang, “Distributed algorithms for searching generalized nash equilibrium of noncooperative games,” IEEE transactions on Cybernetics, vol. 49, no. 6, pp. 2362–2371, 2018.
  • [11] H. K. Khalil and J. W. Grizzle, Nonlinear Systems.   Prentice Hall Upper Saddle River, NJ, 2002, vol. 3.
  • [12] A. Ruszczynski, Nonlinear Optimization.   Princeton University Press, 2011.
  • [13] P. Yi, Y. Hong, and F. Liu, “Initialization-free distributed algorithms for optimal resource allocation with feasibility constraints and application to economic dispatch of power systems,” Automatica, vol. 74, pp. 259–269, 2016.
  • [14] Y. Zhu, W. Yu, G. Wen, and G. Chen, “Projected primal–dual dynamics for distributed constrained nonsmooth convex optimization,” IEEE Transactions on Cybernetics, vol. 50, no. 4, pp. 1776–1782, 2018.
  • [15] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019.
  • [16] S. Liang, X. Zeng, G. Chen, and Y. Hong, “Distributed sub-optimal resource allocation via a projected form of singular perturbation,” Automatica, vol. 121, p. 109180, 2020.
  • [17] M. Frank, P. Wolfe et al., “An algorithm for quadratic programming,” Naval Research Logistics Quarterly, vol. 3, no. 1-2, pp. 95–110, 1956.
  • [18] G. Chen, Y. Ming, Y. Hong, and P. Yi, “Distributed algorithm for ε\varepsilon-generalized nash equilibria with uncertain coupled constraints,” Automatica, vol. 123, p. 109313.
  • [19] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse convex optimization,” in International Conference on Machine Learning, 2013, pp. 427–435.
  • [20] X. Jiang, X. Zeng, L. Xie, J. Sun, and J. Chen, “Distributed stochastic projection-free solver for constrained optimization,” arXiv preprint arXiv:2204.10605, 2022.
  • [21] Z. Cui, H. Chang, S. Shan, and X. Chen, “Generalized unsupervised manifold alignment,” Advances in Neural Information Processing Systems, vol. 27, pp. 2429–2437, 2014.
  • [22] L. Chapel, M. Z. Alaya, and G. Gasso, “Partial optimal tranport with applications on positive-unlabeled learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2903–2913, 2020.
  • [23] X. Zhang and S. Mahadevan, “A bio-inspired approach to traffic network equilibrium assignment problem,” IEEE Transactions on Cybernetics, vol. 48, no. 4, pp. 1304–1315, 2017.
  • [24] D. Garber and E. Hazan, “Faster rates for the Frank-Wolfe method over strongly-convex sets,” in International Conference on Machine Learning, 2015, pp. 541–549.
  • [25] W. Zhang, P. Zhao, W. Zhu, S. C. Hoi, and T. Zhang, “Projection-free distributed online learning in networks,” in International Conference on Machine Learning, 2017, pp. 4054–4062.
  • [26] M. Zhang, L. Chen, A. Mokhtari, H. Hassani, and A. Karbasi, “Quantized Frank-Wolfe: Faster optimization, lower communication, and projection free,” in International Conference on Artificial Intelligence and Statistics, 2020, pp. 3696–3706.
  • [27] M. Jacimovic and A. Geary, “A continuous conditional gradient method,” Yugoslav Journal of Operations Research, vol. 9, no. 2, pp. 169–182, 1999.
  • [28] H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized Frank-Wolfe algorithm for convex and nonconvex problems,” IEEE Transactions on Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017.
  • [29] B. Gharesifard and J. Cortés, “Distributed continuous-time convex optimization on weight-balanced digraphs,” IEEE Transactions on Automatic Control, vol. 59, no. 3, pp. 781–786, 2013.
  • [30] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
  • [31] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, pp. 1–49, 2020.
  • [32] Z. Qiu, S. Liu, and L. Xie, “Distributed constrained optimal consensus of multi-agent systems,” Automatica, vol. 68, pp. 209–215, 2016.
  • [33] R. A. Horn and C. R. Johnson, Matrix Analysis.   Cambridge University Press, 2012.
  • [34] A. Rivet and A. Souloumiac, “Introduction to optimization,” in Optimization Software, Publications Division.   Citeseer, 1987.