This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A distributed proximal splitting method with linesearch for locally Lipschitz data

Felipe Atenas School of Mathematics & Statistics, The University of Melbourne, Australia. E-mail: [email protected]    Minh N. Dao School of Science, RMIT University, Australia E-mail: [email protected]    Matthew K. Tam School of Mathematics & Statistics, The University of Melbourne, Australia. E-mail: [email protected]
(October 21, 2024)
Abstract

In this paper, we propose a distributed first-order algorithm with backtracking linesearch for solving multi-agent minimisation problems, where each agent handles a local objective involving nonsmooth and smooth components. Unlike existing methods that require global Lipschitz continuity and predefined stepsizes, our algorithm adjusts stepsizes using distributed linesearch procedures, making it suitable for problems where global constants are unavailable or difficult to compute. The proposed algorithm is designed within an abstract linesearch framework for a primal-dual proximal-gradient method to solve min-max convex-concave problems, enabling the consensus constraint to be decoupled from the optimisation task. Our theoretical analysis allows for gradients of functions to be locally Lipschitz continuous, relaxing the prevalent assumption of globally Lipschitz continuous gradients.

Keywords:

Backtracking linesearch, distributed optimisation, locally Lipschitz data, min-max problems, primal-dual algorithm, proximal splitting algorithm.

Mathematics Subject Classification (MSC 2020):

90C25, 68W15, 49M27, 65K05.

1 Introduction

In this work, we consider a network of nn agents, connected by a communication graph, working collaboratively to solve

minxdi=1n(fi(x)+hi(x)),\min_{x\in\mathbbm{R}^{d}}\sum_{i=1}^{n}(f_{i}(x)+h_{i}(x)), (1)

where each agent i{1,,n}i\in\{1,\dots,n\} possesses its own components of the objective function fi:d{+}f_{i}:\mathbbm{R}^{d}\to\mathbbm{R}\cup\{+\infty\} and hi:dh_{i}:\mathbbm{R}^{d}\to\mathbbm{R}. Here, fif_{i} is a proper lower semicontinuous convex function that may represent constraints or nonsmooth components of the problem, while hih_{i} is a differentiable convex function with locally Lipschitz continuous gradient capturing smooth components in the objective. Each agent has the ability to perform local computations using their respective fif_{i} and hih_{i}, and the agents communicate via their immediate neighbours in the network to exchange information.

Distributed optimisation problems fitting the form of (1) arise in numerous applications including signal processing, statistics and machine learning [15, 19, 17, 18]. Several methods have been proposed to address problems of this type, however they typically assume that smooth components in the objective function have globally Lipschitz continuous gradients [23, 24, 11], leading to convergence guarantees involving stepsize bounds that depend on the inverse of the (global) Lipschitz modulus [6, 24, 11]. In the case such constants are unavailable, for instance, when the problem formulation does not have globally Lipschitz gradients, or when the Lipschitz constant exists but cannot be computed accurately, defining appropriate stepsizes poses a challenge. Consequently, alternatives strategies for determining the stepsizes, such as via a backtracking linesearch [2, 3], are required.

In order to solve (1), we present linesearch procedures for a distributed first-order algorithm. In the proposed method, each component of the objective function is handled separately, and each iteration utilises a backtracking linesearch involving communication, only allowing each agent to communicate with immediate neighbours and update local variables. For each individual agent, vector operations are performed in a decentralised manner, and the coordination operations to determine the stepsizes use distributed communication with only scalar values. Our approach builds on decentralised proximal-gradient methods, casting problem (1) into the more general framework of min-max problems and addressing the consensus constraint in the dual space. Primal-dual methods to solve min-max problems have been extensively studied in the literature, see [5, 9, 10, 12, 14] and references therein.

Let us recall that a proximal-gradient exact first-order algorithm (PG-EXTRA) is proposed in [24] to solve (1). This algorithm performs rounds of communications for the iterates in each iteration, using a common constant stepsize among agents. By denoting N(i)N(i) the set of agents directly connected to agent ii in the network and Wn×nW\in\mathbbm{R}^{n\times n} a matrix that represents the topology of the network and models communication among agents, called a mixing matrix (see Definition 2.3), the main iteration of PG-EXTRA can be formulated as follows. Given a fixed stepsize σ>0\sigma>0, for each k1k\geqslant 1 and i{1,,n}i\in\{1,\dots,n\}, compute

{wik=wik1+jN(i)Wijxjk12(jN(i)Wijxjk1+xik1)σ(hi(xik)hi(xik1))xik+1=proxσfi(wik).\left\{\begin{aligned} w^{k}_{i}&=w^{k-1}_{i}+\displaystyle\sum_{j\in N(i)}W_{ij}x^{k}_{j}-\dfrac{1}{2}\left(\displaystyle\sum_{j\in N(i)}W_{ij}x_{j}^{k-1}+x_{i}^{k-1}\right)-\sigma\left(\nabla h_{i}(x^{k}_{i})-\nabla h_{i}(x^{k-1}_{i})\right)\\ x^{k+1}_{i}&=\operatorname{prox}_{\sigma f_{i}}\big{(}w^{k}_{i}\big{)}.\end{aligned}\right. (2)

This algorithm converges to a solution to (1) as long as each hih_{i} has globally Lipschitz continuous gradients with constant LiL_{i}, and the stepsize σ\sigma satisfies a bound that depends on the minimum eigenvalue of WW and the Lipschitz constants:

0<σ<1+λmin(W)Lh where Lh=maxi{1,,n}Li.0<\sigma<\dfrac{1+\lambda_{\min}(W)}{L_{h}}\text{~{}~{}where~{}~{}}L_{h}=\max_{i\in\{1,\dots,n\}}L_{i}.

However, computing this bound requires the smallest eigenvalue of the mixing matrix WW and the global Lipschitz constants of the gradients hi\nabla h_{i} for each i{1,,n}i\in\{1,\dots,n\}, two possibly expensive tasks. Moreover, the latter becomes unviable if at least one of the gradients is only locally Lipschitz continuous.

To overcome the aforementioned challenge, we propose a distributed first-order algorithm that uses a linesearch to compute appropriate stepsizes using rounds of peer-to-peer communication. For a sequence of stepsizes (τk)k(0,+)(\tau_{k})_{k}\subseteq(0,+\infty) computed via this linesearch, and a constant β>0\beta>0, the proposed algorithm takes the following form: for all k1k\geqslant 1 and i{1,,n}i\in\{1,\dots,n\},

{uik=uik1+τk12(xikjN(i)Wijxjk)u¯ik=uik+τkτk1(uikuik1)xik+1=proxβτkfi(xikβτkhi(xik)βτku¯ik).\left\{\begin{aligned} u^{k}_{i}&=u^{k-1}_{i}+\dfrac{\tau_{k-1}}{2}\left(x_{i}^{k}-\displaystyle\sum_{j\in N(i)}W_{ij}x_{j}^{k}\right)\\ \overline{u}^{k}_{i}&=u^{k}_{i}+\dfrac{\tau_{k}}{\tau_{k-1}}(u^{k}_{i}-u^{k-1}_{i})\\ x^{k+1}_{i}&=\operatorname{prox}_{\beta\tau_{k}f_{i}}\big{(}x^{k}_{i}-\beta\tau_{k}\nabla h_{i}(x^{k}_{i})-\beta\tau_{k}\overline{u}^{k}_{i}\big{)}.\end{aligned}\right. (3)

The relation between PG-EXTRA and the proposed algorithm is obtained by setting τk=τ\tau_{k}=\tau, βτ2=1\beta\tau^{2}=1 and wik=xikβτkhi(xik)βτku¯ikw^{k}_{i}=x^{k}_{i}-\beta\tau_{k}\nabla h_{i}(x^{k}_{i})-\beta\tau_{k}\overline{u}^{k}_{i} for each i=1,,ni=1,\dots,n. This change of variables yields (2) with σ=βτ\sigma=\beta\tau (see Remark 4.2 for further details). As we will see later, the alternative form used in (3) will be more convenient than (2) for the linesearch formulation and analysis.

Our contribution can be summarised as follows. We propose two linesearch procedures for a distributed proximal splitting method to solve the multi-agent minimisation problem (1). In other words, we propose backtracking linesearch routines for PG-EXTRA. Our derivation uses an abstract linesearch framework to define stepsizes for a primal-dual first-order splitting method for solving min-max convex-concave problems with a linear coupling term. We develop a convergence analysis for such splitting method that only requires locally Lipschitz continuous gradients and, in this case, the linesearch estimates the local Lipschitz constant of such gradients. Furthermore, we extend this linesearch to allow approximation of the norm of the linear coupling term, giving more flexibility to the step length in each iteration.

The paper is organised as follows. Section 2 provides the preliminary concepts and materials used in this work. In Section 3, we propose and analyse convergence of with a unifying linesearch framework for a primal-dual algorithm, which is related to the method with linesearch proposed in [13]. Building on this, Section 4 presents two distributed linesearch procedures to define stepsizes for a distributed proximal-gradient algorithm with guarantees of convergence to a solution of (1). Concluding remarks are given in Section 5.

2 Preliminaries

2.1 General definitions and notations

We assume throughout that 𝒰\mathcal{U} is a finite-dimensional Hilbert space with inner product ,\langle\cdot,\cdot\rangle and induced norm \|\cdot\|. When 𝒰=d\mathcal{U}=\mathbbm{R}^{d}, the notation ,\langle\cdot,\cdot\rangle represents the dot product given by u,v=uv\langle u,v\rangle=u^{\top}v for all u,v𝒰u,v\in\mathcal{U}. Similarly, when 𝒰=n×d\mathcal{U}=\mathbbm{R}^{n\times d}, ,\langle\cdot,\cdot\rangle denotes the trace inner product given by A,B=tr(AB)\langle A,B\rangle=\text{tr}(A^{\top}B) for all A,B𝒰A,B\in\mathcal{U}. For a symmetric matrix AA, λmin(A)\lambda_{\min}(A) and λmax(A)\lambda_{\max}(A) denote the smallest and largest eigenvalues of AA, respectively.

Let ϕ:𝒰{+}\phi:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\}. The domain of ϕ\phi is domϕ:={x𝒰:ϕ(x)<+}\operatorname{dom}\phi:=\{x\in\mathcal{U}:\phi(x)<+\infty\}. We say that ϕ\phi is proper if its domain is nonempty, lower semicontinuous (lsc) if, for all x𝒰x\in\mathcal{U}, ϕ(x)lim infyxϕ(y)\phi(x)\leqslant\liminf_{y\to x}\phi(y), and convex if, for all x,ydomϕx,y\in\operatorname{dom}\phi and all λ(0,1)\lambda\in(0,1),

ϕ(λx+(1λ)y)λϕ(x)+(1λ)ϕ(y).\phi(\lambda x+(1-\lambda)y)\leqslant\lambda\phi(x)+(1-\lambda)\phi(y).

We say an operator Φ:C𝒰𝒰\Phi:C\subseteq\mathcal{U}\to\mathcal{U} is (globally) Lipschitz continuous on CC with constant L>0L>0 if, for all x,yCx,y\in C,

Φ(x)Φ(y)Lxy.\|\Phi(x)-\Phi(y)\|\leqslant L\|x-y\|.

When C=𝒰C=\mathcal{U}, we simply say that Φ\Phi is Lipschitz continuous.

Functions with Lipschitz gradients provide a powerful ingredient for convergence analysis of gradient-based methods, known as the descent lemma [1, Lemma 2.64]. This result states that the first-order approximation error can be bounded by a quadratic term when gradients are Lipschitz continuous.

Lemma 2.1 (Descent lemma).

Suppose ϕ:C\phi:C\to\mathbbm{R} is a differentiable function over a open convex set C𝒰C\subseteq\mathcal{U}, such that ϕ\nabla\phi is Lipschitz continuous on CC with constant L>0L>0. Then, for all x,yCx,y\in C,

|ϕ(y)ϕ(x)ϕ(x),yx|L2yx2.|\phi(y)-\phi(x)-\langle\nabla\phi(x),y-x\rangle|\leqslant\dfrac{L}{2}\|y-x\|^{2}.

We say that an operator Φ:C𝒰𝒰\Phi:C\subseteq\mathcal{U}\to\mathcal{U} is locally Lipschitz continuous on CC if, for all x𝒰x\in\mathcal{U}, there exists an open set UxCU_{x}\subseteq C containing xx, such that Φ\Phi restricted to UxU_{x} is Lipschitz continuous. Any twice continuously differentiable function is locally Lipschitz, since they have locally bounded Hessians. For locally Lipschitz functions in finite-dimensional spaces, Lemma 2.1 is therefore satisfied over bounded sets.

Given a proper lsc convex function ϕ:𝒰{+}\phi:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} and τ>\tau>, the proximity operator of ϕ\phi with stepsize τ>0\tau>0 at x𝒰x\in\mathcal{U} is given by

proxτϕ(x)=argminy𝒰{ϕ(y)+12τyx2}.\operatorname{prox}_{\tau\phi}(x)=\operatorname*{arg\,min}_{y\in\mathcal{U}}\left\{\phi(y)+\dfrac{1}{2\tau}\|y-x\|^{2}\right\}.

For a nonempty closed convex set X𝒰X\subseteq\mathcal{U}, the indicator function ιX\iota_{X} of XX given by ιX(x)=0\iota_{X}(x)=0 if xXx\in X and ++\infty otherwise, and the projector onto XX is denoted by projX:=proxιX\operatorname{proj}_{X}:=\operatorname{prox}_{\iota_{X}}.

The following lemma is concerned with continuity properties of proximity operators. Here, the notation X¯\overline{X} denotes the closure of a set X𝒰X\subseteq\mathcal{U}.

Lemma 2.2 ([8, Corollary 7]).

Let ϕ:𝒰{+}\phi:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} be a proper lsc convex function. Then the operator given by

[0,+)×𝒰𝒰:(t,x){proxtϕ(x)if t>0,projdom(ϕ)¯(x)if t=0[0,+\infty)\times\mathcal{U}\to\mathcal{U}\colon(t,x)\mapsto\begin{cases}\operatorname{prox}_{t\phi}(x)&\text{if~{}}t>0,\\ \operatorname{proj}_{\overline{\operatorname{dom}(\phi)}}(x)&\text{if~{}}t=0\end{cases}

is continuous on its domain. Moreover, it is locally Lipschitz on the interior of its domain.

2.2 Distributed optimisation

We now turn our attention to notation and preliminaries for distributed optimisation. Following [23, 24, 22], given nn (column) vectors x1,,xndx_{1},\dots,x_{n}\in\mathbbm{R}^{d}, we denote

𝐱=(x1xn)=(x1xn)n×d\mathbf{x}=\begin{pmatrix}x_{1}^{\top}\\ \vdots\\ x_{n}^{\top}\\ \end{pmatrix}=\begin{pmatrix}x_{1}&\cdots&x_{n}\\ \end{pmatrix}^{\top}\in\mathbb{R}^{n\times d}

For the functions in problem (1), we define f:n×d{+}f:\mathbbm{R}^{n\times d}\to\mathbbm{R}\cup\{+\infty\} and h:n×dh:\mathbbm{R}^{n\times d}\to\mathbbm{R} as

f(𝐱):=i=1nfi(xi) and h(𝐱):=i=1nhi(xi).f(\mathbf{x}):=\sum_{i=1}^{n}f_{i}(x_{i})\text{~{}~{}and~{}~{}}h(\mathbf{x}):=\sum_{i=1}^{n}h_{i}(x_{i}).

Under this definition, the proximity operator of ff with parameter τ\tau and the gradient of hh are given by

proxτf(𝐱)=(proxτf1(x1)proxτfn(xn)) and h(𝐱)=(h1(x1)hn(xn)).\operatorname{prox}_{\tau f}(\mathbf{x})=\begin{pmatrix}\operatorname{prox}_{\tau f_{1}}(x_{1})^{\top}\\ \vdots\\ \operatorname{prox}_{\tau f_{n}}(x_{n})^{\top}\end{pmatrix}\text{~{}~{}and~{}~{}}\nabla h(\mathbf{x})=\begin{pmatrix}\nabla h_{1}(x_{1})^{\top}\\ \vdots\\ \nabla h_{n}(x_{n})^{\top}\\ \end{pmatrix}.

In order to model communication between agents through the communication graph (V,E)(V,E) in (1), we recall the following definition (see, for example, [23]).

Definition 2.3 (Mixing matrix).

Given an undirected graph (V,E)(V,E) with |V|=n|V|=n, a matrix Wn×nW\in\mathbbm{R}^{n\times n} is called a mixing matrix if

  1. (i)

    Wij=0W_{ij}=0 whenever ijEij\notin E,

  2. (ii)

    W=WW=W^{\top},

  3. (iii)

    ker(IW)=e\ker(I-W)=\mathbbm{R}e, where e=(1,,1)e=(1,\dots,1)^{\top},

  4. (iv)

    IWI-I\prec W\preceq I.

Definition 2.3(i)(ii) ensures that communication is symmetric and only occurs between neighbours, while Definition 2.3(iii) is used to ensure consensus, and Definition 2.3(iv) is related to convergence. Note that, under the notation defined in this section, (2) can be combined for all agents i{1,,n}i\in\{1,\dots,n\} to write the PG-EXTRA compactly as

{𝐰k=𝐰k1+W𝐱k12(W+I)𝐱k1τ(h(𝐱k)h(𝐱k1))𝐱k+1=proxτf(𝐰k).\left\{\begin{aligned} \mathbf{w}^{k}&=\mathbf{w}^{k-1}+W\mathbf{x}^{k}-\dfrac{1}{2}(W+I)\mathbf{x}^{k-1}-\tau\left(\nabla h(\mathbf{x}^{k})-\nabla h(\mathbf{x}^{k-1})\right)\\ \mathbf{x}^{k+1}&=\operatorname{prox}_{\tau f}\big{(}\mathbf{w}^{k}\big{)}.\end{aligned}\right. (4)

As discussed in [14], PG-EXTRA can be understood as the Condat–Vũ method [7, 25] applied to a reformulation of (1) as a saddle-point problem (see also Section 4). In what follows, we recall the derivation of this reformulation. For each agent i{1,,n}i\in\{1,\dots,n\}, let xidx_{i}\in\mathbb{R}^{d} denote its copy of the decision variable xdx\in\mathbb{R}^{d}. Then problem (1) can be expressed as

minx1,,xndi=1nfi(xi)+hi(xi) s.t. x1==xn.\min_{x_{1},\dots,x_{n}\in\mathbbm{R}^{d}}\sum_{i=1}^{n}f_{i}(x_{i})+h_{i}(x_{i})\text{~{}~{}s.t.~{}~{}}x_{1}=\dots=x_{n}. (5)

Let U:=((IW)/2)1/2U:=\bigl{(}(I-W)/2\big{)}^{1/2} so that ker(U)=ker(U2)=ker(IW)=e\ker(U)=\ker(U^{2})=\ker(I-W)=\mathbb{R}e. Then x1==xnx_{1}=\dots=x_{n} if and only if U𝐱=0U\mathbf{x}=0, in which case we say 𝐱\mathbf{x} is in consensus. Hence, problem (5) becomes

min𝐱n×dι{0}(U𝐱)+f(𝐱)+h(𝐱).\min_{\mathbf{x}\in\mathbbm{R}^{n\times d}}\iota_{\{0\}}(U\mathbf{x})+f(\mathbf{x})+h(\mathbf{x}). (6)

Since ι{0}\iota_{\{0\}} is proper lsc convex with ι{0}=0\iota^{*}_{\{0\}}=0, we have ι{0}=ι{0}=sup𝐲n×d,𝐲\iota_{\{0\}}=\iota_{\{0\}}^{**}=\sup_{\mathbf{y}\in\mathbb{R}^{n\times d}}\langle\cdot,\mathbf{y}\rangle. Substituting this into (6) therefore gives

min𝐱n×dsup𝐲n×dU𝐲,𝐱+f(𝐱)+h(𝐱).\min_{\mathbf{x}\in\mathbbm{R}^{n\times d}}\sup_{\mathbf{y}\in\mathbbm{R}^{n\times d}}\langle U\mathbf{y},\mathbf{x}\rangle+f(\mathbf{x})+h(\mathbf{x}). (7)

Altogether, this shows that the multi-agent minimisation problem (1) can be reformulated as a saddle-point problem (7) having a linear coupling term related to consensus among agents. Note that, in this context, U\|U\| can be computed from the mixing matrix WW. The structure of the latter problem (7) motivates our study of min-max problems in the following section.

3 Proximal splitting algorithm with linesearch

Consider the following convex-concave min-max problem

minu𝒰maxv𝒱Ψ(u,v):=Ku,v+G(u)F(v)H(v),\min_{u\in\mathcal{U}}\max_{v\in\mathcal{V}}\>\Psi(u,v):=\langle Ku,v\rangle+G(u)-F(v)-H(v), (8)

where K:𝒰𝒱K:\mathcal{U}\to\mathcal{V} is a (bounded) linear operator between two finite-dimensional Hilbert spaces, G:𝒰{+}G:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} and F:𝒱{+}F:\mathcal{V}\to\mathbbm{R}\cup\{+\infty\} are proper lsc convex functions, and H:𝒱H:\mathcal{V}\to\mathbbm{R} is convex differentiable with locally Lipschitz continuous gradient. We assume that K\|K\| is known and that there exists a saddle point for problem (8), that is, there exists a pair (u^,v^)𝒰×𝒱(\hat{u},\hat{v})\in\mathcal{U}\times\mathcal{V} satisfying

(u,v)𝒰×𝒱,Ψ(u^,v)Ψ(u^,v^)Ψ(u,v^).\forall(u,v)\in\mathcal{U}\times\mathcal{V},\quad\Psi(\hat{u},v)\leqslant\Psi(\hat{u},\hat{v})\leqslant\Psi(u,\hat{v}). (9)

Conditions for existence of saddle points for convex-concave problems can be found, for example, in [20, 21]. Note that (9) can be equivalently expressed as the following pair of inequalities

{u𝒰,P^(u^,v^)(u):=G(u)G(u^)+Kv^,uu^0,v𝒱,D^(u^,v^)(v):=(F+H)(v)(F+H)(v^)Ku^,vv^0.\left\{\begin{aligned} \forall u\in\mathcal{U},&&\hat{P}_{(\hat{u},\hat{v})}(u)&:=G(u)-G(\hat{u})+\langle K^{*}\hat{v},u-\hat{u}\rangle\geqslant 0,\\ \forall v\in\mathcal{V},&&\hat{D}_{(\hat{u},\hat{v})}(v)&:=(F+H)(v)-(F+H)(\hat{v})-\langle K\hat{u},v-\hat{v}\rangle\geqslant 0.\end{aligned}\right. (10)

The two quantities above define the primal-dual gap. Given a saddle point (u^,v^)(\hat{u},\hat{v}) of problem (8), the primal-dual gap is given by

𝒢^(u^,v^)(u,v)=P^(u^,v^)(u)+D^(u^,v^)(v)0.\hat{\mathcal{G}}_{(\hat{u},\hat{v})}(u,v)=\hat{P}_{(\hat{u},\hat{v})}(u)+\hat{D}_{(\hat{u},\hat{v})}(v)\geqslant 0. (11)

Assuming H\nabla H is globally Lipschitz continuous (with constant L>0L>0) and knowledge of K\|K\|, the Condat–Vũ method [7, 25] can be used to solve (8). Given stepsizes τ,σ>0\tau,\sigma>0 satisfying Lσ2+τσK21\frac{L\sigma}{2}+\tau\sigma\|K\|^{2}\leqslant 1, this method takes the form

{uk=proxτG(uk1τKvk)u¯k=2ukuk1vk+1=proxσF(vk+σ[Ku¯kH(vk)]).\left\{\begin{aligned} u^{k}&=\operatorname{prox}_{\tau G}\big{(}u^{k-1}-\tau K^{*}v^{k}\big{)}\\ \overline{u}^{k}&=2u^{k}-u^{k-1}\\ v^{k+1}&=\operatorname{prox}_{\sigma F}\big{(}v^{k}+\sigma\big{[}K\overline{u}^{k}-\nabla H(v^{k})\big{]}\big{)}.\end{aligned}\right.

Note that, in the special case where H=0H=0, this method reduces to the primal-dual hybrid gradient (PDHG) method from [5]. To handle the case where useful estimates of both LL and K\|K\| are unavailable, a version of the Condat–Vũ method with a backtracking linesearch was developed in [13, Section 4]. However, this approach is not directly suited to our setting for two reasons: firstly, it still requires global Lipschitz continuity of HH (even though the Lipschitz constant is not needed) and, secondly, it can not exploit knowledge of K\|K\| even when it is available. In the distributed setting of Section 4, the latter is related to the minimum eigenvalue of the mixing matrix and, as such, is typically assumed to be known.

Thus, in order to solve problem (8) in the setting important for distributed optimisation, in this section, we extend the analysis of the primal-dual algorithm with linesearch introduced in [13] under the assumption that H\nabla H is locally Lipschitz (but not necessarily globally Lipschitz) and that K\|K\| is known. Our analysis considers an abstract linesearch procedure satisfying Assumptions 3.1 and 3.3 (for use in Section 4).

3.1 Primal-dual algorithm with abstract linesearch

In Algorithm 1, we present our proposed method with an abstract linesearch procedure, denoted LS, which is assumed to satisfy Assumption 3.1. The main result of this section (Theorem 3.4) shows that the sequences generated by Algorithm 1 converge to a saddle point of (8). An implementation of the abstract linesearch is given in Subroutine 2.

Assumption 3.1 (Abstract linesearch).

Given a set of parameters param containing β>0\beta>0 and δL(0,1)\delta_{L}\in(0,1), and some set 𝒫\mathcal{P}, consider an abstract linesearch LS that takes the input

(param,τinit,τ,u,u,v)𝒫×(0,+)×(0,+)×𝒰×𝒰×𝒱(\texttt{param},\tau_{\text{init}},\tau,u^{-},u,v)\in\mathcal{P}\times(0,+\infty)\times(0,+\infty)\times\mathcal{U}\times\mathcal{U}\times\mathcal{V}

and returns

(τ+,v+)(0,+)×𝒱,(\tau^{+},v^{+})\in(0,+\infty)\times\mathcal{V},

satisfying the following conditions.

  1. (i)

    τ+τinit\tau^{+}\leqslant\tau_{\text{init}}.

  2. (ii)

    τ+(H(v+)H(v)H(v),v+v)δ2βv+v2.\tau^{+}\left(H(v^{+})-H(v)-\langle\nabla H(v),v^{+}-v\rangle\right)\leqslant\dfrac{\delta}{2\beta}\|v^{+}-v\|^{2}.

The set of parameters param in Assumption 3.1, besides including the constants β\beta and δL\delta_{L} that appear in the descent condition (ii), it may also involve other parameters that define internal operations in LS. We will define a specific set of parameters in Section 3.2.

Choose initial points u0𝒰u^{0}\in\mathcal{U}, v1𝒱v^{1}\in\mathcal{V};
Choose a set of parameters param containing β>0\beta>0 and δL(0,1)\delta_{L}\in(0,1);
Choose parameters τ0>0\tau_{0}>0, γ(0,1),\gamma\in(0,1), and δK(0,1)\delta_{K}\in(0,1) such that δK+δL<1\delta_{K}+\delta_{L}<1;
Set θ0=1\theta_{0}=1;
for k=1,2,k=1,2,\dots\> do
      
       \triangleright Step 1 (Dual update). Define
uk=proxτk1G(uk1τk1Kvk).u^{k}=\operatorname{prox}_{\tau_{k-1}G}(u^{k-1}-\tau_{k-1}K^{*}v^{k}).
       \triangleright Step 2 (Backtracking linesearch and primal update). Initialise the linesearch by choosing αk[1,1+γθk1]\alpha_{k}\in[1,\sqrt{1+\gamma\theta_{k-1}}], and define
τk(0)=min{δKβK,τk1αk}.\tau_{k(0)}=\min\left\{\dfrac{\sqrt{\delta_{K}}}{\sqrt{\beta}\|K\|},\tau_{k-1}\alpha_{k}\right\}. (12)
      Compute
τk,vk+1=LS(param,τk(0),τk1,uk1,uk,vk),\tau_{k},v^{k+1}=\texttt{LS}(\texttt{param},\tau_{k(0)},\tau_{k-1},u^{k-1},u^{k},v^{k}),
      where
θk\displaystyle\theta_{k} =τkτk11\displaystyle=\tau_{k}\tau_{k-1}^{-1}
u¯k\displaystyle\overline{u}^{k} =uk+θk(ukuk1)\displaystyle=u^{k}+\theta_{k}(u^{k}-u^{k-1})
vk+1\displaystyle v^{k+1} =proxβτkF(vk+βτk[Ku¯kH(vk)]).\displaystyle=\operatorname{prox}_{\beta\tau_{k}F}\big{(}v^{k}+\beta\tau_{k}[K\overline{u}^{k}-\nabla H(v^{k})]\big{)}.
Algorithm 1 Primal-dual algorithm with linesearch for problem (8).

In order to prove convergence of Algorithm 1, motivated by [13, Theorem 3.4], we first show that the algorithm satisfies a sufficient descent condition and generates bounded sequences.

Lemma 3.2 (Sufficient decrease and boundedness of iterates).

Suppose that F:𝒱{+}F:\mathcal{V}\to\mathbbm{R}\cup\{+\infty\} and G:𝒰{+}G:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} are proper lsc convex functions, and H:𝒱H:\mathcal{V}\to\mathbbm{R} is a differentiable convex function with locally Lipschitz continuous gradient. In addition, suppose Assumption 3.1 holds, and let (u^,v^)(\hat{u},\hat{v}) be a saddle point of (8). Then, for the sequences (vk)(v^{k}), (u¯k)(\overline{u}^{k}) and (uk)(u^{k}) generated by Algorithm 1, the following hold:

  1. (i)

    The sequence (φk)k(\varphi_{k})_{k} defined by

    φk:=τk1(1+θk1)P^(u^,v^)(uk1)+12uku^2+12βvkv^2,\varphi_{k}:=\tau_{k-1}(1+\theta_{k-1})\hat{P}_{(\hat{u},\hat{v})}(u^{k-1})+\dfrac{1}{2}\|u^{k}-\hat{u}\|^{2}+\dfrac{1}{2\beta}\|v^{k}-\hat{v}\|^{2}, (13)

    satisfies the sufficient decrease condition

    0φk+1φk12u¯kuk21(δK+δL)2βvk+1vk2τkD^(u^,v^)(vk+1)(1γ)τk1θk1P^(u^,v^)(uk1).0\leqslant\varphi_{k+1}\leqslant\varphi_{k}-\dfrac{1}{2}\|\overline{u}^{k}-u^{k}\|^{2}-\dfrac{1-(\delta_{K}+\delta_{L})}{2\beta}\|v^{k+1}-v^{k}\|^{2}\\ -\tau_{k}\hat{D}_{(\hat{u},\hat{v})}(v^{k+1})-(1-\gamma)\tau_{k-1}\theta_{k-1}\hat{P}_{(\hat{u},\hat{v})}(u^{k-1}). (14)
  2. (ii)

    The sequences (vk)(v^{k}), (u¯k)(\overline{u}^{k}) and (uk)(u^{k}) are bounded, and

  3. (iii)

    (vk+1vk)(\|v^{k+1}-v^{k}\|), (u¯kuk)(\|\overline{u}^{k}-u^{k}\|), (τkθkP^(u^,v^)(uk))(\tau_{k}\theta_{k}\hat{P}_{(\hat{u},\hat{v})}(u^{k})), and (τkD^(u^,v^)(vk+1))(\tau_{k}\hat{D}_{(\hat{u},\hat{v})}(v^{k+1})) converge to 0.

Proof.

In the context of Algorithm 1, Assumption 3.1(ii) implies:

k,τk(H(vk+1)H(vk)H(vk),vk+1vk)δL2βvk+1vk2.\forall k\in\mathbb{N},\quad\tau_{k}\left(H(v^{k+1})-H(v^{k})-\langle\nabla H(v^{k}),v^{k+1}-v^{k}\rangle\right)\leqslant\dfrac{\delta_{L}}{2\beta}\|v^{k+1}-v^{k}\|^{2}. (15)

In view of the update rule of vk+1v^{k+1} in Algorithm 1 and [1, Proposition 12.26], we have

v𝒱,F(vk+1)+1βτk(vkvk+1)+Ku¯k,vvk+1H(vk),vvk+1F(v).\forall v\in\mathcal{V},\>F(v^{k+1})+\left\langle\dfrac{1}{\beta\tau_{k}}(v^{k}-v^{k+1})+K\overline{u}^{k},v-v^{k+1}\right\rangle-\langle\nabla H(v^{k}),v-v^{k+1}\rangle\leqslant F(v). (16)

In view of Assumption 3.1(ii), τk2K2δKβ\tau_{k}^{2}\|K\|^{2}\leqslant\frac{\delta_{K}}{\beta}, and thus τk2Kvk+1Kvk2δKβvk+1vk2.\tau_{k}^{2}\|K^{*}v^{k+1}-K^{*}v^{k}\|^{2}\leqslant\frac{\delta_{K}}{\beta}\|v^{k+1}-v^{k}\|^{2}. This estimate together with condition (15) and the convexity of HH implies

H(vk+1)\displaystyle H(v^{k+1}) H(vk)+H(vk),vk+1vk+δL2βτkvk+1vk2\displaystyle\leqslant H(v^{k})+\langle\nabla H(v^{k}),v^{k+1}-v^{k}\rangle+\dfrac{\delta_{L}}{2\beta\tau_{k}}\|v^{k+1}-v^{k}\|^{2} (17)
H(vk)+H(vk),vvk+H(vk),vk+1v+δL2βτkvk+1vk2\displaystyle\leqslant H(v^{k})+\langle\nabla H(v^{k}),v-v^{k}\rangle+\langle\nabla H(v^{k}),v^{k+1}-v\rangle+\dfrac{\delta_{L}}{2\beta\tau_{k}}\|v^{k+1}-v^{k}\|^{2}
+δK2βτkvk+1vk2τk2Kvk+1Kvk2\displaystyle\quad+\dfrac{\delta_{K}}{2\beta\tau_{k}}\|v^{k+1}-v^{k}\|^{2}-\frac{\tau_{k}}{2}\|K^{*}v^{k+1}-K^{*}v^{k}\|^{2}
H(v)+H(vk),vk+1v+δK+δL2βτkvk+1vk2τk2Kvk+1Kvk2.\displaystyle\leqslant H(v)+\langle\nabla H(v^{k}),v^{k+1}-v\rangle+\dfrac{\delta_{K}+\delta_{L}}{2\beta\tau_{k}}\|v^{k+1}-v^{k}\|^{2}-\frac{\tau_{k}}{2}\|K^{*}v^{k+1}-K^{*}v^{k}\|^{2}.

Thus, combining (16) and (17) yields

τk((F+H)(vk+1)(F+H)(v))1β(vk+1vk)τkKu¯k,vvk+1+δK+δL2βvk+1vk2τk22Kvk+1Kvk2\tau_{k}\bigl{(}(F+H)(v^{k+1})-(F+H)(v)\bigr{)}\\ \leqslant\left\langle\dfrac{1}{\beta}(v^{k+1}-v^{k})-\tau_{k}K\overline{u}^{k},v-v^{k+1}\right\rangle+\dfrac{\delta_{K}+\delta_{L}}{2\beta}\|v^{k+1}-v^{k}\|^{2}-\frac{\tau_{k}^{2}}{2}\|K^{*}v^{k+1}-K^{*}v^{k}\|^{2} (18)

for all v𝒱v\in\mathcal{V}. With (18) established, the remainder of the proof closely follows [13, Theorem 3.4]. From the update rule for uk+1u^{k+1} in Algorithm 1 and [1, Proposition 12.26], we have

u𝒰,τk(G(uk+1)G(u))uk+1uk+τkKvk+1,uuk+1.\forall u\in\mathcal{U},\>\tau_{k}(G(u^{k+1})-G(u))\leqslant\langle u^{k+1}-u^{k}+\tau_{k}K^{*}v^{k+1},u-u^{k+1}\rangle. (19)

Using (19), the identity τk=θkτk1\tau_{k}=\theta_{k}\tau_{k-1} and the definition of u¯k\bar{u}^{k}, we deduce that

τk((1+θk)G(uk)θkG(uk1)G(u))\displaystyle\tau_{k}\bigr{(}(1+\theta_{k})G(u^{k})-\theta_{k}G(u^{k-1})-G(u)\big{)} (20)
=θkτk1(G(uk)G(uk+1))+θk2τk1(G(uk)G(uk1))+τk(G(uk+1)G(u))\displaystyle=\theta_{k}\tau_{k-1}\bigr{(}G(u^{k})-G(u^{k+1})\bigr{)}+\theta_{k}^{2}\tau_{k-1}\bigl{(}G(u^{k})-G(u^{k-1})\bigr{)}+\tau_{k}\bigl{(}G(u^{k+1})-G(u)\bigr{)}
=θkukuk1+τk1Kvk,uk+1uk+θk2ukuk1+τk1Kvk,uk1uk\displaystyle=\theta_{k}\langle u^{k}-u^{k-1}+\tau_{k-1}K^{*}v^{k},u^{k+1}-u^{k}\rangle+\theta_{k}^{2}\langle u^{k}-u^{k-1}+\tau_{k-1}K^{*}v^{k},u^{k-1}-u^{k}\rangle
+uk+1uk+τkKvk+1,uuk+1\displaystyle\qquad+\langle u^{k+1}-u^{k}+\tau_{k}K^{*}v^{k+1},u-u^{k+1}\rangle
=u¯kuk+τkKvk,uk+1u¯k+uk+1uk+τkKvk+1,uuk+1.\displaystyle=\langle\bar{u}^{k}-u^{k}+\tau_{k}K^{*}v^{k},u^{k+1}-\bar{u}^{k}\rangle+\langle u^{k+1}-u^{k}+\tau_{k}K^{*}v^{k+1},u-u^{k+1}\rangle.

Adding (18) and (20) yields

τk((F+H)(vk+1)(F+H)(v)+(1+θk)G(uk)θkG(uk1)G(u))uk+1uk,uuk+1+1βvk+1vk,vvk+1+u¯kuk,uk+1u¯k+δK+δL2βvk+1vk2τk22Kvk+1Kvk2+τk(Kvk+1,uuk+1Ku¯k,v^vk+1+Kvk,uk+1u¯k),\tau_{k}\big{(}(F+H)(v^{k+1})-(F+H)(v)+(1+\theta_{k})G(u^{k})-\theta_{k}G(u^{k-1})-G(u)\big{)}\\ \leqslant\langle u^{k+1}-u^{k},u-u^{k+1}\rangle+\dfrac{1}{\beta}\left\langle v^{k+1}-v^{k},v-v^{k+1}\right\rangle+\langle\overline{u}^{k}-u^{k},u^{k+1}-\overline{u}^{k}\rangle+\dfrac{\delta_{K}+\delta_{L}}{2\beta}\|v^{k+1}-v^{k}\|^{2}\\ -\frac{\tau_{k}^{2}}{2}\|K^{*}v^{k+1}-K^{*}v^{k}\|^{2}+\tau_{k}\left(\langle K^{*}v^{k+1},u-u^{k+1}\rangle-\left\langle K\overline{u}^{k},\hat{v}-v^{k+1}\right\rangle+\langle K^{*}v^{k},u^{k+1}-\overline{u}^{k}\rangle\right), (21)

for all (u,v)𝒰×𝒱(u,v)\in\mathcal{U}\times\mathcal{V}. Using properties of the adjoint, the inner product terms on last line of (21) can be written as

τkKvk,uk+1u¯k+τkKvk+1,uuk+1τkKu¯k,vvk+1\displaystyle\tau_{k}\langle K^{*}v^{k},u^{k+1}-\overline{u}^{k}\rangle+\tau_{k}\langle K^{*}v^{k+1},u-u^{k+1}\rangle-\tau_{k}\left\langle K\overline{u}^{k},v-v^{k+1}\right\rangle (22)
=τkKvk+1Kvk,u¯kuk+1+τkKvk+1,uu¯kτkKu¯k,vvk+1\displaystyle=\tau_{k}\langle K^{*}v^{k+1}-K^{*}v^{k},\overline{u}^{k}-u^{k+1}\rangle+\tau_{k}\langle K^{*}v^{k+1},u-\bar{u}^{k}\rangle-\tau_{k}\left\langle K\overline{u}^{k},v-v^{k+1}\right\rangle
=τkKvk+1Kvk,u¯kuk+1+τkvk+1v,KuτkKu¯kKu,v\displaystyle=\tau_{k}\langle K^{*}v^{k+1}-K^{*}v^{k},\overline{u}^{k}-u^{k+1}\rangle+\tau_{k}\langle v^{k+1}-v,Ku\rangle-\tau_{k}\left\langle K\overline{u}^{k}-Ku,v\right\rangle
=τkKvk+1Kvk,u¯kuk+1+τkvk+1v,Kuτk(1+θk)uku,Kv\displaystyle=\tau_{k}\langle K^{*}v^{k+1}-K^{*}v^{k},\overline{u}^{k}-u^{k+1}\rangle+\tau_{k}\langle v^{k+1}-v,Ku\rangle-\tau_{k}(1+\theta_{k})\left\langle u_{k}-u,K^{*}v\right\rangle
+τkθkuk1u,Kv^.\displaystyle\qquad+\tau_{k}\theta_{k}\left\langle u_{k-1}-u,K^{*}\hat{v}\right\rangle.

Let (u^,v^)(\hat{u},\hat{v}) be a saddle point of problem (8), and take (u,v)=(u^,v^)(u,v)=(\hat{u},\hat{v}). Substituting (22) into (21), noting the definitions for P^(u^,v^)\hat{P}_{(\hat{u},\hat{v})} and D^(u^,v^)\hat{D}_{(\hat{u},\hat{v})} from (11), we obtain

τk(D^(u^,v^)(vk+1)θkP^(u^,v^)(uk1)+(1+θk)P^(u^,v^)(uk))uk+1uk,u^uk+1+1βvk+1vk,v^vk+1+u¯kuk,uk+1u¯k+δK+δL2βvk+1vk2+τkKvk+1Kvk,u¯kuk+1τk22Kvk+1Kvk2\tau_{k}\big{(}\hat{D}_{(\hat{u},\hat{v})}(v^{k+1})-\theta_{k}\hat{P}_{(\hat{u},\hat{v})}(u^{k-1})+(1+\theta_{k})\hat{P}_{(\hat{u},\hat{v})}(u^{k})\big{)}\\ \leqslant\langle u^{k+1}-u^{k},\hat{u}-u^{k+1}\rangle+\dfrac{1}{\beta}\left\langle v^{k+1}-v^{k},\hat{v}-v^{k+1}\right\rangle+\langle\overline{u}^{k}-u^{k},u^{k+1}-\overline{u}^{k}\rangle\\ \quad+\dfrac{\delta_{K}+\delta_{L}}{2\beta}\|v^{k+1}-v^{k}\|^{2}+\tau_{k}\langle K^{*}v^{k+1}-K^{*}v^{k},\overline{u}^{k}-u^{k+1}\rangle-\frac{\tau_{k}^{2}}{2}\|K^{*}v^{k+1}-K^{*}v^{k}\|^{2}

Using a,b12a2+12b2\langle a,b\rangle\leqslant\frac{1}{2}\|a\|^{2}+\frac{1}{2}\|b\|^{2} in the last line and the law of cosines in the second line yields

τk(D^(u^,v^)(vk+1)θkP^(u^,v^)(uk1)+(1+θk)P^(u^,v^)(uk))12(uku^2uk+1uk2uk+1u^2)+12β(vkv^2vk+1vk2vk+1v^2)+12(uk+1uk2u¯kuk2uk+1u¯k2)+δK+δL2βvk+1vk2+12u¯kuk+12=12(uku^2uk+1u^2)+12β(vkv^2vk+1v^2)12u¯kuk21(δK+δL)2βvk+1vk2.\tau_{k}\big{(}\hat{D}_{(\hat{u},\hat{v})}(v^{k+1})-\theta_{k}\hat{P}_{(\hat{u},\hat{v})}(u^{k-1})+(1+\theta_{k})\hat{P}_{(\hat{u},\hat{v})}(u^{k})\big{)}\\ \leqslant\dfrac{1}{2}\left(\|u^{k}-\hat{u}\|^{2}-\|u^{k+1}-u^{k}\|^{2}-\|u^{k+1}-\hat{u}\|^{2}\right)+\dfrac{1}{2\beta}\left(\|v^{k}-\hat{v}\|^{2}-\|v^{k+1}-v^{k}\|^{2}-\|v^{k+1}-\hat{v}\|^{2}\right)\\ \quad+\dfrac{1}{2}\left(\|u^{k+1}-u^{k}\|^{2}-\|\overline{u}^{k}-u^{k}\|^{2}-\|u^{k+1}-\overline{u}^{k}\|^{2}\right)+\dfrac{{\delta_{K}+\delta_{L}}}{{2\beta}}\|v^{k+1}-v^{k}\|^{2}+\dfrac{1}{2}\|\overline{u}^{k}-u^{k+1}\|^{2}\\ =\dfrac{1}{2}\left(\|u^{k}-\hat{u}\|^{2}-\|u^{k+1}-\hat{u}\|^{2}\right)+\dfrac{1}{2\beta}\left(\|v^{k}-\hat{v}\|^{2}-\|v^{k+1}-\hat{v}\|^{2}\right)\\ -\dfrac{1}{2}\|\overline{u}^{k}-u^{k}\|^{2}-\dfrac{1-(\delta_{K}+\delta_{L})}{{2\beta}}\|v^{k+1}-v^{k}\|^{2}.

From (12) and Assumption 3.1(i), τkτk1αk\tau_{k}\leqslant\tau_{k-1}\alpha_{k}, and thus θkτkτk1(1+γθk1)\theta_{k}\tau_{k}\leqslant\tau_{k-1}(1+\gamma\theta_{k-1}). Then, (φk)k(\varphi_{k})_{k} satisfies the sufficient decrease condition (14). Hence, (φk)k(\varphi_{k})_{k} converges to infkφk0\inf_{k}\varphi_{k}\geqslant 0. The sufficient decrease condition implies

u¯kuk2,vk+1vk2,τkD^(u^,v^)(vk+1),τk1θk1P^(u^,v^)(uk1)0 as k+.\|\overline{u}^{k}-u^{k}\|^{2},\>\|v^{k+1}-v^{k}\|^{2},\>\tau_{k}\hat{D}_{(\hat{u},\hat{v})}(v^{k+1}),\>\tau_{k-1}\theta_{k-1}\hat{P}_{(\hat{u},\hat{v})}(u^{k-1})\to 0\text{ as }k\to+\infty.

Furthermore, since P^(u^,v^)(uk1)0\hat{P}_{(\hat{u},\hat{v})}(u^{k-1})\geqslant 0 and (φk)k(\varphi_{k})_{k} converges, the sequences (uk)(u^{k}) and (vk)(v^{k}) are bounded, which in turn implies that (u¯k)(\overline{u}^{k}) is also bounded. ∎

Global convergence will follow from the previous results, as long as the sequence of stepsizes (τk)k(\tau_{k})_{k} does not vanish. This fact determines the next assumption of the abstract linesearch.

Assumption 3.3.

For Algorithm 1, suppose the sequence (τk)k(\tau_{k})_{k} generated by LS is separated from 0 whenever the sequence (uk,vk)k(u^{k},v^{k})_{k} is bounded.

Note that, as a consequence of (12) and Assumptions 3.1(i) and 3.3, since θk=τkτk11\theta_{k}=\tau_{k}\tau_{k-1}^{-1}, the sequence (θk)k(\theta_{k})_{k} is bounded and separated from zero.

Theorem 3.4.

Suppose that F:𝒱{+}F:\mathcal{V}\to\mathbbm{R}\cup\{+\infty\} and G:𝒰{+}G:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} are proper lsc convex functions, and H:𝒱H:\mathcal{V}\to\mathbbm{R} is a differentiable convex function with locally Lipschitz continuous gradient. Suppose, in addition, Assumptions 3.1 and 3.3 hold, and there exists a saddle point of (8). Then, the sequence (uk,vk)k(u^{k},v^{k})_{k} generated by Algorithm 1 converges to a saddle point (u^,v^)(\hat{u},\hat{v}) of problem (8), and (𝒢^(u^,v^)(uk,vk))k\left(\hat{\mathcal{G}}_{(\hat{u},\hat{v})}(u^{k},v^{k})\right)_{k} converges to 0.

Proof.

We first prove that any cluster point of (uk,vk)k(u^{k},v^{k})_{k} is a saddle point. Let (u^,v^)(\hat{u},\hat{v}) be a cluster point of (uk,vk)k(u^{k},v^{k})_{k} which is bounded due to Lemma 3.2(ii). Then there exists a subsequence such that (uk,vk)(u^,v^)(u^{\ell_{k}},v^{\ell_{k}})\to(\hat{u},\hat{v}) as k+k\to+\infty. From (18) and (19), we have

(F+H)(vk+1)(F+H)(v)1βτk(vk+1vk)Ku¯k,vvk+1+δLβτkvk+1vk2G(uk+1)G(u)1τk(uk+1uk)+Kvk+1,uuk+1,\begin{array}[]{rcl}(F+H)(v^{\ell_{k}+1})-(F+H)(v)&\leqslant&\left\langle\dfrac{1}{\beta\tau_{\ell_{k}}}(v^{\ell_{k}+1}-v^{\ell_{k}})-K\overline{u}^{\ell_{k}},v-v^{\ell_{k}+1}\right\rangle+\dfrac{\delta_{L}}{\beta\tau_{\ell_{k}}}\|v^{\ell_{k}+1}-v^{\ell_{k}}\|^{2}\\ G(u^{\ell_{k}+1})-G(u)&\leqslant&\left\langle\dfrac{1}{\tau_{\ell_{k}}}(u^{\ell_{k}+1}-u^{\ell_{k}})+K^{*}v^{\ell_{k}+1},u-u^{\ell_{k}+1}\right\rangle,\end{array}

for all (u,v)𝒰×𝒱(u,v)\in\mathcal{U}\times\mathcal{V}. In view of Assumptions 3.1(i) and 3.3 and Lemma 3.2(iii), taking the limit as k+k\to+\infty, yields 1βτk(vk+1vk)0\dfrac{1}{\beta\tau_{\ell_{k}}}(v^{\ell_{k}+1}-v^{\ell_{k}})\to 0 and 1τk(uk+1uk)0\dfrac{1}{\tau_{\ell_{k}}}(u^{\ell_{k}+1}-u^{\ell_{k}})\to 0, and thus P^(u^,v^)(u),D^(u^,v^)(v)0\hat{P}_{(\hat{u},\hat{v})}(u),\hat{D}_{(\hat{u},\hat{v})}(v)\geqslant 0 for all (u,v)𝒰×𝒱(u,v)\in\mathcal{U}\times\mathcal{V}. That (u^,v^)(\hat{u},\hat{v}) is a saddle point for problem (8) follows from (10). Furthermore, substituting k\ell_{k} with \ell in the above set of inequalities and (u,v)=(u^,v^)(u,v)=(\hat{u},\hat{v}), taking the limit as +\ell\to+\infty, since both (τk)k(\tau_{k})_{k} and (θk)k(\theta_{k})_{k} are separated from zero, Lemma 3.2(iii) implies D^(u^,v^)(vk)0\hat{D}_{(\hat{u},\hat{v})}(v^{k})\to 0 and P^(u^,v^)(uk)0\hat{P}_{(\hat{u},\hat{v})}(u^{k})\to 0, so that 𝒢^(u^,v^)(uk,vk)0\hat{\mathcal{G}}_{(\hat{u},\hat{v})}(u^{k},v^{k})\to 0 follows from (11). Finally, we prove that the whole sequence (uk,vk)k(u^{k},v^{k})_{k} converges to (u^,v^)(\hat{u},\hat{v}). Since (φk)k(\varphi_{k})_{k} defined in (13) (monotonically) converges to some φ~0\tilde{\varphi}\geqslant 0, then

12uku^2+12βvkv^2φ~ as k+.\dfrac{1}{2}\|u^{k}-\hat{u}\|^{2}+\dfrac{1}{2\beta}\|v^{k}-\hat{v}\|^{2}\to\tilde{\varphi}\text{ as }k\to+\infty.

Taking the limit along the subsequence (uk,vk)k(u^{\ell_{k}},v^{\ell_{k}})_{k} as k+k\to+\infty in the previous equation, yields φ~=0\tilde{\varphi}=0. Thus uku^u^{k}\to\hat{u} and vkv^v^{k}\to\hat{v} as k+k\to+\infty. ∎

Remark 3.5.

Some comments regarding the proof of the result above are in order.

  1. (i)

    By using with the argument in [13, Theorem 3.5], it is possible to derive the following ergodic rate for the primal-dual gap in Theorem 3.4:

    𝒢^(u^,v^)(UK,VK)1sK(τ1θ1P^(u^,v^)(u0)+12u1u^2+12βv1v^2),\hat{\mathcal{G}}_{(\hat{u},\hat{v})}(U^{K},V^{K})\leqslant\frac{1}{s_{K}}\left(\tau_{1}\theta_{1}\hat{P}_{(\hat{u},\hat{v})}(u^{0})+\frac{1}{2}\|u^{1}-\hat{u}\|^{2}+\frac{1}{2\beta}\|v^{1}-\hat{v}\|^{2}\right),

    where sK=k=1Kτks_{K}=\sum_{k=1}^{K}\tau_{k}, UK=τ1θ0u0+k=1Kτku¯kτ1θ1+sNU^{K}=\dfrac{\tau_{1}\theta_{0}u^{0}+\sum_{k=1}^{K}\tau_{k}\bar{u}^{k}}{\tau_{1}\theta_{1}+s_{N}} and VK=k=1KτkvksKV^{K}=\dfrac{\sum_{k=1}^{K}\tau_{k}v^{k}}{s_{K}}.

  2. (ii)

    The introduction of the parameter γ(0,1)\gamma\in(0,1) in Step 2 of Algorithm 1 is crucial to guarantee global convergence of the sequence. Otherwise, if γ=1\gamma=1, we would only be able to prove subsequential convergence to saddle points. Indeed, setting γ=1\gamma=1 in (14) would prevent us from concluding that (τkθkP^(u^,v^)(uk))k(\tau_{k}\theta_{k}\hat{P}_{(\hat{u},\hat{v})}(u^{k}))_{k} converges to 0, and thus we would only be able to guarantee that (12uku^2+12βvkv^2)k\left(\dfrac{1}{2}\|u^{k}-\hat{u}\|^{2}+\dfrac{1}{2\beta}\|v^{k}-\hat{v}\|^{2}\right)_{k} converges (but not necessarily to 0). In this case, an alternative is to assume, in addition, that GG is continuous on its domain as in [13, Theorem 3.4].

3.2 A linesearch procedure for primal-dual splitting methods

In this section, we propose a natural implementation in Subroutine 2 of LS to satisfy Assumptions 3.1 and 3.3.

Input: param ={β>0,δL(0,1),ρ(0,1)}=\{\beta>0,\delta_{L}\in(0,1),\rho\in(0,1)\}, τk(0)\tau_{k(0)}, τk1\tau_{k-1}, uk1u^{k-1}, uku^{k}, vkv^{k}.
for j=0,1,j=0,1,\dots do
       \triangleright Step 3 (Primal-dual trial points). Set θk(j)=τk(j)τk11\theta_{k(j)}=\tau_{k(j)}\tau_{k-1}^{-1}, and
u¯k(j)\displaystyle\overline{u}^{k(j)} =uk+θk(j)(ukuk1)\displaystyle=u^{k}+\theta_{k(j)}(u^{k}-u^{k-1})
vk(j)+1\displaystyle v^{k(j)+1} =proxβτk(j)F(vk+βτk(j)[Ku¯k(j)H(vk)])\displaystyle=\operatorname{prox}_{\beta\tau_{k(j)}F}\big{(}v^{k}+\beta\tau_{k(j)}[K\overline{u}^{k(j)}-\nabla H(v^{k})]\big{)}
       \triangleright Step 4 (Backtracking linesearch test). If
τk(j)(H(vk(j)+1)H(vk)H(vk),vk(j)+1vk)>δL2βvk(j)+1vk2,\tau_{k(j)}\left(H(v^{k(j)+1})-H(v^{k})-\langle\nabla H(v^{k}),v^{k(j)+1}-v^{k}\rangle\right)>\dfrac{\delta_{L}}{2\beta}\|v^{k(j)+1}-v^{k}\|^{2}, (23)
set τk(j+1)=ρτk(j)\tau_{k(j+1)}=\rho\tau_{k(j)}, update jj+1j\leftarrow j+1, and go back to Step 1. Otherwise, set τk=τk(j)\tau_{k}=\tau_{k(j)}, vk+1=vk(j)+1v^{k+1}=v^{k(j)+1}, and exit the loop.
Output: τk,vk+1\tau_{k},v^{k+1}.
Subroutine 2 Linesearch procedure LS for Algorithm 1.
Remark 3.6.

Some comments regarding the parameters in Subroutine 2 are in order.

  1. (i)

    The shrinking parameter ρ(0,1)\rho\in(0,1) is used in the backtracking linesearch to find a stepsize of adequate length to satisfy (15). As for β>0\beta>0, it is used to balance the primal and dual stepsizes. The parameter sequence (αk)k(\alpha_{k})_{k} and the parameter γ(0,1)\gamma\in(0,1) give flexibility to the stepsizes. The initialisation in (12) assures that the stepsize sequence remains bounded, but possibly increase it if larger stepsizes are allowed.

  2. (ii)

    Suppose that, in (12), αk=1\alpha_{k}=1 for all k1k\geqslant 1, and that the linesearch LS loop terminates in a finite number of steps in each iteration. Then, the sequence of stepsizes (τk)k(\tau_{k})_{k} is nonincreasing and bounded above. In this case, it suffices to initialise the outer loop with τ0=δKβK\tau_{0}=\frac{\sqrt{\delta_{K}}}{\sqrt{\beta}\|K\|}, and set τk(0)=τk1\tau_{k(0)}=\tau_{k-1} for condition (12) to always hold. Furthermore, the stepsize sequence is also eventually constant, and hence our proposed method eventually reduces to [7, Algorithm 3.2]. To see this, suppose, to the contrary, that the stepsize sequence decreases an infinite number of times. By construction, each decrease must be at least by a factor of ρ(0,1)\rho\in(0,1). This implies that (τk)k(\tau_{k})_{k} converges to 0, a contradiction.

  3. (iii)

    Let us assume that after finitely many iterations of Algorithm 1, the stepsizes τk\tau_{k} become constant equal to τ\tau. In view of (12), it holds βτ2K2δK\beta\tau^{2}\|K\|^{2}\leqslant\delta_{K}. Furthermore, the linesearch condition (23) suggest that an approximation of the equivalent of a Lipschitz constant is δLβτ\frac{\delta_{L}}{\beta\tau}. Therefore, for σ=βτ\sigma=\beta\tau, the condition on the stepsize of the Condat–Vũ method reads

    δLβτσ2+βτ2K2δL2+δK<1δL2.\frac{\frac{\delta_{L}}{\beta\tau}\sigma}{2}+\beta\tau^{2}\|K\|^{2}\leqslant\frac{\delta_{L}}{2}+\delta_{K}<1-\frac{\delta_{L}}{2}.

    Hence, if the linesearch condition eventually finds constant stepsizes, it implies the bounds on the stepsizes in [7, 25], as the right-hand side in the above estimate is strictly smaller than 11. Therefore, we retrieve convergence to saddle points for this special case.

If H\nabla H is globally Lipschitz continuous with constant L>0L>0, the inequality (15) holds whenever τkδKβL\tau_{k}\leqslant\frac{\delta_{K}}{\beta L} by virtue of Lemma 2.1. This estimate is valid after finitely many steps in the linesearch procedure, since (τk(j))j(\tau_{k(j)})_{j\in\mathbbm{N}} is a decreasing sequence and hence the linesearch is well-defined for globally Lipschitz continuous gradients. The analogous result when H\nabla H is only locally Lipschitz continuous is presented in the following lemma.

Lemma 3.7.

Suppose that F:𝒱{+}F:\mathcal{V}\to\mathbbm{R}\cup\{+\infty\} and G:𝒰{+}G:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} are proper lsc convex functions, and H:𝒱H:\mathcal{V}\to\mathbbm{R} is a differentiable convex function with locally Lipschitz continuous gradient. Then, the following hold:

  1. (i)

    For each k1k\geqslant 1, there exists ξk>0\xi_{k}>0 such that

    t(0,ξk],t(H(vk+1(t))H(vk)H(vk),vk+1(t)vk)δL2βvk+1(t)vk2,\forall t\in(0,\xi_{k}],\quad t\left(H(v^{k+1}(t))-H(v^{k})-\langle\nabla H(v^{k}),v^{k+1}(t)-v^{k}\rangle\right)\leqslant\dfrac{\delta_{L}}{2\beta}\|v^{k+1}(t)-v^{k}\|^{2}, (24)

    where u¯k(t)=uk+tτk11(ukuk1)\overline{u}^{k}(t)=u^{k}+t\tau_{k-1}^{-1}(u^{k}-u^{k-1}) and vk+1(t)=proxβtF(vk+βt[Ku¯k(t)H(vk)])v^{k+1}(t)=\operatorname{prox}_{\beta tF}(v^{k}+\beta t[K\overline{u}^{k}(t)-\nabla H(v^{k})]).

  2. (ii)

    The linesearch in Subroutine 2 terminates in finitely many iterations with τkτk(0)\tau_{k}\leqslant\tau_{k(0)}.

  3. (iii)

    The sequence (τk)k(\tau_{k})_{k} defined in Algorithm 1 is bounded and separated from 0.

In particular, the implementation of LS in Subroutine 2 is well-defined and satisfies Assumptions 3.1 and 3.3.

Proof.

In order to prove (i), fix k1k\geqslant 1. In view of Lemma 2.2, there exists ηk>0\eta_{k}>0 such that, for all t(0,ηk]t\in(0,\eta_{k}], we have vk+1(t)Ck:=projdom(F)¯(vk)+B(0,1)v^{k+1}(t)\in C_{k}:=\operatorname{proj}_{\overline{\operatorname{dom}(\partial F)}}(v^{k})+B(0,1), where B(0,1)B(0,1) denote the open ball centred at 0 with radius 11. Moreover, as H\nabla H is locally Lipschitz continuous and 𝒱\mathcal{V} is finite dimensional, H\nabla H is Lipschitz continuous on the open bounded convex set CkC_{k}. Thus, according to Lemma 2.1, there exists Lk>0L_{k}>0 such that

vCk,(H(v)H(vk)H(vk),vvk)Lk2vvk2.\forall v\in C_{k},\quad\left(H(v)-H(v^{k})-\langle\nabla H(v^{k}),v-v^{k}\rangle\right)\leqslant\frac{L_{k}}{2}\|v-v^{k}\|^{2}.

Taking ξk=min{ηk,δLβLk}\xi_{k}=\min\{\eta_{k},\frac{\delta_{L}}{\beta L_{k}}\} yields (24). Since ρ(0,1)\rho\in(0,1), there exists jk1j_{k}\geqslant 1 such that τk(j)=τk(0)ρjξk\tau_{k(j)}=\tau_{k(0)}\rho^{j}\leqslant\xi_{k} for all jjkj\geqslant j_{k}. This shows that the linesearch terminates after jkj_{k} iterations. Furthermore, since τk=τk(jk)=τk(0)ρjkτk(0)\tau_{k}=\tau_{k(j_{k})}=\tau_{k(0)}\rho^{j_{k}}\leqslant\tau_{k(0)}, then Assumption 3.1(i) holds with τinit=τk(0)\tau_{\text{init}}=\tau_{k(0)}, proving (ii). By taking t=τkt=\tau_{k} in (24), Assumption 3.1(ii) follows. Next, as a consequence of (12), (τk)k(\tau_{k})_{k} is bounded above by δK/(βK)\sqrt{\delta_{K}}/(\sqrt{\beta}\|K\|). Suppose, by way of a contradiction, that (τk)k(\tau_{k})_{k} is not separated from zero. Then there exists a decreasing subsequence (τk)(\tau_{k_{\ell}})_{\ell} with τk0\tau_{k_{\ell}}\to 0 as +\ell\to+\infty. Set τ^k=ρ1τk\hat{\tau}_{k_{\ell}}=\rho^{-1}\tau_{k_{\ell}} and set

u^k=uk+τ^kτk11(ukuk1),v^k+1=proxβτ^kF(vk+βτ^k[Ku^kH(vk)]).\hat{u}^{k_{\ell}}=u^{k}+\hat{\tau}_{k_{\ell}}\tau_{k-1}^{-1}(u^{k}-u^{k-1}),\quad\hat{v}^{k_{\ell}+1}=\operatorname{prox}_{\beta\hat{\tau}_{k_{\ell}}F}\big{(}v^{k}+\beta\hat{\tau}_{k_{\ell}}[K\hat{u}^{{k_{\ell}}}-\nabla H(v^{k_{\ell}})]\big{)}.

Due to the linesearch procedure in Subroutine 2, we obtain

δL2βv^k+1vk2<τ^k(H(v^k+1)H(vk)H(vk),v^k+1vk).\dfrac{\delta_{L}}{2\beta}\|\hat{v}^{{k_{\ell}}+1}-v^{k_{\ell}}\|^{2}<\hat{\tau}_{k_{\ell}}\left(H(\hat{v}^{{k_{\ell}}+1})-H(v^{k_{\ell}})-\langle\nabla H(v^{k_{\ell}}),\hat{v}^{{k_{\ell}}+1}-v^{k_{\ell}}\rangle\right). (25)

By Lemma 3.2, the sequences (uk)k(u^{k})_{k} and (vk)k(v^{k})_{k} are bounded. Thus, appealing to Lemma 2.2, (v^k+1)(\hat{v}^{k_{\ell}+1})_{\ell} is bounded as the continuous image of a bounded sequence. Let C𝒱C\subseteq\mathcal{V} denote any bounded open convex set containing (vk)(v^k+1)l(v^{k_{\ell}})_{\ell}\cup(\hat{v}^{k_{\ell}+1})_{l}. By applying Lemma 2.1, there exists LC>0L_{C}>0 such that

H(v^k+1)H(vk)H(vk),v^k+1vkLC2v^k+1vk2.H(\hat{v}^{k_{\ell}+1})-H(v^{k_{\ell}})-\langle\nabla H(v^{k_{\ell}}),\hat{v}^{k_{\ell}+1}-v^{k_{\ell}}\rangle\leqslant\dfrac{L_{C}}{2}\|\hat{v}^{k_{\ell}+1}-v^{k_{\ell}}\|^{2}. (26)

Combining (25) and (26) yields the contradiction τ^k>δL/(βLC)\hat{\tau}_{k_{\ell}}>\delta_{L}/(\beta L_{C}), and hence we conclude that (τk)k(\tau_{k})_{k} is separated from zero. The fact that θk=τk/τk1\theta_{k}=\tau_{k}/\tau_{k-1} is also bounded above and separate from zero follows, showing (iii) holds, from where Assumption 3.3 follows. ∎

Corollary 3.8 (Primal-dual convergence).

Suppose that F:𝒱{+}F:\mathcal{V}\to\mathbbm{R}\cup\{+\infty\} and G:𝒰{+}G:\mathcal{U}\to\mathbbm{R}\cup\{+\infty\} are proper lsc convex functions, and H:𝒱H:\mathcal{V}\to\mathbbm{R} is a differentiable convex function with locally Lipschitz continuous gradient. Then, the sequence (uk,vk)k(u^{k},v^{k})_{k} generated by Algorithm 1 with LS implemented with Subroutine 2, converges to saddle point (u^,v^)(\hat{u},\hat{v}) of problem (8), and (𝒢^(u^,v^)(uk,vk))k\left(\hat{\mathcal{G}}_{(\hat{u},\hat{v})}(u^{k},v^{k})\right)_{k} converges to 0.

Proof.

In view of Lemma 3.7, Assumption 3.1 holds, and thus from Lemma 3.2(ii), the generated sequences are bounded. Furthermore, from Lemma 3.7, Assumption 3.3 holds as well. Therefore, the result follows from Theorem 3.4. ∎

Remark 3.9.

In the following, we comment on the above convergence result.

  1. (i)

    If the operator norm of KK is not available for use in (12), then it is instead possible to replace (12) with τk(0)=τk1αk\tau_{k(0)}=\tau_{k-1}\alpha_{k} and modify the linesearch conditions in (23) with

    τk(j)22Kvk(j)+1Kvk2+τk(j)(H(vk(j)+1)H(vk)H(vk),vk(j)+1vk)>δK+δL2βvk(j)+1vk2.\frac{\tau_{k(j)}^{2}}{2}\|K^{*}v^{k(j)+1}-K^{*}v^{k}\|^{2}+\tau_{k(j)}\left(H(v^{k(j)+1})-H(v^{k})-\langle\nabla H(v^{k}),v^{k(j)+1}-v^{k}\rangle\right)\\ >\dfrac{\delta_{K}+\delta_{L}}{2\beta}\|v^{k(j)+1}-v^{k}\|^{2}. (27)

    This corresponds to the linesearch proposed and analysed in [13, Algorithm 4] for globally Lipschitz H\nabla H. In the locally Lipschitz case, it is also possible to prove analogous results to Lemma 3.7(i) and (ii), and 3.2. Regarding Lemma 3.7(iii), since we modify the initialisation in (23), we can only guarantee that (τk)k(\tau_{k})_{k} is separated from zero, which is enough to argue convergence to saddle points as in [13, Section 4].

  2. (ii)

    In the presence of additional structure in HH, other variants of Subroutine 2 can be used. This will be discussed further in Section 4.

In this section, we defined a concrete linesearch procedure for Algorithm 1 that leads to convergence to saddle points of the min-max convex-concave problem (8), by assuming that the norm of the operator K\|K\| is available. In the next section, we apply these ideas to obtain implementable distributed algorithms for multi-agent minimisation problems.

4 A distributed proximal-gradient algorithm with linesearch

In this section, we propose a distributed first-order method with linesearch based upon Algorithm 1 to solve the multi-agent minimisation problem (1). In doing so, the main question that arises is how to implement the linesearch in distributed settings. We propose two implementations that address this with differing amounts of communication.

4.1 Construction of the algorithm

Let Wn×nW\in\mathbb{R}^{n\times n} be a mixing matrix, and set U=(IW2)1/2U=\big{(}\frac{I-W}{2}\bigr{)}^{1/2}. As explained in Section 2, problem (1) can be formulated as

min𝐱n×dmax𝐲n×dU𝐲,𝐱+f(𝐱)+h(𝐱).\displaystyle\min_{\mathbf{x}\in\mathbbm{R}^{n\times d}}\max_{\mathbf{y}\in\mathbbm{R}^{n\times d}}\langle U\mathbf{y},\mathbf{x}\rangle+f(\mathbf{x})+h(\mathbf{x}). (28)

Hence, problem (28) is in the form of problem (8) of Section 3 with

u=𝐲,v=𝐱,K=U,G=ι00,F=f and H=h.u=\mathbf{y},\>v=\mathbf{x},\>K=-U,\>G=\iota_{0}^{*}\equiv 0,\>F=f\text{ and }H=h.

Applying Algorithm 1 to problem (28) gives

{𝐲k=𝐲k1+τk1U𝐱k𝐲¯k=𝐲k+θk(𝐲k𝐲k1)𝐱k+1=proxβτkf(𝐱kβτk(U𝐲¯k+h(𝐱k))),\left\{\begin{array}[]{rcl}\mathbf{y}^{k}&=&\mathbf{y}^{k-1}+\tau_{k-1}U\mathbf{x}^{k}\\ \overline{\mathbf{y}}^{k}&=&\mathbf{y}^{k}+\theta_{k}(\mathbf{y}^{k}-\mathbf{y}^{k-1})\\ \mathbf{x}^{k+1}&=&\operatorname{prox}_{\beta\tau_{k}f}\Big{(}\mathbf{x}^{k}-\beta\tau_{k}\big{(}U\overline{\mathbf{y}}^{k}+\nabla h(\mathbf{x}^{k})\big{)}\Big{)},\end{array}\right. (29)

where θk=τkτk11\theta_{k}=\tau_{k}\tau_{k-1}^{-1} and τk\tau_{k} is calculated using a linesearch. However, in this form, (29) is not suitable for a distributed implementation since UU need not be a mixing matrix (in particular, it need not satisfy Definition 2.3(i)). To address this, denote

𝐮k=U𝐲k and 𝐮¯k=U𝐲¯k.\mathbf{u}^{k}=U\mathbf{y}^{k}\text{ and }\overline{\mathbf{u}}^{k}=U\overline{\mathbf{y}}^{k}. (30)

Premultiplying the first two equations in (29) by UU and substituting (30) yield

{𝐮k=𝐮k1+τk12(IW)𝐱k𝐮¯k=𝐮k+θk(𝐮k𝐮k1)𝐱k+1=proxβτkf(𝐱kβτk[𝐮¯k+h(𝐱k)]),\left\{\begin{aligned} \mathbf{u}^{k}&=\mathbf{u}^{k-1}+\frac{\tau_{k-1}}{2}(I-W)\mathbf{x}^{k}\\ \overline{\mathbf{u}}^{k}&=\mathbf{u}^{k}+\theta_{k}(\mathbf{u}^{k}-\mathbf{u}^{k-1})\\ \mathbf{x}^{k+1}&=\operatorname{prox}_{\beta\tau_{k}f}\big{(}\mathbf{x}^{k}-\beta\tau_{k}[\overline{\mathbf{u}}^{k}+\nabla h(\mathbf{x}^{k})]\big{)},\end{aligned}\right. (31)

Since this only involves the mixing matrix WW, (31) is suitable for distributed implementation. Due to (30), we also note that the initialisation for (31) requires that 𝐮0rangeU\mathbf{u}^{0}\in\operatorname{range}U (e.g., 𝐮0=0\mathbf{u}^{0}=0).

As a result of the discussion above, we present, in Algorithm 3, a distributed proximal-gradient method for finding solutions to problem (28), given an abstract distributed linesearch procedure LS that computes stepsizes τk\tau_{k} satisfying

τk(h(𝐱k+1)h(𝐱k)h(𝐱k),𝐱k+1𝐱k)δL2β𝐱k+1𝐱k2.\tau_{k}\left(h(\mathbf{x}^{k+1})-h(\mathbf{x}^{k})-\langle\nabla h(\mathbf{x}^{k}),\mathbf{x}^{k+1}-\mathbf{x}^{k}\rangle\right)\leqslant\dfrac{\delta_{L}}{2\beta}\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|^{2}.

This method corresponds to Algorithm 1 applied to (28). In Section 4.2, we discuss two possible distributed implementations of LS.

Choose a mixing matrix Wn×nW\in\mathbbm{R}^{n\times n};
Choose a set of parameters param containing β>0\beta>0 and δL(0,1)\delta_{L}\in(0,1);
Choose parameters τ0>0\tau_{0}>0, γ(0,1),\gamma\in(0,1), and δK(0,1)\delta_{K}\in(0,1) such that δK+δL<1\delta_{K}+\delta_{L}<1;
Set θ0=1\theta_{0}=1;
Choose an initial point 𝐱1n×d\mathbf{x}^{1}\in\mathbbm{R}^{n\times d} and set 𝐮0=0\mathbf{u}^{0}=0;
for k=1,2,,k=1,2,\dots, do
       \triangleright Step 5 (Dual update). Update variables according to
𝐮k=𝐮k1+12τk1(IW)𝐱k.\mathbf{u}^{k}=\mathbf{u}^{k-1}+\frac{1}{2}\tau_{k-1}(I-W)\mathbf{x}^{k}.
       \triangleright Step 6 (Backtracking linesearch and primal update). Initialise the linesearch by choosing αk[1,1+γθk1]\alpha_{k}\in[1,\sqrt{1+\gamma\theta_{k-1}}], and define
τk(0)=min{2δKβ(1λmin(W)),τk1αk}.\tau_{k(0)}=\min\left\{\frac{\sqrt{2\delta_{K}}}{\sqrt{\beta(1-\lambda_{\min}(W))}},\tau_{k-1}\alpha_{k}\right\}. (32)
      Compute
τk,𝐱k+1=LS(param,τk(0),τk1,𝐮k1,𝐮k,𝐱k),\tau_{k},\mathbf{x}^{k+1}=\texttt{LS}(\texttt{param},\tau_{k(0)},\tau_{k-1},\mathbf{u}^{k-1},\mathbf{u}^{k},\mathbf{x}^{k}),
      where
θk\displaystyle\theta_{k} =τkτk11\displaystyle=\tau_{k}\tau_{k-1}^{-1}
𝐮¯k\displaystyle\overline{\mathbf{u}}^{k} =𝐮k+θk(𝐮k𝐮k1)\displaystyle=\mathbf{u}^{k}+\theta_{k}(\mathbf{u}^{k}-\mathbf{u}^{k-1})
𝐱k+1\displaystyle\mathbf{x}^{k+1} =proxβτkf(𝐱k+βτk[𝐮¯kh(𝐱k)]).\displaystyle=\operatorname{prox}_{\beta\tau_{k}f}\big{(}\mathbf{x}^{k}+\beta\tau_{k}[\overline{\mathbf{u}}^{k}-\nabla h(\mathbf{x}^{k})]\big{)}.
Algorithm 3 Distributed proximal-gradient algorithm for problem (1).

Note that the linesearch initialisation (12) for Algorithm 1 involves the opeartor norm of K=UK=-U. In view of the definition of UU, this can be expressed in terms of the minimum eigenvalue of the mixing matrix WW:

K2=U2=12IW=12(1λmin(W)).\|K\|^{2}=\|U^{2}\|=\dfrac{1}{2}\|I-W\|=\dfrac{1}{2}(1-\lambda_{\min}(W)).

This identity is used to obtain the initialisation rule for τk(0)\tau_{k(0)} in (32) from (12).

Remark 4.1.

Some further comments regarding Algorithm 3 are in order. Although the initialisation 𝐮0=0\mathbf{u}^{0}=0 is used in Algorithm 3 for convenience, it is actually possible to use 𝐮0=(u10,,un0)\mathbf{u}^{0}=(u_{1}^{0},\dots,u_{n}^{0})^{\top} such that i=1nui0=0\sum_{i=1}^{n}u_{i}^{0}=0. To see this, note that the derivation (30) only requires 𝐮0rangeU\mathbf{u}^{0}\in\operatorname{range}U and, due to Definition 2.3(iii), we have

rangeU=rangeU2=range(IW)=ker(IW)={𝐮=(u1,,un):i=1nui=0}.\operatorname{range}U=\operatorname{range}U^{2}=\operatorname{range}(I-W)=\ker(I-W)^{\perp}=\{\mathbf{u}=(u_{1},\dots,u_{n})^{\top}:\sum_{i=1}^{n}u_{i}=0\}.

The linesearch procedure LS in Algorithm 3 is assumed to perform agent-independent proximal steps to define

𝐱k+1=proxβτkf(𝐱kβτk[𝐮¯k+h(𝐱k)]).\mathbf{x}^{k+1}=\operatorname{prox}_{\beta\tau_{k}f}(\mathbf{x}^{k}-\beta\tau_{k}[\bar{\mathbf{u}}^{k}+\nabla h(\mathbf{x}^{k})]).

Two possible ways to implement LS are described in Subroutines 4 and 5 assuming hi\nabla h_{i} are locally Lipschitz continuous for each i{1,,n}i\in\{1,\dots,n\} and that (an estimate of) λmin(W)\lambda_{\min}(W) is available.

At the cost of extra communication, a variation of the proposed linesearch based on [13, Section 5] is discussed in Remark 4.3, which does not require knowledge of λmin(W)\lambda_{\min}(W) in the initialisation (12).

Remark 4.2 (Equivalent forms of PG-EXTRA).

In the following, we explain the relation between PG-EXTRA (4) and our proposal (31) (Algorithm 3).

  1. (i)

    Although not immediately obvious, when the stepsize is fixed, (31) is equivalent to PG-EXTRA (4) as we explain below. When the stepsize is not fixed, it is convenient to work with (31) due to the role of the sequence (θk)k(\theta_{k})_{k}.

    To see the correspondence between (31) with fixed stepsize and (4), first let τk=τ\tau_{k}=\tau for some constant τ>0\tau>0 and all k1k\geqslant 1. Then θk=1\theta_{k}=1. Next, define β=τ2\beta=\tau^{-2}, and denote the argument of the proximal step in (31) by

    𝐰k=𝐱kβτ[𝐮¯k+h(𝐱k)]k1.\mathbf{w}^{k}=\mathbf{x}^{k}-\beta\tau[\overline{\mathbf{u}}^{k}+\nabla h(\mathbf{x}^{k})]\quad\forall k\geqslant 1.

    Noting that βτ=τ1\beta\tau=\tau^{-1}, for all k2k\geqslant 2, we have

    𝐰k𝐰k1\displaystyle\mathbf{w}^{k}-\mathbf{w}^{k-1} =𝐱k𝐱k1βτ(𝐮¯k𝐮¯k1)βτ(h(𝐱k)h(𝐱k1))\displaystyle=\mathbf{x}^{k}-\mathbf{x}^{k-1}-\beta\tau\bigl{(}\bar{\mathbf{u}}^{k}-\bar{\mathbf{u}}^{k-1}\bigr{)}-\beta\tau\bigl{(}\nabla h(\mathbf{x}^{k})-\nabla h(\mathbf{x}^{k-1})\bigr{)}
    =𝐱k𝐱k1τ1(2(𝐮k𝐮k1)(𝐮k1𝐮k2))βτ(h(𝐱k)h(𝐱k1))\displaystyle=\mathbf{x}^{k}-\mathbf{x}^{k-1}-\tau^{-1}\bigl{(}2(\mathbf{u}^{k}-\mathbf{u}^{k-1})-(\mathbf{u}^{k-1}-\mathbf{u}^{k-2})\bigr{)}-\beta\tau\bigl{(}\nabla h(\mathbf{x}^{k})-\nabla h(\mathbf{x}^{k-1})\bigr{)}
    =𝐱k𝐱k1τ1(τ(IW)𝐱kτ2(IW)𝐱k1)βτ(h(𝐱k)h(𝐱k1))\displaystyle=\mathbf{x}^{k}-\mathbf{x}^{k-1}-\tau^{-1}\left(\tau(I-W)\mathbf{x}^{k}-\frac{\tau}{2}(I-W)\mathbf{x}^{k-1}\right)-\beta\tau\bigl{(}\nabla h(\mathbf{x}^{k})-\nabla h(\mathbf{x}^{k-1})\bigr{)}
    =W𝐱k12(I+W)𝐱k1βτ(h(𝐱k)h(𝐱k1)).\displaystyle=W\mathbf{x}^{k}-\dfrac{1}{2}(I+W)\mathbf{x}^{k-1}-\beta\tau\bigl{(}\nabla h(\mathbf{x}^{k})-\nabla h(\mathbf{x}^{k-1})\bigr{)}.

    Altogether, we have

    {𝐰k=𝐰k1+W𝐱k12(W+I)𝐱k1σ(h(𝐱k)h(𝐱k1))𝐱k+1=proxσf(𝐰k),\left\{\begin{aligned} \mathbf{w}^{k}&=\mathbf{w}^{k-1}+W\mathbf{x}^{k}-\dfrac{1}{2}(W+I)\mathbf{x}^{k-1}-\sigma\left(\nabla h(\mathbf{x}^{k})-\nabla h(\mathbf{x}^{k-1})\right)\\ \mathbf{x}^{k+1}&=\operatorname{prox}_{\sigma f}\big{(}\mathbf{w}^{k}\big{)},\end{aligned}\right.

    which corresponds to (31) with stepsize σ=βτ=τ1\sigma=\beta\tau=\tau^{-1}.

  2. (ii)

    The initialisation of Algorithm 3 can recover the initialisation of PG-EXTRA used in [24]. Indeed, taking 𝐮0=0\mathbf{u}^{0}=0 and any 𝐱1n×d\mathbf{x}^{1}\in\mathbbm{R}^{n\times d} gives 𝐮1=τ2(IW)𝐱1\mathbf{u}^{1}=\frac{\tau}{2}(I-W)\mathbf{x}^{1} which implies 𝐮¯1=𝐮1+(𝐮1𝐮0)=τ(IW)𝐱0\overline{\mathbf{u}}^{1}=\mathbf{u}^{1}+(\mathbf{u}^{1}-\mathbf{u}^{0})=\tau(I-W)\mathbf{x}^{0}. Since βτ2=1\beta\tau^{2}=1, we therefore have

    𝐱2=proxβτf(𝐱1βτ[𝐮¯1+h(𝐱1)])=proxβτf(W𝐱1βτh(𝐱1)).\mathbf{x}^{2}=\operatorname{prox}_{\beta\tau f}\big{(}\mathbf{x}^{1}-\beta\tau[\overline{\mathbf{u}}^{1}+\nabla h(\mathbf{x}^{1})]\big{)}=\operatorname{prox}_{\beta\tau f}\big{(}W\mathbf{x}^{1}-\beta\tau\nabla h(\mathbf{x}^{1})\big{)}.

    Alternatively, taking any 𝐱1n×d\mathbf{x}^{1}\in\mathbbm{R}^{n\times d} and setting 𝐮0=τ(IW)𝐱1\mathbf{u}^{0}=-\tau(I-W)\mathbf{x}^{1} gives 𝐮1=τ2(IW)𝐱1\mathbf{u}^{1}=-\frac{\tau}{2}(I-W)\mathbf{x}^{1} which implies 𝐮¯1=0\overline{\mathbf{u}}^{1}=0. We therefore obtain

    𝐱2=proxβτf(𝐱1βτh(𝐱1)),\mathbf{x}^{2}=\operatorname{prox}_{\beta\tau f}\big{(}\mathbf{x}^{1}-\beta\tau\nabla h(\mathbf{x}^{1})\big{)},

    which corresponds to the initialisation of PG-EXTRA discussed in [14, Remark 4.5].

4.2 Linesearch procedures in distributed settings

In this section, propose two distributed linesearch techniques that can take the role of LS in Algorithm 3. Although the updates for the variables 𝐱k,𝐮k\mathbf{x}^{k},\mathbf{u}^{k} and 𝐮¯k\bar{\mathbf{u}}^{k} in Algorithm 3 can be decentralised (when the stepsize is known), the proposed linesearch procedure internally requires global communication of scalar values across all agents. More precisely, LS as defined in Subroutine 4 requires a method for computing a global (scalar) sum, and LS as defined in Subroutine 5 requires computing a minimum (scalar) value among agents. Both of these operators can be efficiently implemented, for instance, using Message Passing Interface (MPI) [16]. See [4, Chapter 10] for more details. Since the aforementioned approaches only require communication of scalars across the network, their global communication costs are small compare to methods that needs global communication of the vectors and matrix variables such as 𝐱k,𝐮k\mathbf{x}^{k},\mathbf{u}^{k} and 𝐮¯k\bar{\mathbf{u}}^{k}.

Linesearch with global (scalar) summation.

The first distributed linesearch subroutine we propose is presented in Subroutine 4, and corresponds to a direct application of Subroutine 2. It locally computes independent proximal steps for each agent in (33), using one common stepsize. In each outer loop of the linesearch, a round of global communication is then used to compute the sum of the scalars a1,,ana_{1},\dots,a_{n}, and the linesearch condition (34) is checked. Note that this check (for positivity of the sum) can either be performed for by all agents after the global value has been returned, or performed by one agent followed by broadcasting the outcome to the other agents. Observe also that each computation of the summation (34) involves performing nn proximal steps (one per agent). Note that, when (34) holds, the current stepsize is too large and must be shrunk by the same amount for all agents before returning to Step 1. When (34) does not hold, the linesearch procedure terminates with all agents agreeing on a common stepsize.

Input: param ={β>0,δL(0,1),ρ(0,1)}=\{\beta>0,\delta_{L}\in(0,1),\rho\in(0,1)\}, τk(0)\tau_{k(0)}, τk1\tau_{k-1}, 𝐮k1\mathbf{u}^{k-1}, 𝐮k\mathbf{u}^{k}, 𝐱k\mathbf{x}^{k}.
for j=0,1,j=0,1,\dots do
       \triangleright Step 7 (Primal-dual agent-wise operations).
      for i=1,,ni=1,\dots,n do
             Define u¯ik(j)=uik+τk(j)τk11(uikuik1)\overline{u}^{k(j)}_{i}=u^{k}_{i}+\tau_{k(j)}\tau_{k-1}^{-1}(u^{k}_{i}-u^{k-1}_{i}). Compute
xik(j)+1=proxβτk(j)fi(xikβτk(j)[u¯ik(j)+hi(xik)]),x^{k(j)+1}_{i}=\operatorname{prox}_{\beta\tau_{k(j)}f_{i}}\big{(}x^{k}_{i}-\beta\tau_{k(j)}[\overline{u}^{k(j)}_{i}+\nabla h_{i}(x^{k}_{i})]\big{)}, (33)
and
ai=τk(j)[hi(xik(j)+1)hi(xik)hi(xik),xik(j)+1xik]δL2βxik(j)+1xik2.a_{i}=\tau_{k(j)}[h_{i}(x_{i}^{k(j)+1})-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),x_{i}^{k(j)+1}-x_{i}^{k}\rangle]-\dfrac{\delta_{L}}{2\beta}\displaystyle\|x_{i}^{k(j)+1}-x_{i}^{k}\|^{2}.
       \triangleright Step 8 (Check the linesearch condition). Compute the global sum i=1nai\sum_{i=1}^{n}a_{i}, and if
i=1nai>0,\begin{array}[]{c}\displaystyle\sum_{i=1}^{n}a_{i}>0,\end{array} (34)
set τk(j+1)=ρτk(j)\tau_{k(j+1)}=\rho\tau_{k(j)} and go back to Step 1. Otherwise, set τk=τk(j)\tau_{k}=\tau_{k(j)}, 𝐱k+1=𝐱k(j)+1\mathbf{x}^{k+1}=\mathbf{x}^{k(j)+1}, and exit the linesearch.
Output: τk,𝐱k+1\tau_{k},\mathbf{x}^{k+1}.
Subroutine 4 Linesearch LS with global sum for Algorithm 3.

We also propose a linesearch that instead of requiring the calculation of a decentralised sum of local values over a network, it needs the computation of the minimum (scalar) value among agents in a network.

Linesearch with a global minimum.

The second distributed linesearch subroutine we propose is presented in Subroutine 5. Different from Subroutine 4, this is not a direct application of Subroutine 2 as it locally computes independent proximal steps for each agent in (35) using potentially different stepsizes between agents. In the outer loop (Steps 1 & 2), the linesearch condition (36) is tested locally for each agent, rather than globally as in Subroutine 4, and hence no global communication is need in this part. In exchange, a common stepsize for all agents is computed by finding a global minimum across the network. This common stepsize is then used to recompute the proximal steps for each agent (if needed). Note that the global minimum only needs to be computed once in each time LS is run.

Input: param ={β>0,δL(0,1),ρ(0,1)}=\{\beta>0,\delta_{L}\in(0,1),\rho\in(0,1)\}, τk(0)\tau_{k(0)}, τk1\tau_{k-1}, 𝐮k1\mathbf{u}^{k-1}, 𝐮k\mathbf{u}^{k}, 𝐱k\mathbf{x}^{k}.
\triangleright Step 9 (Primal-dual and linesearch agent-wise operations).
for i=1,,ni=1,\dots,n do
      
      Set τk(0),i=τk(0)\tau_{k(0),i}=\tau_{k(0)}.
      for j=0,1,j=0,1,\dots do
            
            Step 1.1. Define uˇik(j)=uik+τk(j),iτk11(uikuik1)\check{u}^{k(j)}_{i}=u^{k}_{i}+\tau_{k(j),i}\tau_{k-1}^{-1}(u^{k}_{i}-u^{k-1}_{i}). Compute
xˇik(j)+1=proxβτk(j),ifi(xikβτk(j),i[u¯ik(j)+hi(xik)]).\check{x}^{k(j)+1}_{i}=\operatorname{prox}_{\beta\tau_{k(j),i}f_{i}}\big{(}x^{k}_{i}-\beta\tau_{k(j),i}[\overline{u}^{k(j)}_{i}+\nabla h_{i}(x^{k}_{i})]\big{)}. (35)
and
bi=τk(j),i[hi(xˇik(j)+1)hi(xik)hi(xik),xˇik(j)+1xik]δL2βxˇik(j)+1xik2.b_{i}=\tau_{k(j),i}\left[h_{i}(\check{x}_{i}^{k(j)+1})-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),\check{x}_{i}^{k(j)+1}-x_{i}^{k}\rangle\right]-\dfrac{\delta_{L}}{2\beta}\|\check{x}_{i}^{k(j)+1}-x_{i}^{k}\|^{2}.
            Step 1.2. Backtracking linesearch test. If
bi>0,\begin{array}[]{c}b_{i}>0,\end{array} (36)
set τk(j+1),i=ρτk(j),i\tau_{k(j+1),i}=\rho\tau_{k(j),i} and go back to Step 1.1. Otherwise, set τk,i=τk(j),i\tau_{k,i}=\tau_{k(j),i}, u¯ik=uˇik(j)\bar{u}_{i}^{k}=\check{u}_{i}^{k(j)}, xik+1=xˇik(j)+1x_{i}^{k+1}=\check{x}_{i}^{k(j)+1}, and finish the linesearch for agent ii.
      
\triangleright Step 10. Compute τk=mini=1,,nτk,i\tau_{k}=\displaystyle\min_{i=1,\dots,n}\tau_{k,i} and
for i=1,,ni=1,\dots,n such that τi,k>τk\tau_{i,k}>\tau_{k} do
       Recompute
u¯ik\displaystyle\overline{u}^{k}_{i} =uik+τkτk11(uikuik1)\displaystyle=u^{k}_{i}+\tau_{k}\tau_{k-1}^{-1}(u^{k}_{i}-u^{k-1}_{i})
xik+1\displaystyle x^{k+1}_{i} =proxβτkfi(xikβτk[u¯ik+hi(xik)]).\displaystyle=\operatorname{prox}_{\beta\tau_{k}f_{i}}\big{(}x^{k}_{i}-\beta\tau_{k}[\overline{u}^{k}_{i}+\nabla h_{i}(x^{k}_{i})]\big{)}.
Output: τk,𝐱k+1\tau_{k},\mathbf{x}^{k+1}.
Subroutine 5 Linesearch LS with a global minimum for Algorithm 3.
Remark 4.3.

Some comments regarding Subroutines 4 and 5 are in order.

  1. (i)

    The rounds of communication in both algorithms are different in nature. On the one hand, in each outer loop of Subroutine 4, we require communication to compute the global sum in (34) and test the inequality therein. On the other hand, in Subroutine 5, we require communication to compute the minimum stepsize among agents only once per linesearch call.

  2. (ii)

    One advantage of Subroutine 4 is that it does not require the computation of additional proximal steps after communicating, as all agents share the same stepsize before entering a new inner linesearch iteration. However, this cost is balanced by the global communication needed to check the linesearch condition. In this regard, there is a trade-off between checking a global sum and recomputing proximal steps and computing the global minimum stepsize. The latter is the case of Subroutine 5, in which both proximal steps and test of the linesearch condition are performed locally, and may be more suitable whenever the proximal steps are cheap relatively to the cost of computing global scalar sums.

4.3 Convergence analysis

In this section, we analyse convergence of Algorithm 3 with LS being either Subroutine 4 or 5, which will require the following lemma.

Lemma 4.4.

Suppose that f1,,fn:d{+}f_{1},\dots,f_{n}:\mathbbm{R}^{d}\to\mathbbm{R}\cup\{+\infty\} are proper lsc convex functions and h1,,hn:dh_{1},\dots,h_{n}:\mathbbm{R}^{d}\to\mathbbm{R} are convex differentiable functions with locally Lipschitz continuous gradients. Then, Subroutines 4 and 5 are well-defined and satisfy Assumptions 3.1 and 3.3. In particular,

k,τki=1n[hi(xik+1)hi(xik)hi(xik),xik+1xik]δL2βi=1nxik+1xik2.\forall k\in\mathbb{N},\quad\tau_{k}\sum_{i=1}^{n}\left[h_{i}(x_{i}^{k+1})-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),x_{i}^{k+1}-x_{i}^{k}\rangle\right]\leqslant\dfrac{\delta_{L}}{2\beta}\sum_{i=1}^{n}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}. (37)
Proof.

Subroutine 4 is a distributed implementation of Subroutine 2, therefore the result for this procedure follows from Lemma 3.7, and (37) follows from (15). Regarding Subroutine 5, using Lemma 3.7(i) with H=hiH=h_{i}, for each i{1,,n}i\in\{1,\dots,n\} and k1k\geqslant 1, there exists ξk,i>0\xi_{k,i}>0 such that

t(0,ξk,i],t(hi(xik+1(t))hi(xik)hi(xik),xik+1(t)xik)δL2βxik+1(t)xik2,\forall t\in(0,\xi_{k,i}],\quad t\left(h_{i}(x_{i}^{k+1}(t))-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),x_{i}^{k+1}(t)-x_{i}^{k}\rangle\right)\leqslant\dfrac{\delta_{L}}{2\beta}\|x_{i}^{k+1}(t)-x_{i}^{k}\|^{2}, (38)

where xik+1(t)=proxβtfi(xikβt[u¯ik+hi(xik)])x_{i}^{k+1}(t)=\operatorname{prox}_{\beta tf_{i}}(x_{i}^{k}-\beta t[\overline{u}_{i}^{k}+\nabla h_{i}(x_{i}^{k})]). After finitely many backtracking linesearch steps, the search for agent ii terminates with τk,i(0,ξk,i]\tau_{k,i}\in(0,\xi_{k,i}]. Since τk=mini=1,,nτk,i\tau_{k}=\min_{i=1,\dots,n}\tau_{k,i}, this means that the linesearch terminates with τkτk(0)\tau_{k}\leqslant\tau_{k(0)}, so Assumption 3.1(i) holds with τinit=τk(0)\tau_{\text{init}}=\tau_{k(0)}. By aggregating (38) over all i{1,,n}i\in\{1,\dots,n\} and taking t=τkt=\tau_{k}, yields Assumption 3.1(ii) and (37). Furthermore, in view of Lemma 3.2, (𝐱k)(\mathbf{x}^{k}) is bounded. In order to check that (τk)k(\tau_{k})_{k} is separated from 0, assume by way of a contradiction that there exists a subsequence τk0\tau_{k_{\ell}}\to 0 as +\ell\to+\infty. From Step 2 of Subroutine 5, there exists i{1,,n}i\in\{1,\dots,n\} such that τk,i0\tau_{k_{\ell},i}\to 0. Applying Lemma 3.7(iii) with τk=τk,i\tau_{k}=\tau_{k_{\ell},i} yields a contradiction, a Step 1 in Subroutine 5 corresponds to Subroutine 2 applied to agent ii. Hence, Assumption 3.3 holds. ∎

Since we have established that Algorithm 1 is well-defined and both Subroutines 4 and 5 satisfy Assumptions 3.1 and 3.3, we are ready to state the convergence result: the primal sequence (𝐱k)k(\mathbf{x}^{k})_{k} converges to a solution in consensus, being asymptotically feasible.

Corollary 4.5 (Convergence to primal solutions).

Suppose that, for each i{1,,n}i\in\{1,\dots,n\}, fi:d{+}f_{i}:\mathbbm{R}^{d}\to\mathbbm{R}\cup\{+\infty\} is a proper lsc convex function and hi:dh_{i}:\mathbbm{R}^{d}\to\mathbbm{R} is a differentiable convex with locally Lipschitz continuous gradient, and that the set of solutions of (1) is nonempty. Then, the sequence (𝐱k)k(\mathbf{x}^{k})_{k} generated by Algorithm 3 using the linesearch in Subroutine 4 or 5, converges to a point in consensus 𝐱^\hat{\mathbf{x}}, such that any row of 𝐱^\hat{\mathbf{x}} defines a solution to (1).

Proof.

In view of Lemma 4.4, the result follows from Theorem 3.4: (𝐱k)k(\mathbf{x}^{k})_{k} converges to a solution to the primal problem associated to (28), which corresponds to (6). ∎

Remark 4.6.

The initialisation of τk(0)\tau_{k(0)} in (32) requires the knowledge of the smallest eigenvalue of WW. Instead, we could also perform the linesearch in (27) and initialise τk(0)=τk1αk\tau_{k(0)}=\tau_{k-1}\alpha_{k}. For this alternative linesearch, we need a distributed way to calculate the leftmost term on the left-hand side to make it implementable. More specifically, since K=UK=-U, the extra term in the linesearch can be rewritten as

K𝐱k(j)+1K𝐱k2=𝐱k(j)+1𝐱k,K2(𝐱k(j)+1𝐱k)=12𝐱k(j)+1𝐱k,(IW)(𝐱k(j)+1𝐱k)=12(𝐱k(j)+1𝐱k2W(𝐱k(j)+1𝐱k),𝐱k(j)+1𝐱k).\begin{array}[]{rcl}\|K^{*}\mathbf{x}^{k(j)+1}-K^{*}\mathbf{x}^{k}\|^{2}&=&\langle\mathbf{x}^{k(j)+1}-\mathbf{x}^{k},K^{2}(\mathbf{x}^{k(j)+1}-\mathbf{x}^{k})\rangle\\ &=&\dfrac{1}{2}\langle\mathbf{x}^{k(j)+1}-\mathbf{x}^{k},(I-W)(\mathbf{x}^{k(j)+1}-\mathbf{x}^{k})\rangle\\ &=&\dfrac{1}{2}\left(\|\mathbf{x}^{k(j)+1}-\mathbf{x}^{k}\|^{2}-\langle W(\mathbf{x}^{k(j)+1}-\mathbf{x}^{k}),\mathbf{x}^{k(j)+1}-\mathbf{x}^{k}\rangle\right).\end{array} (39)

Therefore, in each step of the linesearch we need to perform pre-multiplications by WW, which amounts to rounds of communication. In this way, we check (34) with

ai=τk(j)24(xik(j)+1xik2jN(i)Wij(xjk(j)+1xjk),xik(j)+1xik)+τk(j)[hi(xik(j)+1)hi(xik)hi(xik),xik(j)+1xik]δK+δL2βxik(j)+1xik2,a_{i}=\dfrac{\tau_{k(j)}^{2}}{4}\left(\|x_{i}^{k(j)+1}-x_{i}^{k}\|^{2}-\left\langle\sum_{j\in N(i)}W_{ij}(x_{j}^{k(j)+1}-x_{j}^{k}),x_{i}^{k(j)+1}-x_{i}^{k}\right\rangle\right)\\ +\tau_{k(j)}\left[h_{i}(x_{i}^{k(j)+1})-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),x_{i}^{k(j)+1}-x_{i}^{k}\rangle\right]-\dfrac{\delta_{K}+\delta_{L}}{2\beta}\|x_{i}^{k(j)+1}-x_{i}^{k}\|^{2},

while (36) requires

bi=τk(j),i24(xˇik(j)+1xik2jN(i)Wij(xˇjk(j)+1xjk),xˇik(j)+1xik)+τk(j),i[hi(xˇik(j)+1)hi(xik)hi(xik),xˇik(j)+1xik]δK+δL2βxˇik(j)+1xik2.b_{i}=\dfrac{\tau_{k(j),i}^{2}}{4}\left(\|\check{x}_{i}^{k(j)+1}-x_{i}^{k}\|^{2}-\left\langle\sum_{j\in N(i)}W_{ij}(\check{x}_{j}^{k(j)+1}-x_{j}^{k}),\check{x}_{i}^{k(j)+1}-x_{i}^{k}\right\rangle\right)\\ +\tau_{k(j),i}\left[h_{i}(\check{x}_{i}^{k(j)+1})-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),\check{x}_{i}^{k(j)+1}-x_{i}^{k}\rangle\right]-\dfrac{\delta_{K}+\delta_{L}}{2\beta}\|\check{x}_{i}^{k(j)+1}-x_{i}^{k}\|^{2}.

Note that when both linesearch procedures terminate, they imply

τk24i=1n(xik+1xik2jN(i)Wij(xjk+1xjk),xik+1xik)+τki=1n[hi(xik+1)hi(xik)hi(xik),xik+1xik]δK+δL2βi=1nxik+1xik2.\dfrac{\tau_{k}^{2}}{4}\displaystyle\sum_{i=1}^{n}\left(\|x_{i}^{k+1}-x_{i}^{k}\|^{2}-\left\langle\sum_{j\in N(i)}W_{ij}(x_{j}^{k+1}-x_{j}^{k}),x_{i}^{k+1}-x_{i}^{k}\right\rangle\right)\\ +\tau_{k}\sum_{i=1}^{n}\left[h_{i}(x_{i}^{k+1})-h_{i}(x_{i}^{k})-\langle\nabla h_{i}(x_{i}^{k}),x_{i}^{k+1}-x_{i}^{k}\rangle\right]\leqslant\dfrac{\delta_{K}+\delta_{L}}{2\beta}\sum_{i=1}^{n}\|x_{i}^{k+1}-x_{i}^{k}\|^{2}.

These two options could be as costly as estimating λmin(W)\lambda_{\min}(W) separately, since they need a number of round of network communications equal to linesearch steps, per iteration. In this sense, we have the choice of reallocating the possibly highly costly network communication rounds per iteration to a separate problem to compute λmin(W)\lambda_{\min}(W). The modifications of the linesearch procedures (33)–(34) and (35)–(36) by using (39) are also well-defined, and convergence of the resulting method follows from Remark 3.9(i).

5 Concluding remarks

In this work, we propose an abstract linesearch framework for a primal-dual splitting method to solve min-max convex concave problems, that can be viewed as a generalisation of the method proposed in [13], and we provide an implementation of such a linesearch that naturally satisfies the assumptions. In the distributed optimisation setting, we propose two distributed linesearch procedures, in such a way that the resulting method extends PG-EXTRA defining variable stepsizes. We also allow the gradients hi\nabla h_{i} to be locally Lipschitz continuous for each agent i{1,,n}i\in\{1,\dots,n\}, instead of using the usual assumption of globally Lipschitz gradients.

Funding.

FA, MND and MKT were supported in part by Australian Research Council grant DP230101749.

References

  • [1] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer New York, 2011.
  • [2] A. Beck and M. Teboulle. Gradient-based algorithms with applications to signal-recovery problems. In D. P. Palomar and Y. C. Eldar, editors, Convex Optimization in Signal Processing and Communications, page 42–88. Cambridge University Press, 2009.
  • [3] J. Y. Bello Cruz and T. T. Nghia. On the convergence of the forward–backward splitting method with linesearches. Optimization Methods and Software, 31(6):1209–1238, 2016.
  • [4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • [5] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120–145, 2011.
  • [6] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In H. H. Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, editors, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer New York, New York, NY, 2011.
  • [7] L. Condat. A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460–479, 2013.
  • [8] M. P. Friedlander, A. Goodwin, and T. Hoheisel. From perspective maps to epigraphical projections. Mathematics of Operations Research, 48(3):1711–1740, 2023.
  • [9] B. He and X. Yuan. Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM Journal on Imaging Sciences, 5(1):119–149, 2012.
  • [10] N. Komodakis and J.-C. Pesquet. Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Processing Magazine, 32(6):31–54, 2015.
  • [11] Z. Li, W. Shi, and M. Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17):4494–4506, 2019.
  • [12] Z. Li and M. Yan. New convergence analysis of a primal-dual algorithm with large stepsizes. Advances in Computational Mathematics, 47:1–20, 2021.
  • [13] Y. Malitsky and T. Pock. A first-order primal-dual algorithm with linesearch. SIAM Journal on Optimization, 28(1):411–432, 2018.
  • [14] Y. Malitsky and M. K. Tam. A first-order algorithm for decentralised min-max problems. arXiv preprint arXiv:2308.11876, 2023.
  • [15] G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10):5262–5276, 2010.
  • [16] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard Version 4.1, Nov. 2023.
  • [17] A. Nedić, A. Olshevsky, and C. A. Uribe. Fast convergence rates for distributed non-Bayesian learning. IEEE Transactions on Automatic Control, 62(11):5538–5553, 2017.
  • [18] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning, pages 2681–2690. PMLR, 2017.
  • [19] C. Ravazzi, S. M. Fosson, and E. Magli. Distributed iterative thresholding for 0/1\ell_{0}/\ell_{1}-regularized linear inverse problems. IEEE Transactions on Information Theory, 61(4):2081–2100, 2015.
  • [20] R. T. Rockafellar. Monotone operators associated with saddle-functions and minimax problems. In Proceedings of Symposia in Pure Mathematics, volume 18, pages 241–250. American Mathematical Society, 1970.
  • [21] R. T. Rockafellar. Saddle-points and convex analysis. Differential games and related topics, 109, 1971.
  • [22] E. K. Ryu and W. Yin. Large-scale Convex Pptimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, 2022.
  • [23] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • [24] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015.
  • [25] B. C. Vũ. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013.