This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Double Tracking Method for Optimization with Decentralized Generalized Orthogonality Constraints

Lei Wang Department of Statistics, Pennsylvania State University, University Park, PA, USA (wlkings@lsec.cc.ac.cn).    Nachuan Xiao Institute of Operational Research and Analytics, National University of Singapore, Singapore (xnc@lsec.cc.ac.cn).    Xin Liu State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, Beijing, China (liuxin@lsec.cc.ac.cn).
Abstract

In this paper, we consider the decentralized optimization problems with generalized orthogonality constraints, where both the objective function and the constraint exhibit a distributed structure. Such optimization problems, albeit ubiquitous in practical applications, remain unsolvable by existing algorithms in the presence of distributed constraints. To address this issue, we convert the original problem into an unconstrained penalty model by resorting to the recently proposed constraint-dissolving operator. However, this transformation compromises the essential property of separability in the resulting penalty function, rendering it impossible to employ existing algorithms to solve. We overcome this difficulty by introducing a novel algorithm that tracks the gradient of the objective function and the Jacobian of the constraint mapping simultaneously. The global convergence guarantee is rigorously established with an iteration complexity. To substantiate the effectiveness and efficiency of our proposed algorithm, we present numerical results on both synthetic and real-world datasets.

1 Introduction

Rapid advances in data collection and processing capabilities have paved the way for the utilization of distributed systems in a large number of practical applications, such as dictionary learning [28], statistical inference [16], multi-agent control [15], and neural network training [42, 44]. The large scale and spatial/temporal disparity of data, coupled with the limitations in storage and computational resources, make centralized approaches infeasible or inefficient. Consequently, decentralized algorithms are developed to solve an optimization problem through the collaboration of the agents, where the need to efficiently manage and process vast amounts of distributed data is paramount.

Given a distributed system of d+d\in\mathbb{N}_{+} agents connected by a communication network, the focus of this paper is on the following decentralized optimization problem with the generalized orthogonality constraint:

minXn×p\displaystyle\min\limits_{X\in\mathbb{R}^{n\times p}}\hskip 5.69054pt f(X):=i=1dfi(X)\displaystyle f(X):=\sum_{i=1}^{d}f_{i}(X) (1.1a)
s.t.\displaystyle\mathrm{s.\,t.}\,\,\hskip 9.95845pt i=1dXMiX=Ip,\displaystyle\sum_{i=1}^{d}X^{\top}M_{i}X=I_{p}, (1.1b)

where fif_{i} is a continuously differentiable local function, Min×nM_{i}\in\mathbb{R}^{n\times n} is a symmetric matrix, and Ipp×pI_{p}\in\mathbb{R}^{p\times p} denotes the p×pp\times p identity matrix. Moreover, for each i[d]i\in[d], both fif_{i} and MiM_{i} are privately owned by agent ii. For convenience, we denote M:=i=1dMiM:=\sum_{i=1}^{d}M_{i}, which is assumed to be positive definite throughout this paper. The feasible region of problem (1.1), denoted by 𝒮Mn,p:={Xn×pXMX=Ip}\mathcal{S}_{M}^{n,p}:=\{X\in\mathbb{R}^{n\times p}\mid X^{\top}MX=I_{p}\}, is an embedded submanifold of n×p\mathbb{R}^{n\times p} and commonly referred to as the generalized Stiefel manifold [1]. It is noteworthy that, different from classical decentralized optimization problems, the constraint (1.1b) also has a distributed structure across agents, which leads to considerable challenges to solve (1.1) under the decentralized setting. Throughout this paper, we make the following blanket assumption.

Assumption 1.

Each fif_{i} is continuously differentiable, and its gradient fi\nabla f_{i} is locally Lipschitz continuous in n\mathbb{R}^{n}.

Problems of the form (1.1) are found widely in many applications. Below is a brief introduction to an important application of (1.1) in statistics.

Canonical Correlation Analysis (CCA)

CCA is a fundamental and ubiquitous statistical tool that characterizes linear relationships between two sets of variables [19]. Let An×qA\in\mathbb{R}^{n\times q} and Bm×qB\in\mathbb{R}^{m\times q} be two datasets in the form of matrices, where nn and mm are the dimensions of the two datasets respectively, and qq is the number of samples in each of the two datasets. CCA aims to identify linear combinations of variables within each dataset to maximize their correlations. Mathematically, CCA can be equivalently formulated as the following optimization problem [17],

minX(n+m)×p\displaystyle\min\limits_{X\in\mathbb{R}^{(n+m)\times p}} 12tr(XΣX)\displaystyle-\frac{1}{2}\mathrm{tr}\left(X^{\top}\Sigma X\right) (1.2)
s.t.\displaystyle\mathrm{s.\,t.} XMX=Ip,\displaystyle X^{\top}MX=I_{p},

where Σ(n+m)×(n+m)\Sigma\in\mathbb{R}^{(n+m)\times(n+m)} and M(n+m)×(n+m)M\in\mathbb{R}^{(n+m)\times(n+m)} are two matrices generated by AA and BB, and tr()\mathrm{tr}(\cdot) denotes the trace of a given square matrix. Specifically, Σ\Sigma and MM have the following form,

Σ=[AAABBABB],M=[AA00BB],\Sigma=\begin{bmatrix}AA^{\top}&AB^{\top}\\ BA^{\top}&BB^{\top}\end{bmatrix},\quad M=\begin{bmatrix}AA^{\top}&0\\ 0&BB^{\top}\end{bmatrix},

respectively.

In this paper, we consider the distributed setting in which the samples contained in AA and BB are stored locally in dd locations, possibly having been collected and owned by different agents. Suppose each agent ii possess qiq_{i} samples and q1+q2++qd=qq_{1}+q_{2}+\dotsb+q_{d}=q. Let Ain×qiA_{i}\in\mathbb{R}^{n\times q_{i}} and Bim×qiB_{i}\in\mathbb{R}^{m\times q_{i}} denote the local data of agent ii. Then the data matrices AA and BB can be divided into dd blocks respectively, namely, A=[A1A2Ad]A=[A_{1}\;A_{2}\;\dotsb\;A_{d}] and B=[B1B2Bd]B=[B_{1}\;B_{2}\;\dotsb\;B_{d}]. Under the aforementioned distributed setting, the optimization model (1.2) of CCA can be recast as the following form:

minX(n+m)×p\displaystyle\min\limits_{X\in\mathbb{R}^{(n+m)\times p}} 12i=1dtr(XΣiX)\displaystyle-\frac{1}{2}\sum_{i=1}^{d}\mathrm{tr}\left(X^{\top}\Sigma_{i}X\right) (1.3)
s.t.\displaystyle\mathrm{s.\,t.} i=1dXMiX=Ip,\displaystyle\sum_{i=1}^{d}X^{\top}M_{i}X=I_{p},

where Σi(n+m)×(n+m)\Sigma_{i}\in\mathbb{R}^{(n+m)\times(n+m)} and Mi(n+m)×(n+m)M_{i}\in\mathbb{R}^{(n+m)\times(n+m)} are given by

Σi=[AiAiAiBiBiAiBiBi],Mi=[AiAi00BiBi],\Sigma_{i}=\begin{bmatrix}A_{i}A_{i}^{\top}&A_{i}B_{i}^{\top}\\ B_{i}A_{i}^{\top}&B_{i}B_{i}^{\top}\end{bmatrix},\quad M_{i}=\begin{bmatrix}A_{i}A_{i}^{\top}&0\\ 0&B_{i}B_{i}^{\top}\end{bmatrix},

respectively. There are other optimization problems in statistics with structures similar to CCA. Interested readers can refer to the references [12, 17] for further details.

1.1 Related Works

Decentralized optimization has experienced significant advancements in recent decades, particularly in the Euclidean space. Various algorithms have been proposed to tackle different types of problems, such as gradient-based algorithms [24, 41, 27, 31], primal-dual frameworks [29, 22, 8, 18], and second-order methods [5, 43, 13]. In general, these algorithms are only capable of handling scenarios where variables are restricted in a convex subset of the Euclidean space. Consequently, these algorithms cannot be directly applied to solve (1.1b), where the feasible region is typically non-convex. Interested readers can refer to some recent surveys [25, 9] for more comprehensive information.

In order to solve decentralized optimization problems on manifolds, many algorithms have adopted geometric tools derived from Riemannian optimization, including tangent spaces and retraction operators. For instance, the algorithms delineated in [11, 14] extend the Riemannian gradient descent method [1] to the decentralized setting, which can be combined with gradient tracking techniques [32] to achieve the exact convergence. Building on this foundation, Chen et al. [10] further introduces the decentralized Riemannian conjugate gradient method. Additionally, there are several algorithms devised to address nonsmooth optimization problems, such as subgradient algorithm [34] and proximal gradient algorithm [38]. It is crucial to underscore that the computation of tangent spaces and retraction operators requires complete information about the matrix MM, which is unattainable in a decentralized environment. This inherent limitation hinders the application of the aforementioned algorithms to the optimization problems with decentralized generalized orthogonality constraints.

There is another class of methodologies [35, 37, 36, 33] that focuses on employing infeasible approaches, such as augmented Lagrangian methods, to tackle nonconvex manifold constraints. These algorithms do not require each iterate to strictly adhere to manifold constraints, thereby allowing the pursuit of global consensus directly in the Euclidean space. Compared to Riemannian optimization methods, this type of algorithm requires only a single round of communication per iteration to guarantee their global convergence. However, it is noteworthy that these algorithms are tailored for Stiefel manifolds and cannot handle scenarios where the constraints themselves exhibit a distributed structure. Although we can draw inspiration from these algorithms to tackle the problem (1.1), the resulting penalty model remains challenging to solve. On the one hand, the penalty function usually forfeits the distributed structure inherent in the constraint, which is no longer separable across the agents. On the other hand, without full knowledge of the matrix MM, each agent can not independently compute its own local gradients. Consequently, it is impossible to straightforwardly extend existing algorithms to solve the problem (1.1).

1.2 Contributions

In decentralized optimization, current researches mainly focus on scenarios where the objective function exhibits a distributed structure. However, in problem (1.1), the generalized orthogonal constraint also exhibits a similar structure, which triggers off an enormous difficulty in solving it.

To develop a decentralized algorithm for the problem (1.1), we employ the constraint dissolving operator introduced in [40] to construct an exact penalty model, which can not be solved by existing algorithms. To efficiently minimize our proposed penalty model, we devise an approximate direction for the gradient of the penalty function, which is composed of separable components, including the gradient of the objective function and the Jacobian of the constraint mapping. Then, we propose to track these two components simultaneously across the network to assemble them in the approximate direction. This double-tracking strategy is quite efficient to reach a consensus on the generalized Stiefel manifolds.

Based on the aforementioned techniques, we develop a novel constraint dissolving algorithm with double tracking (CDADT). To the best of our knowledge, this is the first algorithm capable of solving optimization problems with decentralized generalized orthogonality constraints. Under rather mild conditions, we establish the global convergence of CDADT and provide its iteration complexity. Preliminary numerical results demonstrate the great potential of CDADT.

1.3 Organization

The rest of this paper is organized as follows. Section 2 draws into some preliminaries related to the topic of this paper. In Section 3, we develop a constraint dissolving algorithm with double tracking to solve the problem (1.1). The convergence properties of the proposed algorithm are investigated in Section 4. Numerical results are presented in Section 5 to evaluate the performance of our algorithm. Finally, this paper concludes with concluding remarks and key insights in Section 6.

2 Preliminaries

In this section, we first present fundamental notations and network settings considered in this paper. Following this, we revisit the first-order stationarity condition of (1.1) and introduce the concepts of Kurdyka-Łojasiewicz (KŁ) property and constraint dissolving operator.

2.1 Notations

The following notations are adopted throughout this paper. The Euclidean inner product of two matrices Y1,Y2Y_{1},Y_{2} with the same size is defined as Y1,Y2=tr(Y1Y2)\left\langle Y_{1},Y_{2}\right\rangle=\mathrm{tr}(Y_{1}^{\top}Y_{2}), where tr(B)\mathrm{tr}(B) stands for the trace of a square matrix BB. The p×pp\times p identity matrix is represented by Ipp×pI_{p}\in\mathbb{R}^{p\times p}. We define the symmetric part of a square matrix BB as sym(B):=(B+B)/2\mathrm{sym}(B):=(B+B^{\top})/2. The Frobenius norm and 2-norm of a given matrix CC are denoted by CF\left\|C\right\|_{\mathrm{F}} and C2\left\|C\right\|_{2}, respectively. The (i,j)(i,j)-th entry of a matrix CC is represented by C(i,j)C(i,j). The notations 𝟏dd\mathbf{1}_{d}\in\mathbb{R}^{d} and 𝟎dd\mathbf{0}_{d}\in\mathbb{R}^{d} stand for the dd-dimensional vector of all ones and all zeros, respectively. The notations σmin(X)\sigma_{\min}(X) and σmax(X)\sigma_{\max}(X) represent the smallest and largest singular value of a matrix XX, respectively. The Kronecker product is denoted by \otimes. For any d+d\in\mathbb{N}_{+}, we denote [d]:={1,2,,d}[d]:=\{1,2,\dotsc,d\}. We define the distance between a point XX and a set 𝒞\mathcal{C} by dist(X,𝒞):=inf{YXFY𝒞}\mathrm{dist}(X,\mathcal{C}):=\inf\{\left\|Y-X\right\|_{\mathrm{F}}\mid Y\in\mathcal{C}\}. Given a differentiable function g(X):n×pg(X):\mathbb{R}^{n\times p}\to\mathbb{R}, the Euclidean gradient of gg with respect to XX is represented by g(X)\nabla g(X). Further notations will be introduced wherever they occur.

2.2 Network Setting

We assume that the dd agents are connected by a communication network. And they can only exchange information with their immediate neighbors. The network 𝙶=(𝚅,𝙴)\mathtt{G}=(\mathtt{V},\mathtt{E}) captures the communication links diffusing information among the agents. Here, 𝚅=[d]\mathtt{V}=[d] is composed of all the agents and 𝙴={(i,j)i and j are connected}\mathtt{E}=\{(i,j)\mid i\text{~{}and~{}}j\text{~{}are connected}\} represents the set of communication links. Throughout this paper, we make the following assumptions on the network.

Assumption 2.

The communication network 𝙶=(𝚅,𝙴)\mathtt{G}=(\mathtt{V},\mathtt{E}) is connected. Furthermore, there exists a mixing matrix W=[W(i,j)]d×dW=[W(i,j)]\in\mathbb{R}^{d\times d} associated with 𝙶\mathtt{G} satisfying the following conditions.

  1. (i)

    WW is symmetric and nonnegative.

  2. (ii)

    W𝟏d=W𝟏d=𝟏dW\mathbf{1}_{d}=W^{\top}\mathbf{1}_{d}=\mathbf{1}_{d}.

  3. (iii)

    W(i,j)=0W(i,j)=0 if iji\neq j and (i,j)𝙴(i,j)\notin\mathtt{E}, and W(i,j)>0W(i,j)>0 otherwise.

The conditions in Assumption 2 follow from standard assumptions in the literature [39, 25], which is dictated by the underlying network topology. According to the Perron-Frobenius Theorem [26], we find that the eigenvalues of WW fall within the range (1,1](-1,1], and hence,

λ:=W𝟏d𝟏d/d2<1.\lambda:=\left\|W-\mathbf{1}_{d}\mathbf{1}_{d}^{\top}/d\right\|_{2}<1. (2.1)

The parameter λ\lambda serves as a key indicator of the network connectivity and is instrumental in the analysis of decentralized methods. Generally speaking, the closer λ\lambda approaches 11, the poorer the network connectivity becomes.

2.3 Stationarity

In this subsection, we delve into the first-order stationarity condition of the problem (1.1). Towards this end, we introduce some geometric concepts of Riemannian manifolds. For each point X𝒮Mn,pX\in\mathcal{S}_{M}^{n,p}, the tangent space to 𝒮Mn,p\mathcal{S}_{M}^{n,p} at XX is referred to as 𝒯X:={Dn×pDMX+XMD=0}\mathcal{T}_{X}:=\{D\in\mathbb{R}^{n\times p}\mid D^{\top}MX+X^{\top}MD=0\}. In this paper, we consider the Riemannian metric ,M\left\langle\cdot,\cdot\right\rangle_{M} on 𝒯X\mathcal{T}_{X} that is induced from the inner product, i.e., V1,V2M=V1,MV2=tr(V1MV2)\left\langle V_{1},V_{2}\right\rangle_{M}=\left\langle V_{1},MV_{2}\right\rangle=\mathrm{tr}(V_{1}^{\top}MV_{2}). The corresponding Riemannian gradient of a smooth function ff is given by

gradf(X):=M1f(X)Xsym(Xf(X)),\mathrm{grad}\,f(X):=M^{-1}\nabla f(X)-X\mathrm{sym}(X^{\top}\nabla f(X)),

which is nothing but the projection of f(X)\nabla f(X) onto 𝒯X\mathcal{T}_{X} under the metric ,M\left\langle\cdot,\cdot\right\rangle_{M}. Finally, the first-order stationarity condition of the problem (1.1) can be stated as follows.

Definition 2.1.

A point X𝒮Mn,pX\in\mathcal{S}_{M}^{n,p} is called a first-order stationary point of the problem (1.1) if it satisfies the following condition,

gradf(X)=0.\mathrm{grad}\,f(X)=0.

Since this paper focuses on infeasible algorithms for the problem (1.1), we introduce the following definition of ϵ\epsilon-stationary point.

Definition 2.2.

A point Xn×pX\in\mathbb{R}^{n\times p} is called a first-order ϵ\epsilon-stationary point of the problem (1.1) if it satisfies the following condition,

max{gradf(𝒫(X))F,XMXIpF}ϵ,\max\left\{\left\|\mathrm{grad}\,f(\mathcal{P}(X))\right\|_{\mathrm{F}},\,\left\|X^{\top}MX-I_{p}\right\|_{\mathrm{F}}\right\}\leq\epsilon,

where 𝒫()\mathcal{P}(\cdot) is the projection operator onto the generalized Stiefel manifold 𝒮Mn,p\mathcal{S}_{M}^{n,p}.

2.4 Kurdyka-Łojasiewicz Property

A part of the convergence results developed in this paper falls in the scope of a general class of functions that satisfy the Kurdyka-Łojasiewicz (KŁ) property [23, 20]. Below, we introduce the basic elements to be used in the subsequent theoretical analysis.

For any τ>0\tau>0, we denote by Φτ\Phi_{\tau} the class of all concave and continuous functions ϕ:[0,τ)+\phi:[0,\tau)\to\mathbb{R}_{+} which satisfy the following conditions,

  1. (i)

    ϕ(0)=0\phi(0)=0;

  2. (ii)

    ϕ\phi is continuously differentiable on (0,τ)(0,\tau) and continuous at 0;

  3. (iii)

    ϕ(t)>0\phi^{\prime}(t)>0 for any t(0,τ)t\in(0,\tau).

Now we define the KŁ property.

Definition 2.3.

Let g:n(,+]g:\mathbb{R}^{n}\to(-\infty,+\infty] be a proper and lower semicontinuous function and g\partial g be the (limiting) subdifferential of gg.

  1. (i)

    The function gg is said to satisfy the KŁ property at u¯dom(g):={ung(u)}\bar{u}\in\mathrm{dom}(\partial g):=\{u\in\mathbb{R}^{n}\mid\partial g(u)\neq\emptyset\} if there exists a constant τ(0,+]\tau\in(0,+\infty], a neighborhood 𝒰\mathcal{U} of u¯\bar{u} and a function ϕΦτ\phi\in\Phi_{\tau}, such that for any u𝒰u\in\mathcal{U} satisfying g(u¯)<g(u)<g(u¯)+τg(\bar{u})<g(u)<g(\bar{u})+\tau, the following KŁ inequality holds,

    ϕ(g(u)g(u¯))dist(0,g(u))1.\phi^{\prime}(g(u)-g(\bar{u}))\,\mathrm{dist}(0,\partial g(u))\geq 1.

    The function ϕ\phi is called a desingularizing function of gg at u¯\bar{u}.

  2. (ii)

    We say gg is a KŁ function if gg satisfies the KŁ property at each point of dom(g)\mathrm{dom}(\partial g).

KŁ functions are ubiquitous in many practical applications, which covers a wealth of nonconvex nonsmooth functions. For example, tame functions constitutes a wide class of KŁ functions, including semialgebraic and real subanalytic functions. We refer interested readers to [6, 2, 3, 4, 7] for more details.

2.5 Constraint Dissolving Operator

The constraint dissolving operator proposed in [40] offers a powerful technique to handle manifold constraints, which can be leveraged to construct an unconstrained penalty model for Riemannian optimization problems. Specifically, a constraint dissolving operator of the generalized orthogonality constraint (1.1b) is given by

𝒜(X):=12X(3Ipi=1dXMiX).\mathcal{A}(X):=\dfrac{1}{2}X\left(3I_{p}-\sum_{i=1}^{d}X^{\top}M_{i}X\right).

Then solving the problem (1.1) can be converted into the unconstrained minimization of the following penalty function,

minXn×ph(X):=i=1dfi(𝒜(X))+β4i=1dXMiXIpF2,\min\limits_{X\in\mathbb{R}^{n\times p}}\hskip 5.69054pth(X):=\sum_{i=1}^{d}f_{i}\left(\mathcal{A}(X)\right)+\dfrac{\beta}{4}\left\|\sum_{i=1}^{d}X^{\top}M_{i}X-I_{p}\right\|^{2}_{\mathrm{F}}, (2.2)

where β>0\beta>0 is a penalty parameter.

Xiao et al. [40] have proved that (1.1) and (2.2) share the same first-order stationary points, second-order stationary points, and local minimizers in a neighborhood of 𝒮Mn,p\mathcal{S}_{M}^{n,p}. However, it is intractable to solve the problem (2.2) under the decentralized setting. It is worth noting that, in the construction of the penalty function h(X)h(X), the constraint (1.1b) is integrated into the original objective function and the quadratic penalty term, resulting in the loss of the separable structure. To the best of our knowledge, there are currently no algorithms equipped to tackle such problems. Therefore, in the next section, we will design a novel algorithm to solve the penalty model under the decentralized setting.

3 Algorithm Development

The purpose of this section is to develop an efficient decentralized algorithm to solve the penalty model (2.2). An approximate direction is first constructed for the gradient of the penalty function, which is easier to evaluate under the decentralized setting. Then, we propose a double-tracking strategy to fabricate the approximate direction across the whole network. The resulting algorithm is capable of reaching a consensus on the generalized Stiefel manifold.

3.1 Gradient Approximation

Under the conditions in Assumption 1, the penalty function h(X)h(X) in (2.2) is continuously differentiable, the gradient of which takes the following form:

h(X)=\displaystyle\nabla h(X)={} 12i=1dfi(Z)Z=𝒜(X)(3IpXi=1dMiX)i=1dMiXsym(Xi=1dfi(Z)Z=𝒜(X))\displaystyle\dfrac{1}{2}\sum_{i=1}^{d}\nabla f_{i}(Z)\mid_{Z=\mathcal{A}(X)}\left(3I_{p}-X^{\top}\sum_{i=1}^{d}M_{i}X\right)-\sum_{i=1}^{d}M_{i}X\mathrm{sym}\left(X^{\top}\sum_{i=1}^{d}\nabla f_{i}(Z)\mid_{Z=\mathcal{A}(X)}\right)
+βi=1dMiX(Xi=1dMiXIp).\displaystyle+\beta\sum_{i=1}^{d}M_{i}X\left(X^{\top}\sum_{i=1}^{d}M_{i}X-I_{p}\right).

Therefore, each agent ii can not compute the local gradient fi(Z)Z=𝒜(X)\nabla f_{i}(Z)\mid_{Z=\mathcal{A}(X)} individually since the evaluation of 𝒜(X)\mathcal{A}(X) requires the accessibility of {Mi}\{M_{i}\} over all the agents. In light of the property that 𝒜(X)XF=𝒪(XMXIpF)\left\|\mathcal{A}(X)-X\right\|_{\mathrm{F}}=\mathcal{O}(\left\|X^{\top}MX-I_{p}\right\|_{\mathrm{F}}) whenever XX is not far away from 𝒮Mn,p\mathcal{S}_{M}^{n,p}, we propose to approximate the local gradient fi(Z)Z=𝒜(X)\nabla f_{i}(Z)\mid_{Z=\mathcal{A}(X)} by fi(Z)Z=X\nabla f_{i}(Z)\mid_{Z=X}. Hereafter, fi(Z)Z=X\nabla f_{i}(Z)\mid_{Z=X} is denoted by fi(X)\nabla f_{i}(X) for simplicity. As a result, we can obtain the following approximation of h(X)\nabla h(X):

H(X)=S(X)+βQ(X),H(X)=S(X)+\beta Q(X), (3.1)

where

S(X)=12f(X)(3IpXMX)MXsym(Xf(X)),S(X)=\dfrac{1}{2}\nabla f(X)\left(3I_{p}-X^{\top}MX\right)-MX\mathrm{sym}\left(X^{\top}\nabla f(X)\right),

and

Q(X)=MX(XMXIp).Q(X)=MX\left(X^{\top}MX-I_{p}\right).

The approximate direction H(X)H(X) of h(X)\nabla h(X) possesses the following desirable properties.

Lemma 3.1.

Let :={Xn×pXMXIpF1/6}\mathcal{R}:=\{X\in\mathbb{R}^{n\times p}\mid\|X^{\top}MX-I_{p}\|_{\mathrm{F}}\leq 1/6\} be a bounded region and Cg:=supXf(X)FC_{g}:=\sup_{X\in\mathcal{R}}\left\|\nabla f(X)\right\|_{\mathrm{F}} be a positive constant. Then, if βmax{(12+342Cg)σmin1/2(M)/5,σmin3/2(M)Ls2}\beta\geq\max\{(12+3\sqrt{42}C_{g})\sigma_{\min}^{-1/2}(M)/5,\sigma_{\min}^{-3/2}(M)L_{s}^{2}\}, we have

H(X)F212σmin2(M)gradf(𝒫(X))F2+βσmin1/2(M)XMXIpF2,\left\|H(X)\right\|^{2}_{\mathrm{F}}\geq\dfrac{1}{2}\sigma_{\min}^{2}(M)\left\|\mathrm{grad}\,f(\mathcal{P}(X))\right\|^{2}_{\mathrm{F}}+\beta\sigma_{\min}^{1/2}(M)\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}},

for any XX\in\mathcal{R}.

Proof.

For any XX\in\mathcal{R}, we have the inequalities σmax2(M1/2X)7/6\sigma_{\max}^{2}(M^{1/2}X)\leq 7/6 and σmin2(M1/2X)5/6\sigma_{\min}^{2}(M^{1/2}X)\geq 5/6, and hence,

M1/2X(XMXIp)F2σmin2(M1/2X)XMXIpF256XMXIpF2.\left\|M^{1/2}X\left(X^{\top}MX-I_{p}\right)\right\|^{2}_{\mathrm{F}}\geq\sigma_{\min}^{2}(M^{1/2}X)\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}}\geq\dfrac{5}{6}\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}}.

Then from the formulation of S(X)S(X), it holds that

S(X),X(XMXIp)\displaystyle\left\langle S(X),X\left(X^{\top}MX-I_{p}\right)\right\rangle
=\displaystyle={} 12f(X)(3IpXMX),X(XMXIp)MXsym(Xf(X)),X(XMXIp)\displaystyle\dfrac{1}{2}\left\langle\nabla f(X)\left(3I_{p}-X^{\top}MX\right),X\left(X^{\top}MX-I_{p}\right)\right\rangle-\left\langle MX\mathrm{sym}\left(X^{\top}\nabla f(X)\right),X\left(X^{\top}MX-I_{p}\right)\right\rangle
=\displaystyle={} 12sym(Xf(X)),(XMXIp)(3IpXMX)2XMX(XMXIp)\displaystyle\dfrac{1}{2}\left\langle\mathrm{sym}\left(X^{\top}\nabla f(X)\right),\left(X^{\top}MX-I_{p}\right)\left(3I_{p}-X^{\top}MX\right)-2X^{\top}MX\left(X^{\top}MX-I_{p}\right)\right\rangle
=\displaystyle={} 32sym(Xf(X)),(XMXIp)2,\displaystyle-\dfrac{3}{2}\left\langle\mathrm{sym}\left(X^{\top}\nabla f(X)\right),\left(X^{\top}MX-I_{p}\right)^{2}\right\rangle,

which implies that

|S(X),X(XMXIp)|\displaystyle\left\lvert\left\langle S(X),X\left(X^{\top}MX-I_{p}\right)\right\rangle\right\rvert 32sym(Xf(X))F(XMXIp)2F\displaystyle\leq\dfrac{3}{2}\left\|\mathrm{sym}\left(X^{\top}\nabla f(X)\right)\right\|_{\mathrm{F}}\left\|\left(X^{\top}MX-I_{p}\right)^{2}\right\|_{\mathrm{F}}
32(M1/2X)M1/2f(X)F(XMXIp)2F\displaystyle\leq\dfrac{3}{2}\left\|(M^{1/2}X)^{\top}M^{-1/2}\nabla f(X)\right\|_{\mathrm{F}}\left\|\left(X^{\top}MX-I_{p}\right)^{2}\right\|_{\mathrm{F}}
42Cg4σmin1/2(M)XMXIpF2.\displaystyle\leq\dfrac{\sqrt{42}C_{g}}{4\sigma_{\min}^{1/2}(M)}\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}}.

Now it can be readily verifies that

H(X)F2\displaystyle\left\|H(X)\right\|^{2}_{\mathrm{F}}\geq{} σmin(M)M1/2H(X)F2\displaystyle\sigma_{\min}(M)\left\|M^{-1/2}H(X)\right\|^{2}_{\mathrm{F}} (3.2)
=\displaystyle={} σmin(M)M1/2S(X)F2+2βσmin(M)S(X),X(XMXIp)\displaystyle\sigma_{\min}(M)\left\|M^{-1/2}S(X)\right\|^{2}_{\mathrm{F}}+2\beta\sigma_{\min}(M)\left\langle S(X),X\left(X^{\top}MX-I_{p}\right)\right\rangle
+β2σmin(M)M1/2X(XMXIp)F2\displaystyle+\beta^{2}\sigma_{\min}(M)\left\|M^{1/2}X\left(X^{\top}MX-I_{p}\right)\right\|^{2}_{\mathrm{F}}
\displaystyle\geq{} σmin2(M)M1S(X)F2+16(5βσmin1/2(M)342Cg)βσmin1/2(M)XMXIpF2\displaystyle\sigma_{\min}^{2}(M)\left\|M^{-1}S(X)\right\|^{2}_{\mathrm{F}}+\dfrac{1}{6}\left(5\beta\sigma_{\min}^{1/2}(M)-3\sqrt{42}C_{g}\right)\beta\sigma_{\min}^{1/2}(M)\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}}
\displaystyle\geq{} σmin2(M)M1S(X)F2+2βσmin1/2(M)XMXIpF2,\displaystyle\sigma_{\min}^{2}(M)\left\|M^{-1}S(X)\right\|^{2}_{\mathrm{F}}+2\beta\sigma_{\min}^{1/2}(M)\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}},

where the last inequality follows from the condition β(12+342Cg)σmin1/2(M)/5\beta\geq(12+3\sqrt{42}C_{g})\sigma_{\min}^{-1/2}(M)/5.

Since it holds that σmin2(M1/2X)5/6\sigma_{\min}^{2}(M^{1/2}X)\geq 5/6, we know that XMXX^{\top}MX is positive definite and 𝒫(X)=X(XMX)1/2\mathcal{P}(X)=X(X^{\top}MX)^{-1/2}. Then straightforward calculations give rise to that

X𝒫(X)=X(XMX)1/2((XMX)1/2+Ip)1(XMXIp),X-\mathcal{P}(X)=X(X^{\top}MX)^{-1/2}((X^{\top}MX)^{1/2}+I_{p})^{-1}(X^{\top}MX-I_{p}),

which further infers that

X𝒫(X)Fσmin1/2(M)XMXIpF.\displaystyle\left\|X-\mathcal{P}(X)\right\|_{\mathrm{F}}\leq\sigma_{\min}^{-1/2}(M)\left\|X^{\top}MX-I_{p}\right\|_{\mathrm{F}}.

According to the local Lipschitz continuity of S(X)S(X), there exists a constant Ls>0L_{s}>0 such that

gradf(𝒫(X))M1S(X)F=\displaystyle\left\|\mathrm{grad}\,f(\mathcal{P}(X))-M^{-1}S(X)\right\|_{\mathrm{F}}={} M1S(𝒫(X))M1S(X)F\displaystyle\left\|M^{-1}S(\mathcal{P}(X))-M^{-1}S(X)\right\|_{\mathrm{F}}
\displaystyle\leq{} σmax(M1)S(𝒫(X))S(X)F\displaystyle\sigma_{\max}(M^{-1})\left\|S(\mathcal{P}(X))-S(X)\right\|_{\mathrm{F}}
\displaystyle\leq{} σmin1(M)Ls𝒫(X)XF\displaystyle\sigma_{\min}^{-1}(M)L_{s}\left\|\mathcal{P}(X)-X\right\|_{\mathrm{F}}
\displaystyle\leq{} σmin3/2(M)LsXMXIpF.\displaystyle\sigma_{\min}^{-3/2}(M)L_{s}\left\|X^{\top}MX-I_{p}\right\|_{\mathrm{F}}.

Moreover, we have

gradf(𝒫(X))F2\displaystyle\left\|\mathrm{grad}\,f(\mathcal{P}(X))\right\|^{2}_{\mathrm{F}}\leq{} 2gradf(𝒫(X))M1S(X)F2+2M1S(X)F2\displaystyle 2\left\|\mathrm{grad}\,f(\mathcal{P}(X))-M^{-1}S(X)\right\|^{2}_{\mathrm{F}}+2\left\|M^{-1}S(X)\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 2σmin3(M)Ls2XMXIpF2+2M1S(X)F2.\displaystyle 2\sigma_{\min}^{-3}(M)L_{s}^{2}\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}}+2\left\|M^{-1}S(X)\right\|^{2}_{\mathrm{F}}.

Combining the above relationship with (3.2) yields that

H(X)F2\displaystyle\left\|H(X)\right\|^{2}_{\mathrm{F}}\geq{} 12σmin2(M)gradf(𝒫(X))F2+(2βσmin1/2(M)σmin1(M)Ls2)XMXIpF2\displaystyle\dfrac{1}{2}\sigma_{\min}^{2}(M)\left\|\mathrm{grad}\,f(\mathcal{P}(X))\right\|^{2}_{\mathrm{F}}+\left(2\beta\sigma_{\min}^{1/2}(M)-\sigma_{\min}^{-1}(M)L_{s}^{2}\right)\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}}
\displaystyle\geq{} 12σmin2(M)gradf(𝒫(X))F2+βσmin1/2(M)XMXIpF2,\displaystyle\dfrac{1}{2}\sigma_{\min}^{2}(M)\left\|\mathrm{grad}\,f(\mathcal{P}(X))\right\|^{2}_{\mathrm{F}}+\beta\sigma_{\min}^{1/2}(M)\left\|X^{\top}MX-I_{p}\right\|^{2}_{\mathrm{F}},

where the last inequality follows from the condition βσmin3/2(M)Ls2\beta\geq\sigma_{\min}^{-3/2}(M)L_{s}^{2}. The proof is completed. ∎

Lemma 3.1 reveals that the norm of gradf(𝒫(X))\mathrm{grad}\,f(\mathcal{P}(X)) and the feasibility violation of XX are both controlled by the norm of H(X)H(X) as long as XX\in\mathcal{R} and β\beta is sufficiently large. Therefore, as an approximation of h(X)\nabla h(X), H(X)H(X) can serve as a search direction to solve the problem (2.2).

3.2 Double Tracking Strategy

In the construction of H(X)H(X), each agent i[d]i\in[d] is able to compute the local gradient fi(X)\nabla f_{i}(X) independently. However, the evaluation of H(X)H(X) still fails to be distributed into dd agents since it is not separable. Nevertheless, we find that the components constituting H(X)H(X) exhibit a separable structure as follows,

H(X)=d2U(X)(3IpdXV(X))d2V(X)sym(XU(X))+βdV(X)(dXV(X)Ip),H(X)=\dfrac{d}{2}U(X)\left(3I_{p}-dX^{\top}V(X)\right)-d^{2}V(X)\mathrm{sym}\left(X^{\top}U(X)\right)+\beta dV(X)\left(dX^{\top}V(X)-I_{p}\right), (3.3)

where

U(X):=1di=1dfi(X) and V(X):=1di=1dMiX.U(X):=\dfrac{1}{d}\sum_{i=1}^{d}\nabla f_{i}(X)\mbox{~{}and~{}}V(X):=\dfrac{1}{d}\sum_{i=1}^{d}M_{i}X.

This observation inspires us to propose a double-tracking strategy. Specifically, we can first track U(X)U(X) and V(X)V(X) separately across the whole network by resorting to the dynamic average consensus [45] protocol. Then, these two components are collected together to assemble a global estimate of H(X)H(X). It is worth mentioning that V(X)V(X) is exactly the Jacobian of the constraint mapping.

3.3 Algorithm Description

In this subsection, we describe the proposed algorithm to solve the problem (2.2). Hereafter, the notation Xi(k)X_{i}^{(k)} represents the kk-th iterate of XiX_{i}. Our algorithm introduces three auxiliary local variables Ui(k)n×pU_{i}^{(k)}\in\mathbb{R}^{n\times p}, Vi(k)n×pV_{i}^{(k)}\in\mathbb{R}^{n\times p}, and Hi(k)n×pH_{i}^{(k)}\in\mathbb{R}^{n\times p} for each agent ii at the kk-th iteration. Specifically, Ui(k)U_{i}^{(k)} and Vi(k)V_{i}^{(k)} track f(Xi(k))\nabla f(X_{i}^{(k)}) and MXi(k)MX_{i}^{(k)} respectively, through the exchange of local information. In addition, Hi(k)H_{i}^{(k)} aims at estimating the search direction based on the formulation (3.3). The key steps of our algorithm from the perspective of each agent are outlined below.

Step 1: Computing Search Direction.

We first compute an approximate search direction based on (3.3) as follows:

Hi(k)=\displaystyle H_{i}^{(k)}={} d2Ui(k)(3Ipd(Xi(k))Vi(k))d2Vi(k)sym((Xi(k))Ui(k))\displaystyle\dfrac{d}{2}U_{i}^{(k)}\left(3I_{p}-d(X_{i}^{(k)})^{\top}V_{i}^{(k)}\right)-d^{2}V_{i}^{(k)}\mathrm{sym}\left((X_{i}^{(k)})^{\top}U_{i}^{(k)}\right) (3.4)
+βdVi(k)(d(Xi(k))Vi(k)Ip).\displaystyle+\beta dV_{i}^{(k)}\left(d(X_{i}^{(k)})^{\top}V_{i}^{(k)}-I_{p}\right).

Step 2: Mixing Local Information.

To ensure that the local estimates XiX_{i}’s asymptotically converge to a common value, we leverage the following consensus protocol. Given the search directions Hi(k)H_{i}^{(k)}’s in the previous step, we update the local variable Xi(k+1)X_{i}^{(k+1)} by

Xi(k+1)=j=1dW(i,j)(Xj(k)ηHj(k)),X_{i}^{(k+1)}=\sum\limits_{j=1}^{d}W(i,j)\left(X_{j}^{(k)}-\eta H_{j}^{(k)}\right), (3.5)

where η>0\eta>0 is a stepsize. The above procedure can be realized in a distributed manner.

Step 3: Tracking Gradient and Jacobian.

Finally, to guarantee that each Ui(k)U_{i}^{(k)} and Vi(k)V_{i}^{(k)} track the average of fi(Xi(k))\nabla f_{i}(X_{i}^{(k)}) and MiXi(k)M_{i}X_{i}^{(k)} respectively, we leverage the dynamic average consensus [45] technique. The resulting gradient and Jacobian tracking schemes read as follows,

Ui(k+1)=j=1dW(i,j)(Uj(k)+fj(Xj(k+1))fj(Xj(k))),\displaystyle U_{i}^{(k+1)}=\sum\limits_{j=1}^{d}W(i,j)\left(U_{j}^{(k)}+\nabla f_{j}(X_{j}^{(k+1)})-\nabla f_{j}(X_{j}^{(k)})\right), (3.6)
Vi(k+1)=j=1dW(i,j)(Vj(k)+MjXj(k+1)MjXj(k)),\displaystyle V_{i}^{(k+1)}=\sum\limits_{j=1}^{d}W(i,j)\left(V_{j}^{(k)}+M_{j}X_{j}^{(k+1)}-M_{j}X_{j}^{(k)}\right), (3.7)

with Ui(0)=fi(Xi(0))U_{i}^{(0)}=\nabla f_{i}(X_{i}^{(0)}) and Vi(0)=MiXi(0)V_{i}^{(0)}=M_{i}X_{i}^{(0)}.

Then based on the aforementioned steps, we formally present the detailed algorithmic framework in Algorithm 1, named constraint dissolving algorithm with double tracking and abbreviated to CDADT.

1
2Input: Xinitn×pX_{\mathrm{init}}\in\mathbb{R}^{n\times p}, , β>0\beta>0, and η>0\eta>0.
3
4Set k:=0k:=0.
5for i[d]i\in[d] do
6      
7      Initialize Xi(k):=XinitX_{i}^{(k)}:=X_{\mathrm{init}}, Ui(k):=fi(Xinit)U_{i}^{(k)}:=\nabla f_{i}(X_{\mathrm{init}}), and Vi(k):=MiXinitV_{i}^{(k)}:=M_{i}X_{\mathrm{init}}.
8
9while “not converged” do
10      
11      for i[d]i\in[d] do
12            
13            Compute Hi(k)H_{i}^{(k)} by (3.4).
14            Update Xi(k+1)X_{i}^{(k+1)} by (3.5).
15            Update Ui(k+1)U_{i}^{(k+1)} and Vi(k+1)V_{i}^{(k+1)} by (3.4) and (3.7), respectively.
16      
17      Set k:=k+1k:=k+1.
18
19Output: {Xi(k)}i=1d\{X_{i}^{(k)}\}_{i=1}^{d}.
20
Algorithm 1 Constraint dissolving algorithm with double tracking (CDADT) for (1.1).

In the rest of this subsection, we exhibit the compact form of Algorithm 1. For the sake of convenience, we denote J=𝟏d𝟏d/dd×dJ=\mathbf{1}_{d}\mathbf{1}_{d}^{\top}/d\in\mathbb{R}^{d\times d}, 𝐉=JIndn×dn\mathbf{J}=J\otimes I_{n}\in\mathbb{R}^{dn\times dn}, 𝐄=𝟏dIndn×n\mathbf{E}=\mathbf{1}_{d}\otimes I_{n}\in\mathbb{R}^{dn\times n}, and 𝐖=WIndn×dn\mathbf{W}=W\otimes I_{n}\in\mathbb{R}^{dn\times dn}. It can be readily verified that (𝐖𝐉)𝐉=0(\mathbf{W}-\mathbf{J})\mathbf{J}=0. The following notations are also used in the sequel.

  • 𝐗(k)=[(X1(k)),,(Xd(k))]\mathbf{X}^{(k)}=[(X_{1}^{(k)})^{\top},\dotsc,(X_{d}^{(k)})^{\top}]^{\top}, X¯(k)=𝐄𝐗(k)/d\bar{X}^{(k)}=\mathbf{E}^{\top}\mathbf{X}^{(k)}/d, 𝐗¯(k)=𝐄X¯(k)=𝐉𝐗(k)\bar{\mathbf{X}}^{(k)}=\mathbf{E}\bar{X}^{(k)}=\mathbf{J}\mathbf{X}^{(k)}.

  • 𝐔(k)=[(U1(k)),,(Ud(k))]\mathbf{U}^{(k)}=[(U_{1}^{(k)})^{\top},\dotsc,(U_{d}^{(k)})^{\top}]^{\top}, U¯(k)=𝐄𝐔(k)/d\bar{U}^{(k)}=\mathbf{E}^{\top}\mathbf{U}^{(k)}/d, 𝐔¯(k)=𝐄U¯(k)=𝐉𝐔(k)\bar{\mathbf{U}}^{(k)}=\mathbf{E}\bar{U}^{(k)}=\mathbf{J}\mathbf{U}^{(k)}.

  • 𝐕(k)=[(V1(k)),,(Vd(k))]\mathbf{V}^{(k)}=[(V_{1}^{(k)})^{\top},\dotsc,(V_{d}^{(k)})^{\top}]^{\top}, V¯(k)=𝐄𝐕(k)/d\bar{V}^{(k)}=\mathbf{E}^{\top}\mathbf{V}^{(k)}/d, 𝐕¯(k)=𝐄V¯(k)=𝐉𝐕(k)\bar{\mathbf{V}}^{(k)}=\mathbf{E}\bar{V}^{(k)}=\mathbf{J}\mathbf{V}^{(k)}.

  • 𝐇(k)=[(H1(k)),,(Hd(k))]\mathbf{H}^{(k)}=[(H_{1}^{(k)})^{\top},\dotsc,(H_{d}^{(k)})^{\top}]^{\top}, H¯(k)=𝐄𝐇(k)/d\bar{H}^{(k)}=\mathbf{E}^{\top}\mathbf{H}^{(k)}/d, 𝐇¯(k)=𝐄H¯(k)=𝐉𝐇(k)\bar{\mathbf{H}}^{(k)}=\mathbf{E}\bar{H}^{(k)}=\mathbf{J}\mathbf{H}^{(k)}.

  • 𝐆(k)=[(f1(X1(k))),,(fd(Xd(k)))]\mathbf{G}^{(k)}=[(\nabla f_{1}(X_{1}^{(k)}))^{\top},\dotsc,(\nabla f_{d}(X_{d}^{(k)}))]^{\top}, G¯(k)=𝐄𝐆(k)/d\bar{G}^{(k)}=\mathbf{E}^{\top}\mathbf{G}^{(k)}/d, 𝐆¯(k)=𝐄G¯(k)=𝐉𝐆(k)\bar{\mathbf{G}}^{(k)}=\mathbf{E}\bar{G}^{(k)}=\mathbf{J}\mathbf{G}^{(k)}.

  • 𝐃(k)=[(M1X1(k)),,(MdXd(k))]\mathbf{D}^{(k)}=[(M_{1}X_{1}^{(k)})^{\top},\dotsc,(M_{d}X_{d}^{(k)})^{\top}]^{\top}, D¯(k)=𝐄𝐃(k)/d\bar{D}^{(k)}=\mathbf{E}^{\top}\mathbf{D}^{(k)}/d, 𝐃¯(k)=𝐄D¯(k)=𝐉𝐃(k)\bar{\mathbf{D}}^{(k)}=\mathbf{E}\bar{D}^{(k)}=\mathbf{J}\mathbf{D}^{(k)}.

By the formulation of the above notations, the main iteration loop of Algorithm 1 can be summarized in the following compact form.

{𝐗(k+1)=𝐖(𝐗(k)η𝐇(k)),𝐔(k+1)=𝐖(𝐔(k)+𝐆(k+1)𝐆(k)),𝐕(k+1)=𝐖(𝐕(k)+𝐃(k+1)𝐃(k)).\left\{\begin{aligned} \mathbf{X}^{(k+1)}&=\mathbf{W}(\mathbf{X}^{(k)}-\eta\mathbf{H}^{(k)}),\\ \mathbf{U}^{(k+1)}&=\mathbf{W}(\mathbf{U}^{(k)}+\mathbf{G}^{(k+1)}-\mathbf{G}^{(k)}),\\ \mathbf{V}^{(k+1)}&=\mathbf{W}(\mathbf{V}^{(k)}+\mathbf{D}^{(k+1)}-\mathbf{D}^{(k)}).\end{aligned}\right.

Moreover, it is not difficult to check that, for any kk\in\mathbb{N}, the following relationships hold.

X¯(k+1)=X¯(k)ηH¯(k),U¯(k)=G¯(k),and V¯(k)=D¯(k).\bar{X}^{(k+1)}=\bar{X}^{(k)}-\eta\bar{H}^{(k)},~{}\bar{U}^{(k)}=\bar{G}^{(k)},~{}\mbox{and~{}}\bar{V}^{(k)}=\bar{D}^{(k)}. (3.8)

4 Convergence Analysis

In this section, we present the convergence analysis of Algorithm 1. The global convergence guarantee is rigorously established under rather mild conditions, together with an iteration complexity.

4.1 Consensus and Tracking Errors

This subsection is devoted to building the upper bound of consensus errors and tracking errors. We start from the consensus error 𝐗(k+1)𝐗¯(k+1)F\|\mathbf{X}^{(k+1)}-\bar{\mathbf{X}}^{(k+1)}\|_{\mathrm{F}} in the following lemma.

Lemma 4.1.

Suppose the conditions in Assumption 2 hold. Then for any kk\in\mathbb{N}, it holds that

𝐗(k+1)𝐗¯(k+1)F21+λ22𝐗(k)𝐗¯(k)F2+η2C1𝐇(k)F2,\left\|\mathbf{X}^{(k+1)}-\bar{\mathbf{X}}^{(k+1)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{1+\lambda^{2}}{2}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+\eta^{2}C_{1}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}},

where C1=λ2(1+λ2)/(1λ2)>0C_{1}=\lambda^{2}\left(1+\lambda^{2}\right)/(1-\lambda^{2})>0 is a constant.

Proof.

By the update scheme in (3.5) and straightforward calculations, we can attain that

𝐗(k+1)𝐗¯(k+1)F2=\displaystyle\left\|\mathbf{X}^{(k+1)}-\bar{\mathbf{X}}^{(k+1)}\right\|^{2}_{\mathrm{F}}= (𝐖𝐉)(𝐗(k)𝐗¯(k))η(𝐖𝐉)𝐇(k)F2\displaystyle\left\|(\mathbf{W}-\mathbf{J})(\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)})-\eta(\mathbf{W}-\mathbf{J})\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq (1+γ)(𝐖𝐉)(𝐗(k)𝐗¯(k))F2+η2(1+1/γ)(𝐖𝐉)𝐇(k)F2\displaystyle\left(1+\gamma\right)\left\|(\mathbf{W}-\mathbf{J})(\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)})\right\|^{2}_{\mathrm{F}}+\eta^{2}\left(1+1/\gamma\right)\left\|(\mathbf{W}-\mathbf{J})\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq λ2(1+γ)𝐗(k)𝐗¯(k)F2+η2λ2(1+1/γ)𝐇(k)F2,\displaystyle\lambda^{2}\left(1+\gamma\right)\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+\eta^{2}\lambda^{2}\left(1+1/\gamma\right)\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}},

where γ=(1λ2)/(2λ2)>0\gamma=(1-\lambda^{2})/(2\lambda^{2})>0 is a constant. This completes the proof since λ2(1+γ)=(1+λ2)/2\lambda^{2}\left(1+\gamma\right)=(1+\lambda^{2})/2 and λ2(1+1/γ)=C1\lambda^{2}\left(1+1/\gamma\right)=C_{1}. ∎

Next, we proceed to bound the tracking errors 𝐔(k+1)𝐔¯(k+1)F\|\mathbf{U}^{(k+1)}-\bar{\mathbf{U}}^{(k+1)}\|_{\mathrm{F}} and 𝐕(k+1)𝐕¯(k+1)F\|\mathbf{V}^{(k+1)}-\bar{\mathbf{V}}^{(k+1)}\|_{\mathrm{F}}. To facilitate the narrative, we define the bounded region ={Xn×pXFCx}\mathcal{B}=\{X\in\mathbb{R}^{n\times p}\mid\|X\|_{\mathrm{F}}\leq C_{x}\}, where Cx=d(1+C~x)>0C_{x}=\sqrt{d}(1+\tilde{C}_{x})>0 is a constant with C~x=sup{XFX}>0\tilde{C}_{x}=\sup\left\{\|X\|_{\mathrm{F}}\mid X\in\mathcal{R}\right\}>0.

Lemma 4.2.

Suppose the conditions in Assumptions 1 and 2 hold, Xi(k+1)X_{i}^{(k+1)}\in\mathcal{B} and Xi(k)X_{i}^{(k)}\in\mathcal{B} for any i[d]i\in[d]. Then we have

𝐔(k+1)𝐔¯(k+1)F21+λ22𝐔(k)𝐔¯(k)F2+8Lc2C1𝐗(k)𝐗¯(k)F2+2η2Lc2C1𝐇(k)F2,\left\|\mathbf{U}^{(k+1)}-\bar{\mathbf{U}}^{(k+1)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{1+\lambda^{2}}{2}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+8L_{c}^{2}C_{1}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+2\eta^{2}L_{c}^{2}C_{1}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}},

and

𝐕(k+1)𝐕¯(k+1)F21+λ22𝐕(k)𝐕¯(k)F2+8Lc2C1𝐗(k)𝐗¯(k)F2+2η2Lc2C1𝐇(k)F2,\left\|\mathbf{V}^{(k+1)}-\bar{\mathbf{V}}^{(k+1)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{1+\lambda^{2}}{2}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}+8L_{c}^{2}C_{1}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+2\eta^{2}L_{c}^{2}C_{1}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}},

where Lc=max{Lg,Lm}>0L_{c}=\max\{L_{g},L_{m}\}>0 is a constant with Lm=sup{σmax(Mi)i[d]}>0L_{m}=\sup\left\{\sigma_{\max}(M_{i})\mid i\in[d]\right\}>0 and Lg=sup{fi(X)fi(Y)F/XYFXY,X,Y,i[d]}>0L_{g}=\sup\left\{\left\|\nabla f_{i}(X)-\nabla f_{i}(Y)\right\|_{\mathrm{F}}/\left\|X-Y\right\|_{\mathrm{F}}\mid X\neq Y,X\in\mathcal{B},Y\in\mathcal{B},i\in[d]\right\}>0.

Proof.

To begin with, straightforward manipulations lead to that

𝐔(k+1)𝐔¯(k+1)F2\displaystyle\left\|\mathbf{U}^{(k+1)}-\bar{\mathbf{U}}^{(k+1)}\right\|^{2}_{\mathrm{F}} =(𝐖𝐉)(𝐔(k)𝐔¯(k))+(𝐖𝐉)(𝐆(k+1)𝐆(k))F2\displaystyle=\left\|(\mathbf{W}-\mathbf{J})(\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)})+(\mathbf{W}-\mathbf{J})(\mathbf{G}^{(k+1)}-\mathbf{G}^{(k)})\right\|^{2}_{\mathrm{F}}
(1+γ)(𝐖𝐉)(𝐔(k)𝐔¯(k))F2+(1+1/γ)(𝐖𝐉)(𝐆(k+1)𝐆(k))F2\displaystyle\leq\left(1+\gamma\right)\left\|(\mathbf{W}-\mathbf{J})(\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)})\right\|^{2}_{\mathrm{F}}+\left(1+1/\gamma\right)\left\|(\mathbf{W}-\mathbf{J})(\mathbf{G}^{(k+1)}-\mathbf{G}^{(k)})\right\|^{2}_{\mathrm{F}}
λ2(1+γ)𝐔(k)𝐔¯(k)F2+λ2(1+1/γ)𝐆(k+1)𝐆(k)F2,\displaystyle\leq\lambda^{2}\left(1+\gamma\right)\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+\lambda^{2}\left(1+1/\gamma\right)\left\|\mathbf{G}^{(k+1)}-\mathbf{G}^{(k)}\right\|^{2}_{\mathrm{F}},

where γ=(1λ2)/(2λ2)>0\gamma=(1-\lambda^{2})/(2\lambda^{2})>0 is a constant. According to the local Lipschitz continuity of fi\nabla f_{i}, it follows that

𝐆(k+1)𝐆(k)FLg𝐗(k+1)𝐗(k)F.\left\|\mathbf{G}^{(k+1)}-\mathbf{G}^{(k)}\right\|_{\mathrm{F}}\leq L_{g}\left\|\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}\right\|_{\mathrm{F}}.

Moreover, it can be readily verified that

𝐗(k+1)𝐗(k)=𝐖(𝐗(k)η𝐇(k))𝐗(k)=(𝐖Idn)(𝐗(k)𝐗¯(k))η𝐖𝐇(k),\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}=\mathbf{W}(\mathbf{X}^{(k)}-\eta\mathbf{H}^{(k)})-\mathbf{X}^{(k)}=(\mathbf{W}-I_{dn})(\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)})-\eta\mathbf{W}\mathbf{H}^{(k)}, (4.1)

which implies that

𝐗(k+1)𝐗(k)F28𝐗(k)𝐗¯(k)F2+2η2𝐇(k)F2.\left\|\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}\right\|^{2}_{\mathrm{F}}\leq 8\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+2\eta^{2}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}.

Combining the above three relationships, we can obtain the bound for 𝐔(k+1)𝐔¯(k+1)F\|\mathbf{U}^{(k+1)}-\bar{\mathbf{U}}^{(k+1)}\|_{\mathrm{F}}. Furthermore, the bound for 𝐕(k+1)𝐕¯(k+1)F\|\mathbf{V}^{(k+1)}-\bar{\mathbf{V}}^{(k+1)}\|_{\mathrm{F}} can be obtained by using the same technique, hence its proof is omitted for simplicity. ∎

The following lemma demonstrates that dUi(k)dU_{i}^{(k)} and dVi(k)dV_{i}^{(k)} are estimates of f(X¯(k))\nabla f(\bar{X}^{(k)}) and MX¯(k)M\bar{X}^{(k)} for each agent ii, respectively.

Lemma 4.3.

Suppose that Xi(k)X_{i}^{(k)}\in\mathcal{B} for any i[d]i\in[d]. Then, under the conditions in Assumptions 1 and 2, the following two inequalities hold

dUi(k)f(X¯(k))F22d2Ui(k)U¯(k)F2+2dLc2𝐗(k)𝐗¯(k)F2,\left\|dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq 2d^{2}\left\|U_{i}^{(k)}-\bar{U}^{(k)}\right\|^{2}_{\mathrm{F}}+2dL_{c}^{2}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}},

and

dVi(k)MX¯(k)F22d2Vi(k)V¯(k)F2+2dLc2𝐗(k)𝐗¯(k)F2.\left\|dV_{i}^{(k)}-M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}\leq 2d^{2}\left\|V_{i}^{(k)}-\bar{V}^{(k)}\right\|^{2}_{\mathrm{F}}+2dL_{c}^{2}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}.
Proof.

According to the local Lipschitz continuity of fi\nabla f_{i}, it follows that

dU¯(k)f(X¯(k))F2=\displaystyle\left\|d\bar{U}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}={} dG¯(k)f(X¯(k))F2=i=1d(fi(Xi(k))fi(X¯(k)))F2\displaystyle\left\|d\bar{G}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}=\left\|\sum_{i=1}^{d}(\nabla f_{i}(X_{i}^{(k)})-\nabla f_{i}(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} di=1dfi(Xi(k))fi(X¯(k))F2dLg2𝐗(k)𝐗¯(k)F2,\displaystyle d\sum_{i=1}^{d}\left\|\nabla f_{i}(X_{i}^{(k)})-\nabla f_{i}(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq dL_{g}^{2}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}},

which further yields that

dUi(k)f(X¯(k))F2=\displaystyle\left\|dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}={} dUi(k)dU¯(k)+dU¯(k)f(X¯(k))F2\displaystyle\left\|dU_{i}^{(k)}-d\bar{U}^{(k)}+d\bar{U}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 2d2Ui(k)U¯(k)F2+2dU¯(k)f(X¯(k))F2\displaystyle 2d^{2}\left\|U_{i}^{(k)}-\bar{U}^{(k)}\right\|^{2}_{\mathrm{F}}+2\left\|d\bar{U}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 2d2Ui(k)U¯(k)F2+2dLg2𝐗(k)𝐗¯(k)F2.\displaystyle 2d^{2}\left\|U_{i}^{(k)}-\bar{U}^{(k)}\right\|^{2}_{\mathrm{F}}+2dL_{g}^{2}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}.

Hence, we can conclude that the first assertion of this lemma holds. The second assertion can be proved by using a similar argument. ∎

We conclude this subsection by showing that H¯(k)\bar{H}^{(k)} is an approximation of H(X¯(k))H(\bar{X}^{(k)}) with the approximation error controlled by the consensus and tracking errors. For convenience, we denote two constants Cu=d(1+Cg)>0C_{u}=\sqrt{d}(1+C_{g})>0 and Cv=d(1+CxLm)>0C_{v}=\sqrt{d}(1+C_{x}L_{m})>0 to be used in the following lemma.

Lemma 4.4.

Let the conditions in Assumptions 1 and 2 hold. Suppose that Xi(k)FCx\|X_{i}^{(k)}\|_{\mathrm{F}}\leq C_{x}, Ui(k)FCu\|U_{i}^{(k)}\|_{\mathrm{F}}\leq C_{u}, and Vi(k)FCv\|V_{i}^{(k)}\|_{\mathrm{F}}\leq C_{v} for any i[d]i\in[d]. Then we have

H¯(k)H(X¯(k))F2\displaystyle\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq{} C2d𝐔(k)𝐔¯(k)F2+C2+C3β2d𝐕(k)𝐕¯(k)F2\displaystyle\dfrac{C_{2}}{d}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{C_{2}+C_{3}\beta^{2}}{d}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}
+C2+C3β2d𝐗(k)𝐗¯(k)F2,\displaystyle+\dfrac{C_{2}+C_{3}\beta^{2}}{d}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}},

where C2>0C_{2}>0 and C3>0C_{3}>0 are two constants.

Proof.

To begin with, we have

Hi(k)H(X¯(k))F2\displaystyle\left\|H_{i}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq{} 34dUi(k)(3Ipd(Xi(k))Vi(k))f(X¯(k))(3Ip(X¯(k))MX¯(k))F2\displaystyle\dfrac{3}{4}\left\|dU_{i}^{(k)}(3I_{p}-d(X_{i}^{(k)})^{\top}V_{i}^{(k)})-\nabla f(\bar{X}^{(k)})(3I_{p}-(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
+3d2Vi(k)sym((Xi(k))Ui(k))MX¯(k)sym((X¯(k))f(X¯(k))F2\displaystyle+3\left\|d^{2}V_{i}^{(k)}\mathrm{sym}((X_{i}^{(k)})^{\top}U_{i}^{(k)})-M\bar{X}^{(k)}\mathrm{sym}((\bar{X}^{(k)})^{\top}\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
+3β2dVi(k)(d(Xi(k))Vi(k)Ip)MX¯(k)((X¯(k))MX¯(k)Ip)F2.\displaystyle+3\beta^{2}\left\|dV_{i}^{(k)}(d(X_{i}^{(k)})^{\top}V_{i}^{(k)}-I_{p})-M\bar{X}^{(k)}((\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p})\right\|^{2}_{\mathrm{F}}.

By straightforward calculations, we can obtain that

dUi(k)(3Ipd(Xi(k))Vi(k))f(X¯(k))(3Ip(X¯(k))MX¯(k))F2\displaystyle\left\|dU_{i}^{(k)}(3I_{p}-d(X_{i}^{(k)})^{\top}V_{i}^{(k)})-\nabla f(\bar{X}^{(k)})(3I_{p}-(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 3(dUi(k)f(X¯(k)))(3Ipd(Xi(k))Vi(k))F2+3f(X¯(k))(X¯(k)Xi(k))MX¯(k)F2\displaystyle 3\left\|(dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)}))(3I_{p}-d(X_{i}^{(k)})^{\top}V_{i}^{(k)})\right\|^{2}_{\mathrm{F}}+3\left\|\nabla f(\bar{X}^{(k)})(\bar{X}^{(k)}-X_{i}^{(k)})^{\top}M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
+3f(X¯(k))(Xi(k))(MX¯(k)dVi(k))F2\displaystyle+3\left\|\nabla f(\bar{X}^{(k)})(X_{i}^{(k)})^{\top}(M\bar{X}^{(k)}-dV_{i}^{(k)})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 3(18p+2d2Cx2Cv2)dUi(k)f(X¯(k))F2+3σmax2(M)Cx2Cg2Xi(k)X¯(k)F2\displaystyle 3(18p+2d^{2}C_{x}^{2}C_{v}^{2})\left\|dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+3\sigma_{\max}^{2}(M)C_{x}^{2}C_{g}^{2}\left\|X_{i}^{(k)}-\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
+3Cx2Cg2dVi(k)MX¯(k)F2.\displaystyle+3C_{x}^{2}C_{g}^{2}\left\|dV_{i}^{(k)}-M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}.

And it can be readily verified that

d2Vi(k)sym((Xi(k))Ui(k))MX¯(k)sym((X¯(k))f(X¯(k))F2\displaystyle\left\|d^{2}V_{i}^{(k)}\mathrm{sym}((X_{i}^{(k)})^{\top}U_{i}^{(k)})-M\bar{X}^{(k)}\mathrm{sym}((\bar{X}^{(k)})^{\top}\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 3d2(dVi(k)MX¯(k))(Xi(k))Ui(k)F2+3d2MX¯(k)(Xi(k)X¯(k))Ui(k)F2\displaystyle 3d^{2}\left\|(dV_{i}^{(k)}-M\bar{X}^{(k)})(X_{i}^{(k)})^{\top}U_{i}^{(k)}\right\|^{2}_{\mathrm{F}}+3d^{2}\left\|M\bar{X}^{(k)}(X_{i}^{(k)}-\bar{X}^{(k)})^{\top}U_{i}^{(k)}\right\|^{2}_{\mathrm{F}}
+3MX¯(k)(X¯(k))(dUi(k)f(X¯(k)))F2\displaystyle+3\left\|M\bar{X}^{(k)}(\bar{X}^{(k)})^{\top}(dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 3d2Cx2Cu2dVi(k)MX¯(k)F2+3d2σmax2(M)Cx2Cu2Xi(k)X¯(k)F2\displaystyle 3d^{2}C_{x}^{2}C_{u}^{2}\left\|dV_{i}^{(k)}-M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}+3d^{2}\sigma_{\max}^{2}(M)C_{x}^{2}C_{u}^{2}\left\|X_{i}^{(k)}-\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
+3σmax2(M)Cx4dUi(k)f(X¯(k))F2.\displaystyle+3\sigma_{\max}^{2}(M)C_{x}^{4}\left\|dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}.

Moreover, we have

dVi(k)(d(Xi(k))Vi(k)Ip)MX¯(k)((X¯(k))MX¯(k)Ip)F2\displaystyle\left\|dV_{i}^{(k)}(d(X_{i}^{(k)})^{\top}V_{i}^{(k)}-I_{p})-M\bar{X}^{(k)}((\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 3(dVi(k)MX¯(k))(d(Xi(k))Vi(k)Ip)F2+3MX¯(k)(Xi(k))(dVi(k)MX¯(k))F2\displaystyle 3\left\|(dV_{i}^{(k)}-M\bar{X}^{(k)})(d(X_{i}^{(k)})^{\top}V_{i}^{(k)}-I_{p})\right\|^{2}_{\mathrm{F}}+3\left\|M\bar{X}^{(k)}(X_{i}^{(k)})^{\top}(dV_{i}^{(k)}-M\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
+3MX¯(k)(Xi(k)X¯(k))MX¯(k)F2\displaystyle+3\left\|M\bar{X}^{(k)}(X_{i}^{(k)}-\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 3(2p+2d2Cx2Cv2)dVi(k)MX¯(k)F2+3σmax2(M)Cx4dVi(k)MX¯(k)F2\displaystyle 3(2p+2d^{2}C_{x}^{2}C_{v}^{2})\left\|dV_{i}^{(k)}-M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}+3\sigma_{\max}^{2}(M)C_{x}^{4}\left\|dV_{i}^{(k)}-M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
+3σmax4(M)Cx4Xi(k)X¯(k)F2.\displaystyle+3\sigma_{\max}^{4}(M)C_{x}^{4}\left\|X_{i}^{(k)}-\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}.

Combining the above three relationships, we can acquire that

Hi(k)H(X¯(k))F2\displaystyle\left\|H_{i}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq{} ChudUi(k)f(X¯(k))F2+(Chv+Chv′′β2)dVi(k)MX¯(k)F2\displaystyle C_{hu}\left\|dU_{i}^{(k)}-\nabla f(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+(C_{hv}^{\prime}+C_{hv}^{\prime\prime}\beta^{2})\left\|dV_{i}^{(k)}-M\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
+(Chx+Chx′′β2)Xi(k)X¯(k)F2,\displaystyle+(C_{hx}^{\prime}+C_{hx}^{\prime\prime}\beta^{2})\left\|X_{i}^{(k)}-\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}},

where Chu=9(9p+d2Cx2Cv2+2σmax2(M)Cx4)/2C_{hu}=9(9p+d^{2}C_{x}^{2}C_{v}^{2}+2\sigma_{\max}^{2}(M)C_{x}^{4})/2, Chv=9Cx2Cg2/4+9d2Cx2Cu2C_{hv}^{\prime}=9C_{x}^{2}C_{g}^{2}/4+9d^{2}C_{x}^{2}C_{u}^{2}, Chv′′=9(2p+2d2Cx2Cv2+σmax2(M)Cx4)C_{hv}^{\prime\prime}=9(2p+2d^{2}C_{x}^{2}C_{v}^{2}+\sigma_{\max}^{2}(M)C_{x}^{4}), Chx=σmax2(M)ChvC_{hx}^{\prime}=\sigma_{\max}^{2}(M)C_{hv}^{\prime}, and Chx′′=9σmax4(M)Cx4C_{hx}^{\prime\prime}=9\sigma_{\max}^{4}(M)C_{x}^{4} are five positive constants. For convenience, we further denote two constants C2=max{1,2d2Chu,2d2Chv,d(Chx+2dLg2Chu+2dLm2Chv)}1C_{2}=\max\{1,2d^{2}C_{hu},2d^{2}C_{hv}^{\prime},d(C_{hx}^{\prime}+2dL_{g}^{2}C_{hu}^{\prime}+2dL_{m}^{2}C_{hv}^{\prime})\}\geq 1 and C3=max{2d2Chv′′,d(Chx′′+2dLm2Chv′′)}>0C_{3}=\max\{2d^{2}C_{hv}^{\prime\prime},d(C_{hx}^{\prime\prime}+2dL_{m}^{2}C_{hv}^{\prime\prime})\}>0. Then according to Lemma 4.3, it follows that

Hi(k)H(X¯(k))F2\displaystyle\left\|H_{i}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq{} 2d2ChuUi(k)U¯(k)F2+2d2(Chv+Chv′′β2)Vi(k)V¯(k)F2\displaystyle 2d^{2}C_{hu}\left\|U_{i}^{(k)}-\bar{U}^{(k)}\right\|^{2}_{\mathrm{F}}+2d^{2}(C_{hv}^{\prime}+C_{hv}^{\prime\prime}\beta^{2})\left\|V_{i}^{(k)}-\bar{V}^{(k)}\right\|^{2}_{\mathrm{F}} (4.2)
+(Chx+2dLg2Chu+2dLm2Chv+(Chx′′+2dLm2Chv′′)β2)𝐗(k)𝐗¯(k)F2\displaystyle+(C_{hx}^{\prime}+2dL_{g}^{2}C_{hu}^{\prime}+2dL_{m}^{2}C_{hv}^{\prime}+(C_{hx}^{\prime\prime}+2dL_{m}^{2}C_{hv}^{\prime\prime})\beta^{2})\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} C2Ui(k)U¯(k)F2+(C2+C3β2)Vi(k)V¯(k)F2\displaystyle C_{2}\left\|U_{i}^{(k)}-\bar{U}^{(k)}\right\|^{2}_{\mathrm{F}}+(C_{2}+C_{3}\beta^{2})\left\|V_{i}^{(k)}-\bar{V}^{(k)}\right\|^{2}_{\mathrm{F}}
+1d(C2+C3β2)𝐗(k)𝐗¯(k)F2.\displaystyle+\dfrac{1}{d}(C_{2}+C_{3}\beta^{2})\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}.

The last thing to do in the proof is to show that

H¯(k)H(X¯(k))F21di=1dHi(k)H(X¯(k))F2.\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq\dfrac{1}{d}\sum_{i=1}^{d}\left\|H_{i}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}.

Combining the above two relationships, we complete the proof. ∎

4.2 Boundedness of Iterates

In this subsection, we aim to show that the iterate sequence generated by Algorithm 1 is restricted in a neighborhood of the feasible region. Moreover, the average of local variables is always restricted in the bounded region \mathcal{R}, which guarantees the usage of Lemma 3.1.

We first prove the following technical lemma.

Lemma 4.5.

Suppose that X¯(k+1)\bar{X}^{(k+1)} is generated by (3.8) with X¯(k)\bar{X}^{(k)}\in\mathcal{R} and H¯(k)H(X¯(k))F23\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq 3. Let the penalty parameter β\beta and stepsize η\eta satisfy

β864(3+Cs2)σmin(M),\beta\geq\sqrt{\dfrac{864(3+C_{s}^{2})}{\sigma_{\min}(M)}},

and

0<ηmin{14Lqβ,35σmin(M)β,β480(3+Cs2)},0<\eta\leq\min\left\{\dfrac{1}{4L_{q}\beta},\,\dfrac{3}{5\sigma_{\min}(M)\beta},\,\dfrac{\beta}{480(3+C_{s}^{2})}\right\},

respectively. Then, under the conditions in Assumptions 1 and 2, we have X¯(k+1)\bar{X}^{(k+1)}\in\mathcal{R}.

Proof.

It follows from the relationship (3.8) that

X¯(k+1)=X¯(k)ηH(X¯(k))+η(H(X¯(k))H¯(k))=X¯(k)ηβQ(X¯(k))+ηY(k),\displaystyle\bar{X}^{(k+1)}=\bar{X}^{(k)}-\eta H(\bar{X}^{(k)})+\eta(H(\bar{X}^{(k)})-\bar{H}^{(k)})=\bar{X}^{(k)}-\eta\beta Q(\bar{X}^{(k)})+\eta Y^{(k)},

where Y(k):=H(X¯(k))H¯(k)S(X¯(k))Y^{(k)}:=H(\bar{X}^{(k)})-\bar{H}^{(k)}-S(\bar{X}^{(k)}) and Y(k)Y^{(k)} satisfies

Y(k)F22H(X¯(k))H¯(k)F2+2S(X¯(k))F22(3+Cs2).\left\|Y^{(k)}\right\|^{2}_{\mathrm{F}}\leq 2\left\|H(\bar{X}^{(k)})-\bar{H}^{(k)}\right\|^{2}_{\mathrm{F}}+2\left\|S(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq 2(3+C_{s}^{2}).

Since X¯(k)\bar{X}^{(k)}\in\mathcal{R}, we have σmax2(M1/2X¯(k))7/6\sigma_{\max}^{2}(M^{1/2}\bar{X}^{(k)})\leq 7/6 and σmin2(M1/2X¯(k))5/6\sigma_{\min}^{2}(M^{1/2}\bar{X}^{(k)})\geq 5/6, and hence,

Q(X¯(k))F2=\displaystyle\left\|Q(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}={} M1/2M1/2X¯(k)((X¯(k))MX¯(k)Ip)F2\displaystyle\left\|M^{1/2}M^{1/2}\bar{X}^{(k)}((\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p})\right\|^{2}_{\mathrm{F}} (4.3)
\displaystyle\geq{} σmin2(M1/2)σmin2(M1/2X¯(k))(X¯(k))MX¯(k)IpF2\displaystyle\sigma_{\min}^{2}(M^{1/2})\sigma_{\min}^{2}(M^{1/2}\bar{X}^{(k)})\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}
\displaystyle\geq{} 56σmin(M)(X¯(k))MX¯(k)IpF2.\displaystyle\dfrac{5}{6}\sigma_{\min}(M)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}.

By virtue of the Young’s inequality, we can obtain that

X¯(k+1)X¯(k)F2=\displaystyle\left\|\bar{X}^{(k+1)}-\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}={} η2βQ(X¯(k))Y(k)F22η2β2Q(X¯(k))F2+2η2Y(k)F2\displaystyle\eta^{2}\left\|\beta Q(\bar{X}^{(k)})-Y^{(k)}\right\|^{2}_{\mathrm{F}}\leq 2\eta^{2}\beta^{2}\left\|Q(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+2\eta^{2}\left\|Y^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} ηβ2LqQ(X¯(k))F2+η2LqβY(k)F2,\displaystyle\dfrac{\eta\beta}{2L_{q}}\left\|Q(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{\eta}{2L_{q}\beta}\left\|Y^{(k)}\right\|^{2}_{\mathrm{F}},

where the last inequality results from the condition that η1/(4Lqβ)\eta\leq 1/(4L_{q}\beta). Moreover, we have

Q(X¯(k)),X¯(k+1)X¯(k)=\displaystyle\left\langle Q(\bar{X}^{(k)}),\bar{X}^{(k+1)}-\bar{X}^{(k)}\right\rangle={} ηβQ(X¯(k))F2+ηQ(X¯(k)),Y(k)\displaystyle-\eta\beta\left\|Q(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\eta\left\langle Q(\bar{X}^{(k)}),Y^{(k)}\right\rangle
\displaystyle\leq{} 3ηβ4Q(X¯(k))F2+ηβY(k)F2.\displaystyle-\dfrac{3\eta\beta}{4}\left\|Q(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{\eta}{\beta}\left\|Y^{(k)}\right\|^{2}_{\mathrm{F}}.

For convenience, we denote c(X)=XMXIpF2/4c(X)=\|X^{\top}MX-I_{p}\|^{2}_{\mathrm{F}}/4. According to the local Lipschitz continuity of c(X)=Q(X)\nabla c(X)=Q(X), there exists a constant Lq>0L_{q}>0 such that

c(X¯(k+1))\displaystyle c(\bar{X}^{(k+1)})\leq{} c(X¯(k))+Q(X¯(k)),X¯(k+1)X¯(k)+Lq2X¯(k+1)X¯(k)F2\displaystyle c(\bar{X}^{(k)})+\left\langle Q(\bar{X}^{(k)}),\bar{X}^{(k+1)}-\bar{X}^{(k)}\right\rangle+\dfrac{L_{q}}{2}\left\|\bar{X}^{(k+1)}-\bar{X}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} c(X¯(k))ηβ2Q(X¯(k))F2+5η4βY(k)F2,\displaystyle c(\bar{X}^{(k)})-\dfrac{\eta\beta}{2}\left\|Q(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{5\eta}{4\beta}\left\|Y^{(k)}\right\|^{2}_{\mathrm{F}},

which together with (4.3) infers that

(X¯(k+1))MX¯(k+1)IpF2(153σmin(M)ηβ)(X¯(k))MX¯(k)IpF2+5ηβY(k)F2.\left\|(\bar{X}^{(k+1)})^{\top}M\bar{X}^{(k+1)}-I_{p}\right\|^{2}_{\mathrm{F}}\leq\left(1-\dfrac{5}{3}\sigma_{\min}(M)\eta\beta\right)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}+\dfrac{5\eta}{\beta}\left\|Y^{(k)}\right\|^{2}_{\mathrm{F}}.

Now we investigate the above relationship in the following two cases.

Case I: (X¯(k))MX¯(k)IpF1/12\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}\leq 1/12.

Since ηmin{3/(5σmin(M)β),β/(480(3+Cs2))}\eta\leq\min\{3/(5\sigma_{\min}(M)\beta),\beta/(480(3+C_{s}^{2}))\}, we have

(X¯(k+1))MX¯(k+1)IpF2(X¯(k))MX¯(k)IpF2+148=136.\left\|(\bar{X}^{(k+1)})^{\top}M\bar{X}^{(k+1)}-I_{p}\right\|^{2}_{\mathrm{F}}\leq\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{48}=\dfrac{1}{36}.

Case II: (X¯(k))MX¯(k)IpF>1/12\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}>1/12.

It can be readily verified that

(X¯(k+1))MX¯(k+1)IpF2(X¯(k))MX¯(k)IpF2\displaystyle\left\|(\bar{X}^{(k+1)})^{\top}M\bar{X}^{(k+1)}-I_{p}\right\|^{2}_{\mathrm{F}}-\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}\leq{} 53σmin(M)ηβ(X¯(k))MX¯(k)IpF2\displaystyle-\dfrac{5}{3}\sigma_{\min}(M)\eta\beta\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}
+10ηβ(3+Cs2)\displaystyle+\dfrac{10\eta}{\beta}(3+C_{s}^{2})
\displaystyle\leq{} 5η(1432σmin(M)β+2β(3+Cs2))\displaystyle 5\eta\left(-\dfrac{1}{432}\sigma_{\min}(M)\beta+\dfrac{2}{\beta}(3+C_{s}^{2})\right)
\displaystyle\leq{} 0,\displaystyle 0,

where the last inequality follows from the conditions β2864(3+Cs2)/σmin(M)\beta^{2}\geq 864(3+C_{s}^{2})/\sigma_{\min}(M). Hence, we arrive at

(X¯(k+1))MX¯(k+1)IpF(X¯(k))MX¯(k)IpF1/6.\left\|(\bar{X}^{(k+1)})^{\top}M\bar{X}^{(k+1)}-I_{p}\right\|_{\mathrm{F}}\leq\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}\leq 1/6.

Combining the above two cases together, we complete the proof. ∎

Based on Lemma 4.5, we can prove the main results of this subsection.

Proposition 4.6.

Suppose the conditions in Assumptions 1 and 2 hold. Let the penalty parameter β\beta and stepsize η\eta satisfy

βmax{1, 2LcC2,864(3+Cs2)σmin(M),32Lc2C1C21λ2},\beta\geq\max\left\{1,\,2L_{c}\sqrt{C_{2}},\,\sqrt{\dfrac{864(3+C_{s}^{2})}{\sigma_{\min}(M)}},\,\sqrt{\dfrac{32L_{c}^{2}C_{1}C_{2}}{1-\lambda^{2}}}\right\}, (4.4)

and

ηmin{14Lqβ,35σmin(M)β,β480(3+Cs2),1β(Ch+Ch′′β)d(1λ2)2C1(C2+C3β2)},\eta\leq\min\left\{\dfrac{1}{4L_{q}\beta},\,\dfrac{3}{5\sigma_{\min}(M)\beta},\,\dfrac{\beta}{480(3+C_{s}^{2})},\,\dfrac{1}{\beta(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)}\sqrt{\dfrac{d(1-\lambda^{2})}{2C_{1}(C_{2}+C_{3}\beta^{2})}}\right\}, (4.5)

respectively. Then for any kk\in\mathbb{N}, it holds that

{X¯(k),𝐗(k)FCx,𝐔(k)FCu,𝐕(k)FCv,𝐗(k)𝐗¯(k)F2dβ2(C2+C3β2),𝐔(k)𝐔¯(k)F2dC2,𝐕(k)𝐕¯(k)F2dC2+C3β2.\left\{\begin{aligned} &\bar{X}^{(k)}\in\mathcal{R},~{}\left\|\mathbf{X}^{(k)}\right\|_{\mathrm{F}}\leq C_{x},~{}\left\|\mathbf{U}^{(k)}\right\|_{\mathrm{F}}\leq C_{u},~{}\left\|\mathbf{V}^{(k)}\right\|_{\mathrm{F}}\leq C_{v},\\ &\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{d}{\beta^{2}(C_{2}+C_{3}\beta^{2})},\\ &\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{d}{C_{2}},~{}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{d}{C_{2}+C_{3}\beta^{2}}.\end{aligned}\right. (4.6)
Proof.

We intend to prove this proposition by mathematical induction. The argument (4.6) directly holds at iteration k=0k=0 resulting from the initialization. Now, we assume that this argument holds at iteration kk, and investigate the situation at iteration k+1k+1.

To begin with, it follows from Lemma 4.4 and the condition β1\beta\geq 1 that

H¯(k)H(X¯(k))F2C2d𝐔(k)𝐔¯(k)F2+C2+C3β2d(𝐕(k)𝐕¯(k)F2+𝐗(k)𝐗¯(k)F2)3.\displaystyle\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}\leq\dfrac{C_{2}}{d}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{C_{2}+C_{3}\beta^{2}}{d}\left(\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}+\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}\right)\leq 3.

Hence, according to Lemma 4.5, we know that X¯(k+1)\bar{X}^{(k+1)}\in\mathcal{R}. Moreover, we have

Hi(k)F\displaystyle\left\|H_{i}^{(k)}\right\|_{\mathrm{F}}\leq{} d2Ui(k)(3Ipd(Xi(k))Vi(k))F+d2Vi(k)sym((Xi(k))Ui(k))F\displaystyle\dfrac{d}{2}\left\|U_{i}^{(k)}(3I_{p}-d(X_{i}^{(k)})^{\top}V_{i}^{(k)})\right\|_{\mathrm{F}}+d^{2}\left\|V_{i}^{(k)}\mathrm{sym}((X_{i}^{(k)})^{\top}U_{i}^{(k)})\right\|_{\mathrm{F}}
+βdVi(k)(d(Xi(k))Vi(k)Ip)F\displaystyle+\beta d\left\|V_{i}^{(k)}(d(X_{i}^{(k)})^{\top}V_{i}^{(k)}-I_{p})\right\|_{\mathrm{F}}
\displaystyle\leq{} Ch+Ch′′βd,\displaystyle\dfrac{C_{h}^{\prime}+C_{h}^{\prime\prime}\beta}{\sqrt{d}},

where Ch=3d3/2Cu(dCxCv+p)/2>0C_{h}^{\prime}=3d^{3/2}C_{u}(dC_{x}C_{v}+\sqrt{p})/2>0 and Ch′′=d3/2Cv(dCxCv+p)>0C_{h}^{\prime\prime}=d^{3/2}C_{v}(dC_{x}C_{v}+\sqrt{p})>0 are two constants. Then it can be readily verified that

𝐇(k)F2=i=1dHi(k)F2(Ch+Ch′′β)2.\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}=\sum_{i=1}^{d}\left\|H_{i}^{(k)}\right\|^{2}_{\mathrm{F}}\leq(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)^{2}.

Combining Lemma 4.1 with the condition η2d(1λ2)2β2C1(C2+C3β2)(Ch+Ch′′β)2\eta^{2}\leq\dfrac{d(1-\lambda^{2})}{2\beta^{2}C_{1}(C_{2}+C_{3}\beta^{2})(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)^{2}} leads to that

𝐗(k+1)𝐗¯(k+1)F2\displaystyle\left\|\mathbf{X}^{(k+1)}-\bar{\mathbf{X}}^{(k+1)}\right\|^{2}_{\mathrm{F}}\leq{} 1+λ22𝐗(k)𝐗¯(k)F2+η2C1𝐇(k)F2\displaystyle\dfrac{1+\lambda^{2}}{2}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+\eta^{2}C_{1}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} d(1+λ2)2β2(C2+C3β2)+η2C1(Ch+Ch′′β)2\displaystyle\dfrac{d(1+\lambda^{2})}{2\beta^{2}(C_{2}+C_{3}\beta^{2})}+\eta^{2}C_{1}(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)^{2}
\displaystyle\leq{} dβ2(C2+C3β2),\displaystyle\dfrac{d}{\beta^{2}(C_{2}+C_{3}\beta^{2})},

which together with the condition β1\beta\geq 1 implies that

𝐗(k+1)F𝐗(k+1)𝐗¯(k+1)F+dX¯(k+1)Fd(1+C~x)=Cx.\displaystyle\left\|\mathbf{X}^{(k+1)}\right\|_{\mathrm{F}}\leq\left\|\mathbf{X}^{(k+1)}-\bar{\mathbf{X}}^{(k+1)}\right\|_{\mathrm{F}}+\sqrt{d}\left\|\bar{X}^{(k+1)}\right\|_{\mathrm{F}}\leq\sqrt{d}(1+\tilde{C}_{x})=C_{x}.

As a direct consequence of Lemma 4.2, we can proceed to show that

𝐕(k+1)𝐕¯(k+1)F2\displaystyle\left\|\mathbf{V}^{(k+1)}-\bar{\mathbf{V}}^{(k+1)}\right\|^{2}_{\mathrm{F}}\leq{} 1+λ22𝐕(k)𝐕¯(k)F2+8Lc2C1𝐗(k)𝐗¯(k)F2+2η2Lc2C1𝐇(k)F2\displaystyle\dfrac{1+\lambda^{2}}{2}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}+8L_{c}^{2}C_{1}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+2\eta^{2}L_{c}^{2}C_{1}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} d(1+λ2)2(C2+C3β2)+8dLc2C1β2(C2+C3β2)+2η2Lc2C1(Ch+Ch′′β)2\displaystyle\dfrac{d(1+\lambda^{2})}{2(C_{2}+C_{3}\beta^{2})}+\dfrac{8dL_{c}^{2}C_{1}}{\beta^{2}(C_{2}+C_{3}\beta^{2})}+2\eta^{2}L_{c}^{2}C_{1}(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)^{2}
\displaystyle\leq{} dC2+C3β2,\displaystyle\dfrac{d}{C_{2}+C_{3}\beta^{2}},

where the last inequality results from the conditions η2d(1λ2)8Lc2C1(C2+C3β2)(Ch+Ch′′β)2\eta^{2}\leq\dfrac{d(1-\lambda^{2})}{8L_{c}^{2}C_{1}(C_{2}+C_{3}\beta^{2})(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)^{2}} and β232Lc2C11λ2\beta^{2}\geq\dfrac{32L_{c}^{2}C_{1}}{1-\lambda^{2}}. Furthermore, since C2+C3β2C21C_{2}+C_{3}\beta^{2}\geq C_{2}\geq 1, we have

𝐕(k+1)F𝐕(k+1)𝐕¯(k+1)F+dV¯(k+1)Fd(1+CxLm)=Cv.\displaystyle\left\|\mathbf{V}^{(k+1)}\right\|_{\mathrm{F}}\leq\left\|\mathbf{V}^{(k+1)}-\bar{\mathbf{V}}^{(k+1)}\right\|_{\mathrm{F}}+\sqrt{d}\left\|\bar{V}^{(k+1)}\right\|_{\mathrm{F}}\leq\sqrt{d}(1+C_{x}L_{m})=C_{v}.

Similarly, under the conditions η2d(1λ2)8Lc2C1C2(Ch+Ch′′β)2\eta^{2}\leq\dfrac{d(1-\lambda^{2})}{8L_{c}^{2}C_{1}C_{2}(C_{h}^{\prime}+C_{h}^{\prime\prime}\beta)^{2}} and β232Lc2C1C21λ2\beta^{2}\geq\dfrac{32L_{c}^{2}C_{1}C_{2}}{1-\lambda^{2}}, we can show that

𝐔(k+1)𝐔¯(k+1)F2dC2,and 𝐔(k+1)FdC2+dCgCu.\left\|\mathbf{U}^{(k+1)}-\bar{\mathbf{U}}^{(k+1)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{d}{C_{2}},~{}\mbox{and~{}}\left\|\mathbf{U}^{(k+1)}\right\|_{\mathrm{F}}\leq\sqrt{\dfrac{d}{C_{2}}}+\sqrt{d}C_{g}\leq C_{u}.

The proof is completed. ∎

4.3 Sufficient Descent

The purpose of this subsection is to evaluate the descent property of the sequence {h(X¯(k))}\{h(\bar{X}^{(k)})\}.

Lemma 4.7.

Suppose that Assumptions 1 and 2 hold. Let all the conditions in Proposition 4.6 be satisfied. We further assume that η1/(8(Ls+Lqβ))\eta\leq 1/(8(L_{s}+L_{q}\beta)). Then for any kk\in\mathbb{N}, it holds that

h(X¯(k+1))\displaystyle h(\bar{X}^{(k+1)})\leq{} h(X¯(k))58ηH(X¯(k))F2+94ηH¯(k)H(X¯(k))F2\displaystyle h(\bar{X}^{(k)})-\dfrac{5}{8}\eta\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{9}{4}\eta\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
+4ηLg2C4(X¯(k))MX¯(k)IpF2,\displaystyle+4\eta L_{g}^{2}C_{4}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}},

where C4>0C_{4}>0 is a constant.

Proof.

According to Proposition 4.6, we know that the inclusion X¯(k)\bar{X}^{(k)}\in\mathcal{R} holds for any kk\in\mathbb{N}. Since h\nabla h is locally Lipschitz continuous, there exist two constants Ls>0L_{s}>0 and Lq>0L_{q}>0 such that

h(X¯(k+1))=\displaystyle h(\bar{X}^{(k+1)})={} h(X¯(k)ηH¯(k))h(X¯(k))ηh(X¯(k)),H¯(k)+12η2(Ls+Lqβ)H¯(k)F2\displaystyle h(\bar{X}^{(k)}-\eta\bar{H}^{(k)})\leq h(\bar{X}^{(k)})-\eta\left\langle\nabla h(\bar{X}^{(k)}),\bar{H}^{(k)}\right\rangle+\dfrac{1}{2}\eta^{2}\left(L_{s}+L_{q}\beta\right)\left\|\bar{H}^{(k)}\right\|^{2}_{\mathrm{F}}
=\displaystyle={} h(X¯(k))ηH(X¯(k))F2ηh(X¯(k))H(X¯(k)),H¯(k)\displaystyle h(\bar{X}^{(k)})-\eta\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}-\eta\left\langle\nabla h(\bar{X}^{(k)})-H(\bar{X}^{(k)}),\bar{H}^{(k)}\right\rangle
ηH(X¯(k)),H¯(k)H(X¯(k))+12η2(Ls+Lqβ)H¯(k)F2\displaystyle-\eta\left\langle H(\bar{X}^{(k)}),\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\rangle+\dfrac{1}{2}\eta^{2}\left(L_{s}+L_{q}\beta\right)\left\|\bar{H}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} h(X¯(k))78ηH(X¯(k))F2ηh(X¯(k))H(X¯(k)),H¯(k)\displaystyle h(\bar{X}^{(k)})-\dfrac{7}{8}\eta\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}-\eta\left\langle\nabla h(\bar{X}^{(k)})-H(\bar{X}^{(k)}),\bar{H}^{(k)}\right\rangle
ηH(X¯(k)),H¯(k)H(X¯(k))+18ηH¯(k)H(X¯(k))F2,\displaystyle-\eta\left\langle H(\bar{X}^{(k)}),\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\rangle+\dfrac{1}{8}\eta\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}},

where the last inequality follows from the condition η1/(8(Ls+Lqβ))\eta\leq 1/(8(L_{s}+L_{q}\beta)) and the relationship

H¯(k)F22H¯(k)H(X¯(k))F2+2H(X¯(k))F2.\left\|\bar{H}^{(k)}\right\|^{2}_{\mathrm{F}}\leq 2\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+2\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}.

By virtue of the Young’s inequality, we can obtain that

|h(X¯(k))H(X¯(k)),H¯(k)|\displaystyle\left\lvert\left\langle\nabla h(\bar{X}^{(k)})-H(\bar{X}^{(k)}),\bar{H}^{(k)}\right\rangle\right\rvert\leq{} 4h(X¯(k))H(X¯(k))F2+116H¯(k)F2\displaystyle 4\left\|\nabla h(\bar{X}^{(k)})-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{1}{16}\left\|\bar{H}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 4Lg2C4(X¯(k))MX¯(k)IpF2+18H¯(k)H(X¯(k))F2\displaystyle 4L_{g}^{2}C_{4}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{8}\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
+18H(X¯(k))F2,\displaystyle+\dfrac{1}{8}\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}},

where C4=(7σmin(M)(144p+1)+343σmax(M))/(432σmin2(M))>0C_{4}=(7\sigma_{\min}(M)(144p+1)+343\sigma_{\max}(M))/(432\sigma_{\min}^{2}(M))>0 is a constant. Moreover, we have

|H(X¯(k)),H¯(k)H(X¯(k))|\displaystyle\left\lvert\left\langle H(\bar{X}^{(k)}),\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\rangle\right\rvert\leq{} 2H¯(k)H(X¯(k))F2+18H(X¯(k))F2.\displaystyle 2\left\|\bar{H}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{1}{8}\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}.

Combing the above relationships, we finally arrive at the assertion of this lemma. ∎

The above lemma indicates that the sequence {h(X¯(k))}\{h(\bar{X}^{(k)})\} is not necessarily decreasing in a monotonic manner. To address this issue, we introduce the following merit function,

(𝐗,𝐔,𝐕):=h(𝐄𝐗/d)+(Idn𝐉)𝐗F2+ρ(Idn𝐉)𝐔F2+ρ(Idn𝐉)𝐕F2,\hbar(\mathbf{X},\mathbf{U},\mathbf{V}):=h(\mathbf{E}^{\top}\mathbf{X}/d)+\left\|(I_{dn}-\mathbf{J})\mathbf{X}\right\|^{2}_{\mathrm{F}}+\rho\left\|(I_{dn}-\mathbf{J})\mathbf{U}\right\|^{2}_{\mathrm{F}}+\rho\left\|(I_{dn}-\mathbf{J})\mathbf{V}\right\|^{2}_{\mathrm{F}},

where ρ=(1λ2)/(128Lc2C1)>0\rho=(1-\lambda^{2})/(128L_{c}^{2}C_{1})>0 is a constant. For convenience, we denote (k):=(𝐗(k),𝐔(k),𝐕(k))\hbar^{(k)}:=\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)}) hereafter. The following proposition illustrates that the sequence {(k)}\{\hbar^{(k)}\} satisfies a sufficient descent property, and hence, is monotonically decreasing.

Proposition 4.8.

Suppose that Assumptions 1 and 2 hold. Let all the conditions in Proposition 4.6 be satisfied. We further assume that

βmax{12+342Cg5σmin1/2(M),Ls2σmin3/2(M),16Lg2C4σmin1/2(M)},\beta\geq\max\left\{\dfrac{12+3\sqrt{42}C_{g}}{5\sigma_{\min}^{1/2}(M)},\,\dfrac{L_{s}^{2}}{\sigma_{\min}^{3/2}(M)},\,\dfrac{16L_{g}^{2}C_{4}}{\sigma_{\min}^{1/2}(M)}\right\}, (4.7)

and

ηmin{116dC5,18(Ls+Lqβ),d(1λ2)min{1,2ρ}36(C2+C3β2),(1λ2)min{1,2ρ}32(C2+C3β2)C5}.\eta\leq\min\left\{\dfrac{1}{16dC_{5}},\,\dfrac{1}{8(L_{s}+L_{q}\beta)},\,\dfrac{d(1-\lambda^{2})\min\{1,2\rho\}}{36(C_{2}+C_{3}\beta^{2})},\,\sqrt{\dfrac{(1-\lambda^{2})\min\{1,2\rho\}}{32(C_{2}+C_{3}\beta^{2})C_{5}}}\right\}. (4.8)

Then for any kk\in\mathbb{N}, the following sufficient descent property holds, namely,

(k+1)\displaystyle\hbar^{(k+1)}\leq{} (k)14ησmin2(M)gradf(𝒫(X¯(k)))F214ηβσmin1/2(M)(X¯(k))MX¯(k)IpF2\displaystyle\hbar^{(k)}-\dfrac{1}{4}\eta\sigma_{\min}^{2}(M)\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}-\dfrac{1}{4}\eta\beta\sigma_{\min}^{1/2}(M)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}
1λ24𝐗(k)𝐗¯(k)F2(1λ2)ρ4𝐔(k)𝐔¯(k)F2(1λ2)ρ4𝐕(k)𝐕¯(k)F2.\displaystyle-\dfrac{1-\lambda^{2}}{4}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{(1-\lambda^{2})\rho}{4}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{(1-\lambda^{2})\rho}{4}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}.
Proof.

Combining Lemmas 4.4 and 4.7 gives rise to that

h(X¯(k+1))\displaystyle h(\bar{X}^{(k+1)})\leq{} h(X¯(k))58ηH(X¯(k))F2+4ηLg2C4(X¯(k))MX¯(k)IpF2\displaystyle h(\bar{X}^{(k)})-\dfrac{5}{8}\eta\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+4\eta L_{g}^{2}C_{4}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}
+9ηC24d𝐔(k)𝐔¯(k)F2+9η(C2+C3β2)4d𝐕(k)𝐕¯(k)F2\displaystyle+\dfrac{9\eta C_{2}}{4d}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{9\eta(C_{2}+C_{3}\beta^{2})}{4d}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}
+9η(C2+C3β2)4d𝐗(k)𝐗¯(k)F2\displaystyle+\dfrac{9\eta(C_{2}+C_{3}\beta^{2})}{4d}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} h(X¯(k))58ηH(X¯(k))F2+4ηLg2C4(X¯(k))MX¯(k)IpF2\displaystyle h(\bar{X}^{(k)})-\dfrac{5}{8}\eta\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+4\eta L_{g}^{2}C_{4}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}
+(1λ2)ρ8𝐔(k)𝐔¯(k)F2+(1λ2)ρ8𝐕(k)𝐕¯(k)F2\displaystyle+\dfrac{(1-\lambda^{2})\rho}{8}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{(1-\lambda^{2})\rho}{8}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}
+1λ216𝐗(k)𝐗¯(k)F2,\displaystyle+\dfrac{1-\lambda^{2}}{16}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}},

where the last inequality follows from the condition ηd(1λ2)min{1,2ρ}/(36(C2+C3β2))\eta\leq d(1-\lambda^{2})\min\{1,2\rho\}/(36(C_{2}+C_{3}\beta^{2})). Then according to Lemmas 4.1 and 4.2, it follows that

(k+1)\displaystyle\hbar^{(k+1)}\leq{} (k)58ηH(X¯(k))F2+η2C5𝐇(k)F2+4ηLg2C4(X¯(k))MX¯(k)IpF2\displaystyle\hbar^{(k)}-\dfrac{5}{8}\eta\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\eta^{2}C_{5}\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}+4\eta L_{g}^{2}C_{4}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}} (4.9)
3(1λ2)ρ8𝐔(k)𝐔¯(k)F23(1λ2)ρ8𝐕(k)𝐕¯(k)F2\displaystyle-\dfrac{3(1-\lambda^{2})\rho}{8}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{3(1-\lambda^{2})\rho}{8}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}
5(1λ2)16𝐗(k)𝐗¯(k)F2,\displaystyle-\dfrac{5(1-\lambda^{2})}{16}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}},

where C5=C1+4Lc2C1ρ>0C_{5}=C_{1}+4L_{c}^{2}C_{1}\rho>0 is a constant. As a direct consequence of the relationship (4.2), we can proceed to show that

𝐇(k)F2=\displaystyle\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}={} i=1dHi(k)F22dH(X¯(k))F2+2i=1dHi(k)H(X¯(k))F2\displaystyle\sum_{i=1}^{d}\left\|H_{i}^{(k)}\right\|^{2}_{\mathrm{F}}\leq 2d\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+2\sum_{i=1}^{d}\left\|H_{i}^{(k)}-H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 2dH(X¯(k))F2+2C2𝐔(k)𝐔¯(k)F2+2(C2+C3β2)𝐕(k)𝐕¯(k)F2\displaystyle 2d\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+2C_{2}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+2(C_{2}+C_{3}\beta^{2})\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}
+2(C2+C3β2)𝐗(k)𝐗¯(k)F2.\displaystyle+2(C_{2}+C_{3}\beta^{2})\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}.

By virtue of the condition ηmin{1/(16dC5),(1λ2)min{1,2ρ}/(32(C2+C3β2)C5)}\eta\leq\min\{1/(16dC_{5}),\sqrt{(1-\lambda^{2})\min\{1,2\rho\}/(32(C_{2}+C_{3}\beta^{2})C_{5})}\}, we have

𝐇(k)F2\displaystyle\left\|\mathbf{H}^{(k)}\right\|^{2}_{\mathrm{F}}\leq{} 18ηC5H(X¯(k))F2+(1λ2)ρ8η2C5𝐔(k)𝐔¯(k)F2+(1λ2)ρ8η2C5𝐕(k)𝐕¯(k)F2\displaystyle\dfrac{1}{8\eta C_{5}}\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}+\dfrac{(1-\lambda^{2})\rho}{8\eta^{2}C_{5}}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{(1-\lambda^{2})\rho}{8\eta^{2}C_{5}}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}} (4.10)
+1λ216η2C5𝐗(k)𝐗¯(k)F2.\displaystyle+\dfrac{1-\lambda^{2}}{16\eta^{2}C_{5}}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}.

Then, by combining two relationships (4.9) and (4.10), it can be readily verified that

(k+1)\displaystyle\hbar^{(k+1)}\leq{} (k)η2H(X¯(k))F2(1λ2)ρ4𝐔(k)𝐔¯(k)F2(1λ2)ρ4𝐕(k)𝐕¯(k)F2\displaystyle\hbar^{(k)}-\dfrac{\eta}{2}\left\|H(\bar{X}^{(k)})\right\|^{2}_{\mathrm{F}}-\dfrac{(1-\lambda^{2})\rho}{4}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{(1-\lambda^{2})\rho}{4}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}
1λ24𝐗(k)𝐗¯(k)F2+4ηLg2C4(X¯(k))MX¯(k)IpF2,\displaystyle-\dfrac{1-\lambda^{2}}{4}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}+4\eta L_{g}^{2}C_{4}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}},

which together with Lemma 3.1 yields that

(k+1)\displaystyle\hbar^{(k+1)}\leq{} (k)η4σmin2(M)gradf(𝒫(X¯(k)))F2η2(βσmin1/2(M)8Lg2C4)(X¯(k))MX¯(k)IpF2\displaystyle\hbar^{(k)}-\dfrac{\eta}{4}\sigma_{\min}^{2}(M)\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}-\dfrac{\eta}{2}\left(\beta\sigma_{\min}^{1/2}(M)-8L_{g}^{2}C_{4}\right)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}
1λ24𝐗(k)𝐗¯(k)F2(1λ2)ρ4𝐔(k)𝐔¯(k)F2(1λ2)ρ4𝐕(k)𝐕¯(k)F2.\displaystyle-\dfrac{1-\lambda^{2}}{4}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{(1-\lambda^{2})\rho}{4}\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{(1-\lambda^{2})\rho}{4}\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}.

Since β16σmin1/2(M)Lg2C4\beta\geq 16\sigma_{\min}^{-1/2}(M)L_{g}^{2}C_{4}, we can obtain the desired sufficient descent property. The proof is completed. ∎

4.4 Global Convergence

Based on the sufficient descent property of {(k)}\{\hbar^{(k)}\}, we can finally establish the global convergence guarantee of Algorithm 1 to a first-order stationary point of the problem (1.1). Moreover, the iteration complexity is also presented.

Theorem 4.9.

Suppose Assumptions 1 and 2 hold. Let the penalty parameter β\beta satisfy the conditions (4.4) and (4.7) and the stepsize η\eta satisfy the conditions (4.5) and (4.8). Then the sequence {𝐗(k)}\{\mathbf{X}^{(k)}\} has at least one accumulation point. Moreover, for any accumulation point 𝐗\mathbf{X}^{\ast}, there exists a first-order stationary point X¯𝒮Mn,p\bar{X}^{\ast}\in\mathcal{S}_{M}^{n,p} of the problem (1.1) such that 𝐗=(𝟏dIn)X¯\mathbf{X}^{\ast}=(\mathbf{1}_{d}\otimes I_{n})\bar{X}^{\ast}. Finally, the following relationships hold, namely,

mink=0,1,,K1gradf(𝒫(X¯(k)))F24((0)h¯)ησmin2(M)K,\min_{k=0,1,\dotsc,K-1}\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}\leq\dfrac{4(\hbar^{(0)}-\underline{h})}{\eta\sigma_{\min}^{2}(M)K}, (4.11)
mink=0,1,,K1𝐗(k)𝐗¯(k)F24((0)h¯)(1λ2)K,\min_{k=0,1,\dotsc,K-1}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}\leq\dfrac{4(\hbar^{(0)}-\underline{h})}{(1-\lambda^{2})K}, (4.12)
mink=0,1,,K1(X¯(k))MX¯(k)IpF24((0)h¯)ηβσmin1/2(M)K,\min_{k=0,1,\dotsc,K-1}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}\leq\dfrac{4(\hbar^{(0)}-\underline{h})}{\eta\beta\sigma_{\min}^{1/2}(M)K}, (4.13)

where h¯\underline{h} is a constant.

Proof.

According to Proposition 4.6, we know that the sequence {𝐗(k)}\{\mathbf{X}^{(k)}\} is bounded. Then the lower boundedness of {h(X¯(k))}\{h(\bar{X}^{(k)})\} is owing to the continuity of hh. Hence, there exists a constant h¯\underline{h} such that

(k)h(X¯(k))h¯,\hbar^{(k)}\geq h(\bar{X}^{(k)})\geq\underline{h},

for any kk\in\mathbb{N}. It follows from Proposition 4.8 that the sequence {(k)}\{\hbar^{(k)}\} is convergent and the following relationships hold,

limkgradf(𝒫(X¯(k)))F2=0,limk𝐗(k)𝐗¯(k)F2=0,limk(X¯(k))MX¯(k)IpF2=0.\lim\limits_{k\to\infty}\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}=0,\,\lim\limits_{k\to\infty}\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|^{2}_{\mathrm{F}}=0,\,\lim\limits_{k\to\infty}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|^{2}_{\mathrm{F}}=0. (4.14)

According to the Bolzano-Weierstrass theorem, it follows that {𝐗(k)}\{\mathbf{X}^{(k)}\} exists an accumulation point, say 𝐗\mathbf{X}^{\ast}. Then the relationships in (4.14) imply that there exists a first-order stationary point X¯𝒮Mn,p\bar{X}^{\ast}\in\mathcal{S}_{M}^{n,p} of the problem (1.1) such that 𝐗=(𝟏dIn)X¯\mathbf{X}^{\ast}=(\mathbf{1}_{d}\otimes I_{n})\bar{X}^{\ast}.

The last thing to do in the proof is to show that the relationships (4.11)-(4.13) hold. Indeed, it follows from Proposition 4.8 that

k=0K1gradf(𝒫(X¯(k)))F24ησmin2(M)k=0K1((k)(k+1))4((0)(K))ησmin2(M)4((0)h¯)ησmin2(M),\displaystyle\sum_{k=0}^{K-1}\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|^{2}_{\mathrm{F}}\leq\dfrac{4}{\eta\sigma_{\min}^{2}(M)}\sum_{k=0}^{K-1}\left(\hbar^{(k)}-\hbar^{(k+1)}\right)\leq\dfrac{4(\hbar^{(0)}-\hbar^{(K)})}{\eta\sigma_{\min}^{2}(M)}\leq\dfrac{4(\hbar^{(0)}-\underline{h})}{\eta\sigma_{\min}^{2}(M)},

which yields the relationship (4.11). The other relationships can be proved similarly. Therefore, we complete the proof. ∎

The global sublinear convergence rate in Theorem 4.9 guarantees that Algorithm 1 is able to return a first-order ϵ\epsilon-stationary point in at most 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) iterations. Since Algorithm 1 performs three rounds of communication per iteration, the total number of communication rounds required to obtain a first-order ϵ\epsilon-stationary point is also 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) at the most.

Remark 1.

Under the conditions in Theorem 4.9, the tracking errors also asymptotically converges to zero at a sublinear rate as follows,

mink=0,1,,K1max{𝐔(k)𝐔¯(k)F2,𝐕(k)𝐕¯(k)F2}4((0)h¯)(1λ2)ρK.\min_{k=0,1,\dotsc,K-1}\max\left\{\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|^{2}_{\mathrm{F}},\,\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|^{2}_{\mathrm{F}}\right\}\leq\dfrac{4(\hbar^{(0)}-\underline{h})}{(1-\lambda^{2})\rho K}.

4.5 Convergence Under Kurdyka-Łojasiewicz Property

In this subsection, we establish the convergence of Algorithm 1 when the problem (1.1) satisfies the KŁ property. In particular, for any i[d]i\in[d], we prove that the entire sequence {Xi(k)}k\{X_{i}^{(k)}\}_{k\in\mathbb{N}} converges to a stationary point of the problem (1.1) with guaranteed asymptotic convergence rates.

To begin with, we define the following quantity,

r(k):=\displaystyle r^{(k)}:={} gradf(𝒫(X¯(k)))F+(X¯(k))MX¯(k)IpF+𝐗(k)𝐗¯(k)F\displaystyle\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|_{\mathrm{F}}+\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}+\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|_{\mathrm{F}}
+𝐔(k)𝐔¯(k)F+𝐕(k)𝐕¯(k)F.\displaystyle+\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|_{\mathrm{F}}+\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|_{\mathrm{F}}.

Then it can be readily verified from Proposition 4.8 that

(k)(k+1)Ce(r(k))2,\hbar^{(k)}-\hbar^{(k+1)}\geq C_{e}(r^{(k)})^{2}, (4.15)

where Ce:=min{ησmin2(M),ηβσ1/2(M),1λ2,(1λ2)ρ}/20C_{e}:=\min\{\eta\sigma_{\min}^{2}(M),\eta\beta\sigma^{1/2}(M),1-\lambda^{2},(1-\lambda^{2})\rho\}/20 is a prefixed positive constant.

Next, we give a lower bound of r(k)r^{(k)} by the norm of :=(𝐗,𝐔,𝐕)\nabla\hbar:=(\nabla_{\mathbf{X}}\hbar,\nabla_{\mathbf{U}}\hbar,\nabla_{\mathbf{V}}\hbar) in the following lemma.

Lemma 4.10.

Suppose Assumption 1 and Assumption 2 hold. Let all the conditions in Theorem 4.9 be satisfied. Then, for any kk\in\mathbb{N}, there holds that

(𝐗(k),𝐔(k),𝐕(k))FCrr(k),\left\|\nabla\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)})\right\|_{\mathrm{F}}\leq C_{r}r^{(k)},

where Cr>0C_{r}>0 is a constant.

Proof.

To begin with, from straightforward calculations, we can obtain that

𝐗(𝐗(k),𝐔(k),𝐕(k))=\displaystyle\nabla_{\mathbf{X}}\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)})={} 𝐄h(X¯(k))/d+2(Idn𝐉)(𝐗(k)𝐗¯(k))\displaystyle\mathbf{E}\nabla h(\bar{X}^{(k)})/d+2(I_{dn}-\mathbf{J})(\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)})
=\displaystyle={} 𝐄Mgradf(𝒫(X¯(k)))/d+𝐄(h(X¯(k))Mgradf(𝒫(X¯(k))))/d\displaystyle\mathbf{E}M\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))/d+\mathbf{E}(\nabla h(\bar{X}^{(k)})-M\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)})))/d
+2(Idn𝐉)(𝐗(k)𝐗¯(k)).\displaystyle+2(I_{dn}-\mathbf{J})(\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}).

Notice that Mgradf(𝒫(X¯(k)))=S(𝒫(X¯(k)))M\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))=S(\mathcal{P}(\bar{X}^{(k)})) and H(X¯(k))=S(X¯(k))+βQ(X¯(k))H(\bar{X}^{(k)})=S(\bar{X}^{(k)})+\beta Q(\bar{X}^{(k)}), we have

h(X¯(k))Mgradf(𝒫(X¯(k)))F\displaystyle\left\|\nabla h(\bar{X}^{(k)})-M\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|_{\mathrm{F}}\leq{} h(X¯(k))H(X¯(k))F+S(X¯(k))S(𝒫(X¯(k)))F\displaystyle\left\|\nabla h(\bar{X}^{(k)})-H(\bar{X}^{(k)})\right\|_{\mathrm{F}}+\left\|S(\bar{X}^{(k)})-S(\mathcal{P}(\bar{X}^{(k)}))\right\|_{\mathrm{F}} (4.16)
+βQ(X¯(k))F\displaystyle+\beta\left\|Q(\bar{X}^{(k)})\right\|_{\mathrm{F}}
\displaystyle\leq{} C6(X¯(k))MX¯(k)IpF+βQ(X¯(k))F,\displaystyle C_{6}\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}+\beta\left\|Q(\bar{X}^{(k)})\right\|_{\mathrm{F}},

where C6:=C4Lg+σmin1/2(M)Ls>0C_{6}:=\sqrt{C_{4}}L_{g}+\sigma_{\min}^{-1/2}(M)L_{s}>0. According to Proposition 4.6, it follows that X¯(k)\bar{X}^{(k)}\in\mathcal{R}, and hence,

MX¯(k)((X¯(k))MX¯(k)Ip)F76σmax1/2(M)(X¯(k))MX¯(k)IpF.\left\|M\bar{X}^{(k)}((\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p})\right\|_{\mathrm{F}}\leq\sqrt{\dfrac{7}{6}}\sigma_{\max}^{1/2}(M)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}. (4.17)

Combining (4.16) with (4.17), we can acquire that

h(X¯(k))Mgradf(𝒫(X¯(k)))F(C6+76σmax1/2(M)β)(X¯(k))MX¯(k)IpF,\displaystyle\left\|\nabla h(\bar{X}^{(k)})-M\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|_{\mathrm{F}}\leq\left(C_{6}+\sqrt{\dfrac{7}{6}}\sigma_{\max}^{1/2}(M)\beta\right)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}},

which further implies that

𝐗(𝐗(k),𝐔(k),𝐕(k))F\displaystyle\left\|\nabla_{\mathbf{X}}\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)})\right\|_{\mathrm{F}}\leq{} ddσmax(M)gradf(𝒫(X¯(k)))F+2𝐗(k)𝐗¯(k)F\displaystyle\dfrac{\sqrt{d}}{d}\sigma_{\max}(M)\left\|\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)}))\right\|_{\mathrm{F}}+2\left\|\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)}\right\|_{\mathrm{F}} (4.18)
+dd(C6+76σmax1/2(M)β)(X¯(k))MX¯(k)IpF.\displaystyle+\dfrac{\sqrt{d}}{d}\left(C_{6}+\sqrt{\dfrac{7}{6}}\sigma_{\max}^{1/2}(M)\beta\right)\left\|(\bar{X}^{(k)})^{\top}M\bar{X}^{(k)}-I_{p}\right\|_{\mathrm{F}}.

Using a similar argument, we can proceed to prove that

𝐔(𝐗(k),𝐔(k),𝐕(k))F2ρ𝐔(k)𝐔¯(k)F,\left\|\nabla_{\mathbf{U}}\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)})\right\|_{\mathrm{F}}\leq 2\rho\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|_{\mathrm{F}},\\ (4.19)

and

𝐕(𝐗(k),𝐔(k),𝐕(k))F2ρ𝐕(k)𝐕¯(k).\left\|\nabla_{\mathbf{V}}\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)})\right\|_{\mathrm{F}}\leq 2\rho\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|. (4.20)

Then from (4.18)-(4.20), we have

(𝐗(k),𝐔(k),𝐕(k))F(3σmax2(M)/d+3(C6+7/6σmax1/2(M)β)2/d+8ρ2+12)1/2r(k).\left\|\nabla\hbar(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)})\right\|_{\mathrm{F}}\leq(3\sigma_{\max}^{2}(M)/d+3(C_{6}+\sqrt{7/6}\sigma_{\max}^{1/2}(M)\beta)^{2}/d+8\rho^{2}+12)^{1/2}r^{(k)}.

This completes the proof with CrC_{r} chosen as (3σmax2(M)/d+3(C6+7/6σmax1/2(M)β)2/d+8ρ2+12)1/2(3\sigma_{\max}^{2}(M)/d+3(C_{6}+\sqrt{7/6}\sigma_{\max}^{1/2}(M)\beta)^{2}/d+8\rho^{2}+12)^{1/2}. ∎

Let 𝐙(k):=(𝐗(k),𝐔(k),𝐕(k))\mathbf{Z}^{(k)}:=(\mathbf{X}^{(k)},\mathbf{U}^{(k)},\mathbf{V}^{(k)}). The following lemma reveals that the distance between 𝐙(k+1)\mathbf{Z}^{(k+1)} and 𝐙(k)\mathbf{Z}^{(k)} can be controlled by r(k)r^{(k)}.

Lemma 4.11.

Suppose Assumption 1 and Assumption 2 hold. Then with the same conditions as Theorem 4.9, there exists Cz>0C_{z}>0 such that

𝐙(k+1)𝐙(k)FCzr(k),\left\|\mathbf{Z}^{(k+1)}-\mathbf{Z}^{(k)}\right\|_{\mathrm{F}}\leq C_{z}r^{(k)},

for any kk\in\mathbb{N}.

Proof.

It follows from the equality (4.1) that

𝐗(k+1)𝐗(k)=\displaystyle\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}={} (𝐖Idn)(𝐗(k)𝐗¯(k))η𝐖(𝐇(k)𝐄H(X¯(k)))\displaystyle(\mathbf{W}-I_{dn})(\mathbf{X}^{(k)}-\bar{\mathbf{X}}^{(k)})-\eta\mathbf{W}(\mathbf{H}^{(k)}-\mathbf{E}H(\bar{X}^{(k)}))
η𝐄(S(X¯(k))S(𝒫(X¯(k))))η𝐄Q(X¯(k))η𝐄Mgradf(𝒫(X¯(k))),\displaystyle-\eta\mathbf{E}(S(\bar{X}^{(k)})-S(\mathcal{P}(\bar{X}^{(k)})))-\eta\mathbf{E}Q(\bar{X}^{(k)})-\eta\mathbf{E}M\mathrm{grad}\,f(\mathcal{P}(\bar{X}^{(k)})),

which further yields that

𝐗(k+1)𝐗(k)FC~zr(k),\left\|\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}\right\|_{\mathrm{F}}\leq\tilde{C}_{z}r^{(k)},

with C~z:=2+3(C2+C3β2)+ηd(C6+7/6σmax1/2(M)β+σmax(M))>0\tilde{C}_{z}:=2+\sqrt{3(C_{2}+C_{3}\beta^{2})}+\eta\sqrt{d}(C_{6}+\sqrt{7/6}\sigma_{\max}^{1/2}(M)\beta+\sigma_{\max}(M))>0. Moreover, we have

𝐔(k+1)𝐔(k)F2𝐔(k)𝐔¯(k)F+Lc𝐗(k+1)𝐗(k),\left\|\mathbf{U}^{(k+1)}-\mathbf{U}^{(k)}\right\|_{\mathrm{F}}\leq 2\left\|\mathbf{U}^{(k)}-\bar{\mathbf{U}}^{(k)}\right\|_{\mathrm{F}}+L_{c}\left\|\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}\right\|,

and

𝐕(k+1)𝐕(k)F2𝐕(k)𝐕¯(k)F+Lc𝐗(k+1)𝐗(k).\left\|\mathbf{V}^{(k+1)}-\mathbf{V}^{(k)}\right\|_{\mathrm{F}}\leq 2\left\|\mathbf{V}^{(k)}-\bar{\mathbf{V}}^{(k)}\right\|_{\mathrm{F}}+L_{c}\left\|\mathbf{X}^{(k+1)}-\mathbf{X}^{(k)}\right\|.

The above three inequalities imply that

𝐙(k+1)𝐙(k)F((1+4Lc2)C~z2+16)1/2r(k),\left\|\mathbf{Z}^{(k+1)}-\mathbf{Z}^{(k)}\right\|_{\mathrm{F}}\leq\left(\left(1+4L_{c}^{2}\right)\tilde{C}_{z}^{2}+16\right)^{1/2}r^{(k)},

which completes the proof with CzC_{z} chosen as ((1+4Lc2)C~z2+16)1/2((1+4L_{c}^{2})\tilde{C}_{z}^{2}+16)^{1/2}. ∎

With Lemma 4.10 and Lemma 4.11, we establish the convergence of the sequence {𝐗(k)}\{\mathbf{X}^{(k)}\} generated by Algorithm 1 when \hbar is a KŁ function, as stated in the following theorem.

Theorem 4.12.

Suppose that \hbar is a KŁ function. Then with the same conditions as Theorem 4.9, there exists a first-order stationary point X𝒮Mn,pX^{\ast}\in\mathcal{S}_{M}^{n,p} of the problem (1.1) such that the sequence {𝐗(k)}\{\mathbf{X}^{(k)}\} converges to (𝟏dIn)X(\mathbf{1}_{d}\otimes I_{n})X^{\ast}.

Proof.

According to Proposition 4.8, the sequence {(k)}\{\hbar^{(k)}\} is nonincreasing and has a lower bound, which implies that the limit :=limk(k)\hbar^{\circ}:=\lim_{k\to\infty}\hbar^{(k)} exists. Let Ω\Omega be the set of all the accumulation points of the sequence {𝐙(k)}\{\mathbf{Z}^{(k)}\}. For any 𝐙:=(𝐗,𝐔,𝐕)Ω\mathbf{Z}^{\circ}:=(\mathbf{X}^{\circ},\mathbf{U}^{\circ},\mathbf{V}^{\circ})\in\Omega, we have

(𝐙)=limk(k)=.\hbar(\mathbf{Z}^{\circ})=\lim_{k\to\infty}\hbar^{(k)}=\hbar^{\circ}. (4.21)

Thus, \hbar is equal to the constant \hbar^{\circ} on Ω\Omega. If there exists k¯\bar{k}\in\mathbb{N} such that (k¯)=\hbar^{(\bar{k})}=\hbar^{\circ}, the direct combination of Proposition 4.8 and Lemma 4.11 would imply that 𝐙(k¯+1)=𝐙(k¯)\mathbf{Z}^{(\bar{k}+1)}=\mathbf{Z}^{(\bar{k})}. Then a trivial induction shows that the assertion of this theorem is obvious. Since {(k)}\{\hbar^{(k)}\} is a nonincreasing sequence, it is clear from (4.21) that (k)>\hbar^{(k)}>\hbar^{\circ}. Moreover, for any τ>0\tau>0, there exists kk^{\prime}\in\mathbb{N} such that (k)<+τ\hbar^{(k)}<\hbar^{\circ}+\tau with k>kk>k^{\prime}. According to Lemma 5 in [7], we know that Ω\Omega is a compact and connect set and

limkdist(𝐙(k),Ω)=0.\lim_{k\to\infty}\mathrm{dist}(\mathbf{Z}^{(k)},\Omega)=0.

Hence, for any α>0\alpha>0, there exists k′′k^{\prime\prime} such that dist(𝐙(k),Ω)<α\mathrm{dist}(\mathbf{Z}^{(k)},\Omega)<\alpha with k>k′′k>k^{\prime\prime}. Summing up all these facts, we can obtain that 𝐙(k)\mathbf{Z}^{(k)} belongs to the intersection of {𝐙dist(𝐙(k),Ω)<α}\{\mathbf{Z}\mid\mathrm{dist}(\mathbf{Z}^{(k)},\Omega)<\alpha\} and {𝐙<(𝐙)<+τ}\{\mathbf{Z}\mid\hbar^{\circ}<\hbar(\mathbf{Z})<\hbar^{\circ}+\tau\} for any k>max{k,k′′}k>\max\{k^{\prime},k^{\prime\prime}\}. By Lemma 6 in [7], there exists a constant τ>0\tau>0 and a function ϕΦτ\phi\in\Phi_{\tau} such that

ϕ((k))(𝐙(k))F1,\phi^{\prime}(\hbar^{(k)}-\hbar^{\circ})\left\|\nabla\hbar(\mathbf{Z}^{(k)})\right\|_{\mathrm{F}}\geq 1,

for any k>max{k,k′′}k>\max\{k^{\prime},k^{\prime\prime}\}. Multiplying both sides of (4.15) by ϕ((k))\phi^{\prime}(\hbar^{(k)}-\hbar^{\circ}) yields that

Ceϕ((k))(r(k))2ϕ((k))((k)(k+1))ϕ((k))ϕ((k+1)),C_{e}\phi^{\prime}(\hbar^{(k)}-\hbar^{\circ})(r^{(k)})^{2}\leq\phi^{\prime}(\hbar^{(k)}-\hbar^{\circ})(\hbar^{(k)}-\hbar^{(k+1)})\leq\phi(\hbar^{(k)}-\hbar^{\circ})-\phi(\hbar^{(k+1)}-\hbar^{\circ}),

where the second inequality follows from the concavity of ϕ\phi. Combining the above relationship with Lemma 4.10 and Lemma 4.11, we have

ϕ((k))ϕ((k+1))CeCrr(k)ϕ((k))(𝐙(k))FCeCrCz𝐙(k+1)𝐙(k)F,\phi(\hbar^{(k)}-\hbar^{\circ})-\phi(\hbar^{(k+1)}-\hbar^{\circ})\geq\dfrac{C_{e}}{C_{r}}r^{(k)}\phi^{\prime}(\hbar^{(k)}-\hbar^{\circ})\left\|\nabla\hbar(\mathbf{Z}^{(k)})\right\|_{\mathrm{F}}\geq\dfrac{C_{e}}{C_{r}C_{z}}\left\|\mathbf{Z}^{(k+1)}-\mathbf{Z}^{(k)}\right\|_{\mathrm{F}},

which further implies that

k=0s𝐙(k+1)𝐙(k)FCrCzCe(ϕ((0))ϕ((s+1)))CrCzCeϕ((0)),\sum_{k=0}^{s}\left\|\mathbf{Z}^{(k+1)}-\mathbf{Z}^{(k)}\right\|_{\mathrm{F}}\leq\dfrac{C_{r}C_{z}}{C_{e}}\left(\phi(\hbar^{(0)}-\hbar^{\circ})-\phi(\hbar^{(s+1)}-\hbar^{\circ})\right)\leq\dfrac{C_{r}C_{z}}{C_{e}}\phi(\hbar^{(0)}-\hbar^{\circ}),

for any ss\in\mathbb{N}. Letting ss\to\infty, we can attain that

k=0𝐙(k+1)𝐙(k)F<.\sum_{k=0}^{\infty}\left\|\mathbf{Z}^{(k+1)}-\mathbf{Z}^{(k)}\right\|_{\mathrm{F}}<\infty.

Hence, the iterate sequence {𝐙(k)}\{\mathbf{Z}^{(k)}\} is a Cauchy sequence and hence is convergent, which infers that {𝐗(k)}\{\mathbf{X}^{(k)}\} is also convergent. Finally, Theorem 4.9 guarantees that the limit point of {𝐗(k)}\{\mathbf{X}^{(k)}\} has the form (𝟏dIn)X(\mathbf{1}_{d}\otimes I_{n})X^{\ast}, where X𝒮Mn,pX^{\ast}\in\mathcal{S}_{M}^{n,p} is a first-order stationary point of the problem (1.1). This completes the proof. ∎

When \hbar is a semialgebraic function, one important result is that the desingularizing function can be chosen to be of the form

ϕ(t)=ct1θ,\phi(t)=ct^{1-\theta},

where c>0c>0 is a constant and θ[0,1)\theta\in[0,1) is a parameter impacting the convergence rate. Let 𝐙\mathbf{Z}^{\circ} be the limit point of the sequence 𝐙(k)\mathbf{Z}^{(k)}. Using the same line of analysis introduced in [2], we can obtain the following estimations of convergence rates.

  1. (i)

    If θ=0\theta=0, the sequence 𝐙(k)\mathbf{Z}^{(k)} converges in a finite number of steps.

  2. (ii)

    If θ(0,1/2]\theta\in(0,1/2], there exists μ>0\mu>0 and ω(0,1)\omega\in(0,1) such that

    𝐙(k)𝐙Fμωk.\left\|\mathbf{Z}^{(k)}-\mathbf{Z}^{\circ}\right\|_{\mathrm{F}}\leq\mu\omega^{k}.
  3. (iii)

    If θ(1/2,1)\theta\in(1/2,1), there exists μ>0\mu>0 such that

    𝐙(k)𝐙Fμk1θ2θ1.\left\|\mathbf{Z}^{(k)}-\mathbf{Z}^{\circ}\right\|_{\mathrm{F}}\leq\mu k^{-\frac{1-\theta}{2\theta-1}}.

5 Numerical Experiments

In this section, we conduct a series of numerical experiments to demonstrate the efficiency and effectiveness of CDADT, specifically focusing on the CCA problems (1.2). The corresponding experiments are performed on a workstation with dual Intel Xeon Gold 6242R CPU processors (at 3.103.10 GHz×20×2\times 20\times 2) and 510510 GB of RAM under Ubuntu 20.04. The tested algorithms are implemented in the 𝙿𝚢𝚝𝚑𝚘𝚗\mathtt{Python} language with the communication realized by the 𝚖𝚙𝚒𝟺𝚙𝚢\mathtt{mpi4py} package.

In the numerical experiments, the following three quantities are collected and recorded at each iteration as performance metrics.

  • Stationarity violation: U¯(k)V¯(k)sym((X¯(k))U¯(k))F\left\|\bar{U}^{(k)}-\bar{V}^{(k)}\mathrm{sym}\left((\bar{X}^{(k)})^{\top}\bar{U}^{(k)}\right)\right\|_{\mathrm{F}}.

  • Consensus error: i=1dXi(k)X¯(k)F/d\sum_{i=1}^{d}\left\|X_{i}^{(k)}-\bar{X}^{(k)}\right\|_{\mathrm{F}}/d.

  • Feasibility violation: (X¯(k))V¯(k)IpF\left\|(\bar{X}^{(k)})^{\top}\bar{V}^{(k)}-I_{p}\right\|_{\mathrm{F}}.

Furthermore, we generate the Metropolis constant edge weight matrix [29] as the mixing matrix for the tested networks.

5.1 Numerical Results on Synthetic Datasets

The first experiment is to evaluate the performance of CDADT on synthetic datasets. Specifically, in the CCA problem (1.2), the first data matrix An×qA\in\mathbb{R}^{n\times q} (assuming nqn\leq q without loss of generality) by its (economy-form) singular value decomposition as follows,

A=USV,A=USV^{\top}, (5.1)

where both Un×nU\in\mathbb{R}^{n\times n} and Vq×nV\in\mathbb{R}^{q\times n} are orthogonal matrices orthonormalized from randomly generated matrices, and Sn×nS\in\mathbb{R}^{n\times n} is a diagonal matrix with diagonal entries

S(i,i)=ξAi,i[n],S(i,i)=\xi_{A}^{i},\quad i\in[n], (5.2)

for a parameter ξA(0,1)\xi_{A}\in(0,1) that determines the decay rate of the singular values of AA. The second data matrix Bm×qB\in\mathbb{R}^{m\times q} is generated in a similar manner but with a different decay rate ξB(0,1)\xi_{B}\in(0,1) of the singular values. After construction, the columns of the data matrices AA and BB are uniformly distributed into dd agents.

We test the performances of CDADT with different choices of penalty parameters on the Erdös-Rényi (ER) network. The data matrices An×qA\in\mathbb{R}^{n\times q} and Bm×qB\in\mathbb{R}^{m\times q} are randomly generated with n=20n=20, m=30m=30, q=3200q=3200, ξA=0.97\xi_{A}=0.97, and ξB=0.96\xi_{B}=0.96. And the CCA problem (1.2) is tested with p=5p=5 and d=32d=32. The corresponding numerical results are provided in Figure 1, which presents the performances of CDADT with β{0.01,0.1,1,10,100}\beta\in\{0.01,0.1,1,10,100\}. It can be observed that the curves of three performance metrics almost coincide with each other, which corroborates the robustness of CDADT to the penalty parameter in a wide range.

Refer to caption
(a) Stationarity Violation
Refer to caption
(b) Consensus Error
Refer to caption
(c) Feasibility Violation
Figure 1: Numerical performances of CDADT for different values of β\beta.

Furthermore, we conduct numerical tests to evaluate the impact of network topologies on the performance of CDADT, focusing on ring networks, grid networks, and ER networks. Figure 2 illustrates the structures of these networks and the corresponding values of λ\lambda defined in (2.1). For our experiment, the data matrices An×qA\in\mathbb{R}^{n\times q} and Bm×qB\in\mathbb{R}^{m\times q} are randomly generated with n=m=50n=m=50, q=3200q=3200, ξA=0.99\xi_{A}=0.99, and ξB=0.98\xi_{B}=0.98. Then the CCA problem (1.2) is tested with p=5p=5 and d=16d=16. We set the algorithmic parameters η=0.0001\eta=0.0001 and β=1\beta=1 in CDADT. Figure 3 depicts the diminishing trend of three performance metrics against the communication rounds on a logarithmic scale, with different networks distinguished by colors. We can observe that, as the network connectivity becomes worse (i.e., λ\lambda approaches 11), our algorithm requires more communication rounds to achieve the specified accuracy.

Refer to caption
(a) ER (λ0.82)(\lambda\approx 0.82)
Refer to caption
(b) Grid (λ0.87)(\lambda\approx 0.87)
Refer to caption
(c) Ring (λ0.95)(\lambda\approx 0.95)
Figure 2: Illustration of different network structures.
Refer to caption
(a) Stationarity Violation
Refer to caption
(b) Consensus Error
Refer to caption
(c) Feasibility Violation
Figure 3: Numerical performances of CDADT on synthetic datasets across three different networks.

5.2 Numerical Results on Real-world Datasets

Next, we engage in a numerical test to assess the effectiveness of CDADT on two real-world datasets, including MNIST [21] and Mediamill [30]. Specifically, MNIST is a database of handwritten digits, consisting of gray-scale images of size 28×2828\times 28. Every image is split into left and right halves, which are used as the two views with n=m=392n=m=392. We employ CCA to learn the correlated representations between left and right halves of the images, which involves the full training set of MNIST containing q=60000q=60000 images. In the Mediamill dataset, each image is a representative keyframe of a video shot containing n=120n=120 features, which is annotated with m=101m=101 labels. We extract the first q=43200q=43200 samples to test the CCA problem, which is performed to explore the correlation structure between images and labels.

For our testing, we employ CDADT to solve the CCA problem (1.2) on the ER network, where we fix p=5p=5 and d=32d=32. And the algorithmic parameters are set to η=0.005\eta=0.005 and β=1\beta=1 in CDADT. The corresponding numerical results are presented in Figure 4. It is noteworthy that the effectiveness of CDADT is not limited to synthetic datasets but also extends to real-world applications. Moreover, CDADT demonstrates its potential to solve CCA problems over large-scale networks.

Refer to caption
(a) MNIST
Refer to caption
(b) Mediamill
Figure 4: Numerical performances of CDADT on two real-world datasets.

6 Conclusion

Decentralized optimization problems with generalized orthogonality constraints arise in various scientific and engineering applications. Existing algorithms for solving Riemannian optimization problems heavily rely on geometric tools of the underlying Riemannian manifold, such as tangent spaces and retraction operators. However, these algorithms can not be applied to solve 1.3, where the manifold constraints possess an inherently distributed structure. To surmount this intricate challenge, we propose to employ the constraint dissolving operator to build up an exact penalty model of the original problem. Nevertheless, existing algorithms remain unsuitable for solving the newly derived penalty model, as the penalty function is not entirely separable over that agents. To address these challenges, we develop an efficient decentralized algorithm based on the double-tracking strategy. In order to construct a descent direction of the penalty function, our algorithm not only tracks the gradient of the objective function but also maintains a global estimate of the Jacobian of the constraint mapping. The proposed algorithm is guaranteed to converge to a first-order stationary point under mild conditions. We validate its performance through numerical experiments conducted on both synthetic and real-world datasets.

As for future works, we are interested in extending our algorithm to general constraints such that it can find a wider range of applications. Moreover, it is worthy of investigating the performance of CDADT in stochastic and online settings.

Acknowledgement

The third author was supported in part by the National Natural Science Foundation of China (12125108, 12226008, 11991021, 11991020, 12021001, 12288201), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (ZDBS-LY-7022), and CAS–Croucher Funding Scheme for Joint Laboratories “CAS AMSS–PolyU Joint Laboratory of Applied Mathematics: Nonlinear Optimization Theory, Algorithms and Applications”.

References

  • Absil et al. [2008] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008. ISBN 9781400830244.
  • Attouch and Bolte [2009] Hedy Attouch and Jérôme Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116:5–16, 2009.
  • Attouch et al. [2010] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
  • Attouch et al. [2013] Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013.
  • Bajovic et al. [2017] Dragana Bajovic, Dusan Jakovetic, Natasa Krejic, and Natasa Krklec Jerinkic. Newton-like method with diagonal correction for distributed optimization. SIAM Journal on Optimization, 27(2):1171–1203, 2017.
  • Bolte et al. [2007] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
  • Bolte et al. [2014] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
  • Chang et al. [2015] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63(2):482–497, 2015.
  • Chang et al. [2020] Tsung-Hui Chang, Mingyi Hong, Hoi-To Wai, Xinwei Zhang, and Songtao Lu. Distributed learning in the nonconvex world: From batch data to streaming and beyond. IEEE Signal Processing Magazine, 37(3):26–38, 2020.
  • Chen et al. [2023] Jun Chen, Haishan Ye, Mengmeng Wang, Tianxin Huang, Guang Dai, Ivor W Tsang, and Yong Liu. Decentralized Riemannian conjugate gradient method on the Stiefel manifold. arXiv:2308.10547, 2023.
  • Chen et al. [2021] Shixiang Chen, Alfredo Garcia, Mingyi Hong, and Shahin Shahrampour. Decentralized Riemannian gradient descent on the Stiefel manifold. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1594–1605. PMLR, 2021.
  • Chen et al. [2010] Xin Chen, Changliang Zou, and R Dennis Cook. Coordinate-independent sparse sufficient dimension reduction and variable selection. The Annals of Statistics, 38(6):3696–3723, 2010.
  • Daneshmand et al. [2021] Amir Daneshmand, Gesualdo Scutari, Pavel Dvurechensky, and Alexander Gasnikov. Newton method over networks is fast up to the statistical precision. In Proceedings of the 38th International Conference on Machine Learning, pages 2398–2409. PMLR, 2021.
  • Deng and Hu [2023] Kangkang Deng and Jiang Hu. Decentralized projected Riemannian gradient method for smooth optimization on compact submanifolds. arXiv:2304.08241, 2023.
  • Dimarogonas et al. [2011] Dimos V Dimarogonas, Emilio Frazzoli, and Karl H Johansson. Distributed event-triggered control for multi-agent systems. IEEE Transactions on Automatic Control, 57(5):1291–1297, 2011.
  • Du and Wang [2024] Jinye Du and Qihua Wang. Empirical likelihood inference over decentralized networks. arXiv:2401.12836, 2024.
  • Gao and Ma [2023] Sheng Gao and Zongming Ma. Sparse GCA and thresholded gradient descent. Journal of Machine Learning Research, 24(135):1–61, 2023.
  • Hajinezhad and Hong [2019] Davood Hajinezhad and Mingyi Hong. Perturbed proximal primal–dual algorithm for nonconvex nonsmooth optimization. Mathematical Programming, 176(1):207–245, 2019.
  • Hotelling [1936] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3–4):321–377, 1936.
  • Kurdyka [1998] Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier, 48(3):769–783, 1998.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Ling et al. [2015] Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro. DLM: Decentralized linearized alternating direction method of multipliers. IEEE Transactions on Signal Processing, 63(15):4051–4064, 2015.
  • Lojasiewicz [1963] Stanislaw Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les Équations aux Dérivées Partielles, 117:87–89, 1963.
  • Nedić and Ozdaglar [2009] Angelia Nedić and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • Nedić et al. [2018] Angelia Nedić, Alex Olshevsky, and Michael G Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
  • Pillai et al. [2005] S Unnikrishna Pillai, Torsten Suel, and Seunghun Cha. The Perron–Frobenius theorem: Some of its applications. IEEE Signal Processing Magazine, 22(2):62–75, 2005.
  • Qu and Li [2017] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245–1260, 2017.
  • Raja and Bajwa [2015] Haroon Raja and Waheed U Bajwa. Cloud K-SVD: A collaborative dictionary learning algorithm for big, distributed data. IEEE Transactions on Signal Processing, 64(1):173–188, 2015.
  • Shi et al. [2015] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • Snoek et al. [2006] Cees GM Snoek, Marcel Worring, Jan C Van Gemert, Jan-Mark Geusebroek, and Arnold WM Smeulders. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th ACM International Conference on Multimedia, pages 421–430, 2006.
  • Song et al. [2023] Zhuoqing Song, Lei Shi, Shi Pu, and Ming Yan. Optimal gradient tracking for decentralized optimization. Mathematical Programming, pages 1–53, 2023.
  • Sun et al. [2022] Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354–385, 2022.
  • Sun et al. [2024] Youbang Sun, Shixiang Chen, Alfredo Garcia, and Shahin Shahrampour. Global convergence of decentralized retraction-free optimization on the Stiefel manifold. arXiv:2405.11590, 2024.
  • Wang et al. [2023] Jinxin Wang, Jiang Hu, Shixiang Chen, Zengde Deng, and Anthony Man-Cho So. Decentralized weakly convex optimization over the Stiefel manifold. arXiv:2303.17779, 2023.
  • Wang and Liu [2022] Lei Wang and Xin Liu. Decentralized optimization over the Stiefel manifold by an approximate augmented Lagrangian function. IEEE Transactions on Signal Processing, 70:3029–3041, 2022.
  • Wang and Liu [2023a] Lei Wang and Xin Liu. Smoothing gradient tracking for decentralized optimization over the Stiefel manifold with non-smooth regularizers. In Proceedings of the 62nd IEEE Conference on Decision and Control (CDC), pages 126–132. IEEE, 2023a.
  • Wang and Liu [2023b] Lei Wang and Xin Liu. A variance-reduced stochastic gradient tracking algorithm for decentralized optimization with orthogonality constraints. Journal of Industrial and Management Optimization, 19(10):7753–7776, 2023b.
  • Wang et al. [2024] Lei Wang, Le Bao, and Xin Liu. A decentralized proximal gradient tracking algorithm for composite optimization on Riemannian manifolds. arXiv:2401.11573, 2024.
  • Xiao and Boyd [2004] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
  • Xiao et al. [2024] Nachuan Xiao, Xin Liu, and Kim-Chuan Toh. Dissolving constraints for Riemannian optimization. Mathematics of Operations Research, 49(1):366–397, 2024.
  • Xu et al. [2015] Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060. IEEE, 2015.
  • Yuan et al. [2021] Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, and Wotao Yin. DecentLaM: Decentralized momentum SGD for large-batch deep training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3029–3039, 2021.
  • Zhang et al. [2021] Jiaojiao Zhang, Qing Ling, and Anthony Man-Cho So. A Newton tracking algorithm with exact linear convergence for decentralized consensus optimization. IEEE Transactions on Signal and Information Processing over Networks, 7:346–358, 2021.
  • Zhang et al. [2024] Siyuan Zhang, Nachuan Xiao, and Xin Liu. Decentralized stochastic subgradient methods for nonsmooth nonconvex optimization. arXiv:2403.11565, 2024.
  • Zhu and Martínez [2010] Minghui Zhu and Sonia Martínez. Discrete-time dynamic average consensus. Automatica, 46(2):322–329, 2010.