This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decentralized Optimization Over the Stiefel Manifold by an Approximate Augmented Lagrangian Function

Lei Wang State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]). Research is supported by the National Natural Science Foundation of China (No. 11971466 and 11991020).    Xin Liu State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]). Research is supported in part by the National Natural Science Foundation of China (No. 12125108 and 11991021), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022).
Abstract

In this paper, we focus on the decentralized optimization problem over the Stiefel manifold, which is defined on a connected network of dd agents. The objective is an average of dd local functions, and each function is privately held by an agent and encodes its data. The agents can only communicate with their neighbors in a collaborative effort to solve this problem. In existing methods, multiple rounds of communications are required to guarantee the convergence, giving rise to high communication costs. In contrast, this paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration. DESTINY combines gradient tracking techniques with a novel approximate augmented Lagrangian function. The global convergence to stationary points is rigorously established. Comprehensive numerical experiments demonstrate that DESTINY has a strong potential to deliver a cutting-edge performance in solving a variety of testing problems.

1 Introduction

This paper is dedicated to providing an efficient approach to the decentralized optimization with orthogonality constraints, a problem defined on a connected network and solved by dd agents cooperatively.

minXn×p\displaystyle\min\limits_{X\in\mathbb{R}^{n\times p}} f(X):=1di=1dfi(X)\displaystyle f(X):=\dfrac{1}{d}\sum\limits_{i=1}^{d}f_{i}(X) (1.1)
s.t.\displaystyle\mathrm{s.\,t.} XX=Ip,\displaystyle X^{\top}X=I_{p},

where IpI_{p} is the p×pp\times p identity matrix, and fi:n×pf_{i}:\mathbb{R}^{n\times p}\to\mathbb{R} is a local function privately known by the ii-th agent. The set of all n×pn\times p orthogonal matrices is referred to as Stiefel manifold [37], which is denoted by 𝒮n,p={Xn×pXX=Ip}\mathcal{S}_{n,p}=\{X\in\mathbb{R}^{n\times p}\mid X^{\top}X=I_{p}\}. Throughout this paper, we make the following blanket assumptions about these local functions.

Assumption 1.

Each local objective function fif_{i} is smooth and its gradient fi\nabla f_{i} is locally Lipschitz continuous.

We consider the scenario that the agents can only exchange information with their immediate neighbors through the network, which is modeled as a connected undirected graph 𝙶=(𝚅,𝙴)\mathtt{G}=(\mathtt{V},\mathtt{E}). Here, 𝚅=[d]:={1,2,,d}\mathtt{V}=[d]:=\{1,2,\dotsc,d\} is composed of all the agents and 𝙴={(i,j)i and j are connected}\mathtt{E}=\{(i,j)\mid i\text{~{}and~{}}j\text{~{}are connected}\} denotes the set of all the communication links. Let Xin×pX_{i}\in\mathbb{R}^{n\times p} be the local copy of the common variable XX kept by the ii-th agent. Then the optimization problem (1.1) over the Stiefel manifold can be equivalently recast as the following decentralized formulation.

min{Xi}i=1d\displaystyle\min\limits_{\{X_{i}\}_{i=1}^{d}}\hskip 5.69054pt 1di=1dfi(Xi)\displaystyle\dfrac{1}{d}\sum\limits_{i=1}^{d}f_{i}(X_{i}) (1.2a)
s.t.\displaystyle\mathrm{s.\,t.}\,\,\hskip 7.11317pt Xi=Xj,(i,j)𝙴,\displaystyle X_{i}=X_{j},\hskip 5.69054pt(i,j)\in\mathtt{E}, (1.2b)
Xi𝒮n,p,i[d].\displaystyle X_{i}\in\mathcal{S}_{n,p},\hskip 5.69054pti\in[d]. (1.2c)

Since the graph 𝙶\mathtt{G} is assumed to be connected, the constraints (1.2b) enforce the consensus condition X1=X2==XdX_{1}=X_{2}=\dotsb=X_{d}, which guarantees the equivalence between (1.1) and (1.2). Due to the nonconvexity of Stiefel manifolds, it is challenging to design efficient algorithms to solve (1.2).

1.1 Broad applications

Problems of the form (1.1) that require decentralized computations are found widely in various scientific and engineering areas. We briefly describe several representative applications of it below.

Application 1: Principal Component Analysis.

The principal component analysis (PCA) is a basic and ubiquitous tool for dimensionality reduction serving as a critical procedure or preprocessing step in numerous statistical and machine learning tasks. Mathematically, PCA amounts to solving the following optimization problem.

minXn×p\displaystyle\min\limits_{X\in\mathbb{R}^{n\times p}} 12di=1dtr(XAiAiX)\displaystyle-\frac{1}{2d}\sum\limits_{i=1}^{d}\mathrm{tr}\left(X^{\top}A_{i}A_{i}^{\top}X\right) (1.3)
s.t.\displaystyle\mathrm{s.\,t.} X𝒮n,p,\displaystyle X\in\mathcal{S}_{n,p},

where A=[A1,,Ad]n×mA=[A_{1},\dotsc,A_{d}]\in\mathbb{R}^{n\times m} is the data matrix consisting of mm samples with nn features. In the decentralized setting, Ain×miA_{i}\in\mathbb{R}^{n\times m_{i}} represents the local data stored in the ii-th agent such that m=m1++mdm=m_{1}+\dotsb+m_{d}. This is a natural scenario since each sample is a column of AA.

Application 2: Orthogonal Least Squares Regression.

The orthogonal least squares regression (OLSR) proposed in [52] is a supervised learning method for feature extraction in linear discriminant analysis. Suppose we have mm samples with nn features drawn from pp classes. Then OLSR identifies an orthogonal transformation X𝒮n,pX\in\mathcal{S}_{n,p} by solving the following problem.

minXn×p\displaystyle\min\limits_{X\in\mathbb{R}^{n\times p}} 1di=1dCiXDiF2\displaystyle\frac{1}{d}\sum\limits_{i=1}^{d}\left\|C_{i}^{\top}X-D_{i}^{\top}\right\|^{2}_{\mathrm{F}} (1.4)
s.t.\displaystyle\mathrm{s.\,t.} X𝒮n,p,\displaystyle X\in\mathcal{S}_{n,p},

where C=[C1,,Cd]n×mC=[C_{1},\dotsc,C_{d}]\in\mathbb{R}^{n\times m} and D=[D1,,Dd]p×mD=[D_{1},\dotsc,D_{d}]\in\mathbb{R}^{p\times m} are two matrices constructed by samples. Under the decentralized scenario, Cin×miC_{i}\in\mathbb{R}^{n\times m_{i}} and Dip×miD_{i}\in\mathbb{R}^{p\times m_{i}} are local data and corresponding class indicator matrix stored in the ii-th agent, respectively. Here, m=m1++mdm=m_{1}+\dotsb+m_{d}. As claimed in [52], enforcing the orthogonality on transformation matrices is conducive to preserve local structure and discriminant information.

Application 3: Sparse Dictionary Learning.

Sparse dictionary learning (SDL) is a classical unsupervised representation learning method, which finds wide applications in signal and image processing due to its powerful capability of exploiting low-dimensional structures in high-dimensional data. Motivated by the fact that maximizing a high-order norm promotes spikiness and sparsity simultaneously, the authors of [50] introduce the following 4\ell_{4}-norm maximization problem with orthogonality constraints for SDL.

minXn×p\displaystyle\min\limits_{X\in\mathbb{R}^{n\times p}} 14di=1dXBi44\displaystyle-\frac{1}{4d}\sum\limits_{i=1}^{d}\left\|X^{\top}B_{i}\right\|_{4}^{4} (1.5)
s.t.\displaystyle\mathrm{s.\,t.} X𝒮n,p,\displaystyle X\in\mathcal{S}_{n,p},

where B=[B1,,Bd]n×mB=[B_{1},\dotsc,B_{d}]\in\mathbb{R}^{n\times m} is the data matrix consisting of mm samples with nn features, and the 4\ell_{4}-norm is defined as Y44=i,jY(i,j)4\left\|Y\right\|_{4}^{4}=\sum_{i,j}Y(i,j)^{4}. In the decentralized setting, Bin×miB_{i}\in\mathbb{R}^{n\times m_{i}} is the local data stored in the ii-th agent such that m=m1++mdm=m_{1}+\dotsb+m_{d}.

1.2 Literature survey

Recent years have witnessed the development of algorithms for optimization problems over the Stiefel manifold, including gradient descent approaches [24, 28, 1, 3], conjugate gradient approaches [9, 33, 53], constraint preserving updating schemes [41, 20], Newton methods [19, 18], trust-region methods [2], multipliers correction frameworks [13, 38], augmented Lagrangian methods [14], sequential linearized proximal gradient algorithms [45], and so on. Moreover, exact penalty models are constructed for this special type of problems, such as PenC [44] and ExPen [43]. Efficient infeasible algorithms can be designed based on these two models. Unfortunately, none of the above algorithms take the decentralized optimization into account.

Recently, several distributed ADMM algorithms designed for centralized networks have emerged to solve various variants of PCA problems [39, 40], which pursue a consensus on the subspaces spanned by splitting variables. This strategy relaxes feasibility restrictions and significantly improves the convergence. However, these algorithms can not be straightforwardly extended to the decentralized setting.

Decentralized optimization in the Euclidean space has been adequately investigated in recent decades. Most existing algorithms adapt classical Euclidean methods to the decentralized setting. For instance, in the decentralized gradient descent (DGD) method [27], each agent combines the average of its neighbors with a local negative gradient step. DGD can only converge to a neighborhood of a solution with a constant stepsize, and it has to employ diminishing stepsizes to obtain an exact solution at the cost of slow convergence [48, 49]. To cope with this speed-accuracy dilemma of DGD, EXTRA [36] corrects the convergence error by introducing two different mixing matrices (to be defined in Subsection 2.1) for two consecutive iterations and leveraging their difference to update local variables. And decentralized gradient tracking (DGT) algorithms [46, 30, 26] introduce an auxiliary variable to maintain a local estimate of the global gradient descent direction. Both EXTRA and DGT permit the use of a constant stepsize to attain an exact solution without sacrificing the convergence rate. The above three types of algorithms are originally tailored for unconstrained problems. Their constrained versions, such as [31, 8, 35], are still incapable of solving (1.2) since they can only deal with convex constraints. Another category of methods is based on the primal-dual framework, including decentralized augmented Lagrangian methods [17, 16] and decentralized ADMM methods [4, 22, 23]. In the presence of nonconvex constraints, these algorithms are not applicable to (1.2) either.

As a special case of (1.2), a considerable amount of effort has been devoted to developing algorithms for decentralized PCA problems. For instance, a decentralized ADMM algorithm is proposed in [34] based on the low-rank matrix approximation model. And [12] develops a decentralized algorithm called DSA and its accelerated version ADSA by adopting a similar update formula of DGD to an early work of training neural networks [32]. The linear convergence of DSA to a neighborhood of a solution is established in [11], but ADSA does not yet have theoretical guarantees. Moreover, [10] further introduces a decentralized version of an early work for online PCA [21], which enjoys the exact and linear convergence. Recently, [47] combines gradient tracking techniques with subspace iterations to develop an exact algorithm DeEPCA with a linear convergence rate. Generally speaking, it is intractable to extend these algorithms to generic cases.

In a word, all the aforementioned approaches can not be employed to solve (1.2). In fact, the investigation of decentralized optimization over the Stiefel manifold is relatively limited. To the best of our knowledge, there is only one work [5] targeting at this specific type of problem, which extends DGD and DGT from the Euclidean space to the Stiefel manifold. The resulting algorithms are called DRSGD and DRGTA, respectively. Analogous to the Euclidean setting, DRGTA can achieve the exact convergence with constant stepsizes, but DRSGD can not. For these two algorithms, a communication bottleneck arises from multiple consensus steps, namely, multiple rounds of communications per iteration, making them difficult to acquire a practical application in large-scale networks. As illustrated in [6], one local consensus step requires approximately one-third or one-half the communication overheads of a global averaging operation under a certain practical scenario. Consequently, multiple consensus steps will incur intolerably high communication budgets, thereby hindering the scalability of decentralized algorithms. Furthermore, an extra consensus stepsize, which is less than 11, is introduced in DRSGD and DRGTA, significantly slowing down the convergence. In fact, the above two requirements are imposed to control the consensus error on Stiefel manifolds. The nonconvexity of 𝒮n,p\mathcal{S}_{n,p} triggers off an enormous difficulty in seeking a consensus on it.

1.3 Contributions

This paper develops an infeasible decentralized approach, called DESTINY, to the problem (1.2), which invokes only a single round of communications per iteration and gets rid of small consensus stepsizes. These two goals are achieved with the help of a novel approximate augmented Lagrangian function. It enables us to use unconstrained algorithms to solve the problem (1.2) and hence tides us over the obstacle to reaching a consensus on Stiefel manifolds. The potential of this function is not limited to the decentralized setting. It can serve as a penalty function for optimization problems over the Stiefel manifold, which is of independent interest.

Although DESTINY is based on the gradient tracking technique, existing convergence results are not applicable. We rigorously establish the global convergence to stationary points with the worst case complexity under rather mild conditions. All the theoretical analysis is applicable to the special case of d=1d=1. Therefore, as a by-product, DESTINY boils down to a new algorithm in the non-distributed setting.

Preliminary numerical experiments, conducted under the distributed environment, validate the effectiveness of our approximate augmented Lagrangian function and demonstrate the robustness to penalty parameters. In addition, for the first time, we find that BB stepsizes can accelerate the convergence of decentralized algorithms for optimization problems over the Stiefel manifold. Finally, extensive experimental results indicate that our algorithm requires far fewer rounds of communications than what are required by existing peer methods.

1.4 Notations

The Euclidean inner product of two matrices Y1,Y2Y_{1},Y_{2} with the same size is defined as Y1,Y2=tr(Y1Y2)\left\langle Y_{1},Y_{2}\right\rangle=\mathrm{tr}(Y_{1}^{\top}Y_{2}), where tr(B)\mathrm{tr}(B) stands for the trace of a square matrix BB. And the notation sym(B)=(B+B)/2\mathrm{sym}(B)=(B+B^{\top})/2 represents the symmetric part of BB. The Frobenius norm and 2-norm of a given matrix CC is denoted by CF\left\|C\right\|_{\mathrm{F}} and C2\left\|C\right\|_{2}, respectively. The (i,j)(i,j)-th entry of a matrix CC is represented by C(i,j)C(i,j). The notation 𝟏d\mathbf{1}_{d} stands for the dd-dimensional vector of all ones. The Kronecker product is denoted by \otimes. Given a differentiable function g(X):n×pg(X):\mathbb{R}^{n\times p}\to\mathbb{R}, the gradient of gg with respect to XX is represented by g(X)\nabla g(X). Further notation will be introduced wherever it occurs.

1.5 Outline

The rest of this paper is organized as follows. In Section 2, we introduce an approximate augmented Lagrangian function and devise a novel decentralized algorithm for (1.2). Moreover, Section 3 is dedicated to investigating the convergence properties of the proposed algorithm. Numerical experiments on a variety of test problems are presented in Section 4 to evaluate the performance of the proposed algorithm. We draw final conclusions and discuss some possible future developments in the last section.

2 Algorithm Development

In this section, some preliminaries are first introduced, including first-order stationarity conditions and mixing matrices. Then we propose a novel approximate augmented Lagrangian function to deal with nonconvex orthogonality constraints. From this foundation, we further devise an efficient infeasible decentralized approach to optimization problems over the Stiefel manifold.

2.1 Preliminaries

To facilitate the narrative, we first introduce the stationarity conditions. As discussed in [3, 38], the definition of first-order stationary points of (1.1) can be stated as follows.

Definition 2.1.

A point X𝒮n,pX\in\mathcal{S}_{n,p} is called a first-order stationary point of (1.1) if and only if

0=gradf(X):=f(X)Xsym(Xf(X)).0=\mathrm{grad}\,f(X):=\nabla f(X)-X\mathrm{sym}\left(X^{\top}\nabla f(X)\right).

Here, gradf(X)\mathrm{grad}\,f(X) denotes the Riemannian gradient [3] of ff at XX.

In addition, the structure of the communication network is associated with a mixing matrix W=[W(i,j)]d×dW=[W(i,j)]\in\mathbb{R}^{d\times d}, which conforms to the underlying graph topology. The mixing matrix plays an essential role in the development of decentralized algorithms. We assume that WW satisfies the following properties, which are standard in the literature.

Assumption 2.

The mixing matrix WW satisfies the following conditions.

  1. (i)

    WW is symmetric.

  2. (ii)

    WW is doubly stochastic, namely, WW is nonnegative and W𝟏d=W𝟏d=𝟏dW\mathbf{1}_{d}=W^{\top}\mathbf{1}_{d}=\mathbf{1}_{d}.

  3. (iii)

    W(i,j)=0W(i,j)=0 if iji\neq j and (i,j)𝙴(i,j)\notin\mathtt{E}.

Typically, decentralized algorithms rely on the mixing matrix WW to diffuse information throughout the network, which has a few common choices. We refer interested readers to [42, 36] for more details. According to the Perron-Frobenius Theorem [29], we know that the eigenvalues of WW lie in (1,1](-1,1] and

λ:=W𝟏d𝟏d/d2<1.\lambda:=\left\|W-\mathbf{1}_{d}\mathbf{1}_{d}^{\top}/d\right\|_{2}<1. (2.1)

The parameter λ\lambda measures the connectedness of networks and plays a prominent part in the analysis of gradient tracking based methods.

2.2 Approximate augmented Lagrangian function

As mentioned earlier, it is quite difficult to seek a consensus on the Stiefel manifold, which motivates us to devise an infeasible decentralized algorithm for the problem (1.2). The renowned augmented Lagrangian method is a natural choice. Therefore, the story behind our algorithm starts from the augmented Lagrangian function of optimization problems over the Stiefel manifold, which is of the following form.

(X,Λ):=f(X)12Λ,XXIp+β4XXIpF2,\mathcal{L}(X,\Lambda):=f(X)-\dfrac{1}{2}\left\langle\Lambda,X^{\top}X-I_{p}\right\rangle+\dfrac{\beta}{4}\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}},

where Λp×p\Lambda\in\mathbb{R}^{p\times p} consists of the corresponding Lagrangian multipliers and β>0\beta>0 is the penalty parameter. However, the presence of multipliers brings the difficulties in both algorithmic design and theoretical analysis.

In fact, the Lagrangian multipliers Λ\Lambda, as discussed in [38], admit the following closed-form expression at any first-order stationary point XX of (1.1).

Λ=sym(Xf(X)).\Lambda=\mathrm{sym}\left(X^{\top}\nabla f(X)\right). (2.2)

The second term of (X,Λ)\mathcal{L}(X,\Lambda) with the above expression of Λ\Lambda can be rewritten as the following form.

Λ,XXIp=f(X),XXXX.\left\langle\Lambda,X^{\top}X-I_{p}\right\rangle=\left\langle\nabla f(X),XX^{\top}X-X\right\rangle.

Then in light of the Taylor expansion, we have the following approximation.

f(XXX)f(X)f(X),XXXX.f(XX^{\top}X)-f(X)\approx\left\langle\nabla f(X),XX^{\top}X-X\right\rangle. (2.3)

Now we replace the second term of (X,Λ)\mathcal{L}(X,\Lambda) with the left hand side of (2.3), and then obtain the following function.

h(X):=g(X)+βb(X),h(X):=g(X)+\beta b(X),

where

g(X):=32f(X)12f(XXX), and b(X):=14XXIpF2.g(X):=\dfrac{3}{2}\,f(X)-\dfrac{1}{2}\,f(XX^{\top}X),\mbox{~{}and~{}}b(X):=\dfrac{1}{4}\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}}.

We call it approximate augmented Lagrangian function. The multiplier is updated implicitly by the closed form expression (2.2) in this function h(X)h(X), and hence, we can design an infeasible algorithm entirely in the primal formulation.

Now we discuss the gradient of h(X)h(X) that can be easily computed as follows.

h(X)=g(X)+βX(XXIp),\nabla h(X)=\nabla g(X)+\beta X\left(X^{\top}X-I_{p}\right),

where

g(X)=32f(X)12f(XXX)XXXsym(Xf(XXX)).\nabla g(X)=\dfrac{3}{2}\nabla f(X)-\dfrac{1}{2}\nabla f(XX^{\top}X)X^{\top}X-X\mathrm{sym}\left(X^{\top}\nabla f(XX^{\top}X)\right).

Clearly, the gradient of ff should be evaluated twice at different points XX and XXXXX^{\top}X in g(X)\nabla g(X), which gives rise to additional computational costs. In order to alleviate this difficulty, we use the following direction to approximate g(X)\nabla g(X).

G(X)=32f(XXX)12f(XXX)XXXsym(Xf(XXX)).G(X)=\dfrac{3}{2}\nabla f(XX^{\top}X)-\dfrac{1}{2}\nabla f(XX^{\top}X)X^{\top}X-X\mathrm{sym}\left(X^{\top}\nabla f(XX^{\top}X)\right).

For convenience, we denote

H(X)=G(X)+βX(XXIp).H(X)=G(X)+\beta X\left(X^{\top}X-I_{p}\right).

As an approximation of h(X)\nabla h(X), H(X)H(X) possesses the following desirable properties.

Lemma 2.2.

Let :={Xn×pXXIpF1/6}\mathcal{R}:=\{X\in\mathbb{R}^{n\times p}\mid\|X^{\top}X-I_{p}\|_{\mathrm{F}}\leq 1/6\} be a bounded region and C0:=supXf(XXX)FC_{0}:=\sup_{X\in\mathcal{R}}\left\|\nabla f(XX^{\top}X)\right\|_{\mathrm{F}} be a positive constant. Then if β(6+21C0)/5\beta\geq(6+21C_{0})/5, we have

H(X)F2G(X)F2+βXXIpF2,\left\|H(X)\right\|^{2}_{\mathrm{F}}\geq\left\|G(X)\right\|^{2}_{\mathrm{F}}+\beta\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}},

for any XX\in\mathcal{R}.

Proof.

Suppose σmin\sigma_{\min} is the smallest singular value of XX. Then for any XX\in\mathcal{R}, we have X227/6\left\|X\right\|_{2}^{2}\leq 7/6 and σmin25/6\sigma_{\min}^{2}\geq 5/6, and hence,

X(XXIp)F2σmin2XXIpF256XXIpF2.\left\|X\left(X^{\top}X-I_{p}\right)\right\|^{2}_{\mathrm{F}}\geq\sigma_{\min}^{2}\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}}\geq\dfrac{5}{6}\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}}.

Moreover, simple algebraic manipulations give us

G(X),X(XXIp)\displaystyle\left\langle G(X),X\left(X^{\top}X-I_{p}\right)\right\rangle
=f(XXX)(32Ip12XX),X(XXIp)Xsym(Xf(XXX)),X(XXIp)\displaystyle=\left\langle\nabla f(XX^{\top}X)\left(\dfrac{3}{2}I_{p}-\dfrac{1}{2}X^{\top}X\right),X\left(X^{\top}X-I_{p}\right)\right\rangle-\left\langle X\mathrm{sym}\left(X^{\top}\nabla f(XX^{\top}X)\right),X\left(X^{\top}X-I_{p}\right)\right\rangle
=sym(Xf(XXX)),(XXIp)(32Ip12XX)XX(XXIp)\displaystyle=\left\langle\mathrm{sym}\left(X^{\top}\nabla f(XX^{\top}X)\right),\left(X^{\top}X-I_{p}\right)\left(\dfrac{3}{2}I_{p}-\dfrac{1}{2}X^{\top}X\right)-X^{\top}X\left(X^{\top}X-I_{p}\right)\right\rangle
=32sym(Xf(XXX)),(XXIp)2,\displaystyle=-\dfrac{3}{2}\left\langle\mathrm{sym}\left(X^{\top}\nabla f(XX^{\top}X)\right),\left(X^{\top}X-I_{p}\right)^{2}\right\rangle,

which yields that

|G(X),X(XXIp)|\displaystyle\left\lvert\left\langle G(X),X\left(X^{\top}X-I_{p}\right)\right\rangle\right\rvert 32sym(Xf(XXX))F(XXIp)2F\displaystyle\leq\dfrac{3}{2}\left\|\mathrm{sym}\left(X^{\top}\nabla f(XX^{\top}X)\right)\right\|_{\mathrm{F}}\left\|\left(X^{\top}X-I_{p}\right)^{2}\right\|_{\mathrm{F}}
74C0XXIpF2.\displaystyle\leq\dfrac{7}{4}C_{0}\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}}.

Now it can be readily verifies that

H(X)F2\displaystyle\left\|H(X)\right\|^{2}_{\mathrm{F}} =G(X)F2+2βG(X),X(XXIp)+β2X(XXIp)F2\displaystyle=\left\|G(X)\right\|^{2}_{\mathrm{F}}+2\beta\left\langle G(X),X\left(X^{\top}X-I_{p}\right)\right\rangle+\beta^{2}\left\|X\left(X^{\top}X-I_{p}\right)\right\|^{2}_{\mathrm{F}}
G(X)F2+16β(5β21C0)XXIpF2,\displaystyle\geq\left\|G(X)\right\|^{2}_{\mathrm{F}}+\dfrac{1}{6}\beta\left(5\beta-21C_{0}\right)\left\|X^{\top}X-I_{p}\right\|^{2}_{\mathrm{F}},

which together with β(6+21C0)/5\beta\geq(6+21C_{0})/5 completes the proof. ∎

Lemma 2.2 reveals that the orthogonality violation of XX is controlled by the squared norm of H(X)H(X) as long as XX\in\mathcal{R} and β\beta is sufficiently large. And it is worthy of mentioning that G(X)=gradf(X)G(X)=\mathrm{grad}\,f(X) when X𝒮n,pX\in\mathcal{S}_{n,p}. In fact, h(X)h(X) can play the part of a penalty function for optimization problems over the Stiefel manifold, which is of interest for a future investigation.

2.3 Algorithm description

Inspired by Lemma 2.2, we can devise a decentralized algorithm such that the iterates are restricted in the bounded region \mathcal{R} and H(X)H(X) asymptotically vanishes.

Under the decentralized setting, the computation of H(X)H(X) can be distributed into dd agents as follows.

H(X)=1di=1dHi(X),H(X)=\dfrac{1}{d}\sum\limits_{i=1}^{d}H_{i}(X),

where Hi(X):=Gi(X)+βX(XXIp)H_{i}(X):=G_{i}(X)+\beta X(X^{\top}X-I_{p}), and

Gi(X):=fi(XXX)(3IpXX)/2Xsym(Xfi(XXX)).G_{i}(X):=\nabla f_{i}(XX^{\top}X)\left(3I_{p}-X^{\top}X\right)/2-X\mathrm{sym}\left(X^{\top}\nabla f_{i}(XX^{\top}X)\right).

Clearly, only local gradient fi(XXX)\nabla f_{i}(XX^{\top}X) is involved in the expression of Hi(X)H_{i}(X), which signifies that the evaluation of Hi(X)H_{i}(X) can be accomplished by the ii-th agent individually. Consequently, Hi(X)H_{i}(X) can act as a local descent direction. We can employ gradient tracking techniques to estimate the average of these local descent directions. The resulting algorithm is summarized in Algorithm 1, named decentralized Stiefel algorithm by an approximate augmented Lagrangian penalty function and abbreviated to DESTINY.

1
2Input: initial guess Xinitial𝒮n,pX_{\mathrm{initial}}\in\mathcal{S}_{n,p}, stepsize η>0\eta>0, and penalty parameter β>0\beta>0.
3
4for all i[d]i\in[d] in parallel do
5      
6      Set k:=0k:=0.
7      Initialize Xi,k:=XinitialX_{i,k}:=X_{\mathrm{initial}} and Di,k:=Hi(Xi,k)D_{i,k}:=H_{i}(X_{i,k}).
8      while “not converged” do
9            
10            Update Xi,k+1:=j=1dW(i,j)(Xj,kηDj,k)X_{i,k+1}:=\sum\limits_{j=1}^{d}W(i,j)\left(X_{j,k}-\eta D_{j,k}\right).
11            Update Di,k+1:=j=1dW(i,j)Dj,k+Hi(Xi,k+1)Hi(Xi,k)D_{i,k+1}:=\sum\limits_{j=1}^{d}W(i,j)D_{j,k}+H_{i}(X_{i,k+1})-H_{i}(X_{i,k}).
12            Set k:=k+1k:=k+1.
13      
14
15Output: {Xi,k}\{X_{i,k}\}.
16
Algorithm 1 DESTINY.

Our algorithm consists of two procedures, namely, updating local variables and tracking global descent directions. Unlike DRGTA [5], in each procedure, our algorithm only performs one consensus step per iteration. These two operations can be realized in a single round of communications to reduce the latency. In addition, our algorithm does not involve an extra consensus stepsize.

For the sake of convenience, we now define the following averages of local variables.

  • X¯k=1di=1dXi,k\bar{X}_{k}=\dfrac{1}{d}\sum\limits_{i=1}^{d}X_{i,k}.

  • D¯k=1di=1dDi,k\bar{D}_{k}=\dfrac{1}{d}\sum\limits_{i=1}^{d}D_{i,k}.

  • H¯k=1di=1dHi(Xi,k)\bar{H}_{k}=\dfrac{1}{d}\sum\limits_{i=1}^{d}H_{i}(X_{i,k}).

Then it is not difficult to check that, for any kk\in\mathbb{N}, the following relationships hold.

X¯k+1=X¯kηD¯k, and D¯k=H¯k.\bar{X}_{k+1}=\bar{X}_{k}-\eta\bar{D}_{k},\mbox{~{}and~{}}\bar{D}_{k}=\bar{H}_{k}. (2.4)

Moreover, we denote J=𝟏d𝟏d/dd×dJ=\mathbf{1}_{d}\mathbf{1}_{d}^{\top}/d\in\mathbb{R}^{d\times d}, 𝐉=JIndn×dn\mathbf{J}=J\otimes I_{n}\in\mathbb{R}^{dn\times dn}, and 𝐖=WIndn×dn\mathbf{W}=W\otimes I_{n}\in\mathbb{R}^{dn\times dn}. The following stacked notations are also used in the sequel.

  • 𝐗k=[X1,k,,Xd,k]dn×p\mathbf{X}_{k}=[X_{1,k}^{\top},\dotsc,X_{d,k}^{\top}]^{\top}\in\mathbb{R}^{dn\times p}, 𝐗¯k=(𝟏dIn)X¯k=𝐉𝐗kdn×p\bar{\mathbf{X}}_{k}=\left(\mathbf{1}_{d}\otimes I_{n}\right)\bar{X}_{k}=\mathbf{J}\mathbf{X}_{k}\in\mathbb{R}^{dn\times p}.

  • 𝐃k=[D1,k,,Dd,k]dn×p\mathbf{D}_{k}=[D_{1,k}^{\top},\dotsc,D_{d,k}^{\top}]^{\top}\in\mathbb{R}^{dn\times p}, 𝐃¯k=(𝟏dIn)D¯k=𝐉𝐃kdn×p\bar{\mathbf{D}}_{k}=\left(\mathbf{1}_{d}\otimes I_{n}\right)\bar{D}_{k}=\mathbf{J}\mathbf{D}_{k}\in\mathbb{R}^{dn\times p}.

  • 𝐇k=[H1(X1,k),,Hd(Xd,k)]dn×p\mathbf{H}_{k}=[H_{1}(X_{1,k})^{\top},\dotsc,H_{d}(X_{d,k})^{\top}]^{\top}\in\mathbb{R}^{dn\times p}, 𝐇¯k=(𝟏dIn)H¯k=𝐉𝐇kdn×p\bar{\mathbf{H}}_{k}=\left(\mathbf{1}_{d}\otimes I_{n}\right)\bar{H}_{k}=\mathbf{J}\mathbf{H}_{k}\in\mathbb{R}^{dn\times p}.

By virtue of the above notations, we have (𝐖𝐉)𝐗¯k=(𝐖𝐉)𝐃¯k=0\left(\mathbf{W}-\mathbf{J}\right)\bar{\mathbf{X}}_{k}=\left(\mathbf{W}-\mathbf{J}\right)\bar{\mathbf{D}}_{k}=0. And the main iteration loop of Algorithm 1 can be summarized as the following compact form.

{𝐗k+1=𝐖(𝐗kη𝐃k),𝐃k+1=𝐖𝐃k+𝐇k+1𝐇k.\left\{\begin{aligned} \mathbf{X}_{k+1}&=\mathbf{W}\left(\mathbf{X}_{k}-\eta\mathbf{D}_{k}\right),\\ \mathbf{D}_{k+1}&=\mathbf{W}\mathbf{D}_{k}+\mathbf{H}_{k+1}-\mathbf{H}_{k}.\end{aligned}\right.

3 Convergence Analysis

Existing convergence guarantees of gradient tracking based algorithms, such as [7], are constructed for globally Lipschitz smooth and coercive functions, which are restrictive for (1.2) since 𝒮n,p\mathcal{S}_{n,p} is compact. In this section, the global convergence of Algorithm 1 is rigorously established under rather mild conditions. The objective function is only assumed to be locally Lipschitz smooth (see Assumption 1). The worst case complexity is also provided.

All the special constants to be used in this section are listed below. We divide these constants into two categories. Recall that \mathcal{R} is a bounded set defined in Lemma 2.2 and λ\lambda is a constant defined in (2.1). And we define :={Xn×pXF7dp/6+d}\mathcal{B}:=\{X\in\mathbb{R}^{n\times p}\mid\left\|X\right\|_{\mathrm{F}}\leq\sqrt{7dp/6}+\sqrt{d}\}.

  • Category I:

    Mg=sup{Gi(X)FX,i[d]};Mb=sup{b(X)FX};\displaystyle M_{g}=\sup\,\{\left\|G_{i}(X)\right\|_{\mathrm{F}}\mid X\in\mathcal{B},i\in[d]\};\quad M_{b}=\sup\,\{\left\|\nabla b(X)\right\|_{\mathrm{F}}\mid X\in\mathcal{B}\}; (3.1)
    Lg=sup{Gi(X)Gi(Y)FXYF|XY,X,Y,i[d]};\displaystyle L_{g}=\sup\,\left\{\dfrac{\left\|G_{i}(X)-G_{i}(Y)\right\|_{\mathrm{F}}}{\left\|X-Y\right\|_{\mathrm{F}}}\bigg{|}X\neq Y,X\in\mathcal{B},Y\in\mathcal{B},i\in[d]\right\};
    Lb=sup{b(X)b(Y)FXYF|XY,X,Y};C1=λ2(1+λ2)1λ2;\displaystyle L_{b}=\sup\,\left\{\dfrac{\left\|\nabla b(X)-\nabla b(Y)\right\|_{\mathrm{F}}}{\left\|X-Y\right\|_{\mathrm{F}}}\bigg{|}X\neq Y,X\in\mathcal{B},Y\in\mathcal{B}\right\};\quad C_{1}=\dfrac{\lambda^{2}\left(1+\lambda^{2}\right)}{1-\lambda^{2}};
    Lf=sup{f(X)f(XXX)FX(XXIp)F|XXXX,X};γ=1λ22λ2.\displaystyle L_{f}=\sup\,\left\{\dfrac{\left\|\nabla f(X)-\nabla f(XX^{\top}X)\right\|_{\mathrm{F}}}{\left\|X\left(X^{\top}X-I_{p}\right)\right\|_{\mathrm{F}}}\bigg{|}X\neq XX^{\top}X,X\in\mathcal{B}\right\};\quad\gamma=\dfrac{1-\lambda^{2}}{2\lambda^{2}}.
  • Category II:

    CD=𝐃¯0𝐃0F+3d/(1λ)+d(Mg+βMb);\displaystyle C_{D}=\left\|\bar{\mathbf{D}}_{0}-\mathbf{D}_{0}\right\|_{\mathrm{F}}+3\sqrt{d}/\left(1-\lambda\right)+\sqrt{d}\left(M_{g}+\beta M_{b}\right); (3.2)
    C2=12(1+λ2)(Lg+βLb)21λ2;ρ=316C2(1λ2).\displaystyle C_{2}=\dfrac{12\left(1+\lambda^{2}\right)\left(L_{g}+\beta L_{b}\right)^{2}}{1-\lambda^{2}};\quad\rho=\dfrac{3}{16C_{2}}\left(1-\lambda^{2}\right).

The first category of constants defined in (3.1) is independent of β\beta, while the second category in (3.2) is not. In order to establish the global convergence of Algorithm 1, we need to impose several mild conditions on β\beta and η\eta. To facilitate the narrative, we first state all these conditions here.

Assumption 3.

We make the following assumptions about β\beta and η\eta.

  1. (i)

    The penalty parameter β\beta satisfies

    β>max{6+21C05,72(4+3Mg)5,1Lb, 22Lf2}.\beta>\max\left\{\dfrac{6+21C_{0}}{5},\;\dfrac{72(4+3M_{g})}{5},\;\dfrac{1}{L_{b}},\;22L_{f}^{2}\right\}.
  2. (ii)

    The stepsize η\eta satisfies

    0<η<min{η¯1,η¯2,η¯3,η¯4,η¯5,η¯6,η¯7,η¯8,η¯9,η¯10,η¯11,η¯12},0<\eta<\min\left\{\bar{\eta}_{1},\bar{\eta}_{2},\bar{\eta}_{3},\bar{\eta}_{4},\bar{\eta}_{5},\bar{\eta}_{6},\bar{\eta}_{7},\bar{\eta}_{8},\bar{\eta}_{9},\bar{\eta}_{10},\bar{\eta}_{11},\bar{\eta}_{12}\right\},

    where

    η¯1=21649β2,η¯2=127β,η¯3=118(4+3Mg),η¯4=5β/7243Mg(1+Mg)2,\displaystyle\bar{\eta}_{1}=\dfrac{216}{49\beta^{2}},\quad\bar{\eta}_{2}=\dfrac{12}{7\beta},\quad\bar{\eta}_{3}=\dfrac{1}{18(4+3M_{g})},\quad\bar{\eta}_{4}=\dfrac{5\beta/72-4-3M_{g}}{(1+M_{g})^{2}},
    η¯5=d(1λ)(Lg+βLb)λCD,η¯6=d(Lg+βLb)CD,η¯7=18(Lg+βLb),\displaystyle\bar{\eta}_{5}=\dfrac{\sqrt{d}(1-\lambda)}{(L_{g}+\beta L_{b})\lambda C_{D}},\quad\bar{\eta}_{6}=\dfrac{\sqrt{d}}{(L_{g}+\beta L_{b})C_{D}},\quad\bar{\eta}_{7}=\dfrac{1}{8(L_{g}+\beta L_{b})},
    η¯8=14ρdC2,η¯9=1λ22λ2(Lg+βLb)16(1+λ2),\displaystyle\bar{\eta}_{8}=\dfrac{1}{4\rho dC_{2}},\quad\bar{\eta}_{9}=\dfrac{1-\lambda^{2}}{2\lambda^{2}(L_{g}+\beta L_{b})}\sqrt{\dfrac{1}{6(1+\lambda^{2})}},
    η¯10=d(1λ2)24(Lg+βLb)3,η¯11=d(1λ2)12(Lg+βLb)2,η¯12=3ρ(1λ2)16C1.\displaystyle\bar{\eta}_{10}=\sqrt{\dfrac{d(1-\lambda^{2})}{24(L_{g}+\beta L_{b})^{3}}},\quad\bar{\eta}_{11}=\dfrac{d(1-\lambda^{2})}{12(L_{g}+\beta L_{b})^{2}},\quad\bar{\eta}_{12}=\sqrt{\dfrac{3\rho(1-\lambda^{2})}{16C_{1}}}.

The conditions in Assumption 3 are introduced for theoretical analysis. The parameters β\beta and η\eta satisfying these conditions are usually restrictive in practical use.

3.1 Boundedness of iterates

The purpose of this subsection is to show that the sequence {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} generated by Algorithm 1 is bounded and X¯k\bar{X}_{k} is restricted in the bounded region \mathcal{R}. We first prove the following technical lemma.

Lemma 3.1.

Suppose the conditions in Assumptions 1, 2, and 3 hold, and X¯k+1\bar{X}_{k+1} is generated by (2.4) with X¯k\bar{X}_{k}\in\mathcal{R}, 𝐗kF7dp/6+d\left\|\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\sqrt{7dp/6}+\sqrt{d}, and 𝐗¯k𝐗kFd/(Lg+βLb)\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\sqrt{d}/(L_{g}+\beta L_{b}). Then we have X¯k+1\bar{X}_{k+1}\in\mathcal{R}.

Proof.

Since 𝐗kF7dp/6+d\left\|\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\sqrt{7dp/6}+\sqrt{d}, we have Xi,kX_{i,k}\in\mathcal{B}. Then it can be readily verified that

H(X¯k)H¯kF\displaystyle\left\|H(\bar{X}_{k})-\bar{H}_{k}\right\|_{\mathrm{F}} 1di=1dHi(X¯k)Hi(Xi,k)FLg+βLbdi=1dX¯kXi,kF\displaystyle\leq\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|H_{i}(\bar{X}_{k})-H_{i}(X_{i,k})\right\|_{\mathrm{F}}\leq\dfrac{L_{g}+\beta L_{b}}{d}\sum\limits_{i=1}^{d}\left\|\bar{X}_{k}-X_{i,k}\right\|_{\mathrm{F}}
Lg+βLbd𝐗¯k𝐗kF1.\displaystyle\leq\dfrac{L_{g}+\beta L_{b}}{\sqrt{d}}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq 1.

Moreover, it follows from (2.4) that

X¯k+1=X¯kηH(X¯k)+η(H(X¯k)H¯k)=X¯kηβX¯k(X¯kX¯kIp)+ηYk,\displaystyle\bar{X}_{k+1}=\bar{X}_{k}-\eta H(\bar{X}_{k})+\eta\left(H(\bar{X}_{k})-\bar{H}_{k}\right)=\bar{X}_{k}-\eta\beta\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta Y_{k},

where Yk:=H(X¯k)H¯kG(X¯k)Y_{k}:=H(\bar{X}_{k})-\bar{H}_{k}-G(\bar{X}_{k}) and YkY_{k} satisfies

YkFH(X¯k)H¯kF+G(X¯k)F1+Mg.\left\|Y_{k}\right\|_{\mathrm{F}}\leq\left\|H(\bar{X}_{k})-\bar{H}_{k}\right\|_{\mathrm{F}}+\left\|G(\bar{X}_{k})\right\|_{\mathrm{F}}\leq 1+M_{g}.

Then by straightforward calculations, we can obtain that

X¯k+1X¯k+1Ip=\displaystyle\bar{X}_{k+1}^{\top}\bar{X}_{k+1}-I_{p}={} (X¯kηβX¯k(X¯kX¯kIp)+ηYk)(X¯kηβX¯k(X¯kX¯kIp)+ηYk)Ip\displaystyle\left(\bar{X}_{k}-\eta\beta\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta Y_{k}\right)^{\top}\left(\bar{X}_{k}-\eta\beta\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta Y_{k}\right)-I_{p}
=\displaystyle={} X¯kX¯kIp2ηβX¯kX¯k(X¯kX¯kIp)+ηX¯kYk+η2β2X¯kX¯k(X¯kX¯kIp)2\displaystyle\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}-2\eta\beta\bar{X}_{k}^{\top}\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta\bar{X}_{k}^{\top}Y_{k}+\eta^{2}\beta^{2}\bar{X}_{k}^{\top}\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)^{2}
η2β(X¯kX¯kIp)X¯kYk+ηYkX¯kη2βYkX¯k(X¯kX¯kIp)+η2YkYk\displaystyle-\eta^{2}\beta\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)\bar{X}_{k}^{\top}Y_{k}+\eta Y_{k}^{\top}\bar{X}_{k}-\eta^{2}\beta Y_{k}^{\top}\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta^{2}Y_{k}^{\top}Y_{k}
=\displaystyle={} (InηβX¯kX¯k)2(X¯kX¯kIp)η2β2X¯kX¯k(X¯kX¯kIp)+ηX¯kYk+ηYkX¯k\displaystyle\left(I_{n}-\eta\beta\bar{X}_{k}^{\top}\bar{X}_{k}\right)^{2}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)-\eta^{2}\beta^{2}\bar{X}_{k}^{\top}\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta\bar{X}_{k}^{\top}Y_{k}+\eta Y_{k}^{\top}\bar{X}_{k}
η2β(X¯kX¯kIp)X¯kYkη2βYkX¯k(X¯kX¯kIp)+η2YkYk.\displaystyle-\eta^{2}\beta\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)\bar{X}_{k}^{\top}Y_{k}-\eta^{2}\beta Y_{k}^{\top}\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)+\eta^{2}Y_{k}^{\top}Y_{k}.

This further implies that

X¯k+1X¯k+1IpF\displaystyle\left\|\bar{X}_{k+1}^{\top}\bar{X}_{k+1}-I_{p}\right\|_{\mathrm{F}}\leq{} (InηβX¯kX¯k)2(X¯kX¯kIp)F+η2β2X¯kX¯k(X¯kX¯kIp)F\displaystyle\left\|\left(I_{n}-\eta\beta\bar{X}_{k}^{\top}\bar{X}_{k}\right)^{2}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)\right\|_{\mathrm{F}}+\eta^{2}\beta^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)\right\|_{\mathrm{F}}
+2ηX¯kYkF+2η2β(X¯kX¯kIp)X¯kYkF+η2YkYkF\displaystyle+2\eta\left\|\bar{X}_{k}^{\top}Y_{k}\right\|_{\mathrm{F}}+2\eta^{2}\beta\left\|\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)\bar{X}_{k}^{\top}Y_{k}\right\|_{\mathrm{F}}+\eta^{2}\left\|Y_{k}^{\top}Y_{k}\right\|_{\mathrm{F}}
\displaystyle\leq{} (156ηβ)2X¯kX¯kIpF+49216η2β2+73(1+Mg)η\displaystyle\left(1-\dfrac{5}{6}\eta\beta\right)^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}+\dfrac{49}{216}\eta^{2}\beta^{2}+\dfrac{7}{3}(1+M_{g})\eta
+718(1+Mg)η2β+(1+Mg)2η2\displaystyle+\dfrac{7}{18}(1+M_{g})\eta^{2}\beta+(1+M_{g})^{2}\eta^{2}
\displaystyle\leq{} (156ηβ)2X¯kX¯kIpF+η(4+3Mg)+η2(1+Mg)2,\displaystyle\left(1-\dfrac{5}{6}\eta\beta\right)^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}+\eta\left(4+3M_{g}\right)+\eta^{2}\left(1+M_{g}\right)^{2},

where the last equality follows from ηmin{η¯1,η¯2}\eta\leq\min\{\bar{\eta}_{1},\bar{\eta}_{2}\}. Now we consider the above relationship in the following two cases.

Case I: X¯kX¯kIpF1/12\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}\leq 1/12.

Since ηη¯3\eta\leq\bar{\eta}_{3}, we have

X¯k+1X¯k+1IpFX¯kX¯kIpF+112=16.\left\|\bar{X}_{k+1}^{\top}\bar{X}_{k+1}-I_{p}\right\|_{\mathrm{F}}\leq\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}+\dfrac{1}{12}=\dfrac{1}{6}.

Case II: X¯kX¯kIpF>1/12\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}>1/12.

It can be readily verified that

X¯k+1X¯k+1IpFX¯kX¯kIpF\displaystyle\left\|\bar{X}_{k+1}^{\top}\bar{X}_{k+1}-I_{p}\right\|_{\mathrm{F}}-\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}\leq{} ((156ηβ)21)X¯kX¯kIpF\displaystyle\left(\left(1-\dfrac{5}{6}\eta\beta\right)^{2}-1\right)\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}
+η(4+3Mg)+η2(1+Mg)2\displaystyle+\eta\left(4+3M_{g}\right)+\eta^{2}\left(1+M_{g}\right)^{2}
\displaystyle\leq{} 572ηβ+η(4+3Mg)+η2(1+Mg)2,\displaystyle-\dfrac{5}{72}\eta\beta+\eta\left(4+3M_{g}\right)+\eta^{2}\left(1+M_{g}\right)^{2},

which together with β>72(4+3Mg)/5\beta>72(4+3M_{g})/5 and ηη¯4\eta\leq\bar{\eta}_{4} yields that

X¯k+1X¯k+1IpFX¯kX¯kIpF0.\left\|\bar{X}_{k+1}^{\top}\bar{X}_{k+1}-I_{p}\right\|_{\mathrm{F}}-\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}\leq 0.

Hence, we arrive at X¯k+1X¯k+1IpFX¯kX¯kIpF1/6\left\|\bar{X}_{k+1}^{\top}\bar{X}_{k+1}-I_{p}\right\|_{\mathrm{F}}\leq\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|_{\mathrm{F}}\leq 1/6. Combing the above two cases, we complete the proof. ∎

Based on Lemma 3.1, we can prove the main results of this subsection.

Proposition 3.2.

Suppose the conditions in Assumptions 1, 2, and 3 hold. Let {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} be the iterate sequence generated by Algorithm 1 with XinitialX_{\mathrm{initial}}\in\mathcal{R}. Then for any kk\in\mathbb{N}, it holds that

X¯k,𝐗¯k𝐗kFdLg+βLb,𝐗kF7dp6+d, and 𝐃kFCD.\bar{X}_{k}\in\mathcal{R},~{}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\dfrac{\sqrt{d}}{L_{g}+\beta L_{b}},~{}\left\|\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\sqrt{\dfrac{7dp}{6}}+\sqrt{d},\mbox{~{}and~{}}\left\|\mathbf{D}_{k}\right\|_{\mathrm{F}}\leq C_{D}. (3.3)
Proof.

We use mathematical induction to prove this proposition. The argument (3.3) directly holds at {(𝐗0,𝐃0)}\{\left(\mathbf{X}_{0},\mathbf{D}_{0}\right)\} resulting from the initialization. Now, we assume that this argument holds at {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\}, and investigate the situation at {(𝐗k+1,𝐃k+1)}\{\left(\mathbf{X}_{k+1},\mathbf{D}_{k+1}\right)\}.

Our first purpose is to show that 𝐗¯k+1𝐗k+1Fd/(Lg+βLb)\left\|\bar{\mathbf{X}}_{k+1}-\mathbf{X}_{k+1}\right\|_{\mathrm{F}}\leq\sqrt{d}/(L_{g}+\beta L_{b}). By straightforward calculations, we can attain that

𝐗¯k+1𝐗k+1F\displaystyle\left\|\bar{\mathbf{X}}_{k+1}-\mathbf{X}_{k+1}\right\|_{\mathrm{F}} =(𝐖𝐉)(𝐗¯k𝐗k)+η(𝐖𝐉)𝐃kF\displaystyle=\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right)+\eta\left(\mathbf{W}-\mathbf{J}\right)\mathbf{D}_{k}\right\|_{\mathrm{F}}
(𝐖𝐉)(𝐗¯k𝐗k)F+η(𝐖𝐉)𝐃kF\displaystyle\leq\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right)\right\|_{\mathrm{F}}+\eta\left\|\left(\mathbf{W}-\mathbf{J}\right)\mathbf{D}_{k}\right\|_{\mathrm{F}}
λ𝐗¯k𝐗kF+ηλ𝐃kF\displaystyle\leq\lambda\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}}+\eta\lambda\left\|\mathbf{D}_{k}\right\|_{\mathrm{F}}
λ𝐗¯k𝐗kF+ηλCD,\displaystyle\leq\lambda\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}}+\eta\lambda C_{D},

which together with ηη¯5\eta\leq\bar{\eta}_{5} implies that

𝐗¯k+1𝐗k+1FdλLg+βLb+d(1λ)Lg+βLbdLg+βLb.\left\|\bar{\mathbf{X}}_{k+1}-\mathbf{X}_{k+1}\right\|_{\mathrm{F}}\leq\dfrac{\sqrt{d}\lambda}{L_{g}+\beta L_{b}}+\dfrac{\sqrt{d}\left(1-\lambda\right)}{L_{g}+\beta L_{b}}\leq\dfrac{\sqrt{d}}{L_{g}+\beta L_{b}}.

Then we aim to prove that X¯k+1\bar{X}_{k+1}\in\mathcal{R} and 𝐗k+1F7dp/6+d\left\|\mathbf{X}_{k+1}\right\|_{\mathrm{F}}\leq\sqrt{7dp/6}+\sqrt{d}. According to Lemma 3.1, we have X¯k+1\bar{X}_{k+1}\in\mathcal{R}. And it follows that

𝐗k+1F𝐗¯k+1F+𝐗¯k+1𝐗k+1F7dp6+dLg+βLb7dp6+d,\left\|\mathbf{X}_{k+1}\right\|_{\mathrm{F}}\leq\left\|\bar{\mathbf{X}}_{k+1}\right\|_{\mathrm{F}}+\left\|\bar{\mathbf{X}}_{k+1}-\mathbf{X}_{k+1}\right\|_{\mathrm{F}}\leq\sqrt{\dfrac{7dp}{6}}+\dfrac{\sqrt{d}}{L_{g}+\beta L_{b}}\leq\sqrt{\dfrac{7dp}{6}}+\sqrt{d},

as a result of the condition Lg+βLbβLb1L_{g}+\beta L_{b}\geq\beta L_{b}\geq 1.

In order to finish the proof, we still have to show that 𝐃k+1FCD\left\|\mathbf{D}_{k+1}\right\|_{\mathrm{F}}\leq C_{D}. In fact, we have

Hi(Xi,k+1)Hi(Xi,k)F(Lg+βLb)Xi,k+1Xi,kF,i[d],\left\|H_{i}(X_{i,k+1})-H_{i}(X_{i,k})\right\|_{\mathrm{F}}\leq(L_{g}+\beta L_{b})\left\|X_{i,k+1}-X_{i,k}\right\|_{\mathrm{F}},\quad i\in[d],

which implies that

𝐇k+1𝐇kF(Lg+βLb)𝐗k+1𝐗kF.\left\|\mathbf{H}_{k+1}-\mathbf{H}_{k}\right\|_{\mathrm{F}}\leq(L_{g}+\beta L_{b})\left\|\mathbf{X}_{k+1}-\mathbf{X}_{k}\right\|_{\mathrm{F}}.

Moreover, it can be readily verified that

𝐗k+1𝐗kF\displaystyle\left\|\mathbf{X}_{k+1}-\mathbf{X}_{k}\right\|_{\mathrm{F}} =𝐖(𝐗kη𝐃k)𝐗kF=(Idn𝐖)(𝐗¯k𝐗k)η𝐖𝐃kF\displaystyle=\left\|\mathbf{W}\left(\mathbf{X}_{k}-\eta\mathbf{D}_{k}\right)-\mathbf{X}_{k}\right\|_{\mathrm{F}}=\left\|\left(I_{dn}-\mathbf{W}\right)\left(\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right)-\eta\mathbf{W}\mathbf{D}_{k}\right\|_{\mathrm{F}}
2𝐗¯k𝐗kF+η𝐃kF2dLg+βLb+ηCD3dLg+βLb,\displaystyle\leq 2\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}}+\eta\left\|\mathbf{D}_{k}\right\|_{\mathrm{F}}\leq\dfrac{2\sqrt{d}}{L_{g}+\beta L_{b}}+\eta C_{D}\leq\dfrac{3\sqrt{d}}{L_{g}+\beta L_{b}},

where the last inequality follows from ηη¯6\eta\leq\bar{\eta}_{6}. Combing the above two relationships, we can obtain that

𝐃¯k+1𝐃k+1F\displaystyle\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|_{\mathrm{F}} =(𝐖𝐉)(𝐃¯k𝐃k)(Idn𝐉)(𝐇k+1𝐇k)F\displaystyle=\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)-\left(I_{dn}-\mathbf{J}\right)\left(\mathbf{H}_{k+1}-\mathbf{H}_{k}\right)\right\|_{\mathrm{F}}
(𝐖𝐉)(𝐃¯k𝐃k)F+(Idn𝐉)(𝐇k+1𝐇k)F\displaystyle\leq\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)\right\|_{\mathrm{F}}+\left\|\left(I_{dn}-\mathbf{J}\right)\left(\mathbf{H}_{k+1}-\mathbf{H}_{k}\right)\right\|_{\mathrm{F}}
λ𝐃¯k𝐃kF+𝐇k+1𝐇kF\displaystyle\leq\lambda\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|_{\mathrm{F}}+\left\|\mathbf{H}_{k+1}-\mathbf{H}_{k}\right\|_{\mathrm{F}}
λ𝐃¯k𝐃kF+3d,\displaystyle\leq\lambda\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|_{\mathrm{F}}+3\sqrt{d},

which is followed by

𝐃¯k+1𝐃k+1F𝐃¯0𝐃0F+3d1λ.\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|_{\mathrm{F}}\leq\left\|\bar{\mathbf{D}}_{0}-\mathbf{D}_{0}\right\|_{\mathrm{F}}+\dfrac{3\sqrt{d}}{1-\lambda}.

Therefore, we can obtain that

𝐃k+1F\displaystyle\left\|\mathbf{D}_{k+1}\right\|_{\mathrm{F}} 𝐃¯k+1𝐃k+1F+𝐃¯k+1F=𝐃¯k+1𝐃k+1F+𝐇¯k+1F\displaystyle\leq\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|_{\mathrm{F}}+\left\|\bar{\mathbf{D}}_{k+1}\right\|_{\mathrm{F}}=\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|_{\mathrm{F}}+\left\|\bar{\mathbf{H}}_{k+1}\right\|_{\mathrm{F}}
𝐃¯0𝐃0F+3d1λ+d(Mg+βMb)=CD.\displaystyle\leq\left\|\bar{\mathbf{D}}_{0}-\mathbf{D}_{0}\right\|_{\mathrm{F}}+\dfrac{3\sqrt{d}}{1-\lambda}+\sqrt{d}\left(M_{g}+\beta M_{b}\right)=C_{D}.

The proof is completed. ∎

3.2 Sufficient descent of the merit function

In this subsection, we aim to prove the sufficient descent property of the merit function (to be defined later). Towards this end, we first build the upper bound of consensus error 𝐗¯k𝐗kF2\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\|^{2}_{\mathrm{F}} and gradient tracking error 𝐃¯k𝐃kF2\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\|^{2}_{\mathrm{F}}.

Lemma 3.3.

Suppose {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} is the iterate sequence generated by Algorithm 1. Then for any kk\in\mathbb{N}, it holds that

𝐗¯k+1𝐗k+1F21+λ22𝐗¯k𝐗kF2+η2C1𝐃¯k𝐃kF2,\left\|\bar{\mathbf{X}}_{k+1}-\mathbf{X}_{k+1}\right\|^{2}_{\mathrm{F}}\leq\dfrac{1+\lambda^{2}}{2}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+\eta^{2}C_{1}\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}},

where C1>0C_{1}>0 is a constant defined in (3.1).

Proof.

By straightforward calculations, we can attain that

𝐗¯k+1𝐗k+1F2=\displaystyle\left\|\bar{\mathbf{X}}_{k+1}-\mathbf{X}_{k+1}\right\|^{2}_{\mathrm{F}}= (𝐖𝐉)(𝐗¯k𝐗k)η(𝐖𝐉)(𝐃¯k𝐃k)F2\displaystyle\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right)-\eta\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)\right\|^{2}_{\mathrm{F}}
\displaystyle\leq (1+γ)(𝐖𝐉)(𝐗¯k𝐗k)F2+η2(1+1/γ)(𝐖𝐉)(𝐃¯k𝐃k)F2\displaystyle\left(1+\gamma\right)\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right)\right\|^{2}_{\mathrm{F}}+\eta^{2}\left(1+1/\gamma\right)\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)\right\|^{2}_{\mathrm{F}}
\displaystyle\leq λ2(1+γ)𝐗¯k𝐗kF2+η2λ2(1+1/γ)𝐃¯k𝐃kF2,\displaystyle\lambda^{2}\left(1+\gamma\right)\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+\eta^{2}\lambda^{2}\left(1+1/\gamma\right)\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}},

where γ>0\gamma>0 is a constant defined in (3.1). This completes the proof since λ2(1+γ)=(1+λ2)/2\lambda^{2}\left(1+\gamma\right)=(1+\lambda^{2})/2 and λ2(1+1/γ)=C1\lambda^{2}\left(1+1/\gamma\right)=C_{1}. ∎

Lemma 3.4.

Suppose the conditions in Assumptions 1, 2, and 3 hold. Let {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} be the iterate sequence generated by Algorithm 1 with XinitialX_{\mathrm{initial}}\in\mathcal{R}. Then for any kk\in\mathbb{N}, it holds that

𝐃¯kF22(Lg+βLb)2𝐗¯k𝐗kF2+2dH(X¯k)F2.\left\|\bar{\mathbf{D}}_{k}\right\|^{2}_{\mathrm{F}}\leq 2\left(L_{g}+\beta L_{b}\right)^{2}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+2d\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}.
Proof.

At first, it is straightforward to verify that

H(X¯k)D¯kF\displaystyle\left\|H(\bar{X}_{k})-\bar{D}_{k}\right\|_{\mathrm{F}} =H(X¯k)H¯kF1di=1dHi(X¯k)Hi(Xi,k)F\displaystyle=\left\|H(\bar{X}_{k})-\bar{H}_{k}\right\|_{\mathrm{F}}\leq\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|H_{i}(\bar{X}_{k})-H_{i}(X_{i,k})\right\|_{\mathrm{F}} (3.4)
Lg+βLbdi=1dX¯kXi,kFLg+βLbd𝐗¯k𝐗kF,\displaystyle\leq\dfrac{L_{g}+\beta L_{b}}{d}\sum\limits_{i=1}^{d}\left\|\bar{X}_{k}-X_{i,k}\right\|_{\mathrm{F}}\leq\dfrac{L_{g}+\beta L_{b}}{\sqrt{d}}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|_{\mathrm{F}},

which further implies that

D¯kF2\displaystyle\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}} 2H(X¯k)D¯kF2+2H(X¯k)F2\displaystyle\leq 2\left\|H(\bar{X}_{k})-\bar{D}_{k}\right\|^{2}_{\mathrm{F}}+2\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}} (3.5)
2d(Lg+βLb)2𝐗¯k𝐗kF2+2H(X¯k)F2.\displaystyle\leq\dfrac{2}{d}\left(L_{g}+\beta L_{b}\right)^{2}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+2\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}.

We complete the proof since 𝐃¯kF2=dD¯kF2\|\bar{\mathbf{D}}_{k}\|^{2}_{\mathrm{F}}=d\|\bar{D}_{k}\|^{2}_{\mathrm{F}}. ∎

Lemma 3.5.

Suppose the conditions in Assumptions 1, 2, and 3 hold. Let {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} be the iterate sequence generated by Algorithm 1 with XinitialX_{\mathrm{initial}}\in\mathcal{R}. Then for any kk\in\mathbb{N}, it holds that

𝐃¯k+1𝐃k+1F25+3λ28𝐃¯k𝐃kF2+C2𝐗¯k𝐗kF2+η8ρH(X¯k)F2,\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|^{2}_{\mathrm{F}}\leq{}\dfrac{5+3\lambda^{2}}{8}\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}+C_{2}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+\dfrac{\eta}{8\rho}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}},

where C2>0C_{2}>0 and ρ>0\rho>0 are two constants defined in (3.2).

Proof.

By straightforward calculations, we can attain that

𝐃¯k+1𝐃k+1F2\displaystyle\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|^{2}_{\mathrm{F}} =(𝐖𝐉)(𝐃¯k𝐃k)(Idn𝐉)(𝐇k+1𝐇k)F2\displaystyle=\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)-\left(I_{dn}-\mathbf{J}\right)\left(\mathbf{H}_{k+1}-\mathbf{H}_{k}\right)\right\|^{2}_{\mathrm{F}}
(1+γ)(𝐖𝐉)(𝐃¯k𝐃k)F2+(1+1/γ)(Idn𝐉)(𝐇k+1𝐇k)F2\displaystyle\leq\left(1+\gamma\right)\left\|\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)\right\|^{2}_{\mathrm{F}}+\left(1+1/\gamma\right)\left\|\left(I_{dn}-\mathbf{J}\right)\left(\mathbf{H}_{k+1}-\mathbf{H}_{k}\right)\right\|^{2}_{\mathrm{F}}
λ2(1+γ)𝐃¯k𝐃kF2+(1+1/γ)𝐇k+1𝐇kF2,\displaystyle\leq\lambda^{2}\left(1+\gamma\right)\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}+\left(1+1/\gamma\right)\left\|\mathbf{H}_{k+1}-\mathbf{H}_{k}\right\|^{2}_{\mathrm{F}},

where γ>0\gamma>0 is a constant defined in (3.1). According to Proposition 3.2, the inequality 𝐗kF7dp/6+d\left\|\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\sqrt{7dp/6}+\sqrt{d} holds for all kk\in\mathbb{N}. Hence, it follows that

𝐇k+1𝐇kF(Lg+βLb)𝐗k+1𝐗kF.\left\|\mathbf{H}_{k+1}-\mathbf{H}_{k}\right\|_{\mathrm{F}}\leq(L_{g}+\beta L_{b})\left\|\mathbf{X}_{k+1}-\mathbf{X}_{k}\right\|_{\mathrm{F}}.

Moreover, it can be readily verified that

𝐗k+1𝐗kF2\displaystyle\left\|\mathbf{X}_{k+1}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}} =𝐖(𝐗kη𝐃k)𝐗kF2\displaystyle=\left\|\mathbf{W}\left(\mathbf{X}_{k}-\eta\mathbf{D}_{k}\right)-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}
=(Idn𝐖)(𝐗¯k𝐗k)+η(𝐖𝐉)(𝐃¯k𝐃k)η𝐖𝐃¯kF2\displaystyle=\left\|\left(I_{dn}-\mathbf{W}\right)\left(\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right)+\eta\left(\mathbf{W}-\mathbf{J}\right)\left(\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right)-\eta\mathbf{W}\bar{\mathbf{D}}_{k}\right\|^{2}_{\mathrm{F}}
6𝐗¯k𝐗kF2+3η2λ2𝐃¯k𝐃kF2+3η2𝐃¯kF2.\displaystyle\leq 6\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+3\eta^{2}\lambda^{2}\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}+3\eta^{2}\left\|\bar{\mathbf{D}}_{k}\right\|^{2}_{\mathrm{F}}.

According to Lemma 3.4, we have

𝐗k+1𝐗kF2\displaystyle\left\|\mathbf{X}_{k+1}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}} 6(1+η2(Lg+βLb)2)𝐗¯k𝐗kF2+3η2λ2𝐃¯k𝐃kF2+6dη2H(X¯k)F2\displaystyle\leq 6\left(1+\eta^{2}\left(L_{g}+\beta L_{b}\right)^{2}\right)\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+3\eta^{2}\lambda^{2}\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}+6d\eta^{2}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}
12𝐗¯k𝐗kF2+3η2λ2𝐃¯k𝐃kF2+3η2ρC2H(X¯k)F2,\displaystyle\leq 12\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+3\eta^{2}\lambda^{2}\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}+\dfrac{3\eta}{2\rho C_{2}}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}},

where the last inequality follows from the condition ηmin{η¯7,η¯8}\eta\leq\min\{\bar{\eta}_{7},\bar{\eta}_{8}\}. Combing the above relationships, we can obtain that

𝐃¯k+1𝐃k+1F2\displaystyle\left\|\bar{\mathbf{D}}_{k+1}-\mathbf{D}_{k+1}\right\|^{2}_{\mathrm{F}}\leq{} λ2(1+γ)(1+3η2(Lg+βLb)2/γ)𝐃¯k𝐃kF2\displaystyle\lambda^{2}\left(1+\gamma\right)\left(1+3\eta^{2}(L_{g}+\beta L_{b})^{2}/\gamma\right)\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}
+C2𝐗¯k𝐗kF2+η8ρH(X¯k)F2.\displaystyle+C_{2}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+\dfrac{\eta}{8\rho}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}.

Moreover, it follows from the condition ηη¯9\eta\leq\bar{\eta}_{9} that the inequality

λ2(1+γ)(1+3η2(Lg+βLb)2/γ)5+3λ28\lambda^{2}\left(1+\gamma\right)\left(1+3\eta^{2}(L_{g}+\beta L_{b})^{2}/\gamma\right)\leq\dfrac{5+3\lambda^{2}}{8}

holds. The proof is completed. ∎

Next, we evaluate the descent property of the sequence {h(X¯k)}\{h(\bar{X}_{k})\}.

Lemma 3.6.

Suppose the conditions in Assumptions 1, 2, and 3 hold. Let {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} be the iterate sequence generated by Algorithm 1 with XinitialX_{\mathrm{initial}}\in\mathcal{R}. Then for any kk\in\mathbb{N}, it holds that

h(X¯k+1)h(X¯k)+1λ28𝐗¯k𝐗kF2+212ηLf2X¯kX¯kIpF258ηH(X¯k)F2.h(\bar{X}_{k+1})\leq h(\bar{X}_{k})+\dfrac{1-\lambda^{2}}{8}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+\dfrac{21}{2}\eta L_{f}^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}-\dfrac{5}{8}\eta\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}.
Proof.

According to Proposition 3.2, we know that the inequality 𝐗kF7dp/6+d\left\|\mathbf{X}_{k}\right\|_{\mathrm{F}}\leq\sqrt{7dp/6}+\sqrt{d} holds for any kk\in\mathbb{N}. Then it follows from the local Lipschitz continuity of hh that

h(X¯k+1)\displaystyle h(\bar{X}_{k+1}) =h(X¯kηD¯k)h(X¯k)ηh(X¯k),D¯k+12η2(Lg+βLb)D¯kF2\displaystyle=h(\bar{X}_{k}-\eta\bar{D}_{k})\leq h(\bar{X}_{k})-\eta\left\langle\nabla h(\bar{X}_{k}),\bar{D}_{k}\right\rangle+\dfrac{1}{2}\eta^{2}\left(L_{g}+\beta L_{b}\right)\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}}
=h(X¯k)ηh(X¯k)H(X¯k),D¯kηH(X¯k),D¯k+12η2(Lg+βLb)D¯kF2\displaystyle=h(\bar{X}_{k})-\eta\left\langle\nabla h(\bar{X}_{k})-H(\bar{X}_{k}),\bar{D}_{k}\right\rangle-\eta\left\langle H(\bar{X}_{k}),\bar{D}_{k}\right\rangle+\dfrac{1}{2}\eta^{2}\left(L_{g}+\beta L_{b}\right)\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}}
h(X¯k)ηh(X¯k)H(X¯k),D¯kηH(X¯k),D¯k+18ηH(X¯k)F2\displaystyle\leq h(\bar{X}_{k})-\eta\left\langle\nabla h(\bar{X}_{k})-H(\bar{X}_{k}),\bar{D}_{k}\right\rangle-\eta\left\langle H(\bar{X}_{k}),\bar{D}_{k}\right\rangle+\dfrac{1}{8}\eta\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}
+1λ224𝐗¯k𝐗kF2,\displaystyle\quad+\dfrac{1-\lambda^{2}}{24}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}},

where the last inequality follows from (3.5) and the condition ηmin{η¯7,η¯10}\eta\leq\min\{\bar{\eta}_{7},\bar{\eta}_{10}\}. Moreover, we have

|h(X¯k)H(X¯k),D¯k|4h(X¯k)H(X¯k)F2+116D¯kF2\displaystyle\left\lvert\left\langle\nabla h(\bar{X}_{k})-H(\bar{X}_{k}),\bar{D}_{k}\right\rangle\right\rvert\leq 4\left\|\nabla h(\bar{X}_{k})-H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}+\dfrac{1}{16}\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}}
=9f(X¯k)f(X¯kX¯kX¯k)F2+116D¯kF2\displaystyle=9\left\|\nabla f(\bar{X}_{k})-\nabla f(\bar{X}_{k}\bar{X}_{k}^{\top}\bar{X}_{k})\right\|^{2}_{\mathrm{F}}+\dfrac{1}{16}\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}}
9Lf2X¯k(X¯kX¯kIp)F2+116D¯kF2212Lf2X¯kX¯kIpF2+116D¯kF2\displaystyle\leq 9L_{f}^{2}\left\|\bar{X}_{k}\left(\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right)\right\|^{2}_{\mathrm{F}}+\dfrac{1}{16}\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}}\leq\dfrac{21}{2}L_{f}^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{16}\left\|\bar{D}_{k}\right\|^{2}_{\mathrm{F}}
212Lf2X¯kX¯kIpF2+1λ224η𝐗¯k𝐗kF2+18H(X¯k)F2,\displaystyle\leq\dfrac{21}{2}L_{f}^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}+\dfrac{1-\lambda^{2}}{24\eta}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{8}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}},

where the last inequality results from (3.5) and the condition ηη¯11\eta\leq\bar{\eta}_{11}. And straightforward manipulations give us

H(X¯k),D¯k=H(X¯k),D¯kH(X¯k)+H(X¯k)=H(X¯k),D¯kH(X¯k)+H(X¯k)F2,\left\langle H(\bar{X}_{k}),\bar{D}_{k}\right\rangle=\left\langle H(\bar{X}_{k}),\bar{D}_{k}-H(\bar{X}_{k})+H(\bar{X}_{k})\right\rangle=\left\langle H(\bar{X}_{k}),\bar{D}_{k}-H(\bar{X}_{k})\right\rangle+\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}},

and

|H(X¯k),D¯kH(X¯k)|\displaystyle\left\lvert\left\langle H(\bar{X}_{k}),\bar{D}_{k}-H(\bar{X}_{k})\right\rangle\right\rvert 18H(X¯k)F2+2D¯kH(X¯k)F2\displaystyle\leq\dfrac{1}{8}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}+2\left\|\bar{D}_{k}-H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}
18H(X¯k)F2+1λ224η𝐗¯k𝐗kF2,\displaystyle\leq\dfrac{1}{8}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}+\dfrac{1-\lambda^{2}}{24\eta}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}},

where the last inequality is implied by (3.4) and the condition ηη¯11\eta\leq\bar{\eta}_{11}. Combing the above relationships, we finally arrive at the assertion of this lemma. ∎

Now we are in the position to introduce the merit function

h~(𝐗,𝐃):=h((𝟏dIn)𝐗)+𝐉𝐗𝐗F2+ρ𝐉𝐃𝐃F2,\tilde{h}(\mathbf{X},\mathbf{D}):=h((\mathbf{1}_{d}\otimes I_{n})^{\top}\mathbf{X})+\left\|\mathbf{J}\mathbf{X}-\mathbf{X}\right\|^{2}_{\mathrm{F}}+\rho\left\|\mathbf{J}\mathbf{D}-\mathbf{D}\right\|^{2}_{\mathrm{F}},

where ρ>0\rho>0 is a constant defined in (3.2).

Proposition 3.7.

Suppose the conditions in Assumptions 1, 2, and 3 hold. Let {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} be the iterate sequence generated by Algorithm 1 with XinitialX_{\mathrm{initial}}\in\mathcal{R}. Then for any kk\in\mathbb{N}, the following sufficient descent condition holds, which implies that the sequence {h~(𝐗k,𝐃k)}\{\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})\} is monotonically non-increasing.

h~(𝐗k+1,𝐃k+1)\displaystyle\tilde{h}(\mathbf{X}_{k+1},\mathbf{D}_{k+1})\leq{} h~(𝐗k,𝐃k)316(1λ2)𝐗¯k𝐗kF23ρ16(1λ2)𝐃¯k𝐃kF2\displaystyle\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})-\dfrac{3}{16}\left(1-\lambda^{2}\right)\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}-\dfrac{3\rho}{16}\left(1-\lambda^{2}\right)\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}} (3.6)
η2G(X¯k)F2η2Lf2X¯kX¯kIpF2.\displaystyle-\dfrac{\eta}{2}\left\|G(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}-\dfrac{\eta}{2}L_{f}^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}.
Proof.

Combing Lemmas 3.3, 3.5 and 3.6, we have

h~(𝐗k+1,𝐃k+1)\displaystyle\tilde{h}(\mathbf{X}_{k+1},\mathbf{D}_{k+1})\leq{} h~(𝐗k,𝐃k)316(1λ2)𝐗¯k𝐗kF2η2H(X¯k)F2\displaystyle\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})-\dfrac{3}{16}\left(1-\lambda^{2}\right)\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}-\dfrac{\eta}{2}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}
ρ(38(1λ2)η2C1ρ)𝐃¯k𝐃kF2+212ηLf2X¯kX¯kIpF2\displaystyle-\rho\left(\dfrac{3}{8}\left(1-\lambda^{2}\right)-\dfrac{\eta^{2}C_{1}}{\rho}\right)\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}+\dfrac{21}{2}\eta L_{f}^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} h~(𝐗k,𝐃k)316(1λ2)𝐗¯k𝐗kF23ρ16(1λ2)𝐃¯k𝐃kF2\displaystyle\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})-\dfrac{3}{16}\left(1-\lambda^{2}\right)\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}-\dfrac{3\rho}{16}\left(1-\lambda^{2}\right)\left\|\bar{\mathbf{D}}_{k}-\mathbf{D}_{k}\right\|^{2}_{\mathrm{F}}
η2H(X¯k)F2+212ηLf2X¯kX¯kIpF2,\displaystyle-\dfrac{\eta}{2}\left\|H(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}+\dfrac{21}{2}\eta L_{f}^{2}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}},

where the last inequality follows from the condition ηη¯12\eta\leq\bar{\eta}_{12}. Together with Lemma 2.2 and β22Lf2\beta\geq 22L_{f}^{2}, we can obtain the sufficient descent condition (3.6), which infers that

h~(𝐗k+1,𝐃k+1)h~(𝐗k,𝐃k).\tilde{h}(\mathbf{X}_{k+1},\mathbf{D}_{k+1})\leq\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k}).

The proof is completed. ∎

3.3 Global convergence

Finally, we can establish the global convergence of Algorithm 1 together with the worst case complexity.

Theorem 3.8.

Suppose the conditions in Assumptions 1, 2, and 3 hold. Let {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} be the iterate sequence generated by Algorithm 1 with XinitialX_{\mathrm{initial}}\in\mathcal{R}. Then {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} has at least one accumulation point. Moreover, for any accumulation point (𝐗,𝐃)\left(\mathbf{X}^{\ast},\mathbf{D}^{\ast}\right), we have 𝐗=(𝟏dIn)X¯\mathbf{X}^{\ast}=(\mathbf{1}_{d}\otimes I_{n})\bar{X}^{\ast}, where X¯𝒮n,p\bar{X}^{\ast}\in\mathcal{S}_{n,p} is a first-order stationary point of the problem (1.1). Finally, the following relationships hold, which results in the global sublinear convergence rate of Algorithm 1.

mink=0,1,,K1𝐗¯k𝐗kF216(h~(𝐗0,𝐃0)h¯)3(1λ2)K,\min_{k=0,1,\dotsc,K-1}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}\leq\dfrac{16(\tilde{h}(\mathbf{X}_{0},\mathbf{D}_{0})-\underline{h})}{3\left(1-\lambda^{2}\right)K}, (3.7)
mink=0,1,,K1G(X¯k)F22(h~(𝐗0,𝐃0)h¯)ηK,\min_{k=0,1,\dotsc,K-1}\left\|G(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}\leq\dfrac{2(\tilde{h}(\mathbf{X}_{0},\mathbf{D}_{0})-\underline{h})}{\eta K}, (3.8)

and

mink=0,1,,K1X¯kX¯kIpF22(h~(𝐗0,𝐃0)h¯)ηLf2K,\min_{k=0,1,\dotsc,K-1}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}\leq\dfrac{2(\tilde{h}(\mathbf{X}_{0},\mathbf{D}_{0})-\underline{h})}{\eta L_{f}^{2}K}, (3.9)

where h¯\underline{h} is a constant.

Proof.

According to Lemma 3.2, we know that the sequence {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} is bounded. Hence, the lower boundedness of {h~(𝐗k,𝐃k)}\{\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})\} is owing to the continuity of h~\tilde{h}. Namely, there exists a constant h¯\underline{h} such that h~(𝐗k,𝐃k)h¯\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})\geq\underline{h} for any kk\in\mathbb{N}. Then, it follows from Proposition 3.7 that the sequence {h~(𝐗k,𝐃k)}\{\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})\} is convergent and the following relationships hold.

limk𝐗¯k𝐗kF2=0,limkG(X¯k)F2=0,limkX¯kX¯kIpF2=0.\lim\limits_{k\to\infty}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}}=0,\quad\lim\limits_{k\to\infty}\left\|G(\bar{X}_{k})\right\|^{2}_{\mathrm{F}}=0,\quad\lim\limits_{k\to\infty}\left\|\bar{X}_{k}^{\top}\bar{X}_{k}-I_{p}\right\|^{2}_{\mathrm{F}}=0. (3.10)

According to the Bolzano-Weierstrass theorem, it follows that {(𝐗k,𝐃k)}\{\left(\mathbf{X}_{k},\mathbf{D}_{k}\right)\} exists an accumulation point, say (𝐗,𝐃)\left(\mathbf{X}^{\ast},\mathbf{D}^{\ast}\right). The relationships in (3.10) imply that 𝐗=(𝟏dIn)X¯\mathbf{X}^{\ast}=(\mathbf{1}_{d}\otimes I_{n})\bar{X}^{\ast} for some X¯𝒮n,p\bar{X}^{\ast}\in\mathcal{S}_{n,p}. Moreover, we have

gradf(X¯)=G(X¯)=limkG(X¯k)=0,\mathrm{grad}\,f(\bar{X}^{\ast})=G(\bar{X}^{\ast})=\lim\limits_{k\to\infty}G(\bar{X}_{k})=0,

which implies that X¯\bar{X}^{\ast} is a first-order stationary point of (1.1). Finally, we prove that the relationships (3.7)-(3.9) hold. Indeed, it follows from Proposition 3.7 that

k=0K1𝐗¯k𝐗kF2\displaystyle\sum_{k=0}^{K-1}\left\|\bar{\mathbf{X}}_{k}-\mathbf{X}_{k}\right\|^{2}_{\mathrm{F}} 163(1λ2)k=0K1(h~(𝐗k,𝐃k)h~(𝐗k+1,𝐃k+1))\displaystyle\leq\dfrac{16}{3\left(1-\lambda^{2}\right)}\sum_{k=0}^{K-1}\left(\tilde{h}(\mathbf{X}_{k},\mathbf{D}_{k})-\tilde{h}(\mathbf{X}_{k+1},\mathbf{D}_{k+1})\right)
16(h~(𝐗0,𝐃0)h~(𝐗K,𝐃K))3(1λ2)16(h~(𝐗0,𝐃0)h¯)3(1λ2),\displaystyle\leq\dfrac{16(\tilde{h}(\mathbf{X}_{0},\mathbf{D}_{0})-\tilde{h}(\mathbf{X}_{K},\mathbf{D}_{K}))}{3\left(1-\lambda^{2}\right)}\leq\dfrac{16(\tilde{h}(\mathbf{X}_{0},\mathbf{D}_{0})-\underline{h})}{3\left(1-\lambda^{2}\right)},

which implies the relationship (3.7). The other two relationships can be proved similarly. Therefore, we complete the proof. ∎

4 Numerical Simulations

In this section, the numerical performance of DESTINY is evaluated through comprehensive numerical experiments in comparison to state-of-the-art algorithms. The corresponding experiments are performed on a workstation with two Intel Xeon Gold 6242R CPU processors (at 3.103.10GHz×20×2\times 20\times 2) and 510GB of RAM under Ubuntu 20.04. All the tested algorithms are implemented in the Python language, and the communication is realized via the package mpi4py.

4.1 Test problems

In the numerical experiments, we test three types of problems, including PCA, OLSR and SDL, which are introduced in Subsection 1.1.

In the PCA problem (1.3), we construct the global data matrix An×mA\in\mathbb{R}^{n\times m} (assuming nmn\leq m without loss of generality) by its (economy-form) singular value decomposition as follows.

A=UΣV,A=U\Sigma V^{\top}, (4.1)

where both Un×nU\in\mathbb{R}^{n\times n} and Vm×nV\in\mathbb{R}^{m\times n} are orthogonal matrices orthonormalized from randomly generated matrices, and Σn×n\Sigma\in\mathbb{R}^{n\times n} is a diagonal matrix with diagonal entries

Σii=ξi/2,i[n],\Sigma_{ii}=\xi^{i/2},\quad i\in[n], (4.2)

for a parameter ξ(0,1)\xi\in(0,1) that determines the decay rate of the singular values of AA. In general, smaller decay rates (with ξ\xi closer to 1) correspond to more difficult cases. Finally, the global data matrix AA is uniformly distributed into dd agents.

As for the OLSR problem (1.4) and SDL problem (1.5), we use real-world text and image datasets to construct the global data matrix, respectively.

Unless otherwise specified, the network is randomly generated based on the Erdos-Renyi graph with a fixed probability 𝚙𝚛𝚘𝚋(0,1)\mathtt{prob}\in(0,1) in the following experiments. And we select the mixing matrix WW to be the Metroplis constant edge weight matrix [36].

In our experiments, we collect and compare two performance measurements: (relative) substationarity violation defined by

1gradf(X¯0)Fgradf(X¯k)F,\dfrac{1}{\left\|\mathrm{grad}\,f(\bar{X}_{0})\right\|_{\mathrm{F}}}\left\|\mathrm{grad}\,f(\bar{X}_{k})\right\|_{\mathrm{F}},

and consensus error defined by

1di=1dXi,kX¯kF2.\sqrt{\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|X_{i,k}-\bar{X}_{k}\right\|^{2}_{\mathrm{F}}}.

When presenting numerical results, we plot the decay of the above two measurements versus the round of communications.

4.2 Default settings

In this subsection, we determine the default settings for our algorithm. The corresponding experiments are performed on an Erdos-Renyi network with 𝚙𝚛𝚘𝚋=0.5\mathtt{prob}=0.5.

In the first experiment, we test the performances of DESTINY with different choices of stepsizes. Classical strategies for choosing stepsizes, such as line search, are not applicable in the decentralized setting. Recently, BB stepsizes have been introduced to the decentralized optimization for strongly convex problems [15]. In order to improve the performance of DESTINY, we intend to apply the following BB stepsizes.

ηi,kBB=|Si,k,Ji,kJi,k,Ji,k|,\eta^{\mathrm{BB}}_{i,k}=\left\lvert\frac{\left\langle S_{i,k},J_{i,k}\right\rangle}{\left\langle J_{i,k},J_{i,k}\right\rangle}\right\rvert,

where Si,k=Xi,kXi,k1S_{i,k}=X_{i,k}-X_{i,k-1}, and Ji,k=Di,kDi,k1J_{i,k}=D_{i,k}-D_{i,k-1}. As illustrated in Figure 1, the above BB stepsizes can significantly enhance the efficiency of DESTINY. Hence, in subsequent experiments, we equip DESTINY with BB stepsizes by default. To the best of our knowledge, this is the first attempt, for nonconvex optimization problems, to accelerate the convergence of decentralized algorithms by BB stepsizes.

Refer to caption
(a) Substationarity Violation
Refer to caption
(b) Consensus Error
Figure 1: Comparison of different stepsizes for DESTINY. The experimental settings are n=100n=100, m=3200m=3200, p=5p=5, ξ=0.9\xi=0.9, and d=32d=32. The penalty parameter of DESTINY is β=1\beta=1.

Next, we present that our algorithm is not sensitive to the penalty parameter β\beta. Figure 2 depicts the performances of DESTINY with different values of β\beta which are distinguished by colors. In Figure 2(c), the y-axis depicts the feasibility violation in the logarithmic scale, which is defined by:

1di=1dXi,kXi,kIpF2.\sqrt{\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|X_{i,k}^{\top}X_{i,k}-I_{p}\right\|^{2}_{\mathrm{F}}}.

We can observe that the curves of substationarity violations, consensus errors, and feasibility violations almost coincide with each other, which indicates that DESTINY has almost the same performance in a wide range of penalty parameters. This observation validates the robustness of our algorithm to penalty parameters. Therefore, we fix β=1\beta=1 by default in the subsequent experiments.

Refer to caption
(a) Substationarity Violation
Refer to caption
(b) Consensus Error
Refer to caption
(c) Feasibility Violation
Figure 2: Comparison of different penalty parameters for DESTINY. The experimental settings are n=200n=200, m=6400m=6400, p=5p=5, ξ=0.8\xi=0.8, and d=16d=16.

4.3 Comparison on decentralized PCA

In this subsection, we compare the performance of DESTINY in solving decentralized PCA problems (1.3) with some other state-of-the-art methods, including ADSA[12], DRGTA [5], and DeEPCA [47]. Among these three algorithms, ADSA and DRGTA require to choose the stepsize. In the empirical experiments, we find that DRGTA also benefits from the BB stepsizes, but ADSA can not converge with it. Hence, DRGTA is equipped with the same BB stepsizes as DESTINY, while the stepsize of ADSA is fixed as a hand-optimized constant. In the following experiments, for fair comparisons, both DRGTA and DeEPCA perform only one round of communications per iteration.

In order to assess the quality of the above four solvers, we perform a set of experiments on randomly generated matrices as described in Subsection 4.1. The network is built based on an Erdos-Renyi graph with 𝚙𝚛𝚘𝚋\mathtt{prob} ranging from 0.20.2 to 0.60.6 with increment 0.20.2. Other experimental settings are n=300n=300, m=3200m=3200, p=10p=10, ξ=0.95\xi=0.95, and d=32d=32. The numerical results are illustrated in Figure 3. It can be observed that DeEPCA and DRGTA can not converge when 𝚙𝚛𝚘𝚋=0.2\mathtt{prob}=0.2, which signifies that a single consensus step per iteration is not sufficient for these two algorithms to converge in sparse networks. Apart from this special case, DeEPCA has the best performance among all the algorithms. Moreover, DESTINY is competitive with ADSA and significantly outperforms DRGTA in all cases. It should be noted that DeEPCA and ADSA are tailored for decentralized PCA problems and can not be extended to generic cases.

Refer to caption
(a) 𝚙𝚛𝚘𝚋=0.2\mathtt{prob}=0.2
Refer to caption
(b) 𝚙𝚛𝚘𝚋=0.4\mathtt{prob}=0.4
Refer to caption
(c) 𝚙𝚛𝚘𝚋=0.6\mathtt{prob}=0.6
Refer to caption
(d) 𝚙𝚛𝚘𝚋=0.2\mathtt{prob}=0.2
Refer to caption
(e) 𝚙𝚛𝚘𝚋=0.4\mathtt{prob}=0.4
Refer to caption
(f) 𝚙𝚛𝚘𝚋=0.6\mathtt{prob}=0.6
Figure 3: Comparison of DRGTA, DESTINY, ADSA, and DeEPCA for different values of 𝚙𝚛𝚘𝚋\mathtt{prob} in solving decentralized PCA problems. The figures in the first and second rows depict substationarity violations and consensus errors, respectively.

4.4 Comparison on decentralized OLSR

In this subsection, the abilities of DESTINY and DRGTA [5] are examined in solving decentralized OLSR problems (1.4) on the real-world text dataset Cora [25]. For our testing in this case, we select a subset of Cora and extract the first n=100n=100 features to use, which contains research papers from three subfields, including data structure (DS), hardware and architecture (HA), and programming language (PL). The number of samples (m)(m) and classes (p)(p) are summarized as follows.

  • Cora-DS: m=751m=751, p=9p=9.

  • Cora-HA: m=400m=400, p=7p=7.

  • Cora-PL: m=1575m=1575, p=9p=9.

The detailed descriptions and download links of this dataset can be found in [51]. The number of agents is set to d=32d=32 in this experiment. We provide the numerical results in Figure 4, which demonstrates that DESTINY attains better performances than DRGTA in terms of both substationarity violations and consensus errors. Therefore, our algorithm manifests its efficiency in dealing with real-world datasets.

Refer to caption
(a) Cora-DS
Refer to caption
(b) Cora-HA
Refer to caption
(c) Cora-PL
Refer to caption
(d) Cora-DS
Refer to caption
(e) Cora-HA
Refer to caption
(f) Cora-PL
Figure 4: Comparison between DRGTA and DESTINY for different datasets in solving decentralized OLSR problems. The figures in the first and second rows depict substationarity violations and consensus errors, respectively.

4.5 Comparison on decentralized SDL

In this subsection, we evaluate the performance on decentralized SDL problems (1.5) of DESTINY and DRGTA [5]. Three image datasets popular in machine learning research are tested in the following experiments, including MNIST111Available from http://yann.lecun.com/exdb/mnist/ , CIFAR-10222Available from https://www.cs.toronto.edu/~kriz/cifar.html , and CIFAR-1002{}^{\ref{note:CIFAR}}. We summarize the numbers of features (n)(n) and samples (m)(m) as follows.

  • MNIST: n=784n=784, m=60000m=60000.

  • CIFAR-10: n=3072n=3072, m=50000m=50000.

  • CIFAR-100: n=3072n=3072, m=50000m=50000.

For our testing, the numbers of computed bases and agents are set to p=1p=1 and d=32d=32, respectively. Numerical results from this experiment are given in Figure 5. It can be observed that DESTINY always dominates DRGTA in terms of both substationarity violations and consensus errors. These results indicate that the observed superior performance of our algorithm is not just limited to PCA and OLSR problems.

Refer to caption
(a) MNIST
Refer to caption
(b) CIFAR-10
Refer to caption
(c) CIFAR-100
Refer to caption
(d) MNIST
Refer to caption
(e) CIFAR-10
Refer to caption
(f) CIFAR-100
Figure 5: Comparison between DRGTA and DESTINY for different datasets in solving decentralized SDL problems. The figures in the first and second rows depict substationarity violations and consensus errors, respectively.

5 Conclusions

As a powerful and fundamental tool, Riemannian optimization has been adapted to solving consensus problems on a multi-agent network, giving rise to the decentralized Riemannian gradient tracking algorithm (DRGTA) on the Stiefel manifold. In order to guarantee the convergence, DRGTA requires multiple rounds of communications per iteration, and hence, the communication overheads are very high. In this paper, we propose an infeasible decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration. DESTINY seeks a consensus in the Euclidean space directly by resorting to an elegant approximate augmented Lagrangian function. Even in the non-distributed setting, the proposed algorithm is still new for optimization problems over the Stiefel manifold.

Most existing works on analyzing the convergence of gradient tracking based algorithms impose stringent assumptions on objective functions, such as global Lipschitz smoothness and coerciveness. In our case of DESTINY which tackles decentralized optimization problems with nonconvex constraints, we are able to derive the global convergence and worst case complexity under rather mild conditions. The objective function is only assumed to be locally Lipschitz smooth.

Comprehensive numerical comparisons are carried out under the distributed setting with test results strongly in favor of DESTINY. Most notably, the rounds of communications required by DESTINY are significantly fewer than what are required by DRGTA.

Finally, some topics are worthy of future studies to fully understand the behavior and realize the potential of DESTINY and its variants. For example, the asynchronous version of DESTINY is worthy of investigating to address the load balance issues in the distributed environments. And more work is needed to adapt DESTINY to the dynamic network settings so that it can find a wider range of applications.

Acknowledgment

The authors would like to thank Prof. Lei-Hong Zhang for sharing the Cora dataset, and Prof. Qing Ling and Dr. Nachuan Xiao for their insightful discussions.

References

  • [1] T. E. Abrudan, J. Eriksson, and V. Koivunen, Steepest descent algorithms for optimization under unitary matrix constraint, IEEE Transactions on Signal Processing, 56 (2008), pp. 1134–1147.
  • [2] P.-A. Absil, C. G. Baker, and K. A. Gallivan, Trust-region methods on Riemannian manifolds, Foundations of Computational Mathematics, 7 (2006), pp. 303–330.
  • [3] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2008.
  • [4] T.-H. Chang, M. Hong, and X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Transactions on Signal Processing, 63 (2015), pp. 482–497.
  • [5] S. Chen, A. Garcia, M. Hong, and S. Shahrampour, Decentralized Riemannian gradient descent on the Stiefel manifold, in Proceedings of the 38th International Conference on Machine Learning, vol. 139, PMLR, 2021, pp. 1594–1605.
  • [6] Y. Chen, K. Yuan, Y. Zhang, P. Pan, Y. Xu, and W. Yin, Accelerating gossip SGD with periodic global averaging, in Proceedings of the 38th International Conference on Machine Learning, vol. 139, PMLR, 2021, pp. 1791–1802.
  • [7] A. Daneshmand, G. Scutari, and V. Kungurtsev, Second-order guarantees of distributed gradient algorithms, SIAM Journal on Optimization, 30 (2020), pp. 3029–3068.
  • [8] P. Di Lorenzo and G. Scutari, Next: In-network nonconvex optimization, IEEE Transactions on Signal and Information Processing over Networks, 2 (2016), pp. 120–136.
  • [9] A. Edelman, T. A. Arias, and S. T. Smith, The geometry of algorithms with orthogonality constraints, SIAM Journal on Matrix Analysis and Applications, 20 (1998), pp. 303–353.
  • [10] A. Gang and W. U. Bajwa, FAST-PCA: A fast and exact algorithm for distributed principal component analysis, arXiv:2108.12373, (2021).
  • [11] A. Gang and W. U. Bajwa, A linearly convergent algorithm for distributed principal component analysis, Signal Processing, 193 (2022), p. 108408.
  • [12] A. Gang, H. Raja, and W. U. Bajwa, Fast and communication-efficient distributed PCA, in Proceedings of the 2019 International Conference on Acoustics, Speech and Signal Processing, IEEE, 2019, pp. 7450–7454.
  • [13] B. Gao, X. Liu, X. Chen, and Y.-X. Yuan, A new first-order algorithmic framework for optimization problems with orthogonality constraints, SIAM Journal on Optimization, 28 (2018), pp. 302–332.
  • [14] B. Gao, X. Liu, and Y.-X. Yuan, Parallelizable algorithms for optimization problems with orthogonality constraints, SIAM Journal on Scientific Computing, 41 (2019), pp. A1949–A1983.
  • [15] J. Gao, X.-W. Liu, Y.-H. Dai, Y. Huang, and P. Yang, Achieving geometric convergence for distributed optimization with Barzilai-Borwein step sizes, Science China Information Sciences, 65 (2022), pp. 1–2.
  • [16] D. Hajinezhad and M. Hong, Perturbed proximal primal–dual algorithm for nonconvex nonsmooth optimization, Mathematical Programming, 176 (2019), pp. 207–245.
  • [17] M. Hong, D. Hajinezhad, and M.-M. Zhao, Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks, in International Conference on Machine Learning, PMLR, 2017, pp. 1529–1538.
  • [18] J. Hu, B. Jiang, L. Lin, Z. Wen, and Y.-X. Yuan, Structured quasi-Newton methods for optimization with orthogonality constraints, SIAM Journal on Scientific Computing, 41 (2019), pp. A2239–A2269.
  • [19] J. Hu, A. Milzarek, Z. Wen, and Y. Yuan, Adaptive quadratically regularized Newton method for Riemannian optimization, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 1181–1207.
  • [20] B. Jiang and Y.-H. Dai, A framework of constraint preserving update schemes for optimization on Stiefel manifold, Mathematical Programming, 153 (2015), pp. 535–575.
  • [21] T. Krasulina, Method of stochastic approximation in the determination of the largest eigenvalue of the mathematical expectation of random matrices, Automatation and Remote Control, 2 (1970), pp. 50–56.
  • [22] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, DLM: Decentralized linearized alternating direction method of multipliers, IEEE Transactions on Signal Processing, 63 (2015), pp. 4051–4064.
  • [23] Y. Liu, W. Xu, G. Wu, Z. Tian, and Q. Ling, Communication-censored ADMM for decentralized consensus optimization, IEEE Transactions on Signal Processing, 67 (2019), pp. 2565–2579.
  • [24] J. H. Manton, Optimization algorithms exploiting unitary constraints, IEEE Transactions on Signal Processing, 50 (2002), pp. 635–650.
  • [25] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, Automating the construction of internet portals with machine learning, Information Retrieval, 3 (2000), pp. 127–163.
  • [26] A. Nedic, A. Olshevsky, and W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs, SIAM Journal on Optimization, 27 (2017), pp. 2597–2633.
  • [27] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, 54 (2009), pp. 48–61.
  • [28] Y. Nishimori and S. Akaho, Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold, Neurocomputing, 67 (2005), pp. 106–135.
  • [29] S. U. Pillai, T. Suel, and S. Cha, The Perron-Frobenius theorem: some of its applications, IEEE Signal Processing Magazine, 22 (2005), pp. 62–75.
  • [30] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions on Control of Network Systems, 5 (2017), pp. 1245–1260.
  • [31] S. S. Ram, A. Nedić, and V. V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization, Journal of Optimization Theory and Applications, 147 (2010), pp. 516–545.
  • [32] T. D. Sanger, Optimal unsupervised learning in a single-layer linear feedforward neural network, Neural Networks, 2 (1989), pp. 459–473.
  • [33] H. Sato, A Dai–Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions, Computational Optimization and Applications, 64 (2016), pp. 101–118.
  • [34] I. D. Schizas and A. Aduroja, A distributed framework for dimensionality reduction and denoising, IEEE Transactions on Signal Processing, 63 (2015), pp. 6379–6394.
  • [35] G. Scutari and Y. Sun, Distributed nonconvex constrained optimization over time-varying digraphs, Mathematical Programming, 176 (2019), pp. 497–544.
  • [36] W. Shi, Q. Ling, G. Wu, and W. Yin, EXTRA: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, 25 (2015), pp. 944–966.
  • [37] E. Stiefel, Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten, Commentarii Mathematici Helvetici, 8 (1935), pp. 305–353.
  • [38] L. Wang, B. Gao, and X. Liu, Multipliers correction methods for optimization problems over the Stiefel manifold, CSIAM Transactions on Applied Mathematics, 2 (2021), pp. 508–531.
  • [39] L. Wang, X. Liu, and Y. Zhang, A distributed and secure algorithm for computing dominant SVD based on projection splitting, arXiv:2012.03461, (2020).
  • [40]  , A communication-efficient and privacy-aware distributed algorithm for sparse PCA, arXiv:2106.03320, (2021).
  • [41] Z. Wen and W. Yin, A feasible method for optimization with orthogonality constraints, Mathematical Programming, 142 (2013), pp. 397–434.
  • [42] L. Xiao and S. Boyd, Fast linear iterations for distributed averaging, Systems & Control Letters, 53 (2004), pp. 65–78.
  • [43] N. Xiao and X. Liu, Solving optimization problems over the Stiefel manifold by smooth exact penalty function, arXiv:2110.08986, (2021).
  • [44] N. Xiao, X. Liu, and Y.-X. Yuan, A class of smooth exact penalty function methods for optimization problems with orthogonality constraints, Optimization Methods and Software, (2020), pp. 1–37.
  • [45]  , A penalty-free infeasible approach for a class of nonsmooth optimization problems over the Stiefel manifold, arXiv:2103.03514, (2021).
  • [46] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, in 54th IEEE Conference on Decision and Control, IEEE, 2015, pp. 2055–2060.
  • [47] H. Ye and T. Zhang, DeEPCA: Decentralized exact PCA with linear convergence rate, Journal of Machine Learning Research, 22 (2021), pp. 1–27.
  • [48] K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, SIAM Journal on Optimization, 26 (2016), pp. 1835–1854.
  • [49] J. Zeng and W. Yin, On nonconvex decentralized gradient descent, IEEE Transactions on Signal Processing, 66 (2018), pp. 2834–2848.
  • [50] Y. Zhai, Z. Yang, Z. Liao, J. Wright, and Y. Ma, Complete dictionary learning via 4\ell_{4}-norm maximization over the orthogonal group, Journal of Machine Learning Research, 21 (2020), pp. 1–68.
  • [51] L.-H. Zhang, W. H. Yang, C. Shen, and J. Ying, An eigenvalue-based method for the unbalanced Procrustes problem, SIAM Journal on Matrix Analysis and Applications, 41 (2020), pp. 957–983.
  • [52] H. Zhao, Z. Wang, and F. Nie, Orthogonal least squares regression for feature extraction, Neurocomputing, 216 (2016), pp. 200–207.
  • [53] X. Zhu, A Riemannian conjugate gradient method for optimization on the Stiefel manifold, Computational Optimization and Applications, 67 (2017), pp. 73–110.