This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Communication-Efficient And Privacy-Aware Distributed Algorithm For Sparse PCA

Lei Wang State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]). Research is supported by the National Natural Science Foundation of China (No. 11971466 and 11991020).    Xin Liu State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, China ([email protected]). Research is supported in part by the National Natural Science Foundation of China (No. 12125108, 11991021, and 12288201), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022).    Yin Zhang School of Data Science, The Chinese University of Hong Kong, Shenzhen, China ([email protected]). Research is supported in part by the Shenzhen Science and Technology Program (No. GXWD20201231105722002-20200901175001001).
Abstract

Sparse principal component analysis (PCA) improves interpretability of the classic PCA by introducing sparsity into the dimension-reduction process. Optimization models for sparse PCA, however, are generally non-convex, non-smooth and more difficult to solve, especially on large-scale datasets requiring distributed computation over a wide network. In this paper, we develop a distributed and centralized algorithm called DSSAL1 for sparse PCA that aims to achieve low communication overheads by adapting a newly proposed subspace-splitting strategy to accelerate convergence. Theoretically, convergence to stationary points is established for DSSAL1. Extensive numerical results show that DSSAL1 requires far fewer rounds of communication than state-of-the-art peer methods. In addition, we make the case that since messages exchanged in DSSAL1 are well-masked, the possibility of private-data leakage in DSSAL1 is much lower than in some other distributed algorithms.

1 Introduction

The principal component analysis (PCA) is a fundamental and ubiquitous tool in statistics and data analytics. In particular, it frequently serves as a critical preprocessing step in numerous data science or machine learning tasks. In general, solutions produced by the classical PCA are dense combinations of all features. However, in practice, sparse combinations not only enhance the interpretability of the principal components but also reduce storage [47, 9], which motivates the idea of sparse PCA. More importantly, from a theoretical perspective as given in [34, 63], sparse PCA is able to remediate some inconsistency phenomenon present in the classical PCA under the high-dimensional setting. As a dimension reduction and feature extraction method, sparse PCA can be widely applied to application areas where PCA is normally used; such as medical imaging [47], biological knowledge mining [54], ecology study [24], cancer analysis [9], neuroscience research [7], to mention a few examples.

Let A=[a1,a2,,am]A=[a_{1},a_{2},\dotsc,a_{m}] be an n×mn\times m data matrix, where nn and mm denote the numbers of features and samples, respectively. Without loss of generality, we assume that each row of AA has a zero mean. Mathematically speaking, PCA finds an orthonormal basis Z𝒮n,p:={Zn×pZZ=Ip}Z\in\mathcal{S}_{n,p}:=\{Z\in\mathbb{R}^{n\times p}\mid Z^{\top}Z=I_{p}\} of a pp-dimensional subspace such that the projection of samples a1,a2,,ama_{1},a_{2},\dotsc,a_{m} on this subspace has the most variance. The set 𝒮n,p\mathcal{S}_{n,p} is usually referred to as the Stiefel manifold [48]. Then PCA can be formally expressed as the following optimization problem with the orthogonality constraints.

minZn×p\displaystyle\min\limits_{Z\in\mathbb{R}^{n\times p}} f(Z):=12tr(ZAAZ)\displaystyle f(Z):=-\dfrac{1}{2}\mathrm{tr}\left(Z^{\top}AA^{\top}Z\right) (1)
s.t.\displaystyle\mathrm{s.\,t.} Z𝒮n,p,\displaystyle Z\in\mathcal{S}_{n,p},

where tr()\mathrm{tr}(\cdot) represents the trace of a given square matrix, and the column of ZZ are called loading vectors or simply loadings.

In the projected data ZAp×mZ^{\top}A\in\mathbb{R}^{p\times m}, the number of features is reduced from nn to pp and each feature (row of ZAZ^{\top}A) is a linear combination of the original features (rows of AA) with coefficients from ZZ. For a sufficiently sparse ZZ, each reduced feature depends only on a few original features instead of all of them, leading to better interpretability in many applications. For this purpose, we consider the following 1\ell_{1}-regularized optimization model for sparse PCA:

minZn×p\displaystyle\min\limits_{Z\in\mathbb{R}^{n\times p}} f¯(Z):=f(Z)+r(Z)\displaystyle\bar{f}(Z):=f(Z)+r(Z) (2)
s.t.\displaystyle\mathrm{s.\,t.} Z𝒮n,p,\displaystyle Z\in\mathcal{S}_{n,p},

where the (non-operator) 1\ell_{1}-norm regularizer r(Z):=μZ1=μi,j|[Z]ij|r(Z):=\mu\left\|Z\right\|_{1}=\mu\sum_{i,j}|[Z]_{ij}| is imposed to promote sparsity in ZZ, and μ>0\mu>0 is the parameter used to control the amount of sparseness. Here, [Z]ij[Z]_{ij} denotes the (i,j)(i,j)-th entry of the matrix ZZ. Our distributed approach can efficiently tackle (2) with p>1p>1 by pursuing sparsity and orthogonality at the same time.

The optimization problem (2) is a penalized version of the SCoTLASS model proposed in [35]. Evidently, there is a significant difficulty gap in going from the standard PCA to sparse PCA in terms of both problem complexity and solution methodology. The standard PCA is polynomial-time solvable, while the sparse PCA is NP-hard if the 0\ell_{0}-regularizer is used to enforce sparsity [39], though it is still unclear what is the computational time complexity of solving the 1\ell_{1}-regularized model (2). In this paper we will be content with a theoretical result of convergence to first-order stationary points of (2). There are other formulations for sparse PCA, such as regression model [62], semidefinite programming [14, 15], and matrix decompositions [46, 53]. A comparative study of these formulations is beyond the scope of this paper. We refer interested readers to [63] for a recent survey of theoretical and computational results for sparse PCA.

1.1 Distributed setting

We consider the following distributed setting. The data matrix AA is divided into dd blocks, each containing a number of samples; namely, A=[A1,A2,,Ad]A=[A_{1},A_{2},\dotsc,A_{d}] where Ain×miA_{i}\in\mathbb{R}^{n\times m_{i}} so that m1++md=mm_{1}+\dotsb+m_{d}=m. This is a natural setting since each sample is a column of AA. These submatrices AiA_{i}, i=1,,di=1,\dotsc,d, are stored locally in dd locations, possibly having been collected at different locations by different agents, and all the agents are connected through a communication network. According to the splitting scheme of AA, the function f(Z)f(Z) can also be distributed into dd agents, namely,

f(Z)=i=1dfi(Z) with fi(Z)=12tr(ZAiAiZ).f(Z)=\sum\limits_{i=1}^{d}f_{i}(Z)\mbox{~{}~{}with~{}~{}}f_{i}(Z)=-\dfrac{1}{2}\mathrm{tr}\left(Z^{\top}A_{i}A_{i}^{\top}Z\right). (3)

In terms of network topology, we only need to assume that the network allows global summation operations (say, through all-reduce type of communications [43]) which are required by our algorithm. In particular, our algorithm will operate well in the federated-learning setting [41] where all the agents are connected to a center server so that global summations can be readily achieved in the network. To this extent, we say our algorithm is a centralized algorithm.

For most distributed algorithms, since the amount of communications per iteration remains essentially the same, the total communication overhead is proportional to the required number of iterations regardless of the underlying network topology. Therefore, our goal is to devise an algorithm that converges fast in terms of iteration counts, while considerations on other network topologies and communication patterns are beyond the scope of the current paper.

In certain applications, such as those in healthcare and financial industry [38, 60], preserving privacy of local data is of primary importance. In this paper, we consider the following scenario: each agent ii wants to keep its local dataset (i.e., AiAiA_{i}A_{i}^{\top}) from being discovered by any other agents including the center. In this situation, it is not an option to implement a pre-agreed encryption or a coordinated masking operation. For convenience, we will call such a privacy situation of intrinsic privacy. For an algorithm to preserve intrinsic privacy, publicly exchanged information must be safe so that none of the local-data matrices can be figured out by any means. We will show later that, in general, algorithms based on traditional methods with distributed matrix multiplications are not intrinsically private.

1.2 Related works

Optimization problems with orthogonality constraints have been actively investigated in recent decades, for which many algorithms and solvers have been developed, such as, gradient approaches [40, 42, 1, 3], conjugate gradient approaches [16, 45, 61], constraint preserving updating schemes [52, 32], Newton methods [30, 29], trust-region methods [2], multipliers correction frameworks [21, 49], and orthonormalization-free approaches [22, 55]. These aforementioned algorithms are designed for smooth objective functions, and are generally not suitable for problem (2).

There exist some algorithms specifically designed for solving non-smooth optimization problems with orthogonality constraints; most of which adopt certain non-smooth optimization techniques to the Stiefel manifold; for instances, Riemannian subgradient methods [17, 18], proximal point algorithms [6], non-smooth trust-region methods [25], gradient sampling methods [28], and proximal gradient methods [11, 31]. Out of these algorithms, proximal gradient methods are among the most efficient. Specifically, Chen et al. [11] propose a manifold proximal gradient method (ManPG) and its accelerated version ManPG-Ada for non-smooth optimization problems over the Stiefel manifold. Starting from a current iterate, say Z(k)𝒮n,pZ^{(k)}\in\mathcal{S}_{n,p}, ManPG and ManPG-Ada generate the next iterate by solving the following subproblem restricted to the tangent space 𝒯Z(k)𝒮n,p={Dn×pDZ(k)+(Z(k))D=0}\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}=\{D\in\mathbb{R}^{n\times p}\mid D^{\top}Z^{(k)}+(Z^{(k)})^{\top}D=0\}:

minD𝒯Z(k)𝒮n,pi=1dAiAiZ(k),D+12ηDF2+r(Z(k)+D),\min\limits_{D\in\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\hskip 5.69054pt-\left\langle\sum\limits_{i=1}^{d}A_{i}A_{i}^{\top}Z^{(k)},D\right\rangle+\dfrac{1}{2\eta}\left\|D\right\|^{2}_{\mathrm{F}}+r(Z^{(k)}+D), (4)

which consumes the most computational time in the algorithm. In ManPG [11], the semi-smooth Newton (SSN) method [57] is deployed to solve the above subproblem, and global convergence to stationary points is established with the convergence rate O(1/k)O(1/\sqrt{k}). Huang and Wei [31] later extend the framework of ManPG to general Riemannian manifolds beyond the Stiefel manifold. Another class of algorithms first introduces auxiliary variables to split the objective function and the orthogonality constraint and then utilizes alternating minimization techniques to solve the resulting model, including SOC [37], MADMM [36], and PAMAL [12]. It is worth mentioning that SOC and MADMM lack a convergence guarantee. As for PAMAL, although convergence is guaranteed, its numerical performance is heavily dependent on its parameters as discussed in [11].

In principle, all the aforementioned algorithms can be adapted to the distributed setting considered in this paper. We take the algorithm ManPG as an example. Under the distributed setting, each agent computes the local product AiAiZ(k)A_{i}A_{i}^{\top}Z^{(k)} individually, then one round of communications will gather the global sum i=1dAiAiZ(k)\sum_{i=1}^{d}A_{i}A_{i}^{\top}Z^{(k)} at a center. The center then solves subproblem (4) and scatters the updated Z(k+1)Z^{(k+1)} back to all agents. At each iteration, distributed computation is basically limited to the matrix-multiplication level.

We point out that, without prior data-masking, distributed algorithms at the matrix-multiplication level cannot preserve local-data privacy intrinsically. Specifically, local data AiAiA_{i}A_{i}^{\top}, privately owned by agent ii, can be uncovered by anyone who has access to publicly exchanged products AiAiZ(k)A_{i}A_{i}^{\top}Z^{(k)}, after collecting enough such products and then solving a system of linear equations for the “unknown” AiAiA_{i}A_{i}^{\top}. This idea works even for more complicated iterative procedures as long as a mapping from data AiAiA_{i}A_{i}^{\top} to a publicly exchanged quantity is deterministic and known.

To illustrate this data leakage situation, we now present a simple experiment on algorithm ManPG-Ada [11]. The test environment well be described in Section 5. As mentioned earlier, at iteration kk the publicly shared quantity in ManPG-Ada by agent ii is Si(k):=AiAiZ(k)S_{i}^{(k)}:=A_{i}A_{i}^{\top}Z^{(k)}, where Z(k)Z^{(k)} is the kk-th iterate of ManPG-Ada accessible to all agents. For instance, to recover the unknown local data A1A1A_{1}A_{1}^{\top} we construct the following linear system with unknown Yn×nY\in\mathbb{R}^{n\times n}:

Y[Z(1),,Z(k)]=[S1(1),,S1(k)].Y[Z^{(1)},\cdots,Z^{(k)}]=[S_{1}^{(1)},\cdots,S_{1}^{(k)}]. (5)

Let Y(k)Y^{(k)} be the minimum norm least-squares solution to the linear equation (5) at iteration kk. We perform an experiment to illustrate that Y(k)Y^{(k)} will quickly converge to A1A1A_{1}A_{1}^{\top}, when ManPG-Ada is deployed to solve the sparse PCA problem on a randomly generated test matrix with n=100n=100 and m=1280m=1280 (other parameters are d=10d=10, p=10p=10, and μ=0.05\mu=0.05). Figure 1 depicts how the reconstruction error Y(k)A1A1F\|Y^{(k)}-A_{1}A_{1}^{\top}\|_{\mathrm{F}} and the stationarity violation111 Suppose D(k)D^{(k)} is the solution to (4). According to Lemma 5.3 in [11], X(k)X^{(k)} is a first-order stationary point if D(k)=0D^{(k)}=0. Therefore, the stationarity violation is defined as D(k)F\|D^{(k)}\|_{\mathrm{F}}. vary, on a logarithmic scale, as the number of iterations increases. We observe that local data A1A1A_{1}A_{1}^{\top} is obtained with high accuracy much faster than solving the sparse PCA problem.

Refer to caption
Figure 1: Local data uncovered by solving linear systems during ManPG-Ada iterations.

In order to handle distributed datasets, some distributed ADMM-type algorithms have been introduced to PCA and related problems. These methods achieve algorithm-level parallelization, which are generally more secure than those based on matrix-multiplication-level parallelization. For instance, [26] proposes a distributed algorithm for sparse PCA with convergence to stationary points, but only studies the special case of p=1p=1. Moreover, under the distributed setting, [51] develop a subspace splitting strategy to tackle the smooth optimization problem over the Grassmann manifold without the 1\ell_{1}-norm regularizer in (2).

Recently, decentralized optimization has attracted attentions partly due to its wide applications in wireless sensor networks. Only local communications are carried out in decentralized optimization, namely, agents exchange information only with their immediate neighbors. This is quite different from the distributed setting considered in the current paper. We refer interested readers to the references [23, 19, 20, 4, 59, 10, 50] for some decentralized PCA algorithms.

1.3 Main contributions

Recently, a subspace splitting strategy has been proposed [51] to accelerate convergence of distributed algorithms for optimization problems over Grassmann manifolds where objective functions are orthogonal-transformation invariant222A function f(X)f(X) is called orthogonal-transformation invariant if f(XO)=f(X)f(XO)=f(X) for any X𝒮n,pX\in\mathcal{S}_{n,p} and O𝒮p,pO\in\mathcal{S}_{p,p}. and smooth. In regularized problems such as sparse PCA, orthogonal-transformation invariance and smoothness no longer hold. In this paper, we present a non-trivial extension of the subspace splitting strategy to more general optimization problems over Stiefel manifolds. In particular, by incorporating the subspace splitting strategy into the framework of ManPG [11], we have developed a distributed algorithm DSSAL1 to solve the 1\ell_{1}-regularized optimization problem (2) for sparse PCA.

The main step of DSSAL1 is to solve the subproblem:

minD𝒯Z(k)𝒮n,pi=1dQi(k)Z(k),D+12ηDF2+r(Z(k)+D),\min\limits_{D\in\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\hskip 5.69054pt\left\langle\sum\limits_{i=1}^{d}Q_{i}^{(k)}Z^{(k)},D\right\rangle+\dfrac{1}{2\eta}\left\|D\right\|^{2}_{\mathrm{F}}+r(Z^{(k)}+D), (6)

which is identical to subproblem (4) in ManPG except that each local data matrix AiAiA_{i}A_{i}^{\top} is replaced, or masked, by a matrix Qi(k)-Q_{i}^{(k)} whose expression will be derived later.

In a sense, the main contributions of this paper can be attributed to this remarkably simple replacement or masking, which brings two great benefits. Firstly, convergence will be significantly accelerated which will be shown empirically in Section 5. Secondly, publicly exchanged products Qi(k)Z(k)Q_{i}^{(k)}Z^{(k)} are no longer images of some fixed and known mapping of AiAiA_{i}A_{i}^{\top}, making it practically impossible to uncover AiAiA_{i}A_{i}^{\top} from such products via equation-solving.

Several innovations have been made in the development of our algorithms. Firstly, in our algorithm, only the global variable can possibly converge to a solution, while the local variables will never do. The role of the local variables, which are generally dense, is to help identify the right subspace. Secondly, we devise an inexact-solution strategy to effectively handle the difficult subproblem for the global variable where orthogonality and sparsity are pursued simultaneously. Thirdly, we establish the global convergence to stationarity for our algorithm, overcoming a number of technical difficulties associated with the rather complex algorithm construction, as well as the non-convexity and non-smoothness in problem (2).

1.4 Notations

The Euclidean inner product of two matrices Y1,Y2Y_{1},Y_{2} of the same size is defined as Y1,Y2=tr(Y1Y2)\left\langle Y_{1},Y_{2}\right\rangle=\mathrm{tr}(Y_{1}^{\top}Y_{2}). The Frobenius norm and spectral norm of a given matrix CC are denoted by CF\left\|C\right\|_{\mathrm{F}} and C2\left\|C\right\|_{2}, respectively. Given a differentiable function g(X):n×pg(X):\mathbb{R}^{n\times p}\to\mathbb{R}, the gradient of gg with respect to XX is denoted by g(X)\nabla g(X). And the subdifferential of a Lipschitz continuous function h(X)h(X) is denoted by h(X)\partial h(X). The tangent space to the Stiefel manifold 𝒮n,p\mathcal{S}_{n,p} at Z𝒮n,pZ\in\mathcal{S}_{n,p}, is represented by 𝒯Z𝒮n,p={Yn×pYZ+ZY=0}\mathcal{T}_{Z}\mathcal{S}_{n,p}=\{Y\in\mathbb{R}^{n\times p}\mid Y^{\top}Z+Z^{\top}Y=0\}, and the orthogonal projection of YY onto 𝒯Z𝒮n,p\mathcal{T}_{Z}\mathcal{S}_{n,p} is denoted by Proj𝒯Z𝒮n,p(Y)=(InZZ)Y+Z(ZYYZ)/2\mathrm{Proj}_{\mathcal{T}_{Z}\mathcal{S}_{n,p}}\left(Y\right)=\left(I_{n}-ZZ^{\top}\right)Y+Z\left(Z^{\top}Y-Y^{\top}Z\right)/2. For X𝒮n,pX\in\mathcal{S}_{n,p}, we define 𝐏X:=InXX{\mathbf{P}}^{\perp}_{X}:=I_{n}-XX^{\top} standing for the projection operator onto the null space of XX^{\top}. Further notation will be introduced as it occurs.

1.5 Organization

The rest of this paper is organized as follows. In Section 2, we introduce a novel subspace-splitting model for sparse PCA, and investigate the structure of associated Lagrangian multipliers. Then a distributed algorithm is proposed to solve this model based on an ADMM-like framework in Section 3. Moreover, convergence properties of the proposed algorithm are studied in Section 4. Numerical experiments on a variety of test problems are presented in Section 5 to evaluate the performance of the proposed algorithm. We draw final conclusions and discuss some possible future developments in the last section.

2 Subspace-splitting model for sparse PCA

We consider the scenario that the ii-th component function fi(X)f_{i}(X) of ff in (3) can be evaluated only by the ii-th agent since local dadaset AiA_{i} is accessible only to the ii-th agent. In order to devise a distributed algorithm, the classic variable-splitting approach is to introduce a set of local variables {Xi}\{X_{i}\} to make the sum of local functions nominally separable. Then a (centralized) distributed algorithm would maintain a global variable ZZ and impose variable-consensus constraints Xi=ZX_{i}=Z.

Despite the regularizer term rr in (2), all component functions fi(X)f_{i}(X) in ff are still invariant under orthogonal transformations. It should be natural for us to adapt the subspace-splitting idea introduced in [51], that is, to use the subspace-consensus constraints XiXi=ZZX_{i}X_{i}^{\top}=ZZ^{\top} to accelerate convergence. In this paper, we propose to solve the following optimization problem:

minXi,Zn×p\displaystyle\min\limits_{X_{i},Z\in\mathbb{R}^{n\times p}}\hskip 5.69054pt i=1dfi(Xi)+r(Z)\displaystyle\sum\limits_{i=1}^{d}f_{i}(X_{i})+r(Z) (7a)
s.t.\displaystyle\mathrm{s.\,t.}\,\,\hskip 31.29802pt XiXi=Ip,i=1,,d,\displaystyle X_{i}^{\top}X_{i}=I_{p},\hskip 22.76219pti=1,\dotsc,d, (7b)
XiXi=ZZ,i=1,,d,\displaystyle X_{i}X_{i}^{\top}=ZZ^{\top},\hskip 9.95845pti=1,\dotsc,d, (7c)
ZZ=Ip,\displaystyle Z^{\top}Z=I_{p}, (7d)

which we will call the subspace-splitting model for problem (2), noting that both sides of (7c) are orthogonal projections onto subspaces. For brevity, we collect the global variable ZZ and all local variables {Xi}\{X_{i}\} into a point (Z,{Xi})\left(Z,\{X_{i}\}\right). A point (Z,{Xi})\left(Z,\{X_{i}\}\right) is feasible if it satisfies the constraints (7b)-(7d).

2.1 Stationarity conditions

In this subsection, we aim to present the stationarity conditions of the sparse PCA problem (2). We first introduce the definition of Clarke subgradient [13] for non-smooth functions.

Definition 2.1.

Suppose f:n×pf:\mathbb{R}^{n\times p}\to\mathbb{R} is a Lipschitz continuous function. The generalized directional derivative of ff at the point Xn×pX\in\mathbb{R}^{n\times p} along the direction Hn×pH\in\mathbb{R}^{n\times p} is defined by:

f(X;H):=lim supYX,t0+f(Y+tH)f(Y)t.f^{\circ}(X;H):=\limsup\limits_{Y\to X,\,t\to 0^{+}}\dfrac{f(Y+tH)-f(Y)}{t}.

Based on generalized directional derivative of ff, the (Clark) subgradient of ff is defined by:

f(X):={Gn×pG,Hf(X;H)}.\partial f(X):=\{G\in\mathbb{R}^{n\times p}\mid\left\langle G,H\right\rangle\leq f^{\circ}(X;H)\}.

As discussed in [58, 11], the first-order stationarity condition of (2) can be stated as:

0Proj𝒯Z𝒮n,p(AAZ+r(Z)).0\in\mathrm{Proj}_{\mathcal{T}_{Z}\mathcal{S}_{n,p}}\left(-AA^{\top}Z+\partial r(Z)\right).

We provide an equivalent description of the above first-order stationarity condition, which will be used in the theoretical analysis.

Lemma 2.2.

A point Z𝒮n,pZ\in\mathcal{S}_{n,p} is a first-order stationary point of (2) if and only if there exists R(Z)r(Z)R(Z)\in\partial r(Z) such that the following conditions hold:

{𝐏Z(AAZ+R(Z))=0,ZR(Z)R(Z)Z=0.\left\{\begin{aligned} {\mathbf{P}}^{\perp}_{Z}\left(-AA^{\top}Z+R(Z)\right)=0,\\ Z^{\top}R(Z)-R(Z)^{\top}Z=0.\end{aligned}\right. (8)

The proof of Lemma 2.2 is put into Appendix A. Then we can characterize the first-order stationary points of (7) in the following manner.

Definition 2.3.

Suppose Xin×p(i=1,,d)X_{i}\in\mathbb{R}^{n\times p}(i=1,\dotsc,d) and Zn×pZ\in\mathbb{R}^{n\times p}. A point (Z,{Xi})(Z,\{X_{i}\}) is called a first-order stationary point of (7) if it is feasible and ZZ satisfies the conditions in (8).

2.2 Existence of low-rank multipliers

By associating dual variables Γi\Gamma_{i}, Λi\Lambda_{i}, and Θ\Theta to the constraints (7b), (7c), and (7d), respectively, we derive an equivalent description of first-order stationarity conditions of (7).

Proposition 2.4.

Suppose Xin×p(i=1,,d)X_{i}\in\mathbb{R}^{n\times p}(i=1,\dotsc,d) and Zn×pZ\in\mathbb{R}^{n\times p}. A point (Z,{Xi})(Z,\{X_{i}\}) is a first-order stationary point of (7) if and only if there exist symmetric matrices Λin×n\Lambda_{i}\in\mathbb{R}^{n\times n}, Γip×p\Gamma_{i}\in\mathbb{R}^{p\times p}, and Θp×p\Theta\in\mathbb{R}^{p\times p} such that (Z,{Xi})(Z,\{X_{i}\}) satisfies the following condition:

{0=AiAiXi+XiΓi+ΛiXi,i=1,,d,0r(Z)+i=1dΛiZZΘ,0=XiXiIp,i=1,,d,0=XiXiZZ,i=1,,d,0=ZZIp.\left\{\begin{aligned} 0&=A_{i}A_{i}^{\top}X_{i}+X_{i}\Gamma_{i}+\Lambda_{i}X_{i},&i=1,\dotsc,d,\\ 0&\in\partial r(Z)+\sum\limits_{i=1}^{d}\Lambda_{i}Z-Z\Theta,\\ 0&=X_{i}^{\top}X_{i}-I_{p},&i=1,\dotsc,d,\\ 0&=X_{i}X_{i}^{\top}-ZZ^{\top},&i=1,\dotsc,d,\\ 0&=Z^{\top}Z-I_{p}.\end{aligned}\right. (9)

The proof of Proposition 2.4 is relegated to Appendix B. Actually, the equations in (9) can be viewed as the KKT conditions of (7), while Γip×p\Gamma_{i}\in\mathbb{R}^{p\times p}, Λin×n\Lambda_{i}\in\mathbb{R}^{n\times n}, and Θp×p\Theta\in\mathbb{R}^{p\times p} are the Lagrangian multipliers corresponding to the constraints (7b), (7c), and (7d), respectively.

In [51], a low-rank and closed-form multiplier is devised with respect to the subspace constraint (7c) for the PCA problems, which can be expressed as:

Λi=XiXiAiAi𝐏Xi𝐏XiAiAiXiXi(i=1,,d).\Lambda_{i}=-X_{i}X_{i}^{\top}A_{i}A_{i}^{\top}{\mathbf{P}}^{\perp}_{X_{i}}-{\mathbf{P}}^{\perp}_{X_{i}}A_{i}A_{i}^{\top}X_{i}X_{i}^{\top}\;(i=1,\dotsc,d). (10)

In Appendix (B), we further verify that the above formulation is also valid for the sparse PCA problems at any first-order stationary point (Z,{Xi})(Z,\{X_{i}\}). In the next section, we will use (10) to update the multipliers in our framework of ADMM. This strategy simultaneously saves computational costs and storage requirements.

3 Algorithmic framework

Now we describe the proposed algorithm, based on an ADMM-like framework, to solve the subspace-splitting model (7). Note that there are three constraints (7b)-(7d). We only penalize the subspace constraints (7c) to the objective function, and obtain the corresponding augmented Lagrangian function:

(Z,{Xi},{Λi})=i=1di(Z,Xi,Λi)+r(Z),\mathcal{L}(Z,\{X_{i}\},\{\Lambda_{i}\})=\sum\limits_{i=1}^{d}\mathcal{L}_{i}(Z,X_{i},\Lambda_{i})+r(Z), (11)

where

i(Z,Xi,Λi)=\displaystyle\mathcal{L}_{i}(Z,X_{i},\Lambda_{i})={} 12tr(XiAiAiXi)12Λi,XiXiZZ\displaystyle-\dfrac{1}{2}\mathrm{tr}\left(X_{i}^{\top}A_{i}A_{i}^{\top}X_{i}\right)-\dfrac{1}{2}\left\langle\Lambda_{i},X_{i}X_{i}^{\top}-ZZ^{\top}\right\rangle
+βi4XiXiZZF2,\displaystyle+\dfrac{\beta_{i}}{4}\left\|X_{i}X_{i}^{\top}-ZZ^{\top}\right\|^{2}_{\mathrm{F}},

and βi>0(i=1,,d)\beta_{i}>0(i=1,\dotsc,d) are penalty parameters. Schematically, we will follow the ADMM-like framework below to build an algorithm for solving subspace-splitting model (7), though we quickly add that the two optimization subproblems below in (12) and (13) are not “solved” in a normal sense since we may stop after a single iteration of an iterative scheme.

Z(k+1)argminZ𝒮n,p(Z,{Xi(k)},{Λi(k)}).\displaystyle Z^{(k+1)}\approx\operatorname*{arg\,min}_{Z\in\mathcal{S}_{n,p}}\;\mathcal{L}(Z,\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\}). (12)
Xi(k+1)argminXi𝒮n,pi(Z(k+1),Xi,Λi(k)),i=1,,d.\displaystyle X_{i}^{(k+1)}\approx\operatorname*{arg\,min}_{X_{i}\in\mathcal{S}_{n,p}}\;\mathcal{L}_{i}(Z^{(k+1)},X_{i},\Lambda_{i}^{(k)}),\;\;i=1,\dotsc,d. (13)
Wi(k+1)=𝐏Xi(k+1)AiAiXi(k+1),i=1,,d.\displaystyle W_{i}^{(k+1)}=-{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}A_{i}A_{i}^{\top}X_{i}^{(k+1)},\;\;i=1,\dotsc,d. (14)
Λi(k+1)=Xi(k+1)(Wi(k+1))+Wi(k+1)(Xi(k+1)),i=1,,d.\displaystyle{\Lambda}_{i}^{(k+1)}=X_{i}^{(k+1)}(W_{i}^{(k+1)})^{\top}+W_{i}^{(k+1)}(X_{i}^{(k+1)})^{\top},\;\;i=1,\dotsc,d. (15)

In the above framework, the superscript (k)(k) counts the number of iterations, and the subscript ii indicates the number of agents.

A novel feature in our algorithm is the way to update the multipliers Λi\Lambda_{i} associated with the subspace constraints (7c). Generally speaking, in the augment Lagrangian based approach, the multipliers are updated by the dual ascent step

Λi(k+1)=Λi(k)τβi(Xi(k+1)(Xi(k+1))Z(k+1)(Z(k+1))),\Lambda_{i}^{(k+1)}=\Lambda_{i}^{(k)}-\tau\beta_{i}\left(X_{i}^{(k+1)}(X_{i}^{(k+1)})^{\top}-Z^{(k+1)}(Z^{(k+1)})^{\top}\right),

where τ>0\tau>0 is the step size. This standard method would require to store and update an n×nn\times n matrix at each agent, which could be costly when nn is large. Instead, we use the low-rank updating formulas (14)-(15) based on the closed-form expression (10) derived in Section 2.2. In our iterative setting, these multiplier matrices are never stored but used in matrix multiplications, in which the required additional storage for agent ii is for the n×pn\times p matrix WiW_{i}.

3.1 Subproblem for global variable

We now describe how to approximate subproblems (12) and (13). By rearrangments, subproblem (12) reduces to:

minZ𝒮n,pq(k)(Z):=12tr(ZQ(k)Z)+r(Z)\min\limits_{Z\in\mathcal{S}_{n,p}}\hskip 5.69054ptq^{(k)}(Z):=\dfrac{1}{2}\mathrm{tr}\left(Z^{\top}Q^{(k)}Z\right)+r(Z) (16)

where Q(k)=i=1dQi(k)Q^{(k)}=\sum_{i=1}^{d}Q_{i}^{(k)} and Qi(k)Q_{i}^{(k)} is an n×nn\times n matrix defined by

Qi(k)=Λi(k)βiXi(k)(Xi(k)).Q_{i}^{(k)}=\Lambda_{i}^{(k)}-\beta_{i}X_{i}^{(k)}(X_{i}^{(k)})^{\top}. (17)

We quickly point out here that it is not necessary to construct and store these QQ-matrices explicitly since we will only use them to multiply n×pn\times p matrices in an iterative scheme to obtain approximate solutions to subproblem (16). More details will follow later.

Apparently, subproblem (16) pursues the orthogonality and sparsity simultaneously, and is not easier to solve than the the original problem (2). However, we will demonstrate later by both theoretical analysis and numerical experiments that inexactly solving (16) by conducting one proximal gradient step is adequate for the global convergence.

Starting from the current iterate Z(k)Z^{(k)}, we first find a decent direction D(k)D^{(k)} restricted to the tangent space 𝒯Z(k)𝒮n,p\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p} by solving the following subproblem

minDn×p\displaystyle\min\limits_{D\in\mathbb{R}^{n\times p}} Q(k)Z(k),D+12ηDF2+r(Z(k)+D)\displaystyle\left\langle Q^{(k)}Z^{(k)},D\right\rangle+\dfrac{1}{2\eta}\left\|D\right\|^{2}_{\mathrm{F}}+r(Z^{(k)}+D) (18)
s.t.\displaystyle\mathrm{s.\,t.} DZ(k)+(Z(k))D=0,\displaystyle D^{\top}Z^{(k)}+(Z^{(k)})^{\top}D=0,

where η>0\eta>0 is the step size. Since Z(k)+D(k)Z^{(k)}+D^{(k)} does not necessarily lie on the Stiefel manifold 𝒮n,p\mathcal{S}_{n,p}, we then perform a projection to bring it back to 𝒮n,p\mathcal{S}_{n,p}, which can be represented as:

Z(k+1)=Proj𝒮n,p(Z(k)+D(k)).Z^{(k+1)}=\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(Z^{(k)}+D^{(k)}\right).

Here, the orthogonal projection of a matrix Cn×pC\in\mathbb{R}^{n\times p} onto 𝒮n,p\mathcal{S}_{n,p} is denoted by Proj𝒮n,p(C)=UCVC\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(C\right)=U_{C}V_{C}^{\top}, where UCΣCVCU_{C}\Sigma_{C}V_{C}^{\top} is the economic form of the singular value decomposition of CC.

Remark 1.

We note that the subproblems solved by ManPG [11] are identical in form to our subproblem (18) but with Q(k)Q^{(k)} replaced by the data matrix AAAA^{\top}. That is, ManPG applies manifold proximal gradient steps to a fixed problem, while our algorithm computes steps of the same type using a sequence of matrices, each being updated to incorporate latest information. From this point of view, our algorithm can be interpreted as an acceleration scheme to reduce the number of outer-iterations, thereby reducing communication overheads.

Since these n×nn\times n matrices Qi(k)(i=1,,d)Q_{i}^{(k)}(i=1,\dotsc,d) are distributively maintained in dd agents, each agent is not able to independently solve subproblem (18). Fortunately, we only need to calculate

Q(k)Z(k)=i=1dQi(k)Z(k),Q^{(k)}Z^{(k)}=\sum\limits_{i=1}^{d}Q_{i}^{(k)}Z^{(k)}, (19)

to solve this subproblem. In the distributed setting, the right hand side of (19) can be accomplished by calculating Qi(k)Z(k)Q_{i}^{(k)}Z^{(k)} in each agent and then invoking the all-reduce type of communication, where each agent just shares one n×pn\times p matrix. If the all-reduce type of communication is realized by the butterfly algorithm [43], the communication overhead per iteration is O(nplog(d))O\left(np\log(d)\right). Furthermore, each local product Qi(k)Z(k)Q_{i}^{(k)}Z^{(k)} can be computed from

Qi(k)Z(k)=\displaystyle Q_{i}^{(k)}Z^{(k)}={} Λi(k)Z(k)βiXi(k)(Xi(k))Z(k)\displaystyle\Lambda_{i}^{(k)}Z^{(k)}-\beta_{i}X_{i}^{(k)}(X_{i}^{(k)})^{\top}Z^{(k)} (20)
=\displaystyle={} Xi(k)(Wi(k))Z(k)+Wi(k)(Xi(k))Z(k)\displaystyle X_{i}^{(k)}(W_{i}^{(k)})^{\top}Z^{(k)}+W_{i}^{(k)}(X_{i}^{(k)})^{\top}Z^{(k)}
βiXi(k)(Xi(k))Z(k),\displaystyle-\beta_{i}X_{i}^{(k)}(X_{i}^{(k)})^{\top}Z^{(k)},

with a computational cost in the order of O(np2)O(np^{2}). From the above formula, one observes that the n×nn\times n matrices QiQ_{i} need not be stored explicitly.

Now we consider how to efficiently solve subproblem (18). By associating a multiplier Υp×p\Upsilon\in\mathbb{R}^{p\times p} to the linear equality constraint, the Lagrangian function of (18) can be written as:

𝔏(D,Υ)=\displaystyle\mathfrak{L}(D,\Upsilon)={} Q(k)Z(k),D+12ηDF2+r(Z(k)+D)\displaystyle\left\langle Q^{(k)}Z^{(k)},D\right\rangle+\dfrac{1}{2\eta}\left\|D\right\|^{2}_{\mathrm{F}}+r(Z^{(k)}+D)
12Υ,DZ(k)+(Z(k))D.\displaystyle-\dfrac{1}{2}\left\langle\Upsilon,D^{\top}Z^{(k)}+(Z^{(k)})^{\top}D\right\rangle.

We choose to apply the Uzawa method [5] to solve subproblem (18). At the jj-th inner iteration, we first minimize the above Lagrangian function with respect to DD for a fixed Υ=Υ(j)\Upsilon=\Upsilon(j):

D(j+1)=\displaystyle D(j+1)={} argminDn×p𝔏(D,Υ(j))\displaystyle\operatorname*{arg\,min}\limits_{D\in\mathbb{R}^{n\times p}}\mathfrak{L}(D,\Upsilon(j)) (21)
=\displaystyle={} Proxηr(Z(k)η(Q(k)Z(k)Z(k)Υ(j)))Z(k),\displaystyle\mathrm{Prox}_{\eta r}\left(Z^{(k)}-\eta\left(Q^{(k)}Z^{(k)}-Z^{(k)}\Upsilon(j)\right)\right)-Z^{(k)},

where D(j)D(j) and Υ(j)\Upsilon(j) denote the jj-th inner iterate of DD and Υ\Upsilon, respectively. Here, we use Proxg(X)\mathrm{Prox}_{g}\left(X\right) to denote the proximal mapping of a given function g:n×pg:\mathbb{R}^{n\times p}\to\mathbb{R} at the point Xn×pX\in\mathbb{R}^{n\times p}, which is defined by:

Proxg(X)=argminYn×pg(Y)+12YXF2.\mathrm{Prox}_{g}\left(X\right)=\operatorname*{arg\,min}\limits_{Y\in\mathbb{R}^{n\times p}}\hskip 5.69054ptg(Y)+\dfrac{1}{2}\left\|Y-X\right\|^{2}_{\mathrm{F}}.

For the 1\ell_{1}-norm regularizer term r(X)=μX1r(X)=\mu\left\|X\right\|_{1}, the proximal mapping in (21) admits a closed-form solution:

[Proxηr(X)]ij={[X]ijημ, if [X]ij>ημ,0, if ημ[X]ijημ,[X]ij+ημ, if [X]ij<ημ,\left[\mathrm{Prox}_{\eta r}\left(X\right)\right]_{ij}=\left\{\begin{array}[]{ll}\left[X\right]_{ij}-\eta\mu,&\mbox{~{}if~{}}\left[X\right]_{ij}>\eta\mu,\\ 0,&\mbox{~{}if~{}}-\eta\mu\leq\left[X\right]_{ij}\leq\eta\mu,\\ \left[X\right]_{ij}+\eta\mu,&\mbox{~{}if~{}}\left[X\right]_{ij}<-\eta\mu,\end{array}\right.

where the subscript []ij[\,\cdot\,]_{ij} represents the (i,j)(i,j)-th entry of a matrix. Then the multiplier is updated by a dual ascent step:

Υ(j+1)=Υ(j)τ(D(j+1)Z(k)+(Z(k))D(j+1)),\Upsilon(j+1)=\Upsilon(j)-\tau\left(D(j+1)^{\top}Z^{(k)}+(Z^{(k)})^{\top}D(j+1)\right), (22)

where τ>0\tau>0 is the step size. These two steps are repeated until convergence. The complete framework is summarized in Algorithm 1.

The Uzawa method can be viewed as a special case of the primal-dual hybrid gradient algorithm (PDHG) developed in [27] with the convergence rate O(1/k)O(1/k) in the ergodic sense under mild conditions. Moreover, as an inner solver it can bring a higher overall efficiency than the SSN method used by ManPG [11] (see [56] for a recent study).

1
2Input: Z(k)Z^{(k)}, Q(k)Z(k)Q^{(k)}Z^{(k)}, and η\eta in subproblem (18).
3
4Set j:=0j:=0, and choose the step size τ>0\tau>0 as well as the initial variable Υ(0)\Upsilon(0).
5while not converged do
6       Compute D(j+1)D(j+1) by (21).
7      Update Υ(j+1)\Upsilon(j+1) by (22).
8      Set j:=j+1j:=j+1.
9
10Output: D(j)D(j).
Algorithm 1 Uzawa method for subproblem (18).

3.2 Subproblems for local variables

In this subsection, we focus on for the ii-th local variable XiX_{i}, which can be rearranged as the following equivalent problem:

minXi𝒮n,phi(k)(Xi):=12tr(XiHi(k)Xi).\min\limits_{X_{i}\in\mathcal{S}_{n,p}}\hskip 5.69054pth_{i}^{(k)}(X_{i}):=-\dfrac{1}{2}\mathrm{tr}\left(X_{i}^{\top}H_{i}^{(k)}X_{i}\right). (23)

Here, Hi(k)H_{i}^{(k)} is an n×nn\times n real symmetric matrix:

Hi(k)=AiAi+Λi(k)+βiZ(k+1)(Z(k+1)),H_{i}^{(k)}=A_{i}A_{i}^{\top}+\Lambda_{i}^{(k)}+\beta_{i}Z^{(k+1)}(Z^{(k+1)})^{\top}, (24)

which is only related to the local data AiA_{i}. This is a standard eigenvalue problem where one needs to compute a pp-dimensional dominant eigenspace of Hi(k)H_{i}^{(k)}.

As a subproblem, it is not necessary to solve (23) to high precision. In practice, we just need to find a point Xi(k+1)𝒮n,pX_{i}^{(k+1)}\in\mathcal{S}_{n,p} satisfying the following two conditions, which suffices to be a good inexact solution empirically, and to guarantee the global convergence of the whole algorithm. The first condition demands a sufficient decrease in function value:

hi(k)(Xi(k))hi(k)(Xi(k+1))ciciAi22+βi𝐏Xi(k)Hi(k)Xi(k)F2,h_{i}^{(k)}\left(X_{i}^{(k)}\right)-h_{i}^{(k)}\left(X_{i}^{(k+1)}\right)\geq\dfrac{c_{i}}{c_{i}^{\prime}\left\|A_{i}\right\|_{2}^{2}+\beta_{i}}\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}H_{i}^{(k)}X_{i}^{(k)}\right\|^{2}_{\mathrm{F}}, (25)

where ci>0c_{i}>0 and ci>0c_{i}^{\prime}>0 are two constants independent of βi\beta_{i}. The second condition is a sufficient decrease in KKT violation:

𝐏Xi(k+1)Hi(k)Xi(k+1)Fδi𝐏Xi(k)Hi(k)Xi(k)F,\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}H_{i}^{(k)}X_{i}^{(k+1)}\right\|_{\mathrm{F}}\leq\delta_{i}\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}H_{i}^{(k)}X_{i}^{(k)}\right\|_{\mathrm{F}}, (26)

where δi[0,1)\delta_{i}\in[0,1) is a constant independent of βi\beta_{i}. It turns out that these two rather weak termination conditions for subproblem (23) are sufficient for us to derive global convergence of our ADMM-like algorithm framework (12)-(15).

In practice, we can combine a warm-start strategy with a single iteration of SSI [44] to generate the next iterate Xi(k+1)=Proj𝒮n,p(Hi(k)Xi(k))X_{i}^{(k+1)}=\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(H_{i}^{(k)}X_{i}^{(k)}\right), i.e.,

Xi(k+1)\displaystyle X_{i}^{(k+1)} =Proj𝒮n,p(AiAiXi(k)+Λi(k)Xi(k)+βiZ(k+1)(Z(k+1))Xi(k))\displaystyle=\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(A_{i}A_{i}^{\top}X_{i}^{(k)}+\Lambda_{i}^{(k)}X_{i}^{(k)}+\beta_{i}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}\right) (27)
=Proj𝒮n,p(AiAiXi(k)+Wi(k)+βiZ(k+1)(Z(k+1))Xi(k)),\displaystyle=\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(A_{i}A_{i}^{\top}X_{i}^{(k)}+W_{i}^{(k)}+\beta_{i}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}\right),

which can be computed in the order of O(np2)O(np^{2}) floating-point operations, given that the term AiAiXi(k)A_{i}A_{i}^{\top}X_{i}^{(k)} is inherited from the last iteration as a result of updating Wi(k)W_{i}^{(k)}, see (14).

3.3 Algorithm description

We formally present the detailed algorithmic framework as Algorithm 2 below, named distributed subspace splitting algorithm with 1\ell_{1} regularization and abbreviated to DSSAL1. In the distributed environment, all agents are initiated from the same point Z(0)𝒮n,pZ^{(0)}\in\mathcal{S}_{n,p}. And the initial guess of multipliers are computed by (15). After initialization, all agents first solve the common subproblem for ZZ collaboratively by certain communication strategy. Then each agent solves its subproblem for XiX_{i} and updates its multiplier Λi\Lambda_{i}. These two steps only involve the local data privately stored at each agent, and hence can be carried out in dd agents concurrently. This procedure is repeated until convergence.

1
2Input: functions fi(i=1,,d)f_{i}(i=1,\dotsc,d) and rr.
3
4Set k:=0k:=0, choose penalty parameters {βi}\left\{\beta_{i}\right\}, and initialize Z(0)Z^{(0)}.
5Set Xi(0)=Z(0)X_{i}^{(0)}=Z^{(0)} for i=1,,di=1,\dotsc,d.
6Compute the initial multipliers {Λi(0)}\{\Lambda_{i}^{(0)}\} by (15).
7while not converged do
8       Solve (18) to obtain D(k)D^{(k)} by Algorithm 1.
9      Set Z(k+1)=Proj𝒮n,p(Z(k)+D(k))Z^{(k+1)}=\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(Z^{(k)}+D^{(k)}\right).
10      for i=1,,di=1,\dotsc,d do
11             Find Xi(k+1)𝒮n,pX_{i}^{(k+1)}\in\mathcal{S}_{n,p} satisfying (25) and (26).
12            Update the multipliers Λi(k+1)\Lambda_{i}^{(k+1)} by (15).
13      
14      Set k:=k+1k:=k+1.
15
16Output: Z(k)Z^{(k)}.
17
Algorithm 2 Distributed Subspace Splitting Algorithm with 1\ell_{1} regularization (DSSAL1).

3.4 Data privacy

We claim that DSSAL1 can naturally protect the intrinsic privacy of local data. To form the global sum in (19), the shared information in DSSAL1 at iteration kk from the ii-th agent is Si(k):=Qi(k)Z(k)S_{i}^{(k)}:=Q_{i}^{(k)}Z^{(k)} where Z(k)Z^{(k)} is known to all agents. However, the n×pn\times p system of equations, Qi(k)Z(k)=Si(k)Q_{i}^{(k)}Z^{(k)}=S_{i}^{(k)}, is insufficient for obtaining the n×nn\times n mask matrix Qi(k)Q_{i}^{(k)} which changes from iteration to iteration. Secondly, even if a few mask matrices Qi(k)Q_{i}^{(k)} were unveiled, it would still be impossible to derive the local data matrix AiAiA_{i}A_{i}^{\top} from these Qi(k)Q_{i}^{(k)} without knowing corresponding Xi(k)X_{i}^{(k)} (and βi\beta_{i}) which are always kept privately by the ii-th agent. Finally, consider the ideal “converged” case where XiXi=ZZX_{i}X_{i}^{\top}=ZZ^{\top} held at iteration kk, and βi\beta_{i} were known. In this case, Qi(k)Q_{i}^{(k)} would be a known linear function of AiAiA_{i}A_{i}^{\top} parameterized by Z(k)Z^{(k)}. Still, the n×pn\times p system Qi(k)Z(k)=Si(k)Q_{i}^{(k)}Z^{(k)}=S_{i}^{(k)} would not be sufficient to uncover the n×nn\times n local data matrix AiAiA_{i}A_{i}^{\top} (strictly speaking, one only needs to recover n(n+1)/2n(n+1)/2 entries since AiAiA_{i}A_{i}^{\top} is symmetric). Based on this discussion, we call DSSAL1 a privacy-aware method.

3.5 Computational cost

We conclude this section by discussing the computational cost of our algorithm per iteration. We first compute the matrix multiplication Q(k)Z(k)Q^{(k)}Z^{(k)} by (19) and (20), whose computational cost for each agent is O(np2)O(np^{2}) as mentioned earlier. Then, at the center, the Uzawa method is applied to solving subproblem (18) which has a per-iteration complexity O(np2)O(np^{2}). In practice, it usually takes very few iterations to generate Z(k+1)Z^{(k+1)}. Next, each agent uses a single iteration of SSI to generate Xi(k+1)X_{i}^{(k+1)} by (27) with the computational cost O(np2)O(np^{2}) as discussed before. Finally, agent ii updates Wi(k+1)W_{i}^{(k+1)} by (14) with the computational cost 4npmi+O(np2)4npm_{i}+O(np^{2}) (which represents the multiplier matrix Λi(k+1)\Lambda_{i}^{(k+1)} implicitly). Overall, for each agent, the computational cost of our algorithm is 4npmi+O(np2)4npm_{i}+O(np^{2}) per iteration. At the center, the computational cost for approximately solving (18) is, empirically speaking, O(np2)O(np^{2}).

4 Convergence analysis

In this section, we analyze the global convergence of Algorithm 2 and prove that a sequence {Z(k)}\{Z^{(k)}\} generated by Algorithm 2 has at least one accumulation point, and any accumulation point is a first-order stationary point. A global, sub-linear convergence rate is also established.

We start with a property that a feasible point is first-order stationary if no progress can be made by solving (18).

Lemma 4.1.

Let (Z(k),{Xi(k)})(Z^{(k)},\{X_{i}^{(k)}\}) be feasible. Then Z(k)Z^{(k)} is a first-order stationary point of the sparse PCA problem (2) if D(k)=0D^{(k)}=0 is the minimizer of subproblem (18).

The proof of Lemma 4.1 is deferred to Appendix C. It motivates the following definition of an ϵ\epsilon-stationary point for the subspace-splitting model (7).

Definition 4.2.

Suppose Z(k)Z^{(k)} is the kk-th iterate of Algorithm 2. Then Z(k)Z^{(k)} is called an ϵ\epsilon-stationary point if the following condition holds:

1di=1dZ(k)(Z(k))Xi(k)(Xi(k))F2+D(k)F2ϵ2,\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|Z^{(k)}(Z^{(k)})^{\top}-X_{i}^{(k)}(X_{i}^{(k)})^{\top}\right\|^{2}_{\mathrm{F}}+\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}\leq\epsilon^{2},

where ϵ>0\epsilon>0 is a small constant.

In order to prove the convergence of our algorithm, we need to impose some mild conditions on algorithm parameters, which are summarized below.

Condition 1.

The algorithm parameters η>0\eta>0, ci>0c_{i}>0, ci>0c_{i}^{\prime}>0, δi[0,1)\delta_{i}\in[0,1), as well as two auxiliary parameters ρ1\rho\geq 1 and σ¯(0,1)\underline{\sigma}\in(0,1), satisfy the following conditions:

0<η<12M¯, 0δi<σ¯2ρd, 0<σ¯<min{1,1ci},i=1,,d,0<\eta<\dfrac{1}{2\bar{M}},\;0\leq\delta_{i}<\dfrac{\underline{\sigma}}{2\sqrt{\rho d}},\;0<\underline{\sigma}<\min\left\{1,\dfrac{1}{\sqrt{c_{i}}}\right\},\;i=1,\dotsc,d,

where M¯=μnp/2+2AF2+pi=1dβi>0\bar{M}=\mu\sqrt{np}/2+2\left\|A\right\|^{2}_{\mathrm{F}}+\sqrt{p}\sum_{i=1}^{d}\beta_{i}>0 is a constant.

Condition 2.

Each penalty parameter βi(i=1,,d)\beta_{i}(i=1,\dotsc,d) in (11) has a lower bound

max{ξiAi22,8(μnp+pAF2)(1σ¯2),6pAiF2ciσ¯2(1σ¯2)},\displaystyle\max\left\{\xi_{i}\left\|A_{i}\right\|_{2}^{2},\;\dfrac{8(\mu np+\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}})}{(1-\underline{\sigma}^{2})},\;\dfrac{6\sqrt{p}\left\|A_{i}\right\|^{2}_{\mathrm{F}}}{c_{i}\underline{\sigma}^{2}(1-\underline{\sigma}^{2})}\right\},

where ξi=max{ci, 42/σ¯, 4(2ρd+2)/(σ¯2ρdδi), 4(2ρd+1)/(ciσ¯2ρd)}>0\xi_{i}=\max\{c_{i}^{\prime},\;4\sqrt{2}/\underline{\sigma},\;4(2\sqrt{\rho d}+\sqrt{2})/(\underline{\sigma}-2\sqrt{\rho d}\delta_{i}),\;4(\sqrt{2\rho d}+1)/(c_{i}\underline{\sigma}^{2}\rho d)\}>0 is a constant. In addition, βiρβj\beta_{i}\leq\rho\beta_{j} holds for any i,j{1,,d}i,j\in\{1,\dotsc,d\}.

We note that the above technical conditions are not necessary in a practical implementation, They are introduced purely for the purpose of theoretical analysis to facilitate obtaining a global convergence rate and corresponding worst-case complexity for Algorithm 2.

Theorem 4.3.

Suppose {Z(k)}\{Z^{(k)}\} is an iterate sequence generated by Algorithm 2, starting from an arbitrary orthogonal matrix Z(0)𝒮n,pZ^{(0)}\in\mathcal{S}_{n,p}, with parameters satisfying Conditions 1 and 2. Then {Z(k)}\{Z^{(k)}\} has at least one accumulation point and any accumulation point must be a first-order stationary point of the sparse PCA problem (2). Moreover, for any integer K>1K>1, it holds that

mink=1,,K{D(k)F2+1di=1dZ(k)(Z(k))Xi(k)(Xi(k))F2}CK,\min\limits_{k=1,\dotsc,K}\left\{\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|Z^{(k)}(Z^{(k)})^{\top}-X_{i}^{(k)}(X_{i}^{(k)})^{\top}\right\|^{2}_{\mathrm{F}}\right\}\leq\dfrac{C}{K},

where C>0C>0 is a constant.

The proof of Theorem 4.3, which is rather complicated and lengthy, will be given in Appendix D. The global sub-linear convergence rate in Theorem 4.3 guarantees that DSSAL1 is able to return an ϵ\epsilon-stationary point in at most O(1/ϵ2)O(1/\epsilon^{2}) iterations. Since DSSAL1 performs one round of communication per iteration, the number of communication rounds required to obtain an ϵ\epsilon-stationary point is also O(1/ϵ2)O(1/\epsilon^{2}) at the most.

5 Numerical results

In this section, we evaluate the empirical performance of DSSAL1 through a set of comprehensive numerical experiments. All the experiments throughout this section are performed on a high-performance computing cluster 333More information at http://lsec.cc.ac.cn/chinese/lsec/LSSC-IVintroduction.pdf, called LSSC-IV which is maintained at the State Key Laboratory of Scientific and Engineering Computing (LSEC), Chinese Academy of Sciences. The LSSC-IV cluster has 408 nodes, each consisting of two Inter(R) Xeon(R) Gold 6140 processors (at 2.302.30GHz ×18\times 18) with 192192GB memory, running under the operating system Red Hat Enterprise Linux Server 7.3.

We compare the performance of DSSAL1 with two state-of-the-art algorithms: (1) an ADMM-type algorithm called SOC [37] and (2) a manifold proximal gradient method called ManPG-Ada [11]. Since open-source, parallel codes for the above two algorithms are not available, to conduct experiments under the distributed environment of the LSSC-IV cluster, we implemented the two existing algorithms and our own algorithm DSSAL1 in C++ with MPI for inter-process communications444 Our code is downloadable from http://lsec.cc.ac.cn/~liuxin/code.html. For the two existing algorithms, we set all parameters to their default values as described in [37, 11]. The linear algebra library Eigen555Available from https://eigen.tuxfamily.org/index.php?title=Main_Page (version 3.3.8) is adopted for matrix computation tasks.

5.1 DSSAL1 Implementation details

In Algorithm 2, we set the penalty parameters to βi=0.1(fi(Xi(0))F+μ)\beta_{i}=0.1(\|\nabla f_{i}(X_{i}^{(0)})\|_{\mathrm{F}}+\mu), and in subproblem (18) we set the hyperparameter to η=1/(i=1dβi)\eta=1/(\sum_{i=1}^{d}\beta_{i}) . In Algorithm 1, we set the step size to τ=1/(2η)\tau=1/\left(2\eta\right) and terminate the algorithm whenever

D(j)Z(k)+(Z(k))D(j)FD(k1)F\left\|D(j)^{\top}Z^{(k)}+(Z^{(k)})^{\top}D(j)\right\|_{\mathrm{F}}\leq\left\|D^{(k-1)}\right\|_{\mathrm{F}}

is satisfied or the number of iterations reaches 1010.

We use the well-known SSI method [44] to obtain a very rough solution to subproblem (23) for XiX_{i}. More specifically, we initialize XiX_{i} to the previous iterate Xi(k)X_{i}^{(k)} and perform a single SSI iteration, as given by (27), to generate the next iterate Xi(k+1)X_{i}^{(k+1)}.

The stopping criteria used in Algorithm 2 are

1di=1dZ(k)(Z(k))Xi(k)(Xi(k))Fϵc and D(k)Fϵg,\dfrac{1}{d}\sum\limits_{i=1}^{d}\left\|Z^{(k)}(Z^{(k)})^{\top}-X_{i}^{(k)}(X_{i}^{(k)})^{\top}\right\|_{\mathrm{F}}\leq\epsilon_{c}\mbox{~{}~{}and~{}~{}}\left\|D^{(k)}\right\|_{\mathrm{F}}\leq\epsilon_{g}, (28)

where ϵc\epsilon_{c} and ϵg\epsilon_{g} are two small positive constants. Unless otherwise specified, ϵc\epsilon_{c} and ϵg\epsilon_{g} are set to 10610^{-6} and 108np10^{-8}np, respectively. Algorithm 2 is also terminated once the iteration count reaches 𝙼𝚊𝚡𝙸𝚝𝚎𝚛=50000\mathtt{MaxIter}=50000.

5.2 Synthetic data generation

In our experiments, a synthetic data matrix An×mA\in\mathbb{R}^{n\times m} is constructed into the form of (economy-size) SVD:

A=UΣV,A=U\Sigma V^{\top}, (29)

where Un×nU\in\mathbb{R}^{n\times n} and Vm×nV\in\mathbb{R}^{m\times n} satisfy UU=VV=InU^{\top}U=V^{\top}V=I_{n} and Σn×n\Sigma\in\mathbb{R}^{n\times n} is nonnegative and diagonal. Specifically, UU and VV are results of orthonormalization of random matrices whose entries are drawn independently and identically from [1,1][-1,1] under the uniform distribution, and

Σii=ξ1i,i=1,,n,\Sigma_{ii}=\xi^{1-i},\quad i=1,\dotsc,n,

where the parameter ξ1\xi\geq 1 determines the decay rate of singular values. Finally, we apply the standard PCA pre-processing operations to a data matrix A=UΣVA=U\Sigma V^{\top} by subtracting the sample-mean from each sample and then normalizing the rows of the resulting matrix to make them of unit 2\ell_{2}-norm. For our synthetic data matrices, such pre-processing will only slightly perturb the decay rate of the singular-values before the pre-processing which is uniformly equal to 1/ξ1/\xi by construction. Unless specified otherwise, the default value for the decay-rate parameter is ξ=1.1\xi=1.1.

In the numerical experiments, all the algorithms are started from the same initial points. Since the optimization problem is non-convex, different solvers may still occasionally return different solutions when starting from a common initial point at random. As suggested in [11], to increase the chance that all solvers find the same solution, we first run the Riemannian subgradient method [8, 18] for 500500 iterations and then use the resulting output as the common starting point.

5.3 Comprehensive comparison on synthetic data

In order to do a thorough evaluation on the empirical performance of DSSAL1, we design four groups of test data matrices, generated as in Subsection 5.2. In each group, there is only one parameter varying while all the others are fixed. Specifically, the varying and fixed parameters for the four groups are as follows:

  1. 1.

    varying sample dimension n=1000+500jn=1000+500j for j=0,1,2,3,4j=0,1,2,3,4, while m=128000m=128000, p=10p=10, μ=0.5\mu=0.5, d=128d=128;

  2. 2.

    varying number of computed loading vectors p=10+5jp=10+5j for j=0,1,2,3,4j=0,1,2,3,4, while n=1000n=1000, m=128000m=128000, μ=0.3\mu=0.3, d=128d=128;

  3. 3.

    varying regularization parameter μ=0.2+0.2j\mu=0.2+0.2j for j=0,1,2,3,4j=0,1,2,3,4, while n=1000n=1000, m=128000m=128000, p=10p=10, d=128d=128;

  4. 4.

    varying number of cores d=16×2jd=16\times 2^{j} for j=0,1,2,3,4j=0,1,2,3,4, while n=1000n=1000, m=256000m=256000, p=10p=10, μ=0.5\mu=0.5.

Additional experimental results for varying ξ\xi will be presented in the next subsection. Numerical results obtained for the above four test groups are presented in Tables 1 to 4, respectively, where we record wall-clock times in seconds and total rounds of communication. The average function values and sparsity levels for the four groups of tests are provided in Table 5. When computing the sparsity of a solution matrix (i.e., the percentage of zero elements), we set a matrix element to zero when its absolute value is less than 10510^{-5}.

From Table 5, we see that all three algorithms have attained comparable solution qualities with similar function values and sparsity levels in the four groups of testing problems.

Table 1: Comparison of DSSAL1, ManPG-Ada, and SOC for different nn.
Wall-clock time in seconds Rounds of communication
nn DSSAL1 ManPG-Ada SOC DSSAL1 ManPG-Ada SOC
1000 10.97 26.76 72.07 655 1794 7569
1500 6.61 16.12 57.75 223 642 2201
2000 49.22 172.32 725.63 1238 5054 20444
2500 181.97 412.51 2296.26 3753 10971 45707
3000 34.58 153.40 860.76 680 2789 11480
Table 2: Comparison of DSSAL1, ManPG-Ada, and SOC for different pp.
Wall-clock time in seconds Rounds of communication
pp DSSAL1 ManPG-Ada SOC DSSAL1 ManPG-Ada SOC
10 9.74 26.42 75.25 629 1622 6652
15 29.99 56.31 153.98 1728 3586 15865
20 110.26 239.51 466.22 6144 14107 44086
25 68.79 148.34 334.58 3030 6153 27060
30 110.16 173.11 204.87 5133 6966 14621
Table 3: Comparison of DSSAL1, ManPG-Ada, and SOC for different μ\mu.
Wall-clock time in seconds Rounds of communication
μ\mu DSSAL1 ManPG-Ada SOC DSSAL1 ManPG-Ada SOC
0.2 25.05 59.46 430.37 1537 4140 50000
0.4 22.16 46.76 115.36 1393 3061 13680
0.6 11.61 29.08 81.82 838 1899 8278
0.8 10.09 40.07 66.46 733 3369 8121
1.0 9.12 19.80 56.33 655 1348 6396
Table 4: Comparison of DSSAL1, ManPG-Ada, and SOC for different dd.
Wall-clock time in seconds Rounds of communication
dd DSSAL1 ManPG-Ada SOC DSSAL1 ManPG-Ada SOC
16 161.72 168.36 383.14 897 1169 5175
32 106.95 135.81 312.20 808 1169 5175
64 54.33 68.89 167.27 753 1169 5175
128 24.28 35.54 96.04 683 1169 5175
256 9.17 14.74 47.61 660 1169 5175
Table 5: Average function values and sparsity levels of DSSAL1, ManPG-Ada, and SOC for different tests.
Function value Sparsity level
Test DSSAL1 ManPG-Ada SOC DSSAL1 ManPG-Ada SOC
Varying nn -672.26 -672.26 -672.26 16.54% 16.51% 16.50%
Varying pp -360.52 -360.51 -360.46 40.42% 40.46% 40.32%
Varying μ\mu -282.65 -282.65 -282.64 26.28% 26.29% 26.28%
Varying dd -302.02 -301.97 -301.97 22.13% 22.21% 22.21%

It should be evident from Tables 1 to 4 that, in all four test groups and in terms of both wall-clock time and round of communication, DSSAL1 clearly outperforms ManPG-Ada which in turn outperforms SOC by large margins. Since the amount of communication per round for the three algorithms are essentially the same, their total communication overhead is proportional to the rounds of communication required. Because DSSAL1 takes far fewer rounds of communication than the other two algorithms (often by considerable margins), we conclude that DSSAL1 is a more communication-efficient algorithm than the other two. For example, in Table 3 for the case of μ=0.8\mu=0.8, the number of communication rounds taken by DSSAL1 is less than a quarter of that by ManPG-Ada and one tenth of that by COS.

5.4 Empirical convergence rate

In this subsection, we examine empirical convergence rates of iterates produced by DSSAL1 and ManPG-Ada for comparison, while SOC is excluded from this experiment given its obvious non-competitiveness in previous experiments.

In the following experiments, we fix n=1000n=1000, m=128000m=128000, p=5p=5, μ=0.2\mu=0.2, and d=128d=128. Three synthetic matrices An×mA\in\mathbb{R}^{n\times m} is randomly generated by (29) with ξ\xi taking three different values 1.151.15, 1.11.1, and 1.051.05, respectively, on which DSSAL1 and ManPG-Ada return ZDZ^{\ast}_{\mathrm{D}} and ZMZ^{\ast}_{\mathrm{M}}, respectively, with smaller-than-usual termination tolerances ϵc=108\epsilon_{c}=10^{-8} and ϵg=1010np\epsilon_{g}=10^{-10}np in (28). We use the average of the two, Z=(ZD+ZM)/2Z^{\ast}=(Z^{\ast}_{\mathrm{D}}+Z^{\ast}_{\mathrm{M}})/2, as a “ground truth” solution. Then we rerun the two algorithms on the same AA with the termination condition Z(k)ZF3×104\|Z^{(k)}-Z^{\ast}\|_{\mathrm{F}}\leq 3\times 10^{-4} and record the quantity Z(k)ZF\|Z^{(k)}-Z^{\ast}\|_{\mathrm{F}} at each iteration.

In Figure 2, we plot the iterate-error sequences {Z(k)ZF}\{\|Z^{(k)}-Z^{\ast}\|_{\mathrm{F}}\} for both DSSAL1 and ManPG-Ada and observe that both algorithms appear to converge asymptotically at linear rates. Overall, however, the convergence of DSSAL1 is several times faster than that of ManPG-Ada. We also provide the ratio between the iteration number of ManPG-Ada and that of DSSAL1 for different values of ξ\xi in Table 6. In general, the closer to one ξ\xi is, the slower the singular values of AA decay, and the more difficult the problem tends to be. Table 6 demonstrates that in our test the advantage of DSSAL1 becomes more and more pronounced as the test instances become more and more difficult to solve.

Refer to caption
(a) ξ=1.15\xi=1.15
Refer to caption
(b) ξ=1.1\xi=1.1
Refer to caption
(c) ξ=1.05\xi=1.05
Figure 2: Comparison between DSSAL1 and ManPG-Ada of empirical convergence rates.
Table 6: The iteration number ratio between ManPG-Ada and DSSAL1 for different values of ξ\xi.
ξ=1.15\xi=1.15 ξ=1.1\xi=1.1 ξ=1.05\xi=1.05
ItManPGAdaItDSSAL1\dfrac{\mathrm{It}_{\mathrm{ManPG-Ada}}}{\mathrm{It}_{\mathrm{DSSAL1}}} 12022964.06\dfrac{1202}{296}\approx 4.06 14173354.23\dfrac{1417}{335}\approx 4.23 20143385.96\dfrac{2014}{338}\approx 5.96

6 Conclusion

In this paper, we propose a distributed algorithm, called DSSAL1, for solving the 1\ell_{1}-regularized optimization model (2) for sparse PCA computation. DSSAL1 has the following features.

  1. 1.

    The algorithm successfully extends the subspace-splitting strategy from orthogonal-transformation invariant objective function f=fif=\sum f_{i} to the sum f+rf+r where rr can be non-invariant and non-smooth.

  2. 2.

    The algorithm has built-in mechanism for local-data masking and hence naturally protects local-data privacy without requiring a special privacy preservation process.

  3. 3.

    The algorithm is storage-efficient in that beside local data, it only requires storing n×pn\times p matrices (usually pnp\ll n) by each agent, thanks to a low-rank multiplier formula.

  4. 4.

    The algorithm has a global convergence guarantee to stationary points and a worst-case complexity under mild conditions, in spite of the nonlinear equality constraints for subspace consensus.

Comprehensive numerical simulations are conducted under a distributed environment to evaluate the performance of our algorithm in comparison to two state-of-the-art approaches. Remarkably, the communication rounds required by DSSAL1 are often over one order of magnitude smaller than existing methods. These results indicate that DSSAL1 has a great potential in solving large-scale application problems in distributed environments where data privacy is a primary concern.

Finally, we mention two related topics worthy of future studies. One is the possibility of developing asynchronous approaches for sparse PCA to address load balance issues in distributed environments. Another is to extend the subspace splitting strategy to decentralized networks so that a wider range of applications can benefit from the effectiveness of this approach.

\appendixpage

Appendix A Proof of Lemma 2.2

Proof of Lemma 2.2.

According to the definition of Proj𝒯Z𝒮n,p()\mathrm{Proj}_{\mathcal{T}_{Z}\mathcal{S}_{n,p}}\left(\cdot\right), it follows that

Proj𝒯Z𝒮n,p(AAZ+R(Z))F2\displaystyle\left\|\mathrm{Proj}_{\mathcal{T}_{Z}\mathcal{S}_{n,p}}\left(-AA^{\top}Z+R(Z)\right)\right\|^{2}_{\mathrm{F}}
=\displaystyle={} 14Z(AAZ+R(Z))(AAZ+R(Z))ZF2\displaystyle\dfrac{1}{4}\left\|Z^{\top}\left(-AA^{\top}Z+R(Z)\right)-\left(-AA^{\top}Z+R(Z)\right)^{\top}Z\right\|^{2}_{\mathrm{F}}
+𝐏Z(AAZ+R(Z))F2\displaystyle+\left\|{\mathbf{P}}^{\perp}_{Z}\left(-AA^{\top}Z+R(Z)\right)\right\|^{2}_{\mathrm{F}}
=\displaystyle={} 𝐏Z(AAZ+R(Z))F2+14ZR(Z)R(Z)ZF2,\displaystyle\left\|{\mathbf{P}}^{\perp}_{Z}\left(-AA^{\top}Z+R(Z)\right)\right\|^{2}_{\mathrm{F}}+\dfrac{1}{4}\left\|Z^{\top}R(Z)-R(Z)^{\top}Z\right\|^{2}_{\mathrm{F}},

where R(Z)r(Z)R(Z)\in\partial r(Z). The proof is completed. ∎

Appendix B Proof of Proposition 2.4

Proof of Proposition 2.4.

To begin with, we assume that (Z,{Xi})\left(Z,\{X_{i}\}\right) is a first-order stationary point. Then there exists R(Z)r(Z)R(Z)\in\partial r(Z) such that

𝐏Z(AAZ+R(Z))=0,{\mathbf{P}}^{\perp}_{Z}\left(-AA^{\top}Z+R(Z)\right)=0,

and ZR(Z)Z^{\top}R(Z) is symmetric. Let Θ=ZR(Z)Zr(Z)\Theta=Z^{\top}R(Z)\in Z^{\top}\partial r(Z), Γi=XiAiAiXi\Gamma_{i}=-X_{i}^{\top}A_{i}A_{i}^{\top}X_{i}, and

Λi=𝐏XiAiAiXiXiXiXiAiAi𝐏Xi\Lambda_{i}=-{\mathbf{P}}^{\perp}_{X_{i}}A_{i}A_{i}^{\top}X_{i}X_{i}^{\top}-X_{i}X_{i}^{\top}A_{i}A_{i}^{\top}{\mathbf{P}}^{\perp}_{X_{i}}

with i=1,,di=1,\dotsc,d. Then the matrices Θ\Theta, Γi\Gamma_{i} and Λi\Lambda_{i} are symmetric and rank(Λi)2p\mbox{\rm rank}\left(\Lambda_{i}\right)\leq 2p. Moreover, we can deduce that

AiAiXi+XiΓi+ΛiXi=AiAiXiXiXiAiAiXi𝐏XiAiAiXi=0,A_{i}A_{i}^{\top}X_{i}+X_{i}\Gamma_{i}+\Lambda_{i}X_{i}=A_{i}A_{i}^{\top}X_{i}-X_{i}X_{i}^{\top}A_{i}A_{i}^{\top}X_{i}-{\mathbf{P}}^{\perp}_{X_{i}}A_{i}A_{i}^{\top}X_{i}=0,

and

R(Z)+i=1dΛiZZΘ=\displaystyle R(Z)+\sum\limits_{i=1}^{d}\Lambda_{i}Z-Z\Theta={} R(Z)i=1d𝐏ZAiAiZZZR(Z)\displaystyle R(Z)-\sum\limits_{i=1}^{d}{\mathbf{P}}^{\perp}_{Z}A_{i}A_{i}^{\top}Z-ZZ^{\top}R(Z)
=\displaystyle={} 𝐏Z(AAZ+R(Z))=0.\displaystyle{\mathbf{P}}^{\perp}_{Z}\left(-AA^{\top}Z+R(Z)\right)=0.

Hence, (Z,{Xi})\left(Z,\{X_{i}\}\right) satisfies the conditions in (9) under these specific choices of Θ\Theta, Γi\Gamma_{i} and Λi\Lambda_{i}.

Conversely, we now assume that there exist R(Z)r(Z)R(Z)\in\partial r(Z) and symmetric matrices Θ\Theta, Γi\Gamma_{i} and Λi\Lambda_{i} such that (Z,{Xi})\left(Z,\{X_{i}\}\right) satisfies the conditions in (9). It follows from the first and second equality in (9) that

i=1d𝐏XiAiAiXiXi=i=1d𝐏Xi(XiΓi+ΛiXi)Xi=i=1d𝐏XiΛiXiXi\displaystyle\sum\limits_{i=1}^{d}{\mathbf{P}}^{\perp}_{X_{i}}A_{i}A_{i}^{\top}X_{i}X_{i}^{\top}=-\sum\limits_{i=1}^{d}{\mathbf{P}}^{\perp}_{X_{i}}\left(X_{i}\Gamma_{i}+\Lambda_{i}X_{i}\right)X_{i}^{\top}=-\sum\limits_{i=1}^{d}{\mathbf{P}}^{\perp}_{X_{i}}\Lambda_{i}X_{i}X_{i}^{\top}
=𝐏Z(i=1dΛiZ)Z=𝐏Z(R(Z)ZΘ)Z=𝐏ZR(Z)Z.\displaystyle=-{\mathbf{P}}^{\perp}_{Z}\left(\sum\limits_{i=1}^{d}\Lambda_{i}Z\right)Z^{\top}={\mathbf{P}}^{\perp}_{Z}\left(R(Z)-Z\Theta\right)Z^{\top}={\mathbf{P}}^{\perp}_{Z}R(Z)Z^{\top}.

At the same time, since XiXi=ZZX_{i}X_{i}^{\top}=ZZ^{\top}, we have

i=1d𝐏XiAiAiXiXi=\displaystyle\sum\limits_{i=1}^{d}{\mathbf{P}}^{\perp}_{X_{i}}A_{i}A_{i}^{\top}X_{i}X_{i}^{\top}={} i=1d𝐏ZAiAiZZ=𝐏ZAAZZ.\displaystyle\sum\limits_{i=1}^{d}{\mathbf{P}}^{\perp}_{Z}A_{i}A_{i}^{\top}ZZ^{\top}={\mathbf{P}}^{\perp}_{Z}AA^{\top}ZZ^{\top}.

Combining the above two equalities and orthogonality of ZZ, we arrive at

𝐏Z(AAZ+R(Z))=0.{\mathbf{P}}^{\perp}_{Z}\left(-AA^{\top}Z+R(Z)\right)=0.

Left-multiplying both sides of the second equality in (9) by ZZ^{\top}, we obtain that

ZR(Z)=Θi=1dZΛiZ,Z^{\top}R(Z)=\Theta-\sum\limits_{i=1}^{d}Z^{\top}\Lambda_{i}Z,

which together with the symmetry of Λi\Lambda_{i} and Θ\Theta implies that ZR(Z)Z^{\top}R(Z) is also symmetric. This completes the proof. ∎

Appendix C Proof of Lemma 4.1

Proof of Lemma 4.1.

Since (Z(k),{Xi(k)})(Z^{(k)},\{X_{i}^{(k)}\}) is feasible, we know Xi(k)(Xi(k))=Z(k)(Z(k))X_{i}^{(k)}(X_{i}^{(k)})^{\top}=Z^{(k)}(Z^{(k)})^{\top} for i=1,,di=1,\dotsc,d. Thus, it can be readily verified that

Q(k)Z(k)=\displaystyle Q^{(k)}Z^{(k)}={} i=1d(Λi(k)βiXi(k)(Xi(k)))Z(k)\displaystyle\sum\limits_{i=1}^{d}\left(\Lambda_{i}^{(k)}-\beta_{i}X_{i}^{(k)}(X_{i}^{(k)})^{\top}\right)Z^{(k)}
=\displaystyle={} i=1d(𝐏Z(k)AiAiZ(k)(Z(k))βiZ(k)(Z(k)))Z(k)\displaystyle\sum\limits_{i=1}^{d}\left(-{\mathbf{P}}^{\perp}_{Z^{(k)}}A_{i}A_{i}^{\top}Z^{(k)}(Z^{(k)})^{\top}-\beta_{i}Z^{(k)}(Z^{(k)})^{\top}\right)Z^{(k)}
=\displaystyle={} 𝐏Z(k)AiAiZ(k)(i=1dβi)Z(k),\displaystyle-{\mathbf{P}}^{\perp}_{Z^{(k)}}A_{i}A_{i}^{\top}Z^{(k)}-\left(\sum\limits_{i=1}^{d}\beta_{i}\right)Z^{(k)},

which implies that

Proj𝒯Z(k)𝒮n,p(Q(k)Z(k))=Proj𝒯Z(k)𝒮n,p(AiAiZ(k)).\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(Q^{(k)}Z^{(k)}\right)=\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(-A_{i}A_{i}^{\top}Z^{(k)}\right).

According to Theorem 4.1 in [58], the first-order optimality condition of (18) can be stated as:

0Proj𝒯Z(k)𝒮n,p(Q(k)Z(k)+1ηD(k)+r(Z(k)+D(k))).0\in\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(Q^{(k)}Z^{(k)}+\dfrac{1}{\eta}D^{(k)}+\partial r(Z^{(k)}+D^{(k)})\right).

Since D(k)=0D^{(k)}=0 is the global minimizer of (18), we have

0\displaystyle 0\in{} Proj𝒯Z(k)𝒮n,p(Q(k)Z(k)+r(Z(k)))\displaystyle\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(Q^{(k)}Z^{(k)}+\partial r(Z^{(k)})\right)
=\displaystyle={} Proj𝒯Z(k)𝒮n,p(AiAiZ(k)+r(Z(k))).\displaystyle\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(-A_{i}A_{i}^{\top}Z^{(k)}+\partial r(Z^{(k)})\right).

We obtain the assertion of this lemma. ∎

Appendix D Convergence of Algorithm 2

Now we prove Theorem 4.3 to establish the global convergence of Algorithm 2. In addition to the notations introduced in Section 1, we further adopt the followings throughout the theoretical analysis. The notations rank(C)\mbox{\rm rank}\left(C\right) and σmin(C)\sigma_{\min}\left(C\right) represent the rank and the smallest singular value of CC, respectively. For X,Y𝒮n,pX,Y\in\mathcal{S}_{n,p}, we define 𝐃𝐩(X,Y):=XXYY{\mathbf{D_{p}}}\left(X,Y\right):=XX^{\top}-YY^{\top} and 𝐝𝐩(X,Y):=𝐃𝐩(X,Y)F{\mathbf{d_{p}}}\left(X,Y\right):=\left\|{\mathbf{D_{p}}}\left(X,Y\right)\right\|_{\mathrm{F}}, standing for, respectively, the projection distance matrix and its measurement.

To begin with, we provide a sketch of our proof. Suppose {Z(k)}\{Z^{(k)}\} is the iteration sequence generated by Algorithm 2, with Xi(k)X_{i}^{(k)} and Λi(k)\Lambda_{i}^{(k)} being the local variable and multiplier of the ii-th agent at the kk-th iteration, respectively. The proof includes the following main steps.

  1. 1.

    The sequence {Z(k)}\{Z^{(k)}\} is bounded and the sequence {(Z(k),{Xi(k)},{Λi(k)})}\{\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\} is bounded from below.

  2. 2.

    The sequence {Z(k)}\{Z^{(k)}\} satisfies 𝐝𝐩2(Z(k+1),Xi(k))2(1σ¯2){\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq 2(1-\underline{\sigma}^{2}), and σ¯\underline{\sigma} is a unified lower bound of the smallest singular values of the matrices (Xik)Zk+1(i=1,,d)(X_{i}^{k})^{\top}Z^{k+1}(i=1,\dotsc,d).

  3. 3.

    The sequence {(Z(k),{Xi(k)},{Λi(k)})}\{\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\} is monotonically non-increasing, and hence is convergent.

  4. 4.

    The sequence {Z(k)}\{Z^{(k)}\} has at least one accumulation point, and any accumulation point is a first-order stationary point of the sparse PCA problem (2).

Next we verify all the items in the above sketch by proving the following lemmas and corollaries.

Lemma D.1.

Suppose {Z(k)}\{Z^{(k)}\} is the iterate sequence generated by Algorithm 2. Let

g(k)(D)=Q(k)Z(k),D+12ηDF2+r(Z(k)+D).g^{(k)}(D)=\left\langle Q^{(k)}Z^{(k)},D\right\rangle+\dfrac{1}{2\eta}\left\|D\right\|^{2}_{\mathrm{F}}+r(Z^{(k)}+D).

Then the following relationship holds for any kk\in\mathbb{N},

g(k)(0)g(k)(D(k))12ηD(k)F2.g^{(k)}(0)-g^{(k)}(D^{(k)})\geq\dfrac{1}{2\eta}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}.
Proof.

Since g(k)g^{(k)} is strongly convex with modulus 1η\dfrac{1}{\eta}, we have

g(k)(D^)g(k)(D)+g(k)(D),D^D+12ηD^DF2,g^{(k)}(\hat{D})\geq g^{(k)}(D)+\left\langle\partial g^{(k)}(D),\hat{D}-D\right\rangle+\dfrac{1}{2\eta}\left\|\hat{D}-D\right\|^{2}_{\mathrm{F}}, (30)

for any D,D^n×pD,\hat{D}\in\mathbb{R}^{n\times p}. In particular, if D^,D𝒯Z(k)𝒮n,p\hat{D},D\in\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}, it holds that

g(k)(D),D^D=Proj𝒯Z(k)𝒮n,p(g(k)(D)),D^D.\left\langle\partial g^{(k)}(D),\hat{D}-D\right\rangle=\left\langle\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(\partial g^{(k)}(D)\right),\hat{D}-D\right\rangle.

It follows from the first-order optimality condition of (18) that 0Proj𝒯Z(k)𝒮n,p(g(k)(D(k)))0\in\mathrm{Proj}_{\mathcal{T}_{Z^{(k)}}\mathcal{S}_{n,p}}\left(\partial g^{(k)}(D^{(k)})\right). Finally, taking D^=0\hat{D}=0 and D=D(k)D=D^{(k)} in (30) yields the assertion of this lemma. ∎

Lemma D.2.

Suppose Z𝒮n,pZ\in\mathcal{S}_{n,p} and D𝒯Z𝒮n,pD\in\mathcal{T}_{Z}\mathcal{S}_{n,p}. Then it holds that

Proj𝒮n,p(Z+D)ZFDF,\left\|\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(Z+D\right)-Z\right\|_{\mathrm{F}}\leq\left\|D\right\|_{\mathrm{F}},

and

Proj𝒮n,p(Z+D)ZDF12DF2.\left\|\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(Z+D\right)-Z-D\right\|_{\mathrm{F}}\leq\dfrac{1}{2}\left\|D\right\|^{2}_{\mathrm{F}}.
Proof.

The proof can be found in, for example, [33]. For the sake of completeness, we provide a proof here. It follows from the orthogonality of ZZ and the skew-symmetry of ZDZ^{\top}D that Z+DZ+D has full column rank. This yields that Proj𝒮n,p(Z+D)=(Z+D)F1\mathrm{Proj}_{\mathcal{S}_{n,p}}(Z+D)=(Z+D)F^{-1}, where F=(Ip+DD)1/2F=(I_{p}+D^{\top}D)^{1/2}. Since Proj𝒮n,p(Z+D)Z=(Z(IpF)+D)F1\mathrm{Proj}_{\mathcal{S}_{n,p}}(Z+D)-Z=(Z(I_{p}-F)+D)F^{-1}, we have

Proj𝒮n,p(Z+D)ZF2=\displaystyle\left\|\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(Z+D\right)-Z\right\|^{2}_{\mathrm{F}}={} 2tr(IpF1)2tr(F1ZD)=2tr(IpF1)\displaystyle 2\mathrm{tr}\left(I_{p}-F^{-1}\right)-2\mathrm{tr}\left(F^{-1}Z^{\top}D\right)=2\mathrm{tr}\left(I_{p}-F^{-1}\right)
=\displaystyle={} 2j=1d(1(1+σ~i2)1/2)j=1dσ~i2=DF2,\displaystyle 2\sum\limits_{j=1}^{d}\left(1-\left(1+\tilde{\sigma}_{i}^{2}\right)^{-1/2}\right)\leq\sum\limits_{j=1}^{d}\tilde{\sigma}_{i}^{2}=\left\|D\right\|^{2}_{\mathrm{F}},

where σ~1σ~d0\tilde{\sigma}_{1}\geq\dotsb\geq\tilde{\sigma}_{d}\geq 0 are the singular values of DD. Similarly, it follows from the relationship Proj𝒮n,p(Z+D)ZD=(Z+D)(F1Ip)\mathrm{Proj}_{\mathcal{S}_{n,p}}(Z+D)-Z-D=(Z+D)(F^{-1}-I_{p}) that

Proj𝒮n,p(Z+D)ZDF=\displaystyle\left\|\mathrm{Proj}_{\mathcal{S}_{n,p}}\left(Z+D\right)-Z-D\right\|_{\mathrm{F}}={} tr((IpF)2)=j=1d(1(1+σ~i2)1/2)2\displaystyle\mathrm{tr}\left(\left(I_{p}-F\right)^{2}\right)=\sum\limits_{j=1}^{d}\left(1-\left(1+\tilde{\sigma}_{i}^{2}\right)^{1/2}\right)^{2}
\displaystyle\leq{} 14j=1dσ~i4=14DF4,\displaystyle\dfrac{1}{4}\sum\limits_{j=1}^{d}\tilde{\sigma}_{i}^{4}=\dfrac{1}{4}\left\|D\right\|_{\mathrm{F}}^{4},

which completes the proof. ∎

Corollary D.3.

Suppose {Z(k)}\{Z^{(k)}\} is the iterate sequence generated by Algorithm 2 with the parameters satisfying Condition 1. Then for any kk\in\mathbb{N}, it holds that

(Z(k),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k)},{Λi(k)})M¯D(k)F2,\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\geq\bar{M}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}, (31)

where M¯>0\bar{M}>0 is a constant defined in Section 4.

Proof.

Firstly, it can be readily verified that

Q(k)Fi=1dQi(k)Fi=1d(2AiF2+pβi).\left\|Q^{(k)}\right\|_{\mathrm{F}}\leq\sum\limits_{i=1}^{d}\left\|Q_{i}^{(k)}\right\|_{\mathrm{F}}\leq\sum\limits_{i=1}^{d}\left(2\left\|A_{i}\right\|^{2}_{\mathrm{F}}+\sqrt{p}\beta_{i}\right).

Let q¯(k)(Z)=tr(ZQ(k)Z)/2\bar{q}^{(k)}(Z)=\mathrm{tr}(Z^{\top}Q^{(k)}Z)/2 be the smooth part of the objective function q(k)(Z)q^{(k)}(Z) in (16). Since q¯(k)\nabla\bar{q}^{(k)} is Lipschitz continuous with the corresponding Lipschitz constant Q(k)F\left\|Q^{(k)}\right\|_{\mathrm{F}}, we have

q¯(k)(Z(k+1))q¯(k)(Z(k))\displaystyle\bar{q}^{(k)}(Z^{(k+1)})-\bar{q}^{(k)}(Z^{(k)})\leq{} Q(k)Z(k),Z(k+1)Z(k)\displaystyle\left\langle Q^{(k)}Z^{(k)},Z^{(k+1)}-Z^{(k)}\right\rangle
+12Q(k)FZ(k+1)Z(k)F2.\displaystyle+\dfrac{1}{2}\left\|Q^{(k)}\right\|_{\mathrm{F}}\left\|Z^{(k+1)}-Z^{(k)}\right\|^{2}_{\mathrm{F}}.

It follows from Lemma D.2 that

Q(k)Z(k),Z(k+1)Z(k)D(k)\displaystyle\left\langle Q^{(k)}Z^{(k)},Z^{(k+1)}-Z^{(k)}-D^{(k)}\right\rangle\leq{} Q(k)Z(k)FZ(k+1)Z(k)D(k)F\displaystyle\left\|Q^{(k)}Z^{(k)}\right\|_{\mathrm{F}}\left\|Z^{(k+1)}-Z^{(k)}-D^{(k)}\right\|_{\mathrm{F}}
\displaystyle\leq{} i=1d(AiF2+p2βi)D(k)F2,\displaystyle\sum\limits_{i=1}^{d}\left(\left\|A_{i}\right\|^{2}_{\mathrm{F}}+\dfrac{\sqrt{p}}{2}\beta_{i}\right)\left\|D^{(k)}\right\|^{2}_{\mathrm{F}},

and

12Q(k)FZ(k+1)Z(k)F2i=1d(AiF2+p2βi)DkF2.\displaystyle\dfrac{1}{2}\left\|Q^{(k)}\right\|_{\mathrm{F}}\left\|Z^{(k+1)}-Z^{(k)}\right\|^{2}_{\mathrm{F}}\leq\sum\limits_{i=1}^{d}\left(\left\|A_{i}\right\|^{2}_{\mathrm{F}}+\dfrac{\sqrt{p}}{2}\beta_{i}\right)\left\|D^{k}\right\|^{2}_{\mathrm{F}}.

Combing the above three inequalities, we can obtain that

q¯(k)(Z(k+1))q¯(k)(Z(k))Q(k)Z(k),D(k)+i=1d(2AiF2+βip)D(k)F2.\displaystyle\bar{q}^{(k)}(Z^{(k+1)})-\bar{q}^{(k)}(Z^{(k)})\leq\left\langle Q^{(k)}Z^{(k)},D^{(k)}\right\rangle+\sum\limits_{i=1}^{d}\left(2\left\|A_{i}\right\|^{2}_{\mathrm{F}}+\beta_{i}\sqrt{p}\right)\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}.

It follows from Lemma D.1 that

Q(k)Z(k),D(k)+r(Z(k)+D(k))r(Z(k))\displaystyle\left\langle Q^{(k)}Z^{(k)},D^{(k)}\right\rangle+r(Z^{(k)}+D^{(k)})-r(Z^{(k)})
=g(k)(D(k))g(k)(0)12ηD(k)F21ηD(k)F2,\displaystyle=g^{(k)}(D^{(k)})-g^{(k)}(0)-\dfrac{1}{2\eta}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}\leq-\dfrac{1}{\eta}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}},

which infers that

q¯(k)(Z(k+1))q¯(k)(Z(k))+r(Z(k)+D(k))r(Z(k))\displaystyle\bar{q}^{(k)}(Z^{(k+1)})-\bar{q}^{(k)}(Z^{(k)})+r(Z^{(k)}+D^{(k)})-r(Z^{(k)})
i=1d(2AiF2+βip)D(k)F21ηD(k)F2.\displaystyle\leq\sum\limits_{i=1}^{d}\left(2\left\|A_{i}\right\|^{2}_{\mathrm{F}}+\beta_{i}\sqrt{p}\right)\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{1}{\eta}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}.

This together with the Lipschitz continuity of r(Z)r(Z) yields that

q(k)(Z(k+1))q(k)(Z(k))\displaystyle q^{(k)}(Z^{(k+1)})-q^{(k)}(Z^{(k)})
=q¯(k)(Z(k+1))q¯(k)(Z(k))+r(Z(k+1))r(Z(k))\displaystyle=\bar{q}^{(k)}(Z^{(k+1)})-\bar{q}^{(k)}(Z^{(k)})+r(Z^{(k+1)})-r(Z^{(k)})
=q¯(k)(Z(k+1))q¯(k)(Z(k))+r(Z(k)+D(k))r(Z(k))\displaystyle=\bar{q}^{(k)}(Z^{(k+1)})-\bar{q}^{(k)}(Z^{(k)})+r(Z^{(k)}+D^{(k)})-r(Z^{(k)})
+r(Z(k+1))r(Z(k)+D(k))\displaystyle\quad+r(Z^{(k+1)})-r(Z^{(k)}+D^{(k)})
q¯(k)(Z(k+1))q¯(k)(Z(k))+r(Z(k)+D(k))r(Z(k))\displaystyle\leq\bar{q}^{(k)}(Z^{(k+1)})-\bar{q}^{(k)}(Z^{(k)})+r(Z^{(k)}+D^{(k)})-r(Z^{(k)})
+μnpZ(k+1)Z(k)D(k)F\displaystyle\quad+\mu\sqrt{np}\left\|Z^{(k+1)}-Z^{(k)}-D^{(k)}\right\|_{\mathrm{F}}
i=1d(2AiF2+βip)D(k)F21ηD(k)F2+μ2npD(k)F2\displaystyle\leq\sum\limits_{i=1}^{d}\left(2\left\|A_{i}\right\|^{2}_{\mathrm{F}}+\beta_{i}\sqrt{p}\right)\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}-\dfrac{1}{\eta}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{\mu}{2}\sqrt{np}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}
=(M¯1η)D(k)F2.\displaystyle=\left(\bar{M}-\dfrac{1}{\eta}\right)\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}.

Here, M¯>0\bar{M}>0 is a constant defined in Section 4. According to Condition 1, we know that M¯1/ηM¯\bar{M}-1/\eta\leq-\bar{M}. Hence, we finally arrive at

(Z(k),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k)},{Λi(k)})=\displaystyle\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})={} q(k)(Z(k))q(k)(Z(k+1))\displaystyle q^{(k)}(Z^{(k)})-q^{(k)}(Z^{(k+1)})
\displaystyle\geq{} M¯D(k)F2.\displaystyle\bar{M}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}.

This completes the proof. ∎

Lemma D.4.

Suppose {Z(k)}\{Z^{(k)}\} is the iterate sequence generated by Algorithm 2 with the parameters satisfying Condition 1. Then for any kk\in\mathbb{N}, it can be verified that

𝐝𝐩2(Z(k+1),Xi(k))ρj=1d𝐝𝐩2(Z(k),Xj(k))+8βi(pAF2+μnp),{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq\rho\sum\limits_{j=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{j}^{(k)}\right)+\dfrac{8}{\beta_{i}}\left(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np\right), (32)

where ρ1\rho\geq 1 is a constant defined in Section 4.

Proof.

The inequality (31) directly results in the following relationship.

q(k)(Z(k))q(k)(Z(k+1))0.{q}^{(k)}(Z^{(k)})-{q}^{(k)}(Z^{(k+1)})\geq 0.

According to the definition of q(k)q^{(k)}, it follows that

0\displaystyle 0\leq{} 12tr((Z(k))Q(k)Z(k))12tr((Z(k+1))Q(k)Z(k+1))+r(Z(k))r(Z(k+1))\displaystyle\dfrac{1}{2}\mathrm{tr}\left((Z^{(k)})^{\top}Q^{(k)}Z^{(k)}\right)-\dfrac{1}{2}\mathrm{tr}\left((Z^{(k+1)})^{\top}Q^{(k)}Z^{(k+1)}\right)+r(Z^{(k)})-r(Z^{(k+1)})
\displaystyle\leq{} 12j=1dtr((βjXj(k)(Xj(k))Λj(k))𝐃𝐩(Z(k+1),Z(k)))+2μnp.\displaystyle\dfrac{1}{2}\sum\limits_{j=1}^{d}\mathrm{tr}\left(\left(\beta_{j}X_{j}^{(k)}(X_{j}^{(k)})^{\top}-\Lambda_{j}^{(k)}\right){\mathbf{D_{p}}}\left(Z^{(k+1)},Z^{(k)}\right)\right)+2\mu np.

By straightforward calculations, we can deduce that

j=1dtr(Λj(k)𝐃𝐩(Z(k),Z(k+1)))\displaystyle\sum\limits_{j=1}^{d}\mathrm{tr}\left(\Lambda_{j}^{(k)}{\mathbf{D_{p}}}\left(Z^{(k)},Z^{(k+1)}\right)\right)\leq{} j=1dΛj(k)F𝐝𝐩(Z(k+1),Z(k))\displaystyle\sum\limits_{j=1}^{d}\left\|\Lambda_{j}^{(k)}\right\|_{\mathrm{F}}{\mathbf{d_{p}}}\left(Z^{(k+1)},Z^{(k)}\right)
\displaystyle\leq{} 4pj=1dAjF2=4pAF2,\displaystyle 4\sqrt{p}\sum\limits_{j=1}^{d}\left\|A_{j}\right\|^{2}_{\mathrm{F}}=4\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}},

and

j=1dβjtr(Xj(k)(Xj(k))𝐃𝐩(Z(k+1),Z(k)))\displaystyle\sum\limits_{j=1}^{d}\beta_{j}\mathrm{tr}\left(X_{j}^{(k)}(X_{j}^{(k)})^{\top}{\mathbf{D_{p}}}\left(Z^{(k+1)},Z^{(k)}\right)\right)
=12j=1dβj𝐝𝐩2(Z(k),Xj(k))12j=1dβj𝐝𝐩2(Z(k+1),Xj(k)).\displaystyle=\dfrac{1}{2}\sum\limits_{j=1}^{d}\beta_{j}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{j}^{(k)}\right)-\dfrac{1}{2}\sum\limits_{j=1}^{d}\beta_{j}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{j}^{(k)}\right).

The above three inequalities yield that

j=1dβj𝐝𝐩2(Z(k+1),Xj(k))j=1dβj𝐝𝐩2(Z(k),Xj(k))+8pAF2+8μnp,\sum\limits_{j=1}^{d}\beta_{j}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{j}^{(k)}\right)\leq\sum\limits_{j=1}^{d}\beta_{j}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{j}^{(k)}\right)+8\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+8\mu np,

which further implies that

𝐝𝐩2(Z(k+1),Xi(k))\displaystyle{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq{} 1βij=1dβj𝐝𝐩2(Z(k+1),Xj(k))\displaystyle\dfrac{1}{\beta_{i}}\sum\limits_{j=1}^{d}\beta_{j}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{j}^{(k)}\right)
\displaystyle\leq{} ρj=1d𝐝𝐩2(Z(k),Xj(k))+8βi(pAF2+μnp).\displaystyle\rho\sum\limits_{j=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{j}^{(k)}\right)+\dfrac{8}{\beta_{i}}\left(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np\right).

This completes the proof. ∎

Lemma D.5.

Suppose Z(k+1)Z^{(k+1)} is the (k+1)(k+1)-th iterate generated by Algorithm 2 and satisfies the following condition:

𝐝𝐩2(Z(k+1),Xi(k))2(1σ¯2),{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq 2\left(1-\underline{\sigma}^{2}\right),

where σ¯(0,1)\underline{\sigma}\in(0,1) is a constant defined in Condition 1. Let the algorithm parameters satisfy Conditions 1 and 2. Then for any i=1,,di=1,\dotsc,d, it holds that

hi(k)(Xi(k))hi(k)(Xi(k+1))14σ¯2ciβi𝐝𝐩2(Z(k+1),Xi(k)),h_{i}^{(k)}(X_{i}^{(k)})-h_{i}^{(k)}(X_{i}^{(k+1)})\geq\frac{1}{4}\underline{\sigma}^{2}c_{i}\beta_{i}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right), (33)

and

𝐝𝐩2(Z(k+1),Xi(k+1))(1ciσ¯2)𝐝𝐩2(Z(k+1),Xi(k))+12βipAiF2.{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)\leq\left(1-c_{i}\underline{\sigma}^{2}\right){\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)+\dfrac{12}{\beta_{i}}\sqrt{p}\left\|A_{i}\right\|^{2}_{\mathrm{F}}. (34)
Proof.

It follows from Condition 2 that βi>ciAi22\beta_{i}>c_{i}^{\prime}\left\|A_{i}\right\|_{2}^{2}, which together with (25) yields that

hi(k)(Xi(k))hi(k)(Xi(k+1))ci2βi𝐏Xi(k)Hi(k)Xi(k)F2,h_{i}^{(k)}(X_{i}^{(k)})-h_{i}^{(k)}(X_{i}^{(k+1)})\geq\dfrac{c_{i}}{2\beta_{i}}\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}H_{i}^{(k)}X_{i}^{(k)}\right\|^{2}_{\mathrm{F}}, (35)

And it can be checked that

𝐏Xi(k)Hi(k)Xi(k)=𝐏Xi(k)(AiAiXi(k)+Λi(k)Xi(k)+βiZ(k+1)(Z(k+1))Xi(k))\displaystyle{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}H_{i}^{(k)}X_{i}^{(k)}={\mathbf{P}}^{\perp}_{X_{i}^{(k)}}\left(A_{i}A_{i}^{\top}X_{i}^{(k)}+\Lambda_{i}^{(k)}X_{i}^{(k)}+\beta_{i}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}\right) (36)
=𝐏Xi(k)(AiAiXi(k)𝐏Xi(k)AiAiXi(k))βi𝐏Xi(k)Z(k+1)(Z(k+1))Xi(k)\displaystyle={\mathbf{P}}^{\perp}_{X_{i}^{(k)}}\left(A_{i}A_{i}^{\top}X_{i}^{(k)}-{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}A_{i}A_{i}^{\top}X_{i}^{(k)}\right)-\beta_{i}{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}
=βi𝐏Xi(k)Z(k+1)(Z(k+1))Xi(k).\displaystyle=-\beta_{i}{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}.

Suppose σ^1,,σ^p\hat{\sigma}_{1},\dotsc,\hat{\sigma}_{p} are the singular values of (Xi(k))Z(k+1)(X_{i}^{(k)})^{\top}Z^{(k+1)}. It is clear that 0σ^i10\leq\hat{\sigma}_{i}\leq 1 for any i=1,,pi=1,\dotsc,p due to the orthogonality of Xi(k)X_{i}^{(k)} and Z(k+1)Z^{(k+1)}. On the one hand, we have

𝐝𝐩2(Z(k+1),Xi(k))=Xi(k)(Xi(k))Z(k+1)(Z(k+1))F2=2j=1p(1σ^j2).{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)=\left\|X_{i}^{(k)}(X_{i}^{(k)})^{\top}-Z^{(k+1)}(Z^{(k+1)})^{\top}\right\|^{2}_{\mathrm{F}}=2\sum\limits_{j=1}^{p}\left(1-\hat{\sigma}_{j}^{2}\right).

On the other hand, it follows from 𝐝𝐩2(Z(k+1),Xi(k))2(1σ¯2){\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq 2\left(1-\underline{\sigma}^{2}\right) that

σmin((Xi(k))Z(k+1))σ¯.\sigma_{\min}\left((X_{i}^{(k)})^{\top}Z^{(k+1)}\right)\geq\underline{\sigma}.

Let Yi(k)=(Xi(k))Z(k+1)(Z(k+1))Xi(k)Y_{i}^{(k)}=(X_{i}^{(k)})^{\top}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}. By straightforward calculations, we can derive that

𝐏Xi(k)Z(k+1)(Z(k+1))Xi(k)F2=tr(Yi(k))tr((Yi(k))2)=j=1pσ^j2(1σ^j2)\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}\right\|^{2}_{\mathrm{F}}=\mathrm{tr}\left(Y_{i}^{(k)}\right)-\mathrm{tr}\left((Y_{i}^{(k)})^{2}\right)=\sum\limits_{j=1}^{p}\hat{\sigma}_{j}^{2}\left(1-\hat{\sigma}_{j}^{2}\right) (37)
j=1pσmin2((Xi(k))Z(k+1))(1σ^j2)12σ¯2𝐝𝐩2(Z(k+1),Xi(k)).\displaystyle\geq\sum\limits_{j=1}^{p}\sigma_{\min}^{2}\left((X_{i}^{(k)})^{\top}Z^{(k+1)}\right)\left(1-\hat{\sigma}_{j}^{2}\right)\geq\dfrac{1}{2}\underline{\sigma}^{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right).

Combining (35), (36) and (37), we acquire the assertion (33). Then it follows from the definition of hi(k)h_{i}^{(k)} that

ciσ¯2𝐝𝐩2(Z(k+1),Xi(k))\displaystyle c_{i}\underline{\sigma}^{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq{} 2tr(Z(k+1)(Z(k+1))𝐃𝐩(Xi(k+1),Xi(k)))\displaystyle 2\mathrm{tr}\left(Z^{(k+1)}(Z^{(k+1)})^{\top}{\mathbf{D_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)\right)
+2βitr((AiAi+Λi(k))𝐃𝐩(Xi(k+1),Xi(k))).\displaystyle+\dfrac{2}{\beta_{i}}\mathrm{tr}\left(\left(A_{i}A_{i}^{\top}+\Lambda_{i}^{(k)}\right){\mathbf{D_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)\right).

By straightforward calculations, we can obtain that

tr((AiAi+Λi(k))𝐃𝐩(Xi(k+1),Xi(k)))\displaystyle\mathrm{tr}\left(\left(A_{i}A_{i}^{\top}+\Lambda_{i}^{(k)}\right){\mathbf{D_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)\right)\leq{} AiAi+Λi(k)F𝐝𝐩(Xi(k+1),Xi(k))\displaystyle\left\|A_{i}A_{i}^{\top}+\Lambda_{i}^{(k)}\right\|_{\mathrm{F}}{\mathbf{d_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)
\displaystyle\leq{} 6pAiF2,\displaystyle 6\sqrt{p}\left\|A_{i}\right\|^{2}_{\mathrm{F}},

and

tr(Z(k+1)(Z(k+1))𝐃𝐩(Xi(k+1),Xi(k)))\displaystyle\mathrm{tr}\left(Z^{(k+1)}(Z^{(k+1)})^{\top}{\mathbf{D_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)\right)
=12𝐝𝐩2(Z(k+1),Xi(k))12𝐝𝐩2(Z(k+1),Xi(k+1)).\displaystyle=\dfrac{1}{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)-\dfrac{1}{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right).

The above three relationships yield (34). We complete the proof. ∎

Lemma D.6.

Suppose {Z(k)}\{Z^{(k)}\} is the iterate sequence generated by Algorithm 2 initiated from Z(0)𝒮n,pZ^{(0)}\in\mathcal{S}_{n,p} with the parameters satisfying Conditions 1 and 2. Then for any i=1,,di=1,\dotsc,d and kk\in\mathbb{N}, it holds that

𝐝𝐩2(Z(k+1),Xi(k))2(1σ¯2).{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq 2\left(1-\underline{\sigma}^{2}\right). (38)
Proof.

We use mathematical induction to prove this lemma. To begin with, it follows from the inequality (32) that

𝐝𝐩2(Z(1),Xi(0))\displaystyle{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(1)},X_{i}^{(0)}\right)\leq{} ρj=1d𝐝𝐩2(Z(0),Xj(0))+8βi(pAF2+μnp)\displaystyle\rho\sum\limits_{j=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(0)},X_{j}^{(0)}\right)+\dfrac{8}{\beta_{i}}\left(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np\right)
=\displaystyle={} 8βi(pAF2+μnp)2(1σ¯2),\displaystyle\dfrac{8}{\beta_{i}}\left(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np\right)\leq 2\left(1-\underline{\sigma}^{2}\right),

under the relationship βi>4(pAF2+μnp)/(1σ¯2)\beta_{i}>4(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np)/(1-\underline{\sigma}^{2}) in Condition 2. Thus, the argument (38) directly holds for (Z(1),{Xi(0)})(Z^{(1)},\{X_{i}^{(0)}\}). Now, we assume the argument holds at (Z(k+1),{Xi(k)})(Z^{(k+1)},\{X_{i}^{(k)}\}), and investigate the situation at (Z(k+2),{Xi(k+1)})(Z^{(k+2)},\{X_{i}^{(k+1)}\}).

According to Condition 2, we have 12pAiF2/βi<2(1σ¯2)ciσ¯212\sqrt{p}\left\|A_{i}\right\|^{2}_{\mathrm{F}}/\beta_{i}<2\left(1-\underline{\sigma}^{2}\right)c_{i}\underline{\sigma}^{2}. Since we assume that 𝐝𝐩2(Z(k+1),Xi(k))2(1σ¯2){\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)\leq 2\left(1-\underline{\sigma}^{2}\right), it follows from the relationship (34) that

𝐝𝐩2(Z(k+1),Xi(k+1))\displaystyle{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)\leq{} (1ciσ¯2)𝐝𝐩2(Z(k+1),Xi(k))+12βipAiF2\displaystyle\left(1-c_{i}\underline{\sigma}^{2}\right){\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)+\dfrac{12}{\beta_{i}}\sqrt{p}\left\|A_{i}\right\|^{2}_{\mathrm{F}}
\displaystyle\leq{} 2(1σ¯2)(1ciσ¯2)+2(1σ¯2)ciσ¯2=2(1σ¯2),\displaystyle 2\left(1-\underline{\sigma}^{2}\right)\left(1-c_{i}\underline{\sigma}^{2}\right)+2\left(1-\underline{\sigma}^{2}\right)c_{i}\underline{\sigma}^{2}=2\left(1-\underline{\sigma}^{2}\right),

which infers that σmin((Xi(k+1))Z(k+1))σ¯\sigma_{\min}\left((X_{i}^{(k+1)})^{\top}Z^{(k+1)}\right)\geq\underline{\sigma}. Similar to the proof of Lemma D.5, we can acquire that

𝐏Xi(k+1)Z(k+1)(Z(k+1))Xi(k+1)F212σ¯2𝐝𝐩2(Z(k+1),Xi(k+1)).\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k+1)}\right\|^{2}_{\mathrm{F}}\geq\dfrac{1}{2}\underline{\sigma}^{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right). (39)

Combining the condition (26) and the equality (36), we have

𝐏Xi(k+1)Hi(k)Xi(k+1)Fδi𝐏Xi(k)Hi(k)Xi(k)F\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}H_{i}^{(k)}X_{i}^{(k+1)}\right\|_{\mathrm{F}}\leq\delta_{i}\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}H_{i}^{(k)}X_{i}^{(k)}\right\|_{\mathrm{F}} (40)
=δiβi𝐏Xi(k)Z(k+1)(Z(k+1))Xi(k)Fδiβi𝐝𝐩(Z(k+1),Xi(k)).\displaystyle=\delta_{i}\beta_{i}\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k)}}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k)}\right\|_{\mathrm{F}}\leq\delta_{i}\beta_{i}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right).

On the other hand, it follows from the triangular inequality that

𝐏Xi(k+1)Hi(k)Xi(k+1)F\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}H_{i}^{(k)}X_{i}^{(k+1)}\right\|_{\mathrm{F}}
\displaystyle\geq{} 𝐏Xi(k+1)(AiAi+Λi(k+1)+βiZ(k+1)(Z(k+1)))Xi(k+1)F\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}\left(A_{i}A_{i}^{\top}+\Lambda_{i}^{(k+1)}+\beta_{i}Z^{(k+1)}(Z^{(k+1)})^{\top}\right)X_{i}^{(k+1)}\right\|_{\mathrm{F}}
𝐏Xi(k+1)(Λi(k+1)Λi(k))Xi(k+1)F\displaystyle-\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}\left(\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)}\right)X_{i}^{(k+1)}\right\|_{\mathrm{F}}

Combing the inequality (39), it can be verified that

𝐏Xi(k+1)(AiAi+Λi(k+1)+βiZ(k+1)(Z(k+1)))Xi(k+1)F\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}\left(A_{i}A_{i}^{\top}+\Lambda_{i}^{(k+1)}+\beta_{i}Z^{(k+1)}(Z^{(k+1)})^{\top}\right)X_{i}^{(k+1)}\right\|_{\mathrm{F}}
=βi𝐏Xi(k+1)Z(k+1)(Z(k+1))Xi(k+1)F22σ¯βi𝐝𝐩(Z(k+1),Xi(k+1)).\displaystyle=\beta_{i}\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}Z^{(k+1)}(Z^{(k+1)})^{\top}X_{i}^{(k+1)}\right\|_{\mathrm{F}}\geq\dfrac{\sqrt{2}}{2}\underline{\sigma}\beta_{i}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right).

Moreover, according to Lemma B.4 in [51], we have

𝐏Xi(k+1)(Λi(k+1)Λi(k))Xi(k+1)F\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}\left(\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)}\right)X_{i}^{(k+1)}\right\|_{\mathrm{F}}\leq{} Λi(k+1)Λi(k)F\displaystyle\left\|\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)}\right\|_{\mathrm{F}}
\displaystyle\leq{} 4Ai22𝐝𝐩(Xi(k+1),Xi(k)).\displaystyle 4\left\|A_{i}\right\|_{2}^{2}{\mathbf{d_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right).

Combing the above three inequalities, we further obtain that

𝐏Xi(k+1)Hi(k)Xi(k+1)F\displaystyle\left\|{\mathbf{P}}^{\perp}_{X_{i}^{(k+1)}}H_{i}^{(k)}X_{i}^{(k+1)}\right\|_{\mathrm{F}}\geq{} 22σ¯βi𝐝𝐩(Z(k+1),Xi(k+1))\displaystyle\dfrac{\sqrt{2}}{2}\underline{\sigma}\beta_{i}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)
4Ai22𝐝𝐩(Xi(k+1),Xi(k)).\displaystyle-4\left\|A_{i}\right\|_{2}^{2}{\mathbf{d_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right).

Together with (40), this yields that

22σ¯βi𝐝𝐩(Z(k+1),Xi(k+1))\displaystyle\dfrac{\sqrt{2}}{2}\underline{\sigma}\beta_{i}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)
δiβi𝐝𝐩(Z(k+1),Xi(k))+4Ai22𝐝𝐩(Xi(k+1),Xi(k))\displaystyle\leq\delta_{i}\beta_{i}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)+4\left\|A_{i}\right\|_{2}^{2}{\mathbf{d_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)
(δiβi+4Ai22)𝐝𝐩(Z(k+1),Xi(k))+4Ai22𝐝𝐩(Z(k+1),Xi(k+1)).\displaystyle\leq\left(\delta_{i}\beta_{i}+4\left\|A_{i}\right\|_{2}^{2}\right){\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)+4\left\|A_{i}\right\|_{2}^{2}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right).

According to Conditions 1 and 2, we have 2σ¯βi8Ai22>0\sqrt{2}\underline{\sigma}\beta_{i}-8\left\|A_{i}\right\|_{2}^{2}>0 and σ¯2ρdδi>0\underline{\sigma}-2\sqrt{\rho d}\delta_{i}>0. Thus, it can be verified that

𝐝𝐩(Z(k+1),Xi(k+1))\displaystyle{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)\leq{} 2(δiβi+4Ai22)2σ¯βi8Ai22𝐝𝐩(Z(k+1),Xi(k))\displaystyle\dfrac{2(\delta_{i}\beta_{i}+4\left\|A_{i}\right\|_{2}^{2})}{\sqrt{2}\underline{\sigma}\beta_{i}-8\left\|A_{i}\right\|_{2}^{2}}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right) (41)
\displaystyle\leq{} 12ρd𝐝𝐩(Z(k+1),Xi(k)),\displaystyle\sqrt{\dfrac{1}{2\rho d}}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right),

where the last inequality follows from the relationship β>4(2ρd+2)Ai22σ¯2ρdδi\beta>\dfrac{4\left(2\sqrt{\rho d}+\sqrt{2}\right)\left\|A_{i}\right\|_{2}^{2}}{\underline{\sigma}-2\sqrt{\rho d}\delta_{i}} in Condition 2. This together with (32) and (38) yields that

𝐝𝐩2(Z(k+2),Xi(k+1))ρj=1d𝐝𝐩2(Z(k+1),Xj(k+1))+8βi(pAF2+μnp)\displaystyle{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+2)},X_{i}^{(k+1)}\right)\leq\rho\sum\limits_{j=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{j}^{(k+1)}\right)+\dfrac{8}{\beta_{i}}\left(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np\right)
12dj=1d𝐝𝐩2(Z(k+1),Xj(k))+(1σ¯2)(1σ¯2)+(1σ¯2)=2(1σ¯2),\displaystyle\leq\dfrac{1}{2d}\sum\limits_{j=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{j}^{(k)}\right)+\left(1-\underline{\sigma}^{2}\right)\leq\left(1-\underline{\sigma}^{2}\right)+\left(1-\underline{\sigma}^{2}\right)=2\left(1-\underline{\sigma}^{2}\right),

since we assume that βi>8(pAF2+μnp)/(1σ¯2)\beta_{i}>8(\sqrt{p}\left\|A\right\|^{2}_{\mathrm{F}}+\mu np)/(1-\underline{\sigma}^{2}) in Condition 2. The proof is completed. ∎

Corollary D.7.

Suppose {Z(k)}\{Z^{(k)}\} is the iterate sequence generated by Algorithm 2 initiated from Z(0)𝒮n,pZ^{(0)}\in\mathcal{S}_{n,p}, and the problem parameters satisfy Conditions 1 and 2. Then for any kk\in\mathbb{N}, we can obtain that

(Z(k+1),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k)})\displaystyle\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k)}\})
14σ¯2i=1dciβi𝐝𝐩2(Z(k+1),Xi(k)).\displaystyle\geq\dfrac{1}{4}\underline{\sigma}^{2}\sum\limits_{i=1}^{d}c_{i}\beta_{i}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right).
Proof.

This corollary directly follows from Lemma D.5 and Lemma D.6. ∎

Corollary D.8.

Suppose {Z(k)}\{Z^{(k)}\} is the iterate sequence generated by Algorithm 2 initiated from Z(0)𝒮n,pZ^{(0)}\in\mathcal{S}_{n,p}, and problem parameters satisfy Conditions 1 and 2. Then for any kk\in\mathbb{N}, we can acquire that

(Z(k+1),{Xi(k+1)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k+1)})\displaystyle\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k+1)}\})
2ρd+1ρdi=1dAi22𝐝𝐩2(Z(k+1),Xi(k)).\displaystyle\geq-\dfrac{\sqrt{2\rho d}+1}{\rho d}\sum\limits_{i=1}^{d}\left\|A_{i}\right\|_{2}^{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right).
Proof.

According to the Cauchy–Schwarz inequality, we can show that

|Λi(k+1)Λi(k),𝐃𝐩(Xi(k+1),Z(k+1))|Λi(k+1)Λi(k)F𝐝𝐩(Z(k+1),Xi(k+1))\displaystyle\left\lvert\left\langle\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)},{\mathbf{D_{p}}}\left(X_{i}^{(k+1)},Z^{(k+1)}\right)\right\rangle\right\rvert\leq\left\|\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)}\right\|_{\mathrm{F}}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)
8ρdAi22𝐝𝐩(Xi(k+1),Xi(k))𝐝𝐩(Z(k+1),Xi(k)),\displaystyle\leq\sqrt{\dfrac{8}{\rho d}}\left\|A_{i}\right\|_{2}^{2}{\mathbf{d_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right){\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right),

where the last inequality follows from Lemma B.4 in [51] and (41). In addition, we have

𝐝𝐩(Xi(k+1),Xi(k))\displaystyle{\mathbf{d_{p}}}\left(X_{i}^{(k+1)},X_{i}^{(k)}\right)\leq{} 𝐝𝐩(Z(k+1),Xi(k+1))+𝐝𝐩(Z(k+1),Xi(k))\displaystyle{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)+{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)
\displaystyle\leq{} 2ρd+12ρd𝐝𝐩(Z(k+1),Xi(k)),\displaystyle\dfrac{\sqrt{2\rho d}+1}{\sqrt{2\rho d}}{\mathbf{d_{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right),

which implies that

Λi(k+1)Λi(k),𝐃𝐩(Xi(k+1),Z(k+1))\displaystyle\left\langle\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)},{\mathbf{D_{p}}}\left(X_{i}^{(k+1)},Z^{(k+1)}\right)\right\rangle
2(2ρd+1)ρdAi22𝐝𝐩2(Z(k+1),Xi(k)).\displaystyle\geq-\dfrac{2\left(\sqrt{2\rho d}+1\right)}{\rho d}\left\|A_{i}\right\|_{2}^{2}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right).

Combing the fact that

(Z(k+1),{Xi(k+1)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k+1)})\displaystyle\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k+1)}\})
=12i=1dΛi(k+1)Λi(k),𝐃𝐩(Xi(k+1),Z(k+1)),\displaystyle=\dfrac{1}{2}\sum\limits_{i=1}^{d}\left\langle\Lambda_{i}^{(k+1)}-\Lambda_{i}^{(k)},{\mathbf{D_{p}}}\left(X_{i}^{(k+1)},Z^{(k+1)}\right)\right\rangle,

we complete the proof. ∎

Now based on these lemmas and corollaries, we can demonstrate the monotonic non-increasing of {({Xik},Zk,{Λik})}\left\{\mathcal{L}(\{X_{i}^{k}\},Z^{k},\{\Lambda_{i}^{k}\})\right\}, which results in the global convergence of our algorithm.

Proposition D.9.

Suppose {Z(k)}\{Z^{(k)}\} is the iteration sequence generated by Algorithm 2 initiated from Z(0)𝒮n,pZ^{(0)}\in\mathcal{S}_{n,p}, and problem parameters satisfy Conditions 1 and 2. Then the sequence of augmented Lagrangian functions {(Z(k),{Xi(k)},{Λi(k)})}\{\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\} is monotonically non-increasing, and for any kk\in\mathbb{N}, it satisfies the following sufficient descent property:

(Z(k),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k+1)})\displaystyle\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k+1)}\}) (42)
i=1dJi𝐝𝐩2(Z(k+1),Xi(k+1))+M¯D(k)F2,\displaystyle\geq\sum\limits_{i=1}^{d}J_{i}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k+1)}\right)+\bar{M}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}},

where Ji=12ρdσ¯2ciβi2(2ρd+1)Ai22>0J_{i}=\dfrac{1}{2}\rho d\underline{\sigma}^{2}c_{i}\beta_{i}-2(\sqrt{2\rho d}+1)\left\|A_{i}\right\|_{2}^{2}>0 is a constant.

Proof.

Combining Corollary D.3, Corollary D.7, and Corollary D.8, we obtain that

(Z(k),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k+1)})\displaystyle\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k+1)}\})
i=1d(14σ¯2ciβi2ρd+1ρdAi22)𝐝𝐩2(Z(k+1),Xi(k))+M¯D(k)F2.\displaystyle\geq\sum\limits_{i=1}^{d}\left(\dfrac{1}{4}\underline{\sigma}^{2}c_{i}\beta_{i}-\dfrac{\sqrt{2\rho d}+1}{\rho d}\left\|A_{i}\right\|_{2}^{2}\right){\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k+1)},X_{i}^{(k)}\right)+\bar{M}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}.

Recalling the relationship βi>4(2ρd+1)Ai22/(ρdσ¯2ci)\beta_{i}>4(\sqrt{2\rho d}+1)\left\|A_{i}\right\|_{2}^{2}/(\rho d\underline{\sigma}^{2}c_{i}) in Condition 2, we can conclude that (Z(k),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k+1)})\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\geq\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k+1)}\}). Hence, the sequence {(Z(k),{Xi(k)},{Λi(k)})}\{\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\} is monotonically non-increasing. Finally, the above relationship together with (41) yields the assertion (42). The proof is finished. ∎

Based on the above properties, we are ready to prove Theorem 4.3, which establishes the global convergence rate of our proposed algorithm.

Proof of Theorem 4.3.

The whole sequence {Z(k),{Xi(k)}}\{Z^{(k)},\{X_{i}^{(k)}\}\} is naturally bounded, since each of Xi(k)X_{i}^{(k)} or Z(k)Z^{(k)} is orthogonal. Then it follows from the Bolzano-Weierstrass theorem that this sequence exists an accumulation point {Z,{Xi}}\{Z^{\ast},\{X_{i}^{\ast}\}\}, where Z𝒮n,pZ^{\ast}\in\mathcal{S}_{n,p} and Xi𝒮n,pX_{i}^{\ast}\in\mathcal{S}_{n,p}. Moreover, the boundedness of {Λi(k)}\{\Lambda_{i}^{(k)}\} results from the multipliers updating formula (15). Hence, the lower boundedness of {(Z(k),{Xi(k)},{Λi(k)})}\{\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\} is owing to the continuity of the augmented Lagrangian function. Namely, there exists a constant L¯\underline{L} such that (Z(k),{Xi(k)},{Λi(k)})L¯\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\geq\underline{L} for all kk\in\mathbb{N}.

It follows from the sufficient descent property (42) that

k=1KD(k)F2\displaystyle\sum\limits_{k=1}^{K}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}} (43)
M¯1k=1K((Z(k),{Xi(k)},{Λi(k)})(Z(k+1),{Xi(k+1)},{Λi(k+1)}))\displaystyle\leq\bar{M}^{-1}\sum\limits_{k=1}^{K}\left(\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})-\mathcal{L}(Z^{(k+1)},\{X_{i}^{(k+1)}\},\{\Lambda_{i}^{(k+1)}\})\right)
=M¯1((Z(1),{Xi(1)},{Λi(1)})(Z(K+1),{Xi(K+1)},{Λi(K+1)}))\displaystyle=\bar{M}^{-1}\left(\mathcal{L}(Z^{(1)},\{X_{i}^{(1)}\},\{\Lambda_{i}^{(1)}\})-\mathcal{L}(Z^{(K+1)},\{X_{i}^{(K+1)}\},\{\Lambda_{i}^{(K+1)}\})\right)
M¯1((Z(1),{Xi(1)},{Λi(1)})L¯),\displaystyle\leq\bar{M}^{-1}\left(\mathcal{L}(Z^{(1)},\{X_{i}^{(1)}\},\{\Lambda_{i}^{(1)}\})-\underline{L}\right),

and

k=1K𝐝𝐩2(Z(k),Xi(k))\displaystyle\sum\limits_{k=1}^{K}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{i}^{(k)}\right) (44)
Ji1k=1K((Z(k1),{Xi(k1)},{Λi(k1)})(Z(k),{Xi(k)},{Λi(k)}))\displaystyle\leq J_{i}^{-1}\sum\limits_{k=1}^{K}\left(\mathcal{L}(Z^{(k-1)},\{X_{i}^{(k-1)}\},\{\Lambda_{i}^{(k-1)}\})-\mathcal{L}(Z^{(k)},\{X_{i}^{(k)}\},\{\Lambda_{i}^{(k)}\})\right)
=Ji1((Z(0),{Xi(0)},{Λi(0)})(Z(K),{Xi(K)},{Λi(K)}))\displaystyle=J_{i}^{-1}\left(\mathcal{L}(Z^{(0)},\{X_{i}^{(0)}\},\{\Lambda_{i}^{(0)}\})-\mathcal{L}(Z^{(K)},\{X_{i}^{(K)}\},\{\Lambda_{i}^{(K)}\})\right)
Ji1((Z(0),{Xi(0)},{Λi(0)})L¯).\displaystyle\leq J_{i}^{-1}\left(\mathcal{L}(Z^{(0)},\{X_{i}^{(0)}\},\{\Lambda_{i}^{(0)}\})-\underline{L}\right).

Upon taking the limit as KK\to\infty, we obtain that

k=1D(k)F2< and k=1𝐝𝐩2(Z(k),Xi(k))<,\sum\limits_{k=1}^{\infty}\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}<\infty\text{~{}~{}and~{}~{}}\sum\limits_{k=1}^{\infty}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{i}^{(k)}\right)<\infty,

which further implies that

limkD(k)F=0 and limk𝐝𝐩(Z(k),Xi(k))=0,\lim\limits_{k\to\infty}\left\|D^{(k)}\right\|_{\mathrm{F}}=0\text{~{}~{}and~{}~{}}\lim\limits_{k\to\infty}{\mathbf{d_{p}}}\left(Z^{(k)},X_{i}^{(k)}\right)=0,

respectively. Combing this with Lemma 4.1, we know that any accumulation point ZZ^{\ast} of sequence {Z(k)}\{Z^{(k)}\} is a first-order stationary point of the problem (2).

Eventually, we prove the sublinear convergence rate. Indeed, it follows from the inequalities (43) and (44) that

mink=1,,K{D(k)F2+1di=1d𝐝𝐩2(Z(k),Xi(k))}\displaystyle\min\limits_{k=1,\dotsc,K}\left\{\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{d}\sum\limits_{i=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{i}^{(k)}\right)\right\}
1Kk=1K{D(k)F2+1di=1d𝐝𝐩2(Z(k),Xi(k))}CK,\displaystyle\leq\dfrac{1}{K}\sum\limits_{k=1}^{K}\left\{\left\|D^{(k)}\right\|^{2}_{\mathrm{F}}+\dfrac{1}{d}\sum\limits_{i=1}^{d}{\mathbf{d}^{2}_{\mathbf{p}}}\left(Z^{(k)},X_{i}^{(k)}\right)\right\}\leq\dfrac{C}{K},

where

C=\displaystyle C={} M¯1((Z(1),{Xi(1)},{Λi(1)})L¯)\displaystyle\bar{M}^{-1}\left(\mathcal{L}(Z^{(1)},\{X_{i}^{(1)}\},\{\Lambda_{i}^{(1)}\})-\underline{L}\right)
+(i=1dJi1)d1((Z(0),{Xi(0)},{Λi(0)})L¯)\displaystyle+\left(\sum\limits_{i=1}^{d}J_{i}^{-1}\right)d^{-1}\left(\mathcal{L}(Z^{(0)},\{X_{i}^{(0)}\},\{\Lambda_{i}^{(0)}\})-\underline{L}\right)

is a positive constant. This completes the proof. ∎

References

  • [1] T. E. Abrudan, J. Eriksson, and V. Koivunen, Steepest descent algorithms for optimization under unitary matrix constraint, IEEE Transactions on Signal Processing, 56 (2008), pp. 1134–1147.
  • [2] P.-A. Absil, C. G. Baker, and K. A. Gallivan, Trust-region methods on Riemannian manifolds, Foundations of Computational Mathematics, 7 (2006), pp. 303–330.
  • [3] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2008.
  • [4] F. L. Andrade, M. A. Figueiredo, and J. Xavier, Distributed Picard iteration: Application to distributed EM and distributed PCA, arXiv:2106.10665, (2021).
  • [5] K. J. Arrow, H. Azawa, L. Hurwicz, and H. Uzawa, Studies in linear and non-linear programming, vol. 2, Stanford University Press, 1958.
  • [6] M. Bacák, R. Bergmann, G. Steidl, and A. Weinmann, A second order nonsmooth variational model for restoring manifold-valued images, SIAM Journal on Scientific Computing, 38 (2016), pp. A567–A597.
  • [7] T. Baden, P. Berens, K. Franke, M. R. Rosón, M. Bethge, and T. Euler, The functional diversity of retinal ganglion cells in the mouse, Nature, 529 (2016), pp. 345–350.
  • [8] G. C. Bento, O. P. Ferreira, and J. G. Melo, Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds, Journal of Optimization Theory and Applications, 173 (2017), pp. 548–562.
  • [9] G. Chen, P. F. Sullivan, and M. R. Kosorok, Biclustering with heterogeneous variance, Proceedings of the National Academy of Sciences, 110 (2013), pp. 12253–12258.
  • [10] S. Chen, A. Garcia, M. Hong, and S. Shahrampour, Decentralized Riemannian gradient descent on the Stiefel manifold, in Proceedings of the 38th International Conference on Machine Learning, vol. 139, 2021, pp. 1594–1605.
  • [11] S. Chen, S. Ma, A. Man-Cho So, and T. Zhang, Proximal gradient method for nonsmooth optimization over the Stiefel manifold, SIAM Journal on Optimization, 30 (2020), pp. 210–239.
  • [12] W. Chen, H. Ji, and Y. You, An augmented lagrangian method for 1\ell_{1}-regularized optimization problems with orthogonality constraints, SIAM Journal on Scientific Computing, 38 (2016), pp. B570–B592.
  • [13] F. H. Clarke, Optimization and nonsmooth analysis, SIAM, 1990.
  • [14] A. d’Aspremont, F. Bach, and L. El Ghaoui, Optimal solutions for sparse principal component analysis, Journal of Machine Learning Research, 9 (2008), pp. 1269–1294.
  • [15] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet, A direct formulation for sparse PCA using semidefinite programming, SIAM Review, 49 (2007), pp. 434–448.
  • [16] A. Edelman, T. A. Arias, and S. T. Smith, The geometry of algorithms with orthogonality constraints, SIAM Journal on Matrix Analysis and Applications, 20 (1998), pp. 303–353.
  • [17] O. Ferreira and P. Oliveira, Subgradient algorithm on Riemannian manifolds, Journal of Optimization Theory and Applications, 97 (1998), pp. 93–104.
  • [18] O. P. Ferreira, M. S. Louzeiro, and L. F. Prudente, Iteration-complexity of the subgradient method on Riemannian manifolds with lower bounded curvature, Optimization, 68 (2019), pp. 713–729.
  • [19] A. Gang and W. U. Bajwa, A linearly convergent algorithm for distributed principal component analysis, arXiv:2101.01300, (2021).
  • [20]  , FAST-PCA: A fast and exact algorithm for distributed principal component analysis, arXiv:2108.12373, (2021).
  • [21] B. Gao, X. Liu, X. Chen, and Y.-X. Yuan, A new first-order algorithmic framework for optimization problems with orthogonality constraints, SIAM Journal on Optimization, 28 (2018), pp. 302–332.
  • [22] B. Gao, X. Liu, and Y.-X. Yuan, Parallelizable algorithms for optimization problems with orthogonality constraints, SIAM Journal on Scientific Computing, 41 (2019), pp. A1949–A1983.
  • [23] I. Gemp, B. McWilliams, C. Vernade, and T. Graepel, Eigengame: PCA as a nash equilibrium, arXiv:2010.00554, (2020).
  • [24] K. Gravuer, J. J. Sullivan, P. A. Williams, and R. P. Duncan, Strong human association with plant invasion success for Trifolium introductions to New Zealand, Proceedings of the National Academy of Sciences, 105 (2008), pp. 6344–6349.
  • [25] P. Grohs and S. Hosseini, Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds, IMA Journal of Numerical Analysis, 36 (2016), pp. 1167–1192.
  • [26] D. Hajinezhad and M. Hong, Nonconvex alternating direction method of multipliers for distributed sparse principal component analysis, in 2015 IEEE Global Conference on Signal and Information Processing, 2015, pp. 255–259.
  • [27] B. He, Y. You, and X. Yuan, On the convergence of primal-dual hybrid gradient algorithm, SIAM Journal on Imaging Sciences, 7 (2014), pp. 2526–2537.
  • [28] S. Hosseini and A. Uschmajew, A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds, SIAM Journal on Optimization, 27 (2017), pp. 173–189.
  • [29] J. Hu, B. Jiang, L. Lin, Z. Wen, and Y.-X. Yuan, Structured quasi-Newton methods for optimization with orthogonality constraints, SIAM Journal on Scientific Computing, 41 (2019), pp. A2239–A2269.
  • [30] J. Hu, A. Milzarek, Z. Wen, and Y. Yuan, Adaptive quadratically regularized Newton method for Riemannian optimization, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 1181–1207.
  • [31] W. Huang and K. Wei, Riemannian proximal gradient methods, Mathematical Programming, (2021), pp. 1–43.
  • [32] B. Jiang and Y.-H. Dai, A framework of constraint preserving update schemes for optimization on Stiefel manifold, Mathematical Programming, 153 (2015), pp. 535–575.
  • [33] B. Jiang, S. Ma, A. M.-C. So, and S. Zhang, Vector transport-free SVRG with general retraction for Riemannian optimization: Complexity analysis and practical implementation, arXiv:1705.09059, (2017).
  • [34] I. M. Johnstone and A. Y. Lu, On consistency and sparsity for principal components analysis in high dimensions, Journal of the American Statistical Association, 104 (2009), pp. 682–693.
  • [35] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, A modified principal component technique based on the LASSO, Journal of Computational and Graphical Statistics, 12 (2003), pp. 531–547.
  • [36] A. Kovnatsky, K. Glashoff, and M. M. Bronstein, MADMM: A generic algorithm for non-smooth optimization on manifolds, in European Conference on Computer Vision, Springer, 2016, pp. 680–696.
  • [37] R. Lai and S. Osher, A splitting method for orthogonality constrained problems, Journal of Scientific Computing, 58 (2014), pp. 431–449.
  • [38] Y. Lou, L. Yu, S. Wang, and P. Yi, Privacy preservation in distributed subgradient optimization algorithms, IEEE Transactions on Cybernetics, 48 (2017), pp. 2154–2165.
  • [39] M. Magdon-Ismail, NP-hardness and inapproximability of sparse PCA, Information Processing Letters, 126 (2017), pp. 35–38.
  • [40] J. H. Manton, Optimization algorithms exploiting unitary constraints, IEEE Transactions on Signal Processing, 50 (2002), pp. 635–650.
  • [41] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, Communication-efficient learning of deep networks from decentralized data, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, PMLR, 2017, pp. 1273–1282.
  • [42] Y. Nishimori and S. Akaho, Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold, Neurocomputing, 67 (2005), pp. 106–135.
  • [43] P. S. Pacheco, An introduction to parallel programming, Elsevier, 2011.
  • [44] H. Rutishauser, Simultaneous iteration method for symmetric matrices, Numerische Mathematik, 16 (1970), pp. 205–223.
  • [45] H. Sato, A Dai–Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions, Computational Optimization and Applications, 64 (2016), pp. 101–118.
  • [46] H. Shen and J. Z. Huang, Sparse principal component analysis via regularized low rank matrix approximation, Journal of Multivariate Analysis, 99 (2008), pp. 1015–1034.
  • [47] K. Sjostrand, E. Rostrup, C. Ryberg, R. Larsen, C. Studholme, H. Baezner, J. Ferro, F. Fazekas, L. Pantoni, D. Inzitari, et al., Sparse decomposition and modeling of anatomical shape variation, IEEE Transactions on Medical Imaging, 26 (2007), pp. 1625–1635.
  • [48] E. Stiefel, Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten, Commentarii Mathematici Helvetici, 8 (1935), pp. 305–353.
  • [49] L. Wang, B. Gao, and X. Liu, Multipliers correction methods for optimization problems over the Stiefel manifold, CSIAM Transactions on Applied Mathematics, 2 (2021), pp. 508–531.
  • [50] L. Wang and X. Liu, Decentralized optimization over the Stiefel manifold by an approximate augmented Lagrangian function, IEEE Transactions on Signal Processing, 70 (2022), pp. 3029–3041.
  • [51] L. Wang, X. Liu, and Y. Zhang, A distributed and secure algorithm for computing dominant SVD based on projection splitting, arXiv:2012.03461, (2020).
  • [52] Z. Wen and W. Yin, A feasible method for optimization with orthogonality constraints, Mathematical Programming, 142 (2013), pp. 397–434.
  • [53] D. M. Witten, R. Tibshirani, and T. Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, 10 (2009), pp. 515–534.
  • [54] T. Wu, E. Hu, S. Xu, M. Chen, P. Guo, Z. Dai, T. Feng, L. Zhou, W. Tang, L. Zhan, et al., clusterProfiler 4.0: A universal enrichment tool for interpreting omics data, The Innovation, 2 (2021), p. 100141.
  • [55] N. Xiao, X. Liu, and Y.-X. Yuan, A class of smooth exact penalty function methods for optimization problems with orthogonality constraints, Optimization Methods and Software, (2020), pp. 1–37.
  • [56] N. Xiao, X. Liu, and Y.-x. Yuan, A penalty-free infeasible approach for a class of nonsmooth optimization problems over the Stiefel manifold, arXiv:2103.03514, (2021).
  • [57] X. Xiao, Y. Li, Z. Wen, and L. Zhang, A regularized semi-smooth Newton method with projection steps for composite convex programs, Journal of Scientific Computing, 76 (2018), pp. 364–389.
  • [58] W. H. Yang, L.-H. Zhang, and R. Song, Optimality conditions for the nonlinear programming problems on Riemannian manifolds, Pacific Journal of Optimization, 10 (2014), pp. 415–434.
  • [59] H. Ye and T. Zhang, DeEPCA: Decentralized exact PCA with linear convergence rate, Journal of Machine Learning Research, 22 (2021), pp. 1–27.
  • [60] C. Zhang, M. Ahmad, and Y. Wang, ADMM based privacy-preserving decentralized optimization, IEEE Transactions on Information Forensics and Security, 14 (2018), pp. 565–580.
  • [61] X. Zhu, A Riemannian conjugate gradient method for optimization on the Stiefel manifold, Computational Optimization and Applications, 67 (2017), pp. 73–110.
  • [62] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, Journal of Computational and Graphical Statistics, 15 (2006), pp. 265–286.
  • [63] H. Zou and L. Xue, A selective overview of sparse principal component analysis, Proceedings of the IEEE, 106 (2018), pp. 1311–1320.