A Communication-Efficient And Privacy-Aware Distributed Algorithm For Sparse PCA
Abstract
Sparse principal component analysis (PCA) improves interpretability of the classic PCA by introducing sparsity into the dimension-reduction process. Optimization models for sparse PCA, however, are generally non-convex, non-smooth and more difficult to solve, especially on large-scale datasets requiring distributed computation over a wide network. In this paper, we develop a distributed and centralized algorithm called DSSAL1 for sparse PCA that aims to achieve low communication overheads by adapting a newly proposed subspace-splitting strategy to accelerate convergence. Theoretically, convergence to stationary points is established for DSSAL1. Extensive numerical results show that DSSAL1 requires far fewer rounds of communication than state-of-the-art peer methods. In addition, we make the case that since messages exchanged in DSSAL1 are well-masked, the possibility of private-data leakage in DSSAL1 is much lower than in some other distributed algorithms.
1 Introduction
The principal component analysis (PCA) is a fundamental and ubiquitous tool in statistics and data analytics. In particular, it frequently serves as a critical preprocessing step in numerous data science or machine learning tasks. In general, solutions produced by the classical PCA are dense combinations of all features. However, in practice, sparse combinations not only enhance the interpretability of the principal components but also reduce storage [47, 9], which motivates the idea of sparse PCA. More importantly, from a theoretical perspective as given in [34, 63], sparse PCA is able to remediate some inconsistency phenomenon present in the classical PCA under the high-dimensional setting. As a dimension reduction and feature extraction method, sparse PCA can be widely applied to application areas where PCA is normally used; such as medical imaging [47], biological knowledge mining [54], ecology study [24], cancer analysis [9], neuroscience research [7], to mention a few examples.
Let be an data matrix, where and denote the numbers of features and samples, respectively. Without loss of generality, we assume that each row of has a zero mean. Mathematically speaking, PCA finds an orthonormal basis of a -dimensional subspace such that the projection of samples on this subspace has the most variance. The set is usually referred to as the Stiefel manifold [48]. Then PCA can be formally expressed as the following optimization problem with the orthogonality constraints.
(1) | ||||
where represents the trace of a given square matrix, and the column of are called loading vectors or simply loadings.
In the projected data , the number of features is reduced from to and each feature (row of ) is a linear combination of the original features (rows of ) with coefficients from . For a sufficiently sparse , each reduced feature depends only on a few original features instead of all of them, leading to better interpretability in many applications. For this purpose, we consider the following -regularized optimization model for sparse PCA:
(2) | ||||
where the (non-operator) -norm regularizer is imposed to promote sparsity in , and is the parameter used to control the amount of sparseness. Here, denotes the -th entry of the matrix . Our distributed approach can efficiently tackle (2) with by pursuing sparsity and orthogonality at the same time.
The optimization problem (2) is a penalized version of the SCoTLASS model proposed in [35]. Evidently, there is a significant difficulty gap in going from the standard PCA to sparse PCA in terms of both problem complexity and solution methodology. The standard PCA is polynomial-time solvable, while the sparse PCA is NP-hard if the -regularizer is used to enforce sparsity [39], though it is still unclear what is the computational time complexity of solving the -regularized model (2). In this paper we will be content with a theoretical result of convergence to first-order stationary points of (2). There are other formulations for sparse PCA, such as regression model [62], semidefinite programming [14, 15], and matrix decompositions [46, 53]. A comparative study of these formulations is beyond the scope of this paper. We refer interested readers to [63] for a recent survey of theoretical and computational results for sparse PCA.
1.1 Distributed setting
We consider the following distributed setting. The data matrix is divided into blocks, each containing a number of samples; namely, where so that . This is a natural setting since each sample is a column of . These submatrices , , are stored locally in locations, possibly having been collected at different locations by different agents, and all the agents are connected through a communication network. According to the splitting scheme of , the function can also be distributed into agents, namely,
(3) |
In terms of network topology, we only need to assume that the network allows global summation operations (say, through all-reduce type of communications [43]) which are required by our algorithm. In particular, our algorithm will operate well in the federated-learning setting [41] where all the agents are connected to a center server so that global summations can be readily achieved in the network. To this extent, we say our algorithm is a centralized algorithm.
For most distributed algorithms, since the amount of communications per iteration remains essentially the same, the total communication overhead is proportional to the required number of iterations regardless of the underlying network topology. Therefore, our goal is to devise an algorithm that converges fast in terms of iteration counts, while considerations on other network topologies and communication patterns are beyond the scope of the current paper.
In certain applications, such as those in healthcare and financial industry [38, 60], preserving privacy of local data is of primary importance. In this paper, we consider the following scenario: each agent wants to keep its local dataset (i.e., ) from being discovered by any other agents including the center. In this situation, it is not an option to implement a pre-agreed encryption or a coordinated masking operation. For convenience, we will call such a privacy situation of intrinsic privacy. For an algorithm to preserve intrinsic privacy, publicly exchanged information must be safe so that none of the local-data matrices can be figured out by any means. We will show later that, in general, algorithms based on traditional methods with distributed matrix multiplications are not intrinsically private.
1.2 Related works
Optimization problems with orthogonality constraints have been actively investigated in recent decades, for which many algorithms and solvers have been developed, such as, gradient approaches [40, 42, 1, 3], conjugate gradient approaches [16, 45, 61], constraint preserving updating schemes [52, 32], Newton methods [30, 29], trust-region methods [2], multipliers correction frameworks [21, 49], and orthonormalization-free approaches [22, 55]. These aforementioned algorithms are designed for smooth objective functions, and are generally not suitable for problem (2).
There exist some algorithms specifically designed for solving non-smooth optimization problems with orthogonality constraints; most of which adopt certain non-smooth optimization techniques to the Stiefel manifold; for instances, Riemannian subgradient methods [17, 18], proximal point algorithms [6], non-smooth trust-region methods [25], gradient sampling methods [28], and proximal gradient methods [11, 31]. Out of these algorithms, proximal gradient methods are among the most efficient. Specifically, Chen et al. [11] propose a manifold proximal gradient method (ManPG) and its accelerated version ManPG-Ada for non-smooth optimization problems over the Stiefel manifold. Starting from a current iterate, say , ManPG and ManPG-Ada generate the next iterate by solving the following subproblem restricted to the tangent space :
(4) |
which consumes the most computational time in the algorithm. In ManPG [11], the semi-smooth Newton (SSN) method [57] is deployed to solve the above subproblem, and global convergence to stationary points is established with the convergence rate . Huang and Wei [31] later extend the framework of ManPG to general Riemannian manifolds beyond the Stiefel manifold. Another class of algorithms first introduces auxiliary variables to split the objective function and the orthogonality constraint and then utilizes alternating minimization techniques to solve the resulting model, including SOC [37], MADMM [36], and PAMAL [12]. It is worth mentioning that SOC and MADMM lack a convergence guarantee. As for PAMAL, although convergence is guaranteed, its numerical performance is heavily dependent on its parameters as discussed in [11].
In principle, all the aforementioned algorithms can be adapted to the distributed setting considered in this paper. We take the algorithm ManPG as an example. Under the distributed setting, each agent computes the local product individually, then one round of communications will gather the global sum at a center. The center then solves subproblem (4) and scatters the updated back to all agents. At each iteration, distributed computation is basically limited to the matrix-multiplication level.
We point out that, without prior data-masking, distributed algorithms at the matrix-multiplication level cannot preserve local-data privacy intrinsically. Specifically, local data , privately owned by agent , can be uncovered by anyone who has access to publicly exchanged products , after collecting enough such products and then solving a system of linear equations for the “unknown” . This idea works even for more complicated iterative procedures as long as a mapping from data to a publicly exchanged quantity is deterministic and known.
To illustrate this data leakage situation, we now present a simple experiment on algorithm ManPG-Ada [11]. The test environment well be described in Section 5. As mentioned earlier, at iteration the publicly shared quantity in ManPG-Ada by agent is , where is the -th iterate of ManPG-Ada accessible to all agents. For instance, to recover the unknown local data we construct the following linear system with unknown :
(5) |
Let be the minimum norm least-squares solution to the linear equation (5) at iteration . We perform an experiment to illustrate that will quickly converge to , when ManPG-Ada is deployed to solve the sparse PCA problem on a randomly generated test matrix with and (other parameters are , , and ). Figure 1 depicts how the reconstruction error and the stationarity violation111 Suppose is the solution to (4). According to Lemma 5.3 in [11], is a first-order stationary point if . Therefore, the stationarity violation is defined as . vary, on a logarithmic scale, as the number of iterations increases. We observe that local data is obtained with high accuracy much faster than solving the sparse PCA problem.

In order to handle distributed datasets, some distributed ADMM-type algorithms have been introduced to PCA and related problems. These methods achieve algorithm-level parallelization, which are generally more secure than those based on matrix-multiplication-level parallelization. For instance, [26] proposes a distributed algorithm for sparse PCA with convergence to stationary points, but only studies the special case of . Moreover, under the distributed setting, [51] develop a subspace splitting strategy to tackle the smooth optimization problem over the Grassmann manifold without the -norm regularizer in (2).
Recently, decentralized optimization has attracted attentions partly due to its wide applications in wireless sensor networks. Only local communications are carried out in decentralized optimization, namely, agents exchange information only with their immediate neighbors. This is quite different from the distributed setting considered in the current paper. We refer interested readers to the references [23, 19, 20, 4, 59, 10, 50] for some decentralized PCA algorithms.
1.3 Main contributions
Recently, a subspace splitting strategy has been proposed [51] to accelerate convergence of distributed algorithms for optimization problems over Grassmann manifolds where objective functions are orthogonal-transformation invariant222A function is called orthogonal-transformation invariant if for any and . and smooth. In regularized problems such as sparse PCA, orthogonal-transformation invariance and smoothness no longer hold. In this paper, we present a non-trivial extension of the subspace splitting strategy to more general optimization problems over Stiefel manifolds. In particular, by incorporating the subspace splitting strategy into the framework of ManPG [11], we have developed a distributed algorithm DSSAL1 to solve the -regularized optimization problem (2) for sparse PCA.
The main step of DSSAL1 is to solve the subproblem:
(6) |
which is identical to subproblem (4) in ManPG except that each local data matrix is replaced, or masked, by a matrix whose expression will be derived later.
In a sense, the main contributions of this paper can be attributed to this remarkably simple replacement or masking, which brings two great benefits. Firstly, convergence will be significantly accelerated which will be shown empirically in Section 5. Secondly, publicly exchanged products are no longer images of some fixed and known mapping of , making it practically impossible to uncover from such products via equation-solving.
Several innovations have been made in the development of our algorithms. Firstly, in our algorithm, only the global variable can possibly converge to a solution, while the local variables will never do. The role of the local variables, which are generally dense, is to help identify the right subspace. Secondly, we devise an inexact-solution strategy to effectively handle the difficult subproblem for the global variable where orthogonality and sparsity are pursued simultaneously. Thirdly, we establish the global convergence to stationarity for our algorithm, overcoming a number of technical difficulties associated with the rather complex algorithm construction, as well as the non-convexity and non-smoothness in problem (2).
1.4 Notations
The Euclidean inner product of two matrices of the same size is defined as . The Frobenius norm and spectral norm of a given matrix are denoted by and , respectively. Given a differentiable function , the gradient of with respect to is denoted by . And the subdifferential of a Lipschitz continuous function is denoted by . The tangent space to the Stiefel manifold at , is represented by , and the orthogonal projection of onto is denoted by . For , we define standing for the projection operator onto the null space of . Further notation will be introduced as it occurs.
1.5 Organization
The rest of this paper is organized as follows. In Section 2, we introduce a novel subspace-splitting model for sparse PCA, and investigate the structure of associated Lagrangian multipliers. Then a distributed algorithm is proposed to solve this model based on an ADMM-like framework in Section 3. Moreover, convergence properties of the proposed algorithm are studied in Section 4. Numerical experiments on a variety of test problems are presented in Section 5 to evaluate the performance of the proposed algorithm. We draw final conclusions and discuss some possible future developments in the last section.
2 Subspace-splitting model for sparse PCA
We consider the scenario that the -th component function of in (3) can be evaluated only by the -th agent since local dadaset is accessible only to the -th agent. In order to devise a distributed algorithm, the classic variable-splitting approach is to introduce a set of local variables to make the sum of local functions nominally separable. Then a (centralized) distributed algorithm would maintain a global variable and impose variable-consensus constraints .
Despite the regularizer term in (2), all component functions in are still invariant under orthogonal transformations. It should be natural for us to adapt the subspace-splitting idea introduced in [51], that is, to use the subspace-consensus constraints to accelerate convergence. In this paper, we propose to solve the following optimization problem:
(7a) | ||||
(7b) | ||||
(7c) | ||||
(7d) |
which we will call the subspace-splitting model for problem (2), noting that both sides of (7c) are orthogonal projections onto subspaces. For brevity, we collect the global variable and all local variables into a point . A point is feasible if it satisfies the constraints (7b)-(7d).
2.1 Stationarity conditions
In this subsection, we aim to present the stationarity conditions of the sparse PCA problem (2). We first introduce the definition of Clarke subgradient [13] for non-smooth functions.
Definition 2.1.
Suppose is a Lipschitz continuous function. The generalized directional derivative of at the point along the direction is defined by:
Based on generalized directional derivative of , the (Clark) subgradient of is defined by:
As discussed in [58, 11], the first-order stationarity condition of (2) can be stated as:
We provide an equivalent description of the above first-order stationarity condition, which will be used in the theoretical analysis.
Lemma 2.2.
A point is a first-order stationary point of (2) if and only if there exists such that the following conditions hold:
(8) |
2.2 Existence of low-rank multipliers
By associating dual variables , , and to the constraints (7b), (7c), and (7d), respectively, we derive an equivalent description of first-order stationarity conditions of (7).
Proposition 2.4.
Suppose and . A point is a first-order stationary point of (7) if and only if there exist symmetric matrices , , and such that satisfies the following condition:
(9) |
The proof of Proposition 2.4 is relegated to Appendix B. Actually, the equations in (9) can be viewed as the KKT conditions of (7), while , , and are the Lagrangian multipliers corresponding to the constraints (7b), (7c), and (7d), respectively.
In [51], a low-rank and closed-form multiplier is devised with respect to the subspace constraint (7c) for the PCA problems, which can be expressed as:
(10) |
In Appendix (B), we further verify that the above formulation is also valid for the sparse PCA problems at any first-order stationary point . In the next section, we will use (10) to update the multipliers in our framework of ADMM. This strategy simultaneously saves computational costs and storage requirements.
3 Algorithmic framework
Now we describe the proposed algorithm, based on an ADMM-like framework, to solve the subspace-splitting model (7). Note that there are three constraints (7b)-(7d). We only penalize the subspace constraints (7c) to the objective function, and obtain the corresponding augmented Lagrangian function:
(11) |
where
and are penalty parameters. Schematically, we will follow the ADMM-like framework below to build an algorithm for solving subspace-splitting model (7), though we quickly add that the two optimization subproblems below in (12) and (13) are not “solved” in a normal sense since we may stop after a single iteration of an iterative scheme.
(12) | |||||
(13) | |||||
(14) | |||||
(15) |
In the above framework, the superscript counts the number of iterations, and the subscript indicates the number of agents.
A novel feature in our algorithm is the way to update the multipliers associated with the subspace constraints (7c). Generally speaking, in the augment Lagrangian based approach, the multipliers are updated by the dual ascent step
where is the step size. This standard method would require to store and update an matrix at each agent, which could be costly when is large. Instead, we use the low-rank updating formulas (14)-(15) based on the closed-form expression (10) derived in Section 2.2. In our iterative setting, these multiplier matrices are never stored but used in matrix multiplications, in which the required additional storage for agent is for the matrix .
3.1 Subproblem for global variable
We now describe how to approximate subproblems (12) and (13). By rearrangments, subproblem (12) reduces to:
(16) |
where and is an matrix defined by
(17) |
We quickly point out here that it is not necessary to construct and store these -matrices explicitly since we will only use them to multiply matrices in an iterative scheme to obtain approximate solutions to subproblem (16). More details will follow later.
Apparently, subproblem (16) pursues the orthogonality and sparsity simultaneously, and is not easier to solve than the the original problem (2). However, we will demonstrate later by both theoretical analysis and numerical experiments that inexactly solving (16) by conducting one proximal gradient step is adequate for the global convergence.
Starting from the current iterate , we first find a decent direction restricted to the tangent space by solving the following subproblem
(18) | ||||
where is the step size. Since does not necessarily lie on the Stiefel manifold , we then perform a projection to bring it back to , which can be represented as:
Here, the orthogonal projection of a matrix onto is denoted by , where is the economic form of the singular value decomposition of .
Remark 1.
We note that the subproblems solved by ManPG [11] are identical in form to our subproblem (18) but with replaced by the data matrix . That is, ManPG applies manifold proximal gradient steps to a fixed problem, while our algorithm computes steps of the same type using a sequence of matrices, each being updated to incorporate latest information. From this point of view, our algorithm can be interpreted as an acceleration scheme to reduce the number of outer-iterations, thereby reducing communication overheads.
Since these matrices are distributively maintained in agents, each agent is not able to independently solve subproblem (18). Fortunately, we only need to calculate
(19) |
to solve this subproblem. In the distributed setting, the right hand side of (19) can be accomplished by calculating in each agent and then invoking the all-reduce type of communication, where each agent just shares one matrix. If the all-reduce type of communication is realized by the butterfly algorithm [43], the communication overhead per iteration is . Furthermore, each local product can be computed from
(20) | ||||
with a computational cost in the order of . From the above formula, one observes that the matrices need not be stored explicitly.
Now we consider how to efficiently solve subproblem (18). By associating a multiplier to the linear equality constraint, the Lagrangian function of (18) can be written as:
We choose to apply the Uzawa method [5] to solve subproblem (18). At the -th inner iteration, we first minimize the above Lagrangian function with respect to for a fixed :
(21) | ||||
where and denote the -th inner iterate of and , respectively. Here, we use to denote the proximal mapping of a given function at the point , which is defined by:
For the -norm regularizer term , the proximal mapping in (21) admits a closed-form solution:
where the subscript represents the -th entry of a matrix. Then the multiplier is updated by a dual ascent step:
(22) |
where is the step size. These two steps are repeated until convergence. The complete framework is summarized in Algorithm 1.
The Uzawa method can be viewed as a special case of the primal-dual hybrid gradient algorithm (PDHG) developed in [27] with the convergence rate in the ergodic sense under mild conditions. Moreover, as an inner solver it can bring a higher overall efficiency than the SSN method used by ManPG [11] (see [56] for a recent study).
3.2 Subproblems for local variables
In this subsection, we focus on for the -th local variable , which can be rearranged as the following equivalent problem:
(23) |
Here, is an real symmetric matrix:
(24) |
which is only related to the local data . This is a standard eigenvalue problem where one needs to compute a -dimensional dominant eigenspace of .
As a subproblem, it is not necessary to solve (23) to high precision. In practice, we just need to find a point satisfying the following two conditions, which suffices to be a good inexact solution empirically, and to guarantee the global convergence of the whole algorithm. The first condition demands a sufficient decrease in function value:
(25) |
where and are two constants independent of . The second condition is a sufficient decrease in KKT violation:
(26) |
where is a constant independent of . It turns out that these two rather weak termination conditions for subproblem (23) are sufficient for us to derive global convergence of our ADMM-like algorithm framework (12)-(15).
3.3 Algorithm description
We formally present the detailed algorithmic framework as Algorithm 2 below, named distributed subspace splitting algorithm with regularization and abbreviated to DSSAL1. In the distributed environment, all agents are initiated from the same point . And the initial guess of multipliers are computed by (15). After initialization, all agents first solve the common subproblem for collaboratively by certain communication strategy. Then each agent solves its subproblem for and updates its multiplier . These two steps only involve the local data privately stored at each agent, and hence can be carried out in agents concurrently. This procedure is repeated until convergence.
3.4 Data privacy
We claim that DSSAL1 can naturally protect the intrinsic privacy of local data. To form the global sum in (19), the shared information in DSSAL1 at iteration from the -th agent is where is known to all agents. However, the system of equations, , is insufficient for obtaining the mask matrix which changes from iteration to iteration. Secondly, even if a few mask matrices were unveiled, it would still be impossible to derive the local data matrix from these without knowing corresponding (and ) which are always kept privately by the -th agent. Finally, consider the ideal “converged” case where held at iteration , and were known. In this case, would be a known linear function of parameterized by . Still, the system would not be sufficient to uncover the local data matrix (strictly speaking, one only needs to recover entries since is symmetric). Based on this discussion, we call DSSAL1 a privacy-aware method.
3.5 Computational cost
We conclude this section by discussing the computational cost of our algorithm per iteration. We first compute the matrix multiplication by (19) and (20), whose computational cost for each agent is as mentioned earlier. Then, at the center, the Uzawa method is applied to solving subproblem (18) which has a per-iteration complexity . In practice, it usually takes very few iterations to generate . Next, each agent uses a single iteration of SSI to generate by (27) with the computational cost as discussed before. Finally, agent updates by (14) with the computational cost (which represents the multiplier matrix implicitly). Overall, for each agent, the computational cost of our algorithm is per iteration. At the center, the computational cost for approximately solving (18) is, empirically speaking, .
4 Convergence analysis
In this section, we analyze the global convergence of Algorithm 2 and prove that a sequence generated by Algorithm 2 has at least one accumulation point, and any accumulation point is a first-order stationary point. A global, sub-linear convergence rate is also established.
We start with a property that a feasible point is first-order stationary if no progress can be made by solving (18).
Lemma 4.1.
The proof of Lemma 4.1 is deferred to Appendix C. It motivates the following definition of an -stationary point for the subspace-splitting model (7).
Definition 4.2.
Suppose is the -th iterate of Algorithm 2. Then is called an -stationary point if the following condition holds:
where is a small constant.
In order to prove the convergence of our algorithm, we need to impose some mild conditions on algorithm parameters, which are summarized below.
Condition 1.
The algorithm parameters , , , , as well as two auxiliary parameters and , satisfy the following conditions:
where is a constant.
Condition 2.
We note that the above technical conditions are not necessary in a practical implementation, They are introduced purely for the purpose of theoretical analysis to facilitate obtaining a global convergence rate and corresponding worst-case complexity for Algorithm 2.
Theorem 4.3.
Suppose is an iterate sequence generated by Algorithm 2, starting from an arbitrary orthogonal matrix , with parameters satisfying Conditions 1 and 2. Then has at least one accumulation point and any accumulation point must be a first-order stationary point of the sparse PCA problem (2). Moreover, for any integer , it holds that
where is a constant.
The proof of Theorem 4.3, which is rather complicated and lengthy, will be given in Appendix D. The global sub-linear convergence rate in Theorem 4.3 guarantees that DSSAL1 is able to return an -stationary point in at most iterations. Since DSSAL1 performs one round of communication per iteration, the number of communication rounds required to obtain an -stationary point is also at the most.
5 Numerical results
In this section, we evaluate the empirical performance of DSSAL1 through a set of comprehensive numerical experiments. All the experiments throughout this section are performed on a high-performance computing cluster 333More information at http://lsec.cc.ac.cn/chinese/lsec/LSSC-IVintroduction.pdf, called LSSC-IV which is maintained at the State Key Laboratory of Scientific and Engineering Computing (LSEC), Chinese Academy of Sciences. The LSSC-IV cluster has 408 nodes, each consisting of two Inter(R) Xeon(R) Gold 6140 processors (at GHz ) with GB memory, running under the operating system Red Hat Enterprise Linux Server 7.3.
We compare the performance of DSSAL1 with two state-of-the-art algorithms: (1) an ADMM-type algorithm called SOC [37] and (2) a manifold proximal gradient method called ManPG-Ada [11]. Since open-source, parallel codes for the above two algorithms are not available, to conduct experiments under the distributed environment of the LSSC-IV cluster, we implemented the two existing algorithms and our own algorithm DSSAL1 in C++ with MPI for inter-process communications444 Our code is downloadable from http://lsec.cc.ac.cn/~liuxin/code.html. For the two existing algorithms, we set all parameters to their default values as described in [37, 11]. The linear algebra library Eigen555Available from https://eigen.tuxfamily.org/index.php?title=Main_Page (version 3.3.8) is adopted for matrix computation tasks.
5.1 DSSAL1 Implementation details
In Algorithm 2, we set the penalty parameters to , and in subproblem (18) we set the hyperparameter to . In Algorithm 1, we set the step size to and terminate the algorithm whenever
is satisfied or the number of iterations reaches .
5.2 Synthetic data generation
In our experiments, a synthetic data matrix is constructed into the form of (economy-size) SVD:
(29) |
where and satisfy and is nonnegative and diagonal. Specifically, and are results of orthonormalization of random matrices whose entries are drawn independently and identically from under the uniform distribution, and
where the parameter determines the decay rate of singular values. Finally, we apply the standard PCA pre-processing operations to a data matrix by subtracting the sample-mean from each sample and then normalizing the rows of the resulting matrix to make them of unit -norm. For our synthetic data matrices, such pre-processing will only slightly perturb the decay rate of the singular-values before the pre-processing which is uniformly equal to by construction. Unless specified otherwise, the default value for the decay-rate parameter is .
In the numerical experiments, all the algorithms are started from the same initial points. Since the optimization problem is non-convex, different solvers may still occasionally return different solutions when starting from a common initial point at random. As suggested in [11], to increase the chance that all solvers find the same solution, we first run the Riemannian subgradient method [8, 18] for iterations and then use the resulting output as the common starting point.
5.3 Comprehensive comparison on synthetic data
In order to do a thorough evaluation on the empirical performance of DSSAL1, we design four groups of test data matrices, generated as in Subsection 5.2. In each group, there is only one parameter varying while all the others are fixed. Specifically, the varying and fixed parameters for the four groups are as follows:
-
1.
varying sample dimension for , while , , , ;
-
2.
varying number of computed loading vectors for , while , , , ;
-
3.
varying regularization parameter for , while , , , ;
-
4.
varying number of cores for , while , , , .
Additional experimental results for varying will be presented in the next subsection. Numerical results obtained for the above four test groups are presented in Tables 1 to 4, respectively, where we record wall-clock times in seconds and total rounds of communication. The average function values and sparsity levels for the four groups of tests are provided in Table 5. When computing the sparsity of a solution matrix (i.e., the percentage of zero elements), we set a matrix element to zero when its absolute value is less than .
From Table 5, we see that all three algorithms have attained comparable solution qualities with similar function values and sparsity levels in the four groups of testing problems.
Wall-clock time in seconds | Rounds of communication | |||||
---|---|---|---|---|---|---|
DSSAL1 | ManPG-Ada | SOC | DSSAL1 | ManPG-Ada | SOC | |
1000 | 10.97 | 26.76 | 72.07 | 655 | 1794 | 7569 |
1500 | 6.61 | 16.12 | 57.75 | 223 | 642 | 2201 |
2000 | 49.22 | 172.32 | 725.63 | 1238 | 5054 | 20444 |
2500 | 181.97 | 412.51 | 2296.26 | 3753 | 10971 | 45707 |
3000 | 34.58 | 153.40 | 860.76 | 680 | 2789 | 11480 |
Wall-clock time in seconds | Rounds of communication | |||||
DSSAL1 | ManPG-Ada | SOC | DSSAL1 | ManPG-Ada | SOC | |
10 | 9.74 | 26.42 | 75.25 | 629 | 1622 | 6652 |
15 | 29.99 | 56.31 | 153.98 | 1728 | 3586 | 15865 |
20 | 110.26 | 239.51 | 466.22 | 6144 | 14107 | 44086 |
25 | 68.79 | 148.34 | 334.58 | 3030 | 6153 | 27060 |
30 | 110.16 | 173.11 | 204.87 | 5133 | 6966 | 14621 |
Wall-clock time in seconds | Rounds of communication | |||||
DSSAL1 | ManPG-Ada | SOC | DSSAL1 | ManPG-Ada | SOC | |
0.2 | 25.05 | 59.46 | 430.37 | 1537 | 4140 | 50000 |
0.4 | 22.16 | 46.76 | 115.36 | 1393 | 3061 | 13680 |
0.6 | 11.61 | 29.08 | 81.82 | 838 | 1899 | 8278 |
0.8 | 10.09 | 40.07 | 66.46 | 733 | 3369 | 8121 |
1.0 | 9.12 | 19.80 | 56.33 | 655 | 1348 | 6396 |
Wall-clock time in seconds | Rounds of communication | |||||
DSSAL1 | ManPG-Ada | SOC | DSSAL1 | ManPG-Ada | SOC | |
16 | 161.72 | 168.36 | 383.14 | 897 | 1169 | 5175 |
32 | 106.95 | 135.81 | 312.20 | 808 | 1169 | 5175 |
64 | 54.33 | 68.89 | 167.27 | 753 | 1169 | 5175 |
128 | 24.28 | 35.54 | 96.04 | 683 | 1169 | 5175 |
256 | 9.17 | 14.74 | 47.61 | 660 | 1169 | 5175 |
Function value | Sparsity level | |||||
---|---|---|---|---|---|---|
Test | DSSAL1 | ManPG-Ada | SOC | DSSAL1 | ManPG-Ada | SOC |
Varying | -672.26 | -672.26 | -672.26 | 16.54% | 16.51% | 16.50% |
Varying | -360.52 | -360.51 | -360.46 | 40.42% | 40.46% | 40.32% |
Varying | -282.65 | -282.65 | -282.64 | 26.28% | 26.29% | 26.28% |
Varying | -302.02 | -301.97 | -301.97 | 22.13% | 22.21% | 22.21% |
It should be evident from Tables 1 to 4 that, in all four test groups and in terms of both wall-clock time and round of communication, DSSAL1 clearly outperforms ManPG-Ada which in turn outperforms SOC by large margins. Since the amount of communication per round for the three algorithms are essentially the same, their total communication overhead is proportional to the rounds of communication required. Because DSSAL1 takes far fewer rounds of communication than the other two algorithms (often by considerable margins), we conclude that DSSAL1 is a more communication-efficient algorithm than the other two. For example, in Table 3 for the case of , the number of communication rounds taken by DSSAL1 is less than a quarter of that by ManPG-Ada and one tenth of that by COS.
5.4 Empirical convergence rate
In this subsection, we examine empirical convergence rates of iterates produced by DSSAL1 and ManPG-Ada for comparison, while SOC is excluded from this experiment given its obvious non-competitiveness in previous experiments.
In the following experiments, we fix , , , , and . Three synthetic matrices is randomly generated by (29) with taking three different values , , and , respectively, on which DSSAL1 and ManPG-Ada return and , respectively, with smaller-than-usual termination tolerances and in (28). We use the average of the two, , as a “ground truth” solution. Then we rerun the two algorithms on the same with the termination condition and record the quantity at each iteration.
In Figure 2, we plot the iterate-error sequences for both DSSAL1 and ManPG-Ada and observe that both algorithms appear to converge asymptotically at linear rates. Overall, however, the convergence of DSSAL1 is several times faster than that of ManPG-Ada. We also provide the ratio between the iteration number of ManPG-Ada and that of DSSAL1 for different values of in Table 6. In general, the closer to one is, the slower the singular values of decay, and the more difficult the problem tends to be. Table 6 demonstrates that in our test the advantage of DSSAL1 becomes more and more pronounced as the test instances become more and more difficult to solve.



6 Conclusion
In this paper, we propose a distributed algorithm, called DSSAL1, for solving the -regularized optimization model (2) for sparse PCA computation. DSSAL1 has the following features.
-
1.
The algorithm successfully extends the subspace-splitting strategy from orthogonal-transformation invariant objective function to the sum where can be non-invariant and non-smooth.
-
2.
The algorithm has built-in mechanism for local-data masking and hence naturally protects local-data privacy without requiring a special privacy preservation process.
-
3.
The algorithm is storage-efficient in that beside local data, it only requires storing matrices (usually ) by each agent, thanks to a low-rank multiplier formula.
-
4.
The algorithm has a global convergence guarantee to stationary points and a worst-case complexity under mild conditions, in spite of the nonlinear equality constraints for subspace consensus.
Comprehensive numerical simulations are conducted under a distributed environment to evaluate the performance of our algorithm in comparison to two state-of-the-art approaches. Remarkably, the communication rounds required by DSSAL1 are often over one order of magnitude smaller than existing methods. These results indicate that DSSAL1 has a great potential in solving large-scale application problems in distributed environments where data privacy is a primary concern.
Finally, we mention two related topics worthy of future studies. One is the possibility of developing asynchronous approaches for sparse PCA to address load balance issues in distributed environments. Another is to extend the subspace splitting strategy to decentralized networks so that a wider range of applications can benefit from the effectiveness of this approach.
Appendix A Proof of Lemma 2.2
Proof of Lemma 2.2.
According to the definition of , it follows that
where . The proof is completed. ∎
Appendix B Proof of Proposition 2.4
Proof of Proposition 2.4.
To begin with, we assume that is a first-order stationary point. Then there exists such that
and is symmetric. Let , , and
with . Then the matrices , and are symmetric and . Moreover, we can deduce that
and
Hence, satisfies the conditions in (9) under these specific choices of , and .
Conversely, we now assume that there exist and symmetric matrices , and such that satisfies the conditions in (9). It follows from the first and second equality in (9) that
At the same time, since , we have
Combining the above two equalities and orthogonality of , we arrive at
Left-multiplying both sides of the second equality in (9) by , we obtain that
which together with the symmetry of and implies that is also symmetric. This completes the proof. ∎
Appendix C Proof of Lemma 4.1
Appendix D Convergence of Algorithm 2
Now we prove Theorem 4.3 to establish the global convergence of Algorithm 2. In addition to the notations introduced in Section 1, we further adopt the followings throughout the theoretical analysis. The notations and represent the rank and the smallest singular value of , respectively. For , we define and , standing for, respectively, the projection distance matrix and its measurement.
To begin with, we provide a sketch of our proof. Suppose is the iteration sequence generated by Algorithm 2, with and being the local variable and multiplier of the -th agent at the -th iteration, respectively. The proof includes the following main steps.
-
1.
The sequence is bounded and the sequence is bounded from below.
-
2.
The sequence satisfies , and is a unified lower bound of the smallest singular values of the matrices .
-
3.
The sequence is monotonically non-increasing, and hence is convergent.
-
4.
The sequence has at least one accumulation point, and any accumulation point is a first-order stationary point of the sparse PCA problem (2).
Next we verify all the items in the above sketch by proving the following lemmas and corollaries.
Lemma D.1.
Suppose is the iterate sequence generated by Algorithm 2. Let
Then the following relationship holds for any ,
Proof.
Lemma D.2.
Suppose and . Then it holds that
and
Proof.
The proof can be found in, for example, [33]. For the sake of completeness, we provide a proof here. It follows from the orthogonality of and the skew-symmetry of that has full column rank. This yields that , where . Since , we have
where are the singular values of . Similarly, it follows from the relationship that
which completes the proof. ∎
Corollary D.3.
Proof.
Firstly, it can be readily verified that
Let be the smooth part of the objective function in (16). Since is Lipschitz continuous with the corresponding Lipschitz constant , we have
It follows from Lemma D.2 that
and
Combing the above three inequalities, we can obtain that
It follows from Lemma D.1 that
which infers that
This together with the Lipschitz continuity of yields that
Here, is a constant defined in Section 4. According to Condition 1, we know that . Hence, we finally arrive at
This completes the proof. ∎
Lemma D.4.
Proof.
The inequality (31) directly results in the following relationship.
According to the definition of , it follows that
By straightforward calculations, we can deduce that
and
The above three inequalities yield that
which further implies that
This completes the proof. ∎
Lemma D.5.
Proof.
It follows from Condition 2 that , which together with (25) yields that
(35) |
And it can be checked that
(36) | ||||
Suppose are the singular values of . It is clear that for any due to the orthogonality of and . On the one hand, we have
On the other hand, it follows from that
Let . By straightforward calculations, we can derive that
(37) | ||||
Combining (35), (36) and (37), we acquire the assertion (33). Then it follows from the definition of that
By straightforward calculations, we can obtain that
and
The above three relationships yield (34). We complete the proof. ∎
Lemma D.6.
Proof.
We use mathematical induction to prove this lemma. To begin with, it follows from the inequality (32) that
under the relationship in Condition 2. Thus, the argument (38) directly holds for . Now, we assume the argument holds at , and investigate the situation at .
According to Condition 2, we have . Since we assume that , it follows from the relationship (34) that
which infers that . Similar to the proof of Lemma D.5, we can acquire that
(39) |
Combining the condition (26) and the equality (36), we have
(40) | ||||
On the other hand, it follows from the triangular inequality that
Combing the inequality (39), it can be verified that
Moreover, according to Lemma B.4 in [51], we have
Combing the above three inequalities, we further obtain that
Together with (40), this yields that
According to Conditions 1 and 2, we have and . Thus, it can be verified that
(41) | ||||
where the last inequality follows from the relationship in Condition 2. This together with (32) and (38) yields that
since we assume that in Condition 2. The proof is completed. ∎
Corollary D.7.
Corollary D.8.
Proof.
Now based on these lemmas and corollaries, we can demonstrate the monotonic non-increasing of , which results in the global convergence of our algorithm.
Proposition D.9.
Suppose is the iteration sequence generated by Algorithm 2 initiated from , and problem parameters satisfy Conditions 1 and 2. Then the sequence of augmented Lagrangian functions is monotonically non-increasing, and for any , it satisfies the following sufficient descent property:
(42) | ||||
where is a constant.
Proof.
Based on the above properties, we are ready to prove Theorem 4.3, which establishes the global convergence rate of our proposed algorithm.
Proof of Theorem 4.3.
The whole sequence is naturally bounded, since each of or is orthogonal. Then it follows from the Bolzano-Weierstrass theorem that this sequence exists an accumulation point , where and . Moreover, the boundedness of results from the multipliers updating formula (15). Hence, the lower boundedness of is owing to the continuity of the augmented Lagrangian function. Namely, there exists a constant such that for all .
References
- [1] T. E. Abrudan, J. Eriksson, and V. Koivunen, Steepest descent algorithms for optimization under unitary matrix constraint, IEEE Transactions on Signal Processing, 56 (2008), pp. 1134–1147.
- [2] P.-A. Absil, C. G. Baker, and K. A. Gallivan, Trust-region methods on Riemannian manifolds, Foundations of Computational Mathematics, 7 (2006), pp. 303–330.
- [3] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2008.
- [4] F. L. Andrade, M. A. Figueiredo, and J. Xavier, Distributed Picard iteration: Application to distributed EM and distributed PCA, arXiv:2106.10665, (2021).
- [5] K. J. Arrow, H. Azawa, L. Hurwicz, and H. Uzawa, Studies in linear and non-linear programming, vol. 2, Stanford University Press, 1958.
- [6] M. Bacák, R. Bergmann, G. Steidl, and A. Weinmann, A second order nonsmooth variational model for restoring manifold-valued images, SIAM Journal on Scientific Computing, 38 (2016), pp. A567–A597.
- [7] T. Baden, P. Berens, K. Franke, M. R. Rosón, M. Bethge, and T. Euler, The functional diversity of retinal ganglion cells in the mouse, Nature, 529 (2016), pp. 345–350.
- [8] G. C. Bento, O. P. Ferreira, and J. G. Melo, Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds, Journal of Optimization Theory and Applications, 173 (2017), pp. 548–562.
- [9] G. Chen, P. F. Sullivan, and M. R. Kosorok, Biclustering with heterogeneous variance, Proceedings of the National Academy of Sciences, 110 (2013), pp. 12253–12258.
- [10] S. Chen, A. Garcia, M. Hong, and S. Shahrampour, Decentralized Riemannian gradient descent on the Stiefel manifold, in Proceedings of the 38th International Conference on Machine Learning, vol. 139, 2021, pp. 1594–1605.
- [11] S. Chen, S. Ma, A. Man-Cho So, and T. Zhang, Proximal gradient method for nonsmooth optimization over the Stiefel manifold, SIAM Journal on Optimization, 30 (2020), pp. 210–239.
- [12] W. Chen, H. Ji, and Y. You, An augmented lagrangian method for -regularized optimization problems with orthogonality constraints, SIAM Journal on Scientific Computing, 38 (2016), pp. B570–B592.
- [13] F. H. Clarke, Optimization and nonsmooth analysis, SIAM, 1990.
- [14] A. d’Aspremont, F. Bach, and L. El Ghaoui, Optimal solutions for sparse principal component analysis, Journal of Machine Learning Research, 9 (2008), pp. 1269–1294.
- [15] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet, A direct formulation for sparse PCA using semidefinite programming, SIAM Review, 49 (2007), pp. 434–448.
- [16] A. Edelman, T. A. Arias, and S. T. Smith, The geometry of algorithms with orthogonality constraints, SIAM Journal on Matrix Analysis and Applications, 20 (1998), pp. 303–353.
- [17] O. Ferreira and P. Oliveira, Subgradient algorithm on Riemannian manifolds, Journal of Optimization Theory and Applications, 97 (1998), pp. 93–104.
- [18] O. P. Ferreira, M. S. Louzeiro, and L. F. Prudente, Iteration-complexity of the subgradient method on Riemannian manifolds with lower bounded curvature, Optimization, 68 (2019), pp. 713–729.
- [19] A. Gang and W. U. Bajwa, A linearly convergent algorithm for distributed principal component analysis, arXiv:2101.01300, (2021).
- [20] , FAST-PCA: A fast and exact algorithm for distributed principal component analysis, arXiv:2108.12373, (2021).
- [21] B. Gao, X. Liu, X. Chen, and Y.-X. Yuan, A new first-order algorithmic framework for optimization problems with orthogonality constraints, SIAM Journal on Optimization, 28 (2018), pp. 302–332.
- [22] B. Gao, X. Liu, and Y.-X. Yuan, Parallelizable algorithms for optimization problems with orthogonality constraints, SIAM Journal on Scientific Computing, 41 (2019), pp. A1949–A1983.
- [23] I. Gemp, B. McWilliams, C. Vernade, and T. Graepel, Eigengame: PCA as a nash equilibrium, arXiv:2010.00554, (2020).
- [24] K. Gravuer, J. J. Sullivan, P. A. Williams, and R. P. Duncan, Strong human association with plant invasion success for Trifolium introductions to New Zealand, Proceedings of the National Academy of Sciences, 105 (2008), pp. 6344–6349.
- [25] P. Grohs and S. Hosseini, Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds, IMA Journal of Numerical Analysis, 36 (2016), pp. 1167–1192.
- [26] D. Hajinezhad and M. Hong, Nonconvex alternating direction method of multipliers for distributed sparse principal component analysis, in 2015 IEEE Global Conference on Signal and Information Processing, 2015, pp. 255–259.
- [27] B. He, Y. You, and X. Yuan, On the convergence of primal-dual hybrid gradient algorithm, SIAM Journal on Imaging Sciences, 7 (2014), pp. 2526–2537.
- [28] S. Hosseini and A. Uschmajew, A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds, SIAM Journal on Optimization, 27 (2017), pp. 173–189.
- [29] J. Hu, B. Jiang, L. Lin, Z. Wen, and Y.-X. Yuan, Structured quasi-Newton methods for optimization with orthogonality constraints, SIAM Journal on Scientific Computing, 41 (2019), pp. A2239–A2269.
- [30] J. Hu, A. Milzarek, Z. Wen, and Y. Yuan, Adaptive quadratically regularized Newton method for Riemannian optimization, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 1181–1207.
- [31] W. Huang and K. Wei, Riemannian proximal gradient methods, Mathematical Programming, (2021), pp. 1–43.
- [32] B. Jiang and Y.-H. Dai, A framework of constraint preserving update schemes for optimization on Stiefel manifold, Mathematical Programming, 153 (2015), pp. 535–575.
- [33] B. Jiang, S. Ma, A. M.-C. So, and S. Zhang, Vector transport-free SVRG with general retraction for Riemannian optimization: Complexity analysis and practical implementation, arXiv:1705.09059, (2017).
- [34] I. M. Johnstone and A. Y. Lu, On consistency and sparsity for principal components analysis in high dimensions, Journal of the American Statistical Association, 104 (2009), pp. 682–693.
- [35] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, A modified principal component technique based on the LASSO, Journal of Computational and Graphical Statistics, 12 (2003), pp. 531–547.
- [36] A. Kovnatsky, K. Glashoff, and M. M. Bronstein, MADMM: A generic algorithm for non-smooth optimization on manifolds, in European Conference on Computer Vision, Springer, 2016, pp. 680–696.
- [37] R. Lai and S. Osher, A splitting method for orthogonality constrained problems, Journal of Scientific Computing, 58 (2014), pp. 431–449.
- [38] Y. Lou, L. Yu, S. Wang, and P. Yi, Privacy preservation in distributed subgradient optimization algorithms, IEEE Transactions on Cybernetics, 48 (2017), pp. 2154–2165.
- [39] M. Magdon-Ismail, NP-hardness and inapproximability of sparse PCA, Information Processing Letters, 126 (2017), pp. 35–38.
- [40] J. H. Manton, Optimization algorithms exploiting unitary constraints, IEEE Transactions on Signal Processing, 50 (2002), pp. 635–650.
- [41] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, Communication-efficient learning of deep networks from decentralized data, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, PMLR, 2017, pp. 1273–1282.
- [42] Y. Nishimori and S. Akaho, Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold, Neurocomputing, 67 (2005), pp. 106–135.
- [43] P. S. Pacheco, An introduction to parallel programming, Elsevier, 2011.
- [44] H. Rutishauser, Simultaneous iteration method for symmetric matrices, Numerische Mathematik, 16 (1970), pp. 205–223.
- [45] H. Sato, A Dai–Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions, Computational Optimization and Applications, 64 (2016), pp. 101–118.
- [46] H. Shen and J. Z. Huang, Sparse principal component analysis via regularized low rank matrix approximation, Journal of Multivariate Analysis, 99 (2008), pp. 1015–1034.
- [47] K. Sjostrand, E. Rostrup, C. Ryberg, R. Larsen, C. Studholme, H. Baezner, J. Ferro, F. Fazekas, L. Pantoni, D. Inzitari, et al., Sparse decomposition and modeling of anatomical shape variation, IEEE Transactions on Medical Imaging, 26 (2007), pp. 1625–1635.
- [48] E. Stiefel, Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten, Commentarii Mathematici Helvetici, 8 (1935), pp. 305–353.
- [49] L. Wang, B. Gao, and X. Liu, Multipliers correction methods for optimization problems over the Stiefel manifold, CSIAM Transactions on Applied Mathematics, 2 (2021), pp. 508–531.
- [50] L. Wang and X. Liu, Decentralized optimization over the Stiefel manifold by an approximate augmented Lagrangian function, IEEE Transactions on Signal Processing, 70 (2022), pp. 3029–3041.
- [51] L. Wang, X. Liu, and Y. Zhang, A distributed and secure algorithm for computing dominant SVD based on projection splitting, arXiv:2012.03461, (2020).
- [52] Z. Wen and W. Yin, A feasible method for optimization with orthogonality constraints, Mathematical Programming, 142 (2013), pp. 397–434.
- [53] D. M. Witten, R. Tibshirani, and T. Hastie, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, 10 (2009), pp. 515–534.
- [54] T. Wu, E. Hu, S. Xu, M. Chen, P. Guo, Z. Dai, T. Feng, L. Zhou, W. Tang, L. Zhan, et al., clusterProfiler 4.0: A universal enrichment tool for interpreting omics data, The Innovation, 2 (2021), p. 100141.
- [55] N. Xiao, X. Liu, and Y.-X. Yuan, A class of smooth exact penalty function methods for optimization problems with orthogonality constraints, Optimization Methods and Software, (2020), pp. 1–37.
- [56] N. Xiao, X. Liu, and Y.-x. Yuan, A penalty-free infeasible approach for a class of nonsmooth optimization problems over the Stiefel manifold, arXiv:2103.03514, (2021).
- [57] X. Xiao, Y. Li, Z. Wen, and L. Zhang, A regularized semi-smooth Newton method with projection steps for composite convex programs, Journal of Scientific Computing, 76 (2018), pp. 364–389.
- [58] W. H. Yang, L.-H. Zhang, and R. Song, Optimality conditions for the nonlinear programming problems on Riemannian manifolds, Pacific Journal of Optimization, 10 (2014), pp. 415–434.
- [59] H. Ye and T. Zhang, DeEPCA: Decentralized exact PCA with linear convergence rate, Journal of Machine Learning Research, 22 (2021), pp. 1–27.
- [60] C. Zhang, M. Ahmad, and Y. Wang, ADMM based privacy-preserving decentralized optimization, IEEE Transactions on Information Forensics and Security, 14 (2018), pp. 565–580.
- [61] X. Zhu, A Riemannian conjugate gradient method for optimization on the Stiefel manifold, Computational Optimization and Applications, 67 (2017), pp. 73–110.
- [62] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, Journal of Computational and Graphical Statistics, 15 (2006), pp. 265–286.
- [63] H. Zou and L. Xue, A selective overview of sparse principal component analysis, Proceedings of the IEEE, 106 (2018), pp. 1311–1320.