A Linearly Convergent Algorithm for Distributed Principal Component Analysis
Abstract
Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. This paper focuses on the dual objective of PCA, namely, dimensionality reduction and decorrelation of features, but in a distributed setting. This requires estimating the eigenvectors of the data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors, when data is distributed across a network of machines. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger’s Algorithm (DSA) that estimates the eigenvectors of the data covariance matrix when data is distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also provided to demonstrate the efficacy of the proposed solution.
Index Terms:
Dimensionality reduction, distributed feature learning, generalized Hebbian learning, principal component analysisI Introduction
The modern era of machine learning involves leveraging massive amounts of high-dimensional data, which can have large computational and storage costs. To combat the complexities arising because of the high dimensions of data, dimensionality reduction and feature learning techniques play a pivotal and necessary role in information processing. The most common and widely used technique for this task is Principal Component Analysis (PCA) [2] which, in the simplest of terms, transforms data into uncorrelated features that aid conversion of data from a high-dimensional space to a low-dimensional space while retaining maximum information. Simultaneously, the enormity of the amount of available data makes it difficult to manage it at a single location. There are multiple and an increasing number of scenarios where data is distributed across different locations, either due to storage constraints or by its inherent nature like in the Internet-of-Things [3]. This aspect of the modern-world data have led researchers to explore distributed algorithms, which can process information across different locations/machines [4]. These aforementioned issues have motivated us to study and develop algorithms for distributed PCA that are efficient in terms of computations and communications among multiple machines, and that can also be proven to converge at a fast rate.
When the data is available at a single location, one of the goals of PCA is to find a -dimensional subspace, given by the column space of a matrix , such that the zero-mean data samples retain maximum information when projected onto . In other words, when reconstructed as (subject to ), the data samples should have minimum reconstruction error. It can be shown that this minimal error solution is given by the projection of data onto the subspace spanned by the eigenvectors of data covariance matrix. This implies that for dimensionality reduction, learning any basis of that subspace is sufficient. This is referred to as the subspace learning problem. But while simple dimension reduction does not necessarily need uncorrelated features, most downstream machine learning tasks like classification, pattern matching, regression, etc., work more efficiently when the data features are uncorrelated. In the case of image coding, e.g., PCA is known as the Karhunen–Loeve transform [5], wherein images are compressed by decorrelating neighbouring pixels. With this goal in mind, one needs to aim to find the specific directions that not only have maximum variance, but that also lead to uncorrelated features when data is projected onto those directions. These specific directions are given by the eigenvectors (also called the principal components) of the data covariance matrix, and not just any set of orthogonal basis vectors spanning the same space. Mathematically, along with minimum reconstruction error, the other goal of PCA is to ensure the condition that the off-diagonal entries of are zero (i.e., the data gets decorrelated), while finding the eigenspace of the covariance matrix .
As explained above, the true and complete purpose of PCA is served when the search for the optimal solution ends with the specific set of eigenvectors of the data covariance matrix, and not just with the subspace it spans. It is known that getting the principal components from any other basis of the subspace would only require performing singular value decomposition (SVD) of the obtained subspace. Although true, the SVD operation has a high computational complexity, which makes it an expensive step for big data. The traditional solutions for PCA were developed to overcome the cost of SVD and hence reverting back to it defeats the whole purpose.
Thus, even though the problem of dimensionality reduction of data has many optimal solutions (corresponding to all the sets of basis vectors spanning the -dimensional space), our goal is to find only the ones that give the eigenvectors as the basis. In terms of optimization geometry of the PCA problem in which one tries to minimize the mean-squared reconstruction error under an orthogonality constraint, it is a non-convex strict-saddle function. In a strict-saddle function, all the stationary points except the local minima are strict saddles wherein the Hessians have at least one negative eigenvalue that helps in escaping these saddle points [6, 7]. Also, in the case of PCA the local minima are the same as the global minima. These geometric aspects make PCA, despite being non-convex, a “nice and solvable” problem whose optimal solution can be reached efficiently. However, note that the set of global minima contains, along with the set of eigenvectors as basis, all other possible bases that are rotated with respect to the eigenvectors. And our goal is not to find just any of the global minima but to look into a very particular subset of it, where the basis is not rotated.
A very popular tool that has been used to learn features of data, and hence compress it, is autoencoders. It was shown in [8] that the globally optimum weights of an autoencoder for minimum reconstruction error are the principal components of the covariance matrix of the input data. In [9], Oja described how using the Hebbian rule for updating the weights of a linear neural network would extract the first principal component from the input data. Several other Hebbian-based rules like Rubner’s model, APEX model [10], Generalized Hebbian Algorithm (GHA) [11], etc., were proposed to extend this idea of training a neural network for finding the first eigenvector to extract the first principal components (eigenvectors) of the input covariance matrix. Given the parallelization potential, a feedforward linear neural network-based solution for PCA seems to be very attractive.
The other aspect of modern day data is, as mentioned earlier, its massive size. Collating the huge amount of raw data is usually prohibitive due to communications overhead and/or privacy concerns. These reasons have encouraged researchers over the last couple of decades to develop algorithms that can solve various problems for non-collocated data. The algorithms developed to deal with such scenarios can be broadly classified into two categories: (1) the setups where a central entity/server is required to co-ordinate among the various data centers to yield the final result, and (2) the setup where the data is scattered over an arbitrary network of interconnected data centers with no availability of a central co-ordinating node. The authors in [12, 3] talk about these different setups and the algorithms developed for each of them in more detail. The second scenario is more generic and usually algorithms developed for such setups can be easily modified to be applied to the first scenario. The terms distributed and decentralized are used interchangeably for both the setups in the literature. In this paper, we focus on the latter scenario of arbitrarily connected networks and henceforth call it distributed setup. Hence, here our goal is to solve the PCA problem in the distributed manner when data is scattered over a network of interconnected nodes such that all the nodes in the network eventually agree with each other and converge to the true principal components of the distributed data.
I-A Relation to Prior Work
PCA was developed to find simpler models of smaller dimensions that can approximately fit some data. Some seminal work was done by Pearson [13], who aimed at fitting a line to a set of points, and by Hotelling [2], where a method for the classical PCA problem of decorrelating the features of a given set of data points (observations) by finding the principal components was proposed. Later, some fast iterative methods like the power method, Lanczos algorithm, and orthogonal iterations [14] were proposed, which were proved to converge to the eigenvectors at a linear rate in the case of symmetric matrices. Many other iterative methods have been proposed over the last few decades that are based on the well-known Hebbian learning rule [15] like Oja’s method [9], generalized Hebbian algorithm [11], APEX [10], etc. The analysis for Oja’s algorithm has been provided in [16], which shows that in the deterministic setting the convergence to the first eigenvector is guaranteed at a linear rate for some conditions on the step size and initial estimates. The work in [17] extended the analysis to the generalized Hebbian case for convergence to the first principal eigenvectors for a specific choice of step size.
While ways to solve PCA in the centralized case when data is available at a single location have been around for nearly a century, distributed solutions are very recent. Within our distributed setup where we assume a network of arbitrarily connected nodes with no central server, the data distribution can be broadly classified into two types: (1) distribution by features, and (2) distribution by samples. The PCA algorithms for these two different kinds of distribution are significantly different. While both are completely distributed, the first kind [18, 19, 20, 21] involves estimating only one (or a subset) of feature(s) at each node. In this paper, we focus on finding the eigenvectors when the distribution is by samples, which requires estimation of the whole set of eigenvectors at each node of the network. For this type of distribution, power method was adapted for the distributed setup as a subroutine in [22, 23, 24] to extract the first principal component of the global covariance matrix. Such methods make use of an explicit consensus loop [25] after each power iteration to ensure that the nodes (approximately) agree with each other. While a novel approach that reaches the required solution at the nodes accurately (albeit with a small error due to the consensus iterations), the two time-scale aspect makes it a relatively slow algorithm in terms of communications efficiency. Furthermore, finding multiple principal components with these approaches would require a sequential approach where subsequent components are determined by using a covariance matrix residue that is left after projection on estimates of the higher-order components. In contrast, the work in [26] focuses on finding the top eigenvector in the distributed setup for the streaming data case. A detailed review of various distributed PCA algorithms that exist for different setups is provided in [27].
Next, note that some distributed optimization-based algorithms for non-convex problems are being studied only since recently and those dealing with constrained problems are even fewer. In [28], it is shown that an unconstrained non-convex problem converges to a stationary solution at a sublinear rate. The methods proposed in [29, 30] deal with non-convex objective functions in a distributed setup when the constraint set is convex and [31] works with convex approximations of non-convex problems. Thus, none of these methods are directly applicable to the distributed PCA problem in our setup.
Finally, we proposed an efficient distributed PCA solution in [1] for a distributed network when the data is split sample-wise among the interconnected nodes. In this paper, we extend the preliminary work in [1] and provide a detailed mathematical analysis of the proposed algorithm along with exact convergence rates and extensive numerical experiments.
I-B Our Contributions
The main contributions of this paper are (1) a novel algorithm for distributed PCA, (2) theoretical guarantees for the proposed distributed algorithm with a linear convergence rate to a small neighborhood of the true PCA solution, and (3) experimental results to further demonstrate the efficacy of the proposed algorithm.
Our focus in this paper is to solve the distributed PCA problem so as to find a solution that not only enables dimensionality reduction, but that also provides uncorrelated features of data distributed over a network. That is, our goal is to estimate the true eigenvectors, not just any subspace spanned by them, of the covariance matrix of the data that is distributed across an arbitrarily connected network. Also, we focus on providing a solution that is efficient in terms of communications between the interconnected nodes of an arbitary network. To that end, we propose a distributed algorithm that is based on the generalized Hebbian algorithm (GHA) proposed by Sanger [11], wherein the nodes perform local computations along with information exchange with their directly connected neighbors, similar to the idea followed in the distributed gradient descent (DGD) approach in [32]. The local computations do not involve the calculation of any gradient, but we instead use a “psuedo gradient,” which we henceforth call Sanger’s direction. In our proposed solution, termed the Distributed Sanger’s Algorithm (DSA), we have also done away with the need of explicit consensus iterations for making the nodes agree with each other, thereby making it a one time-scale solution that is more communications efficient. Theoretical guarantees are also provided for our proposed distributed PCA algorithm when using a constant step size. The analysis shows that, when using a constant step size , the DSA solution reaches within a -neighborhood of the optimal solution at a linear rate when the error metric is the angles between the estimated vectors and the true eigenvectors111Our results can also be extrapolated to guarantee exact convergence with decaying step size, albeit at a slower than linear rate.. We also provide experimental results and comparisons with centralized orthogonal iteration [14], centralized GHA [11], a sequential distributed power method-based approach and distributed projected gradient descent. The results support our claims and analysis.
To the best of our knowledge, this is the first solution for distributed PCA that uses a Hebbian update, achieves network agreement without the use of explicit consensus iterations, and still provably reaches the globally optimum solution (within an error margin) at all nodes at a linear rate.
I-C Notation and Organization
The following notation is used in this paper. Scalars and vectors are denoted by lower-case and lower-case bold letters, respectively, while matrices are denoted by upper-case bold letters. The operator denotes the absolute value of a scalar quantity. The superscript in denotes time (or iteration) index, while denotes the exponentiation operation. The superscript denotes the transpose operation, denotes the Frobenius norm of matrices, while both and denote the -norm of vectors. Given a matrix , both and denote its entry at the row and column, while denotes its column.
The rest of the paper is organized as follows. In Section II, we describe and mathematically formulate the distributed PCA problem, while Section III describes the proposed distributed algorithm, which is based on the generalized Hebbian algorithm. In Section IV, we derive a general result for a modified generalized Hebbian algorithm that aids in the convergence analysis of the proposed distributed algorithm, while convergence guarantees for the proposed algorithm are provided in Section V. We provide numerical results in Section VI to show efficacy of the proposed method and provide concluding remarks in Section VII. The statements and proofs of some auxiliary lemmas, which are needed for the proofs of the main lemmas that are used within the convergence analysis in Section IV and Section V, are given in Appendix A, while Appendices B–F contain the formal statements and proofs of the main lemmas.
II Problem Formulation
Principal Component Analysis (PCA) aims at finding the basis of a low-dimensional space that can decorrelate the features of data points and also retain maximum information. More formally, for a random vector with , PCA involves finding the top- eigenvectors of the covariance matrix . The zero mean assumption is taken here without loss of generality as the mean can be subtracted in case data is not centered. Mathematically, PCA can be formulated as
(1) |
The constraint , ensures that decorrelates the features of . Now, and it is straightforward to see that this quantity is diagonal only if contains the eigenvectors of . This explains why the search for a solution of PCA ends with the eigenvectors and not the subspace spanned by them. In practice, we do not have access to and so a covariance matrix estimated from the samples of is used instead. Specifically, for a dataset with samples , or equivalently, for a data matrix , the sample covariance matrix can be written as such that . The true solution for PCA is then obtained by finding the eigenvectors of the covariance matrix , which are also the left singular vectors of the data matrix . The empirical form of (1) is thus
(2) |
In the literature, however, PCA is usually posed with a ‘relaxed’ orthogonality constraint of instead of as follows:
(3) |
The optimization formulation in (3) with this constraint will only lead to a subspace spanned by the eigenvectors of as the solution, thus actually making it a Principal Subspace Analysis (PSA) formulation. In other words, although the formulation (3) gives a solution on the Stiefel manifold, the actual PCA formulation (2) requires the solution to be within a very specific subset of that manifold that corresponds to the eigenvectors of . The accuracy of the solutions given by the PCA and PSA formulations will be the same when measured in terms of the principal angles between the subspace estimates and the true subspace spanned by the eigenvectors of the covariance matrix. Specifically, if is an estimate of the basis of the space spanned by the eigenvectors , then the principal angles between and given by either (2) or (3) will be the same. But a more suitable measure of accuracy for any PCA solution should be the angles between and for all , which motivates us to judge the efficacy of any solution with respect to this metric instead of the principal subspace angles.
In the distributed setup considered in this paper, we consider a network of nodes such that the undirected graph, , describing the network is connected. Here is the set of nodes and is the set of edges, i.e., if there is a direct path between and . The set of neighbors for any node is denoted by . Under the setup of samples being distributed over the nodes, let us assume that the node has a data matrix containing samples such that . Thus each node has access to only a local covariance matrix instead of the global covariance matrix but one can see that . In this setting, a straightforward approach might be that each node finds its own solution independent of the data at all the other nodes. While this might seem viable, this approach will have major drawbacks. Recall that the sample covariance approximates the population covariance at a rate of , where is some function (depending on the distribution) of the number of samples [33]. Since the local data has smaller number of samples than the global data, working with the local covariance matrix alone instead of somehow using the whole data will lead to a larger error in estimation of the eigenvectors. Also, since uniform sampling from the underlying data distribution is not guaranteed in distributed setups, the samples at a node may end up being from a narrow part of the entire distribution, thus being more biased away from the true distribution. This invites the need for the nodes to collaborate amongst themselves in a way that all the data is utilized to find estimates of the eigenvectors at each node while ensuring that all the nodes agree with each other. Thus, for a distributed setting, the PCA problem in (1) can be rewritten here as
(4) |
It is easy to see that . Also, in a distributed setup, each node maintains its own copy of the variable due to the difference in local information (local data) they carry. Thus, all nodes need to agree with each other to ensure the entire network reaches the same true solution. Hence, the true distributed PCA objective is written as
(5) |
Note that (1)–(5) are non-convex optimization problems due to the non-convexity of the constraint set. One possible solution to the PCA problem is to instead solve a convex relaxation of the original non-convex function [34, 35]. The issue with these solutions is that they require memory and computation, which can be prohibitive in high-dimensional settings. In addition, due to iterate size these solutions are not ideal for distributed settings. Also, these formulations, without any further constraints, will not necessarily give a basis that is the set of dominant eigenvectors. Instead, they might end up giving a rotated basis as explained earlier, thereby not completing the task of decorrelating features. Hence, in this paper we use an algebraic method based on GHA for neural network training, which has memory and computation requirements, to solve the distributed PCA problem. Our goal is to converge to the true eigenvectors of the global covariance matrix at every node of the network. As noted earlier in Section I-A, distributed variants of the power method exist in the literature [22, 24, 23] that can find the dominant eigenvector but these methods employ two time-scale approaches that involve several consensus averaging rounds for each iteration of the power method. Such two time-scale approaches can be expensive in terms of communications cost. In this paper, we propose a one time-scale method that finds the top eigenvectors of the global sample covariance matrix at each node through local computations and information exchange with neighbors. The proposed method also converges linearly up to a neighborhood of the true solution when the error metric considered is the angle between the estimates and the true eigenvectors.
III The Proposed Algorithm
In [11], Sanger proposed a generalized Hebbian algorithm (GHA) to train a neural network and find the eigenvectors of the input autocorrelation matrix (same as the covariance matrix for zero-mean input). The outputs of such a network, when the weights are given by the eigenvectors, are the uncorrelated features of the input data that allow data reconstruction with minimal error, hence serving the true purpose of PCA. The algorithm was originally developed to tackle the centralized PCA problem in the case of streaming data, where a new data sample , arrives at each time instance.
In this paper we consider a batch setting, but the alignment of GHA with our basic goal of finding the eigenvectors motivates us to leverage it for our distributed setup. The rationale behind the idea of extrapolating the streaming case to a distributed batch setting is simple: since , the sample-wise distributed data setting can be seen as a mini-batch variant of the streaming data setting. In the context of neural network training, our approach can be viewed as training a network at each node with a mini-batch of samples in a way that all nodes end up with the same trained network whose weights are given by the eigenvectors of the autocorrelation matrix of the entire batch of samples.
The iterate for the GHA as given in [11] has the following update for the matrix of eigenvectors (i.e., the neural network weight matrix) when the sample arrives at the input of the neural network:
(6) |
where is an operator that sets all the elements below the diagonal to zero and is the step size. For , and defining , it was shown in [9] that the term is the consequence of a power series approximation of Oja’s rule in lieu of the explicit normalization used in the case of the power method. In the case of , helps combine Oja’s algorithm with the Gram–Schmidt orthogonalization step as follows:
(7) |
Thus, for any , the term involving in (6) includes an implicit normalization term as well an orthogonalization term , which—analogous to the Gram–Schmidt orthogonalization procedure—forces the estimate to be orthogonal to all the estimates . Another important thing to note about the GHA algorithm is that, in order to estimate the dominant eigenvectors, it only requires the corresponding top eigenvalues to be distinct (and nonzero). In other words, it does not require the covariance matrix to be non-singular.
In the deterministic setting, where we have the full-batch instead of new samples every instance, this iterate changes to
(8) |
Here, we term as the Sanger direction. An iterate similar to (8) has been proven to have global convergence in [17] for some very specific choice of the step sizes that are dependent on the iterate itself. Its straightforward extension to the distributed case is not possible as that would lead to different step sizes at different nodes of the network, making it difficult to talk about its convergence guarantees. Hence, to adapt this iterative method to our distributed setup, we use the typical combine and update strategy used quite richly in the literature for distributed algorithms such as [32, 36, 37, 38]. The main contributions of such works lie in showing that the resulting distributed algorithms achieve consensus (i.e., all nodes will have the same iterate values eventually) and, in addition, the consensus value is the same as the centralized solution. The convergence guarantees for these methods are mainly restricted to convex and strongly convex problems though. Our distributed version of (8) for PCA, which is non-convex, is based on similar principles of combine and update.
Specifically, the node at iteration carries a local copy of the estimate of the eigenvectors of the global covariance matrix . In the combine step, each node exchanges the iterate values with its immediate neighbors , where denotes the neighborhood of node , and then takes a weighted sum of the iterates received along with its local iterate. Then this sum is updated independently at all nodes using their respective local information. Since node in the network only has access to its local sample covariance , the update is in the form of a local Sanger’s direction given as
(9) |
Input:
Initialize:
Return:
The details of the proposed distributed PCA algorithm, called the Distributed Sanger’s Algorithm (DSA), are given in Algorithm 1. The weight matrix in this algorithm is a doubly stochastic matrix conforming to the network topology [25] in the sense that for , when and otherwise. Also, , i.e., there is a self loop at each node. Note that connectivity of the network, as discussed in Section II, is a necessary condition for convergence of DSA. The connectivity assumption, in turn, ensures the Markov chain underlying the graph is aperiodic and irreducible, which implies that the second-largest (in magnitude) eigenvalue of , , is strictly less than . While DSA shares algorithmic similarities with first-order distributed optimization methods [32, 39] in which the combine-and-update strategy is used, our challenge is characterizing its convergence behavior due to the non-convex and constrained nature of the distributed PCA problem. To this end, we first provide a general result in Section IV where we prove the convergence of a modified form of GHA. Then we utilize that result, along with some linear algebraic tools and additional lemmas provided in the appendices, to characterize the dynamics of the distributed setup in Section V and prove the convergence of the proposed algorithm.
IV Convergence Analysis of a Modified GHA
Let , , be an estimate of the -dimensional subspace spanned by the eigenvectors of the covariance matrix after iterations and , be the eigenvectors of with corresponding eigenvalues . On expanding (8) using (7), it is clear that the GHA update equation for estimation of the eigenvector using a constant step size is as follows:
(10) |
We now slightly modify (10) by replacing for by the true eigenvectors . We term the resulting update equation modified GHA and note that this is not an algorithm in the true sense of the term as it cannot be implemented because of its dependence on the true eigenvectors . The sole purpose of this modified GHA is to help in our ultimate goal of providing convergence guarantee for the DSA algorithm. The update equation of the modified GHA for “estimation” of the eigenvector of , , has the form
(11) |
Note that similar to the original GHA, this modified GHA assumes that has distinct eigenvalues, i.e., . Now, since , are the eigenvectors of a real symmetric matrix, they form a basis for and can be used for expansion of any as
(12) |
where is the coefficient corresponding to the eigenvector in the expansion of . Multiplying both sides of (11) by and using the fact that for , we get
This gives
(13) | ||||
(14) |
It has been shown in [16] that the update equation given by
for converges to at a linear rate for a certain condition on the step size . Specifically, it was proven that , where is some constant and . Here, we extend the proof to a general and show that the update equation given in the form of (11) for any , converges to the dominant eigenvector.
Theorem 1.
Suppose , where is the largest eigenvalue of and is the number of eigenvectors to be estimated, , and for all . Then the modified GHA iterate for given by (11) converges at a linear rate to the eigenvector corresponding to the largest eigenvalue of the covariance matrix .
Proof.
The convergence of to requires convergence of the lower-order coefficients and the higher-order coefficients to 0 and convergence of to . Now,
(15) |
Thus, convergence of the lower-order and the higher-order coefficients to 0 along with convergence of the term will also imply the convergence of to . To this end, Lemma 5 in the appendix proves linear convergence of the lower-order coefficients to 0 by showing for some constants . Furthermore, Lemma 6 in the appendix shows that , where and , thereby proving linear convergence of the higher-order coefficients to 0. Finally, Lemma 7 in the appendix shows that , where and . The formal statements and proofs of Lemma 5, Lemma 6 and Lemma 7 are given in Appendix B, Appendix C and Appendix D, respectively.
V Convergence Analysis of Distributed Sanger’s Algorithm (DSA)
With the analysis of the modified GHA in hand, let us proceed to analyze the proposed DSA algorithm. The iterate of DSA at node for the dominant -dimensional eigenspace estimate () is given as
(16) |
where is an estimate of the -dimensional subspace of the global covariance matrix at the node after iterations, is local Sanger’s direction, and is a weight that node assigns to based on the connectivity between nodes and as mentioned before. The Sanger’s direction and the update equation for an estimate of the eigenvector is thus given as
(17) | ||||
(18) |
Now, let the average of after iteration be denoted as and given by taking average of (18) over all the nodes as
where . We present analysis of the DSA algorithm by first proving convergence of the average to a neighborhood of the eigenvector of the global covariance matrix while using a constant step size. Then with the help of Lemma 8 in Appendix E, which proves that the deviation of the iterates at each node from the average is upper bounded, we prove that the iterates at each node also converge to a neighborhood of the true solution. It is noteworthy that the analysis of DSA does not require additional constraints on eigenvalues of , i.e., similar to GHA, we only require the top eigenvalues of to be distinct and non-zero.
The complete proof of convergence of DSA is done by induction. First, we show the convergence of to a neighborhood of and then analyze the rest of the eigenvector estimates by assuming that the higher-order estimates have converged.
Case I for Induction – : The iterate for the dominant eigenvector is
(19) |
Theorem 2.
Suppose , where is the largest eigenvalue of and is the number of eigenvectors to be estimated, , and . Then the DSA iterate for given by (19) converges at a linear rate to an neighborhood of the eigenvector corresponding to the largest eigenvalue of the global covariance matrix at every node of the network.
Proof.
We know that
(20) |
The term is a measure of consensus in the network and we prove in Lemma 8 in Appendix E that this difference decreases linearly until it reaches a level of . More precisely,
(21) |
where . In particular, it is well known that for a connected graph . Now, the average iterate of DSA for the estimate of the dominant eigenvector is
Thus,
(22) |
We saw in Section IV that an iterate of the form
converges linearly to for certain conditions on the step size and the initial point. Thus,
The term in the above equation appears due to the distributed nature of the algorithm and can be bounded separately. Specifically, we prove in Lemma 9, whose formal statement and proof is given in Appendix F, that
Thus,
Since , we have the following two cases:
-
1.
. Then, .
-
2.
. Then .
Therefore,
(23) |
Consequently, from (21) and (23), we get
This proves that converges to a neighborhood of or at a linear rate. ∎
Case II for Induction – : For the remainder of the eigenvectors, we proceed with the proof of convergence by induction. Since we have already proven the base case, we can assume there exist constants and such that
-
1.
, and
-
2.
.
Using the inequality in 1) above, we can write such that . This therefore implies where .
Thus, we have
(24) | |||||
where and .
Consequently,
(25) |
We can now proceed with the final theorem that characterizes the convergence behavior of DSA.
Theorem 3.
Suppose , where is the largest eigenvalue of and is the number of eigenvectors to be estimated, and . Then the DSA iterate for given by (18) converges at a linear rate to an neighborhood of the eigenvector corresponding to the largest eigenvalue of the global covariance matrix at each node of the network.
Proof.
We know
(26) |
Also, from Lemma 8 in the appendix we know that
Now, the average iterate of DSA for estimating the eigenvector is
We know from the discussion in Section IV that for an iterate of the form
there exists a constant such that . Thus,
Now, the term was bounded in Lemma 9 in the appendix as
(27) |
Thus, using (24) and (27), we can write
Consequently, from (26) and Lemma 8 we get
This proves that converges to a neighborhood of or at a linear rate. ∎
It is noteworthy that if decaying step sizes are used such that as (instead of constant ), the convergence will be exact but not linear. The rate in that case will be dominated by the rate of decay of .
VI Experimental Results
In this section, we provide results that demonstrate the efficacy of the proposed DSA algorithm. The need for collaboration between the nodes of a network is a vital part of any distributed algorithm, as already pointed out in Section II. We first verify that necessity along with the effect of step size on DSA by performing some experiments. In these experiments, the weight matrix that conforms to the underlying graph topology is generated using the Metropolis constant edge-weight approach [40]. The performance of DSA in comparison to some baseline methods is also evaluated in additional experiments. We provide experimental results for DSA on synthetic and real data and compare the results with centralized generalized Hebbian algorithm (GHA) [11], centralized orthogonal iteration (OI) [14], distributed projected gradient descent (DPGD) and sequential distributed power method (SeqDistPM). For both the centralized methods, all the data is assumed to be at a single location with the difference being that GHA uses the Hebbian update whereas OI uses the well-known orthogonal iterations to estimate the top eigenvectors of the covariance matrix . DPGD involves two significant steps per iteration. The first is a distributed gradient descent step at every node given by as in [32] using trace maximization as the objective. This is followed by a projection step to ensure the orthogonality constraint . The orthogonalization is accomplished using QR decomposition, an approach that ensures projection onto the Stiefel manifold [41] and whose computational complexity is , at each node in each iteration. In contrast, SeqDistPM involves implementing the distributed power method [22, 24] times, estimating one eigenvector at a time and subtracting its impact on the covariance matrix for the estimation of subsequent eigenvectors. Note that SeqDistPM requires a finite number of consensus iterations per iteration of the power method. Assuming the cost of communicating one matrix across the network from nodes to their neighbors to be one unit, the communication cost of SeqDistPM is per iteration of the power method. The error metric used for comparison and reporting of the results is the average of the angles between the estimated and true eigenvectors, i.e., if is the estimate of the eigenvector at node and is the true eigenvector then the average error across all nodes is calculated as follows:
(28) |
VI-A Synthetic Data


We first show results that emphasize on the need for collaboration among the nodes. To that end, we generate independent and identically distributed (i.i.d.) samples drawn from a multivariate Gaussian distribution with an eigengap and dimension . These samples are distributed equally among the nodes of an Erdos–Renyi network (with connectivity probability ), implying that each node has 1,000 samples. The number of eigenvectors estimated is and a constant step size of is used for this experiment. Figure 1(a) shows the effect of using the GHA at a node without collaboration with other nodes versus DSA, which in simple terms embodies GHA + collaboration in the network. The blue line indicating GHA in the figure is the result of using all the data in a centralized manner. It is clear that the lack of any communication between nodes increases the error in estimation of the eigenvectors by a significant factor. In Figure 1(b), we use the same setup and parameters to show the effect of different step sizes on our proposed DSA algorithm. It is evident that if the step size is too low, the convergence becomes significantly slow, while if its high, the final error is larger. Hence, careful choice of the step size is required for DSA, as characterized by its convergence analysis.



Next, we compare DSA with the distributed methods of DPGD and SeqDistPM to demonstrate its communication efficiency. For that purpose, we generate synthetic data with different eigengaps . We simulate the distributed setup for Erdos-Renyi (), star and cycle graph topologies with nodes. The data is generated so that each node has 1,000 i.i.d samples () drawn from a multivariate Gaussian distribution for , i.e., the total samples generated are 10,000. The dimension of the subspace to be estimated is taken to be . We use as the number of consensus iterations per power iteration for SeqDistPM throughout out experiments. The results reported are an average of 10 Monte-Carlo trials. Figure 2 shows the performance of different algorithms for the estimation of the most dominant eigenvector for different network topologies. It is clear that for SeqDistPM outperforms both DSA and DPGD in terms of communications efficiency because it is basically distributed power method, which is shown in [24, 22] to have good performance for . Even though DSA and DPGD have the same performance in terms of communications cost, it is important to remember that DPGD requires an additional QR normalization step per communications round. Next, Figure 3 shows a comparison between the three algorithms when the top-5 eigenvectors are estimated i.e., . It is clear that while estimating higher-order eigenvectors, DSA slightly outperforms DPGD without performing explicit QR normalization and it also has much better communications efficiency than SeqDistPM. The error for SeqDistPM is significantly high in the beginning because of the sequential estimation, which means that when the first (higher-order) eigenvector(s) is (are) being estimated, the lower-order estimates are still at their initial values and hence those contribute significant error even when the first or higher order terms have low error. After a sufficiently large number of communications rounds, SeqDistPM eventually does reach a lower final error compared to DSA. But this comes at the expense of slower convergence as a function of communications costs. It should also be noted that SeqDistPM lacks a formal convergence analysis and has two time scales that need to be adjusted as both contribute to the final error. Finally, the benefits of DSA over DPGD are twofold. First, DSA reaches similar or better error floor without explicit QR normalization, thus saving computations per iteration; and second, the convergence guarantees for gradient descent-based algorithms for non-convex problems like the PCA have limitations. The guarantees usually exist for convergence to a stationary solution with a sub-linear rate.






VI-B Real-World Data
Along with the synthetic data experiments, we provide some experiments with real-world datasets of MNIST [42] and CIFAR-10 [43]. For the distributed setup in this case, we use an Erdos–Renyi graph with nodes and . Both the datasets have 60,000 samples, thereby making the number of samples per node to be . The data dimension for MNIST is and a constant step size of was used. The plots in Figure 4(a) and Figure 4(b) show the results for for MNIST. Similar plots are shown for CIFAR-10 in Figure 5(a) and Figure 5(b), where the dimension for CIFAR-10 is 1024, the number of estimated eigenvectors and a constant step size of is used. For these real-world data sets, we exclude the comparison with SeqDistPM as it is evident this method requires much higher cost of communications for estimating larger number of eigenvectors.




VII Conclusion
In this paper, we proposed and analyzed a new distributed Principal Component Analysis (PCA) algorithm that, as opposed to distributed subspace learning methods, facilitates both dimensionality reduction and data decorrelation in a distributed setup. Our main contribution in this regard was a detailed convergence analysis to prove that the proposed distributed method linearly converges to a neighborhood of the eigenvectors of the global covariance matrix. We also provided numerical results to demonstrate the communications efficiency and overall effectiveness of the proposed algorithm.
In terms of future work, an obvious extension would be a distributed algorithm that enables exact convergence to the PCA solution at a linear rate. Note that the use of a diminishing step size along with the analysis in this paper already guarantees that DSA can converge exactly to the PCA solution. However, this exact convergence guarantee comes at the expense of a slow convergence rate. We instead expect to combine ideas from this work as well as ideas such as gradient tracking from the literature on distributed optimization [36, 31, 44] to develop a linearly convergent, exact algorithm for distributed PCA in the future. Another possible future direction involves developing an algorithm for distributed PCA that does not require the top eigenvalues to be distinct. We also leave the case of multiple eigenvector estimation from distributed, streaming data, as in [26], for future work.
Appendix A Statements and Proofs of Auxiliary Lemmas
A-A Statement and Proof of Lemma 1
Lemma 1.
Assume . If the step size is bounded above as , where is the largest eigenvalue of and is the number of eigenvectors to be estimated, then
(29) |
Proof.
From (11), we know the iterate for eigenvector estimate is
where . Notice that . Hence,
(30) |
We now split our analysis into three cases based on the range of values of .
Case \@slowromancapi@: Let . Then we see from (30) that
Case \@slowromancapii@: Now suppose . Then from (30) we have
Case \@slowromancapiii@: Finally suppose . Then from (30) we get
To show that , we have to show
(31) |
We now find a lower bound of the right hand side of (31). Note that
(32) | ||||
and | ||||
(33) |
Now, is a generalized Rayleigh quotient whose maximum and minimum values are the largest and smallest eigenvalues of the generalized eigenvalue problem . Since the eigenvectors of and are the same, the largest and smallest eigenvalues of the generalized problems are and , respectively, where and are the largest and smallest eigenvalues of . Thus, . Also, since , we have the right hand side of (33)
Hence, we have that the right hand side of (31) exceeds
Thus, if , then .
Next,
(34) |
Hence, . ∎
A-B Statement and Proof of Lemma 2
Lemma 2.
Suppose and , then
A-C Statement and Proof of Lemma 3
Lemma 3.
Assume . If the step size is bounded above as , where is the largest eigenvalue of and is the number of eigenvectors to be estimated, then
(38) |
Proof.
We have
(39) |
Hence,
Now,
Case \@slowromancapi@: Let us assume . Then, we have
Thus,
Case \@slowromancapii@: Now, suppose . Then, we get
Thus, if we need , the following condition should be met:
Since , if .
Case \@slowromancapiii@: Finally, suppose . We then have the following: .
A-D Statement and Proof of Lemma 4
Lemma 4.
The norm of Sanger’s direction is bounded as
(41) |
Proof.
We know
Next, notice that . Thus,
We, therefore, get
Thus,
∎
Appendix B Statement and Proof of Lemma 5
Lemma 5.
Let . Now, suppose , then the following is true for , and some constant :
(42) |
Appendix C Statement and Proof of Lemma 6
Lemma 6.
Suppose and . Then the following is true for and some constant :
(44) |
Proof.
For we know from (14) that . If , we have .
Thus, we have for ,
Therefore, for , . Since and , hence and . Also, because of the assumption , let us assume . Thus, we can write
(45) |
∎
Appendix D Statement and Proof of Lemma 7
Lemma 7.
Suppose and . Then there exists constants such that
(46) |
Appendix E Statement and Proof of Lemma 8
Lemma 8.
The deviation of an iterate at a node from the average is bounded from above as
(48) |
where is the second largest magnitude of the eigenvalues of given as and is some constant.
Proof.
We stack the iterates and as
The next network-wide iterate (as a stacked vector) can then be written as , where denotes the Kronecker product. The iterate can thus be written as
Since is a symmetric and doubly stochastic mixing matrix, its largest eigenvalue is corresponding to the eigenvector , a column vector of all 1’s. It is also the left eigenvector of . That is, . Also, since the squared norm of Sanger’s direction at every node is bounded, it is easy to see . Now,
∎
Appendix F Statement and Proof of Lemma 9
Lemma 9.
Suppose and , then the following is true :
(49) |
Proof.
We have
Thus,
(50) |
∎
References
- [1] A. Gang, H. Raja, and W. U. Bajwa, “Fast and communication-efficient distributed PCA,” in Proc. IEEE International Conf. Acoustics, Speech and Signal Process. (ICASSP), 2019, pp. 7450–7454.
- [2] H. Hotelling, “Analysis of a complex of statistical variables into principal components.” J. Educational Psychology, vol. 24, no. 6, pp. 417–441, 1933.
- [3] M. Nokleby, H. Raja, and W. U. Bajwa, “Scaling-up distributed processing of data streams for machine learning,” Proc. IEEE, vol. 108, no. 11, pp. 1984–2012, Nov. 2020.
- [4] W. U. Bajwa, V. Cevher, D. Papailiopoulos, and A. Scaglione, “Machine learning from distributed, streaming data,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 11–13, May 2020.
- [5] M. Loève, Probability Theory, 3rd ed. New York: Springer, 1963.
- [6] R. Dixit and W. U. Bajwa, “Exit time analysis for approximations of gradient descent trajectories around saddle points,” ArXiv Prepr. ArXiv200601106, Jun. 2020. [Online]. Available: http://arxiv.org/abs/2006.01106
- [7] ——, “Boundary conditions for linear exit time gradient trajectories around saddle points: Analysis and algorithm,” ArXiv Prepr. ArXiv210102625, Jan. 2021. [Online]. Available: http://arxiv.org/abs/2101.02625
- [8] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Netw., vol. 2, no. 1, p. 53–58, Jan. 1989.
- [9] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix,” J. Math. Anal. Applicat., vol. 106, no. 1, pp. 69 – 84, 1985.
- [10] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks: Theory and Applications. USA: John Wiley & Sons, Inc., 1996.
- [11] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural Netw., vol. 2, no. 6, pp. 459 – 473, 1989.
- [12] Z. Yang, A. Gang, and W. U. Bajwa, “Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 146–159, 2020.
- [13] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Mag., vol. 2, pp. 559–572, 1901.
- [14] G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
- [15] D. O. Hebb, The Organization of Behavior: A Neuropsychological Theory. Wiley New York, 1949.
- [16] Z. Yi, M. Ye, J. C. Lv, and K. K. Tan, “Convergence analysis of a deterministic discrete time system of Oja’s PCA learning algorithm,” IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 1318–1328, Nov 2005.
- [17] J. C. Lv, Z. Yi, and K. K. Tan, “Global convergence of GHA learning algorithm with nonzero-approaching adaptive learning rates,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1557–1571, 2007.
- [18] D. Kempe and F. McSherry, “A decentralized algorithm for spectral analysis,” J. Comput. and Syst. Sci., vol. 74, no. 1, pp. 70 – 83, 2008.
- [19] A. Scaglione, R. Pagliari, and H. Krim, “The decentralized estimation of the sample covariance,” in Proc. 42nd Asilomar Conf. on Signals, Syst. and Comput., 2008, pp. 1722–1726.
- [20] L. Li, A. Scaglione, and J. H. Manton, “Distributed principal subspace estimation in wireless sensor networks,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 4, pp. 725–738, Aug 2011.
- [21] M. Nokleby and W. U. Bajwa, “Resource tradeoffs in distributed subspace tracking over the wireless medium,” in Proc. 1st IEEE Global Conf. Signal and Information Processing (GlobalSIP’13), Symposium on Network Theory, Austin, TX, Dec. 2013, pp. 823–826.
- [22] H. Raja and W. U. Bajwa, “Cloud K-SVD: Computing data-adaptive representations in the cloud,” in Proc. 51st Annual Allerton Conf. Commun., Control and Computing, 2013, pp. 1474–1481.
- [23] H. Wai, A. Scaglione, J. Lafond, and E. Moulines, “Fast and privacy preserving distributed low-rank regression,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., (ICASSP), 2017, pp. 4451–4455.
- [24] H. Raja and W. U. Bajwa, “Cloud-K-SVD: A collaborative dictionary learning algorithm for big, distributed data,” IEEE Trans. Signal Process., vol. 64, no. 1, pp. 173–188, Jan 2016.
- [25] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Syst. & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
- [26] H. Raja and W. U. Bajwa, “Distributed stochastic algorithms for high-rate streaming principal component analysis,” CoRR, vol. abs/2001.01017, 2020. [Online]. Available: http://arxiv.org/abs/2001.01017
- [27] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione, “A review of distributed algorithms for principal component analysis,” Proc. IEEE, vol. 106, no. 8, pp. 1321–1340, 2018.
- [28] M. Hong, D. Hajinezhad, and M.-M. Zhao, “Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. 34th Int. Conf. Mach. Learning, vol. 70. PMLR, 06–11 Aug 2017, pp. 1529–1538.
- [29] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Trans. Autom. Control, vol. 58, no. 2, pp. 391–405, 2013.
- [30] H. Wai, A. Scaglione, J. Lafond, and E. Moulines, “A projection-free decentralized algorithm for non-convex optimization,” in Proc. 2016 IEEE Global Conf. Signal and Inform. Process. (GlobalSIP), 2016, pp. 475–479.
- [31] P. D. Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Trans. Signal Inform. Process. Netw., vol. 2, no. 2, pp. 120–136, 2016.
- [32] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, Jan 2009.
- [33] R. Vershynin, “How close is the sample covariance matrix to the actual covariance matrix?” Journal of Theoretical Probability, vol. 25, 04 2010.
- [34] R. Arora, A. Cotter, and N. Srebro, “Stochastic optimization of PCA with capped MSG,” in Proc. Advances Neural Inform. Process. Systs., 2013, pp. 1815–1823.
- [35] M. K. Warmuth and D. Kuzmin, “Randomized PCA algorithms with regret bounds that are logarithmic in the dimension,” in Proc. Advances Neural Inform. Process. Syst., 2007, pp. 1481–1488.
- [36] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: an exact first-order algorithm for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp. 944–966, 2015.
- [37] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS strategies for distributed estimation,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1035–1048, 2010.
- [38] S. Kar and J. M. Moura, “Consensus + innovations distributed inference over networks: Cooperation and sensing in networked systems,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 99–109, 2013.
- [39] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016.
- [40] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM Review, vol. 46, pp. 667–689, 2003.
- [41] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. USA: Princeton University Press, 2007.
- [42] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” ATT Labs, vol. 2, 2010.
- [43] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
- [44] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, 2018.