A Linearly Convergent Algorithm for Distributed Principal Component Analysis

Arpita Gang and Waheed U. Bajwa Preliminary versions of some of the results reported in this paper were presented at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, May 12-17 2019, [1]. Arpita Gang and WUB are with the Department of Electrical and Computer Engineering, Rutgers University–New Brunswick, NJ 08854 (Emails: {arpita.gang, waheed.bajwa}@rutgers.edu).This work was supported in part by the National Science Foundation under Awards CCF-1453073, CCF-1907658, and OAC-1940074, by the Army Research Office under Awards W911NF-17-1-0546 and W911NF-21-1-0301, and by the DARPA Lagrange Program under ONR/NIWC Contract N660011824020.

Abstract

Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. This paper focuses on the dual objective of PCA, namely, dimensionality reduction and decorrelation of features, but in a distributed setting. This requires estimating the eigenvectors of the data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors, when data is distributed across a network of machines. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger’s Algorithm (DSA) that estimates the eigenvectors of the data covariance matrix when data is distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also provided to demonstrate the efficacy of the proposed solution.

Index Terms:

Dimensionality reduction, distributed feature learning, generalized Hebbian learning, principal component analysis

I Introduction

The modern era of machine learning involves leveraging massive amounts of high-dimensional data, which can have large computational and storage costs. To combat the complexities arising because of the high dimensions of data, dimensionality reduction and feature learning techniques play a pivotal and necessary role in information processing. The most common and widely used technique for this task is Principal Component Analysis (PCA) [2] which, in the simplest of terms, transforms data into uncorrelated features that aid conversion of data from a high-dimensional space to a low-dimensional space while retaining maximum information. Simultaneously, the enormity of the amount of available data makes it difficult to manage it at a single location. There are multiple and an increasing number of scenarios where data is distributed across different locations, either due to storage constraints or by its inherent nature like in the Internet-of-Things [3]. This aspect of the modern-world data have led researchers to explore distributed algorithms, which can process information across different locations/machines [4]. These aforementioned issues have motivated us to study and develop algorithms for distributed PCA that are efficient in terms of computations and communications among multiple machines, and that can also be proven to converge at a fast rate.

When the data is available at a single location, one of the goals of PCA is to find a $K$ -dimensional subspace, given by the column space of a matrix $\mathbf{X}\in\mathbb{R}^{d\times K}$ , such that the zero-mean data samples $\mathbf{y}\in\mathbb{R}^{d}\leavevmode\nobreak\ (d\gg K)$ retain maximum information when projected onto $\mathbf{X}$ . In other words, when reconstructed as $\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{y}$ (subject to $\mathbf{X}^{\mathrm{T}}\mathbf{X}=\mathbf{I}$ ), the data samples should have minimum reconstruction error. It can be shown that this minimal error solution is given by the projection of data onto the subspace spanned by the eigenvectors of data covariance matrix. This implies that for dimensionality reduction, learning any basis of that subspace is sufficient. This is referred to as the subspace learning problem. But while simple dimension reduction does not necessarily need uncorrelated features, most downstream machine learning tasks like classification, pattern matching, regression, etc., work more efficiently when the data features are uncorrelated. In the case of image coding, e.g., PCA is known as the Karhunen–Loeve transform [5], wherein images are compressed by decorrelating neighbouring pixels. With this goal in mind, one needs to aim to find the specific directions that not only have maximum variance, but that also lead to uncorrelated features when data is projected onto those directions. These specific directions are given by the eigenvectors (also called the principal components) of the data covariance matrix, and not just any set of orthogonal basis vectors spanning the same space. Mathematically, along with minimum reconstruction error, the other goal of PCA is to ensure the condition that the off-diagonal entries of $\mathbb{E}[\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}]$ are zero (i.e., the data gets decorrelated), while finding the eigenspace of the covariance matrix $\mathbb{E}[\mathbf{y}\mathbf{y}^{\mathrm{T}}]$ .

As explained above, the true and complete purpose of PCA is served when the search for the optimal solution ends with the specific set of eigenvectors of the data covariance matrix, and not just with the subspace it spans. It is known that getting the principal components from any other basis of the subspace would only require performing singular value decomposition (SVD) of the obtained subspace. Although true, the SVD operation has a high computational complexity, which makes it an expensive step for big data. The traditional solutions for PCA were developed to overcome the cost of SVD and hence reverting back to it defeats the whole purpose.

Thus, even though the problem of dimensionality reduction of data has many optimal solutions (corresponding to all the sets of basis vectors spanning the $K$ -dimensional space), our goal is to find only the ones that give the eigenvectors as the basis. In terms of optimization geometry of the PCA problem in which one tries to minimize the mean-squared reconstruction error under an orthogonality constraint, it is a non-convex strict-saddle function. In a strict-saddle function, all the stationary points except the local minima are strict saddles wherein the Hessians have at least one negative eigenvalue that helps in escaping these saddle points [6, 7]. Also, in the case of PCA the local minima are the same as the global minima. These geometric aspects make PCA, despite being non-convex, a “nice and solvable” problem whose optimal solution can be reached efficiently. However, note that the set of global minima contains, along with the set of eigenvectors as basis, all other possible bases that are rotated with respect to the eigenvectors. And our goal is not to find just any of the global minima but to look into a very particular subset of it, where the basis is not rotated.

A very popular tool that has been used to learn features of data, and hence compress it, is autoencoders. It was shown in [8] that the globally optimum weights of an autoencoder for minimum reconstruction error are the principal components of the covariance matrix of the input data. In [9], Oja described how using the Hebbian rule for updating the weights of a linear neural network would extract the first principal component from the input data. Several other Hebbian-based rules like Rubner’s model, APEX model [10], Generalized Hebbian Algorithm (GHA) [11], etc., were proposed to extend this idea of training a neural network for finding the first eigenvector to extract the first $K$ principal components (eigenvectors) of the input covariance matrix. Given the parallelization potential, a feedforward linear neural network-based solution for PCA seems to be very attractive.

The other aspect of modern day data is, as mentioned earlier, its massive size. Collating the huge amount of raw data is usually prohibitive due to communications overhead and/or privacy concerns. These reasons have encouraged researchers over the last couple of decades to develop algorithms that can solve various problems for non-collocated data. The algorithms developed to deal with such scenarios can be broadly classified into two categories: (1) the setups where a central entity/server is required to co-ordinate among the various data centers to yield the final result, and (2) the setup where the data is scattered over an arbitrary network of interconnected data centers with no availability of a central co-ordinating node. The authors in [12, 3] talk about these different setups and the algorithms developed for each of them in more detail. The second scenario is more generic and usually algorithms developed for such setups can be easily modified to be applied to the first scenario. The terms distributed and decentralized are used interchangeably for both the setups in the literature. In this paper, we focus on the latter scenario of arbitrarily connected networks and henceforth call it distributed setup. Hence, here our goal is to solve the PCA problem in the distributed manner when data is scattered over a network of interconnected nodes such that all the nodes in the network eventually agree with each other and converge to the true principal components of the distributed data.

I-A Relation to Prior Work

PCA was developed to find simpler models of smaller dimensions that can approximately fit some data. Some seminal work was done by Pearson [13], who aimed at fitting a line to a set of points, and by Hotelling [2], where a method for the classical PCA problem of decorrelating the features of a given set of data points (observations) by finding the principal components was proposed. Later, some fast iterative methods like the power method, Lanczos algorithm, and orthogonal iterations [14] were proposed, which were proved to converge to the eigenvectors at a linear rate in the case of symmetric matrices. Many other iterative methods have been proposed over the last few decades that are based on the well-known Hebbian learning rule [15] like Oja’s method [9], generalized Hebbian algorithm [11], APEX [10], etc. The analysis for Oja’s algorithm has been provided in [16], which shows that in the deterministic setting the convergence to the first eigenvector is guaranteed at a linear rate for some conditions on the step size and initial estimates. The work in [17] extended the analysis to the generalized Hebbian case for convergence to the first $K$ principal eigenvectors for a specific choice of step size.

While ways to solve PCA in the centralized case when data is available at a single location have been around for nearly a century, distributed solutions are very recent. Within our distributed setup where we assume a network of arbitrarily connected nodes with no central server, the data distribution can be broadly classified into two types: (1) distribution by features, and (2) distribution by samples. The PCA algorithms for these two different kinds of distribution are significantly different. While both are completely distributed, the first kind [18, 19, 20, 21] involves estimating only one (or a subset) of feature(s) at each node. In this paper, we focus on finding the eigenvectors when the distribution is by samples, which requires estimation of the whole set of eigenvectors at each node of the network. For this type of distribution, power method was adapted for the distributed setup as a subroutine in [22, 23, 24] to extract the first principal component of the global covariance matrix. Such methods make use of an explicit consensus loop [25] after each power iteration to ensure that the nodes (approximately) agree with each other. While a novel approach that reaches the required solution at the nodes accurately (albeit with a small error due to the consensus iterations), the two time-scale aspect makes it a relatively slow algorithm in terms of communications efficiency. Furthermore, finding multiple principal components with these approaches would require a sequential approach where subsequent components are determined by using a covariance matrix residue that is left after projection on estimates of the higher-order components. In contrast, the work in [26] focuses on finding the top eigenvector in the distributed setup for the streaming data case. A detailed review of various distributed PCA algorithms that exist for different setups is provided in [27].

Next, note that some distributed optimization-based algorithms for non-convex problems are being studied only since recently and those dealing with constrained problems are even fewer. In [28], it is shown that an unconstrained non-convex problem converges to a stationary solution at a sublinear rate. The methods proposed in [29, 30] deal with non-convex objective functions in a distributed setup when the constraint set is convex and [31] works with convex approximations of non-convex problems. Thus, none of these methods are directly applicable to the distributed PCA problem in our setup.

Finally, we proposed an efficient distributed PCA solution in [1] for a distributed network when the data is split sample-wise among the interconnected nodes. In this paper, we extend the preliminary work in [1] and provide a detailed mathematical analysis of the proposed algorithm along with exact convergence rates and extensive numerical experiments.

I-B Our Contributions

The main contributions of this paper are (1) a novel algorithm for distributed PCA, (2) theoretical guarantees for the proposed distributed algorithm with a linear convergence rate to a small neighborhood of the true PCA solution, and (3) experimental results to further demonstrate the efficacy of the proposed algorithm.

Our focus in this paper is to solve the distributed PCA problem so as to find a solution that not only enables dimensionality reduction, but that also provides uncorrelated features of data distributed over a network. That is, our goal is to estimate the true eigenvectors, not just any subspace spanned by them, of the covariance matrix of the data that is distributed across an arbitrarily connected network. Also, we focus on providing a solution that is efficient in terms of communications between the interconnected nodes of an arbitary network. To that end, we propose a distributed algorithm that is based on the generalized Hebbian algorithm (GHA) proposed by Sanger [11], wherein the nodes perform local computations along with information exchange with their directly connected neighbors, similar to the idea followed in the distributed gradient descent (DGD) approach in [32]. The local computations do not involve the calculation of any gradient, but we instead use a “psuedo gradient,” which we henceforth call Sanger’s direction. In our proposed solution, termed the Distributed Sanger’s Algorithm (DSA), we have also done away with the need of explicit consensus iterations for making the nodes agree with each other, thereby making it a one time-scale solution that is more communications efficient. Theoretical guarantees are also provided for our proposed distributed PCA algorithm when using a constant step size. The analysis shows that, when using a constant step size $\alpha$ , the DSA solution reaches within a $\mathcal{O}(\alpha)$ -neighborhood of the optimal solution at a linear rate when the error metric is the angles between the estimated vectors and the true eigenvectors¹¹1Our results can also be extrapolated to guarantee exact convergence with decaying step size, albeit at a slower than linear rate.. We also provide experimental results and comparisons with centralized orthogonal iteration [14], centralized GHA [11], a sequential distributed power method-based approach and distributed projected gradient descent. The results support our claims and analysis.

To the best of our knowledge, this is the first solution for distributed PCA that uses a Hebbian update, achieves network agreement without the use of explicit consensus iterations, and still provably reaches the globally optimum solution (within an error margin) at all nodes at a linear rate.

I-C Notation and Organization

The following notation is used in this paper. Scalars and vectors are denoted by lower-case and lower-case bold letters, respectively, while matrices are denoted by upper-case bold letters. The operator $|\cdot|$ denotes the absolute value of a scalar quantity. The superscript in $\mathbf{a}^{(t)}$ denotes time (or iteration) index, while $a^{t}$ denotes the exponentiation operation. The superscript $(\cdot)^{\mathrm{T}}$ denotes the transpose operation, $\|\cdot\|_{F}$ denotes the Frobenius norm of matrices, while both $\|\cdot\|$ and $\|\cdot\|_{2}$ denote the $\ell_{2}$ -norm of vectors. Given a matrix $\mathbf{A}$ , both $a_{ij}$ and $(\mathbf{A})_{ij}$ denote its entry at the $i^{th}$ row and $j^{th}$ column, while $\mathbf{a}_{j}$ denotes its $j^{th}$ column.

The rest of the paper is organized as follows. In Section II, we describe and mathematically formulate the distributed PCA problem, while Section III describes the proposed distributed algorithm, which is based on the generalized Hebbian algorithm. In Section IV, we derive a general result for a modified generalized Hebbian algorithm that aids in the convergence analysis of the proposed distributed algorithm, while convergence guarantees for the proposed algorithm are provided in Section V. We provide numerical results in Section VI to show efficacy of the proposed method and provide concluding remarks in Section VII. The statements and proofs of some auxiliary lemmas, which are needed for the proofs of the main lemmas that are used within the convergence analysis in Section IV and Section V, are given in Appendix A, while Appendices B–F contain the formal statements and proofs of the main lemmas.

II Problem Formulation

Principal Component Analysis (PCA) aims at finding the basis of a low-dimensional space that can decorrelate the features of data points and also retain maximum information. More formally, for a random vector $\mathbf{y}\in\mathbb{R}^{d}$ with $\mathbb{E}\begin{bmatrix}\mathbf{y}\end{bmatrix}=\mathbf{0}$ , PCA involves finding the top- $K$ eigenvectors of the covariance matrix $\boldsymbol{\Sigma}:=\mathbb{E}\begin{bmatrix}\mathbf{y}\mathbf{y}^{\mathrm{T}}\end{bmatrix}$ . The zero mean assumption is taken here without loss of generality as the mean can be subtracted in case data is not centered. Mathematically, PCA can be formulated as

\displaystyle\mathbf{X}

\displaystyle=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\mathbb{E}\begin{bmatrix}\|\mathbf{y}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{y}\|_{2}^{2}\end{bmatrix}\qquad\text{such that}\qquad\forall l\neq q,\ \Big{(}\mathbb{E}\begin{bmatrix}\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}\end{bmatrix}\Big{)}_{lq}=0.

(1)

The constraint $\Big{(}\mathbb{E}\begin{bmatrix}\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}\end{bmatrix}\Big{)}_{lq}=0,\forall l\neq q$ , ensures that $\mathbf{X}$ decorrelates the features of $\mathbf{y}$ . Now, $\mathbb{E}\begin{bmatrix}\mathbf{X}^{\mathrm{T}}\mathbf{y}\mathbf{y}^{\mathrm{T}}\mathbf{X}\end{bmatrix}=\mathbf{X}^{\mathrm{T}}\mathbb{E}\begin{bmatrix}\mathbf{y}\mathbf{y}^{\mathrm{T}}\end{bmatrix}\mathbf{X}=\mathbf{X}^{\mathrm{T}}\boldsymbol{\Sigma}\mathbf{X}$ and it is straightforward to see that this quantity is diagonal only if $\mathbf{X}$ contains the eigenvectors of $\boldsymbol{\Sigma}$ . This explains why the search for a solution of PCA ends with the eigenvectors and not the subspace spanned by them. In practice, we do not have access to $\boldsymbol{\Sigma}$ and so a covariance matrix estimated from the samples of $\mathbf{y}$ is used instead. Specifically, for a dataset with $N$ samples $\{\mathbf{y}_{l}\}_{l=1}^{N}$ , or equivalently, for a data matrix $\mathbf{Y}:=\begin{bmatrix}\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}\end{bmatrix}$ , the sample covariance matrix can be written as $\mathbf{C}=\frac{1}{N}\mathbf{Y}\mathbf{Y}^{\mathrm{T}}$ such that $\boldsymbol{\Sigma}:=\mathbb{E}\begin{bmatrix}\mathbf{C}\end{bmatrix}$ . The true solution for PCA is then obtained by finding the eigenvectors of the covariance matrix $\mathbf{C}$ , which are also the left singular vectors of the data matrix $\mathbf{Y}$ . The empirical form of (1) is thus

\displaystyle\mathbf{X}

\displaystyle=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad f(\mathbf{X})=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\|\mathbf{Y}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\|_{F}^{2}\qquad\text{such that}\qquad\forall l\neq q,\ \Big{(}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\mathbf{Y}^{\mathrm{T}}\mathbf{X}\Big{)}_{lq}=0.

(2)

In the literature, however, PCA is usually posed with a ‘relaxed’ orthogonality constraint of $\mathbf{X}^{T}\mathbf{X}=\mathbf{I}$ instead of $\Big{(}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\mathbf{Y}^{\mathrm{T}}\mathbf{X}\Big{)}_{lq}=0,\forall l\neq q,$ as follows:

\mathbf{X}=\underset{\mathbf{X}\in\mathbb{R}^{d\times K},\mathbf{X}^{\mathrm{T}}\mathbf{X}=\mathbf{I}}{\operatorname*{arg\,min}}f(\mathbf{X})=\underset{\mathbf{X}\in\mathbb{R}^{d\times K},\mathbf{X}^{\mathrm{T}}\mathbf{X}=\mathbf{I}}{\operatorname*{arg\,min}}\|\mathbf{Y}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\|_{F}^{2}.

(3)

The optimization formulation in (3) with this constraint will only lead to a subspace spanned by the eigenvectors of $\mathbf{C}$ as the solution, thus actually making it a Principal Subspace Analysis (PSA) formulation. In other words, although the formulation (3) gives a solution on the Stiefel manifold, the actual PCA formulation (2) requires the solution to be within a very specific subset of that manifold that corresponds to the eigenvectors of $\mathbf{C}$ . The accuracy of the solutions given by the PCA and PSA formulations will be the same when measured in terms of the principal angles between the subspace estimates and the true subspace spanned by the eigenvectors of the covariance matrix. Specifically, if $\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1},\cdots,\mathbf{x}_{K}\end{bmatrix}$ is an estimate of the basis of the space spanned by the eigenvectors $\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1},\cdots,\mathbf{q}_{K}\end{bmatrix}$ , then the principal angles between $\mathbf{Q}$ and $\mathbf{X}$ given by either (2) or (3) will be the same. But a more suitable measure of accuracy for any PCA solution should be the angles between $\mathbf{x}_{i}$ and $\mathbf{q}_{i}$ for all $i=1,\cdots K$ , which motivates us to judge the efficacy of any solution with respect to this metric instead of the principal subspace angles.

In the distributed setup considered in this paper, we consider a network of $M$ nodes such that the undirected graph, $\mathcal{G}:=(\mathcal{V},\mathcal{E})$ , describing the network is connected. Here $\mathcal{V}=\{1,2,\dots,M\}$ is the set of nodes and $\mathcal{E}$ is the set of edges, i.e., $(i,j)\in\mathcal{E}$ if there is a direct path between $i$ and $j$ . The set of neighbors for any node $i$ is denoted by $\mathcal{N}_{i}$ . Under the setup of samples being distributed over the $M$ nodes, let us assume that the $i^{th}$ node has a data matrix $\mathbf{Y}_{i}$ containing $N_{i}$ samples such that $N=\sum_{i=1}^{M}N_{i}$ . Thus each node has access to only a local covariance matrix $\mathbf{C}_{i}=\frac{1}{N_{i}}\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}$ instead of the global covariance matrix but one can see that $N\mathbf{C}=\sum_{i=1}^{M}N_{i}\mathbf{C}_{i}$ . In this setting, a straightforward approach might be that each node finds its own solution independent of the data at all the other nodes. While this might seem viable, this approach will have major drawbacks. Recall that the sample covariance $\mathbf{C}$ approximates the population covariance $\boldsymbol{\Sigma}$ at a rate of $\mathcal{O}(f(N^{-1}))$ , where $f$ is some function (depending on the distribution) of the number of samples $N$ [33]. Since the local data has smaller number of samples than the global data, working with the local covariance matrix $\mathbf{C}_{i}$ alone instead of somehow using the whole data will lead to a larger error in estimation of the eigenvectors. Also, since uniform sampling from the underlying data distribution is not guaranteed in distributed setups, the samples at a node may end up being from a narrow part of the entire distribution, thus being more biased away from the true distribution. This invites the need for the nodes to collaborate amongst themselves in a way that all the data is utilized to find estimates of the eigenvectors at each node while ensuring that all the nodes agree with each other. Thus, for a distributed setting, the PCA problem in (1) can be rewritten here as

\displaystyle\mathbf{X}

\displaystyle=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\sum_{i=1}^{M}f_{i}(\mathbf{X})=\underset{\mathbf{X}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\quad\sum_{i=1}^{M}\|\mathbf{Y}_{i}-\mathbf{X}\mathbf{X}^{\mathrm{T}}\mathbf{Y}_{i}\|_{F}^{2}\quad\text{such that}\quad\forall l\neq q,\ \Big{(}\mathbf{X}^{\mathrm{T}}\big{(}\sum_{i=1}^{M}\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}\big{)}\mathbf{X}\Big{)}_{lq}=0.

(4)

It is easy to see that $\sum_{i=1}^{M}f_{i}(\mathbf{X})=f(\mathbf{X})$ . Also, in a distributed setup, each node $i$ maintains its own copy $\mathbf{X}_{i}$ of the variable $\mathbf{X}$ due to the difference in local information (local data) they carry. Thus, all nodes need to agree with each other to ensure the entire network reaches the same true solution. Hence, the true distributed PCA objective is written as

\displaystyle\underset{\mathbf{X}_{i}\in\mathbb{R}^{d\times K}}{\operatorname*{arg\,min}}\sum_{i=1}^{M}\|\mathbf{Y}_{i}-\mathbf{X}_{i}\mathbf{X}_{i}^{\mathrm{T}}\mathbf{Y}_{i}\|_{F}^{2}\quad\text{such that}\quad\forall j\in\mathcal{N}_{i},\ \mathbf{X}_{i}=\mathbf{X}_{j}\quad\text{and}\quad\forall l\neq q,\ \Big{(}\mathbf{X}_{i}^{\mathrm{T}}\big{(}\sum_{i=1}^{M}\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}\big{)}\mathbf{X}_{i}\Big{)}_{lq}=0.

(5)

Note that (1)–(5) are non-convex optimization problems due to the non-convexity of the constraint set. One possible solution to the PCA problem is to instead solve a convex relaxation of the original non-convex function [34, 35]. The issue with these solutions is that they require $O(d^{2})$ memory and computation, which can be prohibitive in high-dimensional settings. In addition, due to $O(d^{2})$ iterate size these solutions are not ideal for distributed settings. Also, these formulations, without any further constraints, will not necessarily give a basis that is the set of dominant eigenvectors. Instead, they might end up giving a rotated basis as explained earlier, thereby not completing the task of decorrelating features. Hence, in this paper we use an algebraic method based on GHA for neural network training, which has $O(dK)$ memory and computation requirements, to solve the distributed PCA problem. Our goal is to converge to the true eigenvectors of the global covariance matrix $\mathbf{C}$ at every node of the network. As noted earlier in Section I-A, distributed variants of the power method exist in the literature [22, 24, 23] that can find the dominant eigenvector but these methods employ two time-scale approaches that involve several consensus averaging rounds for each iteration of the power method. Such two time-scale approaches can be expensive in terms of communications cost. In this paper, we propose a one time-scale method that finds the top $K$ eigenvectors of the global sample covariance matrix $\mathbf{C}$ at each node through local computations and information exchange with neighbors. The proposed method also converges linearly up to a neighborhood of the true solution when the error metric considered is the angle between the estimates and the true eigenvectors.

III The Proposed Algorithm

In [11], Sanger proposed a generalized Hebbian algorithm (GHA) to train a neural network and find the eigenvectors of the input autocorrelation matrix (same as the covariance matrix for zero-mean input). The outputs of such a network, when the weights are given by the eigenvectors, are the uncorrelated features of the input data that allow data reconstruction with minimal error, hence serving the true purpose of PCA. The algorithm was originally developed to tackle the centralized PCA problem in the case of streaming data, where a new data sample $\mathbf{y}_{t},t=1,2,\ldots$ , arrives at each time instance.

In this paper we consider a batch setting, but the alignment of GHA with our basic goal of finding the eigenvectors motivates us to leverage it for our distributed setup. The rationale behind the idea of extrapolating the streaming case to a distributed batch setting is simple: since $\mathbb{E}[\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}]=\mathbb{E}[\mathbf{Y}_{i}\mathbf{Y}_{i}^{\mathrm{T}}]=\boldsymbol{\Sigma}$ , the sample-wise distributed data setting can be seen as a mini-batch variant of the streaming data setting. In the context of neural network training, our approach can be viewed as training a network at each node with a mini-batch of samples in a way that all nodes end up with the same trained network whose weights are given by the eigenvectors of the autocorrelation matrix of the entire batch of samples.

The iterate for the GHA as given in [11] has the following update for the matrix of eigenvectors (i.e., the neural network weight matrix) $\mathbf{X}$ when the $t^{th}$ sample $\mathbf{y}_{t}$ arrives at the input of the neural network:

\mathbf{X}^{(t+1)}=\mathbf{X}^{(t)}+\alpha_{t}\Big{[}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}-\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}\Big{)}\Big{]},

(6)

where $\boldsymbol{\mathcal{U}}:\mathbb{R}^{K\times K}\rightarrow\mathbb{R}^{K\times K}$ is an operator that sets all the elements below the diagonal to zero and $\alpha_{t}$ is the step size. For $K=1$ , and defining $\boldsymbol{\Sigma}_{t}=\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}$ , it was shown in [9] that the term $(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}=(\mathbf{X}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{X}^{(t)}$ is the consequence of a power series approximation of Oja’s rule in lieu of the explicit normalization used in the case of the power method. In the case of $K>1$ , $\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{y}_{t}\mathbf{y}_{t}^{\mathrm{T}}\mathbf{X}^{(t)}\Big{)}=\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{X}^{(t)}\Big{)}$ helps combine Oja’s algorithm with the Gram–Schmidt orthogonalization step as follows:

$\displaystyle\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{X}^{(t)}\Big{)}$	$\displaystyle=\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Bigg{(}\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\\ \vdots\\ (\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\end{bmatrix}\boldsymbol{\Sigma}_{t}\begin{bmatrix}\mathbf{x}_{1}^{(t)}&\cdots&\mathbf{x}_{K}^{(t)}\end{bmatrix}\Bigg{)}$
	$\displaystyle=\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Bigg{(}\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ (\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ \vdots&\vdots&\ddots&\vdots\\ (\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\end{bmatrix}\Bigg{)}$
	$\displaystyle=\mathbf{X}^{(t)}\Bigg{(}\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ 0&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}&\ldots&(\mathbf{x}_{2}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&(\mathbf{x}_{K}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\end{bmatrix}\Bigg{)}$
	$\displaystyle=\begin{bmatrix}(\mathbf{x}_{1}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{1}^{(t)}\mathbf{x}_{1}^{(t)}&\sum_{p=1}^{2}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{2}^{(t)}\mathbf{x}_{p}^{(t)}&\ldots&\sum_{p=1}^{K}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{K}^{(t)}\mathbf{x}_{p}^{(t)}\end{bmatrix}.$	(7)

Thus, for any $k=1,\ldots,K$ , the term involving $\boldsymbol{\mathcal{U}}(\cdot)$ in (6) includes an implicit normalization term $(\mathbf{x}_{k}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}$ as well an orthogonalization term $\sum_{p=1}^{k-1}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\boldsymbol{\Sigma}_{t}\mathbf{x}_{k}^{(t)}\mathbf{x}_{p}^{(t)}$ , which—analogous to the Gram–Schmidt orthogonalization procedure—forces the estimate $\mathbf{x}_{k}^{(t)}$ to be orthogonal to all the estimates $\mathbf{x}_{p}^{(t)},p=1,\ldots,k-1$ . Another important thing to note about the GHA algorithm is that, in order to estimate the dominant $K$ eigenvectors, it only requires the corresponding top $K$ eigenvalues to be distinct (and nonzero). In other words, it does not require the covariance matrix to be non-singular.

In the deterministic setting, where we have the full-batch instead of new samples every instance, this iterate changes to

\displaystyle\mathbf{X}^{(t+1)}

\displaystyle=\mathbf{X}^{(t)}+\alpha_{t}\Big{[}\mathbf{C}\mathbf{X}^{(t)}-\mathbf{X}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{(t)})^{\mathrm{T}}\mathbf{C}\mathbf{X}^{(t)}\Big{)}\Big{]}=\mathbf{X}^{(t)}+\alpha_{t}\boldsymbol{\mathcal{H}}(\mathbf{C},\mathbf{X}^{(t)}).

(8)

Here, we term $\boldsymbol{\mathcal{H}}:\mathbb{R}^{d\times d}\times\mathbb{R}^{d\times K}\rightarrow\mathbb{R}^{d\times K},\boldsymbol{\mathcal{H}}(\mathbf{C},\mathbf{X}^{t}):=\Big{(}\mathbf{C}\mathbf{X}^{t}-\mathbf{X}^{t}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}^{t})^{\mathrm{T}}\mathbf{C}\mathbf{X}^{t}\Big{)}\Big{)}$ as the Sanger direction. An iterate similar to (8) has been proven to have global convergence in [17] for some very specific choice of the step sizes that are dependent on the iterate itself. Its straightforward extension to the distributed case is not possible as that would lead to different step sizes at different nodes of the network, making it difficult to talk about its convergence guarantees. Hence, to adapt this iterative method to our distributed setup, we use the typical combine and update strategy used quite richly in the literature for distributed algorithms such as [32, 36, 37, 38]. The main contributions of such works lie in showing that the resulting distributed algorithms achieve consensus (i.e., all nodes will have the same iterate values eventually) and, in addition, the consensus value is the same as the centralized solution. The convergence guarantees for these methods are mainly restricted to convex and strongly convex problems though. Our distributed version of (8) for PCA, which is non-convex, is based on similar principles of combine and update.

Specifically, the node $i$ at iteration $t$ carries a local copy $\mathbf{X}_{i}^{(t)}$ of the estimate of the eigenvectors of the global covariance matrix $\mathbf{C}$ . In the combine step, each node $i$ exchanges the iterate values with its immediate neighbors $j\in\mathcal{N}_{i}$ , where $\mathcal{N}_{i}$ denotes the neighborhood of node $i$ , and then takes a weighted sum of the iterates received along with its local iterate. Then this sum is updated independently at all nodes using their respective local information. Since node $i$ in the network only has access to its local sample covariance $\mathbf{C}_{i}$ , the update is in the form of a local Sanger’s direction given as

\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{X}_{i}^{(t)})=\mathbf{C}_{i}\mathbf{X}_{i}^{(t)}-\mathbf{X}_{i}^{(t)}\boldsymbol{\mathcal{U}}\Big{(}(\mathbf{X}_{i}^{(t)})^{\mathrm{T}}\mathbf{C}_{i}\mathbf{X}_{i}^{(t)}\Big{)}.

(9)

Input: $\mathbf{Y}_{1},\mathbf{Y}_{2},\dots\mathbf{Y}_{M},[w_{ij}],\alpha,K$
Initialize: $\forall i,\mathbf{X}_{i}^{(0)}\leftarrow\mathbf{X}_{\text{init}}:\mathbf{X}_{\text{init}}\in\mathbb{R}^{d\times K},\mathbf{X}_{\text{init}}^{\mathrm{T}}\mathbf{X}_{\text{init}}=\mathbf{I}$

for

t=1,2,\dots

Communicate

\mathbf{X}_{i}^{(t-1)}

from each node

i

to its neighbors

Estimate of eigenvectors at node

i

\mbox{\quad}\mathbf{X}_{i}^{(t)}\leftarrow\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}^{(t-1)}+\alpha\boldsymbol{\mathcal{H}}_{i}(\mathbf{X}_{i}^{(t-1)})

end for

Return: $\mathbf{X}_{i}^{(t)},i=1,2,\dots,M$

Algorithm 1 Distributed Sanger’s Algorithm (DSA)

The details of the proposed distributed PCA algorithm, called the Distributed Sanger’s Algorithm (DSA), are given in Algorithm 1. The weight matrix $\mathbf{W}=[w_{ij}]$ in this algorithm is a doubly stochastic matrix conforming to the network topology [25] in the sense that for $i\neq j$ , $w_{ij}\neq 0$ when $(i,j)\in\mathcal{E}$ and $w_{ij}=0$ otherwise. Also, $\forall i,w_{ii}\neq 0$ , i.e., there is a self loop at each node. Note that connectivity of the network, as discussed in Section II, is a necessary condition for convergence of DSA. The connectivity assumption, in turn, ensures the Markov chain underlying the graph $\mathcal{G}$ is aperiodic and irreducible, which implies that the second-largest (in magnitude) eigenvalue of $\mathbf{W}$ , $\beta=\max\{|\lambda_{2}(\mathbf{W})|,|\lambda_{M}(\mathbf{W})|\}$ , is strictly less than $1$ . While DSA shares algorithmic similarities with first-order distributed optimization methods [32, 39] in which the combine-and-update strategy is used, our challenge is characterizing its convergence behavior due to the non-convex and constrained nature of the distributed PCA problem. To this end, we first provide a general result in Section IV where we prove the convergence of a modified form of GHA. Then we utilize that result, along with some linear algebraic tools and additional lemmas provided in the appendices, to characterize the dynamics of the distributed setup in Section V and prove the convergence of the proposed algorithm.

IV Convergence Analysis of a Modified GHA

Let $\mathbf{X}^{(t)}=\begin{bmatrix}\mathbf{x}_{1}^{(t)}&\mathbf{x}_{2}^{(t)}&\cdots&\mathbf{x}_{K}^{(t)}\end{bmatrix}\in\mathbb{R}^{d\times K}$ , $K\leq d$ , be an estimate of the $K$ -dimensional subspace spanned by the eigenvectors of the covariance matrix $\mathbf{C}$ after $t$ iterations and $\mathbf{q}_{l},l=1,\ldots,d$ , be the eigenvectors of $\mathbf{C}$ with corresponding eigenvalues $\lambda_{l}$ . On expanding (8) using (7), it is clear that the GHA update equation for estimation of the $k^{th}$ eigenvector using a constant step size $\alpha$ is as follows:

\displaystyle\mathbf{x}_{k}^{(t+1)}=\mathbf{x}_{k}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{p}^{(t)}(\mathbf{x}_{p}^{(t)})^{\mathrm{T}}\mathbf{C}\mathbf{x}_{k}^{(t)}\big{)}.

(10)

We now slightly modify (10) by replacing $\mathbf{x}_{p}^{(t)}$ for $p<k$ by the true eigenvectors $\mathbf{q}_{p}$ . We term the resulting update equation modified GHA and note that this is not an algorithm in the true sense of the term as it cannot be implemented because of its dependence on the true eigenvectors $\mathbf{q}_{p}$ . The sole purpose of this modified GHA is to help in our ultimate goal of providing convergence guarantee for the DSA algorithm. The update equation of the modified GHA for “estimation” of the $k^{th}$ eigenvector of $\mathbf{C}$ , $k=1,\ldots,K$ , has the form

\mathbf{x}_{k}^{(t+1)}=\mathbf{x}_{k}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\big{)}.

(11)

Note that similar to the original GHA, this modified GHA assumes that $\mathbf{C}$ has $K$ distinct eigenvalues, i.e., $\lambda_{1}>\lambda_{2}>\ldots>\lambda_{K}>\lambda_{K+1}\geq\cdots\geq\lambda_{d}\geq 0$ . Now, since $\mathbf{q}_{l},l=1,\ldots,d$ , are the eigenvectors of a real symmetric matrix, they form a basis for $\mathbb{R}^{d}$ and can be used for expansion of any $\mathbf{x}_{k}^{(t)}$ as

\mathbf{x}_{k}^{(t)}=\sum_{l=1}^{d}z_{k,l}^{(t)}\mathbf{q}_{l},

(12)

where $z_{k,l}^{(t)}$ is the coefficient corresponding to the eigenvector $\mathbf{q}_{l}$ in the expansion of $\mathbf{x}_{k}^{(t)}$ . Multiplying both sides of (11) by $\mathbf{q}_{l}^{T}$ and using the fact that $\mathbf{q}_{l}^{T}\mathbf{q}_{l^{\prime}}=0$ for $l\neq l^{\prime}$ , we get

\displaystyle{z}_{k,l}^{(t+1)}

\displaystyle=

\displaystyle{z}_{k,l}^{(t)}+\alpha(\mathbf{q}_{l}^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-\mathbf{q}_{l}^{T}(\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}{z}_{k,l}^{(t)}).

This gives

	$\displaystyle{z}_{k,l}^{(t+1)}$	$\displaystyle={z}_{k,l}^{(t)}-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}{z}_{k,l}^{(t)},\quad\text{for}\quad l=1,\ldots,k-1,$		(13)
	$\displaystyle\text{and}\quad{z}_{k,l}^{(t+1)}$	$\displaystyle={z}_{k,l}^{(t)}+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}){z}_{k,l}^{(t)},\quad\text{for}\quad l=k,\ldots,d.$		(14)

It has been shown in [16] that the update equation given by

\displaystyle\mathbf{x}_{1}^{(t+1)}=\mathbf{x}_{1}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{1}^{(t)}-(\mathbf{x}_{1}^{(t)})^{T}\mathbf{C}\mathbf{x}_{1}^{(t)}\mathbf{x}_{1}^{(t)}\big{)}

for $k=1$ converges to $\pm\mathbf{q}_{1}$ at a linear rate for a certain condition on the step size $\alpha$ . Specifically, it was proven that $(z_{1,1}^{(t)})^{2}\rightarrow 1\quad\text{and}\quad\sum_{l=2}^{d}(z_{1,l}^{(t)})^{2}\leq b_{1}\rho_{1}^{t}$ , where $b_{1}>0$ is some constant and $\rho_{1}=\big{(}\frac{1+\alpha\lambda_{2}}{1+\alpha\lambda_{1}}\big{)}^{2}<1$ . Here, we extend the proof to a general $k$ and show that the update equation given in the form of (11) for any $k=1,\ldots,K,K<d$ , converges to the $k^{th}$ dominant eigenvector.

Theorem 1.

Suppose $\alpha\leq\frac{1}{3\lambda_{1}{(2K-1)}}$ , where $\lambda_{1}$ is the largest eigenvalue of $\mathbf{C}$ and $K$ is the number of eigenvectors to be estimated, $\mathbf{q}_{k}^{T}\mathbf{x}_{k}^{(0)}\neq 0$ , and $\|\mathbf{x}_{k}^{(0)}\|=1$ for all $k$ . Then the modified GHA iterate for $\mathbf{x}_{k}^{(t)}$ given by (11) converges at a linear rate to the eigenvector $\pm\mathbf{q}_{k}$ corresponding to the $k^{th}$ largest eigenvalue $\lambda_{k}$ of the covariance matrix $\mathbf{C}$ .

Proof.

The convergence of $\mathbf{x}_{k}^{(t)}$ to $\mathbf{q}_{k}$ requires convergence of the lower-order coefficients $z_{k,1}^{(t)},\ldots,z_{k,k-1}^{(t)}$ and the higher-order coefficients $z_{k,k+1}^{(t)},\ldots,z_{k,d}^{(t)}$ to 0 and convergence of $z_{k,k}^{(t)}$ to $\pm 1$ . Now,

$\displaystyle\|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\|$	$\displaystyle=\|\lambda_{k}-\sum_{l=1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|=\|\lambda_{k}-\lambda_{k}({z}_{k,k}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|$
	$\displaystyle\geq\|\lambda_{k}-\lambda_{k}({z}_{k,k}^{(t)})^{2}\|-\|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|-\|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|$
$\displaystyle\text{or,}\quad\lambda_{k}\|1-({z}_{k,k}^{(t)})^{2}\|$	$\displaystyle\leq\|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|+\|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|+\|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\|.$	(15)

Thus, convergence of the lower-order and the higher-order coefficients to 0 along with convergence of the term $|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|$ will also imply the convergence of $z_{k,k}^{(t)}$ to $\pm 1$ . To this end, Lemma 5 in the appendix proves linear convergence of the lower-order coefficients $z_{k,1}^{(t)},\ldots,z_{k,k-1}^{(t)}$ to 0 by showing $\sum_{l=1}^{k-1}(z_{k,l}^{(t+1)})^{2}<a_{1}\gamma^{t+1}$ for some constants $a_{1}>0,\gamma<1$ . Furthermore, Lemma 6 in the appendix shows that $\sum_{l=k+1}^{d}({z}_{k,l}^{(t+1)})^{2}\leq a_{2}\rho_{k}^{t+1}$ , where $a_{1},a_{2}>0$ and $\gamma,\rho_{k}<1$ , thereby proving linear convergence of the higher-order coefficients to 0. Finally, Lemma 7 in the appendix shows that $|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|\leq ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})$ , where $a_{4}>0$ and $\delta,\gamma_{1}<1$ . The formal statements and proofs of Lemma 5, Lemma 6 and Lemma 7 are given in Appendix B, Appendix C and Appendix D, respectively.

Thus,

$\displaystyle\lambda_{k}\|1-({z}_{k,k}^{(t)})^{2}\|$	$\displaystyle\leq$	$\displaystyle\|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|+\|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})$
	$\displaystyle<$	$\displaystyle\lambda_{1}(\sum_{l=1}^{k-1}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}({z}_{k,l}^{(t)})^{2})+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\})$
	$\displaystyle<$	$\displaystyle\lambda_{1}(a_{1}\gamma^{t}+a_{2}\rho_{k}^{t})+ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}).$

Clearly, $\lim\limits_{t\rightarrow\infty}|1-({z}_{k,k}^{(t)})^{2}|=0$ . Therefore, Theorem 1 shows that with an update equation of the form (11), the iterates $\mathbf{x}_{k}^{(t)}$ converge linearly to eigenvectors $\mathbf{q}_{k}$ of the covariance matrix $\mathbf{C}$ . ∎

V Convergence Analysis of Distributed Sanger’s Algorithm (DSA)

With the analysis of the modified GHA in hand, let us proceed to analyze the proposed DSA algorithm. The iterate of DSA at node $i$ for the dominant $K$ -dimensional eigenspace estimate ( $K\leq d$ ) is given as

\mathbf{X}_{i}^{(t+1)}=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}^{(t)}+\alpha\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{X}_{i}^{(t)})=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}^{(t)}+\alpha\Big{(}\mathbf{C}_{i}\mathbf{X}_{i}^{(t)}-\mathbf{X}_{i}^{(t)}{\boldsymbol{\mathcal{U}}}((\mathbf{X}_{i}^{(t)})^{\mathrm{T}}\mathbf{C}_{i}\mathbf{X}_{i}^{(t)})\Big{)},

(16)

where $\mathbf{X}_{i}^{(t)}=\begin{bmatrix}\mathbf{x}_{i,1}^{(t)}&\mathbf{x}_{i,2}^{(t)}&\cdots&\mathbf{x}_{i,K}^{(t)}\end{bmatrix}\in\mathbb{R}^{d\times K}$ is an estimate of the $K$ -dimensional subspace of the global covariance matrix $\mathbf{C}$ at the $i^{th}$ node after $t$ iterations, $\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{X}_{i}^{(t)})$ is local Sanger’s direction, and $w_{ij}\geq 0$ is a weight that node $i$ assigns to $\mathbf{X}_{j}^{(t)}$ based on the connectivity between nodes $i$ and $j$ as mentioned before. The Sanger’s direction and the update equation for an estimate of the $k^{th}$ eigenvector is thus given as

	$\displaystyle\boldsymbol{\mathcal{H}}_{i}(\mathbf{C}_{i},\mathbf{x}_{i,k}^{(t)})$	$\displaystyle=\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,p}^{(t)}$		(17)
	$\displaystyle\text{and,}\quad\mathbf{x}_{i,k}^{(t+1)}$	$\displaystyle=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{x}_{j,k}^{(t)}+\alpha\big{(}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{t}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\big{)}.$		(18)

Now, let the average of $\mathbf{x}_{1,k}^{(t)},\mathbf{x}_{2,k}^{(t)},\ldots,\mathbf{x}_{M,k}^{(t)}$ after $t^{th}$ iteration be denoted as $\bar{\mathbf{x}}_{k}^{(t)}=\frac{1}{M}\sum_{i=1}^{M}\mathbf{x}_{i,k}^{(t)}$ and given by taking average of (18) over all the nodes $i=1,\ldots,M$ as

$\displaystyle\bar{\mathbf{x}}_{k}^{(t+1)}$	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})$
	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})=\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\sum_{i=1}^{M}\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})+\alpha\mathbf{h}_{k}^{(t)}$
	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\big{)}+\alpha\mathbf{h}_{k}^{(t)},$

where $\mathbf{h}_{k}^{(t)}=\frac{1}{M}\sum_{i=1}^{M}(\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)}))$ . We present analysis of the DSA algorithm by first proving convergence of the average $\bar{\mathbf{x}}_{k}^{(t)}$ to a neighborhood of the eigenvector $\mathbf{q}_{k}$ of the global covariance matrix $\mathbf{C}$ while using a constant step size. Then with the help of Lemma 8 in Appendix E, which proves that the deviation of the iterates $\mathbf{x}_{i,k}^{(t)}$ at each node from the average $\bar{\mathbf{x}}_{k}^{(t)}$ is upper bounded, we prove that the iterates at each node also converge to a neighborhood of the true solution. It is noteworthy that the analysis of DSA does not require additional constraints on eigenvalues of $\mathbf{C}_{i}$ , i.e., similar to GHA, we only require the top $K$ eigenvalues of $\mathbf{C}$ to be distinct and non-zero.

The complete proof of convergence of DSA is done by induction. First, we show the convergence of $\mathbf{x}_{i,1}^{(t)}$ to a $\mathcal{O}(\alpha)$ neighborhood of $\mathbf{q}_{1}$ and then analyze the rest of the eigenvector estimates $\mathbf{x}_{i,k}^{(t)},k=2,\ldots,K,$ by assuming that the higher-order estimates have converged.

Case I for Induction – $k=1$ : The iterate for the dominant eigenvector is

\mathbf{x}_{i,1}^{(t+1)}=\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{x}_{j,1}^{(t)}+\alpha(\mathbf{C}_{i}\mathbf{x}_{i,1}^{(t)}-((\mathbf{x}_{i,1}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,1}^{(t)})\mathbf{x}_{i,1}^{(t)}).

(19)

Theorem 2.

Suppose $\alpha\leq\frac{\min_{i}w_{ii}}{3\lambda_{1}{(2K-1)}}$ , where $\lambda_{1}$ is the largest eigenvalue of $\mathbf{C}$ and $K$ is the number of eigenvectors to be estimated, $\mathbf{q}_{1}^{T}\mathbf{x}_{i,1}^{(0)}\neq 0$ , and $\|\mathbf{x}_{i,1}^{(0)}\|=1$ . Then the DSA iterate for $\mathbf{x}_{i,1}^{(t)}$ given by (19) converges at a linear rate to an $\mathcal{O}(\alpha)$ neighborhood of the eigenvector $\pm\mathbf{q}_{1}$ corresponding to the largest eigenvalue $\lambda_{1}$ of the global covariance matrix $\mathbf{C}$ at every node of the network.

Proof.

We know that

\|\mathbf{x}_{i,1}^{(t)}-\mathbf{x}_{1}^{*}\|\leq\|\mathbf{x}_{i,1}^{(t)}-\bar{\mathbf{x}}_{1}^{(t)}\|+\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\|,\quad\text{where $\mathbf{x}_{1}^{*}=\pm\mathbf{q}_{1}$}.

(20)

The term $\|\mathbf{x}_{i,1}^{(t)}-\bar{\mathbf{x}}_{1}^{(t)}\|$ is a measure of consensus in the network and we prove in Lemma 8 in Appendix E that this difference decreases linearly until it reaches a level of $\mathcal{O}(\alpha)$ . More precisely,

\displaystyle\|\mathbf{x}_{i,1}^{(t)}-\bar{\mathbf{x}}_{1}^{(t)}\|\leq b_{1}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)},

(21)

where $\beta=\max\{|\lambda_{2}(\mathbf{W})|,|\lambda_{M}(\mathbf{W})|\}$ . In particular, it is well known that for a connected graph $\beta<1$ . Now, the average iterate of DSA for the estimate of the dominant eigenvector $(k=1)$ is

\displaystyle\bar{\mathbf{x}}_{1}^{(t)}

\displaystyle=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})+\alpha\mathbf{h}_{1}^{(t-1)}.

Thus,

$\displaystyle\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}$	$\displaystyle=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}+\alpha\mathbf{h}_{1}^{(t-1)}$
$\displaystyle\text{or,}\quad\\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle=\\|\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}+\alpha\mathbf{h}_{1}^{(t-1)}\\|$
$\displaystyle\text{or,}\quad\\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle\leq\\|\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}\\|+\alpha\\|\mathbf{h}_{1}^{(t-1)}\\|.$	(22)

We saw in Section IV that an iterate of the form

\displaystyle\bar{\mathbf{x}}_{1}^{(t)}=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})

converges linearly to $\mathbf{x}_{1}^{*}=\pm\mathbf{q}_{1}$ for certain conditions on the step size and the initial point. Thus,

\displaystyle\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\|

\displaystyle\leq

\displaystyle\rho_{1}\|\bar{\mathbf{x}}_{1}^{(t-1)}-\mathbf{x}_{1}^{*}\|+\alpha\|\mathbf{h}_{1}^{(t-1)}\|,\quad\text{where}\quad\rho_{1}=\frac{1+\frac{\alpha}{M}\lambda_{2}}{1+\frac{\alpha}{M}\lambda_{1}}.

The term $\mathbf{h}_{1}^{(t-1)}$ in the above equation appears due to the distributed nature of the algorithm and can be bounded separately. Specifically, we prove in Lemma 9, whose formal statement and proof is given in Appendix F, that

\displaystyle\|\mathbf{h}_{1}^{(t-1)}\|\leq 9\lambda_{1}b_{1}\big{(}\beta^{t-1}+\frac{\alpha}{1-\beta}\big{)}.

Thus,

$\displaystyle\\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle\leq$	$\displaystyle\rho_{1}\\|\bar{\mathbf{x}}_{1}^{(t-1)}-\mathbf{x}_{1}^{*}\\|+9\alpha\lambda_{1}b_{1}\big{(}\beta^{t-1}+\frac{\alpha}{1-\beta}\big{)}$
	$\displaystyle\leq$	$\displaystyle\rho_{1}\Big{(}\rho_{1}\\|\bar{\mathbf{x}}_{1}^{(t-2)}-\mathbf{x}_{1}^{*}\\|+9\alpha\lambda_{1}b_{1}\beta^{t-2}+9\alpha\lambda_{1}b_{1}\big{(}\frac{\alpha}{1-\beta}\big{)}\Big{)}+9\alpha\lambda_{1}b_{1}\beta^{t-1}+9\alpha\lambda_{1}b_{1}\big{(}\frac{\alpha}{1-\beta}\big{)}$
	$\displaystyle\leq$	$\displaystyle\rho_{1}^{t}\\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\\|+9\alpha\lambda_{1}b_{1}\sum_{r=0}^{t-1}(\rho_{1}\beta^{-1})^{r}\beta^{t-1}+\frac{1}{1-\rho_{1}}9\alpha\lambda_{1}b_{1}\big{(}\frac{\alpha}{1-\beta}\big{)}.$

Since $\rho_{1},\beta<1$ , we have the following two cases:

1.

$\rho_{1}\leq\beta\implies\rho_{1}\beta^{-1}\leq 1$ . Then, $\sum_{r=0}^{t-1}(\rho_{1}\beta^{-1})^{r}\beta^{t-1}\leq\sum_{r=0}^{t-1}\beta^{t-1}=t\beta^{t-1}$ .
2.

$\rho_{1}>\beta$ . Then $\sum_{r=0}^{t-1}(\rho_{1}\beta^{-1})^{r}\beta^{t-1}=\beta^{t-1}+\rho_{1}\beta^{t-2}+\cdots+\rho_{1}^{t-1}<\rho_{1}^{t-1}+\cdots+\rho_{1}^{t-1}=t\rho_{1}^{t-1}$ .

Therefore,

\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\|\leq\rho_{1}^{t}\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\|+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+\frac{c_{1}}{1-\rho_{1}}\big{(}\frac{\alpha}{1-\beta}\big{)},\quad\text{where}\quad c_{1}=9\alpha\lambda_{1}b_{1}.

(23)

Consequently, from (21) and (23), we get

	$\displaystyle\\|\mathbf{x}_{i,1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle\leq$	$\displaystyle b_{1}(\beta^{t}+\frac{\alpha}{1-\beta})+\rho_{1}^{t}\\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\\|+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+\frac{c_{1}}{1-\rho_{1}}\big{(}\frac{\alpha}{1-\beta}\big{)}$
		$\displaystyle=$	$\displaystyle\rho_{1}^{t}\\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\\|+b_{1}\beta^{t}+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+(\frac{c_{1}}{1-\rho_{1}}+b_{1})\big{(}\frac{\alpha}{1-\beta}\big{)}.$

This proves that $\mathbf{x}_{i,1}^{(t)}$ converges to a neighborhood of $\mathbf{x}_{1}^{*}=\mathbf{q}_{1}$ or $\mathbf{x}_{1}^{*}=-\mathbf{q}$ at a linear rate. ∎

Case II for Induction – $1<k\leq K$ : For the remainder of the eigenvectors, we proceed with the proof of convergence by induction. Since we have already proven the base case, we can assume there exist constants $c_{i,p}>0$ and $\theta_{i,p}<1$ such that

1.

$\|\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}-\mathbf{q}_{p}\mathbf{q}_{p}^{T}\|\leq c_{i,p}(\theta_{i,p}^{t}+\frac{\alpha}{1-\beta}),\forall p=1,\ldots,k-1$ , and
2.

$\|\mathbf{x}_{i,p}^{(t)}\|^{2}\leq 3,p=1,\ldots,k-1,i=1,\ldots M$ .

Using the inequality in 1) above, we can write ${\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}=\mathbf{q}_{p}\mathbf{q}_{p}^{T}+{\boldsymbol{\phi}}_{i,p}^{(t)},p=1,\ldots,k-1$ such that $\|{\boldsymbol{\phi}}_{i,p}^{(t)}\|\leq{c}_{i,p}({\theta}_{i,p}^{t}+\frac{\alpha}{1-\beta})$ . This therefore implies $\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}=\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}(\mathbf{q}_{p}\mathbf{q}_{p}^{T}+{\boldsymbol{\phi}}_{i,p}^{(t)})\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}=\frac{\alpha}{M}\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}+\alpha\bar{\boldsymbol{\psi}}_{k}^{(t)},$ where $\bar{\boldsymbol{\psi}}_{k}^{(t)}=\frac{1}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\boldsymbol{\phi}}_{i,p}^{(t)}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}$ .

Thus, we have

	$\displaystyle\\|\bar{\boldsymbol{\psi}}_{k}^{(t)}\\|$	$\displaystyle\leq$	$\displaystyle\frac{1}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}\lambda_{1}\\|{\boldsymbol{\phi}}_{i,p}^{(t)}\\|\\|\bar{\mathbf{x}}_{k}^{(t)}\\|\leq\frac{1}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}\sqrt{3}\lambda_{1}{c}_{i,p}({\theta}_{i,p}^{t}+\frac{\alpha}{1-\beta})$		(24)
		$\displaystyle\leq$	$\displaystyle\frac{1}{M}\sqrt{3}\lambda_{1}(k-1)M\bar{c}(\bar{\theta}^{t}+\frac{\alpha}{1-\beta})=\sqrt{3}\lambda_{1}(k-1)\bar{c}(\bar{\theta}^{t}+\frac{\alpha}{1-\beta}),$		(24)

where $\bar{c}=\max_{i,p}\{{c}_{i,p}\}$ and $\bar{\theta}=\max_{i,p}\{{\theta}_{i,p}\}<1$ .

Consequently,

$\displaystyle\bar{\mathbf{x}}_{k}^{(t+1)}$	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\big{)}+\alpha\mathbf{h}_{k}^{(t)}$
	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{i=1}^{M}\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\big{)}+$
		$\displaystyle\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}({\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}-{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T})\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}+\alpha\mathbf{h}_{k}^{(t)}$
	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\big{)}-\frac{\alpha}{M}\sum_{i=1}^{M}\sum_{p=1}^{k-1}\boldsymbol{\phi}_{i,p}^{(t)}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}+\alpha\mathbf{h}_{k}^{(t)}$
	$\displaystyle=$	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}+\frac{\alpha}{M}\big{(}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}-(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t)}\big{)}-\alpha\bar{\boldsymbol{\psi}}_{k}^{(t)}+\alpha\mathbf{h}_{k}^{(t)}.$	(25)

We can now proceed with the final theorem that characterizes the convergence behavior of DSA.

Theorem 3.

Suppose $\alpha\leq\frac{\min_{i}w_{ii}}{3\lambda_{1}{(2K-1)}}$ , where $\lambda_{1}$ is the largest eigenvalue of $\mathbf{C}$ and $K$ is the number of eigenvectors to be estimated, $\mathbf{q}_{k}^{T}\mathbf{x}_{i,k}^{(0)}\neq 0$ and $\|\mathbf{x}_{i,k}^{(0)}\|=1,\forall k=2,\dots,K$ . Then the DSA iterate for $\mathbf{x}_{i,k}^{(t)}$ given by (18) converges at a linear rate to an $\mathcal{O}(\alpha)$ neighborhood of the eigenvector $\mathbf{q}_{k}$ corresponding to the $k^{th}$ largest eigenvalue $\lambda_{k}$ of the global covariance matrix $\mathbf{C}$ at each node of the network.

Proof.

We know

\|\mathbf{x}_{i,k}^{(t)}-\mathbf{x}_{k}^{*}\|\leq\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|+\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\|,\quad\text{where $\mathbf{x}_{k}^{*}=\pm\mathbf{q}_{k}$}.

(26)

Also, from Lemma 8 in the appendix we know that

\displaystyle\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}\big{)}.

Now, the average iterate of DSA for estimating the $k^{th}$ eigenvector is

	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}$	$\displaystyle=\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})+\alpha\mathbf{h}_{k}^{(t-1)}+\alpha\bar{\boldsymbol{\psi}}_{k}^{(t-1)}$
	$\displaystyle\text{or,}\quad\\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\\|$	$\displaystyle\leq\\|\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})-\mathbf{x}_{k}^{*}\\|+$
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\alpha\\|\mathbf{h}_{k}^{(t-1)}\\|+\alpha\\|\bar{\boldsymbol{\psi}}_{k}^{(t-1)}\\|.$

We know from the discussion in Section IV that for an iterate of the form

\displaystyle\bar{\mathbf{x}}_{k}^{(t)}

\displaystyle=\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}),

there exists a constant $\rho_{k}^{{}^{\prime}}<1$ such that $\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\|\leq\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\|$ . Thus,

\displaystyle\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\|\leq\rho_{k}^{{}^{\prime}}\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\|+\alpha\|\mathbf{h}_{k}^{(t-1)}\|+\alpha\|\bar{\boldsymbol{\psi}}_{k}^{(t-1)}\|.

Now, the term $\|\mathbf{h}_{k}^{(t-1)}\|$ was bounded in Lemma 9 in the appendix as

\displaystyle\|\mathbf{h}_{k}^{(t-1)}\|\leq 3(k+2)\lambda_{1}b_{k}\big{(}\beta^{t-1}+\frac{\alpha}{1-\beta}\big{)}.

(27)

Thus, using (24) and (27), we can write

$\displaystyle\\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\\|$	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}}\\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\\|+\alpha(3(k+2)\lambda_{1}b_{k}(\beta^{t-1}+\frac{\alpha}{1-\beta}))+\alpha(\sqrt{3}\lambda_{1}(k-1)\bar{c}(\bar{\theta}^{t-1}+\frac{\alpha}{1-\beta}))$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}}\\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\\|+c_{k}\max\{\beta^{t-1},\bar{\theta}^{t-1}\}+c_{k}\frac{\alpha}{1-\beta},\quad c_{k}=\max\{\alpha(3(k+2)\lambda_{1}b_{k}),\alpha(\sqrt{3}\lambda_{1}(k-1)\bar{c})\}$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}}\Big{(}\rho_{k}^{{}^{\prime}}\\|\bar{\mathbf{x}}_{k}^{(t-2)}-\mathbf{x}_{k}^{*}\\|+c_{k}\max\{\beta^{t-2},\bar{\theta}^{t-2}\}+c_{k}\frac{\alpha}{1-\beta}\Big{)}+c_{k}\max\{\beta^{t-1},\bar{\theta}^{t-1}\}+c_{k}\frac{\alpha}{1-\beta}$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}t}\\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\\|+c_{k}\sum_{r=0}^{t-1}(\rho_{k}^{{}^{\prime}}\max\{\beta,\bar{\theta}\}^{-1})^{r}\max\{\beta,\bar{\theta}\}^{t-1}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}t}\\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\\|+c_{k}t\max\{\rho_{k}^{{}^{\prime}t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}.$

Consequently, from (26) and Lemma 8 we get

	$\displaystyle\\|\mathbf{x}_{i,k}^{(t)}-\mathbf{x}_{k}^{*}\\|$	$\displaystyle\leq$	$\displaystyle b_{k}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)}+\rho_{k}^{{}^{\prime}t}\\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\\|+c_{k}t\max\{\rho_{k}^{{}^{\prime}t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}$
		$\displaystyle=$	$\displaystyle\rho_{k}^{t}\\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\\|+b_{k}\beta^{t}+c_{k}(t-1)\max\{\rho_{k}^{t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+(\frac{c_{k}}{1-\rho_{k}}+b_{k})\big{(}\frac{\alpha}{1-\beta}\big{)}.$

This proves that $\mathbf{x}_{i,k}^{(t)}$ converges to a neighborhood of $\mathbf{x}_{k}^{*}=\mathbf{q}_{k}$ or $\mathbf{x}_{k}^{*}=-\mathbf{q}_{k}$ at a linear rate. ∎

It is noteworthy that if decaying step sizes $\alpha_{t}$ are used such that $\alpha_{t}\rightarrow 0$ as $t\rightarrow\infty$ (instead of constant $\alpha$ ), the convergence will be exact but not linear. The rate in that case will be dominated by the rate of decay of $\alpha_{t}$ .

VI Experimental Results

In this section, we provide results that demonstrate the efficacy of the proposed DSA algorithm. The need for collaboration between the nodes of a network is a vital part of any distributed algorithm, as already pointed out in Section II. We first verify that necessity along with the effect of step size on DSA by performing some experiments. In these experiments, the weight matrix $\mathbf{W}$ that conforms to the underlying graph topology is generated using the Metropolis constant edge-weight approach [40]. The performance of DSA in comparison to some baseline methods is also evaluated in additional experiments. We provide experimental results for DSA on synthetic and real data and compare the results with centralized generalized Hebbian algorithm (GHA) [11], centralized orthogonal iteration (OI) [14], distributed projected gradient descent (DPGD) and sequential distributed power method (SeqDistPM). For both the centralized methods, all the data is assumed to be at a single location with the difference being that GHA uses the Hebbian update whereas OI uses the well-known orthogonal iterations to estimate the top $K$ eigenvectors of the covariance matrix $\mathbf{C}$ . DPGD involves two significant steps per iteration. The first is a distributed gradient descent step at every node $i$ given by $\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{X}_{j}+\alpha\nabla f_{i}(\mathbf{X}_{i})$ as in [32] using trace maximization $f_{i}(\mathbf{X}_{i})=\max\text{Trace}(\mathbf{X}_{i}^{\mathrm{T}}\mathbf{C}_{i}\mathbf{X}_{i})$ as the objective. This is followed by a projection step to ensure the orthogonality constraint $\mathbf{X}_{i}^{\mathrm{T}}\mathbf{X}_{i}=\mathbf{I}$ . The orthogonalization is accomplished using QR decomposition, an approach that ensures projection onto the Stiefel manifold [41] and whose computational complexity is $\mathcal{O}(K^{2}d)$ , at each node in each iteration. In contrast, SeqDistPM involves implementing the distributed power method [22, 24] $K$ times, estimating one eigenvector at a time and subtracting its impact on the covariance matrix for the estimation of subsequent eigenvectors. Note that SeqDistPM requires a finite $T_{c}$ number of consensus iterations per iteration of the power method. Assuming the cost of communicating one $\mathbb{R}^{d\times K}$ matrix across the network from nodes to their neighbors to be one unit, the communication cost of SeqDistPM is $T_{c}/K$ per iteration of the power method. The error metric used for comparison and reporting of the results is the average of the angles between the estimated and true eigenvectors, i.e., if $\mathbf{x}_{i,k}$ is the estimate of the $k^{th}$ eigenvector at $i^{th}$ node and $\mathbf{q}_{k}$ is the true $k^{th}$ eigenvector then the average error across all nodes is calculated as follows:

E=\frac{1}{MK}\sum_{i=1}^{M}\sum_{k=1}^{K}\bigg{(}1-\big{(}\frac{\mathbf{x}_{i,k}^{T}\mathbf{q}_{k}}{\|\mathbf{x}_{i,k}\|}\big{)}^{2}\bigg{)}.

(28)

VI-A Synthetic Data

Refer to caption — (a) Demonstration of the need for collaboration among the nodes in a network for the PCA problem

We first show results that emphasize on the need for collaboration among the nodes. To that end, we generate $N=10,000$ independent and identically distributed (i.i.d.) samples drawn from a multivariate Gaussian distribution with an eigengap $\Delta_{K}=\frac{\lambda_{K+1}}{\lambda_{K}}=0.8$ and dimension $d=10$ . These samples are distributed equally among the $M=10$ nodes of an Erdos–Renyi network (with connectivity probability $p=0.5$ ), implying that each node has 1,000 samples. The number of eigenvectors estimated is $K=3$ and a constant step size of $\alpha=0.1$ is used for this experiment. Figure 1(a) shows the effect of using the GHA at a node without collaboration with other nodes versus DSA, which in simple terms embodies GHA + collaboration in the network. The blue line indicating GHA in the figure is the result of using all the data in a centralized manner. It is clear that the lack of any communication between nodes increases the error in estimation of the eigenvectors by a significant factor. In Figure 1(b), we use the same setup and parameters to show the effect of different step sizes on our proposed DSA algorithm. It is evident that if the step size is too low, the convergence becomes significantly slow, while if its high, the final error is larger. Hence, careful choice of the step size is required for DSA, as characterized by its convergence analysis.

Next, we compare DSA with the distributed methods of DPGD and SeqDistPM to demonstrate its communication efficiency. For that purpose, we generate synthetic data with different eigengaps $\Delta_{K}\in\{0.6,0.8\}$ . We simulate the distributed setup for Erdos-Renyi ( $p=0.5$ ), star and cycle graph topologies with $M=10$ nodes. The data is generated so that each node has 1,000 i.i.d samples ( $N_{i}=1000$ ) drawn from a multivariate Gaussian distribution for $d=20$ , i.e., the total samples generated are 10,000. The dimension of the subspace to be estimated is taken to be $K\in\{1,5\}$ . We use $T_{c}=50$ as the number of consensus iterations per power iteration for SeqDistPM throughout out experiments. The results reported are an average of 10 Monte-Carlo trials. Figure 2 shows the performance of different algorithms for the estimation of the most dominant eigenvector for different network topologies. It is clear that for $K=1$ SeqDistPM outperforms both DSA and DPGD in terms of communications efficiency because it is basically distributed power method, which is shown in [24, 22] to have good performance for $K=1$ . Even though DSA and DPGD have the same performance in terms of communications cost, it is important to remember that DPGD requires an additional QR normalization step per communications round. Next, Figure 3 shows a comparison between the three algorithms when the top-5 eigenvectors are estimated i.e., $K=5$ . It is clear that while estimating higher-order eigenvectors, DSA slightly outperforms DPGD without performing explicit QR normalization and it also has much better communications efficiency than SeqDistPM. The error for SeqDistPM is significantly high in the beginning because of the sequential estimation, which means that when the first (higher-order) eigenvector(s) is (are) being estimated, the lower-order estimates are still at their initial values and hence those contribute significant error even when the first or higher order terms have low error. After a sufficiently large number of communications rounds, SeqDistPM eventually does reach a lower final error compared to DSA. But this comes at the expense of slower convergence as a function of communications costs. It should also be noted that SeqDistPM lacks a formal convergence analysis and has two time scales that need to be adjusted as both contribute to the final error. Finally, the benefits of DSA over DPGD are twofold. First, DSA reaches similar or better error floor without explicit QR normalization, thus saving $\mathcal{O}(K^{2}d)$ computations per iteration; and second, the convergence guarantees for gradient descent-based algorithms for non-convex problems like the PCA have limitations. The guarantees usually exist for convergence to a stationary solution with a sub-linear rate.

VI-B Real-World Data

Along with the synthetic data experiments, we provide some experiments with real-world datasets of MNIST [42] and CIFAR-10 [43]. For the distributed setup in this case, we use an Erdos–Renyi graph with $M=20$ nodes and $p=0.5$ . Both the datasets have 60,000 samples, thereby making the number of samples per node to be $N_{i}=3000$ . The data dimension for MNIST is $d=784$ and a constant step size of $\alpha=0.1$ was used. The plots in Figure 4(a) and Figure 4(b) show the results for $K\in\{10,40\}$ for MNIST. Similar plots are shown for CIFAR-10 in Figure 5(a) and Figure 5(b), where the dimension $d$ for CIFAR-10 is 1024, the number of estimated eigenvectors $K\in\{10,20\}$ and a constant step size of $\alpha=0.7$ is used. For these real-world data sets, we exclude the comparison with SeqDistPM as it is evident this method requires much higher cost of communications for estimating larger number of eigenvectors.

VII Conclusion

In this paper, we proposed and analyzed a new distributed Principal Component Analysis (PCA) algorithm that, as opposed to distributed subspace learning methods, facilitates both dimensionality reduction and data decorrelation in a distributed setup. Our main contribution in this regard was a detailed convergence analysis to prove that the proposed distributed method linearly converges to a neighborhood of the eigenvectors of the global covariance matrix. We also provided numerical results to demonstrate the communications efficiency and overall effectiveness of the proposed algorithm.

In terms of future work, an obvious extension would be a distributed algorithm that enables exact convergence to the PCA solution at a linear rate. Note that the use of a diminishing step size $\alpha$ along with the analysis in this paper already guarantees that DSA can converge exactly to the PCA solution. However, this exact convergence guarantee comes at the expense of a slow convergence rate. We instead expect to combine ideas from this work as well as ideas such as gradient tracking from the literature on distributed optimization [36, 31, 44] to develop a linearly convergent, exact algorithm for distributed PCA in the future. Another possible future direction involves developing an algorithm for distributed PCA that does not require the top $K$ eigenvalues to be distinct. We also leave the case of multiple eigenvector estimation from distributed, streaming data, as in [26], for future work.

Appendix A Statements and Proofs of Auxiliary Lemmas

A-A Statement and Proof of Lemma 1

Lemma 1.

Assume $\|\mathbf{x}_{k}^{(0)}\|=1,\leavevmode\nobreak\ \forall k$ . If the step size is bounded above as $\alpha\leq\frac{1}{3\lambda_{1}{(2K-1)}}$ , where $\lambda_{1}$ is the largest eigenvalue of $\mathbf{C}$ and $K$ is the number of eigenvectors to be estimated, then

\forall t,\quad\|\mathbf{x}_{k}^{(t)}\|<\sqrt{3}\quad\text{and}\quad(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}.

(29)

Proof.

From (11), we know the iterate for $k^{th}$ eigenvector estimate is

$\displaystyle\mathbf{x}_{k}^{(t+1)}$	$\displaystyle=$	$\displaystyle\mathbf{x}_{k}^{(t)}+\alpha\bigg{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{t}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\bigg{)}$
	$\displaystyle=$	$\displaystyle\mathbf{x}_{k}^{(t)}+\alpha\big{(}\mathbf{C}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{t}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)}\big{)}$
	$\displaystyle=$	$\displaystyle\mathbf{x}_{k}^{(t)}+\alpha\big{(}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}\big{)},$

where $\tilde{\mathbf{C}}_{k}=\mathbf{C}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T}$ . Notice that $\tilde{\mathbf{C}}_{k}^{2}=\mathbf{C}^{2}-\sum_{p=1}^{k-1}\lambda_{p}^{2}\mathbf{q}_{p}\mathbf{q}_{p}^{T}$ . Hence,

$\displaystyle\\|\mathbf{x}_{k}^{(t+1)}\\|^{2}$	$\displaystyle=\\|\mathbf{x}_{k}^{(t)}+\alpha\big{(}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}\big{)}\\|^{2}$
	$\displaystyle=\\|\mathbf{x}_{k}^{(t)}\\|^{2}+\alpha^{2}\\|\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)}\\|^{2}+2\alpha(\mathbf{x}_{k}^{(t)})^{T}(\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\mathbf{x}_{k}^{(t)})$
	$\displaystyle=\\|\mathbf{x}_{k}^{(t)}\\|^{2}+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\tilde{\mathbf{C}}_{k}^{2}\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\mathbf{x}_{k}^{(t)})^{T}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}\big{)}$
	$\displaystyle+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\tilde{\mathbf{C}}_{k}\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\\|\mathbf{x}_{k}^{(t)}\\|^{2}\big{)}$
	$\displaystyle=\\|\mathbf{x}_{k}^{(t)}\\|^{2}+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}(\mathbf{C}^{2}-\sum_{p=1}^{k-1}\lambda_{p}^{2}\mathbf{q}_{p}\mathbf{q}_{p}^{T})\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\times$
	$\displaystyle(\mathbf{x}_{k}^{(t)})^{T}(\mathbf{C}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T})\mathbf{x}_{k}^{(t)}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}(\mathbf{C}-\sum_{p=1}^{k-1}\lambda_{p}\mathbf{q}_{p}\mathbf{q}_{p}^{T})\mathbf{x}_{k}^{(t)}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\\|\mathbf{x}_{k}^{(t)}\\|^{2}\big{)}$
	$\displaystyle=\\|\mathbf{x}_{k}^{(t)}\\|^{2}+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)$
	$\displaystyle+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1-\\|\mathbf{x}_{k}^{(t)}\\|^{2})-\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}.$	(30)

We now split our analysis into three cases based on the range of values of $\|\mathbf{x}_{k}^{(t)}\|^{2}$ .

Case \@slowromancapi@: Let $\|\mathbf{x}_{k}^{(t)}\|^{2}\leq 1$ . Then we see from (30) that

$\displaystyle\\|\mathbf{x}_{k}^{(t+1)}\\|^{2}$	$\displaystyle\leq$	$\displaystyle 1+\alpha^{2}(\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}\lambda_{p})+2\alpha\lambda_{1}\leq 1+\alpha^{2}(\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}\lambda_{1})+2\alpha\lambda_{1}$
	$\displaystyle\leq$	$\displaystyle 1+\alpha^{2}\lambda_{1}^{2}(2K-1)+2\alpha\lambda_{1}\sqrt{2K-1}=(1+\alpha\lambda_{1}\sqrt{2K-1})^{2}$
	$\displaystyle\leq$	$\displaystyle 2(1+\alpha^{2}\lambda_{1}^{2}(2K-1))\leq 2(1+\frac{1}{9(2K-1)})\leq 2(1+\frac{1}{9})<3.$

Case \@slowromancapii@: Now suppose $1<\|\mathbf{x}_{k}^{(t)}\|^{2}\leq 2$ . Then from (30) we have

	$\displaystyle\\|\mathbf{x}_{k}^{(t+1)}\\|^{2}$	$\displaystyle\leq$	$\displaystyle 2+\alpha^{2}(2\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}2\lambda_{p})\leq 2+\alpha^{2}(2\lambda_{1}^{2}+2\lambda_{1}\sum_{p=1}^{k-1}2\lambda_{1})$
		$\displaystyle\leq$	$\displaystyle 2(1+\frac{1}{9(2K-1)})\leq 2(1+\frac{1}{9})<3,\quad\text{using similar steps as Case \@slowromancap i@}.$

Case \@slowromancapiii@: Finally suppose $2<\|\mathbf{x}_{k}^{(t)}\|^{2}<3$ . Then from (30) we get

	$\displaystyle\\|\mathbf{x}_{k}^{(t+1)}\\|^{2}$	$\displaystyle<$	$\displaystyle 3+\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)$
			$\displaystyle+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1-\\|\mathbf{x}_{k}^{(t)}\\|^{2})-\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}.$

To show that $\|\mathbf{x}_{k}^{(t+1)}\|^{2}<3$ , we have to show

	$\displaystyle\alpha^{2}\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)$
	$\displaystyle 2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}+2\alpha\big{(}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1-\\|\mathbf{x}_{k}^{(t)}\\|^{2})-\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\big{)}\leq 0$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\alpha\leq\frac{2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)+2\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}}{(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}}.$	(31)

We now find a lower bound of the right hand side of (31). Note that

	$\displaystyle 2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)+2\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}\geq 2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)$	(32)
and	$\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}-\sum_{p=1}^{k-1}\lambda_{p}^{2}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}$
	$\displaystyle\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}(\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)})^{2}.$	(33)

Now, $\frac{(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}}{(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}}$ is a generalized Rayleigh quotient whose maximum and minimum values are the largest and smallest eigenvalues of the generalized eigenvalue problem $\mathbf{C}\mathbf{y}=\lambda\mathbf{C}^{2}\mathbf{y}$ . Since the eigenvectors of $\mathbf{C}$ and $\mathbf{C}^{2}$ are the same, the largest and smallest eigenvalues of the generalized problems are $\frac{1}{\lambda_{d}}$ and $\frac{1}{\lambda_{1}}$ , respectively, where $\lambda_{1}$ and $\lambda_{d}$ are the largest and smallest eigenvalues of $\mathbf{C}$ . Thus, $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}^{2}\mathbf{x}_{k}^{(t)}\leq\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}$ . Also, since $\mathbf{q}_{p}^{T}\mathbf{x}_{k}^{(t)}\leq\|\mathbf{q}\|\|\mathbf{x}_{k}^{(t)}\|$ , we have the right hand side of (33)

			$\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}+((\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)+2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\sum_{p=1}^{k-1}\lambda_{p}\\|\mathbf{x}_{k}^{(t)}\\|^{2}$
		$\displaystyle=$	$\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\lambda_{1}+(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)+2\sum_{p=1}^{k-1}\lambda_{p}\\|\mathbf{x}_{k}^{(t)}\\|^{2})$
		$\displaystyle\leq$	$\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\lambda_{1}+\lambda_{1}\\|\mathbf{x}_{k}^{(t)}\\|^{2}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-2)+2\sum_{p=1}^{k-1}\lambda_{1}\\|\mathbf{x}_{k}^{(t)}\\|^{2})$
		$\displaystyle=$	$\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(1+\\|\mathbf{x}_{k}^{(t)}\\|^{4}-2\\|\mathbf{x}_{k}^{(t)}\\|^{2}+2(k-1)\\|\mathbf{x}_{k}^{(t)}\\|^{2})$
		$\displaystyle=$	$\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}((\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)^{2}+2(k-1)\\|\mathbf{x}_{k}^{(t)}\\|^{2})$
		$\displaystyle=$	$\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)((\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)+2(k-1)\frac{\\|\mathbf{x}_{k}^{(t)}\\|^{2}}{(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)})$
		$\displaystyle<$	$\displaystyle\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)((3-1)+2(k-1)2),\quad\text{since}\frac{\\|\mathbf{x}_{k}^{(t)}\\|^{2}}{(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)}<2$
		$\displaystyle=$	$\displaystyle 2\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)(2k-1)\leq 2\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\\|\mathbf{x}_{k}^{(t)}\\|^{2}-1)(2K-1).$

Hence, we have that the right hand side of (31) exceeds

\displaystyle\frac{2(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)}{2\lambda_{1}(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}(\|\mathbf{x}_{k}^{(t)}\|^{2}-1)(2K-1)}=\frac{1}{\lambda_{1}(2K-1)}>\frac{1}{3\lambda_{1}(2K-1)}.

Thus, if $\alpha\leq\frac{1}{3\lambda_{1}(2K-1)}$ , then $\|\mathbf{x}_{k}^{(t)}\|^{2}<3$ .

Next,

0\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\leq\lambda_{1}\|\mathbf{x}_{k}^{(t)}\|^{2}<3\lambda_{1}\leq 3(2K-1)\lambda_{1}\leq\frac{1}{\alpha}.

(34)

Hence, $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}$ . ∎

A-B Statement and Proof of Lemma 2

Lemma 2.

Suppose $\mathbf{q}_{k}^{T}\mathbf{x}_{k}^{(0)}=z_{k,k}^{(0)}\neq 0$ and $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}$ , then

(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}>\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\},\quad\forall t.

Proof.

We know $0\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}\leq\lambda_{1}\|\mathbf{x}_{k}^{(t)}\|^{2}<3\lambda_{1}\leavevmode\nobreak\ \text{using Lemma\leavevmode\nobreak\ \ref{lemma:bounded_iterate_rayleigh}}$ . Let $\lambda_{m},m>K$ be the smallest non-zero eigenvalue of $\mathbf{C}$ . Now, if $\lambda_{m}\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<3\lambda_{1}$ , then

$\displaystyle({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}$	$\displaystyle=\sum_{l=1}^{d}\lambda_{l}(z_{k,l}^{(t+1)})^{2}$
	$\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(z_{k,l}^{(t+1)})^{2}+\sum_{l=k}^{d}\lambda_{l}(z_{k,l}^{(t+1)})^{2}$
	$\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}(z_{k,l}^{(t)})^{2}+\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
	$\displaystyle\geq(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}\sum_{l=1}^{k-1}\lambda_{l}(z_{k,l}^{(t)})^{2}+(1+\alpha(\lambda_{m}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}\sum_{l=k}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}$
	$\displaystyle>(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}\sum_{l=1}^{k-1}\lambda_{l}(z_{k,l}^{(t)})^{2}+(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}\sum_{l=k}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}$
	$\displaystyle=(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}\sum_{l=1}^{d}\lambda_{l}(z_{k,l}^{(t)})^{2}$
	$\displaystyle>(1-3\alpha\lambda_{1})^{2}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\geq(1-3\alpha\lambda_{1})^{2}\lambda_{m}.$	(35)

Also, from (12), we have $\mathbf{x}_{k}^{(t)}=\sum_{l=1}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}=\sum_{l=1}^{k-1}z_{k,l}^{(t)}\mathbf{q}_{l}+\sum_{l=k}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}.$ Let $\sum_{l=1}^{k-1}z_{k,l}^{(t)}\mathbf{q}_{l}=\mathbf{x}_{k}^{{}^{\prime}(t)}$ and $\sum_{l=k}^{d}z_{k,l}^{(t)}\mathbf{q}_{l}=\tilde{\mathbf{x}}_{k}^{(t)}$ . Thus, $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}=(\tilde{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t)}+(\mathbf{x}_{k}^{{}^{\prime}(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{{}^{\prime}(t)}$ .

Now, if $(\tilde{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t)}\leq(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\lambda_{m}$ then

$\displaystyle({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}\geq(\tilde{\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t+1)}$	$\displaystyle=\sum_{l=k}^{d}\lambda_{l}(z_{k,l}^{(t+1)})^{2}$
	$\displaystyle=\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
	$\displaystyle\geq(1+\alpha(\lambda_{m}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}))^{2}\sum_{l=k}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}>(\tilde{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(t)}.$	(36)

Combining (35) and (36), we have

\displaystyle(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}>\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}.

(37)

∎

A-C Statement and Proof of Lemma 3

Lemma 3.

Assume $\|\mathbf{x}_{i,k}^{(0)}\|=1$ . If the step size is bounded above as $\alpha\leq\frac{w_{ii}}{3\lambda_{1}{(2K-1)}}$ , where $\lambda_{1}$ is the largest eigenvalue of $\mathbf{C}$ and $K$ is the number of eigenvectors to be estimated, then

\|\mathbf{x}_{i,k}^{(t)}\|<\sqrt{3}\quad\text{and}\quad(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha},\quad\forall k,t.

(38)

Proof.

We have

\displaystyle\mathbf{x}_{i,k}^{(t+1)}

\displaystyle=

\displaystyle\sum_{j\in\mathcal{N}_{i}\cup\{i\}}w_{ij}\mathbf{x}_{j,k}^{(t)}+\alpha\big{(}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{t}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\big{)}.

(39)

Hence,

$\displaystyle\\|\mathbf{x}_{i,k}^{(t+1)}\\|$	$\displaystyle\leq$	$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+\alpha\sum_{p=1}^{k-1}\\|\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\\|+\sum_{j\neq i}\\|w_{ij}\mathbf{x}_{j,k}^{(t)}\\|$
	$\displaystyle\leq$	$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+\alpha\sum_{p=1}^{k-1}\lambda_{1}\\|(\mathbf{x}_{i,p}^{(t)})\\|\\|\mathbf{x}_{i,k}^{(t)}\\|\\|\mathbf{x}_{i,p}^{(t)}\\|+\sum_{j\neq i}w_{ij}\\|\mathbf{x}_{j,k}^{(t)}\\|$
	$\displaystyle=$	$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+\alpha\sum_{p=1}^{k-1}\lambda_{1}\\|(\mathbf{x}_{i,p}^{(t)})\\|^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|+\sum_{j\neq i}w_{ij}\\|\mathbf{x}_{j,k}^{(t)}\\|$
	$\displaystyle\leq$	$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+3\alpha\lambda_{1}\sum_{p=1}^{k-1}\\|\mathbf{x}_{i,k}^{(t)}\\|+\sum_{j\neq i}w_{ij}\\|\mathbf{x}_{j,k}^{(t)}\\|$
	$\displaystyle=$	$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+3(k-1)\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+\sum_{j\neq i}w_{ij}\\|\mathbf{x}_{j,k}^{(t)}\\|.$

Now,

	$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha(\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})\mathbf{x}_{i,k}^{(t)})\\|^{2}$
	$\displaystyle=w_{ii}^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}+\alpha^{2}\\|\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})\mathbf{x}_{i,k}^{(t)}\\|^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{(t)})^{T}(\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)})$
	$\displaystyle=w_{ii}^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\\|\mathbf{x}_{i,k}^{(t)}\\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2).$

Case \@slowromancapi@: Let us assume $\|\mathbf{x}_{i,k}^{t}\|^{2}\leq 1,\forall i$ . Then, we have

\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|^{2}

\displaystyle\leq

\displaystyle(w_{ii}+\alpha\lambda_{1})^{2}\leq\big{(}w_{ii}+\frac{w_{ii}}{3(2K-1)}\big{)}^{2}.

Thus,

$\displaystyle\\|\mathbf{x}_{i,k}^{t+1}\\|$	$\displaystyle\leq$	$\displaystyle w_{ii}\big{(}1+\frac{1}{3(2K-1)}\big{)}+\frac{3(k-1)}{3(2K-1)}+(1-w_{ii})$
	$\displaystyle<$	$\displaystyle\frac{1}{3(2K-1)}+\frac{k-1}{2K-1}+1=\frac{k-0.67}{2(K-0.5)}+1$
	$\displaystyle\leq$	$\displaystyle\frac{K-0.67}{2(K-0.5)}+1<1.5<\sqrt{3}.$

Case \@slowromancapii@: Now, suppose $1\leq\|\mathbf{x}_{i,k}^{t}\|^{2}<2,\forall i$ . Then, we get

\displaystyle\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\|^{2}\leq 2w_{ii}^{2}+2\alpha^{2}\lambda_{1}^{2}<2(w_{ii}+\alpha\lambda_{1})^{2}.

Thus, if we need $\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}$ , the following condition should be met:

	$\displaystyle\\|\mathbf{x}_{i,k}^{t+1}\\|$	$\displaystyle\leq\sqrt{2}w_{ii}(1+\alpha\lambda_{1})+3(k-1)\alpha\lambda_{1}\sqrt{2}+(1-w_{ii})\sqrt{2}\leq\sqrt{3}$
	$\displaystyle\Leftrightarrow\quad$	$\displaystyle\sqrt{2}+\sqrt{2}w_{ii}\alpha\lambda_{1}+3(k-1)\alpha\lambda_{1}\sqrt{2}\leq\sqrt{3}$
	$\displaystyle\Leftrightarrow\quad$	$\displaystyle\sqrt{2}\alpha\lambda_{1}+3(k-1)\alpha\lambda_{1}\sqrt{2}\leq\sqrt{3}-\sqrt{2}$
	$\displaystyle\Leftrightarrow\quad$	$\displaystyle\sqrt{2}\alpha\lambda_{1}(3k-2)\leq\sqrt{3}-\sqrt{2}\Leftrightarrow\quad\sqrt{2}\alpha\lambda_{1}(3K-2)\leq\sqrt{3}-\sqrt{2}$
	$\displaystyle\Leftrightarrow\quad$	$\displaystyle\alpha\leq\frac{\sqrt{3}-\sqrt{2}}{\sqrt{2}\lambda_{1}(3K-2)}=\frac{\sqrt{1.5}-1}{\lambda_{1}(3K-2)}=\frac{0.225}{\lambda_{1}(3K-2)}.$

Since $\frac{0.225}{\lambda_{1}3(2K-1)}<\frac{0.225}{\lambda_{1}(3K-2)}$ , if $\alpha\leq\frac{0.225}{3\lambda_{1}(2K-1)},\quad\text{then}\quad\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}$ .

Case \@slowromancapiii@: Finally, suppose $2\leq\|\mathbf{x}_{i,k}^{(t)}\|^{2}\leq 3,\forall i$ . We then have the following: $\sum_{j\neq i}w_{ij}\|\mathbf{x}_{j,k}^{t}\|\leq\sum_{j\neq i}w_{ij}\sqrt{3}=(1-w_{ii})\sqrt{3}$ .

Now, if we desire $\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}$ , then we need

			$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+3(k-1)\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+\sum_{j\neq i}w_{ij}\sqrt{3}\leq\sqrt{3}$
	$\displaystyle\Leftrightarrow\quad$		$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|+3(k-1)\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+(1-w_{ii})\sqrt{3}\leq\sqrt{3}$
	$\displaystyle\Leftrightarrow\quad$		$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|\leq\sqrt{3}-3(k-1)\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|-(1-w_{ii})\sqrt{3}$
	$\displaystyle\Leftrightarrow\quad$		$\displaystyle\\|w_{ii}\mathbf{x}_{i,k}^{(t)}+\alpha\big{(}{\mathbf{C}}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\big{)}\\|^{2}\leq 3w_{ii}^{2}-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}.$

Therefore, we need

	$\displaystyle w_{ii}^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\\|\mathbf{x}_{i,k}^{(t)}\\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)$
	$\displaystyle\leq 3w_{ii}^{2}-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle 3w_{ii}^{2}+2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\\|\mathbf{x}_{i,k}^{(t)}\\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)$
	$\displaystyle\leq 3w_{ii}^{2}-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle 2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(1-\\|\mathbf{x}_{i,k}^{(t)}\\|^{2})+\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)$
	$\displaystyle\leq-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|+9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\alpha^{2}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+\alpha^{2}((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)-9\alpha^{2}\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
	$\displaystyle\leq 2\alpha w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)-6\sqrt{3}(k-1)w_{ii}\alpha\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|$
$\displaystyle\Leftrightarrow\quad$	$\displaystyle\alpha\leq\frac{2w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)-6\sqrt{3}(k-1)w_{ii}\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|}{(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}}.$	(40)

We now find the lower bound of the right-hand side of (40). Note that

	$\displaystyle(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
	$\displaystyle\leq\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2},\quad\text{since}\quad(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}^{2}\mathbf{x}_{i,k}^{(t)}\leq\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}$
	$\displaystyle\leq\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)-9\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
	$\displaystyle=\lambda_{1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)^{2}-9\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
	$\displaystyle\leq\lambda_{1}(k-1)(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)^{2}-9\lambda_{1}^{2}(k-1)^{2}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}$
	$\displaystyle<\lambda_{1}(k-1)(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)\Big{(}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)-9\lambda_{1}(k-1)\Big{)}\quad\text{since}\quad\frac{\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}}{\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1}>1$

and,

	$\displaystyle 2w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)-6\sqrt{3}(k-1)w_{ii}\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}\\|$	$\displaystyle\geq 2w_{ii}(\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)-18(k-1)w_{ii}\lambda_{1}$
		$\displaystyle=2w_{ii}((\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-1)-9(k-1)\lambda_{1}).$

Thus, we have that the right hand side of (40) exceeds

\displaystyle\frac{2w_{ii}((\mathbf{x}_{i,k}^{((t))})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-9(k-1)\lambda_{1})}{\lambda_{1}(k-1)(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)\Big{(}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}(\|\mathbf{x}_{i,k}^{(t)}\|^{2}-1)-9\lambda_{1}(k-1)\Big{)}}=\frac{2w_{ii}}{\lambda_{1}(k-1)}>\frac{w_{ii}}{3\lambda_{1}(2K-1)}.

This proves if $\alpha\leq\min\{\frac{w_{ii}}{3\lambda_{1}(2K-1)},\frac{0.225}{3\lambda_{1}(2K-1)}\}\text{ then }\|\mathbf{x}_{i,k}^{(t+1)}\|\leq\sqrt{3}$ . ∎

A-D Statement and Proof of Lemma 4

Lemma 4.

The norm of Sanger’s direction $\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})$ is bounded as

\displaystyle\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\|^{2}\leq 3\lambda_{i,1}^{2}(3k-2)(3k+1),\forall k=1,\ldots,K.

(41)

Proof.

We know

	$\displaystyle\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})$	$\displaystyle=\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}$
		$\displaystyle=(\mathbf{I}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T})\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}=\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}$
	$\displaystyle\\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\\|^{2}$	$\displaystyle=\\|\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}\\|^{2}$
		$\displaystyle=(\mathbf{x}_{i,k}^{(t)})^{T}(\tilde{\mathbf{C}}_{i}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}.$

Next, notice that $\|\tilde{\mathbf{C}}_{i}^{(t)}\|=\|\mathbf{C}_{i}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\|$ . Thus,

\displaystyle\|\tilde{\mathbf{C}}_{i}^{(t)}\|

\displaystyle\leq

\displaystyle\|\mathbf{C}_{i}\|+\sum_{p=1}^{k-1}\|\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\|\|\mathbf{C}_{i}\|\leq\lambda_{i,1}+\sum_{p=1}^{k-1}3\lambda_{i,1}=\lambda_{i,1}+3(k-1)\lambda_{i,1}=\lambda_{i,1}(3k-2).

We, therefore, get

\displaystyle(\mathbf{x}_{i,k}^{(t)})^{T}(\tilde{\mathbf{C}}_{i}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}\leq\lambda_{\max}((\tilde{\mathbf{C}}_{i}^{(t)})^{T})(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}=\|(\tilde{\mathbf{C}}_{i}^{(t)})\|(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}.

Thus,

	$\displaystyle\\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\\|^{2}$	$\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}$
		$\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\tilde{\mathbf{C}}_{i}^{(t)}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}\\|\mathbf{x}_{i,p}^{(t)}\\|^{2}\lambda_{i,1}$

	$\displaystyle\\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})\\|^{2}$	$\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}(\mathbf{I}-\sum_{p=1}^{k-1}\mathbf{x}_{i,p}^{(t)}(\mathbf{x}_{i,p}^{(t)})^{T})\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)+$
		$\displaystyle(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}\\|\mathbf{x}_{i,p}^{(t)}\\|^{2}\lambda_{i,1}$
		$\displaystyle\leq\lambda_{i,1}(3k-2)(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}+\lambda_{i,1}(3k-2)\sum_{p=1}^{k-1}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}\\|\mathbf{x}_{i,p}^{(t)}\\|^{2}\lambda_{i,1}+$
		$\displaystyle((\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)})^{2}(\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}-2)+(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\sum_{p=1}^{k-1}\\|\mathbf{x}_{i,k}^{(t)}\\|^{2}\\|\mathbf{x}_{i,p}^{(t)}\\|^{2}\lambda_{i,1}$
		$\displaystyle\leq\lambda_{i,1}(3k-2)3\lambda_{i,1}+\lambda_{i,1}(3k-2)\sum_{p=1}^{k-1}9\lambda_{i,1}+9\lambda_{i,1}^{2}+\lambda_{i,1}3\sum_{p=1}^{k-1}9\lambda_{i,1}=3\lambda_{i,1}^{2}(3k-2)(3k+1).$

∎

Appendix B Statement and Proof of Lemma 5

Lemma 5.

Let $\eta=\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}$ . Now, suppose $\eta<(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}$ , then the following is true for $\gamma=1-\alpha\eta$ , and some constant $a_{1}>0$ :

\sum_{l=1}^{k-1}(z_{k,l}^{(t+1)})^{2}<a_{1}\gamma^{t+1}.

(42)

Proof.

For $l=1,\ldots,k-1$ , we know from (13)

	$\displaystyle{z}_{k,l}^{(t+1)}$	$\displaystyle=$	$\displaystyle(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}){z}_{k,l}^{(t)}$
	$\displaystyle\text{or,}\quad({z}_{k,l}^{(t+1)})^{2}$	$\displaystyle=$	$\displaystyle(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}.$

Let $\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}=\eta$ . Since $0<\eta<({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<\frac{1}{\alpha}$ (from (34) and (37)), we have $0<1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<1-\alpha\eta<1$ . Therefore,

\displaystyle\sum_{l=1}^{k-1}({z}_{k,l}^{(t+1)})^{2}<\sum_{l=1}^{k-1}\gamma({z}_{k,l}^{(t)})^{2}<\gamma^{t+1}\sum_{l=1}^{k-1}({z}_{k,l}^{(0)})^{2}=a_{1}\gamma^{t+1},\quad\text{where}\quad\gamma=(1-\alpha\eta)^{2}.

(43)

∎

Appendix C Statement and Proof of Lemma 6

Lemma 6.

Suppose $z_{k,k}^{(0)}\neq 0$ and $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}$ . Then the following is true for $\rho_{k}=\Big{(}\frac{1+\alpha\lambda_{k+1}}{1+\alpha\lambda_{k}}\Big{)}^{2}<1$ and some constant $a_{2}>0$ :

\sum_{l=k+1}^{d}({z}_{k,l}^{(t+1)})^{2}\leq a_{2}\rho_{k}^{t+1}.

(44)

Proof.

For $l=k,\ldots,d$ we know from (14) that ${z}_{k,l}^{(t+1)}=(1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})){z}_{k,l}^{(t)}$ . If $({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<\frac{1}{\alpha}$ , we have $1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})>\alpha\lambda_{l}\geq 0,\forall l=k,\ldots,d$ .

Thus, we have for $l=k+1,\cdots d$ ,

$\displaystyle\Bigg{(}\frac{{z}_{k,l}^{(t+1)}}{{z}_{k,k}^{(t+1)}}\Bigg{)}^{2}$	$\displaystyle=$	$\displaystyle\Bigg{(}\frac{1+\alpha(\lambda_{l}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})}{1+\alpha(\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})}\Bigg{)}^{2}\Bigg{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Bigg{)}^{2}$
	$\displaystyle=$	$\displaystyle\Bigg{(}1-\frac{\alpha(\lambda_{k}-\lambda_{l})}{1+\alpha(\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})}\Bigg{)}^{2}\Bigg{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Bigg{)}^{2}$
	$\displaystyle\leq$	$\displaystyle\Big{(}1-\frac{\alpha(\lambda_{k}-\lambda_{l})}{1+\alpha\lambda_{k}}\Big{)}^{2}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2}$
	$\displaystyle=$	$\displaystyle\Big{(}\frac{1+\alpha\lambda_{l}}{1+\alpha\lambda_{k}}\Big{)}^{2}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2}\leq\Big{(}\frac{1+\alpha\lambda_{k+1}}{1+\alpha\lambda_{k}}\Big{)}^{2}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2}$
	$\displaystyle=$	$\displaystyle\rho_{k}\Big{(}\frac{{z}_{k,l}^{(t)}}{{z}_{k,k}^{(t)}}\Big{)}^{2},\quad\rho_{k}=\Big{(}\frac{1+\alpha\lambda_{k+1}}{1+\alpha\lambda_{k}}\Big{)}^{2}<1.$

Therefore, for $l=k+1,\ldots,d$ , $({z}_{k,l}^{(t+1)})^{2}\leq\rho_{k}^{t+1}\Big{(}\frac{{z}_{k,l}^{(0)}}{{z}_{k,k}^{(0)}}\Big{)}^{2}({z}_{k,k}^{(t+1)})^{2}$ . Since $\|{\mathbf{x}}_{k}^{(t+1)}\|^{2}\leq 3$ and $\|{\mathbf{x}}_{k}^{(0)}\|=1$ , hence $({z}_{k,k}^{t+1})^{2}\leq 3$ and ${z}_{k,l}^{(0)}\leq 1$ . Also, because of the assumption ${z}_{k,k}^{(0)}\neq 0$ , let us assume $({z}_{k,k}^{(0)})^{2}>\tilde{\eta}$ . Thus, we can write

\displaystyle\sum_{l=k+1}^{d}({z}_{k,l}^{t+1})^{2}

\displaystyle\leq

\displaystyle\rho_{k}^{t+1}\sum_{l=k+1}^{d}\frac{3}{\tilde{\eta}}=a_{2}\rho_{k}^{t+1}.

(45)

∎

Appendix D Statement and Proof of Lemma 7

Lemma 7.

Suppose $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}<\frac{1}{\alpha}$ and $(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}>\min\{(1-3\alpha\lambda_{1})^{2}\lambda_{m},(\tilde{\mathbf{x}}_{k}^{(0)})^{T}\mathbf{C}\tilde{\mathbf{x}}_{k}^{(0)}\}$ . Then there exists constants $0<\delta,\gamma_{1}<1,a_{4}>0$ such that

|\lambda_{k}-({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}|\leq ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}).

(46)

Proof.

	$\displaystyle({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}$	$\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}+\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
		$\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}+\sum_{l=k}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
		$\displaystyle+\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
		$\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
		$\displaystyle=\sum_{l=1}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}+P^{(t)}$
		$\displaystyle=(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}+P^{(t)},$

where

	$\displaystyle P^{(t)}$	$\displaystyle=\sum_{l=1}^{k-1}\lambda_{l}(1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}$
		$\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({z}_{k,l}^{(t)})^{2}.$

Now,

	$\displaystyle\lambda_{k}-({\mathbf{x}}_{k}^{(t+1)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t+1)}$	$\displaystyle=\lambda_{k}-(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}$
		$\displaystyle=\lambda_{k}-(1+\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}+2\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}$
		$\displaystyle=\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-(\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}+2\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}$
		$\displaystyle=\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})(\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})+2\alpha)({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-P^{(t)}$
		$\displaystyle=(\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})(1-\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-2\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})-P^{(t)}.$

Let us denote $V^{(t)}=|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}|$ . Then,

	$\displaystyle V^{(t+1)}$	$\displaystyle\leq$	$\displaystyle V^{(t)}\|1-\alpha^{2}(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}-2\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\|+\|P^{(t)}\|$
		$\displaystyle\leq$	$\displaystyle V^{(t)}\max\{(1-\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)})^{2},\alpha^{2}\lambda_{k}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\}+\|P^{(t)}\|.$

Also from (34) and (37), $0<\alpha\eta<\alpha({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<1\text{ and }\alpha^{2}\lambda_{k}({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}<\alpha\lambda_{k}$ . Denote $\delta=\max\{(1-\alpha\eta)^{2},\alpha\lambda_{k}\}$ . Since $\alpha\lambda_{k}<\alpha\lambda_{1}<1$ , hence $0<\delta<1$ . Thus,

V^{(t+1)}\leq\delta V^{(t)}+|P^{(t)}|.

(47)

Next, we bound $|P^{(t)}|$ as follows:

	$\displaystyle\|P^{(t)}\|$	$\displaystyle=\|\sum_{l=1}^{k-1}\lambda_{l}((1-\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})^{2}-(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2})({z}_{k,l}^{(t)})^{2}$
		$\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}((1+\alpha(\lambda_{l}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2}-(1+\alpha(\lambda_{k}-(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)}))^{2})({z}_{k,l}^{(t)})^{2}\|$
		$\displaystyle=\|\sum_{l=1}^{k-1}\lambda_{l}(-\alpha\lambda_{k})(2+\alpha\lambda_{k}-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}$
		$\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}\alpha(\lambda_{l}-\lambda_{k})(2+\alpha(\lambda_{k}+\lambda_{l})-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}\|$
		$\displaystyle\leq\sum_{l=1}^{k-1}\lambda_{l}\|(-\alpha\lambda_{k})(2+\alpha\lambda_{k}-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}\|$
		$\displaystyle+\sum_{l=k+1}^{d}\lambda_{l}\|\alpha(\lambda_{l}-\lambda_{k})(2+\alpha(\lambda_{k}+\lambda_{l})-2\alpha(\mathbf{x}_{k}^{(t)})^{T}\mathbf{C}\mathbf{x}_{k}^{(t)})({z}_{k,l}^{(t)})^{2}\|$
		$\displaystyle\leq\sum_{l=1}^{k-1}\lambda_{l}\alpha\lambda_{k}(2+\alpha\lambda_{k})({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}\lambda_{l}\alpha(\lambda_{k}-\lambda_{l})(2+\alpha(\lambda_{k}+\lambda_{l}))({z}_{k,l}^{(t)})^{2}$
		$\displaystyle<\sum_{l=1}^{k-1}\lambda_{l}\alpha\lambda_{k}(2+\alpha\lambda_{k})({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}\lambda_{l}(2\alpha\lambda_{k}+\alpha^{2}\lambda_{k}^{2})({z}_{k,l}^{(t)})^{2}$
		$\displaystyle<\sum_{l=1}^{k-1}3\lambda_{1}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}3\lambda_{1}({z}_{k,l}^{(t)})^{2},\quad\text{since $\alpha\lambda_{k}<1$ and $\lambda_{l}<\lambda_{1}$}$
		$\displaystyle=3\lambda_{1}(\sum_{l=1}^{k-1}({z}_{k,l}^{(t)})^{2}+\sum_{l=k+1}^{d}({z}_{k,l}^{(t)})^{2})<3\lambda_{1}(a_{1}\gamma^{t}+a_{2}\rho_{k}^{t})\quad\text{using Lemma\leavevmode\nobreak\ \ref{lemma:coeff_decay_lower} and \leavevmode\nobreak\ \ref{lemma:coeff_decay_upper}}$
		$\displaystyle\leq a_{3}\gamma_{1}^{t},\quad\text{where}\quad a_{3}=\max\{3\lambda_{1}a_{1},3\lambda_{1}a_{2}\}\quad\text{and}\quad\gamma_{1}=\max\{\gamma,\rho_{k}\}.$

So from (47), we have $V^{(t+1)}\leq\delta V^{(t)}+a_{3}\gamma_{1}^{t}\leq\delta^{t+1}V^{(0)}+a_{3}\sum_{r=0}^{t}(\delta\gamma_{1}^{-1})^{r}\gamma_{1}^{t}$ . Since $\gamma_{1},\delta<1$ , we have the following two cases:

1.

$\delta\leq\gamma_{1}\implies\delta\gamma_{1}^{-1}\leq 1$ . Then, $\sum_{r=0}^{t}(\delta\gamma_{1}^{-1})^{r}\gamma_{1}^{t}\leq\sum_{r=0}^{t}\gamma_{1}^{t}=t\gamma_{1}^{t}$ .
2.

$\delta>\gamma_{1}$ . Then $\sum_{r=0}^{t}(\delta\gamma_{1}^{-1})^{r}\gamma_{1}^{t}=\gamma_{1}^{t}+\delta\gamma_{1}^{t-1}+\cdots+\delta^{t}<\delta^{t}+\cdots+\delta^{t}=t\delta^{t}$ .

Thus,

\displaystyle V^{(t+1)}

\displaystyle\leq

\displaystyle\delta^{t+1}V^{(0)}+ta_{3}\max\{\delta^{t},\gamma_{1}^{t}\}\leq ta_{4}(\delta^{t+1}+\max\{\delta^{t},\gamma_{1}^{t}\}),

where $a_{4}=\max\{V^{(0)},a_{3}\}$ . ∎

Appendix E Statement and Proof of Lemma 8

Lemma 8.

The deviation of an iterate at a node from the average is bounded from above as

\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}\big{)},\forall k=1,\ldots,K,

(48)

where $\beta$ is the second largest magnitude of the eigenvalues of $\mathbf{W}$ given as $\beta=\max\{|\lambda_{2}(\mathbf{W})|,|\lambda_{M}(\mathbf{W})|\}<1$ and $b_{k}>0$ is some constant.

Proof.

We stack the iterates $\mathbf{x}_{i,k}^{(t)}$ and $\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})$ as

\mathbf{x}_{k}^{(t)}=\begin{bmatrix}\mathbf{x}_{1,k}^{(t)}\\ \mathbf{x}_{2,k}^{(t)}\\ \vdots\\ \mathbf{x}_{M,k}^{(t)}\\ \end{bmatrix}\in\mathbb{R}^{Md}\quad\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(t)})=\begin{bmatrix}\boldsymbol{\mathcal{H}}_{1}(\mathbf{x}_{1,k}^{(t)})\\ \boldsymbol{\mathcal{H}}_{2}(\mathbf{x}_{2,k}^{(t)})\\ \vdots\\ \boldsymbol{\mathcal{H}}_{M}(\mathbf{x}_{M,k}^{(t)})\\ \end{bmatrix}\in\mathbb{R}^{Md}\quad{\mathbf{x}}_{avg,k}^{(t)}=\begin{bmatrix}\bar{\mathbf{x}}_{k}^{(t)}\\ \bar{\mathbf{x}}_{k}^{(t)}\\ \vdots\\ \bar{\mathbf{x}}_{k}^{(t)}\\ \end{bmatrix}\in\mathbb{R}^{Md}.

The next network-wide iterate (as a stacked vector) can then be written as $\mathbf{x}_{k}^{(t)}=(\mathbf{W}\otimes\mathbf{I})\mathbf{x}_{k}^{(t-1)}+\alpha\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(t-1)})$ , where $\otimes$ denotes the Kronecker product. The $t^{th}$ iterate can thus be written as

\mathbf{x}_{k}^{(t)}=(\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)}).

Since $\mathbf{W}=[w_{ij}]$ is a symmetric and doubly stochastic mixing matrix, its largest eigenvalue is $1$ corresponding to the eigenvector $\mathbf{1}_{M}$ , a column vector of all 1’s. It is also the left eigenvector of $\mathbf{W}$ . That is, $\mathbf{W}\mathbf{1}_{M}=\mathbf{1}_{M}\text{ and }\mathbf{1}_{M}^{T}\mathbf{W}=\mathbf{1}_{M}^{T}$ . Also, since the squared norm of Sanger’s direction at every node is bounded, it is easy to see $\|\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(t)})\|^{2}=3M\lambda_{1}^{2}(3k-2)(3k+1)$ . Now,

	$\displaystyle\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|\leq\\|\mathbf{x}_{k}^{(t)}-{\mathbf{x}}_{avg,k}^{(t)}\\|=\\|\mathbf{x}_{k}^{(t)}-\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})\mathbf{x}_{k}^{(t)}\\|$
	$\displaystyle=\\|(\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})-\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})((\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)}))\\|$
	$\displaystyle=\\|(\mathbf{W}^{t}\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}(\mathbf{W}^{t-1-s}\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})-\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}-\alpha\sum_{s=0}^{t-1}(\frac{1}{M}((\mathbf{1}_{M}\mathbf{1}_{M}^{T})\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)}))\\|$
	$\displaystyle=\\|((\mathbf{W}^{t}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}+\alpha\sum_{s=0}^{t-1}((\mathbf{W}^{t-1-s}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\\|$
	$\displaystyle\leq\\|((\mathbf{W}^{t}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\mathbf{x}_{k}^{(0)}\\|+\alpha\\|\sum_{s=0}^{t-1}((\mathbf{W}^{t-1-s}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\\|$
	$\displaystyle\leq\\|((\mathbf{W}^{t}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\\|\\|\mathbf{x}_{k}^{(0)}\\|+\alpha\sum_{s=0}^{t-1}\\|((\mathbf{W}^{t-1-s}-\frac{1}{M}(\mathbf{1}_{M}\mathbf{1}_{M}^{T}))\otimes\mathbf{I})\\|\\|\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\\|$
	$\displaystyle=\beta^{t}\\|\mathbf{x}_{k}^{(0)}\\|+\alpha\sum_{s=0}^{t-1}\beta^{t-1-s}\\|\boldsymbol{\mathcal{H}}(\mathbf{x}_{k}^{(s)})\\|\leq\beta^{t}\sqrt{3M}+\alpha\sqrt{3M\lambda_{1}^{2}(3k-2)(3k+1)}\sum_{s=0}^{t-1}\beta^{t-1-s}$
	$\displaystyle\leq\beta^{t}\sqrt{3M}+\frac{\alpha\sqrt{3M\lambda_{1}^{2}(3k-2)(3k+1)}}{1-\beta}$
	$\displaystyle\leq\sqrt{3M}\lambda_{1}\sqrt{(3k-2)(3k+1)}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)}=b_{k}\big{(}\beta^{t}+\frac{\alpha}{1-\beta}\big{)},\quad\text{where}\quad b_{k}=\lambda_{1}\sqrt{3M}\sqrt{(3k-2)(3k+1)}.$

∎

Appendix F Statement and Proof of Lemma 9

Lemma 9.

Suppose $\|\mathbf{x}_{i,k}^{(t)}\|^{2}\leq 3$ and $\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\|\leq b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}\big{)}$ , then the following is true $\forall k=1,\ldots,K$ :

\displaystyle\|\mathbf{h}_{k}^{(t)}\|\leq 3(k+2)\lambda_{1}b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}).

(49)

Proof.

We have

	$\displaystyle\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})$
	$\displaystyle=\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{x}_{i,k}^{(t)}+(\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)}\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}({\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}{\mathbf{x}}_{i,k}^{(t)}-{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\bar{\mathbf{x}}_{k}^{(t)})$
	$\displaystyle=(\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I})(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})-((\mathbf{x}_{i,k}^{(t)}+\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}))\bar{\mathbf{x}}_{k}^{(t)}-\sum_{p=1}^{k-1}{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})$

	$\displaystyle\\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})\\|$
	$\displaystyle\leq\\|\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|+\|(\mathbf{x}_{i,k}^{(t)}+\bar{\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})\|\\|\bar{\mathbf{x}}_{k}^{(t)}\\|+\sum_{p=1}^{k-1}\\|{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}(\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)})\\|$
	$\displaystyle\leq\\|\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|+\\|\mathbf{x}_{i,k}^{(t)}+\bar{\mathbf{x}}_{k}^{(t)}\\|\\|\mathbf{C}_{i}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|\\|\bar{\mathbf{x}}_{k}^{(t)}\\|+\sum_{p=1}^{k-1}\\|{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\mathbf{C}_{i}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|$
	$\displaystyle\leq\\|\mathbf{C}_{i}-(\mathbf{x}_{i,k}^{(t)})^{T}\mathbf{C}_{i}\mathbf{x}_{i,k}^{(t)}\mathbf{I}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|+\\|\bar{\mathbf{x}}_{k}^{(t)}\\|(\\|\mathbf{x}_{i,k}^{(t)}\\|+\\|\bar{\mathbf{x}}_{k}^{(t)}\\|)\\|\mathbf{C}_{i}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|+\sum_{p=1}^{k-1}\\|{\mathbf{x}}_{i,p}^{(t)}({\mathbf{x}}_{i,p}^{(t)})^{T}\\|\\|\mathbf{C}_{i}\\|\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|$
	$\displaystyle\leq 3\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|+\sqrt{3}(2\sqrt{3})\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|+\sum_{p=1}^{k-1}3\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|$
	$\displaystyle=3(k+2)\lambda_{1}\\|\mathbf{x}_{i,k}^{(t)}-\bar{\mathbf{x}}_{k}^{(t)}\\|\leq 3(k+2)\lambda_{1}b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}).$

Thus,

\displaystyle\|\mathbf{h}_{k}^{(t)}\|

\displaystyle\leq\frac{1}{M}\sum_{i=1}^{M}\|\boldsymbol{\mathcal{H}}_{i}(\mathbf{x}_{i,k}^{(t)})-\boldsymbol{\mathcal{H}}_{i}(\bar{\mathbf{x}}_{k}^{(t)})\|\leq 3(k+2)\lambda_{1}b_{k}(\beta^{t}+\frac{\alpha}{1-\beta}).

(50)

∎

References

[1] A. Gang, H. Raja, and W. U. Bajwa, “Fast and communication-efficient distributed PCA,” in Proc. IEEE International Conf. Acoustics, Speech and Signal Process. (ICASSP), 2019, pp. 7450–7454.
[2] H. Hotelling, “Analysis of a complex of statistical variables into principal components.” J. Educational Psychology, vol. 24, no. 6, pp. 417–441, 1933.
[3] M. Nokleby, H. Raja, and W. U. Bajwa, “Scaling-up distributed processing of data streams for machine learning,” Proc. IEEE, vol. 108, no. 11, pp. 1984–2012, Nov. 2020.
[4] W. U. Bajwa, V. Cevher, D. Papailiopoulos, and A. Scaglione, “Machine learning from distributed, streaming data,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 11–13, May 2020.
[5] M. Loève, Probability Theory, 3rd ed. New York: Springer, 1963.
[6] R. Dixit and W. U. Bajwa, “Exit time analysis for approximations of gradient descent trajectories around saddle points,” ArXiv Prepr. ArXiv200601106, Jun. 2020. [Online]. Available: http://arxiv.org/abs/2006.01106
[7] ——, “Boundary conditions for linear exit time gradient trajectories around saddle points: Analysis and algorithm,” ArXiv Prepr. ArXiv210102625, Jan. 2021. [Online]. Available: http://arxiv.org/abs/2101.02625
[8] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Netw., vol. 2, no. 1, p. 53–58, Jan. 1989.
[9] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix,” J. Math. Anal. Applicat., vol. 106, no. 1, pp. 69 – 84, 1985.
[10] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks: Theory and Applications. USA: John Wiley & Sons, Inc., 1996.
[11] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural Netw., vol. 2, no. 6, pp. 459 – 473, 1989.
[12] Z. Yang, A. Gang, and W. U. Bajwa, “Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 146–159, 2020.
[13] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Mag., vol. 2, pp. 559–572, 1901.
[14] G. H. Golub and C. F. Van Loan, Matrix Computations (3rd Ed.). Baltimore, MD, USA: Johns Hopkins University Press, 1996.
[15] D. O. Hebb, The Organization of Behavior: A Neuropsychological Theory. Wiley New York, 1949.
[16] Z. Yi, M. Ye, J. C. Lv, and K. K. Tan, “Convergence analysis of a deterministic discrete time system of Oja’s PCA learning algorithm,” IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 1318–1328, Nov 2005.
[17] J. C. Lv, Z. Yi, and K. K. Tan, “Global convergence of GHA learning algorithm with nonzero-approaching adaptive learning rates,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1557–1571, 2007.
[18] D. Kempe and F. McSherry, “A decentralized algorithm for spectral analysis,” J. Comput. and Syst. Sci., vol. 74, no. 1, pp. 70 – 83, 2008.
[19] A. Scaglione, R. Pagliari, and H. Krim, “The decentralized estimation of the sample covariance,” in Proc. 42nd Asilomar Conf. on Signals, Syst. and Comput., 2008, pp. 1722–1726.
[20] L. Li, A. Scaglione, and J. H. Manton, “Distributed principal subspace estimation in wireless sensor networks,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 4, pp. 725–738, Aug 2011.
[21] M. Nokleby and W. U. Bajwa, “Resource tradeoffs in distributed subspace tracking over the wireless medium,” in Proc. 1st IEEE Global Conf. Signal and Information Processing (GlobalSIP’13), Symposium on Network Theory, Austin, TX, Dec. 2013, pp. 823–826.
[22] H. Raja and W. U. Bajwa, “Cloud K-SVD: Computing data-adaptive representations in the cloud,” in Proc. 51st Annual Allerton Conf. Commun., Control and Computing, 2013, pp. 1474–1481.
[23] H. Wai, A. Scaglione, J. Lafond, and E. Moulines, “Fast and privacy preserving distributed low-rank regression,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., (ICASSP), 2017, pp. 4451–4455.
[24] H. Raja and W. U. Bajwa, “Cloud-K-SVD: A collaborative dictionary learning algorithm for big, distributed data,” IEEE Trans. Signal Process., vol. 64, no. 1, pp. 173–188, Jan 2016.
[25] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Syst. & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
[26] H. Raja and W. U. Bajwa, “Distributed stochastic algorithms for high-rate streaming principal component analysis,” CoRR, vol. abs/2001.01017, 2020. [Online]. Available: http://arxiv.org/abs/2001.01017
[27] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione, “A review of distributed algorithms for principal component analysis,” Proc. IEEE, vol. 106, no. 8, pp. 1321–1340, 2018.
[28] M. Hong, D. Hajinezhad, and M.-M. Zhao, “Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. 34th Int. Conf. Mach. Learning, vol. 70. PMLR, 06–11 Aug 2017, pp. 1529–1538.
[29] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization,” IEEE Trans. Autom. Control, vol. 58, no. 2, pp. 391–405, 2013.
[30] H. Wai, A. Scaglione, J. Lafond, and E. Moulines, “A projection-free decentralized algorithm for non-convex optimization,” in Proc. 2016 IEEE Global Conf. Signal and Inform. Process. (GlobalSIP), 2016, pp. 475–479.
[31] P. D. Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimization,” IEEE Trans. Signal Inform. Process. Netw., vol. 2, no. 2, pp. 120–136, 2016.
[32] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 48–61, Jan 2009.
[33] R. Vershynin, “How close is the sample covariance matrix to the actual covariance matrix?” Journal of Theoretical Probability, vol. 25, 04 2010.
[34] R. Arora, A. Cotter, and N. Srebro, “Stochastic optimization of PCA with capped MSG,” in Proc. Advances Neural Inform. Process. Systs., 2013, pp. 1815–1823.
[35] M. K. Warmuth and D. Kuzmin, “Randomized PCA algorithms with regret bounds that are logarithmic in the dimension,” in Proc. Advances Neural Inform. Process. Syst., 2007, pp. 1481–1488.
[36] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: an exact first-order algorithm for decentralized consensus optimization,” SIAM J. Optim., vol. 25, no. 2, pp. 944–966, 2015.
[37] F. S. Cattivelli and A. H. Sayed, “Diffusion LMS strategies for distributed estimation,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1035–1048, 2010.
[38] S. Kar and J. M. Moura, “Consensus + innovations distributed inference over networks: Cooperation and sensing in networked systems,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 99–109, 2013.
[39] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016.
[40] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM Review, vol. 46, pp. 667–689, 2003.
[41] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. USA: Princeton University Press, 2007.
[42] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” ATT Labs, vol. 2, 2010.
[43] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
[44] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, 2018.

$\displaystyle\|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\|$	$\displaystyle=\|\lambda_{k}-\sum_{l=1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|=\|\lambda_{k}-\lambda_{k}({z}_{k,k}^{(t)})^{2}-\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}-\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|$
	$\displaystyle\geq\|\lambda_{k}-\lambda_{k}({z}_{k,k}^{(t)})^{2}\|-\|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|-\|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|$
$\displaystyle\text{or,}\quad\lambda_{k}\|1-({z}_{k,k}^{(t)})^{2}\|$	$\displaystyle\leq\|\sum_{l=1}^{k-1}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|+\|\sum_{l=k+1}^{d}\lambda_{l}({z}_{k,l}^{(t)})^{2}\|+\|\lambda_{k}-({\mathbf{x}}_{k}^{(t)})^{T}\mathbf{C}{\mathbf{x}}_{k}^{(t)}\|.$	(15)

$\displaystyle\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}$	$\displaystyle=\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}+\alpha\mathbf{h}_{1}^{(t-1)}$
$\displaystyle\text{or,}\quad\\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle=\\|\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}+\alpha\mathbf{h}_{1}^{(t-1)}\\|$
$\displaystyle\text{or,}\quad\\|\bar{\mathbf{x}}_{1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle\leq\\|\bar{\mathbf{x}}_{1}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}-(\bar{\mathbf{x}}_{1}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{1}^{(t-1)}\bar{\mathbf{x}}_{1}^{(t-1)})-\mathbf{x}_{1}^{*}\\|+\alpha\\|\mathbf{h}_{1}^{(t-1)}\\|.$	(22)

	$\displaystyle\\|\mathbf{x}_{i,1}^{(t)}-\mathbf{x}_{1}^{*}\\|$	$\displaystyle\leq$	$\displaystyle b_{1}(\beta^{t}+\frac{\alpha}{1-\beta})+\rho_{1}^{t}\\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\\|+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+\frac{c_{1}}{1-\rho_{1}}\big{(}\frac{\alpha}{1-\beta}\big{)}$
		$\displaystyle=$	$\displaystyle\rho_{1}^{t}\\|\bar{\mathbf{x}}_{1}^{(0)}-\mathbf{x}_{1}^{*}\\|+b_{1}\beta^{t}+c_{1}t\max\{\rho_{1}^{t-1},\beta^{t-1}\}+(\frac{c_{1}}{1-\rho_{1}}+b_{1})\big{(}\frac{\alpha}{1-\beta}\big{)}.$

	$\displaystyle\bar{\mathbf{x}}_{k}^{(t)}$	$\displaystyle=\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})+\alpha\mathbf{h}_{k}^{(t-1)}+\alpha\bar{\boldsymbol{\psi}}_{k}^{(t-1)}$
	$\displaystyle\text{or,}\quad\\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\\|$	$\displaystyle\leq\\|\bar{\mathbf{x}}_{k}^{(t-1)}+\frac{\alpha}{M}(\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)}-((\bar{\mathbf{x}}_{k}^{(t-1)})^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})\bar{\mathbf{x}}_{k}^{(t-1)}-\sum_{p=1}^{k-1}{\mathbf{q}}_{p}{\mathbf{q}}_{p}^{T}\mathbf{C}\bar{\mathbf{x}}_{k}^{(t-1)})-\mathbf{x}_{k}^{*}\\|+$
		$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\alpha\\|\mathbf{h}_{k}^{(t-1)}\\|+\alpha\\|\bar{\boldsymbol{\psi}}_{k}^{(t-1)}\\|.$

$\displaystyle\\|\bar{\mathbf{x}}_{k}^{(t)}-\mathbf{x}_{k}^{*}\\|$	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}}\\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\\|+\alpha(3(k+2)\lambda_{1}b_{k}(\beta^{t-1}+\frac{\alpha}{1-\beta}))+\alpha(\sqrt{3}\lambda_{1}(k-1)\bar{c}(\bar{\theta}^{t-1}+\frac{\alpha}{1-\beta}))$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}}\\|\bar{\mathbf{x}}_{k}^{(t-1)}-\mathbf{x}_{k}^{*}\\|+c_{k}\max\{\beta^{t-1},\bar{\theta}^{t-1}\}+c_{k}\frac{\alpha}{1-\beta},\quad c_{k}=\max\{\alpha(3(k+2)\lambda_{1}b_{k}),\alpha(\sqrt{3}\lambda_{1}(k-1)\bar{c})\}$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}}\Big{(}\rho_{k}^{{}^{\prime}}\\|\bar{\mathbf{x}}_{k}^{(t-2)}-\mathbf{x}_{k}^{*}\\|+c_{k}\max\{\beta^{t-2},\bar{\theta}^{t-2}\}+c_{k}\frac{\alpha}{1-\beta}\Big{)}+c_{k}\max\{\beta^{t-1},\bar{\theta}^{t-1}\}+c_{k}\frac{\alpha}{1-\beta}$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}t}\\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\\|+c_{k}\sum_{r=0}^{t-1}(\rho_{k}^{{}^{\prime}}\max\{\beta,\bar{\theta}\}^{-1})^{r}\max\{\beta,\bar{\theta}\}^{t-1}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}$
	$\displaystyle\leq$	$\displaystyle\rho_{k}^{{}^{\prime}t}\\|\bar{\mathbf{x}}_{k}^{(0)}-\mathbf{x}_{k}^{*}\\|+c_{k}t\max\{\rho_{k}^{{}^{\prime}t-1},\beta^{t-1},\bar{\theta}^{t-1}\}+\frac{c_{k}}{1-\rho_{k}^{{}^{\prime}}}\big{(}\frac{\alpha}{1-\beta}\big{)}.$